3.10. Visual Modeling

In the previous sections, we explained how the camera motion and calibration and the depth estimates for (almost) every pixel could be obtained. This yields all the necessary information to build different types of visual models. In this section, several types of models are considered. First, the construction of texture-mapped 3D surface models is discussed. Then, a combined image- and geometry-based approach is presented that can render models ranging from pure plenoptic to view-dependent texture and geometry models. Finally, the possibility of combining real and virtual scenes is treated.

3.10.1. 3D surface reconstruction

The 3D surface is approximated by a triangular mesh to reduce geometric complexity and to tailor the model to the requirements of computer graphics visualization systems. A simple approach consists of overlaying a 2D triangular mesh on top of one of the images and then building a corresponding 3D mesh by placing the vertices of the triangles in 3D space according to the values found in the corresponding depth map. To reduce noise, it is recommended to first smooth the depth image (the kernel can be chosen of the same size as the mesh triangles). The image itself can be used as a texture map. While projective texture mapping would normally be required, the small size of the triangles allows us to use standard (affine) texture mapping (the texture coordinates are trivially obtained as the 2D coordinates of the vertices).

It can happen that for some vertices no depth value is available or that the confidence is too low. In these cases, the corresponding triangles are not reconstructed. The same happens when triangles are placed over discontinuities. This is achieved by selecting a maximum angle between the normal of a triangle and the line-of-sight through its center (e.g., 85 degrees). This simple approach works very well on the dense depth maps as obtained through multiview linking. The surface reconstruction approach is illustrated in Figure 3.19.

Figure 3.19. Surface reconstruction approach (top): Triangular mesh is overlaid on top of the image. The vertices are back-projected in space according to the depth values. From this, a 3D surface model is obtained (bottom).


A further example is shown in Figure 3.20. The video sequence was recorded with a handheld camcorder on an archaeological site in Sagalasso, Turkey (courtesy of Marc Waelkens). It shows a decorative medusa head that was part of a monumental fountain. The video sequence was processed fully automatically by using the algorithms discussed in sections 3.7, 3.8, 3.9, and 3.10. From the bundle adjustment and the multiview linking, the accuracy was estimated to be of (compared to the size of the reconstructed object). This has to be compared with the image resolution of 720 x 576. Note that the camera was uncalibrated and, besides the unknown focal length and principal point, has significant radial distortion and an aspect ratio different from one (i.e., 1.09), which were all automatically recovered from the video sequence.

Figure 3.20. 3D-from-video: one of the video frames (upper-left), recovered structure and motion (upper-right), textured and shaded 3D model (middle), and more views of textured 3D model (bottom).


To reconstruct more complex shapes, it is necessary to combine results from multiple depth maps. The simplest approach consists of generating separate models independently and then loading them together in the graphics system. Since all depth maps are located in a single coordinate frame, registration is not an issue. Often it is interesting to integrate the different meshes into a single mesh. A possible approach is given in [10].

3.10.2. Image-based rendering

In the previous section, we presented an approach to construct 3D models. If the goal is to generate novel views, other approaches are available. In recent years, a multitude of image-based approaches have been proposed that render images from images without the need for an explicit intermediate 3D model. The best known approaches are lightfield and lumigraph rendering [37, 18] and image warping [4, 54, 1].

Here we briefly introduce an approach to render novel views directly from images recorded with a handheld camera. If available, some depth information can also be used to refine the underlying geometric assumption. A more extensive discussion of this work can be found in [25, 33]. A related approach was presented in [3]. A lightfield is the collection of the lightrays corresponding to all the pixels in all the recorded images. Therefore, rendering from a lightfield consists of looking up the "closest" ray(s) passing through every pixel of the novel view. Determining the closest ray consists of two steps: (1) determining in which views the closest rays are located, and (2) within that view selecting the ray that intersects the implicit geometric assumption in the same point. For example, if the assumption is that the scene is far away, the corresponding implicit geometric assumption might be Π so that parallel rays would be selected.

In our case, the view selection works as follows. All the camera projection centers are projected in the novel view and Delaunay triangulated. For every pixel within a triangle, the recorded views corresponding to the three vertices are selected as "closest" views. If the implicit geometric assumption is planar, a homography relates the pixels in the novel view with those in a recorded view. Therefore, a complete triangle in the novel view can be efficiently drawn using texture mapping. The contributions of the three cameras can be combined using alpha blending. The geometry can be approximated by one plane for the whole scene, one plane per camera triple, or by several planes for one camera triple. The geometric construction is illustrated in Figure 3.21.

Figure 3.21. Drawing triangles of neighboring projected camera centers and approximated scene geometry.


This approach is illustrated in Figure 3.22 with an image sequence of 187 images recorded by waving a camera over a cluttered desk. In the lower part of Figure 3.22, a detail of a view is shown for the different methods. In the case of one global plane (left image), the reconstruction is sharp where the approximating plane intersects the actual scene geometry. The reconstruction is blurred where the scene geometry diverges from this plane. In the case of local planes (middle image), at the corners of the triangles, the reconstruction is almost sharp, because there the scene geometry is considered directly. Within a triangle, ghosting artifacts occur where the scene geometry diverges from the particular local plane. If these triangles are subdivided (right image), these artifacts are reduced further.

Figure 3.22. Top: image of the desk sequence and sparse structure-and-motion result (left), artificial view rendered using one plane per image triple (right). Details of rendered images showing the differences between the approaches (bottom): one global plane of geometry (left), one local plane for each image triple (middle), and refinement of local planes (right).


3.10.3. Match-moving

Another interesting application of the presented algorithms consists of adding virtual elements to real video. This has important applications in the entertainment industry and several products, such as 2d3's boujou and RealViz's MatchMover, exist that are based on the techniques described in Section 3.7 and Section 3.8. The key issue consists of registering the motion of the real and the virtual camera. The presented techniques can be used to compute the motion of the camera in the real world. This allows us to restrict the problem of introducing virtual objects in video to determine the desired position, orientation, and scale with respect to the reconstructed camera motion. More details on this approach can be found in [7]. Note that to achieve a seamless integration, other effects such as occlusion and lighting must also be taken care of.

An example is shown in Figure 3.23. The video shows the remains of one of the ancient monumental fountains of Sagalassos. A virtual reconstruction of the monument was overlaid on the original frame. The virtual camera was setup to mimic exactly the computed motion and calibration of the original camera.

Figure 3.23. Augmented video: six frames (out of 250) from a video where a virtual reconstruction of an ancient monument has been added.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.13.76