7.5. Camera Tracking and Structure from Motion

With the advent of computer vision techniques in visual effects, directors found that they could move the camera all over the place and the effects crew could still insert digital effects into the image. So, naturally, film directors ran with the ability and never looked back. Now all the large effects shops have dedicated teams of computer vision artists. Almost every shot is tracked in one way or another.

In the computer vision community, the world space location of the camera is typically called the camera's extrinsic parameters. Computing these parameters from information in an image or images is called camera tracking in the visual effects industry. In computer vision, this task has different names depending on how you approach the solution to the problem: camera calibration, pose estimation, structure from motion, and more. All the latest techniques are available to the artists computing the camera motion.

The overriding concern of the artists is accuracy. Once a shot gets assigned to an artist, the artist produces versions of the camera track that are rendered as wireframe over the background image. Until the wireframe objects line up exactly, the artist must continually revisit the sequence, adjusting and tweaking the motion curves of the camera or the objects.

A camera track starts its life on the set. As mentioned before, the data integration team gathers as much information as is practical. At the minimum, a camera calibration chart is photographed, but often a survey of key points is created or the set is photographed with a reference object so that photogrammetry can be used to determine the 3D location of points on the set. This data is used to create a polygonal model of the points on the set. (This model is often quite hard to decipher. It typically has only key points in the set, and they are connected in a rough manner. But it is usually enough to guide the tracking artist.)

The images from the shot are "brought online"—either scanned from the original film or transferred from the high-definition digital camera. Also, any notes from the set are transcribed to a text file and associated with the images.

Before the tracking begins, the lens distortion must be dealt with. While the lenses used in filmmaking are quite extraordinary, they still contain a certain amount of radial distortion on the image. Fixed lenses are quite good, but zoom lenses and especially anamorphic lens setups are notorious for the distortion they contain. In fact, for an anamorphic zoom lens, the distortion is quite hard to characterize. For the most part, the intrinsic parameters of the lens are easily computed from the calibration charts on set. For more complicated lenses, where the standard radial lens distortion equations do not accurately model the behavior of the lens, the distortion is modeled by hand. A correspondence between the grid shot through the lens and an ideal grid is built and used to warp the image into the ideal state.

However the lens distortion is computed, the images are warped to produce "straightened images," which approximate what would be seen through a pinhole camera. These images are used online throughout the tracking, modeling, and animation process. The original images are used only when the final results are composited with the rendered components of the shot. This seems like an unnecessary step; surely the intrinsic parameters of the lens can be incorporated into the camera model. The problem is that these images are not just used for tracking but also as background images in the animation packages that the artists use. And in those packages, a simple OpenGL pinhole camera model is the only one available. So, if the original images were used to track the camera motion and the tracking package uses a complicated camera model that includes radial (or arbitrary) lens distortion, the lineup would look incorrect when viewed through a standard animation package that does not support a more complicated camera model.

The implications of this are twofold. First, two different sets of images must be stored on disk, the original and the straightened version. Second, the artists must render their CG elements at a higher resolution because they will eventually be warped to fit the original images and this will soften (blur) the rendered image. Surprisingly, this does not impact the pipeline as much as one would expect; CG elements are typically blurred slightly to blend better with the original filmed plate.

The artist assigned to the shot uses either an inhouse computer vision program or one of many commercially available packages. For the most part, these programs have similar capabilities. They can perform pattern tracking and feature detection, and they can set up a constraint system and then solve for the location of the camera or moving rigid objects in the scene. Recently, most of these program have added the ability to compute structure from motion without the aid of any surveyed information.

In the most likely scenario, the artists begin with the sequence of images and the surveyed geometry. Correspondences between the points on the geometry and the points in the scene are made, and an initial camera location is determined. Typically, the data integration team has computed the intrinsic parameters of the camera based on the reference grid shot with the lens.

Once the initial pose is determined, the points associated with each point on the reference geometry are tracked in 2D over time. Pattern trackers are used to do this, though when the pattern tracks fail, the artist often tracks the images by hand.

Why would pattern tracks fail? One of the most common reasons is that feature points in the scene—the ones that are easy to digitize with a total station—are often not the best points to camera-track: corners of objects. Corners can be difficult to pattern-track because of the large amount of change that the pattern undergoes as the camera rotates around the corner. And often, the background behind the corner contributes to confusion of the pattern tracking algorithm. Corner detection algorithms can pick them out, but the subpixel precision of these algorithms is not as high as those based on cross-correlation template matching.

Motion blur poses many problems for the tracking artist. Pattern trackers often fail in the presence of motion blur. And even if the artist is forced to track the pattern by hand, it can be difficult to determine where a feature lies on the image if it is blurred over many pixels. Recently, the artists at Digital Domain were tracking a shot for a movie called xXx. In the shot, the camera was mounted on a speeding car traveling down a road parallel to a river. On the river, a hydrofoil boat was traveling alongside the car. The effect was to replace the top of the boat with a digitally created doomsday device and make it look like it was screaming down the river on some nefarious mission (Figure 7.5). The first task was to track the camera motion of the car with respect to the background landscape. Then the boat was tracked as a rigid, moving object with respect to the motion of the camera. Naturally, the motion blur for both cases was extreme. The automatic tracking program had a very difficult time due to the extreme motion of the camera. Feature points would travel nearly halfway across the image in the space of one or two frames. Optical flow algorithms failed for the same reason. Tracking the points proved quite difficult because of the extreme streaking of the information in the image. The artists placed the point in the center of the streak and hoped for the best.

Figure 7.5. Snapshot of a boat/car chase in the movie xXx. Left: Original. Right: Final composite image with digital hydrofoils inserted.


This worked to some extent. In this extreme example, the motion blur of the boat was at times so extreme that the streaks caused by the motion blur were not linear at all. The motion of tracked camera and boat were represented as a linear interpolation between keyframe positions at every frame. The motion blur generated by the renderer did not match the motion blur of the background frame. It is surprising how much small detail the eye can pick up. When the artists used spline interpolation to interpolate the camera motion and added subframe keyframes to correctly account for sharp camera/model motion, the digital imagery fit with the background noticeably better.

Beyond tracking points, digital tracking artists have many tools available for a good track. The points can be weighted based on the certainty of the points—points that are obscured by objects can still be used; even though they cannot be seen, their positions are interpolated (or guessed) and their uncertainty set very high. Camera pose estimation is still an optimization problem, and often the position computed is the position that minimizes the error in the correspondence of points. That being the case, removing a point from the equation can have quite an effect on the computed location of the camera. This results in the dreaded "pop" in motion after a point leaves the image. This pop can be managed by ramping the point's uncertainty as the point approaches leaving the image. Alternatively, the point can be used after it leaves the image: the artist pretends that the point's location is known, and this can minimize any popping.

Other constraints beyond points are available in many packages. Linear constraints are often useful: an edge on the surveyed geometry is constrained to stay on the edge of an object in the image. Pattern trackers can follow the edge of the object as easily as a point on the object.

Of course, it is not necessary to have any surveyed information. Structure from motion techniques can solve for the location of both the camera and the 3D points in the scene without any surveyed information at all. If there is no absolute measurement of anything on the set, there is no way for structure from motion algorithms to determine the absolute scale of the 3D points and camera translation. Scale is important to digital artists; animated characters and physical effects are built at a specific scale, and having to play with the scale to get things to look right is not something that artists like to do. So, even when relying on structure from motion, some information—like the measurement of a distance between two points on the set—is quite useful for establishing the scale of the scene.

The convergence of optical flow techniques, following good features for tracks, and structure from motion solvers has caused large ripples through the film and video production facilities. Now, with no a priori knowledge of the set, a reasonable track and reconstruction of the set can be created-as long as the camera moves enough and enough points can be followed. Some small visual effects facilities rely on almost all structure from motion solutions. Larger facilities are finding it to be an enormous help, but rely on more traditional camera tracking techniques for quite a few shots. Automatic tracking is automatic when it is easy.

Consider a shot where the camera is chasing the hero of the film down a street. At some point, the camera points up to the sky and then back down to center on the main character. Typical effects might be to add some digital buildings to the set (or at least enhance the buildings that are already in the scene) and perhaps add some flying machines or creatures chasing our hero. This makes the camera tracking particularly challenging: the camera, when it points to the sky, will not be able to view any trackable points. But, if buildings and flying objects are being added to the scene, the camera move during that time must at least be "plausible" and certainly be smooth and not exhibit any popping motion as tracking points are let go. Automatic tracking techniques are generally not used on this kind of shot because the artist will want complete control of the points used to compute the camera motion. The camera will be tracked for the beginning of the shot and the end of the shot, and then the camera motion for the area where the camera is only pointing at sky or nonvalid tracking points (like the top of the actor's head) is created by clever interpolation and gut feeling for what the camera was doing. Often, something can be tracked, even if it is a far-off cloud, and that can at least be used to determine the rotation of the camera.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.100.205