3.7. Structure and Motion Π

In this section, we discuss the problem of structure and motion recovery from a more practical point of view. We present an approach to automatically build up the projective structure and motion from a sequence of images. The presented approach is sequential, which offers the advantage that corresponding features are not required to be visible in all views.

We assume that for each pair of consecutive views we are given a set of potentially corresponding feature points. The feature points are typically obtained by using a corner detector [19]. If the images are not too widely separated, corresponding features can be identified by comparing the local intensity neighborhoods of the feature points on a pixel-by-pixel basis. In this case, typically only feature points that have similar coordinates are compared. In case of video, it might be more appropriate to use a feature tracker [56] that follows features from frame to frame. For more widely separated views, more complex features would have to be used [39, 53, 69, 40].

3.7.1. Two-view geometry computation

The first step of our sequential structure and motion computation approach consists of computing the geometric relation between two consecutive views. As seen in Section 3.5.2, this consists of recovering the fundamental matrix. In principle, the linear method presented in Section 3.6.3 could be used. In this section, a practical algorithm is presented to compute the fundamental matrix from a set of corresponding points perturbed with noise and containing a significant proportion of outliers.

Linear algorithms

Before presenting the complete robust algorithm, we revisit the linear algorithm. Given a number of corresponding points, (3.14) can be used to compute F. This equation can be rewritten in the following form:

Equation 3.29


with x1 = [x1y1 1]T, x2 = [x2y2 1]T and f a vector containing the elements of the fundamental matrix. As discussed before, stacking eight or more of these equations allows for a linear solution. Even for seven corresponding points, the one parameter family of solutions obtained by solving the linear equations can be restricted to one or three solutions by enforcing the cubic rank-2 constraint det (F1 + λF2) = 0. Note also that, as pointed out by Hartley [22], it is important to normalize the image coordinates before solving the linear equations. Otherwise, the columns of (3.29) would differ by several orders of magnitude and the error would be concentrated on the coefficients corresponding to the smaller columns. This normalization can be achieved by transforming the image center to the origin and scaling the images so that the coordinates have a standard deviation of unity.

Nonlinear algorithms

The result of the linear equations can be refined by minimizing the following criterion [70]:

Equation 3.30


with d(., .) representing the Euclidean distance in the image. This criterion can be minimized through a Levenberg-Marquardt algorithm [50]. An even better approach consists of computing the maximum likelihood estimation (for Gaussian noise) by minimizing the following criterion:

Equation 3.31


Although in this case the minimization has to be carried out over a much larger set of variables, this can be achieved efficiently by taking advantage of the sparsity of the problem.

Robust algorithm

To compute the fundamental matrix from a set of matches that were automatically obtained from a pair of real images, it is important to explicitly deal with outliers. If the set of matches is contaminated with even a small set of outliers, the result of the above methods can become unusable. This is typical for all types of least-squares approaches (even nonlinear ones). The problem is that the quadratic penalty allows for a single outlier that is very far away from the true solution to completely bias the final result.

An approach that can be used to cope with this problem is the RANSAC algorithm that was proposed by Fischler and Bolles [17]. A minimal subset of the data, in this case seven point correspondences, is randomly selected from the set of potential correspondences, and the solution obtained from it is used to segment the remainder of the dataset in inliers and outliers. If the initial subset contains no outliers, most of the correct correspondences will support the solution. However, if one or more outliers are contained in the initial subset, it is highly improbable that the computed solution will find a lot of support among the remainder of the potential correspondences, yielding a low "inlier" ratio. This procedure is repeated until a satisfying solution is obtained. This is typically defined as a probability in excess of 95% that a good subsample was selected. The expression for this probability is Г = 1 (1 – ρp)m with ρ the fraction of inliers, p the number of features in each sample, seven in this case, and m the number of trials (see [Rousseeuw 51]).

Two-view geometry computation

The different algorithms described above can be combined to yield a practical algorithm to compute the two-view geometry from real images:

  1. Compute initial set of potential correspondences (and set ρmax = 0, m = 0).

  2. While ,

    1. Randomly select a minimal sample (seven pairs of corresponding points),

    2. Compute the solution(s) for F (yielding one or three solutions),

    3. Determine percentage of inliers ρ (for all solutions),

    4. Increment m, update ρmax if ρmax < ρ.

  3. Refine F based on all inliers.

  4. Look for additional matches along epipolar lines.

  5. Refine F based on all correct matches (preferably using (3.31)).

3.7.2. Structure and motion recovery

Once the epipolar geometry has been computed between all consecutive views, the next step consists of reconstructing the structure and motion for the whole sequence. Contrary to the factorization approach of Section 3.6.4, here a sequential approach is presented. First, the structure and motion is initialized for two views and then gradually extended toward the whole sequence. Finally, the solution is refined through a global minimization over all the unknown parameters.

Initializing the structure and motion
Initial motion computation

Two images of the sequence are used to determine a reference frame. The world frame is aligned with the first camera.

The second camera is chosen so that the epipolar geometry corresponds to the retrieved fundamental matrix F:

Equation 3.32


Eq. (3.32) is not completely determined by the epipolar geometry (i.e., F and e), but has 4 more degrees of freedom (i.e., v and σ). The vector v determines the position of the reference plane (i.e., the plane at infinity in an affine or metric frame), and σ determines the global scale of the reconstruction. The location of the reference plane should not make any difference if the algorithm is projectively invariant. To achieve this, it is important to use homogeneous representations for all 3D entities and to only use image measurements for minimizations. The value for the parameter σ has no importance and can be fixed to one.

Initial structure computation

Once two projection matrices have been fully determined, the matches can be reconstructed through triangulation. Due to noise, the lines of sight will not intersect perfectly. In the projective case, the minimizations should be carried out in the images and not in projective 3D space. Therefore, the distance between the reprojected 3D point and the image points should be minimized:

Equation 3.33


It was noted by Hartley and Sturm [23] that the only important choice is to select in which epipolar plane the point is reconstructed. Once this choice is made, it is trivial to select the optimal point from the plane. Since a bundle of epipolar planes has only one parameter, the dimension of the problem is reduced from three to one. Minimizing the following equation is thus equivalent to minimizing equation (3.33).

Equation 3.34


with l1(λ) and 12(λ) the epipolar lines obtained in function of the parameter λ describing the bundle of epipolar planes. It turns out (see [23]) that this equation is a polynomial of degree 6 in λ. The global minimum of (equation 3.34) can thus easily be computed directly. In both images, the points on the epipolar line 11(λ) and 12(λ) closest to the points x1 and x2 respectively are selected. Since these points are in epipolar correspondence, their lines of sight intersect exactly in a 3D point. In the case where (3.31) is minimized to obtain the fundamental matrix F, the procedure described here is unnecessary and the pairs (, ) can be reconstructed directly.

Updating the structure and motion

The previous section dealt with obtaining an initial reconstruction from two views. This section discusses how to add a view to an existing reconstruction. First, the pose of the camera is determined, then the structure is updated based on the added view, and finally new points are initialized.

Projective pose estimation

For every additional view, the pose toward the preexisting reconstruction is determined. This is illustrated in Figure 3.14. It is assumed that the epipolar geometry has been computed between view i - 1 and i. The matches that correspond to already reconstructed points are used to infer correspondences between 2D and 3D. Based on these, the projection matrix Pi is computed using a robust procedure similar to the one laid out for computing the fundamental matrix. In this case, a minimal sample of six matches is needed to compute Pi. A point is considered an inlier if there exists a 3D point that projects sufficiently close to all associated image points. This requires refining the initial solution of X based on all observations, including the last. Because this is computationally expensive (remember that this has to be done for each generated hypothesis), it is advised to use a modified version of RANSAC that cancels the verification of unpromising hypothesis [5]. Once Pi has been determined, the projection of already reconstructed points can be predicted so that some additional matches can be obtained. This means that the search space is gradually reduced from the full image to the epipolar line to the predicted projection of the point.

Figure 3.14. Image matches (xi–1, xi) are found as described before. Since some image points, xi-1, relate to object points, X, the pose for view i can be computed from the inferred matches (X, xi). A point is accepted as an inlier if a solution for exists for which d(P, Xi) < 1 for each view k in which X has been observed.


This procedure relates only the image to the previous image. In fact, it is implicitly assumed that once a point gets out of sight, it will not come back. Although this is true for many sequences, this assumption does not always hold. Assume that a specific 3D point got out of sight, but that it becomes visible again in the two most recent views. This type of point could be interesting to avoid error accumulation. However, the naive approach would just reinstantiate a new independent 3D point. A possible solution to this problem was proposed in [35].

Refining and extending structure

The structure is refined using an iterated linear reconstruction algorithm on each point. The scale factors can also be eliminated from (3.25) so that homogeneous equations in X are obtained:

Equation 3.35


with Pi the i-th row of P and (x, y) being the image coordinates of the point. An estimate of X is computed by solving the system of linear equations obtained from all views where a corresponding image point is available. To obtain a better solution, the criterion Σ d(PX, x)2 should be minimized. This can be approximately obtained by iteratively solving the following weighted linear least-squares problem:

Equation 3.36


where is the previous solution for X. This procedure can be repeated a few times. By solving this system of equations through SVD, a normalized homogeneous point is automatically obtained. If a 3D point is not observed, the position is not updated. In this case one can check if the point was seen in a sufficient number of views to be kept in the final reconstruction. This minimum number of views can, for example, be put to three. This avoids having an important number of outliers due to spurious matches.

Of course, in an image sequence some new features will appear in every new image. If point matches are available that were not related to an existing point in the structure, then a new point can be initialized, as described in Section 3.7.2.

Refining structure and motion

Once the structure and motion has been obtained for the whole sequence, it is recommended to refine it through a global minimization step so that a bias toward the initial views is avoided. A maximum likelihood estimation can be obtained through bundle adjustment [67, 57]. The goal is to find the parameters of the camera view Pk and the 3D points Xi for which the sum of squared distances between the observed image points mki and the reprojected image points Pk(Xi) is minimized. It is advised to extend the camera projection model to also take radial distortion into account. For m views and n points, the following criterion should be minimized:

Equation 3.37


If the errors on the localization of image features are independent and satisfy a zero-mean Gaussian distribution, then it can be shown that bundle adjustment corresponds to a maximum likelihood estimator. This minimization problem is huge; for example, for a sequence of 20 views and 100 points/view, a minimization problem in more than 6000 variables has to be solved (most of them related to the structure). A straightforward computation is obviously not feasible. However, the special structure of the problem can be exploited to solve the problem much more efficiently [67, 57]. The key reason for this is that a specific residual is only dependent on one point and one camera, which results in a very sparse structure for the normal equations.

Structure and motion recovery algorithm

To conclude this section, an overview of the structure and motion recovery algorithm is given. The whole procedure consists of the following steps:

1.
Match or track points over the whole image sequence (see Section 3.5.2).

2.
Initialize the structure and motion recovery:

  1. Select two views that are suited for initialization,

  2. Set up the initial frame,

  3. Reconstruct the initial structure.

3.
For every additional view,

  1. Infer matches to the existing 3D structure,

  2. Compute the camera pose using a robust algorithm,

  3. Refine the existing structure,

  4. Initialize new structure points.

4.
Refine the structure and motion through bundle adjustment.

The results of this algorithm are the camera poses for all the views and the reconstruction of the interest points. For most applications, such as MatchMoving (aligning a virtual camera with the motion of a real camera; see Section 3.10.3), the camera poses are the most useful.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.71.146