The previous chapters have been mainly concerned with analysis of labelled landmarks, where there is clear, meaningful correspondence between the landmarks. Matching configurations of points where the correspondence is unknown is an important but challenging problem in many application areas, including in bioinformatics and computer vision.
In this chapter we shall consider Bayesian approaches that have been developed for matching unlabelled point sets. The matching problem, where the sets of points may be of different sizes, is relevant for the comparison of molecules and the comparison of objects from different views in computer vision. For example, if we have two protein surfaces represented by sets of amino acid locations, a question of interest is whether the two surfaces have a region of the same size-and-shape. This region may correspond to a binding site that the proteins have in common; for example they may both bind to the same protein molecule.
Some initial inferential methods to compare unlabelled shape include the MCMC simulation methodology developed by Green and Mardia (2006), Dryden et al. (2007) and Schmidler (2007), which themselves have connections with work stemming from Moss and Hancock (1996), Rangarajan et al. (1997), Chui and Rangarajan (2000, 2003) and Taylor et al. (2003) among others.
Consider the dataset which was described in Section 1.4.9. The active site of protein 1 contains 40 amino acids and the active site of protein 2 contains 63 amino acids. Green and Mardia (2006) developed a Bayesian model and carried out inference using an MCMC algorithm for inference about these proteins, and in particular which amino acids correspond, or match, between the two proteins.
Green and Mardia (2006) introduced a model for unlabelled size-and-shape matching which has proved successful in a number of applications. The model involves a hidden Poisson process of rate λ in a region with volume v. Two configurations X1 and X2 are generated as Gaussian perturbations with means given by some of the points from the hidden Poisson process, then rigid-body transformations are applied (location and rotation) to the configurations. The points in the hidden Poisson process each give rise to one of four cases: (i) points in both X1 and X2; (ii) a point in X1 only; (iii) a point in X2 only; or (iv) no points in X1 and X2. There is a one to one correspondence between the first type of point labels in X1, X2 via the hidden Poisson process. Green and Mardia (2006) then integrate out the underlying Poisson process to give a likelihood which depends on the rigid-body transformation between the configurations, a match matrix Λ = (λij) giving the one to one correspondence between the points in X1 and X2, and the parameter ρ/λ, where ρ is a parameter reflecting the tendency of points to be matched.
We consider an alternative model which is very similar to that of Green and Mardia (2006), leading to a similar solution; where we assume that μ is a fixed N × m configuration and X is a K × m configuration that we apply rigid-body transformations to. In order to specify the labelling or correspondence between the points we use the match matrix Λ, which is a K × N matrix of 1s and 0s to represent a particular matching of the points in X with the points in μ. For 1 ≤ j ≤ N, if λij = 1 then the ith point of X matches to the jth point in μ. If ∑jλij = 0 then the ith point of X does not match to any point in μ, likewise if ∑iλij = 0 then the jth point of μ does not match to any point in X. If we include the constraint that each row and column can contain at most a single 1, then only one to one matches are allowed, and this is the case for Green and Mardia (2006)’s model.
The rotation matrix Γ SO(m) and the translation parameter γ are also parameters in the model. The matched points in X, denoted as XΛ, are taken as Gaussian perturbations of the matching points in μ with variance 1/τ, and we assume that the rows of unmatched points in X, denoted by X− Λ, are distributed uniformly over a bounded region of volume , and the volume is a hyper-parameter in the model. We shall denote the points that are matched in μ as .
We shall concentrate on the m = 3 dimensional case here. Given a K × N match matrix, Λ (with p matching points), rotation matrix Γ and translation vector γ the likelihood is therefore defined as:
where , Γ = Rz(θ12)Ry(θ13)Rx(θ23), and the rotation matrices about the x, y, z axes are:
with Euler angles θ12 [ − π, π), θ13 [ − π/2, π/2], θ23 [ − π, π). The uniform measure is given by:
in this case (e.g. see Khatri and Mardia 1977). Further discussion of rotation matrices and a different representation were given in Section 3.2.2.
Note that Green and Mardia (2006)’s model is constructed with X1 and X2 as Gaussian perturbations from an underlying Poisson process. However, the likelihood is actually of the same form as the one-sided version described above (Kenobi and Dryden, 2012), where X = X1 is perturbed from μ = X2, although the variance parameter is doubled and .
The parameters of the model are τ, Λ, Γ, γ, We take τ, Λ, Γ, γ to be mutually independent a priori. We use the prior distribution τ ∼ Γ(α0, β0). For the prior distribution of Λ, we assume the rows are independently distributed with the ith row having distribution
for 1 ≤ i ≤ K and 0 ≤ ψ ≤ 1, where ψ is a hyperparameter to be chosen. If we insist on one to one matching then the sum of each row or column is at most 1.
We take the prior for γ as γ ~ N3(μγ, σ2γI3), and Green and Mardia (2006) use a matrix Fisher prior for Γ, which is appealing due to conjugacy. The posterior density of (Λ, τ, Γ, γ) conditioned on X is:
As the posterior has a complicated form, MCMC simulation is used to simulate approximately from the posterior distribution.
The full conditional distribution of τ is given by:
where p is the number of matched points in X, that is the number of non-zero rows of Λ. Hence a Gibbs update can be used for τ. The full conditional distribution of γ is given by:
and so we use a Gibbs update for γ. We update the match matrix Λ using the acceptance probability (see Section 14.2.2)
Green and Mardia (2006) provide a Gibbs step for two of the rotation angles (due to a conjugate prior) and use a random walk Metropolis step for the other angle. Green and Mardia (2006) ensure that the matching is one to one between the points. Additional prior assumptions can be included, for example to match similar amino acids in terms of the four groups described in Section 1.4.9.
After running the MCMC algorithm for a large number of iterations the final estimate of the match matrix can be obtained using an appropriate loss function, as discussed by Green and Mardia (2006). In particular let ℓ01 be the loss of declaring a match when there is not one for a pair of points, and ℓ10 is the loss of not declaring a match when there is one. The optimal match is obtained by maximizing the sum of estimated marginal posterior probabilities minus K = ℓ01/(ℓ01 + ℓ10) times the number of matched points.
The Green–Mardia model has been demonstrated to work well in a variety of situations (Mardia et al. 2007). Further examples include Green et al. (2010) for Bayesian modelling for matching and alignment of biomolecules; Ruffieux and Green (2009) alignment of multiple configurations; and additional applications are given in Mardia et al. (2011, 2013b) and Mardia (2013b) including the use of a gap prior to include positional information along the protein backbone.
Example 14.1 The active sites are given in Figure 14.1 for each of the two proteins, with 40 in 1cyd
(red and yellow) and 63 active sites in 1a27
(blue and green). Following application of the Green–Mardia MCMC algorithm with 1 000 000 MCMC iterations the top 35 aligned active sites (in terms of highest probability of match) are displayed in Figure 14.1, and coloured red and blue. The non-matched amino acids are yellow and green. Clearly the common binding region has a very similar size-and-shape for these 35 matches, see Green and Mardia (2006) who have 36 top matches with K = 0.5 whereas we chose K = 0.75 leading to 35 matches.
A Procrustes size-and-shape model was developed by Dryden et al. (2007) and Schmidler (2007) for unlabelled size-and-shape matching. The model is similar to the Green–Mardia model, except that rotation and translation are estimated using Procrustes registration at each step of the algorithm. A MCMC algorithm is used to draw inferences about the match matrix and a concentration parameter. For the Procrustes model given a particular match matrix Λ, we use partial Procrustes registration to register the p points in XΛ to the corresponding matched points in μΛ, in order to define a distance between the size-and-shapes. This aspect of the matching is present in both the Dryden et al. (2007) and Schmidler (2007) approaches. The Procrustes matching involves finding and such that
where dS(XΛ, μΛ) is the Riemannian metric in size-and-shape space (5.5). The Procrustes estimators of rotation and translation were given in Section 7.2.3.
Let Then is the partial Procrustes fit of XΛ onto μΛ. The partial Procrustes tangent coordinates of XΛ at μΛ are given by the p × m matrix
which is in a pm − m(m − 1)/2 − m dimensional linear subspace of .
We denote the unmatched points in X by X− Λ. We transform these points using the same transformation parameters as for XΛ. Let . We consider X− Λ to lie in . Given the match matrix, Λ, the size-and-shape of XΛ lies in SΣpm and X− Λ lies in .
We assume a zero mean isotropic Gaussian model for VΛ in q = pm − m(m − 1)/2 − m dimensions. [There are m(m − 1)/2 + m linear constraints on VΛ due to the Procrustes registration.] We assume that X− Λ, the non-matching part, is uniformly distributed in a bounded region, , with volume in , and this volume is a hyper-parameter in the model.
The likelihood of X given Λ and τ = 1/σ2, a precision parameter where σ2 is a measure of the variability at each point, is:
This likelihood is given by Dryden et al. (2007) and is essentially that of Schmidler (2007) (with q = mp in the latter).
We write π(τ) and π(Λ) for the prior distributions of τ and Λ and assume τ and Λ are independent a priori.
Schmidler (2007) introduced a gap prior to include positional information along the protein backbone, so that matches are encouraged in contiguous regions along the backbone with small numbers of gaps. Mardia et al. (2013b) also use a gap prior in a model for unlabelled shape matching.
The posterior density of τ and Λ conditional on X is:
Again to carry out inference one uses MCMC.
The full conditional distribution of τ is available from the conjugacy of the Gamma distribution,
so we update τ with a Gibbs step. We make updates to the match matrix using a Metropolis–Hastings step. We select a row at random and if the selected point is already matched then it becomes unmatched with probability preject, or it is matched to another point i with probability (1 − preject)/(N − 1). If the selected point is unmatched then it becomes matched to point i with probability 1/N. We accept the new proposal, Λ*, with probability
where
If preject = 1/N then q/q* = 1, which was the value used by Dryden et al. (2007).
Dryden et al. (2007) also describe a computationally faster approximate Metropolis–Hastings update to the match matrix which does not require the use of the whole configuration in the calculation of the density. If we propose the change (i → l1) to (i → l2) then the alternative Hastings ratio, α*Λ is given by:
where
When a new match is accepted the ordinary partial Procrustes registration is carried out on the new matching points to ensure the configuration of matching points has rotation removed. For brevity we shall refer to the size-and-shape model as the ‘Procrustes model’.
Kenobi and Dryden (2012) describe various approximations to the algorithm to help prevent the MCMC routine becoming stuck at local modes, including involving some large jump proposals. Other work where large jumps have been used to help address multimodality includes that by Tjelmeland and Hegstad (2001) and Tjelmeland and Eidsvik (2004). Their scenarios are simpler where they can make use of local optimization to construct reversible jumps between modes. Kenobi and Dryden (2012) do not carry out local optimization involving gradients, but rather use the nearness of points in physical space for constructing a large jump into a different mode in this large discrete combinatorial search space of matchings. It is not clear how to make such moves reversible, and so the moves are just carried out during the initialization phase.
Kenobi and Dryden (2012) compare their implementation of the Green–Mardia model with the Procrustes model, and performance is quite similar in practice. The Green–Mardia method does appear to mix better, as the Procrustes method can become stuck at local modes. However, good matches had a higher probability of a match with the Procrustes method.
The main difference between the inference approaches is that for the Green–Mardia model one marginalizes over the registration parameters, that is they are integrated out, whereas in the Procrustes model one optimizes over the registration parameters. The two methods are often quite similar in performance. In practice it is often reasonable that a Laplace approximation holds, hence explaining the similarity of the two methods (Kenobi and Dryden 2012).
The choice as to whether to marginalize or optimize is also present in Chapter 16 in the analysis of curve shapes. The quotient space model of Section 16.4.2 involves optimization whereas the Bayesian ambient space model of Section 16.4.3 involves marginalization.
Example 14.2 Dryden et al. (2007) apply the Procrustes model to the steroid data of Section 1.4.4. In particular, they consider all pairwise matchings between steroids using the MCMC algorithm above, and then use the posterior modes to provide distances between each steroid molecule. Cluster analysis is then carried out, and it is demonstrated that steroids with similar binding affinities to the CBG receptor protein also have more similar size-and-shapes.
Early methods for unlabelled matching include those by Moss and Hancock (1996) who use graph-based methods; Rangarajan et al. (1997) who developed a soft matching method (softassign) which used a very efficient numerical algorithm for optimization; Walker (1999) for matching electrophoresis gels; Chui and Rangarajan (2000, 2003); and Taylor et al. (2003) who developed an EM algorithm for estimating probabilities of atom matches in bioinformatics. Kent et al. (2010) provide an EM interpretation of the softassign algorithm of Rangarajan et al. (1997).
The Procrustes model is extended by Dryden et al. (2007) to deal with multiple observations. If there are n observations then the parameters include n matching matrices, as well as an overall unknown mean μ. The parameters are estimated again by MCMC, and the overall mean μ can be updated using an efficient Gibbs procedure. Ruffieux and Green (2009) extended the Green–Mardia model to deal with the alignment of multiple configurations. Mardia et al. (2011) further extended the model for estimating pharmacophores in bioinformatics, where it is of interest to find a common structure among a collection of proteins.
The above methods are all for size-and-shape comparisons, as in bioinformatics one usually wants to retain the scale. Mardia et al. (2013b) consider shape analysis, where isotropic scaling is also dealt with, using an adaptation of the Green–Mardia model.
An alternative but related method for unlabelled matching was provided by Czogiel et al. (2011) where the molecules were regarded as continuous fields of some quantity (e.g. partial charge or atomic radius), and observations from the field are available at a finite set of atoms. A Hilbert space representation is used to calculate a distance between the fields, and again a Bayesian model is developed for inference. The advantage with Czogiel et al. (2011)’s procedure is that there is no need to estimate the combinatorially expensive matching matrix. The methodology was applied to the steroids data of Section 1.4.4 and the procedure provided good clustering of the molecules in terms of size-and-shape being associated with binding affinity to the CBG receptor protein. The MCMC methodology has also been extended to outline matching with occlusions with applications in computer vision (Cao et al. 2011). Yu et al. (2014) also consider hierarchical Bayesian modelling for multigroup shape analysis where correspondences need to be inferred, through a joint distribution of shape variables and object boundary data. Posterior inference is via MCMC simulation.
An alternative method for matching unlabelled objects is to use Minimum Description Length (MDL), which is a method for choosing between models using the shortest description of a training dataset. There are strong connections to model choice methods, including the Bayesian Information Criterion (BIC) (Schwarz 1978). Davies et al. (2002) introduce an MDL approach to statistical shape modelling, and many applications including to image and surface matching are given by Davies et al. (2008b). Generalized matching of a group of surfaces is discussed by Davies et al. (2008a) and Twining et al. (2011).
Many of the applications of unlabelled shape analysis have been in bioinformatics, which motivated the methodology of Green and Mardia (2006); Dryden et al. (2007) and Schmidler (2007). Some other examples where shapes of protein structures are compared can be found in Liu et al. (2011) and Rodriguez and Schmidler (2014), who use a gap prior. Schmidler et al. (2002) provided an early method of Bayesian protein structure prediction; Schmidler et al. (2007) considered models of helices in bioinformatics; and Panaretos and Konis (2011) develop sparse approximations of protein structure from noisy random projections from single molecule experiments. For further relevant work, a collection of papers on the topic of Bayesian structural bioinformatics is found in Hamelryck et al. (2012).
In the preceding sections it makes sense to estimate a correspondence between points, and so we assume that there is an underlying unknown labelling. However, in other applications we want to take into account all possible labellings – there is no sensible correspondence that can be estimated. One particular area of development has been that of unlabelled triangles, which work with a subset of the shape space. For example, for the triangle case where reflections and labelling are not important we work with a half-lune which is one-twelfth of the shape space (see Section 2.6.2). Procrustes methods can be adapted to the unlabelled case, and the minimizations are carried out over the similarity transformations and permutations of the landmark labels.
Kendall and Kendall (1980), Kendall (1984), Small (1988, 1996, p. 158) and Stoyan et al. (1995, Chapter 8) describe the analysis of unlabelled shapes of triplets formed from the megalithic standing stones of Cornwall (see Section 1.4.18). The expected number and variance of nearly collinear triangles are obtained and used to test whether the stones could have been deliberately placed in straight lines or whether the stones were placed at random. The set of collinear shapes is denoted by Coll (Kendall 1984). Le (1991a, 1992) obtains Procrustes distances of configurations to the collinearity set Coll, and further applications to testing for collinearities are given by Kendall (1991a) and Le (1992). With the motivation of understanding human vision, Bhavnagri (1995a) also considers distances to Coll and to the set of regular polygons Sph, which was also defined by Kendall (1984). Le and Bhavnagri (1997) also derive detailed results based on distances to Coll.
We can also consider shape distributions for unlabelled i.i.d. points in a region. Kendall (1985), Kendall and Le (1987) and Le (1987) consider shape distributions for triangles generated from i.i.d. triplets in a convex polygon K. A particularly simple result is that if three points are randomly placed inside a square, then the resulting shape density of Kendall’s shape variables (uK, vK), with respect to the uniform shape measure dγ, is given by (Kendall and Le 1987):
If the convex polygon is a rectangle of sides s × 1, then the density is:
Another important result for three i.i.d. points in a general convex region is that the shape density is constant for values of vK = 0 (Small 1982), as is simply seen in the rectangle case where and f2(uK, 0) = (s + s− 1)/8.
Further work has been carried out on probabilistic issues for unlabelled shapes. For example, Watson (1986) and Mannion (1988, 1990a,b) have considered Markovian sequences of shapes of triangles and Gates (1994) has considered the shapes of triangles formed by random lines in a plane. D.G. Kendall (1977), W.S. Kendall (1990a, 1998) and Le (1991b, 1994) have considered the diffusion of shape, and relationships to the offset normal shape distributions of Section 11.1 have been made. In particular, consider k ≥ 3 points (Xj, Yj)T, j = 1, …, k, that are in Brownian motion having started at time t = 0 at (μj, νj)T, j = 1, …, k. The distribution of the points at time t = σ2 is the isotropic model of Equation (11.10), and so the diffused shape at this time has the offset normal shape distribution of Section 11.1.2.
Another application involves the shapes of Delaunay triangles in Dirichlet tessellations. Miles (1970) gave the basic result that the shapes of typical Delaunay triangles from a Poisson process are independent of the circumradius, and the shape distribution is particularly simple. From Kendall (1983), when shape is measured in terms of two of the internal angles of the triangle α and β, the density with respect to the uniform shape measure dγ is:
This distribution is known as the Miles distribution and the mode is the equilateral shape. See Miles (1970) and Kendall (1983) for further discussion.
Example 14.3 Mardia et al. (1977) and Mardia (1989b) have examined whether Delaunay triangles from the Iowa data of Section 1.4.17 could be more equilateral than expected by chance. In Figure 14.2 we see a plot of a spherical blackboard corresponding to AB ≥ AC ≥ BC. Points have been plotted on the bell according to the shapes of triangles from the Delaunay triangles of the Iowa data in Section 1.4.17. There is a tendency for shapes to concentrate towards the top of the bell (i.e. to be more equilateral), although there was no evidence for central place theory in the test of Mardia (1989b).
Dryden et al. (1997) also consider a statistic based on average squared partial Procrustes distance to the equilateral triangle, where the fact that neighbouring triangles are correlated has been taken into account. The application that Dryden et al. (1995, 1997), Taylor et al. (1995) and Faghihi (1996) consider is the examination of regularity (‘equilateralness’) in an image of the cross-section of muscle fibres. Also, Dryden and Zempléni (2006) investigate extreme unlabelled triangle shapes in muscles, using the generalized extreme value and generalized Pareto distributions. Here ‘extreme’ relates to being a long way in shape distance from the equilateral shape. Differences between diseased and healthy muscles in triangle shapes were observed with these methods.
3.144.98.153