Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13 Fundamentals of Protein Structure Alignment

MARK BRANDT, ALLEN HOLDER, and YOSI SHIBBERU

13.1 Introduction

The central dogma of molecular biology asserts a one-way transfer of information from a cell's genetic code to the expression of proteins. Proteins are the functional workhorses of a cell, and studying these molecules is at the foundation of much of computational biology. Our goal here is to present a succinct introduction to the biological, mathematical, and computational aspects of making pairwise comparisons between protein structures. The presentation is intended to be useful for those who are entering this research area. The chapter begins with a brief introduction to the biology of protein comparison, which is followed by a brief taxonomy of the different mathematical frameworks for protein structure alignment. We conclude with a couple of more recent pairwise comparison techniques that are at the forefront of efficiency and accuracy. Such methods are becoming important as structural databases grow.

13.2 Biological Motivation of Protein Structure Alignment

Proteins are crucially important molecules that are responsible for a large variety of biological functions required for life to exist. The DNA sequence of the genome provides a one-dimensional descriptive code; proteins are self-organizing systems that allow expansion of this one-dimensional code into complex three-dimensional structures possessing a great diversity of functions. Understanding the types of protein structures that are possible is an important part of understanding the existing biological systems, of understanding the aberrant processes that result in genetic disorders, and in the engineering of proteins with novel functions.

Proteins are synthesized as linear polymers of amino acids (Fig. 13.1); the vast majority of proteins consist of a set of 20 different types of amino acids, in defined sequences that, depending on the protein, vary from about 50 to more than 28,000 amino acid residues. Although there are exceptions, in general, the specific sequence of amino acids is specified by the genome; this linear sequence determines the 3D structure of the protein.

Figure 13.1 The structure of one type of amino acid (lysine) is shown on the left, with the sidechain and backbone atoms indicated. Proteins consist of amino acid residues, where the backbone atoms are linked together to form the chain, and the sidechains determine the folded structure and much of the function of the protein. In the partial protein shown on the right, the sidechains are denoted as R, and the three types of dihedral angles (the specific angle formed by the atoms bonded at each position along the backbone) are shown. The $c13-math-0001$ and $c13-math-0002$ angles may vary, within geometric limits imposed by the surrounding atoms; the $c13-math-0003$ angle is fixed, with the six atoms forming the planar structure shown.

The 3D structure of a protein is an emergent property of the linear sequence. Predicting the 3D structure entirely on the basis of the linear sequence has proved to be challenging because the defined structure exhibited by most proteins is a consequence of a large number of relatively weak interactions. The existence of a defined structure is possible because of geometric constraints imposed by the backbone atoms, and because of geometry-dependent hydrogen bonding, electrostatic, and nonpolar interactions between the atoms of the backbone and sidechains.

Prior to the experimental determination of the first protein structures, Linus Pauling predicted the formation of regular repeating structures (especially $c13-math-0004$ helices and $c13-math-0005$ sheets), based on a theoretical understanding of the geometric constraints inherent in the backbone structure. The increasing number of experimentally determined protein structures has confirmed that most proteins consist of arrangements of $c13-math-0006$ helices and $c13-math-0007$ sheets, along with regions of less well-defined structural elements. Because proteins are such large molecules, and because the secondary structural elements are important parts of the overall structure, most proteins are represented in ways that emphasize the arrangement of secondary structural elements (see Fig. 13.2).

Figure 13.2 An $c13-math-0008$ -carbon trace emphasizing the secondary structure of one monomer of the enzyme triose phosphate isomerase (from PDB ID 2YPI). This protein folds into an $c13-math-0009$ barrel structure, with a multistrand $c13-math-0010$ sheet (the thick arrows near the center of the structure) surrounded by $c13-math-0011$ -helical elements. A number of other proteins exhibit this $c13-math-0012$ barrel structure, despite considerable difference in both their sequence and function.

Analysis of protein structures has revealed that many proteins consist of one or more separate domains, which are regions within the protein that fold independently of the remainder of the protein. Protein domains are currently considered to be units of evolution; one major constraint on tolerated mutations is the result of the requirement to maintain the folded structure of the domain.

Comparison of different proteins is crucial to an understanding of the relationships between protein amino acid sequences, protein structure, and protein function. Protein sequence information can be obtained from the genome sequencing projects, and protein sequences can be readily compared. However, many proteins are known to have limited sequence similarity and yet have structures that appear visually similar. This raises the question of how to compare complex 3D structures both quantitatively and in ways that will allow a better understanding of the relationships between their structure and function.

Many proteins exhibit generally similar structures and similar functions. For example, a considerable number of serine protease enzymes have been discovered from species as widely divergent as mammals and bacteria. Although the two proteins shown in Figure 13.3 have very different amino acid sequences, portions of the structure match rather closely. However, in analyzing the structures, it is less clear how important the structural differences are in the subtle differences in function between these proteins. In addition, it is less clear how to best represent the structural differences in a manner that is both consistent and informative.

Figure 13.3 An overlay of two proteins with similar function: protease A from the bacterium Streptomyces griseus (PDB ID 3SGA) and Bos taurus chymotrypsin (PDB ID 1YPH). The two proteins share limited sequence identity ( $c13-math-0013$ 20%), but similar catalytic mechanisms, and considerable structural similarity, especially in the core of the protein. In contrast, another protein, subtilisin from Bacillus amyloliquefaciens, also exhibits a similar catalytic mechanism but has a very different structure.

Another example of the importance of structure is provided by the prion proteins (see Fig. 13.4). Prions are monomeric proteins normally found on the surface of a variety of cells; however, these proteins are capable of undergoing an incompletely understood conformational change that results in oligomerization, with lethal effects to the affected individual. The spongiform encephalopathy diseases are one of a significant number of diseases caused by protein misfolding. An improved understanding of protein structure and protein folding processes might allow intervention in disease processes that are currently untreatable. In addition, many genetic disorders result from altered protein structure and function. While it is apparent that the changes in one of a small number of residues within a large protein cause disease, only a better understanding of the elaboration of sequence information into an overall structure will allow insight into possible approaches for treatment.

Figure 13.4 The same protein in two different conformations: a fragment from the human prion protein. The 1I4M structure is part of a dimer, which may represent a stage in the structural transformation from the largely helical protein shown here to the toxic $c13-math-0014$ -sheet conformation thought to cause the lethal spongiform encephalopathy diseases such as Creutzfeldt–Jakob disease and kuru.

A final purpose of studying protein structure is to allow the design of novel proteins. Enzymes are phenomenal catalysts, which generally exhibit both high reaction rates and high levels of specificity. While biological enzymes catalyze a large range of reactions, no enzymes exist to catalyze many industrially useful reactions. The ability to design new enzyme mechanisms is extremely attractive as a method for carrying out reactions at higher rates, with less expense from heating costs and waste product formation. Current methods for protein design are inefficient and essentially entirely empirical and are largely limited to minor alterations to existing proteins.

We have an increasing database of protein structures; however, we still lack a full understanding of how protein sequence, structure, and function are related. Comparing the structures of existing proteins of different sequences provides important data that will likely lead to an improved understanding of the mechanisms by which existing protein sequences give rise to their corresponding functional protein structures.

13.3 Mathematical Frameworks

The two main mathematical frameworks studied in the protein alignment literature are the contact map overlap (CMO) problem and the largest common point set (LCP) problem under bottleneck distance constraint [17]. Figure 13.5 illustrates a 2D version of each framework.

Figure 13.5 A 2D depiction of protein alignment. Chain A collapses into fold A, creating long-range contact depicted by arcs on chain A. Likewise, chain B collapses into fold B. Then a rigid-body transformation, in this case a vertical translation, superimposes folds A and B to create a fold alignment. In the CMO problem, the arcs on chains A and B are aligned directly (solid arcs) without reference to the superimposed folds. (The consecutive chain contacts are not considered.) In the LCP problem, the superimposed folds determine the chain alignment.

Let $c13-math-0015$ and $c13-math-0016$ be the sets of C $c13-math-0017$ atom coordinates of two proteins, proteins A and B, that we wish to align. (These are the carbon atoms in Fig. 13.1 that bind to the sidechains.) For a given distance cutoff value $c13-math-0018$ , define $c13-math-0019$ and $c13-math-0020$ to be the sets of edges in the contact graphs of proteins A and B, respectively. (Edges are represented by arcs in Fig. 13.5.) Define $c13-math-0021$ to be a bijection (one-to-one, onto map) from a subset $c13-math-0022$ to a subset $c13-math-0023$ . Define $c13-math-0024$ to be a rigid-body transformation of the fold of protein B. (In Figure 13.5, $c13-math-0025$ is simply a translation of fold B onto fold A.) Then solve the two problems as follows:

CMO Problem. Determine the bijection $c13-math-0026$ that maximizes the size of the matched subsets of edges $c13-math-0027$ and $c13-math-0028$ , where an edge $c13-math-0029$ is considered a match if $c13-math-0030$ is an edge in $c13-math-0031$ .

LCP Problem. For all rigid-body transformations $c13-math-0032$ , determine the largest subset $c13-math-0033$ for which there exists a bijection $c13-math-0034$ such that $c13-math-0035$ .

Goldman et al. [9] proved that the CMO problem is NP-hard. Caprara et al. [6] apply linear integer programming methods to obtain exact solutions to the CMO problem for proteins that are similar to one another. Their methods have been improved significantly by Andonov et al. [1] (see Section 13.4). By imposing a proximity requirement for aligned residues and employing packing constraints satisfied by actual protein folds, Xu et al. [25] and Li and Ng [17] have developed polynomial time approximation schemes for the CMO problem. The CMO problem is discussed further in Section 13.4.

The LCP problem appears to be easier to solve than the CMO problem. The LCP problem is more geometric in character, whereas the CMO problem is graph-theoretic in nature. A possible disadvantage of the LCP problem, however, is that it treats proteins as rigid objects. In reality proteins are quite flexible. Hasegawa and Holm [10] claim that alignment methods that allow for flexibility give the most biologically meaningful results.

The main ideas used to develop a polynomial time algorithm for the LCP problem are due to Kolodny and Linial [13] and Poleksic [19, 20]. We describe next a polynomial time algorithm for solving the LCP problem. The basic ideas were developed by Kolodny and Linial [13] and extended by Poleksic [20]. Although their analysies [13, 20] pertain to the 3D alignment problem, here, for simplicity of exposition, we describe the algorithm for the 2D alignment problem.

In two dimensions, we can parametrize rigid-body transformations by $c13-math-0036$ as

where $c13-math-0038$ is the coordinates of a C $c13-math-0039$ atom in protein fold B. For a prescribed distance cutoff value, $c13-math-0040$ , and for a given rigid-body transformation $c13-math-0041$ , define $c13-math-0042$ to be the largest subset of $c13-math-0043$ for which there exists a bijection $c13-math-0044$ such that

The set $c13-math-0046$ and the bijection $c13-math-0047$ can easily be computed in $c13-math-0048$ time by applying dynamic programming, see Section 13.4 for the score matrix $c13-math-0049$ , where

equation

The solution to the LCP problem is then given by $c13-math-0051$ .

A key observation by Kolodny and Linial [13] is that only a finite set of rigid-body transformations needs to be considered in order to optimize commonly used alignment scoring functions. To demonstrate this, first observe that only the compact subset, $c13-math-0052$ of rigid-body transformations needs to be considered because if protein $c13-math-0053$ is translated by $c13-math-0054$ or $c13-math-0055$ , where $c13-math-0056$ is sufficiently large, $c13-math-0057$ will be the empty set since proteins $c13-math-0058$ and $c13-math-0059$ will have no points in common. Since $c13-math-0060$ is continuous and $c13-math-0061$ is compact, $c13-math-0062$ is uniformly continuous on $c13-math-0063$ . Uniform continuity implies that there exists a $c13-math-0064$ such that for any $c13-math-0065$ and $c13-math-0066$ satisfying $c13-math-0067$ we have that $c13-math-0068$ for all $c13-math-0069$ . Now, consider the open balls $c13-math-0070$ where $c13-math-0071$ . The open balls $c13-math-0072$ cover $c13-math-0073$ , that is, $c13-math-0074$ . Since $c13-math-0075$ is compact, a finite subset of these open balls also covers $c13-math-0076$ ; that is, there exist $c13-math-0077$ such that $c13-math-0078$ . Thus, all the rigid-body transformations in the compact set $c13-math-0079$ can be approximated to within a distance of $c13-math-0080$ by the finite set of rigid-body transformations $c13-math-0081$ .

The alignment scoring functions considered by Kolodny and Linial [13] must satisfy a Lipschitz condition. Their algorithm only computes an $c13-math-0082$ approximation of the optimal solution. Poleksic [20] extended Kolodny and Linial's approach to the LCP problem. Moreover, Poleksic's extension computes the exact solution to the LCP problem in expected polynomial time $c13-math-0083$ for globular proteins of size $c13-math-0084$ [19, 20]. We describe Poleksic's algorithm next.

Let $c13-math-0085$ . Also, for $c13-math-0086$ , define the function $c13-math-0087$ , where $c13-math-0088$ is the number of elements in $c13-math-0089$ . Since $c13-math-0090$ , there exists an $c13-math-0091$ such that $c13-math-0092$ , which implies

From this inequality, we can conclude that

because if $c13-math-0095$ and $c13-math-0096$ are in contact for cutoff $c13-math-0097$ , then $c13-math-0098$ and $c13-math-0099$ remain in contact for cutoff $c13-math-0100$ . We also have that

Combining these inequalities, we have

The LCP problem can be solved exactly in a finite number of steps if we can determine an $c13-math-0103$ for which

since this would imply that $c13-math-0105$ . Poleksic [20] showed that such an $c13-math-0106$ exists for all except a finite set of cutoff values $c13-math-0107$ as follows. The function $c13-math-0108$ is a nondecreasing function of $c13-math-0109$ having integer values in the range from $c13-math-0110$ to $c13-math-0111$ . The function $c13-math-0112$ is therefore piecewise constant except for possible jump discontinuities at cutoff values $c13-math-0113$ . If we avoid this finite set of cutoff values, then we can choose an $c13-math-0114$ such that $c13-math-0115$ . The size of the finite set of rigid-body transformations $c13-math-0116$ that we need to search to determine $c13-math-0117$ is determined by how small $c13-math-0118$ is. The size of $c13-math-0119$ is determined by how close $c13-math-0120$ is to one of the jump discontinuity points $c13-math-0121$ , $c13-math-0122$ . These jump discontinuity points are not known in advance. However, Poleksic [20] provides a proof that the overall expected complexity of the algorithm is polynomial in time.

13.3.1 Spectral Methods

Spectral methods have emerged as a new mathematical framework for the protein structure alignment problem. An advantage of spectral methods over CMO methods pointed out by Bhattacharya et al. [4] is that comparisons with spectral methods are based on two residues (one from each protein) rather than the four residues (two from each protein to define an edge) required by CMO methods. Another advantage over CMO pointed out by Lena et al. [15] is that unlike CMO, spectral methods scale well with the size of the distance cutoff parameter $c13-math-0123$ .

In 1988, Umeyama [23] published a spectral, polynomial-time algorithm for computing a bijection between two weighted graphs that are isomorphic. The adjacency matrix of each graph is assume to have distinct eigenvalues. Umeyama's algorithm also works well if the graphs are nearly isomorphic. We describe the details of Umeyama's algorithm next.

Let $c13-math-0124$ and $c13-math-0125$ be the adjacency matrices of two undirected, weighted graphs with the same number of nodes and distinct eigenvalues. The goal is to determine a permutation matrix $c13-math-0126$ that minimizes $c13-math-0127$ . Let $c13-math-0128$ and $c13-math-0129$ be the eigensystem decomposition of $c13-math-0130$ and $c13-math-0131$ . It is possible to prove that

equation

where $c13-math-0133$ and $c13-math-0134$ , $c13-math-0135$ , are the ordered eigenvalues of $c13-math-0136$ and $c13-math-0137$ . Umeyama showed that there exists an orthogonal matrix $c13-math-0138$ for some diagonal matrix $c13-math-0139$ , where $c13-math-0140$ has diagonal elements $c13-math-0141$ or 1, for which

equation

For isomorphic graphs, Umeyama proved that $c13-math-0143$ is a permutation matrix $c13-math-0144$ . Moreover, $c13-math-0145$ can be computed in polynomial time by solving the assignment problem

using the Hungarian algorithm. (The entries of the matrices $c13-math-0147$ and $c13-math-0148$ are the absolute values of the entries in $c13-math-0149$ and $c13-math-0150$ .) Umeyama also observed that the optimal solution can often be obtained for nonisomorphic graphs by using the solution to the assignment problem above as an initial guess and then applying a local optimization algorithm.

Umeyama's algorithm for the graph isomorphism problem does not apply directly to the protein structure alignment problem, which is a subgraph isomorphism problem. In other words, Umeyama's algorithm cannot handle insertions and deletions. In addition, the weighted graphs of protein structures may not be similar to one another, a requirement for Umeyama's algorithm to work reliably.

Bhattacharya et al. [4] attempt to overcome the limitations of Umeyama's algorithm by considering local alignments to identify similar neighborhoods of the same size and then piecing these neighborhoods together. Rather than apply Umeyama's algorithm directly, Bhattacharya et al. normalize each eigenvector by the size of the protein and then compare the eigenvectors without taking absolute values. (Like Umeyama, they do not use the eigenvalues in their algorithm.) A complication arising in the approach used by Bhattacharya et al. is the fact that the eigenvector decomposition of an adjacency matrix does not specify an orientation of the eigenvectors. In other words, if $c13-math-0151$ is an eigenvector, so is $c13-math-0152$ . This requires $c13-math-0153$ eigenvector orientations to be checked, where $c13-math-0154$ equals the number of eigenvectors compared. The time complexity of the resulting algorithm, called Matchprot, is $c13-math-0155$ . Bhattacharya et al. point out that empirical observations suggest that $c13-math-0156$ eigenvectors is sufficient for good results. However, Matchprot has difficulty with alignments involving large insertions and deletions.

Lena et al. [15] introduced a spectral algorithm called Al-Eigen. Unlike Matchprot, Al-Eigen uses a global alignment. Al-Eigen scales eigenvectors by the square root of the corresponding eigenvalues. Like Matchprot, Al-Eigen searches through the $c13-math-0157$ orientations of the $c13-math-0158$ eigenvectors it compares, starting with comparing just one eigenvector from each protein, up to $c13-math-0159$ eigenvectors. The complexity of Al-Eigen is $c13-math-0160$ .

In this section we have tersely reviewed the primary mathematical models associated with optimally aligning protein structures. The intent of the CMO, the LCP, and the spectral frameworks is to discern biologically relevant alignments between two proteins. Each paradigm has advantages and disadvantages, and continued research is important. The algorithmic complexity and resulting solution times were substantial enough that, until quite recently, undertaking the task of completing numerous pairwise comparisons was a weighty computational burden. However, a host of modern algorithms is emerging that hasten the comparison procedure. These are discussed in the next section.

13.4 More Recent Advances with Database Queries

Protein databases contain tens of thousands of structures and continue to grow. One of the research fronts in computational biology is the design of algorithms that can efficiently search and organize these databases. For example, a researcher may want to find those proteins whose structure is similar to one under investigation, or may want to navigate a database by functionally similar proteins. Such queriesrequire comparisons against an entire database. As protein databases grow, undertaking these numerous comparisons requires efficient and accurate comparison algorithms.

We review three methods that are designed for making efficient structural comparisons: (1) a geometric approach that encodes each residue with angle and distance information called globa structure superposition of proteins (GOSSIP) [12] (2) a spectral approach called EIGAS that assigns each residue an eigenvalue associated with a high-dimensional feature [22], and (3) a solution procedure to the CMO problem called A_purva [1]. While these methods represent protein chains differently, they share the common algorithmic solution procedure of dynamic programming (DP). The emerging literature on protein structure alignment points to the important role that DP is fulfilling as an efficient algorithmic framework. Our specific goal here is to highlight this observation and develop the use of DP in each of these alignment procedures.

Dynamic programming was invented by Bellman in the 1940s [3], and its use in biological applications began in 1970 [18]. The discrete version we consider calculates an optimal match between two sequences, say, $c13-math-0161$ and $c13-math-0162$ . The algorithm requires that each possible match be scored, and we let $c13-math-0163$ be the score associated with matching element $c13-math-0164$ of the first sequence with element $c13-math-0165$ of the second. Dynamic programming follows a recursion to calculate an optimal match between the two sequences

13.1 $c13-math-0166$

where $c13-math-0167$ is a gap penalty. The gap penalty can depend on whether a gap is being initiated or continued; in the former the penalty is called a gap opening penalty and in the latter, a gap extension penalty. Completing this recursion over $c13-math-0168$ and $c13-math-0169$ calculates the optimal value of the matching, and backtracing the optimal iterations lists the optimal matching. The optimal match depends not only on the scoring matrix $c13-math-0170$ but also on the penalties to open and continue a gap, and hence, the use of DP to optimally align sequences requires parameter tuning. The computational complexity of calculating an optimal matching is $c13-math-0171$ , showing that DP is an efficient polynomial algorithm.

The coordinates of the C $c13-math-0172$ atoms of each residue are used to describe the protein in each of the techniques presented below. Because of the lack of an absolute coordinate system, a protein is commonly abstracted into pairwise comparisons between C $c13-math-0173$ atoms, which provides a coordinate and rotation free description of the protein. Each of A_purva, GOSSIP, and EIGAS assigns different information to each C $c13-math-0174$ atom, and hence, each imposes a different pairwise relationship between residues. In what follows, we briefly describe each technique soas to highlight the similarities and dissimilarities between the algorithms. In particular, we show how to construct the scoring matrices used for DP so that each method is seen as an application of DP applied to sequence similarity.

13.4.1 GOSSIP

The GOSSIP algorithm uses DP with a scoring matrix created by local 3D geometry. A residue is encoded with eight characteristics that depend on a parameter $c13-math-0175$ that defines the local geometry. The characteristics for residue $c13-math-0176$ depend on the polygon created by residues $c13-math-0177$ , $c13-math-0178$ , $c13-math-0179$ , and $c13-math-0180$ . Five of the characteristics are distances, and we use $c13-math-0181$ to denote the Euclidean distance between the C $c13-math-0182$ atoms of residues $c13-math-0183$ and $c13-math-0184$ . The characteristics are

The angle $c13-math-0193$ is created by the line segments $c13-math-0194$ and $c13-math-0195$ , where $c13-math-0196$ is the centroid of the protein structure and $c13-math-0197$ is the number of residues in the protein chain. See Figure 13.6 for an example. All but the last $c13-math-0198$ residues are encoded.

Figure 13.6 Residue $c13-math-0199$ is characterized by lengths of a polygon, the length of a diagonal, the angle $c13-math-0200$ , and two indices.

The score assigned to matching residue $c13-math-0201$ of the first protein to residue $c13-math-0202$ of the second is

where the coefficients $c13-math-0204$ and $c13-math-0205$ are experimentally decided. DP is applied to the scoring matrix $c13-math-0206$ , but with a few adaptations. First, an indicator function $c13-math-0207$ is used to decide if the characteristics of residues $c13-math-0208$ and $c13-math-0209$ are similar enough to consider a match. Each characteristic is compared, and agreement is decided based on a threshold value that is characteristic dependent. If enough characteristics agree, then $c13-math-0210$ , but if not $c13-math-0211$ . This adaptation alters the recursion in (13.1) to

equation

A second adaptation is that $c13-math-0213$ is not calculated if the difference between $c13-math-0214$ and $c13-math-0215$ exceeds a threshold indicating the maximum number of gaps allowed in a match. GOSSIP further limits the number of the protein chains that are compared by associating with each chain a collection of chains of similar lengths.

Numerical work for GOSSIP is promising. Two datasets described by Kifer et al. [12] were used for testing, one containing 2,930 protein structures from CATH and one containing 3,613 structures from Astral. An all-against-all comparison requires 4,290,985 alignments in the CATH dataset and 6,525,078 comparisons in the Astral dataset. To gauge the effectiveness of GOSSIP, $c13-math-0216$ proteins from the CATH dataset and $c13-math-0217$ from the Astral dataset were selected, and all-against-all comparisons were made on these smaller datasets. Runtimes were then scaled to the larger datasets. The GOSSIP adaptations to DP were estimated to complete all pairwise comparisons in the large datasets in the range from 9 (17.3) h to 5.1 (9.1) min on CATH (Astral), depending on how similar the lengths of the protein chains needed to be to consider an alignment. The longest runtimes were for comparisons in which the chain lengths need to be within only $c13-math-0218$ of each other, and the shortest times required the chain lengths to be within $c13-math-0219$ of each other. The accuracy of the method lies between the results of MultiProt [21] and YAKUSA [7], and while MultiProt better identifies structural similarity, it requires several hundred hours of computational time. GOSSIP bests YAKUSA's results while comparing reasonably with regard to run time.

13.4.2 Eigensystem Alignment with a Spectrum

As with the other methods described in this section, the central theme of aligning protein chains by EIGensystem Alignment with the Spectrum (EIGAS) is to align the residues so that the matching is as similar as possible with regard to a measure of intra-similarity between the residues of the individual proteins. EIGAS uses a scaled version of the Euclidean distance between the C $c13-math-0220$ atoms of any two residues, a measure called smooth contact. A smooth-contact matrix $c13-math-0221$ has as its $c13-math-0222$ th element the smooth contact between residues $c13-math-0223$ and $c13-math-0224$ ,

where $c13-math-0226$ is the Euclidean distance between the C $c13-math-0227$ atoms of residues $c13-math-0228$ and $c13-math-0229$ . Euclidean distances less than the cutoff parameter $c13-math-0230$ are scaled linearly between the maximum smooth-contact value of 1, which occurs if $c13-math-0231$ , and the minimum value of 0, which occurs if the distance between the residues exceeds $c13-math-0232$ . The smooth-contact measure is an assessment of the proximity of the residues. So the smooth contact between a residue and itself is 1. If $c13-math-0233$ is small, then the smooth contact between most residues is zero, but if $c13-math-0234$ is large, then more values are nonzero.

Smooth-contact matrices have the favorable mathematical property that they are positive definite, that is, that all eigenvalues are positive, for suitably small $c13-math-0235$ values. This follows from the fact that the diagonal elements are independent of the value of $c13-math-0236$ since a C_α atom is always in contact with itself, whereas the off-diagonal elements can be made arbitrarily small as $c13-math-0237$ decreases. Hence the sum of the off-diagonal elements of any row is guaranteed to be less than the diagonal element itself for sufficiently small $c13-math-0238$ , a property called diagonal dominance. A well-known result in linear algebra is that a diagonally dominant matrix is positive definite. The positive definite property guarantees that a smooth-contact matrix can be factored as

13.2 $c13-math-0239$

where the columns of $c13-math-0240$ form an orthonormal set of eigenvectors of $c13-math-0241$ and $c13-math-0242$ is a diagonal matrix of the positive eigenvalues. Moreover, a smooth-contact matrix imposes an inner product, and subsequently a norm and a metric, on $c13-math-0243$ , where $c13-math-0244$ is the number of residues in the protein chain. So, each protein chain coincides with a geometric rendering of $c13-math-0245$ .

To illustrate the mathematical and geometric properties induced by a smooth-contact matrix, we consider a three-residue protein chain whose smooth-contact matrix is located in Figure 13.7. The inner product of the vectors $c13-math-0246$ and $c13-math-0247$ induced by the smooth-contact matrix is $c13-math-0248$ , which has the following properties:

$c13-math-0249$ is the norm of $c13-math-0250$ relative to $c13-math-0251$ .
$c13-math-0252$ is the cosine of the angle between $c13-math-0253$ and $c13-math-0254$ .
$c13-math-0255$ is the distance between $c13-math-0256$ and $c13-math-0257$ relative to $c13-math-0258$ .

The inner product supports an embedding of the residues in an $c13-math-0259$ -dimensional space so that the smooth contact between any two residues is the result of an inner product calculation. Let the first residue be associated with $c13-math-0260$ , the second with $c13-math-0261$ , and so on. Then $c13-math-0262$ , which is the smooth contact between residues $c13-math-0263$ and $c13-math-0264$ , for example, $c13-math-0265$ for the smooth-contact matrix in Figure 13.7.

Figure 13.7 The path of the backbone indicates that the Euclidean distances between residues 1 and 3 and between 2 and 3 are shorter than the distance between residues 1 and 2. This is seen in the smooth-contact matrix since both $c13-math-0266$ and $c13-math-0267$ , both of which are greater than $c13-math-0268$ .

The vector space $c13-math-0269$ with the inner product $c13-math-0270$ is called a contact space, and the geometry induced by $c13-math-0271$ is, in some sense, a “skewed” Euclidean geometry. As an illustration we note that a sphere is the collection of unit length vectors, which in contact space means that $c13-math-0272$ is on the sphere only if $c13-math-0273$ . Such a collection is an ellipsoid because the metric is scaled by $c13-math-0274$ . The unit ellipsoid with respect to the contact matrix in Figure 13.7 is depicted in Figure 13.8. The residues are represented by the thick dark vectors $c13-math-0275$ , $c13-math-0276$ , and $c13-math-0277$ , which lie along the standard axes. The angle between $c13-math-0278$ and $c13-math-0279$ is not as it appears in the figure. For example, in Euclidean geometry the angle between $c13-math-0280$ and $c13-math-0281$ is 90°, but in contact space the angle is $c13-math-0282$ . Also, in Euclidean geometry the distance between between $c13-math-0283$ and $c13-math-0284$ is $c13-math-0285$ , but in contact space the distance is $c13-math-0286$ .

Figure 13.8 The ellipsoid is the collection of all unit length vectors with respect to the contact matrix in Figure 13.7. The thick dark vectors represent the residues and are $c13-math-0287$ (pointing to the left), $c13-math-0288$ (pointing out from the page), and $c13-math-0289$ (pointing upward). The thin vectors are the orthonormal eigenvectors that form the columns of $c13-math-0290$ in Equation (13.2). The eigenvectors lie along the principal axes of the ellipsoid, and the length of each axis is $c13-math-0291$ .

The motivation behind EIGAS is that each protein is associated with an $c13-math-0292$ -dimensional geometry, and this perspective allows the problem of aligning protein structures to be restated as a question of optimally aligning the geometries of contact spaces. However, contact spaces vary in dimension just as protein chains vary in length, and the algorithmic design question is to create a method of matching the residues so that the lower-dimensional geometries of the two contact spaces are as similar as possible.

Contact geometries are defined by the eigenvectors and eigenvalues, which give the direction and length of the unit ellipsoid's principal axes. Moreover, the eigenvectors are linear combinations of the $c13-math-0293$ vectors. So, in terms of the protein, each eigenvector represents a weighted sum of the residues in contact space. In some research communities these weighted sums would be called “features” of the protein. Each residue is associated with its nearest eigenvector and is assigned its corresponding eigenvalue. Formally, residue $c13-math-0294$ is assigned the eigenvalue $c13-math-0295$ , where the associated eigenvector $c13-math-0296$ of the smooth-contact matrix solves

where $c13-math-0298$ denotes the eigenvectors that comprise the columns of $c13-math-0299$ and $c13-math-0300$ from Equation (13.2). This eigenvalue assignment associates each residue with one of the principal axes of the ellipsoid defined by the smooth-contact map. A matching of the residues between the protein chains is scored by comparing the eigenvalues associated with the corresponding principal axes.

Consider the problem of matching the protein chain in Figure 13.7 with the chain in Figure 13.9. The middle three residues of the chain in Figure 13.9 are in the same configuration as the residues inFigure 13.7, and one would expect an alignment technique to match $c13-math-0301$ with $c13-math-0302$ , $c13-math-0303$ with $c13-math-0304$ , and $c13-math-0305$ with $c13-math-0306$ . The eigenvalues of the smooth-contact matrix in Figure 13.7 are $c13-math-0307$ , 0.91, and $c13-math-0308$ , and the eigenvalues of the smooth-contact matrix in Figure 13.9 are $c13-math-0309$ , $c13-math-0310$ , $c13-math-0311$ , $c13-math-0312$ , and $c13-math-0313$ . Assigning the residues the eigenvalues of their nearest eigenspaces leads to the matching problem depicted below:

Figure 13.9 Residues $c13-math-0315$ , $c13-math-0316$ , and $c13-math-0317$ are in the same geometric configuration with $c13-math-0318$ , $c13-math-0319$ , and $c13-math-0320$ , respectively, in Figure 13.7

The optimal three-residue matching is shown by the solid edges. The optimal value is the combined eigenvalue difference, $c13-math-0314$ , which is the lowest possible among all matchings of three residues. Note that this is the anticipated alignment.

The value of matching residue $c13-math-0321$ with residue $c13-math-0322$ is estimated by $c13-math-0323$ . If the eigenvalues are similar, then the principal axes of the unit ellipsoids associated with the residues are of approximately the same length. Hence the contact geometries are similar along these particular axes. EIGAS uses the scoring matrix $c13-math-0324$ and DP to find an optimal matching between the residues. EIGAS was reported [22] to correctly identify the SCOP classification of the Skolnick40 dataset by completing all 780 comparisons in about $c13-math-0325$ s. On Proteus300 EIGAS completed all $c13-math-0326$ comparisons in about $c13-math-0327$ h. The best results in terms of quality were achieved with $c13-math-0328$ Å and a gap penalty of $c13-math-0329$ . More recent results [11] show that EIGAS can complete all $c13-math-0330$ comparisons in $c13-math-0331$ seconds if the DP is coded in C $c13-math-0332$ and the computational effort is distributed over eight cores. This more recent numerical work shows that if $c13-math-0333$ Å, the gap opening penalty is $c13-math-0334$ , and the gap extension penalty is $c13-math-0335$ , then EIGAS correctly identifies the SCOP classifications of Proteus300 (perfect classifications were reported for many parameter combinations).

13.4.3 A_purva

As is often the case in applied mathematics, the concept of structure can be represented in terms of graph theory, and this is the manner in which contact map overlap (CMO) is derived. The backbone of each protein chain is represented as a graph, say, $c13-math-0336$ . The vertices in $c13-math-0337$ represent the C $c13-math-0338$ atoms, and two C $c13-math-0339$ atoms are adjacent, meaning that they form an edge in $c13-math-0340$ , if the distance between them satisfies a contact criteria. For example, if $c13-math-0341$ , then residues $c13-math-0342$ and $c13-math-0343$ are in contact and $c13-math-0344$ induces an edge in $c13-math-0345$ . We let $c13-math-0346$ be the index set $c13-math-0347$ and $c13-math-0348$ be the collection of edges $c13-math-0349$ such that $c13-math-0350$ satisfies the contact criteria. A_purva [1] uses an upper bound of $c13-math-0351$ Å, and edges for consecutive residues along the protein backbone are not permitted. If we let $c13-math-0352$ form a binary matrix with a value of $c13-math-0353$ if residues $c13-math-0354$ and $c13-math-0355$ satisfy the contact criteria, then we have represented the protein chain as a graph associated with this adjacency matrix (see Fig. 13.10).

Figure 13.10 A graphical representation of a protein chain of six residues along with its adjacency matrix.

The goal of CMO is to optimize a pairing between the graphical representations of the two proteins, meaning that we want to pair residues so that their contact structures match as well as possible. This question is similar to the classic NP-hard problem of finding a maximum induced subgraph, which is to find the largest possible subgraph of one graph that is a subgraph of the other. However, CMO adds the constraint that the sequential ordering of the nodes must be preserved in the vertex matching. For example, if $c13-math-0356$ of the first protein is matched with $c13-math-0357$ of the second, then $c13-math-0358$ of the first protein cannot be matched to $c13-math-0359$ of the second since this would violate the linear ordering inherited from a protein's backbone. Residue $c13-math-0360$ could be matched to any $c13-math-0361$ of the second protein as long as $c13-math-0362$ .

To accomplish the residue matching, the graphs of two proteins are joined to create a bipartite graph. Let $c13-math-0363$ and $c13-math-0364$ be the graphs of two proteins and $c13-math-0365$ and $c13-math-0366$ be their respective adjacency matrices. Form the complete bipartite graph between the two vertex sets, meaning that every possible edge between $c13-math-0367$ and $c13-math-0368$ is included. The edge sets $c13-math-0369$ and $c13-math-0370$ are used to weight every possible pair of matches, indexed by $c13-math-0371$ to indicate that $c13-math-0372$ of the first protein is matched with $c13-math-0373$ of the second and that $c13-math-0374$ of the first is matched with $c13-math-0375$ of the second. Each $c13-math-0376$ is assigned the product of the corresponding elements of the adjacency matrices:

equation

We say that the paired residue matches $c13-math-0378$ and $c13-math-0379$ share a common contact provided that $c13-math-0380$ is $c13-math-0381$ . As an example, if we are trying to align the protein depicted in Figure 13.10 with a four-residue protein in which the only contacts are between $c13-math-0382$ and $c13-math-0383$ and between $c13-math-0384$ and $c13-math-0385$ , then $c13-math-0386$ but $c13-math-0387$ . A schematic is depicted in Figure 13.11.

Figure 13.11 Diagram (a) depicts two possible pairs of matchings between two proteins. The pair depicted with solid lines is weighted with a one since $c13-math-0388$ is in contact with $c13-math-0389$ and $c13-math-0390$ is in contact with $c13-math-0391$ . The pair depicted by dashed lines is weighted with a zero since $c13-math-0392$ is not in contact with $c13-math-0393$ . Note that not all edges from Figure 13.10 are shown for the top protein. Diagram (b) shows a pairing that would not be allowed since it violates the residue ordering imposed by the proteins' backbones.

The bipartite weights are used to score matchings between the residue sets by adding all possible weights in the matching. If $c13-math-0394$ is matched with $c13-math-0395$ , for $c13-math-0396$ , in the example above, then this matching has a contact score of

This value indicates that the only overlap of this residue matching is due to the fact that matching $c13-math-0398$ to $c13-math-0399$ and $c13-math-0400$ to $c13-math-0401$ yields a common contact. In matrix terms, the score is half of $c13-math-0402$ , where $c13-math-0403$ denotes the residues from the first protein; $c13-math-0404$ , the residues from the second protein; the set subscripts, the submatrices whose rows and columns are listed in the sets; and $c13-math-0405$ , the Hadamard (elementwise) product. The 1-norm is the elementwise norm and not the operator norm. To see the preceding calculation in this matrix form, let $c13-math-0406$ , which gives

equation

As above, the only nonzero elements result from residues $c13-math-0408$ and $c13-math-0409$ being in contact in the first protein and residues $c13-math-0410$ and $c13-math-0411$ being in contact in the second. The symmetry of the adjacency matrices doubles the score. The matrix description shows that we are looking for ordered index sets of the residues from both proteins so that the elementwise product of the resulting adjacency submatrices has as many ones as possible.

The combinatorial problem of maximizing the contact overlap can be stated as a binary optimization problem. For the $c13-math-0412$ th residue in $c13-math-0413$ , with $c13-math-0414$ indicating either the first or second protein, let $c13-math-0415$ and $c13-math-0416$ . We let $c13-math-0417$ be a binary variable indicating that if we match $c13-math-0418$ of the first protein with $c13-math-0419$ of the second and $c13-math-0420$ of the first with $c13-math-0421$ of the second. We also let $c13-math-0422$ be a binary variable indicating if $c13-math-0423$ of the first protein is matched with $c13-math-0424$ of the second. With this notation, the standard formulation of the CMO problem is

13.3 $c13-math-0425$

Solving the CMO problem has drawn much attention, and many have argued favorably for this alignment method. However, the combinatorial complexity of the problem has challenged it with becoming an efficient model for working with entire databases. More recent work [11] shows otherwise. The insight is to reformulate the problem so that the new formulation lends itself to DP.

The reformulation is interpreted on a new graph $c13-math-0426$ , where $c13-math-0427$ . In this new graph each vertex is an ordered pair of vertices, and each edge corresponds to a matched pair of residues that share a contact. In terms of the protein chains, each vertex of this graph is a possible match between the residues of the different proteins, and an edge exists between a pair of possible residue matches only if they share a common contact. A depiction of this graph is found in Figure 13.12.

Figure 13.12 The contacts of the protein from Figure 13.11 are shown on the left, and the contacts for the second protein are shown at the bottom. The new graph links pairs of residue matches if they share a contact. All edges of the new graph are shown as either dashed or solid lines, with the solid lines indicating an optimal solution of $c13-math-0428$ , $c13-math-0429$ , $c13-math-0430$ , and $c13-math-0431$ . The optimal value of the CMO problem is 2 because this is the maximum number of noncrossing edges.

The reformulation hints at how DP can be used to solve the CMO problem. The central concept is to move from the lower left corner toward the upper right corner in an optimal fashion. Unlike the other alignment algorithms discussed in this section, which require a single application of DP, the method of A_purva requires a coupling of two DP algorithms. The two decisions are (1) where to link any possible match, called the local problem; and (2) how to construct on optimal sequence from the local decisions, called the global problem. The example in Figure 13.12 illustrates the decisions. From the match $c13-math-0432$ we could have selected any of $c13-math-0433$ , $c13-math-0434$ , $c13-math-0435$ or $c13-math-0436$ . In all but $c13-math-0437$ we could have found a second nonintersecting edge that would have given a CMO score of 2. For example, if we had instead used the edge from $c13-math-0438$ to $c13-math-0439$ , then we could have added the edge from $c13-math-0440$ to $c13-math-0441$ . The local decision has to assess which edges should be considered. Once we know the optimal local solutions for each $c13-math-0442$ , we use this information to select the global collection of edges so as to maximize the number of edges while maintaining that edges don't cross.

The forward-looking perspective means that we are always making decisions for indices in $c13-math-0443$ , and hence we need to account for the constraints whose summands are over $c13-math-0444$ . Andonov et al. [1] use a classic Lagrangian relaxation that moves these constraints to the objective and penalizes them with multipliers. The relaxed problem maximizes

equation

subject to

equation

We note that this is not directly a Lagrangian relaxation of (13.3) because of a tighter description of the feasible region. Specifically, the authors make the following replacements in (13.3):

equation

The additional terms continue to satisfy the noncrossing property, but they give a more accurate description of the search space.

The Lagrange multipliers $c13-math-0448$ and $c13-math-0449$ are restricted to be nonnegative, and through the application of DP, the relaxed problem can be solved in polynomial time for any collection of Lagrange multipliers, as the complexity is no worse than $c13-math-0450$ . The local use of DP creates a scoring matrix for each $c13-math-0451$ in $c13-math-0452$ . Specifically, let $c13-math-0453$ be the matrix whose rows are indexed by $c13-math-0454$ and whose columns are indexed by $c13-math-0455$ . The score at each position is $c13-math-0456$ , which corresponds to the coefficient of the $c13-math-0457$ variable in the objective. Let $c13-math-0458$ be the optimal value of applying DP to $c13-math-0459$ .

The global problem uses the local scores to create an optimal solution to the relaxed problem. Specifically, let $c13-math-0460$ be the scoring matrix whose components for each $c13-math-0461$ and $c13-math-0462$ are

equation

Note that the last two summations are the coefficients for the $c13-math-0464$ variables, and that the remaining portion of the objective collapses into $c13-math-0465$ . Hence, an optimal solution to the Lagrangian relaxation for any nonnegative collections of $c13-math-0466$ and $c13-math-0467$ can be calculated with a double application of DP.

While the relaxation can be solved efficiently, the relaxed solution is not a solution to the original CMO problem unless the Lagrange multipliers satisfy an optimality condition of their own. The optimization problem that defines a collection of multipliers for which the relaxed problem actually solves the CMO problem is called the dual-optimization problem, which is a minimization problem over the Lagrange multipliers for a fixed collection of $c13-math-0468$ and $c13-math-0469$ variables. A discussion of this topic is outside the scope of this chapter, but interested readers can find descriptions in most nonlinear programming texts [2]. The Lagrangian process is iterative in that starting with initial Lagrange multipliers, the relaxation is solved with DP, and then the Lagrange multipliers are updated by (nearly) solving the dual problem. The process repeats and is stopped once the value of the CMO problem is sufficiently close to the optimal value of the dual problem or due to a time limitation. For details about how A_purva updates the Lagrange multipliers, see the paper by Andonov et al. [1].

Iterative Lagrangian algorithms are a mainstay in the nonlinear programming community, and a common sentiment among experts in optimization is that problems either do or do not lend themselves to this tactic. The numerical tests for A_purva show that the relaxation of CMO lends itself nicely to this solution procedure. A_purva was tested as a database comparison algorithm on both Skolnick40 [6] and Proteus300 [1]. Reported solution times were $c13-math-0470$ s for all $c13-math-0471$ pairwise comparisons for Skolnick40 (approximately $c13-math-0472$ s per comparison) and 13 h and 38 min for all 44,850 comparisons for Proteus300 (approximately 1.09 s per comparison). The comparisons were perfect in their agreement with the SCOP classification via clustering with Chavl [16]. Although not as fast as either of the other techniques discussed in this section, the results are significant advances for CMO, which before A_purva would have been impractical for a dataset like Proteus300.

13.4.4 Comments on Dynamic Programming and Future Research Directions

The application of DP to align protein structures is not restricted to the more recent literature on database applications, and in particular, DP was one of the early solution methods for the CMO problem, [6, 14]. However, the more recent literature indicates that DP is becoming a central algorithmic method to efficiently align protein structures well enough for classification across a sizable database. Other research groups are arriving at the same conclusion about DP [24]. Indeed, as the authors were preparing this document they found additional examples in the more recent literature [5].

With the emerging success that DP is having, we feel compelled to note a few of its limitations. First, DP is a sequential decision process that is optimal only under the Markov property, that is, that the current decision does not depend on past decisions. All the applications of DP discussed here use the natural sequence of the backbone to order the decisions process, but using DP in this fashion does not allow for the possibility of nonsequential alignments. The ability to identify nonsequential alignments is argued by some as a crucial element to identifying protein families [8]. Adapting the DP framework to consider nonsequential alignments is a promising and important avenue for future research.

One of the concerns about three-dimensional alignment methods is that they don't easily adapt to a protein's natural flexibility. Many proteins have several confirmations, and an alignment method that can correctly classify the different confirmations would be beneficial. Moreover, the crystallography and NMR experiments from which we gain three-dimensional coordinates are not without error, and the alignment methods should be robust enough to correctly classify proteins under coordinate perturbations. Whether methods based on DP provide the robustness to handle a protein's flexibility and experimental error is not yet well established. Hasegawa and Holm [10] suggest that more attention should be focused on the robustness of the optimization procedures used to align protein structures.

Finally, DP requires tuning that is experimental in nature. The gap opening and gap extension penalties need to be experimentally set to achieve agreement with biological classifications. Parameter tuning has not been shown to be database-independent, and without such experimental validation, we lack the confidence that DP-based methods can be used to organize a database without an a priori biological classification, which somewhat defeats the purpose of automating database classification.

References

1. Andonov R, Malod-Dognin N, Yanev N, Maximum contact map overlap revisited, J. Comput. Biol. 18(1):27–41 (Jan. 2011).

2. Bazaraa M, Sherali H, Shetty C, Nonlinear Programming: Theory and Algorithms, 3rd ed., Wiley-Interscience, 2006.

3. Bellman R, Dynamic Programming, Princeton Univ. Press, Princeton, NJ, 1957.

4. Bhattacharya S, Bhattacharyya C, Chandra N, Projections for fast protein structure retrieval, BMC Bioinformatics 7(Suppl. 5):S5 (2006).

5. Bonnel N, Marteau P, LNA: Fast Protein Classification Using a Laplacian Characterization of Tertiary Structure, Technical Report, Univ. Bretagne Sud, France, 2011.

6. Caprara A, Carr R, Istrail S, Lancia G, Walenz B, 1001 optimal pdb structure alignments: Integer programming methods for finding the maximum contact map overlap, J. Comput. Biol. 11(1):27–52 (2004).

7. Carpentier M, Brouillet S, Pothier J, Yakusa: A fast structural database scanning method, Proteins 61(1):137–151 (Oct. 2005).

8. Chen L, Wu L, Wang Y, Zhang S, Zhang X, Revealing divergent evolution, identifying circular permutations and detecting active-sites by protein structure comparison, BMC Struct Biol. 6:18 (2006).

9. Goldman D, Istrail S, Papadimitriou C, Algorithmic aspects of protein structure similarity. Proc. 40th Annual Symp. Foundations of Computer Science, 1999, pp. 512–521.

10. Hasegawa H, Holm L, Advances and pitfalls of protein structural alignment, Curr. Opin. Struct. Biol. 19(3):341–348 (June 2009).

11. Holder A, Simon J, Strauser J, Shibberu Y, An Investigation into the Robustness of Algorithms Designed for Efficient Protein Structure Alignment across Databases, Technical Report, Rose-Hulman Inst. Technology, 2012.

12. Kifer I, Nussinov R, Wolfson H, Gossip: A method for fast and accurate global alignment of protein structures, Bioinformatics 27(7):925–932 (2011).

13. Kolodny R, Linial N, Approximate protein structural alignment in polynomial time, Proc Natl Acad Sci USA 101(33):12201–12206 (Aug. 2004).

14. Lancia G, Carr R, Walenz B, Istrail S, 101 optimal pdb structure alignments: A branch-and-cut algorithm for the maximum contact map overlap problem, Proc. 5th Annual Int. Conf. Computational Biology, ACM Press, New York, 2001, pp. 143–202.

15. Di Lena P, Fariselli P, Margara L, Vassura M, Casadio R, Fast overlapping of protein contact maps by alignment of eigenvectors, Bioinformatics 26(18):2250–2258 (Sept. 2010).

16. Lerman I, Likelihood linkage analysis (lla) classification method, Biochimie 75:379–397 (1993).

17. Cheng Li S, Kaow Ng Y, On protein structure alignment under distance constraint, Theor. Comput. Sci. 412(32):4187–4199 (2011) (algorithms and computation).

18. Needleman S, Wunsch C, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48(3):443–453 (March 1970).

19. Poleksic A, Algorithms for optimal protein structure alignment, Bioinformatics, 25(21):2751–2756 (Nov. 2009).

20. Poleksic A, Optimizing a widely used protein structure alignment measure in expected polynomial time, IEEE/ACM Trans. Comput. Biol. Bioinform. 8(6):1716–1720 (2011).

21. Shatsky M, Nussinov R, Wolfson H, A method for simultaneous alignment of multiple protein structures, Proteins 56(1):143–156 (July 2004).

22. Shibberu Y, Holder A, A spectral approach to protein structure alignment, IEEE/ACM Trans. Comput. Biol. Bioinform. 8:867–875 (Feb. 2011).

23. Umeyama S, An eigendecomposition approach to weighted graph matching problems, IEEE Trans. Pattern Anal. Machine Intell. 10(5):695–703 (1988).

24. Wohlers I, Andonov R, Klau G, Algorithm Engineering for Optimal Alignment of Protein Structure Distance Matrices, Technical Report, CWI, Life Sciences Group, Netherlands, 2011.

25. Xu J, Jiao F, Berger B, A parameterized algorithm for protein structure alignment, J. Comput. Biol. 14(5):564–577 (June 2007).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 13: Fundamentals of Protein Structure Alignment

Create new playlist

Sign In

Sign Up

Chapter 13

Fundamentals of Protein Structure Alignment

13.1 Introduction

13.2 Biological Motivation of Protein Structure Alignment

13.3 Mathematical Frameworks

13.3.1 Spectral Methods

13.4 More Recent Advances with Database Queries

13.4.1 GOSSIP

13.4.2 Eigensystem Alignment with a Spectrum

13.4.3 A_purva

13.4.4 Comments on Dynamic Programming and Future Research Directions

References

Table of Contents for
Chapter 13: Fundamentals of Protein Structure Alignment