Chapter 16

Fractal Related Methods for Predicting Protein Structure Classes and Functions

ZU-GUO YU, VO ANH, JIAN-YI YANG, and SHAO-MING ZHU

16.1 Introduction

The molecular function of a protein can be inferred from the protein's structure information [1]. It is well known that an amino acid sequence partially determines the eventual three-dimensional structure of a protein, and the similarity at the protein sequence level implies similarity of function [2, 3]. The prediction of protein structure and function from amino acid sequences is one of the most important and challenge problems in molecular biology.

Protein secondary structure, which is a summary of the general conformation and hydrogen bonding pattern of the amino acid backbone [4, 5], provides some knowledge to further simplify the complicated 3D structure prediction problem. Hence an intermediate but useful step is to predict the protein secondary structure. Since the 1970s, many methods have been developed for predicting protein secondary structure (see the more recent references cited in Ref. [6]).

Four main classes of protein structures, based on the types and arrangement of their secondary structural elements, were recognized [7]: (1) the c16-math-0001 helices, (2) the c16-math-0002 strands, and (3) those with a mixture of c16-math-0003 and c16-math-0004 shapes denoted as c16-math-0005 and c16-math-0006. This structural classification has been accepted and widely used in protein structure and function prediction. As a consequence, it has become an important problem and can help build protein database and predict protein function. In fact, Hou et al. [1] (see also a short report published in Science [8]) constructed a map of the protein structure space using the pairwise structural similarity of 1898 protein chains. They found that the space has a defining feature showing these four classes clustered together as four elongated arms emerging from a common center. A review by Chou [9] described the development of some existing methods for prediction of protein structural classes. Proteins in the same family usually have similar functions. Therefore family identification is an important problem in the study of proteins.

Because similarity in the protein sequence normally implies similarity of function and structure, the similarities of protein sequences can therefore be used to detect the biological function and interaction of proteins [10]. It is necessary to enrich the concept and context of similarity, because some proteins with low sequence identity may have similar tertiary structure and function [11].

Fractal geometry provides a mathematical formalism for describing complex spatial and dynamical structures [12, 13]. Fractal methods are known to be useful for detection of similarity. Traditional multifractal analysis is a useful way to characterize the spatial heterogeneity of both theoretical and experimental fractal patterns [14]. More recently it has been applied successfully in many different fields, including time series analysis and financial modeling [15]. Some applications of fractal methods to DNA sequences are provided in References [16–18] and references cited therein.

Wavelet analysis, recurrence quantification analysis (RQA), and empirical mode decomposition (EMD) are related to fractal methods in nonlinear sciences. Wavelet analysis is a useful tool in many applications such as noise reduction, image processing, information compression, synthesis of signals, and the study of biological data. Wavelets are mathematical functions that decompose data into different frequency components and then describe each component with a resolution matched to its scale [19, 20]. The recurrence plot (RP) is a purely graphical tool originally proposed by Eckmann et al. [21] to detect patterns of recurrence in the data. RQA is a relatively new nonlinear technique introduced by Zbilut and Webber [22, 23] that can quantify the information supplied by RP. The traditional EMD for data, which is a highly adaptive scheme serving as a complement to Fourier and wavelet transforms, was originally proposed by Huang et al. [24]. In EMD, a complicated dataset is decomposed into a finite, often small, number of components called intrinsic mode functions (IMFs). Lin et al. [25] presented a new approach to EMD. This method has been used successfully in many applications in analyzing a diverse range of datasets in biological and medical sciences, geophysics, astronomy, engineering, and other fields [26, 27].

Fractal methods have been used to study proteins. Their applications include fractal analysis of the proton exchange kinetics [28], chaos game representation of protein structures [29], sequences based on the detailed HP model [30, 31], fractal dimension of protein mass [32], fractal properties of protein chains [33], fractal dimensions of protein secondary structures [34], multifractal analysis of the solvent accessibility of proteins [35], and the measure representation of protein sequences [36]. The wavelet approach has been used for prediction of secondary structures of proteins [37–45]. Chen et al. [41] predicted the secondary structure of a protein by the continuous wavelet transform (CWT) and Chou–Fasman method. Marsolo et al. [42] combined the pairwise distance with wavelet decomposition to generate a set of coefficients and capture some features of proteins. Qiu et al. [43] used the continuous wavelet transform to extract the position of the c16-math-0007 helices and short peptides with specialscales. Rezaei et al. [44] applied wavelet analysis to membrane proteins. Pando et al. [45] used the discrete wavelet transform to detect protein secondary structures. Webber et al. [46] defined two independent variables to elucidate protein secondary structures based on the RQA of coordinates of c16-math-0008-carbon atoms. The variables can describe the percentage of c16-math-0009 carbons that are composed of an c16-math-0010 helix and a c16-math-0011 sheet, respectively. The ability of RQA to deal with protein sequences was reviewed by Giuliani et al. [47]. The IMFs obtained by the EMD method were used to discover similarities of protein sequences, and the results showed that IMFs may reflect the functional identities of proteins [48, 49].

More recently, the authors used the fractal methods and related wavelet analysis, RQA and EMD methods to study the prediction of protein structure classes and functions [50–54]. In this chapter, we review the methods and results in these studies.

16.2 Methods

16.2.1 Measure Representation Based on the Detailed HP Model and Six-Letter Model

The detailed HP model was proposed by Yu et al. [30]. In this model, 20 different kinds of amino acids are divided into four classes: nonpolar, negative polar, uncharged polar, and positive polar. The nonpolar class consists of the eight residues ALA, ILE, LEU, MET, PHE, PRO, TRP, and VAL; the negative polar class consists of the two residues ASP and GLU; the uncharged polar class is made up of the seven residues ASN, CYS, GLN, GLY, SER, THR, and TYR; and the remaining three residues ARG, HIS, and LYS designate the positive polar class.

For a given protein sequence c16-math-0012 with length c16-math-0013, where c16-math-0014 is one of the 20 kinds of amino acids for c16-math-0015 we define

16.1 c16-math-0016

This results in a sequence c16-math-0017, where c16-math-0018 is a letter of the alphabet c16-math-0019. The mapping (16.1) is called the detailed HP model [30].

According to Chou and Fasman [55], the 20 different kinds of amino acids are divided into six classes: strong c16-math-0020, former (c16-math-0021); c16-math-0022 former (c16-math-0023);weak c16-math-0024, former (c16-math-0025); c16-math-0026 indifferent (c16-math-0027); c16-math-0028 breaker (c16-math-0029); and strong c16-math-0030 breaker (c16-math-0031). The c16-math-0032 class consists of the three residues Met, Val, and IlE; the c16-math-0033 class consists of the seven residues Cys, Tyr, Phe, Gln, Leu, Thr, and Trp; the c16-math-0034 class consists of the residue Ala; the c16-math-0035 class consists of the three residues Arg, Gly, and Asp; the c16-math-0036 class is made up of the five residues Lys, Ser, His, Asn, and Pro; and the remaining residue Glu constitutes the c16-math-0037 class.

For a given protein sequence c16-math-0038 with length c16-math-0039, where c16-math-0040 is one of the 20 kinds of amino acids for c16-math-0041, we define

16.2 c16-math-0042

This results in a sequence c16-math-0043, where c16-math-0044 is a letter of the alphabet c16-math-0045. The mapping (16.2) is called the six-letter model [51].

Here we call any string made up of c16-math-0046 letters from the set c16-math-0047 (for the detailed HP model) or c16-math-0048 (for the six-letter model) a c16-math-0049-string. For a given c16-math-0050, there are in total c16-math-0051 or c16-math-0052 different c16-math-0053 strings. In order to count the number of c16-math-0054 strings in a sequence c16-math-0055 from a protein sequence c16-math-0056, we need c16-math-0057 or c16-math-0058 counters. We divide the interval c16-math-0059 into c16-math-0060 or c16-math-0061 disjoint subintervals, and use each subinterval to represent a counter. For c16-math-0062 which is a substring with length c16-math-0063, we define

16.3 c16-math-0064

For c16-math-0065, c16-math-0066, c16-math-0067 which is a substring with length c16-math-0068, we define

16.4 c16-math-0069

We then use the subinterval c16-math-0070 to represent substring c16-math-0071. Let c16-math-0072 be the number of times that a substring c16-math-0073 with length c16-math-0074 appears in the sequence c16-math-0075 (when we count these numbers, we open a reading frame with width c16-math-0076 and slide the frame one amino acid each time). We define

16.5 c16-math-0077

to be the frequency of substring c16-math-0078. It follows that c16-math-0079. We can now define a measure c16-math-0080 on c16-math-0081 by c16-math-0082, where

16.6 c16-math-0083

when c16-math-0084 defined by (16.3). We define a measure c16-math-0085 on c16-math-0086 by c16-math-0087, where

16.7 c16-math-0088

when c16-math-0089 defined by (16.4). We see that c16-math-0090 and c16-math-0091. We call c16-math-0092 and c16-math-0093 the measure representation of the protein sequence corresponding to the given c16-math-0094 based on the detailed HP model and the six-letter model, respectively.

16.2.2 Measures and Time Series Based on the Physicochemical Features of Amino Acids

Measured in kcal/mol, the hydrophobic free energies of the 20 amino acids are c16-math-0095, c16-math-0096, c16-math-0097, c16-math-0098, c16-math-0099, c16-math-0100, c16-math-0101, c16-math-0102, c16-math-0103, c16-math-0104, c16-math-0105, c16-math-0106, c16-math-0107, c16-math-0108, c16-math-0109, c16-math-0110, c16-math-0111, c16-math-0112, c16-math-0113, and c16-math-0114 [39].

The solvent accessibility (SA) values for solvent exposed area c16-math-0115 are c16-math-0116, c16-math-0117, c16-math-0118, c16-math-0119, c16-math-0120, c16-math-0121, c16-math-0122, c16-math-0123, c16-math-0124, c16-math-0125, c16-math-0126, c16-math-0127, c16-math-0128, c16-math-0129, c16-math-0130, c16-math-0131, c16-math-0132, c16-math-0133, c16-math-0134, and c16-math-0135 [56].

The Schneider–Wrede scale (SWH) values of the 20 kinds of amino acids are c16-math-0136, c16-math-0137, c16-math-0138, c16-math-0139, c16-math-0140, c16-math-0141, c16-math-0142, c16-math-0143, c16-math-0144, c16-math-0145, c16-math-0146, c16-math-0147, c16-math-0148, c16-math-0149, c16-math-0150, c16-math-0151, c16-math-0152, c16-math-0153, c16-math-0154, and c16-math-0155 [47]. Yang et al. [51] added a constant 12.30 to these values to make all the 20 values nonnegative, yielding the revised Schneider–Wrede scale hydrophobicity (RSWH).

Rose et al. [57] proposed different measures of hydrophobicity of proteins. They gave four kinds of values for surface area and hydrophobicity of each amino acid. We use c16-math-0156, the stochastic standard state accessibility, that is, the solvent accessible surface area (SASA) of a residue in standard state. The SASA of the 20 kinds of amino acids are c16-math-0157, c16-math-0158, c16-math-0159, c16-math-0160, c16-math-0161, c16-math-0162, c16-math-0163,c16-math-0164, c16-math-0165, c16-math-0166, c16-math-0167, c16-math-0168, c16-math-0169, c16-math-0170, c16-math-0171, c16-math-0172, c16-math-0173, c16-math-0174, c16-math-0175, and c16-math-0176.

Each amino acid can also be represented by the value of the volume of sidechains of amino acids [58]. These values are c16-math-0177, c16-math-0178, c16-math-0179, c16-math-0180, c16-math-0181, c16-math-0182, c16-math-0183, c16-math-0184, c16-math-0185, c16-math-0186, c16-math-0187, c16-math-0188, c16-math-0189, c16-math-0190, c16-math-0191, c16-math-0192 , c16-math-0193, c16-math-0194, c16-math-0195, and c16-math-0196.

We convert each amino acid according to hydrophobic free energy, SA, RSWH, SASA, and volume of sidechains along the protein sequence to calculate five different numerical sequences, and view them as time series.

Let c16-math-0197, c16-math-0198, be the time series with length c16-math-0199. First, we define c16-math-0200 as the frequency of c16-math-0201. It follows that c16-math-0202. Now we can define a measure c16-math-0203 on the interval c16-math-0204 by c16-math-0205, where

16.8 c16-math-0206

We denote the interval c16-math-0207 by c16-math-0208. It is seen that c16-math-0209 and c16-math-0210. We call c16-math-0211 the measure for the time series.

16.2.3 c16-math-0212-Curve Representation of Proteins

The concept of c16-math-0213-curve representation of a DNA sequence was first proposed by Zhang and Zhang [59]. We propose a similar concept for proteins [50]. Once we get the sequence c16-math-0214 for a protein, where c16-math-0215 is a letter of the alphabet c16-math-0216 as in Section 16.2.1, we can define the Z-curve representation of this protein as follows. This c16-math-0217 curve consists of a series of nodes c16-math-0218 c16-math-0219, whose coordinates are denoted by c16-math-0220, c16-math-0221 and c16-math-0222. These coordinates are defined as

16.9 c16-math-0223

where c16-math-0224 denote the number of occurrences of the symbols 0,1,2,3 in the prefix c16-math-0225, respectively, and c16-math-0226. The connection of nodes c16-math-0227 to one another by straight lines is defined as the Z-curve representation of this protein. We then define

16.10 c16-math-0228

where c16-math-0229, c16-math-0230, and c16-math-0231 can only have values 1 and c16-math-0232.

16.2.4 Chaos Game Representation of Proteins and Related Time Series

Chaos game representation (CGR) of protein structures was first proposed by Fiser et al. [29]. We denote this CGR by 20-CGR as 20 kinds of letters are used to represent protein sequences. Later Basu et al. [60] and Yu et al. [31] proposed other kinds of CGRs for proteins, in which 12 and 4 kinds of letters were used for protein sequences, respectively. We denote them by 12-CGR and 4-CGR.

16.2.4.1 Reverse Encoding for Amino Acids

It is known that there are several kinds of coded methods for some amino acids. As a result, there should be many possible nucleotide sequences for one given protein sequence. We have used [52], the encoding method proposed by Deschavanne and Tufféry [61], which is listed in Table 1 of our study [52]. Deschavanne and Tufféry [61] explained that the rationale for the choice of this fixed code is to keep a balance in base composition so as to maximize the difference between the amino acid codes.

After one protein sequence is transformed into nucleotide sequences, we can use the CGR of nucleotide sequences [62] to analyze it; the CGR obtained is abbreviated AAD-CGR (amino acids to DNA CGR). The CGR for a nucleotide sequence is defined on the square c16-math-0233, where the four vertices correspond to the four letters A, C, G, and T: the first point of the plot is placed half way between the center of the square and the vertex corresponding to the first letter of the nucleotide sequence; the c16-math-0234th point of the plot is then placed half way between the c16-math-0235th point and the vertex corresponding to the c16-math-0236th letter. The plot is then called the CGR of the nucleotide sequence, or the AAD-CGR of the protein sequence.

We can decompose the AAD-CGR plot into two time series [52]. Any point in the AAD-CGR plot is determined by two coordinates: c16-math-0237 and c16-math-0238 coordinates. Because the AAD-CGR plot can be uniquely reconstructed from these two time series, all the information stored in the AAD-CGR plot is contained in the time series, and the information in the AAD-CGR plot comes from the primary sequence of proteins. Therefore, any analysis of the two time series is equivalent to an indirect analysis of the protein primary sequence. It is possible that such analysis provides better results than direct analysis of the protein primarysequences.

16.2.5 Time Series Based on 6-Letter Model, 12-Letter Model, and 20-Letter Model

According to the 6-letter model, the protein sequence can be represented by a numerical sequence c16-math-0239; here c16-math-0240 for each c16-math-0241.

We now analogously define a 12-letter model and a 20-letter model. With the idea of chaos game representation based on a 12-sided regular polygon [60], we define the 12-letter model as c16-math-0242, c16-math-0243, c16-math-0244, c16-math-0245, c16-math-0246, c16-math-0247, c16-math-0248, c16-math-0249, c16-math-0250, c16-math-0251, c16-math-0252, c16-math-0253, c16-math-0254, c16-math-0255, c16-math-0256, c16-math-0257, c16-math-0258, c16-math-0259, c16-math-0260, and c16-math-0261 according to the order of vertices on the 12-sided regular polygon to represent these amino acids.

The 20-letter model can be similarly defined as c16-math-0262, c16-math-0263, c16-math-0264, c16-math-0265, c16-math-0266, c16-math-0267, c16-math-0268, c16-math-0269, c16-math-0270, c16-math-0271, c16-math-0272, c16-math-0273, c16-math-0274, c16-math-0275, c16-math-0276, c16-math-0277, c16-math-0278, c16-math-0279, c16-math-0280, and c16-math-0281 according to the dictionary order of the 3-letter representation of each amino acid listed in Brown's treatise [63].

These three models can be used to convert a protein sequence into three different numerical sequences (and also can be viewed as time series).

16.2.6 Iterated Function Systems Model

In order to simulate the measure representation of a protein sequence, we proposed use of the iterated function systems (IFS) model [30]. IFS is the acronym assigned by Barnsley and Demko [64] originally to a system of contractive maps c16-math-0282. Let c16-math-0283 be a compact set in a compact metric space, c16-math-0284 and

equation

Then c16-math-0286 is called the attractor of the IFS. The attractor is usually a fractal set and the IFS is a relatively general model to generate many well-known fractal sets such as the Cantor set and the Koch curve. Given a set of probabilities c16-math-0287, we pick an c16-math-0288 and define the iteration sequence

equation

where the indices c16-math-0290 are chosen randomly and independently from the set c16-math-0291 with probabilities c16-math-0292. Then every orbit c16-math-0293 is dense in the attractor c16-math-0294 [64]. For c16-math-0295 sufficiently large, we can view the orbit c16-math-0296 as an approximation of c16-math-0297. This process is called chaos game.

Let c16-math-0298 the characteristic function for the Borel subset c16-math-0299, then, from the ergodic theorem for IFS [64], the limit

equation

exists. The measure c16-math-0301 is the invariant measure of the attractor of the IFS. In other words, c16-math-0302 is the relative visitation frequency of c16-math-0303 during the chaos game. A histogram approximation of the invariant measure may then be obtained by counting the number of visits made to each pixel.

The coefficients in the contractive maps and the probabilities in the IFS model are the parameters to be estimated for a real measure that we want to simulate. A moment method [30, 65] can be used to perform this task.

From the measure representation of a protein sequence based on the detailed HP model, it is logical to choose c16-math-0304 and

equation

in the IFS model. For a given measure representation of a protein sequence based on the detailed HP model, we obtain the estimated values of the probabilities c16-math-0306 by solving an optimization problem [65]. Using the estimated values of the probabilities, we can use the chaos game to generate a histogram approximation of the invariant measure of the IFS that can be compared with the real measure representation of the protein sequence.

16.2.7 Detrended Fluctuation Analysis

The exponent in a detrended fluctuation analysis can be used to characterize the correlation of a time series [17, 18]. We view c16-math-0307, c16-math-0308, and c16-math-0309 c16-math-0310 in the c16-math-0311-curve representation of proteins as time series. We denotethis time series by c16-math-0312. First, the time series is integrated as c16-math-0313, where c16-math-0314 is the average over the whole time period. Next, the integrated time series is divided into boxes of equal length c16-math-0315. In each box of length c16-math-0316, a linear regression is fitted to the data by least squares, representing the trend in that box. We denote the c16-math-0317 coordinate of the straight-line segments by c16-math-0318. We then detrend the integrated time series c16-math-0319 by subtracting the local trend c16-math-0320 in each box. The root-mean-square fluctuation of this integrated and detrended time series is computed as

16.11 c16-math-0321

Typically, c16-math-0322 increases with box size c16-math-0323. A linear relationship on a log–log plot indicates the presence of scaling c16-math-0324. Under such conditions, the fluctuations can be characterized by the scaling exponent c16-math-0325, the slope of the line in the regression c16-math-0326 against c16-math-0327. For uncorrelated data, the integrated time series c16-math-0328 corresponds to a random walk, and therefore, c16-math-0329. A value of c16-math-0330 indicates the presence of long memory so that, for example, a large value is likely to be followed by large values. In contrast, the range c16-math-0331 indicates a different type of power-law correlation such that positive and negative values of the time series are more likely to alternate. We consider the exponents c16-math-0332 for the c16-math-0333, c16-math-0334, and c16-math-0335 c16-math-0336 of the c16-math-0337-curve representation of protein sequences as candidates constructing parameter spaces for proteins in this chapter. These exponents are denoted by c16-math-0338, c16-math-0339, and c16-math-0340 respectively.

16.2.8 Ordinary Multifractal Analysis

The most common algorithms of multifractal analysis are the so-called fixed-size box counting algorithms [16]. In the one-dimensional case, for a given measure c16-math-0341 with support c16-math-0342, we consider the partition sum

16.12 c16-math-0343

with c16-math-0344, where the sum runs over all different nonempty boxes c16-math-0345 of a given side c16-math-0346 in a grid covering of the support c16-math-0347, that is, c16-math-0348. The exponent c16-math-0349 is defined by

16.13 c16-math-0350

and the generalized fractal dimensions of the measure are defined as

16.14 c16-math-0351

16.15 c16-math-0352

where c16-math-0353. The generalized fractal dimensions are numerically estimated through a linear regression of c16-math-0354 against c16-math-0355 for c16-math-0356, and similarly through a linear regression of c16-math-0357 against c16-math-0358 for c16-math-0359. The value c16-math-0360 is called the information dimension and c16-math-0361, the correlation dimension.

The concept of phase transition in multifractal spectra was introduced in studies of logistic maps, Julia sets, and other simple systems. By following the thermodynamic formulation of multifractal measures, Canessa [66] derived an expression for the analogous specific heat as

16.16 c16-math-0362

He showed that the form of c16-math-0363 resembles a classical phase transition at a critical point for financial time series.

The singularities of a measure are characterized by the Lipschitz–Hölder exponent c16-math-0364 which is related to c16-math-0365 by

16.17 c16-math-0366

Substitution of Equation (16.13) into Equation (16.17) yields

16.18 c16-math-0367

Again the exponent c16-math-0368 can be estimated through a linear regression of

equation

against c16-math-0370 [35]. The multifractal spectrum c16-math-0371 versus c16-math-0372 can be calculated according to the relationship c16-math-0373.

16.2.9 Analogous Multifractal Analysis

The analogous multifractal analysis (AMFA) is similar to multiaffinity analysis, and can be briefly sketched as follows [51]. We denote a time series as c16-math-0374. First, the time series is integrated as

16.19 c16-math-0375

16.20 c16-math-0376

where c16-math-0377 is the average over the whole time period. Then two quantities c16-math-0378 and c16-math-0379 are defined as

16.21 c16-math-0380

16.22 c16-math-0381

where c16-math-0382 denotes the average over c16-math-0383 c16-math-0384; c16-math-0385 typically varies from 1 to c16-math-0386 for which the linear fit is good. From the c16-math-0387c16-math-0388 and c16-math-0389c16-math-0390 planes, one can find the following relations:

16.23 c16-math-0391

16.24 c16-math-0392

Linear regressions of c16-math-0393 and c16-math-0394 against c16-math-0395 will result in the exponents c16-math-0396 and c16-math-0397, respectively.

16.2.10 Wavelet Spectrum

As the wavelet transform of a function can be considered as an approximation of the function, wavelet algorithms process data at different scales or components. At each scale, many coefficients can be obtained and the wavelet spectrum is calculated on the basis of these coefficients. Hence the wavelet spectrum provides useful information for analyzing data. Given a function c16-math-0398, one defines its wavelet transform as [19]

16.25 c16-math-0399

where c16-math-0400 is the position and c16-math-0401 is the scale. The scale c16-math-0402 in wavelet transform means the c16-math-0403th resolution of the data. Taking c16-math-0404 and c16-math-0405, we get the wavelet spectrum as

equation

where c16-math-0407.

For simplicity, the scale c16-math-0408 can be selected as c16-math-0409, which are more adjacent and can be used to capture more details of the data. The wavelet spectrum is calculated by summing the squares of the coefficients in each scale c16-math-0410. The local wavelet spectrum is defined through the modulus maxima of the coefficients [67] as c16-math-0411, where

equation

The maximum of the wavelet spectrum and the maximum of the local wavelet spectrum were applied in the prediction of structural classes and families of proteins [53].

In our work, we chose the Daubechies wavelet, which is commonly used as a signal processing tool. These wavelet functions are compactly supported wavelets with extremal phase and highest number of vanishing moments for a given support width. They are orthogonal and biorthogonal functions. The Daubechies wavelets can improve the frequency domain characteristics of other wavelets [19].

16.2.11 Recurrence Quantification Analysis

The recurrence plot (RP) is a purely graphical tool originally proposed by Eckmann et al. [21] to detect patterns of recurrence in the data. For a time series c16-math-0413 with length c16-math-0414, we can embed it into the space c16-math-0415 with embedding dimension c16-math-0416 and a time delay c16-math-0417. We write c16-math-0418 c16-math-0419, where c16-math-0420. In this way we obtain c16-math-0421 vectors (points) in the embedding space c16-math-0422. We gave some numerical explanations for the selection of c16-math-0423 and c16-math-0424 in our earlier paper [52].

From the c16-math-0425 points, we can calculate the distance matrix (DM), which is a square c16-math-0426 matrix. The elements of DM are the distances between all possible combinations of c16-math-0427 points and c16-math-0428 points. They are computed according to the norming function selected. Generally, the Euclidean norm is used [47]. DM can be rescaled by dividing each element in the DM by a certain value as this allows systems operating on different scales to be statistically compared. For such a value, the maximum distance of the entire matrix DM is the most commonly used (and recommended) rescaling option, which redefines the DM over the unit interval (0.0–100.0%).

Once the rescaled DM = c16-math-0429 is calculated, it can be transformed into a recurrence matrix (RM) of distance elements within a threshold c16-math-0430 (namely, radius). RM = c16-math-0431 and c16-math-0432 where c16-math-0433 is the Heaviside function

16.26 c16-math-0434

RP is simply a visualization of RM by plotting the points on the c16-math-0435 plane for those elements in RM with values equal to 1. If c16-math-0436 , we say that c16-math-0437 points recur with reference to c16-math-0438 points. For any c16-math-0439, since c16-math-0440, the RP has always a black main diagonal line. Furthermore, the RP is symmetric with respect to the main diagonal as c16-math-0441. c16-math-0442 is a crucial parameter of RP. If c16-math-0443 is chosen too small, there may be almost no recurrence points and we will not be able to learn about the recurrence structure of the underlying system. On the other hand, if c16-math-0444 is chosen too large, almost every point is a neighbor of every other point, which leads to a large number of artifacts [68]. Selection of c16-math-0445 was discussed numerically in [52].

Recurrence quantification analysis (RQA) is a relatively new nonlinear technique proposed by Zbilut and Webber [22, 23] that can quantify the information supplied by RP. Eight recurrence variables are usually used to quantify RP [68]. It should be pointed out that the recurrence points in the following definitions consist only of those in the upper triangle in RP (excluding the main diagonal line). The first recurrence variable is %recurrence (%REC). %REC is ameasure of the density of recurrence points in the RP. This variable can range from 0% (no recurrent points) to 100% (all points are recurrent). The second recurrence variable is %determinism (%DET). %DET measures the proportion of recurrent points forming diagonal line structures. For this variable, we have to first decide at least how many adjacent recurrent points are needed to define a diagonal line segment. Obviously, the minimum number required (and commonly used) is 2. The third recurrence variable is linemax (c16-math-0446), which is simply the length of the longest diagonal line segment in RP. This is a very important recurrence variable because it inversely scales with the largest positive Lyapunov exponent. The fourth recurrence variable is entropy (ENT), which is the Shannon information entropy of the distribution probability of the length of the diagonal lines. The fifth recurrence variable is trend (TND), which quantifies the degree of system stationarity. It is calculated as the slope of the least squares regression of %local recurrence as a function of the displacement from the main diagonal. It should be made clear that the so called %local recurrence is in fact the proportion of recurrent points on certain line parallel to the main diagonal over the length of this line. %recurrence is calculated on the whole upper triangle in RP while %local recurrence is computed on only certain lines in RP, so it is termed as local. Multiplying by 1000 increases the gain of the TND variable. The remaining three variables are defined on the basis of the vertical line structure. The sixth recurrence variable is %laminarity ( %LAM). %LAM is analogous to %DET but is calculated with recurrent points constituting vertical line structures. Similarly, we also select 2 as the minimum number of adjacent recurrent points to form a vertical line segment. The seventh variable, trapping time (TT), is the average length of vertical line structures. The eighth recurrence variable is maximal length of the vertical lines in RP (c16-math-0447), which is similar to c16-math-0448.

16.2.12 Empirical Mode Decomposition and Similarity of Proteins

Empirical mode decomposition (EMD) was originally designed for non-linear and nonstationary data analysis by Huang et al. [24]. The traditional EMD decomposes a time series into components called intrinsic mode functions (IMFs) to define meaningful frequencies of a signal.

Lin et al. [25] proposed a new algorithm for EMD. Instead of using the envelopes generated by spline, a lowpass filter is used to generate a moving average to replace the mean of the envelopes. The essence of the shifting algorithm remains. Let c16-math-0449 be a lowpass filter operator, for which c16-math-0450 represents a moving average of c16-math-0451. We now define c16-math-0452. In this approach, the lowpass filter c16-math-0453 is dependent on the data c16-math-0454. For a given c16-math-0455, we choose a lowpass filter c16-math-0456 accordingly and set c16-math-0457, where c16-math-0458 is the identity operator. The first IMF in the new EMD is given by c16-math-0459, and subsequently the c16-math-0460th IMF c16-math-0461 is obtained first by selecting a lowpass filter c16-math-0462 according to the data c16-math-0463 and iterations c16-math-0464, where c16-math-0465. The process stops when c16-math-0466 has at most one local maximum or local minimum. Lin et al. [25] suggested using the filter c16-math-0467 given by c16-math-0468. We selected the mask

equation

in our work [53].

Let c16-math-0470. The original signal can be expressed as c16-math-0471, where the number c16-math-0472 can be chosen according to a standard deviation. In our work, the number of components in IMFs was set as 4 due to the short length of some amino acid sequences [53].

The similarity value of two proteins at each component (IMF) is obtained as the maximum absolute value of the correlation coefficient. In our work [53], a new cross-correlation coefficient c16-math-0473 is defined by

16.27 c16-math-0474

where c16-math-0475 is the length of the intersection of two signals with lag c16-math-0476, c16-math-0477 is the length of signal c16-math-0478, and c16-math-0479 is the length of signal c16-math-0480. The maximum absolute value C of all the correlation coefficients of the components is considered as the similarity value for two proteins.

16.3 Results and conclusions

In our earlier paper [50], we selected the amino acid sequences of 43 large proteins from the RCSB Protein Data Bank (http://www.rcsb.org/pdb/index.html). These 43 proteins belong to four structural classes according to their secondary structures:

1. We converted the amino acid sequences of these proteins into their measure representations based on the detailed HP model with c16-math-0481. We found that the IFS model corresponding to c16-math-0482 is a good model for simulating the measure representation of protein sequences, and the estimated value of the probability c16-math-0483 from the IFS model contains information useful for the secondary structural classification of proteins [30]. We performed an IFS simulation for the proteins selected and adopted the estimated parameter c16-math-0484 as one parameter to construct the parameter space for proteins.
2. We converted the amino acid sequences of these proteins to their Z curve representations and performed their detrended fluctuation analysis. The exponents c16-math-0485 were estimated and used as candidate parameters to construct the parameter space.
3. We computed the generalized fractal dimensions c16-math-0486 and the related spectra c16-math-0487, multifractal spectra c16-math-0488 of hydrophobic free energy sequences and solvent accessibility sequences of all 43 proteins.
4. For a structural classification of proteins, we considered the following parameters: c16-math-0489 from the IFS estimations of the measure representations; the exponents c16-math-0490 from the detrended fluctuation analysis of the Z curve representations; the range of c16-math-0491 (i.e. the value c16-math-0492c16-math-0493 in our frame); the maximum value of c16-math-0494 (denoted c16-math-0495); the value c16-math-0496 of c16-math-0497 that corresponds to the maximum value of c16-math-0498; the maximum value of c16-math-0499 (denoted c16-math-0500), the minimum value of c16-math-0501 (denoted c16-math-0502) and c16-math-0503 (defined by c16-math-0504) from the multifracal analysis of the hydrophobic free-energy sequences and solvent accessibility sequences of proteins as candidates for constructing parameter spaces.

In a parameter space, one point represents a protein. We wanted to determine whether the proteins can be separated from four structural classifications in these parameter spaces. We found that we can propose a method which consists of three components to cluster proteins [50]. We used Fisher's linear discriminant algorithm to give a quantitative assessment of our clustering on the selected proteins. The discriminant accuracies are satisfactory. In particular, they reach 94.12% and 88.89% in separating c16-math-0505 proteins from c16-math-0506 proteins in a 3D space.

We [51], considered a set of 49 large proteins that included the 43 proteins studied earlier [50]. Given an amino acid sequence of one protein, we first converted it into its measure representation c16-math-0507 based on the six-letter model with length c16-math-0508. Then we calculated c16-math-0509, and c16-math-0510 for the measures c16-math-0511 of the 49 selected proteins. We then converted the amino acid sequences of proteins into their RSWH sequences according to the revised Schneider–Wrede hydrophobicity scale. We used such sequences to construct the measures c16-math-0512. The ordinary multifractal analysis was then performed on these measures. The AMFA was also performed on the RSWH sequences. Then nine parameters from these analyses were selected as candidates for constructing parameter spaces. We proposed another three steps to cluster protein structures [51]. Fisher's linear discriminant algorithm was used to assess our clustering accuracy on the 49 selected large proteins. The discriminant accuracies are satisfactory. In particular, they reach c16-math-0513 and c16-math-0514 in separating the c16-math-0515 proteins from the c16-math-0516 proteins in a parameter space; c16-math-0517 and c16-math-0518, in separating the c16-math-0519 proteins from the c16-math-0520 proteins in another parameter space; and c16-math-0521 and c16-math-0522, in separating the c16-math-0523 proteins from the c16-math-0524 proteins in the last parameter space.

We [52], intended to predict protein structural classes (c16-math-0525, c16-math-0526, c16-math-0527, or c16-math-0528) for low-similarity datasets. Two datasets were used widely: 1189 (containing 1092 proteins) and 25PDB (containing 1673 proteins) with sequence similarity values of 40% and 25%, respectively. We proposed decomposing the chaos game representation of proteins into two kinds of time series. Then we applied recurrence quantification analysis to analyze these time series. For a given protein sequence, a total of 16 characteristic parameters can be calculated with RQA, which are treated as feature representations of the protein. On the basis of such feature representation, the structural class for each protein was predicted with Fisher's linear discriminant algorithm. The overall accuracies with step-by-step procedure are 65.8% and 64.2% for 1189 and 25PDB datasets, respectively. With one-against-others procedure used widely, we compared our method with five other existing methods. In particular, the overall accuracies of our method are 6.3% and 4.1% higher for the two datasets, respectively. Furthermore, only 16 parameters were used in our method, which is less than that used by other methods.

Family identification is helpful in predicting protein functions. Since most protein sequences are relatively short, we first randomly linked the protein sequences from the same family or superfamily together to form 120 longer protein sequences [53], and each structural class contains 30 linked protein sequences. Then we used, the 6-letter model, 12-letter model, 20-letter model, the revised Schneider–Wrede scale hydrophobicity, solvent accessibility, and stochastic standard state accessibility values to convert linked protein sequences to numerical sequences. Then we calculated the generalized fractal dimensions c16-math-0529, the related spectra c16-math-0530, the multifractal spectra c16-math-0531, and the c16-math-0532 curves of the six kinds of numerical sequences of all 120 linked proteins. The curves of c16-math-0533, c16-math-0534 showed that the numerical sequences from linked proteins are multifractal-like and sufficiently smooth. The c16-math-0535 curves resemble the phase transition at a certain point, while the c16-math-0536 and c16-math-0537 curves indicate the multifractal scaling features of proteins. In wavelet analysis, the choice of a wavelet function should be carefully considered. Different wavelet functions represent a given function with different approximation components. We [53], chose the commonly used Daubechies wavelet db2 and computed the maximum of the wavelet spectrum and the maximum of the local wavelet spectrum for the six kinds of numerical sequences of all 120 linked proteins. The parameters from the multifractal and wavelet analyses were used to construct parameter spaces where each linked protein is represented by a point. The four classes of proteins were then distinguished in these parameter spaces. The discriminant accuracies obtained through Fisher's linear discriminant algorithm are satisfactory in separating these classes. We found that the linked proteins from the same family or superfamily tend to group together and can be separated from other linked proteins. The methods are also helpful to identify the family of an unknown protein.

Zhu et al. [54] applied component similarity analysis based on EMD and the new cross-correlation coefficient formula (16.27) to protein pairs. They then considered maximum absolute value c16-math-0538 of all the correlation coefficients of the components as the similarity value for two proteins. They also created the threshold of correlation [54]. Two signals are considered strongly correlated if the correlation coefficient exceeds c16-math-0539 and weakly correlated if the coefficient is between c16-math-0540 and c16-math-0541. The results showed that the functional relationships of some proteins may be revealed by component analysis of their IMFs. Compared with those traditional alignment methods, component analysis can be evaluated and described easily. It illustrates that EMD and component analysis can complement traditional sequence similarity approaches that focus on the alignment of amino acids.

From our analyses, we found that the measure representation based on the detailed HP model and six-letter model, time series representation based on physicochemical features of amino acids, c16-math-0542-curve representation, the chaos game representation of proteins can provide much information for predicting structure classes and functions of proteins. Fractal methods are useful to analyze protein sequences. Our methods may play a complementary role in the existing methods.

Acknowledgment

This work was supported by the Natural Science Foundation of China (Grant 11071282); the Chinese Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) (Grant No. IRT1179); the Research Foundation of Education Commission of Hunan Province, China (Grant 11A122); the Lotus Scholars Program of Hunan Province of China; the Aid program for Science and Technology Innovative Research Team in Higher Educational Institutions of Hunan Province of China; and the Australian Research Council (Grant DP0559807).

References

1. Hou J, Jun S-R, Zhang C, Kim S-H, Global mapping of the protein structure space and application in structure-based inference of protein function, Natl. Acad. Sci. USA 102:3651–3656 (2005).

2. Anfinsen C, Principles that govern the folding of protein chains, Science 181:223–230 (1973).

3. Chothia C, One thousand families for the molecular biologists, Nature (Lond.) 357:543–544 (1992).

4. Frishman D, Argos P, Knowledge-based protein secondary structure assignment, Proteins 23:566–579 (1995).

5. Crooks GE, Brenner SE, Protein secondary structure: Entropy, correlation and prediction, Bioinformatics 20:1603–1611 (2004).

6. Adamczak R, Porollo A, Meller J, Combining prediction of secondary structure and solvent accessibility in proteins, Proteins 59:467–475 (2005).

7. Levitt M, Chothia C, Structural patterns in globular proteins, Nature 261:552–558 (1976).

8. Service., A dearth of new folds, Science 307:1555–1555 (2005).

9. Chou KC, Progress in protein structural class prediction and its impact to bioinformatics and proteomics, Curr. Protein Peptide Sci. 6(5):423–436 (2005).

10. Trad CH, Fang Q, Cosic I, Protein sequence comparison based on the wavelet transform approach, Protein Eng. 15(3):193–203 (2002).

11. Lesk AM, Computational Molecular Biology: Sources and Methods for Sequence Analysis, Oxford Univ. Press, 1988.

12. Mandelbrot BB, The Fractal Geometry of Nature, Academic Press, New York, 1983.

13. Feder J, Fractals, Plenum Press, New York, 1988.

14. Grassberger P, Procaccia I, Characterization of strange attractors, Rev. Lett. 50:346–349 (1983).

15. Yu ZG, Anh V, Eastes R, Multifractal analysis of geomagnetic storm and solar flare indices and their class dependence, J. Geophys. Res. 114:A05214 (2009).

16. Yu ZG, Anh VV, Lau KS, Measure representation and multifractal analysis of complete genome, Phys. Rev. E 64:031903 (2001).

17. Yu ZG, Anh VV, Wang B, Correlation property of length sequences based on global structure of complete genome, Phys. Rev. E 63:011903 (2001).

18. Peng CK, Buldyrev S, Goldberg AL, Havlin S, Sciortino F, Simons M, Stanley HE, Long-range correlations in nucleotide sequences, Nature 356:168–170 (1992).

19. Chui CK, An Introduction to Wavelets, Academic Press Professional, San Diego, 1992.

20. Daubechies I, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992.

21. Eckmann JP, Kamphorst SO, Ruelle D, Recurrence plots of dynamical systems, Europhys. Lett. 4:973–977 (1987).

22. Zbilut JP, Webber CL Jr, Embeddings and delays as derived from quantification of recurrence plots, Phys. Lett. A 171:199–203 (1992).

23. Webber CL Jr, Zbilut JP, Dynamical assessment of physiological systems and states using recurrence plot strategies, J. Appl. Physiol. 76:965–973 (1994).

24. Huang N, Shen Z, Long SR, Wu ML, Shih HH, Zheng Q, Yen NC, Tung CC, Liu HH, The empirical mode decomposition and Hilbert spectrum for nonlinear and nonstationary time series analysis, Proc. Roy. Soc. Lond. A 454:903–995 (1998).

25. Lin L, Wang Y, Zhou H, Iterative filtering as an alternative for empirical mode decomposition, Adv. Adapt. Data Anal. 1(4):543–560 (2009).

26. Janosi IM, Muller R, Empirical mode decomposition and correlation properties of long daily ozone records, Phys. Rev. E 71:056126 (2005).

27. ZG Yu, Anh V, Wang Y, Mao D, Wanliss J, Modeling and simulation of the horizontal component of the geomagnetic field by fractional stochastic differential equations in conjunction with empirical mode decomposition, J. Geophys. Res. 115:A10219 (2010).

28. Dewey TG, Fractal analysis of proton-exchange kinetics in lysozyme, Proc. Natl. Acad. Sci. USA 91:12101–12104 (1994).

29. Fiser A, Tusnady GE, Simon I, Chaos game representation of protein structure, J. Mol. Graphies 12:302–304 (1994).

30. Yu ZG, Anh VV, Lau KS, Fractal analysis of large proteins based on the detailed HP model, Physica A 337:171–184 (2004).

31. Yu ZG, Anh VV, Lau KS, Chaos game representation, and multifractal and correlation analysis of protein sequences from complete genome based on detailed HP model, J. Theor. Biol. 226:341–348 (2004).

32. Enright MB, Leitner DM, Mass fractal dimension and the compactness of proteins, Phys. Rev. E 71:011912 (2005).

33. Moret MA, Miranda JGV, Nogueira E, et al, Self-similarity and protein chains, Phys. Rev. E 71:012901 (2005).

34. Pavan YS, Mitra CK, Fractal studies on the protein secondary structure elements, Indian J. Biochem. Biophys. 42:141–144 (2005).

35. Balafas JS, Dewey TG, Multifractal analysis of solvent accessibilities in proteins, Phys. Rev. E 52:880–887 (1995).

36. Yu ZG, Anh VV, Lau KS, Multifractal and correlation analysis of protein sequences from complete genome, Phys. Rev. E 68:021913 (2003).

37. Mandell AJ, Selz KA, Shlesinger MF, Mode maches and their locations in the hydrophobic free energy sequences of peptide ligands and their receptor eigenfunctions, Proc. Natl. Acad. Sci. USA 94:13576–13581 (1997).

38. Mandell AJ, Selz KA, Shlesinger MF, Wavelet transfermation of protein hydrophobicity sequences suggests their memberships in structural families, Physica A 244:254–262 (1997).

39. Selz KA, Mandell AJ, Shlesinger MF, Hydrophobic free energy eigenfunctions of pore, channel, and transporter proteins contain β-burst patterns, Biophys. J. 75:2332–2342 (1998).

40. Hirakawa H, Muta S, Kuhara S, The hydrophobic cores of proteins predicted by wavelet analysis, Bioinformatics 15:141–148 (1999).

41. Chen H, Gu F, Liu F, Predicting protein secondary structure using continuous wavelet transform and Chou–Fasman method, Proc. 27th Annual IEEE, Engineering in Medicine and Biology Conf. 2005.

42. Marsolo K, Ramamohanarao K, Structure-based on querying of proteins using wavelets, Proc.15th Int. ACM Conf. Information and Knowledge Management, 2006, pp. 24–33.

43. Qiu JD, Liang RP, Zou XY, Mo JY, Prediction of protein secondary structure based on continuous wavelet transform, Talanta 61(3):285–293 (2003).

44. Rezaei M, Abdolmaleki P, Jahandideh S, Karami Z, Asadabadi EB, Sherafat MA, Abrishami-Moghaddam H, Fadaie M, Foroozan M, Prediction of membrane protein types by means of wavelet analysis and cascaded neural networks, J. Theor. Biol. 254(4):817–820 (2008).

45. Pando J, Sands L, Shaheen SE, Detection of protein secondary structures via the discrete wavelet transform, Phys. Rev. E 80:051909 (2009).

46. Webber CL Jr, Giuliani A, Zbilut JP, Colosimo A, Elucidating protein secondary structures using alpha-carbon recurrence quantifications, Proteins: Struct., Funct., Genet. 44:292–303 (2001).

47. Giuliani A, Benigni R, Zbilut JP, Webber CL Jr., Sirabella P, Colosimo A, Signal analysis methods in elucidation of protein sequence-structure relationships, Chem. Rev. 102:1471–1491 (2002).

48. Shi F, Chen Q, Niu X, Functional similarity analyzing of protein sequences with empirical mode decomposition, Proc. 4th Int. Conf. Fuzzy Systems and Knowledge Discovery, 2007.

49. Shi F, Chen QJ, Li NN, Hilbert Huang transform for predicting proteins subcellular location, J. Biomed. Sci. Eng. 1:59–63 (2008).

50. Yu ZG, Anh VV, Lau KS, Zhou LQ, Fractal and multifractal analysis of hydrophobic free energies and solvent accessibilities in proteins, Phys. Rev. E. 73:031920 (2006).

51. Yang JY, Yu ZG, Anh V, Clustering structure of large proteins using multifractal analyses based on 6-letters model and hydrophobicity scale of amino acids, Chaos, Solitons, Fractals 40:607–620 (2009).

52. Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D, Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation, J. Theor. Biol. 257:618–626 (2009).

53. Zhu SM, Yu ZG, Anh V, Protein structural classification and family identification by multifractal analysis and wavelet spectrum, Chin. Phys. B 20:010505 (2011).

54. Zhu SM, Yu ZG, Anh V, Yang SY, Analysing the similarity of proteins based on a new approach to empirical mode decomposition, Proc. 4th Int. Conf. Bioinformatics and Biomedical Engineering (ICBBE2010), vol. 1, (2010).

55. Chou PY, Fasman GD, Prediction of protein conformation, Biochemistry 13:222–245 (1974).

56. Bordo D, Argos P, Suggestions for safe residue substitutions in site-directed mutagenesis, J. Mol. Biol. 217:721–729 (1991).

57. Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH, Hydrophobicity of amino acid residues in globular proteins, Science 229:834–838 (1985).

58. Krigbaum WR, Komoriya A, Local interactions as a structure determinant for protein molecules: II, Biochim. Biophys. Acta 576:204–228 (1979).

59. Zhang R, Zhang CT, Z curves, an intuitive tool for visualizing and analyzing the DNA sequences, J. Biomol. Struct. Dyn. 11:767–782 (1994).

60. Basu S, Pan A, Dutta C, Das J, Chaos game representation of proteins, J. Mol. Graphics 15:279–289 (1997).

61. Deschavanne P, Tufféry P, Exploring an alignment free approach for protein classification and structural class prediction, Biochimie 90:615–625 (2008).

62. Jeffrey HJ, Chaos game representation of gene structure, Nucleic Acids Res, 18:2163–2170 (1990).

63. Brown TA, Genetics, 3rd ed., Chapman & Hall, London, 1998.

64. Barnsley MF, Demko S, Iterated function systems and the global construction of fractals, Proc. Roy. Soc. (Lond.) 399:243–275 (1985).

65. Vrscay ER, Iterated function systems: Theory, applications and the inverse problem, in Belair J, Dubuc S. (eds.), Fractal Geometry and Analysis, NATO ASI series, Kluwer, 1991.

66. Canessa E, Multifractality in time series, J. Phys. A: Math. Genet. 33:3637–3651 (2000).

67. Arneodo A, Bacry E, Muzy JF, The thermodynamics of fractals revisited with wavelets, Physica A 213:232–275 (1995).

68. Marwan N, Romano MC, Thiel M, Kurths J, Recurrence plots for the analysis of complex systems, Phys. Reports 438:237–329 (2007).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.123.217