Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 24 Finding repetitions in biological networks: challenges, trends, and applications

SIMONA E. ROMBO

24.1 Introduction

For many years, analysis of biological sequences (also termed biosequences) associated with proteins and genomes played a key role in understanding the mechanisms inside the cell [7, 47, 49]. After the genome coding of several organisms was completed [13], significant attention began to focus on studying how cellular components interact with each other to accomplish the biological functions of the cell [59].

While biosequences are usually represented by strings defined on a finite alphabet, where symbols are associated with amino acids or nucleotides, interaction data can be instead modeled by graphs, called biological networks, where nodes represent components and edges their interactions. The set of all the protein–protein interactions (PPIs) of a specific organism represents its interactome.

Despite the different models adopted to analyze biological data, there is a common peculiarity characterizing them; this is termed their intrinsic repetitiveness. The presence of repetitions can be considered biologically interesting in many cases, for example, when the presence of repeted elements is discovered among cells belonging to different organisms, or if a specific feature appears several times in the same cell. Often suitable statistical indices can be usefully exploited in order to characterize the significativeness of the repetitions found [51].

Searching for common substrings in a set of biosequences can help classify them and predict their biological function [36]. On the other hand, the presence of repeated modules across biological networks was also shown to be very relevant. Indeed, several studies proved that biological networks can often be understood in terms of coalitions of basic repeated building blocks [40, 44]. Discovering similar subnetworks in the interactome of different organisms is useful in revealing complex mechanisms at the basis of evolutionary conservations, and to infer the biological meaning of groups of interacting cellular components, possibly belonging to organisms not yet wellcharacterized [53].

The problem of finding interesting repetitions in biological data can be formulated according to different perspectives. In general, if $c24-math-0001$ is a suitable data structure storing the specific biological data under consideration (e.g., a string or a graph), we can distinguish the following three main classes of problems:

1. Given a substructure $c24-math-0002$ (e.g., a substring or a subgraph), where some elements can also be unspecified, find all portions of $c24-math-0003$ that are similar to $c24-math-0004$ , according to some specific similarity function.

2. Find a set of significative objects $c24-math-0005$ , not specified a priori, such that each $c24-math-0006$ is repeated in $c24-math-0007$ a number of times over a fixed threshold. A limited number of mismatches are usually allowed in the repetitions.

3. Given a set $c24-math-0008$ of data structures; find a set of significative objects $c24-math-0009$ , not specified a priori, such that each $c24-math-0010$ is repeated (but for possible mismatches) in $c24-math-0011$ of the $c24-math-0012$ , with $c24-math-0013$ over a fixed threshold.

In both the second and the third formulations of the problem, instead of considering as a gauge of interestingness the frequency of repetitions, other statistical evaluations, such as expectation or $c24-math-0014$ score, can be suitably exploited as well.

From sequences to networks, the difficulty in solving the problems listed above increases. While most exact algorithms searching for repetitions in biosequences need polynomial time (despite the number of repetitions that can be generated), the case of biological networks is much more complicated, often involving subgraph isomorphism checking, which is known to be NP-complete [27]. Therefore, using exact algorithms to search for repetitions in biological networks is seldom feasible, except in a few cases, for which fixed-parameter tractability (FPT) algorithms were designed [e.g., 6,15]. However, such exact approaches are suitable only for specific contexts, where the most general techniques are based on approximate and heuristic approaches [e.g., 38,56].

The main goal of this chapter is to analyze how the general problem of finding repetitions in biological data evolved from sequences to networks data, by focusing on the open challenges and specific applications in biological networks.

In particular, we first recall some basic notions about the domain under consideration, that is, biological networks (Section 24.2). Then, in Section 24.3, we describe in detail how the three problem formulations listed above can be specialized in both the biological sequence and network domains, and we also provide an overview of the main applications in biological networks (Section 24.4). Finally, in Section 24.5 we draw our conclusive remarks.

24.2 The Biological Networks Domain

Since the 1990s, both high-throughput experimental techniques [28, 34] and computational methods [43, 59] have contributed to collection of cellular component interactions, stored in public databases [e.g., 8,30,42]. To model such biological information, suitable networks where interacting components are linked together have been exploited.

Three main types of biological networks have been studied in the literature: protein–protein interaction (PPI) networks, metabolic networks, and gene regulatory networks. All these networks may be represented as graphs $c24-math-0015$ , where $c24-math-0016$ is a set of nodes and $c24-math-0017$ is a set of (directed/indirect) edges.

In the following sections, we briefly recall the main characteristics of each kind of biological network.

24.2.1 Protein–Protein Interaction Networks

Proteins are the basic constituents of living beings. It has been shown that studying how proteins interact inside the cell is necessary to understanding the biological processes in which they are involved [59]. Many PPIs have been discovered in the last years [28, 34, 43].

The set of all the PPIs of a given organism is its interactome, usually modeled by an indirect graph, called a protein–protein interaction network (PPI network), where nodes represent involved proteins and edges encode their interactions. Figure 24.1 illustrates such a representation for a small portion of the Saccharomyces cerevisiae interactome.

Figure 24.1 A small portion of the Saccharomyces cerevisiae interactome, drawn by using PIVOT [45].

Often, a PPI network is represented by its adjacency matrix, that is, a binary matrix $c24-math-0018$ where both rows and columns are associated with the proteins in the network, such that $c24-math-0019$ if protein $c24-math-0020$ interacts with protein $c24-math-0021$ , and $c24-math-0022$ otherwise.

24.2.2 Metabolic Networks

Metabolic networks model the whole set of metabolic and physical processes on the basis of the physiological and biochemical properties of the cell, such as the chemical reactions of metabolisms and the corresponding regulatory interactions. Metabolic networks are often employed to simulate and understand the molecular mechanisms that regulate the organisms, often correlating the genome with aspects of molecular physiology [25, 58].

A metabolic network is often modeled by a bipartite graph, where the two types of nodes represent, respectively, reactions and chemical compounds (see Fig. 24.2). Alternatively, a metabolic network can be represented by a compound graph, that is, a compact version of the bipartite graph where only compound vertices are retained and information on the reactions is stored as edge labels. The reaction graph is the symmetric representation of a compound graph (i.e., reaction vertices are retained and information on the compounds is stored as edge labels). When directed versions of these graphs are considered, the direct edges express the reversibility/irreversibility of some reactions.

Figure 24.2 An example of a metabolic network.

24.2.3 Gene Regulatory Networks

Gene regulatory networks denote the genes expressed in the cell in response to biological signals. Gene regulatory networks are also called transcription regulation networks, since they describe the interactions between transcription factor proteins and the genes that they regulate [9, 37]. Transcription factors respond to biological signals and accordingly change the transcription rate of genes, allowing cells to produce the proteins they need at the appropriate times and amounts [41].

A gene regulatory network can be represented as a directed graph, in which nodes represent proteins, their corresponding mRNAs, and protein–protein complexes (see Fig. 24.3). In particular, the nodes of a gene regulatory network are also associated with genes, by referring to the DNA sequences that are transcribed into the mRNAs that translate into proteins. Edges between nodes represent individual molecular reactions, the protein–protein and protein–mRNA interactions through which the products of one gene affect those of another one. When interactions are inductive (i.e., meaning that an increase in the concentration of one component leads to an increase in the other one), they are represented by arrowheads; when interactions are inhibitory (i.e., when an increase in one component leads to a decrease in the other one), then they are represented by filled circles. A series of edges indicates a chain of such dependences, with cycles corresponding to feedback loops.

Figure 24.3 A small portion of the gene regulatory network in mouse embryonic stem cells [63].

A gene regulatory network may also be represented as a connectivity matrix $c24-math-0023$ , such that $c24-math-0024$ if the component associated with node $c24-math-0025$ encodes a transcription factor regulating the component associated to node $c24-math-0026$ , and $c24-math-0027$ otherwise.

24.3 Problem Formulation

We now consider some technical details concerning the problem of searching for repetitions in biological data.

Let $c24-math-0028$ be a complex structure (e.g., a string or a graph) containing basic elements (e.g., characters or nodes). We suppose that all the basic elements in $c24-math-0029$ can be represented as suitable symbols. Let $c24-math-0030$ be the finite alphabet containing all such symbols. The following definition is important to understand when substructures of the same kind of $c24-math-0031$ are repeated in $c24-math-0032$ , or in a set of structures like $c24-math-0033$ .

Definition 24.1: Occurrence

Let $c24-math-0034$ be a complex structure of the same type as $c24-math-0035$ , and $c24-math-0036$ be the finite alphabet containing all the symbols associated with the basic elements in $c24-math-0037$ , such that $c24-math-0038$ . Furthermore, let $c24-math-0039$ be a function having two complex structures as its arguments and returning as output their distance, according to some specific formula depending on the context under consideration. A substructure $c24-math-0040$ is an occurrence of $c24-math-0041$ iff

where $c24-math-0043$ is a suitable threshold value fixed a priori.

In the biosequences domain, suitable distances that can be adopted are for example the Hamming and the Levenshtein distances. When $c24-math-0044$ is a graph modeling a biological network, the concept of distance is less immediately defined. Indeed, the concept of similarity between subgraphs is always to be related to the subgraph isomorphism check, which is a difficult task, as mentioned earlier [27]. Furthermore, several distinct aspects can be taken into account, for example, the focus can be on the similarity between nodes, rather than between edges and structures, according to the specific biological problem under consideration.

Accordingly, different formulations of the problem of finding repetitions in biological data can be provided.

Formulation 24.1: Matching a Query

Let $c24-math-0045$ be a complex structure, called query, of the same type as $c24-math-0046$ , such that the size of $c24-math-0047$ is much less than the size of $c24-math-0048$ , and let $c24-math-0049$ be the finite alphabet containing all the symbols associated with the basic elements in $c24-math-0050$ , such that $c24-math-0051$ . The problem of matching $c24-math-0052$ in $c24-math-0053$ consists in finding all the occurrences of $c24-math-0054$ in $c24-math-0055$ .

In biosequences analysis, Formulation 24.1 is the analog of string pattern matching, where approximate occurrences can be handled by allowing wildcards and/or gaps [26]. As an example, let $c24-math-0056$ be a string $c24-math-0057$ $c24-math-0058$ defined on the finite alphabet $c24-math-0059$ and let $c24-math-0060$ be the string $c24-math-0061$ , defined on $c24-math-0062$ . If $c24-math-0063$ and the Hamming distance is considered, it is easy to see that $c24-math-0064$ has three occurrences in $c24-math-0065$ , thus matching $c24-math-0066$ in $c24-math-0067$ the output result set is $c24-math-0068$ .

In biological networks analysis, matching a query consists in analyzing an input network, called the target network, searching for possible subnetworks that are similar to a query subnetwork of interest. Such a problem, also known as network querying [20, 62], “is aimed at transferring biological knowledge within and across species” [53]. The so-found subnetworks may indeed correspond to cellular components involved in the same biological processes or performing similar functions than components in the query.

When approximate occurrences are searched for, mismatches possibly including both node/edge insertions and deletions can be allowed as well. The following example clarifies how Formulation 24.1 can be applied to biological networks.

Example 24.1

Consider the query and target networks shown in Figure 24.4. Let $c24-math-0069$ be the query network and suppose that, given a subnetwork $c24-math-0070$ of the target network $c24-math-0071$ , the chosen function $c24-math-0072$ returns to output the number of “differences” between $c24-math-0073$ qnd $c24-math-0074$ defined as follows. Let $c24-math-0075$ and $c24-math-0076$ be two sets of pairs individuating the (possibly directed) edges in $c24-math-0077$ and $c24-math-0078$ , respectively. Then, $c24-math-0079$ returns the size of the set $c24-math-0080$ . Note that $c24-math-0081$ is equal to zero for pairs of identical subgraphs, such that both nodes and edges overlap.

Figure 24.4 Matching a query on biological networks.

If $c24-math-0082$ , the two subgraphs highlighted by a green circle in the target network, together with the subgraph circled by a dashed red line, indicate occurrences of $c24-math-0083$ in $c24-math-0084$ with respect to $c24-math-0085$ .

Now consider a different distance; that is, for each edge $c24-math-0086$ in $c24-math-0087$ , assume that an edge $c24-math-0088$ in $c24-math-0089$ is searched for such that $c24-math-0090$ is equal to $c24-math-0091$ and $c24-math-0092$ is equal to $c24-math-0093$ , respectively. For each edge of $c24-math-0094$ such that the corresponding edge in $c24-math-0095$ according to this rule is missing, the distance between $c24-math-0096$ and $c24-math-0097$ is increased by one. For each direct path of length 2 in $c24-math-0098$ , another direct path of length 2 is searched for in $c24-math-0099$ , such that the first node and the last node in the path are equal. For each of such paths in $c24-math-0100$ that is missed in $c24-math-0101$ , the distance between $c24-math-0102$ and $c24-math-0103$ is increased by one.

With this new distance, and by considering a threshold value $c24-math-0104$ , analogously to the previous case, it is easy to see that the subgraph highlighted by the dashed red line in $c24-math-0105$ is no longer an occurrence of $c24-math-0106$ .

Formulation 24.2: Discovering Repetitions in a Single Structure

Let $c24-math-0107$ be a complex structure defined on the finite alphabet $c24-math-0108$ of basic elements. Discovering repetitions in a single structure corresponds to finding a set of significative substructures $c24-math-0109$ , not specified a priori, such that the number of occurrences of each $c24-math-0110$ satisfies some specific constraints making $c24-math-0111$ worth to be considered interesting.

Formulation 24.2 is related to the concept of motif, which has been exploited in different applications of computational biology [3, 47]. Depending on the context, what is a motif may assume sensibly different meanings. In general, motifs are always associated to repetitive objects. For example, a repeated substring can be considered a motif when its frequency is greater than a fixed threshold, or instead when it is much different from that expected [2].

Also in the context of biological networks, a motif can be defined according to its frequency or to its statistical significance [12]. In the first case, a motif is a subgraph that appears more often than a given threshold number of times in an input network; in the second case, a motif is a subgraph that appears more (or less) often than expected. In particular, to measure the statistical significance of the motifs, many works compare the number of appearances of the motifs in the biological network with the number of appearances in a number of randomized networks [17], by exploiting suitable statistical indices such as the p value or the z score [44].

Despite the similarity with sequences, network motifs present important differences with respect to string motifs, the main important of which concerns the computational complexity of the problem of motif extraction, which is polynomial for strings and exponential (in the size of the input structure) for networks.

Figure 24.5 shows two of the most frequent network motifs, called brick and feedforward loop, respectively.

Formulation 24.3: Discovering Repetitions in Multiple Structures

Let $c24-math-0112$ $c24-math-0113$ be a set of complex structures defined on the finite alphabet $c24-math-0114$ of basic elements. Discovering repetitions in multiple structures corresponds to finding a set of significative substructures $c24-math-0115$ , not specified a priori, such that each $c24-math-0116$ has an occurrence in at least $c24-math-0117$ (e.g., two) $c24-math-0118$ .

Figure 24.5 (a) Brick; (b) feedforward loop.

As is Formulation 24.2, Formulation 24.3 is also related to the concept of motif. However, in this case the repetitions to be searched for occur in different structures, rather than in a single one.

Searching for common substrings repeated in a set of biosequences can be useful for classifying them. For example, if there are $c24-math-0119$ amino acid sequences and $c24-math-0120$ of them present one or several similar subsequences, the corresponding proteins most likely belong to the same protein family. In sequence analysis, alignment is often performed to arrange the biosequences in such a way as to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between them. The so-aligned biosequences are usually represented as rows within a matrix. Gaps may also be inserted between the characters, so that identical or similar characters are aligned in successive columns and individuating common subsequences is then simpler.

When biological networks are considered, discovering repetitions in multiple structures consists in finding a set of conserved edges across the input networks, that is, a set of conserved subgraphs.

For both sequences and networks Formulation 24.3 can be performed by aligning pairs of structures (sequences or networks) or more than two structures at a time. Multiple alignments are often useful to individuate evolutionary conservations, but they are computationally difficult to produce. Furthermore, two different cases can be distinguished:

1. For each pair $c24-math-0121$ , $c24-math-0122$ of extracted repetitions, $c24-math-0123$ and $c24-math-0124$ cannot overlap; that is, they have no common elements. This is a case of global alignment.

2. For each pair $c24-math-0125$ , $c24-math-0126$ of extracted repetitions, $c24-math-0127$ and $c24-math-0128$ may overlap; that is, they have common elements. This is a case of local alignment.

24.3.1 Global Alignment

Global sequence alignment consists in aligning every character in every input sequence, and it is useful when the sequences in the query set are similar and of roughly equal size.

Global network alignment aims at finding a unique (possibly the best one) overall alignment between the input networks, in such a way that all the nodes of the networks are mapped.

Global sequence alignment is often used to compare genomic sequences to understand variations between species. Analogously, global network alignment may be used, for example, to compare interactomes and to understand cross-species variations. In contrast to global sequence alignment, global network alignment may also be useful when the input networks have significantly different sizes.

In greater detail, the output of global network alignment is a set of tuples $c24-math-0129$ $c24-math-0130$ where each $c24-math-0131$ is a subgraph of one of the input networks that is similar to the other subgraphs in the same tuple. In terms of repetition, we can see such a tuple as the set of occurrences of the same motif, occurring in $c24-math-0132$ of the input networks. The peculiarity of global alignment as opposed to the local one is that each $c24-math-0133$ cannot share common nodes with any other $c24-math-0134$ of the same input network that is involved in a different output tuple.

24.3.2 Local Alignment

Local sequence alignment aims at identifying regions of similarity within long sequences that are often widely divergent overall. Local sequence alignment is useful to analyzing dissimilar sequences, for which are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context.

Local network alignment instead consists in finding similar motifs between two or several networks, and the mappings corresponding to different motifs may also be mutually inconsistent. The objective is to find multiple unrelated regions of isomorphism among the input networks, with each region implying a mapping independently of the others. Local network alignment may be applied, for example, to search for a known functional component (e.g., pathways, complexes) in a new species.

The output of local network alignment is, again, a set of tuples $c24-math-0135$ $c24-math-0135a$ , similar to the case of global network alignment. However, in this case, each $c24-math-0136$ can share common nodes with some other $c24-math-0137$ of the same input network, involved in a different output tuple.

24.4 Methods

In this section we summarize the more recent applications for each formulations presented in the previous section with respect to the problem of searching repetitions in biological networks.

24.4.1 Matching a Query

Many approaches presented in the literature that aim at searching for small query subnetworks repeated in a target network focus on queries that are not general graphs, but paths or trees. As an example, Kelley et al. proposed PathBLAST [31], an approach to align two PPI networks by combining interaction topology and protein sequence similarity in order to identify conserved interaction pathways and complexes. Their algorithm searches for high-scoring pathway alignments involving two paths, one for each network, in which proteins of the first path are paired with putative homologs occurring in the same order in the second path. This is achieved by generating a network alignment graph where each node consists of a pair of sequence-similar proteins, each belonging to a different species; thus, links between pairs of nodes may be associated with conserved protein interactions between the corresponding proteins. As a byproduct, the authors use PathBLAST also for network querying by designating one of the two networks in input as the query for the other one.

Pinter et al. [48] proposed MetaPathwayHunter, an algorithm to query metabolic networks, where the queries are multisource trees. In particular, these query trees are directed acyclic graphs for which the correspondent undirected graphs are trees such that some of the nodes may present both incoming and outgoing edges. MetaPathwayHunter searches the networks for approximated matching, allowing node insertions (only one node), whereas no deletions are allowed.

QPath and QNet were proposed by Shlomi et al. [55] and Dost et al. [15], respectively. QPath queries a PPI network by a query pathway consisting of a linear chain of interacting proteins belonging to another organism. The algorithm works similarly to sequence alignment, by aligning the query pathway to putative pathways in the target network, so that proteins in analogous positions have similar sequences. PPI networks interactions reliability scores are considered, and insertions and deletions are allowed. QNet is an extension of QPath in which the queries are trees or graphs with limited treewidth.

GenoLink [16] is a system able to integrate data from different sources (e.g., databases of proteins, genes, organisms, chromosomes) and to query the resulting data graph by graph patterns with constraints attached to both vertices and edges; a query result is the set of all the subgraphs of the target graph that are sufficiently similar to the query pattern and satisfy the constraints.

Yang and Sze [60] considered the two problems of path matching andgraph matching. In particular, they present an exact algorithm to search for subgraphs of arbitrary structure in large graph, grouping related vertices in the target network for each vertex in the query. Although their algorithm is accurate and also efficient enough to be considered exact, the authors state that it is practical for queries having $c24-math-0138$ nodes, and its performance improves if the query is a sparse graph.

Ferro et al. [19] present an approach called NetMatch, a Cytoscape plugin for the network querying allowing for an approximated query, such that certain parts of the subgraph query may be left unspecified. A query for NetMatch is a graph in which some nodes are specified and others are wildcards (which can match an unspecified number of elements). Ferro et al. [19] focus mainly on the topological similarity between the query and target graphs, without taking into account information about node similarities.

Fionda et al. [21] modeled PPI networks through labeled graphs in order to factor in the reliability of interactions, allowing for a more accurate analysis, and a proposed a technique is based on maximum-weight matching of bipartite graphs.

Rombo et al. [50] present an efficient framework based on hidden Markov models that can be used for finding homologous pathways in a network of interest. Given a query path, the method identifies the top $c24-math-0139$ matching paths in the network, which may contain any number of consecutive insertions and deletions. The authors apply their method on PPI networks.

Blin et al. [6] propose an algorithm for matching a query in the shape of graphs, based on dynamic programming and the color coding technique. To transform graphs queries into trees without loss of information, they use a feedback vertex set coupled to a node duplication mechanism. The resulting algorithm is FPT for querying graphs with a bounded size of their feedback vertex set, by giving an alternative to the treewidth parameter, which can be better or worse for a given query. The approach has been successfully applied to retrieve human query networks in the shape of graphs into the fly PPI network.

Ferraro et al. [18] propose AbiNet, an approach for aligning and querying biological networks according to an asymmetric comparison, allowing utilization of differences in the characterization of organisms. They exploit finite automatons suitably coupled with the Viterbi algorithm and use the best characterized organism (master) as a fingerprint to guide the alignment process to the second input network (slave). AbiNet allows for two different uses of the network querying mode: set the network query as the master and the target one as the slave, if attention is focused on finding the pairwise interactions of a specific module (the query) that are conserved in a specific organism (the target). The opposite master/slave arrangement is appropriate when, instead, the objective is to individuate possible diversifications of a module in a more complex target organism.

24.4.2 Discovering Repetitions in a Single Structure

The search of significant repetitions in a single biological network was pioneered by Shen-Orr et al. [54], who defined network motifs as “patterns of interconnections that recur in many different parts of a network at frequencies much higher than those found in randomized networks.” They [54] studied the transcriptional regulation network of Escherichia coli, by searching for small motifs composed by three or four nodes. In particular, three highly significant motifs characterizing such network have been discovered; the most famous is the feedforward loop, shown in Figure 24.5b, whose importance has been shown in further studies [39, 40].

The technique presented by Shen-Orr et al. [54] laid the foundations for different extensions, mainly of three proposed in 2004 and 2008 [5, 10, 61].

Yeger-Lotem et al. [61] identified composite motifs consisting of two kinds of interaction by exploiting edges of different colors in the network modeling. The authors identified an integrated cellular interaction network by two types (colors) of edges, representing protein–protein and transcription–regulation interactions, and developed algorithms for detecting network motifs in such networks with multiple types of edges. Such a study may be considered as a basic framework for detecting the building blocks of networks with multiple types of interaction.

In another study [5], topological motifs derived from families of mutually similar, but not necessarily identical, patterns were discussed and extracted from the gene regulatory network of E. coli. The authors developed a search algorithm to extract topological motifs called graph alignment, in analogy to sequence alignment, that is based on a scoring function. They observed that, in biological networks, functionally related motifs do not need to be topologically identical; thus, they discussed motifs derived from families of mutually similar but not necessarily identical patterns.

In yet another study [10], $c24-math-0140$ -node “bridge” and “brick” motifs were searched for in complex networks, and the presence of such motifs was associated with network topology, but not with network size. The authors proposed a method for performing simultaneously the detection of global statistical features and local connection structures, and the location of functionally and statistically significant network motifs.

All these approaches [5, 10, 54, 61] relate the concept of motif only to network topology, without considering possible properties shared by network nodes in terms of their mutual similarity. According to such a definition of motifs, subgraphs containing the same number of nodes and presenting similar topology may be recognized as motifs, although their nodes are associated with components presenting significant differences (e.g., proteins with very low sequence homology).

Lacroix et al. [35], observed biological networks (e.g., metabolic networks) where a purely topological definition of motifs seems to be inappropriate,as similar topologies can give rise to very different functions. Thus, they [35] introduce a new definition of motifs in the context of metabolic networks, such that the components of the network play a central role and the topology can be added only as a further constraint.

Similarly to the Lacroix et al. study [35], Parida [46] related the concept of motif is to both graph structure and node similarity. In particular, the author presents a three-step exact approach based on the application of the notion of maximality, used extensively in strings, to graphs.

Both works [35, 46] open the way for definition of new, exact, or approximate, approaches for motif extraction, taking into account not only the network topology but also the biological properties of the interacting components.

24.4.3 Discovering Repetitions in Multiple Structures

We consider at first the case of global network alignment, which was proposed for the first time by Singh et al. [56]. In particular, the authors present IsoRank, an algorithm for pairwise global alignment of PPI networks working in two stages: (1) associating a score with each possible match between nodes of the two networks, and then (2) constructing the mapping for the global network alignment by extracting mutually consistent matches according to a bipartite graph weighted matching performed on the two entire networks. IsoRank was extended in Singh's later paper [57], where a multiple alignment among five PPI networks is illustrated. To this aim, the authors exploited an approximate multipartite graph weighted matching.

Liao et al. [38] proposed the IsoRankN (IsoRank-Nibble) tool, which is a global multinetwork alignment tool based on spectral clustering on the induced graph of pairwise alignment scores. As it is based on spectral methods, IsoRankN is both error-tolerant and computationally efficient.

A graph-based maximum structural matching formulation for pairwise global network alignment was introduced [32]. On the basis of such formulation, an approach exploiting the Lagrangian relaxation technique combined with a branch-and-bound method to perform global network alignment is proposed.

In another paper [11] the algorithm PISwap is presented for computing global pairwise alignments of protein interaction networks. The approach is based on a local optimization heuristic that has previously demonstrated its effectiveness for a variety of other NP-hard problems, such as the traveling salesperson problem (TSP). PISwap begins with a sequence-based network alignment and then iteratively adjusts the alignment by incorporating network structure information.

The already discussed AbiNet [18] performs an asymmetric global alignment on pairs of biological networks. It is able to return biologically meaningful associations that a symmetric approach (such as IsoRank) does not find, as shown by an experimental campaign performed on yeast, fly, humans, and two bacteria.

Local network alignment was pioneered by Kelley et al. [31] with Pathblast, whose main characteristics were described in Section 24.4.1. Pathblast has been extended [52] for comparison with several PPI networks. In particular, such an extension is based on the generation of a network alignment graph where each node consists of a group of sequence-similar proteins, one for each species, and each link between a pair of nodes represents a conserved protein interaction between the corresponding protein groups.

In another study [33], a pair of PPI networks are considered as input and a technique using duplication/divergence models through definition of duplications, matches, and mismatches in a graph-theoretic framework is presented. In particular, the alignment problem is reduced to a graph optimization problem, and efficient heuristics are proposed to solve this problem.

Bandyopadhyay et al. [4] proposed a strategy to identify functionally related proteins supplementing sequence-based comparisons with information on conserved protein–protein interactions. The technique first aligns two PPI networks using only sequence similarities, and then performs probabilistic inference (based on Gibbs sampling) to identify pairs of proteins that are likely to retain the same function. The approach has been specifically applied to resolve ambiguous functional orthology relationships in PPI networks.

Functional orthology detection was been considered in another study [23], where Bi-GRAPPIN, a method based on maximum-weight matching of bipartite graphs, is presented. In particular, the bipartite graphs result from comparing the adjacent nodes of pairs of proteins occurring in the input networks. The basic idea is that proteins belonging to different networks should be matched in terms of not only their own sequence similarity but also the similarity of proteins with which they significantly interact with. Bi-GRAPPIN has been exploited [22] as a preliminary step in applying a node collapsing–based technique to extract similar subgraphs from two input networks.

Flannick et al. [24] presented an algorithm for multiple network alignment, named Graemlin. Graemlin aligns an arbitrary number of networks to individuate conserved functional modules, greedily assigning the aligned proteins to nonoverlapping homology classes and progressively aligning multiple input networks. The algorithm also allows searching for different conserved topologies defined by the user. It can be used either to generate an exhaustive list of conserved modules in a set of networks (network-to-network alignment) or to find matches to a particular module within a database of interaction networks (query-to-network alignment).

In another study [14] the algorithm C3Part-M, based on a nonheuristic approach exploiting a correspondence multigraph formalism to extract connected components conserved in multiple networks, is presented and applied on PPInetworks. C3Part-M is compared with NetworkBlast-M [29], another technique more recently proposed on the basis of a novel representation of multiple networks that is linear only with respect to their size.

24.5 Concluding Remarks

In both biosequences and biological networks, the presence of repetitions is usually associated with biologically significant conservations. Thus several techniques have been proposed in the literature to extract interesting repetitions from biological data, and different formulations of such a problem are possible, as we showed in this chapter.

When biological networks are considered, the extraction of interesting similarities involves the subgraph isomorphism problem, which is known to be NP-complete [27]. For this reason, some of the approaches, described [e.g., 48,55] search for repeated structures simpler than graphs, such as trees or paths. Furthermore, the number of existing approximate techniques is notably greater than the number of exact methods. In particular, exact approaches have been proposed when the size of the subgraphs to search for can be fixed a priori, such as in the case of querying and motif extraction. If exact algorithms are exploited, the size of subgraphs has to be restricted to only few nodes.

Some of the approaches reviewed in this chapter focus only on the topological similarity between the conserved subgraphs. Others also factor in information about node similarities in the process of extracting conserved regions, often relaxing topology constraints. Only a few methods have been presented that also apply edge information, such as reliability or different kinds of interaction. Considering both node and edge information seems to be a successful approach for analysis of biological networks. Indeed, in contrast to the biosequences domain, biological interaction data are often affected by noise because much information is retrieved by computational and high-throughput experimental techniques [28, 34].

A major challenge in biological networks analysis is the need to define more efficient algorithms capable of detecting significant repetitions occurring in a large number of networks. Indeed, the amount of available interaction data is increasing exponentially, and searching for common motifs in different networks could be useful in classifying the corresponding interactomes and identifying similar functional groups across species. While multiple sequence alignment has been massively exploited, multiple network alignment is still in its infancy, because such a task is computationally difficult for networks with thousands of nodes and interactions.

Finally, we point out that, although there has been some attempt to define statistical indices in the biological network domain [44], a general and universally accepted methodology for measuring the biological significance of the repetitions detected is still lacking.

References

1. Alon U, Network motifs: Theory and experimental approaches, Nature 8:450–461 (2007).

2. Apostolico A, Bock ME, Lonardi S, Monotony of surprise and large-scale quest for unusual words, J. Comput. Biol. 10(2/3):283–311 (2003).

3. Apostolico A et al, Finding 3d motifs in ribosomal rna structures, Nucleic Acids Res. 37(4):e29 (2008).

4. Bandyopadhyay S, Sharan R, Ideker T, Systematic identification of functional orthologs based on protein network comparison, Genome Res. 16(3):428–435 (2006).

5. Berg J, Lassig M, Local graph alignment and motif search in biological networks, Proc. Natl. Acad. Sci. USA 101(41):14689–14694 (2004).

6. Blin G, Sikora F, Vialette S, Querying graphs in protein-protein interactions networks using feedback vertex set, IEEE/ACM Trans. Comput. Biol. Bioinformatics. 7:628–635 (2010).

7. Brazma A et al, Approaches to the automatic discovery of patterns in biosequences, J. Comput. Biol. 5(2):277–304 (1998).

8. Chatraryamontri A et al, MINT: The molecular interaction database, Nucleic Acids Res. 35(database issue):D572–D574 (2006).

9. Cheadle C et al, Control of gene expression during t cell activation: Alternate regulation of mrna transcription and mrna stability, BMC Genomics 6(1):75 (2005).

10. Cheng C-Y, Huang C-Y, Sun C-T, Mining bridge and brick motifs from complex biological networks for functionally and statistically significant discovery, IEEE Trans. Syst. Man Cybernet. B 38(1):17–24 (2008).

11. Chindelevitch L, Liao C-S, Berger B, Local optimization for global alignment of protein interaction networks, Proc. Pacific Symp. Biocomputing, 2010, pp. 123–132.

12. Ciriello G, Guerra C, A review on models and algorithms for motif discovery in protein-protein interaction network, Brief. Funct. Genomics Proteomics 7(2):147–56 (2008).

13. International Human Genome Sequencing Consortium, Finishing the euchromatic sequence of the human genome, Nature 431(7011):931–945 (2004).

14. Denielou Y-P, Boyer F, Viari A, Sagot M-F, Multiple alignment of biological networks: A flexible approach, Proc. Combinatorial Pattern Matching'09, 2009, pp. 263–273.

15. Dost B et al, Qnet: A tool for querying protein interaction networks, Proc. RECOMB'07, 2007, pp. 1–15.

16. Durand P et al, GenoLink: A graph-based querying and browsing system for investigating the function of genes and proteins, BMC Bioinformatics 7(21):(2006).

17. Erdos P, Renyi A, On the evolution of random graphs, Publ. Mater. Inst. Hung. Acad. Sci. 5:17–61 (1960).

18. Ferraro N, Palopoli L, Panni S, Rombo S, Asymmetric comparison and querying of biological networks, IEEE/ACM Trans. Comput. Biol. Bioinformatics 8:876–889 (2011).

19. Ferro A et al, Netmatch: A cytoscape plugin for searching biological networks, Bioinformatics 23(7):910–912 (2007).

20. Fionda V, Palopoli L, Biological network querying techniques: Analysis and comparison, J. Comput. Biol. 18(4):595–625 (2011).

21. Fionda V, Palopoli L, Panni S, Rombo SE, Protein-protein interaction network querying by a “focus and zoom” approach, Bioinformatics Res. Devel. 13:331–346 (2008).

22. Fionda V, Panni S., Palopoli L, Rombo SE. Extracting similar sub-graphs across ppi networks, Proc. Int. Symp. Computing and Informatics Sciences, 183–188 (2009).

23. Fionda V, Panni S, Palopoli L, Rombo SE, A technique to search functional similarities in ppi networks, Int. J. Data Mining Bioinformatics, 3(4):431–453 (2009).

24. Flannick J et al, Graemlin: General and robust alignment of multiple large interaction networks, Genome Res. 16(9):1169–1181 (2006).

25. Francke C, Siezen RJ, Teusink B, Reconstructing the metabolic network of a bacterium from its genome, Trends Microbiol. 13(11):550–558 (2005).

26. Galil Z, Giancarlo R, Data structures and algorithms for approximate string matching, J. Complex. 4(1):33–72 (1988).

27. Garey M, Johnson D, Computers and Intractability: A Guide to the Theory of NP-completeness, Freeman, New York, 1979.

28. Ito T et al, A comprehensive two-hybrid analysis to explore the yeast protein interactome, Proc. Natl. Acad. Sci. USA 98(8):4569–4574 (2001).

29. Kalaev M, Bafna V, Sharan R, Fast and accurate alignment of multiple protein networks, Proc. RECOMB'08, 2008.

30. Karp PD et al, Expansion of the BioCyc collection of pathway/genome databases to 160 genomes, Nucleic Acids Res. 19:6083–6089 (2005).

31. Kelley BP et al, Conserved pathways within bacteria and yeast as revealed by global protein network alignment, Proc. Natl. Acad. Sci. USA 100(20):11394–11399 (2003).

32. Klau GW, A new graph-based method for pairwise global network alignment, BMC Bioinformatics 10(Suppl. 1):S59 (2009).

33. Koyuturk M et al, Pairwise alignment of protein interaction networks, J. Comput. Biol. 13(2):182–199 (2006).

34. Krogan NJ, Cagney G, et al, Global landscape of protein complexes in the yeast Saccharomyces cerevisiae, Nature 440(7084):637–643 (2006).

35. Lacroix V, Fernandes CG, Sagot M-F, Motif search in graphs: Application to metabolic networks, IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(4):360–368 (2006).

36. Lesk A, Introduction to Protein Science Architecture, Function, and Genomics, Oxford Univ. Press, 2004.

37. Levine M, Davidson EH, Gene regulatory networks for development, Proc. Natl. Acad. Sci. USA 102:4936–4942 (2005).

38. Liao C-S et al, Isorankn: Spectral methods for global alignment of multiple protein networks, Bioinformatics 25:i253–i258 (2009).

39. Mangan S, Alon U, Structure and function of the feed-forward loop network motif, Proc. Natl. Acad. Sci. USA 100(21):11980–11985 (2003).

40. Mangan S, Itzkovitz S, Zaslaver A, Alon U, The incoherent feed-forward loop accelerates the response-time of the gal system of Escherichia coli, J. Mol. Biol. 356(5):1073–1081 (2005).

41. Matys V et al, Transfac and its module transcompel: Transcriptional gene regulation in eukaryotes, Nucleic Acids Res. 34:D108–D110 (2006).

42. Mewes HW et al, MIPS–a database for protein sequences, homology data and yeast genome information, Nucleic Acids Res. 25:28–30 (1997).

43. Miller JP et al, Large-scale identification of yeast integral membrane protein interactions, Proc. Natl. Acad. Sci. USA 102(34):12123–12128 (2005).

44. Milo R et al, Network motifs: Simple building blocks of complex networks, Science 298(5594):824–827 (2002).

45. Orlev N, Shamir R, Shiloh Y, PIVOT: Protein interaction visualization tool, Bioinformatics 20(3):424–425 (2004).

46. Parida L, Discovering topological motifs using a compact notation, J. Comput. Biol. 14(3):46–69 (2007).

47. Parida L, Pattern Discovery in Bioinformatics, Theory and Algorithms, Chapman & HAll/CRC, 2008.

48. Pinter R et al, Alignment of metabolic pathways, Bioinformatics 21(16):3401–3408 (2005).

49. Posada D, Bioinformatics for DNA Sequence Analysis, Humana Press, 2009.

50. Qian X, Sze S-H, Yoon B-J, Querying pathways in protein interaction networks based on hidden markov models, J. Comput. Biol. 16(2):145–57 (2009).

51. Rombo SE, Utro F, Giancarlo R, Basic statistical indices for SeqAn, in Gogol-Döring A, Reinert K, Biological Sequence Analysis Using the SeqAn C++ Library, Chapman&Hall/CRC Mathematical & Computational Biology Series, 2009, pp. 249–260.

52. Sharan R et al, From the cover: Conserved patterns of protein interaction in multiple species, Proc. Natl. Acad. Sci. USA 102(6):1974–1979 (2005).

53. Sharan R, Ideker T, Modeling cellular machinery through biological network comparison, Nat. Biotechnol. 24(4):427–433 (2006).

54. Shen-Orr SS, Milo R, Mangan S, Alon U. Network motifs in the trascriptional regulation network of Escherichia coli, Nature 31:64–68 (2002).

55. Shlomi T et al, Qpath: A method for querying pathways in a protein-protein interaction network, BMC Bioinformatics 7: (2006).

56. Singh R, Xu J, Berger B, Pairwise global alignment of protein interaction networks by matching neighborhood topology, Proc. Res. Comput. Mol. Biol., pp 16–31, (2007).

57. Singh R, Xu J, Berger B, Global alignment of multiple protein interaction networks, Proc. Pacific Symp. Biocomputing, pp. 303–14, (2008).

58. Stelling J, Klamt S, Bettenbrock K, Schuster S, Gilles ED. Metabolic network structure determines key aspects of functionality and regulation, Nature 420:190–193 (2002).

59. von Mering D, Krause C, et al., Comparative assessment of a large-scale data sets of protein-protein interactions, Nature 417(6887):399–403 (2002).

60. Yang Q, Sze S-H, Saga: A subgraph matching tool for biological graphs, J. Comput. Biol. 14(1):56–67 (2007).

61. Yeger-Lotem E et al., Network motifs in integrated cellular networks of transcription regulation and protein protein interaction, Proc. Natl. Acad. Sci. USA 101(16):5934–5939 (2004).

62. Zhang S, Zhang X-S, Chen L, Biomolecular network querying: A promising approach in systems biology, BMC Syst Biol. 2:5 (2008).

63. Zhou Q, Chipperfield H, Melton DA, Wong WH, A gene regulatory network in mouse embryonic stem cells, Proc. Natl. Acad. Sci. USA 104(42):16438–16443 (2007).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 24: Finding repetitions in biological networks: challenges, trends, and applications

Create new playlist

Sign In

Sign Up

Chapter 24

Finding repetitions in biological networks: challenges, trends, and applications