1.6 Discussion

In this chapter, we discussed a number of approaches for the reconstruction and partition of biological networks. We considered the case of both directed and undirected biological networks in each of the above classes of problems. Network reconstruction algorithms presented in this chapter were further categorized based on the type of measurements used in the inference procedure. The type of measurements which we considered were gene expression data and symbol data comprising of gene sets. Gene expression data are numerical matrices containing gene expression levels measured from different experiments, whereas gene sets are sets of genes and do not assume the availability of the corresponding gene expression levels.

For the reconstruction of directed networks, we presented six approaches Boolean networks, probabilistic Boolean networks, Bayesian networks, cGraph, frequency method, and NICO. Among these approaches Boolean networks, probabilistic Boolean networks, and Bayesian networks accommodate gene expression data, whereas cGraph, frequency method, and NICO are suitable to infer the underlying network topologies from gene sets. For the reconstruction of undirected biological networks from gene expression data, we presented two approaches relevance networks and graphical Gaussian models. Nonetheless, the aforementioned algorithms for network inference using gene expression data are also applicable when the inputs are given in the form of gene set compendiums and vice versa. For instance, in order to apply a gene set based approach on gene expression data, an additional data discretization step can be incorporated to derive gene sets. Indeed, genes expressed in an experimental sample discretized using binary labels correspond to a gene set. Similarly, a gene set can be naturally represented as a binary sample by considering the presence or absence of a gene in the set. This equivalence can be used to infer a Bayesian network, Boolean network, or probabilistic Boolean network from a gene set compendium, and to infer mutual information networks, for example, mutual information version of relevance networks which accommodate discrete measurements. Similarly, gene sets obtained after discretizing gene expression data can be utilized to infer a network using NICO or cGraph. Overall, the equivalent representation of a gene set compendium as binary discrete data makes the network inference approaches applicable for both the types of input, gene sets or gene expression data. However, the approaches differ in their output, for example, directed versus undirected networks, and their computational efficiency. In general, the computational inference of undirected networks, for example, relevance networks and graphical Gaussian models is more efficient, as such approaches are based on estimating pairwise associations or similarities. For example, relevance networks measure the strength of pairwise interaction in terms of Pearson's correlation or mutual information, whereas graphical Gaussian models present a more appealing model by taking only direct interactions into account and use partial correlations to estimate the strength of pairwise associations. The two network inference approaches are frequently used in the field of information theory, pattern analysis and machine learning. In the inference of directed networks, Boolean networks and probabilistic Boolean networks present computationally efficient and simpler models, in comparison to Bayesian networks. However, the use of Boolean functions in both Boolean networks and probabilistic Boolean networks may cause the oversimplification of gene regulatory mechanisms. Nonetheless, Boolean networks find applications in many fields including biological systems, circuit theory, and computer science, for example, see Refs. [81, 82]. Bayesian networks provide a sophisticated probabilistic modeling approach to infer gene regulatory mechanisms. As Bayesian networks suffer from nontrivial computational complexity, heuristics are applied to reduce the size of the search space of all possible Bayesian networks defined for a given number of nodes. Bayesian networks are used in a wide range of fields including medicine, document classification, information retrieval, image processing, and financial analysis. Among gene set based approaches, cGraph and frequency method are computationally efficient but they make stringent assumptions in the underlying models. For instance, cGraph adds a weighted edge between every pair of genes which appear in some gene set, whereas frequency method assumes a prior availability of the two end nodes in each gene set and directed edges involved in the corresponding paths. Expectation–maximization based NICO assumes a more general case and reconstructs the underlying network by inferring the order information for each unordered gene set. As the computational complexity of the approach grows exponentially with increase in the lengths of gene sets, an importance sampling based approximation of E-step has been proposed, which guarantees a polynomial time convergence of the EM algorithm. The above algorithms find applications in many real-world scenarios such as sociology, communication networks, and cognitive science.

We reviewed a variety of algorithms for network partitioning. The network partitioning algorithms were categorized as graph clustering algorithms and community detection algorithms. Graph clustering algorithms are applicable to very large-scale integration, distributing jobs on a parallel machine, and other applications found in computer science. Community detection algorithms, on the other hand, are more applicable to biological and social networks.

For graph clustering algorithms, we reviewed the well-known Kernighan–Lin algorithm. The Kernighan–Lin algorithm has complexity O(|V|2 log |V|). Although the Kernighan–Lin algorithm may not be directly applicable to biological networks, its “descendant,” Algorithm 1.5.2, is directly applicable as a postprocessing step for many community detection algorithms [22].

The first community algorithm we presented was the Girvan–Newman algorithm with complexity O(|V||E|2). The essence of the Girvan–Newman algorithm is that edges between communities have high edge-betweenness scores. By focusing on edge-betweenness, the Girvan–Newman algorithm focuses on the flow of the network as opposed to the immediate connection between nodes. Its major drawback is the lack of a proper criterion to determine the cut line of a dendrogram. Modularity was used to remedy the situation, but as seen in Section 1.5.2.3, modularity itself has its own drawbacks. An interesting solution may be replacing modularity as a quality function with the map equation introduced by Rosvall.

Next, we presented Newman's eigenvector method. This method is quite interesting as by defining modularity via Equation 1.37, the modularity matrix B defined in Equation 1.38 takes the position of the graph Laplacian in the spectral bisection algorithm. Newman's eigenvector method is considered to be quite fast with complexity O(|V|2 log |V|). The method focuses on the degrees and connections of immediate nodes as opposed to the flow of information in a given graph. A more useful aspect is that the value of img for a node i corresponds directly to its participation strength in its community. The major drawback of Newman's eigenvector method is the same as spectral bisection method in which its core strength lies in finding the initial bipartition of a graph. There are also drawbacks involved with the choice of modularity as the quality function as seen in Section 1.5.2.3. Finally, extending Equation 1.37 to its directed counterpart Equation 1.45 does not incorporate edge direction correctly.

We then presented Infomap that utilizes information theory to compress good partitions and describe them using the least amount of bits possible. While modularity concentrates on the pairwise relationships between nodes, Infomap focuses on the flow of information within a network similar to the original Girvan–Newman algorithm. Its implementation for directed networks seems the most rigorous of all of the implementations presented. Since Infomap uses a stochastic algorithm, the number of iterations that are needed before a good partitioning is found is unknown.

Finally, we presented the clique percolation method. The clique percolation method is suitable for partitioning biological networks as it allows for overlap between different communities. It has some drawbacks as it may not place all nodes in a community, especially leaf nodes. The complexity of the clique percolation method cannot be expressed in closed form. Moreover, for the case of directed networks, the definition for directed k-clique is rather arbitrary.

Network reconstruction and partitioning are two fundamental problems in computational systems biology, which aim to provide a view of the underlying biomolecular activities at a global or local level. In general, the choice of a network reconstruction approach depends on various factors, for example, type of measurements, number of variables, sample size, amount of prior knowledge, and the type of interactions considered in an analysis. Similarly, network partitioning is dependent on the number of clusters overlapping with different clusters. Nevertheless, the two classes of problems are inherently related and one provides a foundation for the other. Structurally, a large biological network is an ensemble of smaller components or subnetworks. As tightly connected subnetworks often correspond to functional units of a biological network, network partitioning is essential to extract this finer level of detail. On the other hand, subnetworks together provide a global view of the underlying biological processes. In particular, gene set based approaches have recently gained attention in the computational inference of biological networks, for their natural ability to incorporate higher-order gene regulatory relationships. As gene expression data often have a small sample size and excessive noise, such data sets may not capture a complete picture of complex biomolecular activities. Network reconstruction and partitioning, two complementary problems in systems biology, together offer a potential avenue for future researches in gene set based inference of biological networks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.72.176