3.3 Modeling Without Parameter Tuning: A Case Study of Metabolic Networks
In the previous section, we introduced a simple model for reproducing several structural properties observed in real-world biological networks. Furthermore, this model network shows good qualitative and quantitative agreement with real data.
For model fitting, we need to tune model parameters. The parameters may be estimated by minimizing the difference between the model and real data using statistical methods. However, the optimal solutions may require substantial calculation, after which multiple optimal solutions may be obtained. Furthermore, the possibility of trivial agreements (e.g., over-fittings) may remain if the model has several tunable parameters.
In order to avoid this problem, we need to eliminate parameter tunings and construct models in which parameters are determined from observable statistics obtained from real-network data, such as the number of nodes.
In this chapter, we focus on metabolic networks and introduce a network model without parameter tuning.
3.3.1 Network Model
Here, we consider metabolic networks in which nodes and edges represent metabolites and metabolic reactions (substrate–product relationships based on atomic tracing [26]) and propose a simple model that reproduces the structural properties of metabolic networks with two parameters, p and q. These parameters are determined from the statistics of real data (see Section 3.3.3 for details).
In general, it is believed that metabolic networks evolve by a reaction (enzyme) gain due to evolutionary events (see Ref. 27 for details). In this case, we can consider two situations: the case that a new reaction occurs between a new metabolite and an existing metabolite (Event I) and the case that a new reaction occurs between existing metabolites (Event II).
With the above consideration, the modeling is as follows.
i. Event I occurs with the probability 1 −
p (see
Fig. 3.9a). In this case, a new node (the black node) is connected to a randomly selected existing node (the gray node).
ii. Event II occurs with the probability
p (see
Fig. 3.9b and ). In this case, a short-cut path bypasses the path between a node and another node. The short-cut path is generated via a random walk because it may be drawn based on the existing network structure. For example, such short-cut chemical reactions may occur between related metabolic compounds, which are nearly located on the metabolic pathway (see Ref. 27 for details). Hence, we need to consider the length of the bypassed path. However, when we investigate the degree distribution and the degree-dependent clustering coefficient, it is sufficient to consider only two cases: (1) the length is equal to 2 and (2) the length is greater than or equal to 3. This assumption (of considering only two cases) is valid because the degree distribution is independent of the bypassed path length and the clustering coefficient is influenced only when a path of length 2 is bypassed (see Section 3.3.2 for the details). Therefore, we express the bypassed path length using the parameter
q as follows. First, an initial node (the triangular nodes in
Fig. 3.9b and ) is selected at random. Next, with the probability
q, we select a path of length 2 based on a random walk from the initial node. In contrast, with the probability 1 −
q, we select a path with length greater than or equal to 3 based on a random walk from the initial node. Thus, the parameter
q roughly corresponds to the bypassed path length, which is the path length between nodes connected through a short-cut path. Finally, a new edge (short-cut path) is drawn between the initial node (the triangular nodes in
Fig. 3.9b and ) and the terminal node (the quadrangular nodes in
Fig. 3.9b and ). Note that a triangle is accordingly generated with the probability
pq.
3.3.2 Analytical Solution
Degree Distribution. First, we show the analytical solution for degree distribution of the model via mean-field-based analysis [12,22].
We consider the time evolution of ki, which is the degree of node i. When Event I occurs, the degree of node i increases by one with the probability 1/N, where N is the total number of nodes. Further, when Event II occurs, two existing nodes are selected and their respective degrees increase. The degree of one node increases by one with the probability 1/N, because this node is randomly selected. The degree of another node increases by one with the probability ki/∑ jkj, because this node is selected by a random walk from a randomly selected node. It is reported that the probability that a walker arrives at this node equals ki/∑ jkj, irrespective of the number of steps in the random walk [28]. Note that this probability equals to that of the preferential attachment. Thus, the time evolution of ki is
(3.17)
where N = (1 − p)t, because the number of nodes increases by one with the probability 1 − p, and ∑jkj = 2t because one edge is added each time. It should be noted that this equation is independent of the bypassed path length (the parameter q). The solution of the above equation with the initial condition ki(t = s) = 1 is
(3.18)
where A(p) = 2/[p(1 − p)].
From the above equation, because s/t = P(≥ k) as shown in the previous section, the cumulative distribution P(≥ k) is
(3.19)
Since
, we finally obtain the degree distribution
(3.20)
where the degree exponent γ is
(3.21)
As shown in
Equation 3.20, the degree distribution follows a power law with a cutoff within a small degree.
Degree-Dependent Clustering Coefficient. Next, we show an analytical solution for the degree-dependent clustering coefficient of the model via mean-field-based analysis.
The clustering coefficient of node i is defined as Ci = 2Mi/[ki(ki − 1]), where Mi is the number of edges among the neighbors of node i. We consider the time evolution of Mi. The number of edges of Mi increases with the probability pq, because Mi increases when Event II occurs and a path of length 2 is bypassed (a triangle is generated). In other words, we do not need to consider a bypassed path with length greater than or equal to 3. Next, the Mi of each node, which belongs to the triangle, approximately increases by one. The Mi of one node increases by one with the probability 1/N, because this node is selected at random. The Mis of the other two nodes increase by one with the probability ki/∑ jkj, because these nodes are selected by a random walk. Therefore, the time evolution of Mi is
(3.22)
where
N = (1 −
p)
t and ∑
jkj = 2
t. Moreover,
ki = [
A(
p) + 1](
t/
s)
p/2 −
A(
p), as shown in
Equation 3.18.
The solution of the above equation with the initial condition Mi(t = s) = 0 is
(3.23)
where
A(
p) = 2/[
p(1 −
p)]. From
Equation 3.18, since
ki = [
A(
p) + 1](
t/
s)
p/2 −
A(
p), this equation is rewritten as
(3.24)
Substituting this equation into the definition of clustering coefficient described above, we finally obtain the degree-dependent clustering coefficient
(3.25)
Average Clustering Coefficient. Finally, we show an analytical solution for the average clustering coefficient of the model.
The average clustering coefficient is expressed as the summation of the product of the degree distribution and the degree-dependent clustering coefficient:
, where
Km is the maximum degree. We approximate this summation by the integral equation:
(3.26)
The maximum degree indicates that the cumulative probability equals to 1/
N; thus,
P(≥
Km) = 1/
N. From
Equation 3.19,
Km can be expressed as
(3.27)
Equation 3.26 is solved via numerical integration because it is analytically unsolvable.
3.3.3 Estimation of the Parameters
As explained in Section 3.3.1, this model has two parameters p and q. In order to reproduce the structural properties of metabolic networks, we need to estimate these parameters in real-world networks. In this section, we discuss the approach for estimating these parameters.
Parameter p. Here, we consider the time evolutions of the number of nodes
N and the number of edges
E.
By the definition of this model, these time evolutions are described as N = (1 − p)t and E = t. Therefore, the parameter p is estimated as
(3.28)
where N and E are obtained from real metabolic networks.
Parameter q. Here, we consider the number of triangles
T of this model.
In this model, when Event II occurs, the number of triangles approximately increases by one with the probability pq, because a triangle is generated with the probability q. That is,
(3.29)
Since N = (1 − p)t, this equation is rewritten as
(3.30)
From this equation, therefore, the parameter q is estimated by
(3.31)
where T, N, and E are obtained from real metabolic networks.
3.3.4 Comparison with Real Data
Here, we compare structural properties of this model with those of real metabolic networks.
First, we obtain the parameters p and q from the metabolic network of each organism using Equations 3.28 and 3.31, respectively. Substituting the parameters into the equations, shown in Section 3.3.2, we obtain the structural properties predicted from this model.
Figure 3.10 shows a comparison between the degree distribution of our model and that of real metabolic networks. We provide the P(k) values of two well-known organisms.
Figure 3.11 shows a comparison between the degree exponent of our model and that of real metabolic networks. For comparison, the degree exponents are obtained by the maximum likelihood method, considering a cutoff, represented by A(p), as follows:
(3.32)
This value is different from that obtained by the original maximum likelihood method [29].
Figure 3.12 shows a comparison between the degree-dependent clustering coefficient of our model and that of real metabolic networks. We show C(k) for the same two organisms.
Figure 3.13 shows a comparison between the average clustering coefficient of our model and that of real metabolic networks. The predicted average clustering coefficients are obtained with Equation 3.26 via numerical integration.
As shown in Figures 3.10–3.13, the theoretical predictions are in good agreement with the real data, indicating that this model can reproduce the structural properties of real metabolic networks.
As mentioned above, this model has a few parameters, which can be estimated very easily. Furthermore, the model parameters relate to the frequency of evolutionary events such as horizontal gene transfers and gene duplications, through which the metabolic networks expand (e.g., see Refs. [30,31]). This model may be useful for the estimation of missing pathways and the evolutionary origin of metabolic reactions.