10.2 Unsupervised Pattern Discovery in Structured Data

Computational analysis of structural and relational data is a very broad and active field of research. Here, we briefly review central approaches that are related to relevant substructure detection, including graph mining, optimal subgraph search, graph clustering, biclustering, itemset mining, and higher-order relational data mining.

10.2.1 Graph Mining

Graph mining refers to the search for subgraph patterns with predefined characteristics, in a database of one or multiple graphs. In many cases, it is possible to design algorithms that yield the complete set of solutions; such approaches are called enumerative. In the following presentation, we focus on the most common tasks. We start with the classical problem of frequent subgraph mining [5,6]. Given a database of labeled graphs G1, . . ., Gl, the task is to find all connected subgraphs that occur in at least m graphs, where m is a positive integer referred to as the minimum support threshold. As the same node label may appear multiple times in each graph of the database (e.g., atom names in a database of molecule graphs [7]), these methods generally have to deal with the problem of subgraph isomorphism. In biological networks considering gene or protein relationships, the node labels within a graph are typically unique. However, as the graphs are large, the frequency criterion is typically combined with other criteria such as interaction density or cut thresholds, in order to restrict the size of the output [8,9].

Other mining approaches search for substructure patterns in a single graph. While a subgraph frequency criterion can be applied to graphs with nonunique node labeling [10], a popular analysis tool for uniquely labeled graphs is clique finding [2].

Definition 10.1 (Clique) Given a graph G with node set V, a clique is defined as a subset of nodes UV that induces a complete subgraph, that is, all pairs of nodes are connected by an edge. A clique is maximal if it is not contained in any other clique.

In general, clique search is NP-complete [11]. Nevertheless, it is frequently used in practical applications [12,13], and even more flexible pattern definitions have been considered, for example, quasi-cliques [14–17] and pseudo-cliques [18]. They correspond to dense subgraphs rather than complete subgraphs; exact definitions will be given in Section 10.3.

10.2.2 Optimal Subgraph Search

In addition to subgraph enumeration algorithms, there exist multiple approaches to search for (approximately) optimal subgraphs. With respect to the criterion of subgraph density, a number of problems have been studied. First of all, it has been shown that finding a k-node subgraph with the maximum number of edges is NP-hard [19]. Tight approximation bounds have been derived for a simple greedy optimization scheme [20]; the same approximation scheme has been used for directed graphs [21]. Equivalently, the problem of finding a k-node subgraph with the maximum average number of edges per node is NP-hard [22]. On the other hand, without a size constraint, a subgraph with the maximum average number of edges per node can be found in polynomial time by flow-based techniques [22,23].

Moreover, local search approaches have been used to discover dense subgraphs around seed cliques [24,25]. Recently, linear integer programming has been successfully applied for finding connected subgraphs with the maximum sum of node weights, an NP-complete problem [26]. Another type of pattern is the so-called connection subgraph [27], where the aim is to maximize the connectivity between a set of given nodes while removing a large portion of the graph.

10.2.3 Graph Clustering

Graph clustering is the task of assigning the nodes of a graph into distinct groups (clusters) such that there are many edges within a group and few edges between different groups. This topic has been studied extensively, see Ref. [28] for a review. One seminal work in this area is the Kernighan–Lin algorithm [29], which is a heuristic strategy to divide a graph into components with fixed maximum size. If neither the (maximum) size nor the number of partitions is known beforehand, a popular choice is hierarchical clustering methods, which yield a hierarchy of clusters instead of a single partitioning. Hierarchical methods can be divided into two classes: agglomerative and divisive. Agglomerative strategies build the hierarchy bottom-up, starting from single-node clusters and iteratively merging the “closest” pair into a common cluster, for example, see Refs. [30–33].

Divisive hierarchical strategies, in contrast, work in a top-down manner, starting with the entire graph and iteratively dividing it into smaller parts. An obvious splitting criterion are graph cuts [34]. Girvan and Newman [35] use the so-called edge betweenness measure, which is the number of node pairs with the shortest path passing through a specific edge; edges bridging between clusters are expected to have large betweenness values, so edges are removed from the graph in decreasing order of betweenness; the same technique has also been applied to weighted graphs [36]; Luo et al. combine this method with an agglomerative approach [37]. Several alternatives of these measures have been investigated [38,39]. Also, divisive clustering has been integrated with graph coarsening to achieve efficient procedures [40].

Other methods directly partition the graph into a set of clusters, without using hierarchical decomposition steps. The most basic approach is to extract the connected components [41]. An increasingly popular method is spectral clustering [42]. Technically, it is based on eigen decomposition of graph Laplacians, and has interpretations related to graph cuts and random walks. Markov clustering [43,44] is also motivated by random walks; it performs two alternating matrix operations; in contrast to spectral clustering, one does not fix the number of clusters beforehand. Furthermore, probabilistic latent variable models have been used for graph clustering [45], which often also allow cluster overlaps, that is, the same node may belong to different clusters [46,47].

In many bioinformatics applications, graph clustering and dense subgraph mining methods are augmented by integrating multiple data sources, the most common scenario being the combination of protein–protein interactions and gene expression data. One straightforward strategy is to build a new network where protein interaction links and coexpression links are simply pooled [47], or where the edge weights are determined as a function of multiple data sources [48]. Tanay et al. [49] combine edges of different types in a single bipartite network. Edge weights of different datasets have to be normalized appropriately in order to be comparable in the integrated setting. In contrast to that, other approaches keep the data sources separate and define individual constraints for each of them. A variety of criteria have been proposed, including global and local similarity measures for gene expression profiles [14,45,50–54].

10.2.4 Bicluster Analysis

In addition to homogeneous interaction graphs such as protein interaction networks, bipartite graphs occur very frequently in biological data analysis. Similarly as in the previous sections, one central problem arising in that context is dense subgraph detection. A pattern of interest would then consist of a pair of node subsets, one from each partition, such that each node is connected to a large fraction of nodes from the other set. This can be seen as a special instance of the biclustering problem, which is very prominent in gene expression analysis [55–58]. Given a data matrix (e.g., the adjacency weights of a bipartite graph), the goal is to extract subsets of rows that are similar with respect to subsets of columns; a particular pair of a row subset and a column subset (defining a submatrix) is called bicluster. This framework, which is also known as co-clustering or two-mode clustering [59], contrasts with traditional clustering approaches, which cluster either the rows according to their similarity across all columns, or the columns according to their similarity across all rows [60].

A multitude of bicluster detection methods has been developed during the last decade. First of all, one basic idea is to partition the bipartite graph that corresponds to the data matrix into distinct biclusters [61,62]. Other approaches allow for a more flexible arrangement of biclusters, including bicluster overlap, while still optimizing a global objective function taking the whole matrix into account [63,64]. In contrast, enumerative approaches use local criteria based solely on individual biclusters. Many of them are motivated by the (weighted) bipartite graph formalism and refer to some density property of the subgraph. The widely used SAMBA method [49,65] finds heavy subgraphs around each node. Sim et al. [66] fix the maximum number of missing edges tolerated per node as well as minimum size constraints. In Ref. [67], all maximal bicliques are detected and then further extended. Another work [68] searches for all bicluster patterns that satisfy homogeneity constraints with respect to the weight entries. Moreover, various other methods define specific bicluster criteria and solve the problem in a nonexhaustive way, by greedy strategies or approximation techniques [55].

10.2.5 Itemset Mining

For binary-valued data matrices, the simplest bicluster pattern of interest is a submatrix that purely consists of 1-entries. Such patterns can be exhaustively enumerated using itemset mining. This approach has been developed in the context of market basket analysis [69]. There, the data represent a set of transactions, where each transaction consists of a list of products (called items) that were purchased together. The task of frequent itemset mining is to find all sets of items that co-occur in more than m transactions. The frequent itemsets can be used to derive association rules of the following kind: “if a customer bought products A and B, he or she will also buy product C.” This information can, for instance, assist in improving shop layouts. Itemset mining has also been applied in the biological domain, for example, gene expression analysis [70]. The principal algorithmic idea behind itemset mining methods is based on the observation that any subset of a frequent itemset is frequent as well. The originally proposed Apriori algorithm [69] implements this in a level-wise search strategy, where the frequent sets on one level determine the candidate sets on the next level. Since then, there has been active research on improving the efficiency by introducing additional pruning rules and investigating alternative strategies to traverse the search space [71].

10.2.6 Relational Data Mining and Higher-Order Association Analysis

Itemset mining is only suitable for analyzing binary relations such as the transaction-item association data described in the previous section. A natural extension is to consider higher-order relations, which involve more than two partners. For example, one could consider relations between purchased items, regions, and weeks of customer transactions. The generalized mining task can be formulated as follows [72–75]:

Definition 10.2 (Relational Set Mining) Given an n-ary relation RD1 × . . . × Dn, find all n-set patterns (S1, . . ., Sn) such that SiDi for all i = 1, . . ., n and S1 × . . . × SnR.

Generalizing the frequency criterion from itemset mining (see previous section), one can specify minimum size thresholds for S1, . . ., Sn. In the above definition, all domains Di are assumed to be finite (i.e., categorical); their cardinality is denoted by |Di|. An equivalent representation for such data is a |D1| × . . . × |Dn| array A (also called data cube or tensor), where A(d1, . . ., dn) = 1 if (d1, . . ., dn) R, and A(d1, . . ., dn) = 0 otherwise (di {1, . . ., |Di|}, i = 1, . . ., n). Then, an n-set corresponds to a subarray that contains only 1-entries.

More generally, one can drop the constraint of binary values and consider tensors with arbitrary weight entries. Such higher-order datasets occur in different application fields such as sales analysis [72], web mining [74,76,77], neuroscience [78,79], and computational biology [75,80 81]. Therefore, methods that deal with multiway arrays receive increasing attention in the data mining community. One of the most prominent topics is tensor decomposition (see Ref. 82 for a review), which can serve as a basis for clustering or anomaly detection [83].

The goal of tensor clustering is to partition each dimension of the tensor into a predefined number of clusters such that the resulting multiway clusters are as homogeneous as possible [84]. This can be approximated by combining the results from clustering individual dimensions separately [85]. Zhao and Zaki [80] also mine for homogeneous clusters, but instead of specifying the number of clusters, they fix thresholds regarding the homogeneity of values along each dimension and detect overlapping cluster patterns (in the three-way case). Relational models [86] focus on binary-valued tensors, aiming at partitioning them into blocks that contain either mostly ones or mostly zeros. Finally, there exist approaches that deal with multiple relations or tensors at the same time, searching for clusters (communities) [84,87] or association rules [88].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.0.145