candidates. From the example, we can know that LAPIN is a prefi x growth
algorithm with effi cient pruning strategy. It employs a depth fi rst search of
the lexicographic tree to grow the sequences with the help of the projected
item last position databases.
In addition, the same authors introduced some variant versions of
LAPIN, i.e., LAPINSPAM [57] and LAPIN-Web [59]. The fi rst one is devised
based on the SPAM algorithm [7] which utilizes bit information to further
improve the effi ciency. The latter one, i.e., LAPIN-Web [59], is introduced
to specifi cally extract the users’ frequent access patterns with regard to the
log data.
Table 6.3: Item A’s Projected Last Position Lists
j
Customer ID Projected Last Position Lists
10 c
last
=5,a
last
=6,d
last
=7
20 c
last
=5,b
last
=6,d
last
=6
30 c
last
=4,d
last
=4
As clarifi ed in [36], the LAPIN strategy can be deemed as one of the
promising techniques in the sequential pattern mining literature.
6.3 Frequent Subtree Mining
Frequent subtree mining could be seen as an extension issue of frequent
itemset and sequence mining because the data structure of the former is
more complex than that of the latter. There are many applications based on
the frequent tree mining, such as Web mining, bioinformatics, computer
networks, and so forth. In this section, we will introduce the basic concepts
and algorithms for mining frequent subtrees. In essential, most of these
algorithms follow the same spirit of the techniques developed in frequent
itemset mining. More detail and survey on frequent subtree mining can
be found in [11].
6.3.1 Frequent Subtree Mining Problem
The frequent subtree mining problem is defi ned as follows [11]. Given
a class of trees T, a threshold min
sup
, a transitive subtree relation P
Q
between trees P, Q ¢ T, a fi nite data set of trees D ¡T, the frequent tree
mining problem is the problem of fi nding all trees P ¡ T such that no two
trees in P are isomorphic and for all P ¢ P: freq(P, D ) = ¹
Q¢D
d(P, Q ) min
sup
,
where d is an anti-monotone function such that Q ¢ T : d(P’, Q) d(P, Q ) if
P’
P. The simplest choice for function d is given by the indicator function:
d(P, Q ) = 1, if P
Q, otherwisw d(P, Q )=0.
Frequent Pattern Mining 137
138 Applied Data Mining
For ease of exposition, in this chapter, we only consider the simple case,
that the frequency of a pattern tree is defi ned by the number of trees in the
data set that contains the pattern tree. The frequency defi nition is denoted
as transaction based frequency, which is similar to that of itemset or sequence
frequency. Because of the transitivity property of the subtree relation, the
indicator function is anti-monotone and can be utilized to improve the
mining effi ciency. Based on the defi nition of the pattern frequency, the
support can be defi ned as follows: sup(P, D )=freq(P, D )/|D|, which is also
similar to that of itemset or sequence based support.
Following the same strategy as other structure pattern mining
algorithms applied, we can utilize the classic method, i.e., the generate
and test [1], to discover all the frequent subtrees. The common work fl ow
can be executed the following two steps recursively, where P is set an
empty tree fi rstly: (1) calculate freq(P) by fi nding all Q ¢ D with P
Q; and
(2) let P=suc(Q). Note that suc(P) is a possible approach that determines
the successor of P tree. It should guarantee that all the possible trees are
enumerated exactly once and only once. There are many possible methods to
decide the concrete implementation. They are different at the data structure
used and the performance cost.
Algorithm 4: The TreeMiner algorithm [63]
Input: D, σ, {T
k
1
,...,T
k
m
}, F(σ, D,
e
)
Output: F(σ, D,
e
)
1 for i 1toi=mdo
2 F
k+1
i
←∅;
3 for j 1toj=mdo
4 C
k+1
i
←∅;
5 C
k+1
i
←⊗(T
k
i
,T
k
j
, );
6 for all T
k+1
i,j
C
k+1
i,j
as supp(T
k+1
i,j
σ do
7 F
k+1
i
F
k+1
i
∪{T
k+1
i,j
};
8 end
9 end
10 F(σ, D,
e
) ←F(σ, D,
e
) ∪{F
k+1
i
};
11 TreeMiner(D, σ, F
k+1
i
, F(σ, D,
e
));
12 end
6.3.2 Data Structures for Storing Trees
There are many possible data structures can be used for storing trees. For
example, the adjacency matrix and the fi rst-child-next-sibling are commonly
utilized. In addition to these data structures, some other tree representations
have been also introduced for different purposes. For example, to save space,
some canonical representations are proposed because they are more compact
than the commonly used data structures. Another reason is that because
there are always many possibilities to represent the same tree information
for labeled trees, using a unique way is important and essential for mining
process. An effective representation, therefore, facilitates the comparison
process. We will introduce different approaches that were proposed for
frequent subtree mining.
6.3.2.1 TreeMiner
Zaki introduced the TreeMiner algorithm [63] to mine frequently ordered
subtrees. The basic idea is that it applies both breadth fi rst search (BFS)
and depth fi rst search (DFS) to traverse the whole search space fi nding
the frequent subtrees. Similar to other structure mining algorithms in
the literature, TreeMiner also applies the Apriori rule, i.e., all subtrees of
a frequent tree are frequent. Moreover, the author introduces an effective
strategy that by fi nding an observation if we remove either one of the last
two vertices at the end of the string encoding of a rooted ordered tree, we
can obtain the string encoding of a valid embedded subtree. Based on this
observation, Zaki proposed to use BFS and DFS integratedly that generates
the candidate (k+1)-subtrees by joining two frequent k-subtrees which have
the same prefi x string encodings with (k-1)-length. This idea is similar to
that of SPADE [62] for sequential pattern mining.
Algorithm 5: The FREQT algorithm [6]
Input: D, σ, {T
k
1
,...,T
k
m
}, F(σ, D,
i
)
Output: F(σ, D,
i
)
1 for i 1toi=mdo
2 F
k+1
i
←∅;
3 C
k+1
i
←∅;
4 C
k+1
i
extension(T
k
i
,OCL(T
k
i
));
5 for all T
k+1
i
C
k+1
i
suchthatsupp(T
k+1
i
σ do
6 F
k+1
i
F
k+1
i
∪{T
k+1
i
};
7 end
8 F(σ, D,
i
) ←F(σ, D,
i
) ∪{F
k+1
i
};
9 FREQT(D, σ, F
k+1
i
, F(σ, D,
i
));
10 end
To count the frequency of the candidate subtrees, similar to SPADE [62],
TreeMiner introduces the vertical format to represent the data. Specifi cally,
the scope of a node is defi ned as between the preorder number of it and the
Frequent Pattern Mining 139
140 Applied Data Mining
preorder number of the rightmost node of it. From the defi nition, we can
know that the size of the data tree could be very large and this is an issue
for large dataset. The pseudo code of the TreeMiner algorithm is illustrated
in Fig. 4. Refer [63] for detail.
6.3.2.2 FREQT
Asai et al. proposed the FREQT algorithm [6] to fi nd the frequent induced
subtrees. The basic idea follows the well known property, i.e., Apriori. To
generate the candidates, FREQT applies the rightmost extension strategy
that, a k-tree is extended to a candidate (k+1)-tree by adding a new node to
the node at the rightmost branch of the k-tree. By this way, we know that
the parent tree can be uniquely determined. This strategy also guarantees
that each candidate subtrees are traversed exactly only once. Similar to other
structure pattern mining (i.e., itemset or sequence mining), the algorithm
starts to fi nd 1-patterns and then grows the pattern by increasing 1 and so
forth to fi nd all the frequent subtrees. The mining process is terminated
when there is no possible extension can be made. Figure 5 shows the pseudo
code of the FREQT algorithm.
To extend the frequent k-tree to the candidate (k+1)-trees, the FREQT
algorithm utilizes the rightmost extension strategy. Firstly all the siblings
of the nodes on the rightmost path of the K-tree are determined, and then
the children of the rightmost leaf can be found. Based on these children
nodes, the candidate (k+1)-tree can be determined. An intuitive idea to
implement this strategy, is that we only scan a small part of the tree, instead
of scanning the whole data, to improve effi ciency. Some techniques have
been introduced in [6] to tackle this issue. It utilizes a list of pointers for
each tree, to point to the nodes of the pattern map.
Moreover, only the occurrences of the rightmost leaf of the tree is saved to
reduce the space cost.
6.3.2.3 HybridTreeMiner
To further improve the effi ciency of frequent subtree mining, Chi et al. [14]
has proposed the HybridTreeMiner algorithm which, similar to TreeMiner,
applies both the breadth fi rst search and the depth fi rst search strategies.
The basic idea of HybridTreeMiner also follows the traditional generate-
and-test technique. To effi ciently generate the candidates to be tested, the
authors introduced the tree representation, i.e., breadth fi rst canonical form,
to facilitate traversal of all possible subtree candidates. A disadvantage of
the algorithm is that it cannot generate all the candidates in constant time
because of the complexity cost.
Similar to TreeMiner, the HybridTreeMiner algorithm joins two k-trees
which have the common prefi x (k-1)-trees, to generate the candidate subtree.
For those trees which cannot be generated by joining, HybridTreeMiner
borrows the idea of FREQT, that extends the frequent subtrees to obtain the
larger candidates. There are several effective strategies proposed in [13, 14]
to address the issue of tree authomorphisms during the joining and extending
processes. Moreover, for different types of trees, the authors introduced
different approaches to improve the effi ciency of the algorithms. For
example, to deal with the free trees, the algorithm is extended by utilizing the
breadth fi rst tree encoding. By this way, it can take account for a small part of
all the rooted trees. Another strategy introduced by HybridTreeMiner is that
the occurrence lists of the subtrees are proposed and the authors explained
how they are joined for generating the candidates in Chi et al. [14].
6.3.2.4 Gaston
Nijssen et al. proposed another algorithm named Gaston [40]. Based on the
similar idea of TreeMiner and HybridTreeMiner, the Gaston algorithm applies
both the breadth fi rst search and depth fi rst search strategies. There are
several phases introduced for the whole mining process. Firstly, it extracts
the frequent undirected paths by traversing all the possible ways. To
facilitate the process, the authors introduced an effective representation for
trees which can be built in reasonable time for large data; then it deals with
these paths as the start point of a rooted tree, and joins or extends them with
rightmost path extension technique to generate the candidates and test.
6.3.3 Maximal and closed frequent subtrees
A main issue for all the previous work is that the resultant frequent subtrees
may be very large and it can grow exponentially when the size of the original
data increases. As a result, how to effi ciently obtain them is important.
Moreover, it is very diffi cult to clarify the whole result because of the huge
size of them. To tackle these problems, the maximal and closed frequent
subtrees have been introduced [50, 54, 12], which borrows the idea from the
literature of itemset and sequence mining. The defi nition of the maximal
frequent subtree is that none of a maximal frequent subtree’s super trees
are frequent. By this way, the discovered frequent patterns can be reduced
dramatically, which facilitates the mining process and the explanation of
the results. The basic idea of [50, 54] is that they fi rst nd all the frequent
subtrees, and then fi lter out those non-maximal patterns. This technique,
although simple to be implemented, is time consuming. To tackle the issue,
Chi et al. [12] proposed the CMTreeMiner algorithm, which extracts the
maximal patterns without fi rst nding all the frequent ones. Furthermore,
Frequent Pattern Mining 141
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.156.156