Chapter 6: Frequent Pattern Mining (5/7)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

candidates. From the example, we can know that LAPIN is a preﬁ x growth

algorithm with efﬁ cient pruning strategy. It employs a depth ﬁ rst search of

the lexicographic tree to grow the sequences with the help of the projected

item last position databases.

In addition, the same authors introduced some variant versions of

LAPIN, i.e., LAPINSPAM [57] and LAPIN-Web [59]. The ﬁ rst one is devised

based on the SPAM algorithm [7] which utilizes bit information to further

improve the efﬁ ciency. The latter one, i.e., LAPIN-Web [59], is introduced

to speciﬁ cally extract the users’ frequent access patterns with regard to the

log data.

Table 6.3: Item A’s Projected Last Position Lists

Customer ID Projected Last Position Lists

10 c

last

=5,a

last

=6,d

last

20 c

last

=5,b

last

=6,d

last

30 c

last

=4,d

last

As clariﬁ ed in [36], the LAPIN strategy can be deemed as one of the

promising techniques in the sequential pattern mining literature.

6.3 Frequent Subtree Mining

Frequent subtree mining could be seen as an extension issue of frequent

itemset and sequence mining because the data structure of the former is

more complex than that of the latter. There are many applications based on

the frequent tree mining, such as Web mining, bioinformatics, computer

networks, and so forth. In this section, we will introduce the basic concepts

and algorithms for mining frequent subtrees. In essential, most of these

algorithms follow the same spirit of the techniques developed in frequent

itemset mining. More detail and survey on frequent subtree mining can

be found in [11].

6.3.1 Frequent Subtree Mining Problem

The frequent subtree mining problem is deﬁ ned as follows [11]. Given

a class of trees T, a threshold min

sup

, a transitive subtree relation P

≺

between trees P, Q ¢ T, a ﬁ nite data set of trees D ¡T, the frequent tree

mining problem is the problem of ﬁ nding all trees P ¡ T such that no two

trees in P are isomorphic and for all P ¢ P: freq(P, D ) = ¹

Q¢D

d(P, Q ) ≥ min

sup

where d is an anti-monotone function such that Q ¢ T : d(P’, Q) ≥ d(P, Q ) if

P’

≺

P. The simplest choice for function d is given by the indicator function:

d(P, Q ) = 1, if P

≺

Q, otherwisw d(P, Q )=0.

Frequent Pattern Mining 137

138 Applied Data Mining

For ease of exposition, in this chapter, we only consider the simple case,

that the frequency of a pattern tree is deﬁ ned by the number of trees in the

data set that contains the pattern tree. The frequency deﬁ nition is denoted

as transaction based frequency, which is similar to that of itemset or sequence

frequency. Because of the transitivity property of the subtree relation, the

indicator function is anti-monotone and can be utilized to improve the

mining efﬁ ciency. Based on the deﬁ nition of the pattern frequency, the

support can be deﬁ ned as follows: sup(P, D )=freq(P, D )/|D|, which is also

similar to that of itemset or sequence based support.

Following the same strategy as other structure pattern mining

algorithms applied, we can utilize the classic method, i.e., the generate

and test [1], to discover all the frequent subtrees. The common work ﬂ ow

can be executed the following two steps recursively, where P is set an

empty tree ﬁ rstly: (1) calculate freq(P) by ﬁ nding all Q ¢ D with P

≺

Q; and

(2) let P=suc(Q). Note that suc(P) is a possible approach that determines

the successor of P tree. It should guarantee that all the possible trees are

enumerated exactly once and only once. There are many possible methods to

decide the concrete implementation. They are different at the data structure

used and the performance cost.

Algorithm 4: The TreeMiner algorithm [63]

Input: D, σ, {T

,...,T

}, F(σ, D, 

)

Output: F(σ, D, 

)

1 for i ← 1toi=mdo

2 F

k+1

←∅;

3 for j ← 1toj=mdo

4 C

k+1

←∅;

5 C

k+1

←⊗(T

, );

6 for all T

k+1

i,j

∈ C

k+1

i,j

as supp(T

k+1

i,j

≥ σ do

7 F

k+1

← F

k+1

∪{T

k+1

i,j

};

8 end

9 end

10 F(σ, D, 

) ←F(σ, D, 

) ∪{F

k+1

};

11 TreeMiner(D, σ, F

k+1

, F(σ, D, 

));

12 end

6.3.2 Data Structures for Storing Trees

There are many possible data structures can be used for storing trees. For

example, the adjacency matrix and the ﬁ rst-child-next-sibling are commonly

utilized. In addition to these data structures, some other tree representations

have been also introduced for different purposes. For example, to save space,

some canonical representations are proposed because they are more compact

than the commonly used data structures. Another reason is that because

there are always many possibilities to represent the same tree information

for labeled trees, using a unique way is important and essential for mining

process. An effective representation, therefore, facilitates the comparison

process. We will introduce different approaches that were proposed for

frequent subtree mining.

6.3.2.1 TreeMiner

Zaki introduced the TreeMiner algorithm [63] to mine frequently ordered

subtrees. The basic idea is that it applies both breadth ﬁ rst search (BFS)

and depth ﬁ rst search (DFS) to traverse the whole search space ﬁ nding

the frequent subtrees. Similar to other structure mining algorithms in

the literature, TreeMiner also applies the Apriori rule, i.e., all subtrees of

a frequent tree are frequent. Moreover, the author introduces an effective

strategy that by ﬁ nding an observation if we remove either one of the last

two vertices at the end of the string encoding of a rooted ordered tree, we

can obtain the string encoding of a valid embedded subtree. Based on this

observation, Zaki proposed to use BFS and DFS integratedly that generates

the candidate (k+1)-subtrees by joining two frequent k-subtrees which have

the same preﬁ x string encodings with (k-1)-length. This idea is similar to

that of SPADE [62] for sequential pattern mining.

Algorithm 5: The FREQT algorithm [6]

Input: D, σ, {T

,...,T

}, F(σ, D, 

)

Output: F(σ, D, 

)

1 for i ← 1toi=mdo

2 F

k+1

←∅;

3 C

k+1

←∅;

4 C

k+1

← extension(T

,OCL(T

));

5 for all T

k+1

∈ C

k+1

suchthatsupp(T

k+1

≥ σ do

6 F

k+1

← F

k+1

∪{T

k+1

};

7 end

8 F(σ, D, 

) ←F(σ, D, 

) ∪{F

k+1

};

9 FREQT(D, σ, F

k+1

, F(σ, D, 

));

10 end

To count the frequency of the candidate subtrees, similar to SPADE [62],

TreeMiner introduces the vertical format to represent the data. Speciﬁ cally,

the scope of a node is deﬁ ned as between the preorder number of it and the

Frequent Pattern Mining 139

140 Applied Data Mining

preorder number of the rightmost node of it. From the deﬁ nition, we can

know that the size of the data tree could be very large and this is an issue

for large dataset. The pseudo code of the TreeMiner algorithm is illustrated

in Fig. 4. Refer [63] for detail.

6.3.2.2 FREQT

Asai et al. proposed the FREQT algorithm [6] to ﬁ nd the frequent induced

subtrees. The basic idea follows the well known property, i.e., Apriori. To

generate the candidates, FREQT applies the rightmost extension strategy

that, a k-tree is extended to a candidate (k+1)-tree by adding a new node to

the node at the rightmost branch of the k-tree. By this way, we know that

the parent tree can be uniquely determined. This strategy also guarantees

that each candidate subtrees are traversed exactly only once. Similar to other

structure pattern mining (i.e., itemset or sequence mining), the algorithm

starts to ﬁ nd 1-patterns and then grows the pattern by increasing 1 and so

forth to ﬁ nd all the frequent subtrees. The mining process is terminated

when there is no possible extension can be made. Figure 5 shows the pseudo

code of the FREQT algorithm.

To extend the frequent k-tree to the candidate (k+1)-trees, the FREQT

algorithm utilizes the rightmost extension strategy. Firstly all the siblings

of the nodes on the rightmost path of the K-tree are determined, and then

the children of the rightmost leaf can be found. Based on these children

nodes, the candidate (k+1)-tree can be determined. An intuitive idea to

implement this strategy, is that we only scan a small part of the tree, instead

of scanning the whole data, to improve efﬁ ciency. Some techniques have

been introduced in [6] to tackle this issue. It utilizes a list of pointers for

each tree, to point to the nodes of the pattern map.

Moreover, only the occurrences of the rightmost leaf of the tree is saved to

reduce the space cost.

6.3.2.3 HybridTreeMiner

To further improve the efﬁ ciency of frequent subtree mining, Chi et al. [14]

has proposed the HybridTreeMiner algorithm which, similar to TreeMiner,

applies both the breadth ﬁ rst search and the depth ﬁ rst search strategies.

The basic idea of HybridTreeMiner also follows the traditional generate-

and-test technique. To efﬁ ciently generate the candidates to be tested, the

authors introduced the tree representation, i.e., breadth ﬁ rst canonical form,

to facilitate traversal of all possible subtree candidates. A disadvantage of

the algorithm is that it cannot generate all the candidates in constant time

because of the complexity cost.

Similar to TreeMiner, the HybridTreeMiner algorithm joins two k-trees

which have the common preﬁ x (k-1)-trees, to generate the candidate subtree.

For those trees which cannot be generated by joining, HybridTreeMiner

borrows the idea of FREQT, that extends the frequent subtrees to obtain the

larger candidates. There are several effective strategies proposed in [13, 14]

to address the issue of tree authomorphisms during the joining and extending

processes. Moreover, for different types of trees, the authors introduced

different approaches to improve the efﬁ ciency of the algorithms. For

example, to deal with the free trees, the algorithm is extended by utilizing the

breadth ﬁ rst tree encoding. By this way, it can take account for a small part of

all the rooted trees. Another strategy introduced by HybridTreeMiner is that

the occurrence lists of the subtrees are proposed and the authors explained

how they are joined for generating the candidates in Chi et al. [14].

6.3.2.4 Gaston

Nijssen et al. proposed another algorithm named Gaston [40]. Based on the

similar idea of TreeMiner and HybridTreeMiner, the Gaston algorithm applies

both the breadth ﬁ rst search and depth ﬁ rst search strategies. There are

several phases introduced for the whole mining process. Firstly, it extracts

the frequent undirected paths by traversing all the possible ways. To

facilitate the process, the authors introduced an effective representation for

trees which can be built in reasonable time for large data; then it deals with

these paths as the start point of a rooted tree, and joins or extends them with

rightmost path extension technique to generate the candidates and test.

6.3.3 Maximal and closed frequent subtrees

A main issue for all the previous work is that the resultant frequent subtrees

may be very large and it can grow exponentially when the size of the original

data increases. As a result, how to efﬁ ciently obtain them is important.

Moreover, it is very difﬁ cult to clarify the whole result because of the huge

size of them. To tackle these problems, the maximal and closed frequent

subtrees have been introduced [50, 54, 12], which borrows the idea from the

literature of itemset and sequence mining. The deﬁ nition of the maximal

frequent subtree is that none of a maximal frequent subtree’s super trees

are frequent. By this way, the discovered frequent patterns can be reduced

dramatically, which facilitates the mining process and the explanation of

the results. The basic idea of [50, 54] is that they ﬁ rst ﬁ nd all the frequent

subtrees, and then ﬁ lter out those non-maximal patterns. This technique,

although simple to be implemented, is time consuming. To tackle the issue,

Chi et al. [12] proposed the CMTreeMiner algorithm, which extracts the

maximal patterns without ﬁ rst ﬁ nding all the frequent ones. Furthermore,

Frequent Pattern Mining 141

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Frequent Pattern Mining (5/7)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6: Frequent Pattern Mining (5/7)