CHAPTER 6
Frequent Pattern Mining
Frequent pattern mining is one of the most fundamental research issues in
data mining, which aims to mine useful information from huge volume of
data. The purpose of searching such frequent patterns (i.e., association rules)
is to explore the historical supermarket transaction data, which is indeed to
discover the customer behavior based on the purchased items. Association
rules present the fact that how frequently items are bought together. For
example, an association rule “beer->diaper (75%)” indicates that 75% of
the customers that bought beer also bought diaper. Such rules can be used
to make prediction and recommendation for customers and store layout.
Stemmed from the basic itemset data, rule discovery on more general and
complex data (i.e., sequence, tree, graph) has been thoroughly explored
in the past decade. In this chapter, we introduce the basic techniques of
frequent pattern mining on different type of data, i.e., itemset, sequence,
tree, and graph.
In the following sections, most classic algorithms and techniques for
data mining will be introduced. Association rule mining will be presented
in Section 6.1. Sequential pattern mining will be introduced in Section 6.2.
Frequent tree and graph mining will be presented in Section 6.3 and Section
6.4, respectively. Chapter summary will be presented in Section 6.5.
6.1 Association Rule Mining
Data mining is to fi nd valid, novel, potentially useful, and ultimately
understandable patterns in data [18]. The most fundamental and important
issue in data mining is association rule mining [1], which was first
introduced in the early 1990s.
The purpose of searching association rules is to analyze the co-existence
relation between items, which is then utilized to make appropriate
recommendation. The issue has attracted a great deal of interest during
118 Applied Data Mining
the recent surge in data mining research because it is the basis of many
applications, such as customer behavior analysis, stock trend prediction,
and DNA sequence analysis. For example, an association rule “bread ²
milk (90%)” indicates that nine out of ten customers who bought bread also
bought milk. These rules can be useful for store layout, stock prediction,
DNA structure analysis, and so forth.
Table 6.1: A database
Tid Transaction
10 bread, milk
20 bread, chocolate, cookie
30 chocolate, cookie
40 milk
50 bread, cookie
6.1.1 Association Rule Mining Problem
The problem of association rule discovery can be stated as follows [1]: Let I
= {i
1
, i
2
, . . . , i
k
} be a set of items. A subset of I is called an itemset. The itemset,
t
j
, is denoted as {x
1
,x
2
. . . x
m
}, where x
k
is an item, i.e., x
k
¢ I for 1 k m.
The number of items in an itemset is called the length of the itemset. An
itemset with length l is called an l-itemset. An itemset, t
a
= {a
1
, a
2
, . . . , a
n
},
is contained in another itemset, t
b
= {b
1
, b
2
, . . . , b
m
}, if there exists integers 1
i
1
< i
2
< . . . < i
n
m, such that a
1
¡ b
i
1
, a
2
¡ b
i
2
,. . . , a
n
¡ b
i
n
. We denote t
a
a
subset of t
b
, and t
b
a superset of t
a
.
The support of an itemset X, denoted as support(X), is the number of
transactions in which it occurs as a subset. A k length subset of an itemset is
called a k-subset. An itemset is frequent if its support is greater than a user-
specifi ed minimum support (min
sup
) value. The set of frequent k-itemsets
is denoted by F
k
.
An association rule is an expression A ²B, where A and B are itemsets.
The support of the rule is given as support(A ² B)=support(AB) and the
confi dence of the rule is given as conf(A ² B)=support(AB)/support(A)
(i.e., the conditional probability that a transaction contains B, given that it
contains A). A rule is confi dent if its confi dence is greater than a user-specifi ed
minimum confi dence (min
conf
).
The associate rule mining task is to generate all the rules, whose
supports are greater than min
sup
, and the confi dences of the rules are greater
than min
conf
. The issue can be tackled by a two-stage strategy [2]:
Find all frequent itemsets. This stage is the most time consuming
part. Given k items, there can be potentially 2
k
frequent itemsets.
Therefore, almost all the works so far have focused on devising
effi cient algorithms to discover the frequent itemsets, while avoiding
to traverse unnecessary search space somehow. In this chapter, we
mainly introduce the basic algorithms on fi nding frequent itemsets.
Generate confi dent rules. This stage is relatively straightforward and
can be easily completed.
Almost all the association rule mining algorithms apply the two-stage
rule discovery approach. We will discuss it in more detail in the next few
sections.
Example 1. Let our example database be the database D shown in Table 6.1
with min
sup
=1 and min
conf
=30%. Table 6.2 shows all frequent itemsets in D.
Table 6.3 illustrates all the association rules. For the sake of simplicity and
without loss of generality, we assume that items in transactions and itemsets
are kept sorted in the lexicographic order unless stated otherwise.
Table 6.2: Frequent itemsets
Frequent Itemset Transactions Support
bread 10, 20, 50 3
milk 10, 40 2
chocolate 20, 30 2
cookie 20, 30, 50 3
bread, milk 10 1
bread, chocolate 20 1
bread, cookie 20, 50 2
chocolate, cookie 20, 30 2
bread, chocolate, cookie 20 1
Table 6.3: Association rules
Association Rule Support Confidence
bread cookie 2 67%
milk bread 1 50%
chocolate bread 1 50%
chocolate cookie 2 100%
cookie bread 2 67%
cookie chocolate 2 67%
bread, chocolate cookie 1 100%
bread, cookie chocolate 1 50%
chocolate, cookie bread 1 50%
chocolate bread, cookie 1 50%
Frequent Pattern Mining 119
120 Applied Data Mining
6.1.2 Basic Algorithms for Association Rule Mining
6.1.2.1 Apriori
The fi rst algorithm was introduced by Agrawal et al. [1] to address the
association rule mining issue. The same authors introduced another
algorithm named Apriori in their later paper [4] by introducing the
monotonicity property of the association rules to improve the performance.
Mannila et al. [39] presented an independent work with a similar idea.
Apriori applies a two-stage approach to discover frequent itemsets and
confi dent association rules.
Frequent Itemset Discovery. To nd all frequent itemsets, Apriori
introduces a candidate generation and test strategy. The basic idea is
that it fi rst generates the candidate k-itemsets (i.e., k is 1 at the beginning
and is incrementd by 1 in the next cycle), then these candidates will
be evaluated whether frequent or not.
Specifi cally, the algorithm fi rst scans the dataset and the frequent
1-itemsets are found. To discover those frequent 2-itemsets, Apriori
generates candidate 2-itemsets by joining 1-itemsets. These candidates
are evaluated by scanning the original dataset again. In a similar way, all
frequent (k+1)-itemsets can be found based on already known frequent
k-itemsets.
To improve the performance by avoiding to generate too many yet
unnecessary candidates, Apriori introduced a monotonicity property that
a (k+1)-itemset becomes a candidate only if all its k-subset are frequent.
As demonstrated by the authors [4] and many later works, this simple yet
effi cient strategy largely reduces the candidates to be evaluated.
The frequent itemset mining of the Apriori algorithm is presented in
Algorithm 1. The algorithm is executed in a breadth-fi rst search manner.
To generate the candidate itemsets with length k+1, two k-itemsets with the
same (k-1)-prefi x are joined together (lines 12–13). The joined itemset can
be inserted into C
k+1
only if all its k-subsets are frequent (line 14).
To test the candidate k-itemsets (i.e., count their supports), the database
is scanned sequentially and all the candidate itemsets are tested whether
they are included in the transaction scanned. By this way, the corresponding
support is accumulated (lines 5–9). Finally, frequent itemsets are collected
(line 10).
Association Rule Mining. After discovering the frequent itemsets, we
can fi nd the frequent and confi dent association rules straightforward.
The approach is similar to the frequent itemset mining algorithm.
Because the cost of fi nding frequent itemsets is high and accounts for
most of the whole performance on discovering associate rules, almost
all the researches so far have been focused on the frequent itemset
generation step.
6.1.2.2 Eclat
Many algorithms had been proposed based on Apriroi idea, in which Eclat
[64, 61] is distinct in that it is the fi rst to proposed to generate all frequent
itemsets in a depth-fi rst manner, while employing the vertical database
layout and uses the intersection based approach to compute the support
of an itemset.
Figure 6.1.1 illustrates the key idea of Eclat on candidate support
counting. While fi rst scanning of the dataset, it converts the original format
(i.e., Table 6.1) into vertical TID list format, as shown in Fig. 6.1.1. For
example, the TID list of itemset {bread} is {10, 20, 50}, which indicate the
transactions that the itemset exist in the original dataset.
To count the support of k-candidate itemset, the algorithm intersects
its two (k-1)-subset to get the result. For example, as shown in Fig. 6.1.1, to
count the support of the itemset {bread, chocolate}, it intersects the TID lists
of {bread} and {chocolate}, resulting in {20}. The support is therefore 1.
Algorithm 1: Apriori—Frequent Itemset Mining [4]
Input: A transaction database D, a user specified threshold min
sup
Output: Frequent itemsets F
1 C
1
= {1-itemsets};
2 k=1;
3 while C
k
= NULL do
4 // Test candidate itemsets
5 for transaction T D do
6 for candidate itemsets X C
k
do
7 if X T then X.support++;
8 end
9 end
10 F
k
=F
k
X,whereX.support min
sup
;
11 // Generate candidate itemsets
12 for all {i
1
,...i
k1
,i
k
},{i
1
,...i
k1
,i
k
}∈F
k
such that i
k
<i
k
do
13 c={i
1
,...i
l1
,i
k
,i
k
};
14 if all k-subsets of c are frequent then C
k+1
= C
k+1
c;
15 end
16 k++;
17 end
Frequent Pattern Mining 121
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.170.27