Chapter 6: Frequent Pattern Mining (1/7)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 6

Frequent Pattern Mining

Frequent pattern mining is one of the most fundamental research issues in

data mining, which aims to mine useful information from huge volume of

data. The purpose of searching such frequent patterns (i.e., association rules)

is to explore the historical supermarket transaction data, which is indeed to

discover the customer behavior based on the purchased items. Association

rules present the fact that how frequently items are bought together. For

example, an association rule “beer->diaper (75%)” indicates that 75% of

the customers that bought beer also bought diaper. Such rules can be used

to make prediction and recommendation for customers and store layout.

Stemmed from the basic itemset data, rule discovery on more general and

complex data (i.e., sequence, tree, graph) has been thoroughly explored

in the past decade. In this chapter, we introduce the basic techniques of

frequent pattern mining on different type of data, i.e., itemset, sequence,

tree, and graph.

In the following sections, most classic algorithms and techniques for

data mining will be introduced. Association rule mining will be presented

in Section 6.1. Sequential pattern mining will be introduced in Section 6.2.

Frequent tree and graph mining will be presented in Section 6.3 and Section

6.4, respectively. Chapter summary will be presented in Section 6.5.

6.1 Association Rule Mining

Data mining is to ﬁ nd valid, novel, potentially useful, and ultimately

understandable patterns in data [18]. The most fundamental and important

issue in data mining is association rule mining [1], which was first

introduced in the early 1990s.

The purpose of searching association rules is to analyze the co-existence

relation between items, which is then utilized to make appropriate

recommendation. The issue has attracted a great deal of interest during

118 Applied Data Mining

the recent surge in data mining research because it is the basis of many

applications, such as customer behavior analysis, stock trend prediction,

and DNA sequence analysis. For example, an association rule “bread ²

milk (90%)” indicates that nine out of ten customers who bought bread also

bought milk. These rules can be useful for store layout, stock prediction,

DNA structure analysis, and so forth.

Table 6.1: A database

Tid Transaction

10 bread, milk

20 bread, chocolate, cookie

30 chocolate, cookie

40 milk

50 bread, cookie

6.1.1 Association Rule Mining Problem

The problem of association rule discovery can be stated as follows [1]: Let I

= {i

, i

, . . . , i

} be a set of items. A subset of I is called an itemset. The itemset,

, is denoted as {x

. . . x

}, where x

is an item, i.e., x

¢ I for 1 ≤ k ≤ m.

The number of items in an itemset is called the length of the itemset. An

itemset with length l is called an l-itemset. An itemset, t

= {a

, a

, . . . , a

is contained in another itemset, t

= {b

, b

, . . . , b

}, if there exists integers 1

≤ i

< i

< . . . < i

≤ m, such that a

¡ b

, a

¡ b

,. . . , a

¡ b

. We denote t

subset of t

, and t

a superset of t

The support of an itemset X, denoted as support(X), is the number of

transactions in which it occurs as a subset. A k length subset of an itemset is

called a k-subset. An itemset is frequent if its support is greater than a user-

speciﬁ ed minimum support (min

sup

) value. The set of frequent k-itemsets

is denoted by F

An association rule is an expression A ²B, where A and B are itemsets.

The support of the rule is given as support(A ² B)=support(AB) and the

conﬁ dence of the rule is given as conf(A ² B)=support(AB)/support(A)

(i.e., the conditional probability that a transaction contains B, given that it

contains A). A rule is conﬁ dent if its conﬁ dence is greater than a user-speciﬁ ed

minimum conﬁ dence (min

conf

The associate rule mining task is to generate all the rules, whose

supports are greater than min

sup

, and the conﬁ dences of the rules are greater

than min

conf

. The issue can be tackled by a two-stage strategy [2]:

• Find all frequent itemsets. This stage is the most time consuming

part. Given k items, there can be potentially 2

frequent itemsets.

Therefore, almost all the works so far have focused on devising

efﬁ cient algorithms to discover the frequent itemsets, while avoiding

to traverse unnecessary search space somehow. In this chapter, we

mainly introduce the basic algorithms on ﬁ nding frequent itemsets.

• Generate conﬁ dent rules. This stage is relatively straightforward and

can be easily completed.

Almost all the association rule mining algorithms apply the two-stage

rule discovery approach. We will discuss it in more detail in the next few

sections.

Example 1. Let our example database be the database D shown in Table 6.1

with min

sup

=1 and min

conf

=30%. Table 6.2 shows all frequent itemsets in D.

Table 6.3 illustrates all the association rules. For the sake of simplicity and

without loss of generality, we assume that items in transactions and itemsets

are kept sorted in the lexicographic order unless stated otherwise.

Table 6.2: Frequent itemsets

Frequent Itemset Transactions Support

bread 10, 20, 50 3

milk 10, 40 2

chocolate 20, 30 2

cookie 20, 30, 50 3

bread, milk 10 1

bread, chocolate 20 1

bread, cookie 20, 50 2

chocolate, cookie 20, 30 2

bread, chocolate, cookie 20 1

Table 6.3: Association rules

Association Rule Support Conﬁdence

bread ⇒ cookie 2 67%

milk ⇒ bread 1 50%

chocolate ⇒ bread 1 50%

chocolate ⇒ cookie 2 100%

cookie ⇒ bread 2 67%

cookie ⇒ chocolate 2 67%

bread, chocolate ⇒ cookie 1 100%

bread, cookie ⇒ chocolate 1 50%

chocolate, cookie ⇒ bread 1 50%

chocolate ⇒ bread, cookie 1 50%

Frequent Pattern Mining 119

120 Applied Data Mining

6.1.2 Basic Algorithms for Association Rule Mining

6.1.2.1 Apriori

The ﬁ rst algorithm was introduced by Agrawal et al. [1] to address the

association rule mining issue. The same authors introduced another

algorithm named Apriori in their later paper [4] by introducing the

monotonicity property of the association rules to improve the performance.

Mannila et al. [39] presented an independent work with a similar idea.

Apriori applies a two-stage approach to discover frequent itemsets and

conﬁ dent association rules.

• Frequent Itemset Discovery. To ﬁ nd all frequent itemsets, Apriori

introduces a candidate generation and test strategy. The basic idea is

that it ﬁ rst generates the candidate k-itemsets (i.e., k is 1 at the beginning

and is incrementd by 1 in the next cycle), then these candidates will

be evaluated whether frequent or not.

Speciﬁ cally, the algorithm ﬁ rst scans the dataset and the frequent

1-itemsets are found. To discover those frequent 2-itemsets, Apriori

generates candidate 2-itemsets by joining 1-itemsets. These candidates

are evaluated by scanning the original dataset again. In a similar way, all

frequent (k+1)-itemsets can be found based on already known frequent

k-itemsets.

To improve the performance by avoiding to generate too many yet

unnecessary candidates, Apriori introduced a monotonicity property that

a (k+1)-itemset becomes a candidate only if all its k-subset are frequent.

As demonstrated by the authors [4] and many later works, this simple yet

efﬁ cient strategy largely reduces the candidates to be evaluated.

The frequent itemset mining of the Apriori algorithm is presented in

Algorithm 1. The algorithm is executed in a breadth-ﬁ rst search manner.

To generate the candidate itemsets with length k+1, two k-itemsets with the

same (k-1)-preﬁ x are joined together (lines 12–13). The joined itemset can

be inserted into C

k+1

only if all its k-subsets are frequent (line 14).

To test the candidate k-itemsets (i.e., count their supports), the database

is scanned sequentially and all the candidate itemsets are tested whether

they are included in the transaction scanned. By this way, the corresponding

support is accumulated (lines 5–9). Finally, frequent itemsets are collected

(line 10).

• Association Rule Mining. After discovering the frequent itemsets, we

can ﬁ nd the frequent and conﬁ dent association rules straightforward.

The approach is similar to the frequent itemset mining algorithm.

Because the cost of ﬁ nding frequent itemsets is high and accounts for

most of the whole performance on discovering associate rules, almost

all the researches so far have been focused on the frequent itemset

generation step.

6.1.2.2 Eclat

Many algorithms had been proposed based on Apriroi idea, in which Eclat

[64, 61] is distinct in that it is the ﬁ rst to proposed to generate all frequent

itemsets in a depth-ﬁ rst manner, while employing the vertical database

layout and uses the intersection based approach to compute the support

of an itemset.

Figure 6.1.1 illustrates the key idea of Eclat on candidate support

counting. While ﬁ rst scanning of the dataset, it converts the original format

(i.e., Table 6.1) into vertical TID list format, as shown in Fig. 6.1.1. For

example, the TID list of itemset {bread} is {10, 20, 50}, which indicate the

transactions that the itemset exist in the original dataset.

To count the support of k-candidate itemset, the algorithm intersects

its two (k-1)-subset to get the result. For example, as shown in Fig. 6.1.1, to

count the support of the itemset {bread, chocolate}, it intersects the TID lists

of {bread} and {chocolate}, resulting in {20}. The support is therefore 1.

Algorithm 1: Apriori—Frequent Itemset Mining [4]

Input: A transaction database D, a user speciﬁed threshold min

sup

Output: Frequent itemsets F

1 C

= {1-itemsets};

2 k=1;

3 while C

= NULL do

4 // Test candidate itemsets

5 for transaction T ∈ D do

6 for candidate itemsets X ∈ C

7 if X ⊆ T then X.support++;

8 end

9 end

10 F

∪ X,whereX.support ≥ min

sup

;

11 // Generate candidate itemsets

12 for all {i

,...i

k−1

},{i

,...i

k−1



}∈F

such that i



13 c={i

,...i

l−1



};

14 if all k-subsets of c are frequent then C

k+1

= C

k+1

∪ c;

15 end

16 k++;

17 end

Frequent Pattern Mining 121

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Frequent Pattern Mining (1/7)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6: Frequent Pattern Mining (1/7)