Chapter 6: Frequent Pattern Mining (3/7)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

threshold. Suppose the minimal support is 70%, in this case the minimal

support count is 2, the result of large 1-itemsets is listed in Fig. 6.2.2.

• Transformation Step: We map the large itemsets into a series of integers

and the original database is converted by replacing the itemsets.

For example, with the help of the mapping table in Fig. 6.2.2, the

transformed database is obtained, as shown in Fig. 6.2.3.

• Sequence Step: The transformed database is scanned and mined to

ﬁ nd all the frequent patterns.

• Maximal Step: We remove those patterns which are contained in other

sequential patterns. In other words, only maximal sequential patterns

remain.

Customer ID Customer Sequence

10 < ac(bc)d(abc)ad >

20 < b(cd)ac(bd) >

30 < d(bc)(ac)(cd) >

<ac(bc)d(abc)ad>

<d(bc)(ac)(cd)>

<b(cd)ac(bd)>

Figure 6.2.3: Transformed Database

Figure 6.2.2: Large Itemsets

Large Itemsets Mapped To

apple a

banana b

strawberry c

pear d

Figure 6.2.1: Database Sorted by Customer ID and Transaction Time

CID Transaction Tim e Items

10 Sep. 5, 2011 bread

10 Sep. 9, 2011

10 Sep. 10, 2011

banana, cookie

10 Sep. 12, 2011

chocolate

10 Sep. 20, 2011

bread, milk, cookie

10 Sep. 23, 2011

bread

10 Sep. 26 2011

chocolate

20 Sep. 7, 2011

milk

20 Sep. 11, 2011

cookie, chocolate

20 Sep. 13, 2011

bread

20 Sep. 16, 2011

20 Sep. 22, 2011

milk, chocolate

30 Sep. 6, 2011 chocolate

30 Sep. 9, 2011

milk, cookie

30 Sep. 11, 2011

bread, cookie

30 Sep. 15, 2011 cookie, chocolate

Frequent Pattern Mining 127

128 Applied Data Mining

Table 6.1: AprioriAll Candidate Generation L

to C

[5]

Large 4-sequences Candidate 5-sequences

b(ac)d (bc)(ac)d

bcad d(bc)ad

bdad d(bc)da

bdcd d(bc)(ad)

(bc)ad

(bc)(ac)

(bc)cd

c(ac)d

d(ac)d

dbad

d(bc)a

d(bc)d

dcad

µb(ac)dÅ

µ(bc)(ac)dÅ

µd(bc)adÅ

µd(bc)daÅ

µd(bc)(ad)Å

µbcadÅ

µbdadÅ

µbdcdÅ

µ(bc)adÅ

µ(bc)(ac)Å

µ(bc)cdÅ

µc(ac)dÅ

µd(ac)dÅ

µdbadÅ

µd(bc)aÅ

µdcadÅ

µd(bc)dÅ

Among all these steps, the sequence step is the most time consuming

one and therefore, researchers focused on this step. AprioriAll [5] was ﬁ rst

proposed based on the Apriori algorithm in association rule mining [3]. Two

phases are utilized to mine sequential patterns, i.e., candidate generation

and test.

The phase for generating candidates is similar to the AprioriGen in

[3]. The Apriori property is applied to prune those candidate sequences

whose subsequence is not frequent. The difference is that when the authors

generate the candidate by joining the frequent patterns in the previous pass,

different order of combination make different candidates. For example: from

the items, a and b, three candidates µabÅ, µbaÅ and µ(ab)Å can be generated.

But in association rule mining only µ(ab)Å is generated. The reason is that in

association rule mining, the time order is not taken into account. Obviously

the number of candidate sequences in sequential pattern mining are much

larger than the size of the candidate itemsets in association rule mining

during the generation of candidate sequences. Table 6.1 shows how to

generate candidate 5-sequences by joining large 4-sequences. By scanning

the large 4-itemsets, it ﬁ nds that the ﬁ rst itemsets µ(bc)adÅ and second

itemsets µ(bc)(ac)Å share their ﬁ rst three items, according to the join condition

of Apriori they are joined to produce the candidate sequence µ(bc)(ac)dÅ.

Similarly other candidate 5-sequences are generated.

For the second phase, i.e., test phase, is simple and straightforward. The

database is scanned to count the supports of those candidate sequences. As

a result, the frequent sequential patterns can be found.

Due to the efﬁ ciency and simplicity of the AprioriAll algorithm, which is

the ﬁ rst algorithm on mining sequential patterns, the core idea of AprioriAll

is applied by many other algorithms. The problems of AprioriAll are that

there are many candidates generated and multiple passes over the databases

are very time consuming.

6.2.2.1.1 GSP

Srikant and Agrawal generalized the deﬁ nition of sequential pattern mining

problem in [47] by incorporating some new properties, i.e., time constraints,

transaction relaxation, and taxonomy. For the time constraints, the maximum

gap and the minimal gap are deﬁ ned to speciﬁ ed the gap between any two

adjacent transactions in the sequence. When testing a candidate, if any gap

of the candidate falls out of the range between the maximum gap and the

minimal gap, then the candidate is not a pattern. Furthermore, the authors

relaxed the deﬁ nition of transaction by using a sliding window, that when

the time range between two items is smaller than the sliding window, these

two items are considered to be in the same transaction. The taxonomy is

used to generate multiple level sequential patterns.

In [47], the authors proposed a new algorithm which is named GSP to

efﬁ ciently ﬁ nd the generalized sequential patterns. Similar to the AprioriAll

algorithm, there are two phases in GSP, i.e., candidate generation and

test.

In the candidate generation process, the candidate k-sequences are

generated based on the frequent (k-1)-sequences. Given a sequence s = µs

, . . . , s

Å and subsequence c, c is a contiguous subsequence of s if any of

the following conditions holds: (1) c is derived from s by dropping an item

from either s

or s

; (2) c is derived from s by dropping an item from an

element s

that has at least 2 items; and (3) c is a contiguous subsequence

of ĉ, and ĉ is a contiguous subsequence of s. Speciﬁ cally, the candidates are

generated in two phases:

• Joint Phase: Candidate k-sequences are generated by joining two (k-1)-

sequences that have the same contiguous subsequences. When we join

the two sequences, the item can be inserted as a part of the element or

as a separated element. For example, because µd(bc)aÅ and µd(bc)dÅ have

the same contiguous subsequence µd(bc)Å, then we know that candidate

5-sequence µd(bc)(ad)Å, µd(bc)adÅ and µd(bc)daÅ can be generated.

• Prune Phase: The algorithm removes the candidate sequences which

have a contiguous subsequence whose support count is less than the

minimal support. Moreover, it uses a hash-tree structure [41] to reduce

the number of candidates.

The process for generating candidates in the example database is shown

in Fig. 6.2. For GSP, the difﬁ culty is that the support of candidate sequences is

not easy to count due to the introduced generalization rules, while this is not

Frequent Pattern Mining 129

130 Applied Data Mining

a problem for AprioriAll. GSP devises an efﬁ cient strategy which includes

two phases, i.e., forward and backward phases (which are repeated until all

the elements are found): (1) Forward Phase: It looks for successive elements

of s in d, as long as the difference between the end-time of the element and

the start-time of the previous element is less than the maximum gap. If the

difference is greater than the maximum gap, it switches to the backward

phase. If an element is not found, then s is not contained in d; (2) Backward

Phase: It tries to pull up the previous element. Suppose s

is the current

element and end-time(s

)=t. It checks whether there are some transactions

containing s

i−1

and the corresponding transaction-times are larger than the

maximum gap. Since after pulling up s

i−1

, the difference between s

i−1

and

i−2

may not satisfy the gap constraints, the backward pulls back until the

difference of s

i−1

and s

i−2

satisﬁ es the maximum gap or the ﬁ rst element has

been pulled up. Then the algorithm switches to the forward phase. If all

the elements can not be pulled up, then d does not contain s.

Table 6.2: GSP Candidate Generation L

to C

[47]

Large 4-sequences Candidate 5-sequences after joining Candidate 5-sequences after pruning

b(ac)d (bc)(ac)d (bc)(ac)d

bcad d(bc)ad d(bc)ad

bdad d(bc)da

bdcd d(bc)(ad)

(bc)ad

(bc)(ac)

(bc)cd

c(ac)d

d(ac)d

dbad

d(bc)a

d(bc)d

dcad

µb(ac)dÅ

µ(bc)(ac)dÅ

µd(bc)adÅ

µd(bc)daÅ

µd(bc)(ad)Å

µ(bc)(ac)dÅ

µd(bc)adÅ

µbcadÅ

µbdadÅ

µbdcdÅ

µ(bc)adÅ

µ(bc)(ac)Å

µ(bc)cdÅ

µc(ac)dÅ

µd(ac)dÅ

µd(bc)aÅ

µd(bc)dÅ

µdcadÅ

µdbadÅ

For generalized rule, the authors [47] introduced taxonomy knowledge

into the mining process. The taxonomies are incorporated by extending

sequences with corresponding taxonomies. The original sequences are

therefore, replaced by their extended versions. As a result, the number

of rules becomes larger because the sequences become more dense and

redundant rules are produced. To avoid uninteresting rules, the ancestors

are ﬁ rstly precomputed for each item and those are not in the candidates are

removed. Moreover, the algorithm does not count the sequential patterns

that contain both the item and its ancestors. In a summary, the generalized

sequential patterns take more attributes into account and thus, can be

applied to real applications easily.

6.2.2.1.2 SPADE

Zaki introduced another efﬁ cient algorithm, i.e., SPADE [62], to ﬁ nd

frequent sequences using efﬁ cient lattice search techniques and simple

joins. To discover all the patterns, SPADE needs to scan the database three

times. It divides the mining problem into smaller ones to conquer and at

the same time makes it possible that all the necessary data is located in

memory. The core idea of SPADE, is devised based on that of Eclat [64], one

of the efﬁ cient algorithms for association rule mining. From the extensive

experimental evaluation [62], we can see that SPADE is very efﬁ cient in

ﬁ nding sequential patterns.

The mining process of SPADE can be illustrated through a concrete

example. Firstly, the sequential database is transformed into a vertical

format, i.e., id-list database, in which each id is associated with its

corresponding customer sequence and transaction. The vertical version of

the original database (as shown in Fig. 6.2.1) is illustrated in Fig. 6.2.4. For

example, we know that the id-list of item a is (100, 1), (100, 5), (100, 6), (200,

3), and (300, 3), where each pair (SID:TID) indicates the speciﬁ c sequence

and transaction that a locates. By scanning the vertical database, frequent

1-sequences can be easily obtained. To ﬁ nd the frequent 2- sequences,

the original database is scanned again and the new vertical to horizontal

database is constructed by grouping those items with SID and in increase

order of TID, which is shown in Fig. 6.2.5. By scanning the database 2-length

patterns can be discovered. A lattice is constructed based on these 2-length

patterns, and the lattice can be further decomposed into different classes,

where those patterns that have the same preﬁ x belong to the same class.

Such kind of decomposition make it possible that the partitions are small

enough to be loaded into the memory. SPADE then applies temporal joins

to ﬁ nd all other longer patterns by enumerating the lattice [62].

430

330

430230

1304202

330

520220520

320

220510120

610

710310510

510

410210310

110

TIDSIDTIDSIDTIDSID

TIDSID

dcb

Figure 6.2.4: Vertical id-List

Frequent Pattern Mining

131

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Frequent Pattern Mining (3/7)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6: Frequent Pattern Mining (3/7)