threshold. Suppose the minimal support is 70%, in this case the minimal
support count is 2, the result of large 1-itemsets is listed in Fig. 6.2.2.
Transformation Step: We map the large itemsets into a series of integers
and the original database is converted by replacing the itemsets.
For example, with the help of the mapping table in Fig. 6.2.2, the
transformed database is obtained, as shown in Fig. 6.2.3.
Sequence Step: The transformed database is scanned and mined to
nd all the frequent patterns.
Maximal Step: We remove those patterns which are contained in other
sequential patterns. In other words, only maximal sequential patterns
remain.
Customer ID Customer Sequence
10 < ac(bc)d(abc)ad >
20 < b(cd)ac(bd) >
30 < d(bc)(ac)(cd) >
<ac(bc)d(abc)ad>
<d(bc)(ac)(cd)>
<b(cd)ac(bd)>
Figure 6.2.3: Transformed Database
Figure 6.2.2: Large Itemsets
Large Itemsets Mapped To
apple a
banana b
strawberry c
pear d
Figure 6.2.1: Database Sorted by Customer ID and Transaction Time
CID Transaction Tim e Items
10 Sep. 5, 2011 bread
10 Sep. 9, 2011
cookie
10 Sep. 10, 2011
banana, cookie
10 Sep. 12, 2011
chocolate
10 Sep. 20, 2011
bread, milk, cookie
10 Sep. 23, 2011
bread
10 Sep. 26 2011
chocolate
20 Sep. 7, 2011
milk
20 Sep. 11, 2011
cookie, chocolate
20 Sep. 13, 2011
bread
20 Sep. 16, 2011
cookie
20 Sep. 22, 2011
milk, chocolate
30 Sep. 6, 2011 chocolate
30 Sep. 9, 2011
milk, cookie
30 Sep. 11, 2011
bread, cookie
30 Sep. 15, 2011 cookie, chocolate
Frequent Pattern Mining 127
128 Applied Data Mining
Table 6.1: AprioriAll Candidate Generation L
3
to C
4
[5]
Large 4-sequences Candidate 5-sequences
b(ac)d (bc)(ac)d
bcad d(bc)ad
bdad d(bc)da
bdcd d(bc)(ad)
(bc)ad
(bc)(ac)
(bc)cd
c(ac)d
d(ac)d
dbad
d(bc)a
d(bc)d
dcad
µb(ac)dÅ
µ(bc)(ac)dÅ
µd(bc)adÅ
µd(bc)daÅ
µd(bc)(ad)Å
µbcadÅ
µbdadÅ
µbdcdÅ
µ(bc)adÅ
µ(bc)(ac)Å
µ(bc)cdÅ
µc(ac)dÅ
µd(ac)dÅ
µdbadÅ
µd(bc)aÅ
µdcadÅ
µd(bc)dÅ
Among all these steps, the sequence step is the most time consuming
one and therefore, researchers focused on this step. AprioriAll [5] was fi rst
proposed based on the Apriori algorithm in association rule mining [3]. Two
phases are utilized to mine sequential patterns, i.e., candidate generation
and test.
The phase for generating candidates is similar to the AprioriGen in
[3]. The Apriori property is applied to prune those candidate sequences
whose subsequence is not frequent. The difference is that when the authors
generate the candidate by joining the frequent patterns in the previous pass,
different order of combination make different candidates. For example: from
the items, a and b, three candidates µabÅ, µbaÅ and µ(ab)Å can be generated.
But in association rule mining only µ(ab)Å is generated. The reason is that in
association rule mining, the time order is not taken into account. Obviously
the number of candidate sequences in sequential pattern mining are much
larger than the size of the candidate itemsets in association rule mining
during the generation of candidate sequences. Table 6.1 shows how to
generate candidate 5-sequences by joining large 4-sequences. By scanning
the large 4-itemsets, it fi nds that the fi rst itemsets µ(bc)adÅ and second
itemsets µ(bc)(ac)Å share their fi rst three items, according to the join condition
of Apriori they are joined to produce the candidate sequence µ(bc)(ac)dÅ.
Similarly other candidate 5-sequences are generated.
For the second phase, i.e., test phase, is simple and straightforward. The
database is scanned to count the supports of those candidate sequences. As
a result, the frequent sequential patterns can be found.
Due to the effi ciency and simplicity of the AprioriAll algorithm, which is
the fi rst algorithm on mining sequential patterns, the core idea of AprioriAll
is applied by many other algorithms. The problems of AprioriAll are that
there are many candidates generated and multiple passes over the databases
are very time consuming.
6.2.2.1.1 GSP
Srikant and Agrawal generalized the defi nition of sequential pattern mining
problem in [47] by incorporating some new properties, i.e., time constraints,
transaction relaxation, and taxonomy. For the time constraints, the maximum
gap and the minimal gap are defi ned to specifi ed the gap between any two
adjacent transactions in the sequence. When testing a candidate, if any gap
of the candidate falls out of the range between the maximum gap and the
minimal gap, then the candidate is not a pattern. Furthermore, the authors
relaxed the defi nition of transaction by using a sliding window, that when
the time range between two items is smaller than the sliding window, these
two items are considered to be in the same transaction. The taxonomy is
used to generate multiple level sequential patterns.
In [47], the authors proposed a new algorithm which is named GSP to
effi ciently nd the generalized sequential patterns. Similar to the AprioriAll
algorithm, there are two phases in GSP, i.e., candidate generation and
test.
In the candidate generation process, the candidate k-sequences are
generated based on the frequent (k-1)-sequences. Given a sequence s = µs
1
,
s
2
, . . . , s
n
Å and subsequence c, c is a contiguous subsequence of s if any of
the following conditions holds: (1) c is derived from s by dropping an item
from either s
1
or s
n
; (2) c is derived from s by dropping an item from an
element s
j
that has at least 2 items; and (3) c is a contiguous subsequence
of ĉ, and ĉ is a contiguous subsequence of s. Specifi cally, the candidates are
generated in two phases:
Joint Phase: Candidate k-sequences are generated by joining two (k-1)-
sequences that have the same contiguous subsequences. When we join
the two sequences, the item can be inserted as a part of the element or
as a separated element. For example, because µd(bc)aÅ and µd(bc)dÅ have
the same contiguous subsequence µd(bc)Å, then we know that candidate
5-sequence µd(bc)(ad)Å, µd(bc)adÅ and µd(bc)daÅ can be generated.
Prune Phase: The algorithm removes the candidate sequences which
have a contiguous subsequence whose support count is less than the
minimal support. Moreover, it uses a hash-tree structure [41] to reduce
the number of candidates.
The process for generating candidates in the example database is shown
in Fig. 6.2. For GSP, the diffi culty is that the support of candidate sequences is
not easy to count due to the introduced generalization rules, while this is not
Frequent Pattern Mining 129
130 Applied Data Mining
a problem for AprioriAll. GSP devises an effi cient strategy which includes
two phases, i.e., forward and backward phases (which are repeated until all
the elements are found): (1) Forward Phase: It looks for successive elements
of s in d, as long as the difference between the end-time of the element and
the start-time of the previous element is less than the maximum gap. If the
difference is greater than the maximum gap, it switches to the backward
phase. If an element is not found, then s is not contained in d; (2) Backward
Phase: It tries to pull up the previous element. Suppose s
i
is the current
element and end-time(s
i
)=t. It checks whether there are some transactions
containing s
i−1
and the corresponding transaction-times are larger than the
maximum gap. Since after pulling up s
i−1
, the difference between s
i−1
and
s
i−2
may not satisfy the gap constraints, the backward pulls back until the
difference of s
i−1
and s
i−2
satisfi es the maximum gap or the fi rst element has
been pulled up. Then the algorithm switches to the forward phase. If all
the elements can not be pulled up, then d does not contain s.
Table 6.2: GSP Candidate Generation L
4
to C
5
[47]
Large 4-sequences Candidate 5-sequences after joining Candidate 5-sequences after pruning
b(ac)d (bc)(ac)d (bc)(ac)d
bcad d(bc)ad d(bc)ad
bdad d(bc)da
bdcd d(bc)(ad)
(bc)ad
(bc)(ac)
(bc)cd
c(ac)d
d(ac)d
dbad
d(bc)a
d(bc)d
dcad
µb(ac)dÅ
µ(bc)(ac)dÅ
µd(bc)adÅ
µd(bc)daÅ
µd(bc)(ad)Å
µ(bc)(ac)dÅ
µd(bc)adÅ
µbcadÅ
µbdadÅ
µbdcdÅ
µ(bc)adÅ
µ(bc)(ac)Å
µ(bc)cdÅ
µc(ac)dÅ
µd(ac)dÅ
µd(bc)aÅ
µd(bc)dÅ
µdcadÅ
µdbadÅ
For generalized rule, the authors [47] introduced taxonomy knowledge
into the mining process. The taxonomies are incorporated by extending
sequences with corresponding taxonomies. The original sequences are
therefore, replaced by their extended versions. As a result, the number
of rules becomes larger because the sequences become more dense and
redundant rules are produced. To avoid uninteresting rules, the ancestors
are fi rstly precomputed for each item and those are not in the candidates are
removed. Moreover, the algorithm does not count the sequential patterns
that contain both the item and its ancestors. In a summary, the generalized
sequential patterns take more attributes into account and thus, can be
applied to real applications easily.
6.2.2.1.2 SPADE
Zaki introduced another effi cient algorithm, i.e., SPADE [62], to fi nd
frequent sequences using effi cient lattice search techniques and simple
joins. To discover all the patterns, SPADE needs to scan the database three
times. It divides the mining problem into smaller ones to conquer and at
the same time makes it possible that all the necessary data is located in
memory. The core idea of SPADE, is devised based on that of Eclat [64], one
of the effi cient algorithms for association rule mining. From the extensive
experimental evaluation [62], we can see that SPADE is very effi cient in
nding sequential patterns.
The mining process of SPADE can be illustrated through a concrete
example. Firstly, the sequential database is transformed into a vertical
format, i.e., id-list database, in which each id is associated with its
corresponding customer sequence and transaction. The vertical version of
the original database (as shown in Fig. 6.2.1) is illustrated in Fig. 6.2.4. For
example, we know that the id-list of item a is (100, 1), (100, 5), (100, 6), (200,
3), and (300, 3), where each pair (SID:TID) indicates the specifi c sequence
and transaction that a locates. By scanning the vertical database, frequent
1-sequences can be easily obtained. To fi nd the frequent 2- sequences,
the original database is scanned again and the new vertical to horizontal
database is constructed by grouping those items with SID and in increase
order of TID, which is shown in Fig. 6.2.5. By scanning the database 2-length
patterns can be discovered. A lattice is constructed based on these 2-length
patterns, and the lattice can be further decomposed into different classes,
where those patterns that have the same prefi x belong to the same class.
Such kind of decomposition make it possible that the partitions are small
enough to be loaded into the memory. SPADE then applies temporal joins
to fi nd all other longer patterns by enumerating the lattice [62].
430
330
430230
1304202
330
520220520
320
220510120
610
710310510
510
410210310
110
TIDSIDTIDSIDTIDSID
TIDSID
dcb
a
30
Figure 6.2.4: Vertical id-List
Frequent Pattern Mining
131
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.182.250