is applied by many other algorithms. The problems of AprioriAll are that
there are many candidates generated and multiple passes over the databases
are very time consuming.
6.2.2.1.1 GSP
Srikant and Agrawal generalized the defi nition of sequential pattern mining
problem in [47] by incorporating some new properties, i.e., time constraints,
transaction relaxation, and taxonomy. For the time constraints, the maximum
gap and the minimal gap are defi ned to specifi ed the gap between any two
adjacent transactions in the sequence. When testing a candidate, if any gap
of the candidate falls out of the range between the maximum gap and the
minimal gap, then the candidate is not a pattern. Furthermore, the authors
relaxed the defi nition of transaction by using a sliding window, that when
the time range between two items is smaller than the sliding window, these
two items are considered to be in the same transaction. The taxonomy is
used to generate multiple level sequential patterns.
In [47], the authors proposed a new algorithm which is named GSP to
effi ciently fi nd the generalized sequential patterns. Similar to the AprioriAll
algorithm, there are two phases in GSP, i.e., candidate generation and
test.
In the candidate generation process, the candidate k-sequences are
generated based on the frequent (k-1)-sequences. Given a sequence s = µs
1
,
s
2
, . . . , s
n
Å and subsequence c, c is a contiguous subsequence of s if any of
the following conditions holds: (1) c is derived from s by dropping an item
from either s
1
or s
n
; (2) c is derived from s by dropping an item from an
element s
j
that has at least 2 items; and (3) c is a contiguous subsequence
of ĉ, and ĉ is a contiguous subsequence of s. Specifi cally, the candidates are
generated in two phases:
• Joint Phase: Candidate k-sequences are generated by joining two (k-1)-
sequences that have the same contiguous subsequences. When we join
the two sequences, the item can be inserted as a part of the element or
as a separated element. For example, because µd(bc)aÅ and µd(bc)dÅ have
the same contiguous subsequence µd(bc)Å, then we know that candidate
5-sequence µd(bc)(ad)Å, µd(bc)adÅ and µd(bc)daÅ can be generated.
• Prune Phase: The algorithm removes the candidate sequences which
have a contiguous subsequence whose support count is less than the
minimal support. Moreover, it uses a hash-tree structure [41] to reduce
the number of candidates.
The process for generating candidates in the example database is shown
in Fig. 6.2. For GSP, the diffi culty is that the support of candidate sequences is
not easy to count due to the introduced generalization rules, while this is not
Frequent Pattern Mining 129