Sequential pattern mining with prefix span

Turning to sequential pattern matching, the prefix span algorithm is a little more complicated than association rules, so we need to take a step back and explain the basics first. Prefix span has first been described in http://hanj.cs.illinois.edu/pdf/tkde04_spgjn.pdf as a natural extension of the so-called FreeSpan algorithm. The algorithm itself represents a notable improvement over other approaches, such as Generalized Sequential Patterns (GSP). The latter is based on the apriori principle and all the drawbacks we discussed earlier regarding many algorithms based on it carry over to sequential mining as well, that is, expensive candidate generation, multiple database scans, and so on.

Prefix span, in its basic formulation, uses the same fundamental idea as FP-growth, which is, projecting the original database to a usually smaller structure to analyze. While in FP-growth, we recursively built new FP-trees for each suffix of a branch in the original FP-tree, prefix span grows or spans new structures by considering prefixes, as the name suggests.

Let's first properly define the intuitive notions of prefix and suffix in the context of sequences. In what follows, we'll always assume that the items within a sequence item are alphabetically ordered, that is, if s =  <s1, s2, ..., slis a sequence in S and each si is a concatenation of items, that is, si = (ai1 ... aim), where aij are items in I, we assume that all aij are in the alphabetical order within si. In such a situation, an element s' = <s'1, s'2, ..., s'm> is called a prefix of s if and only if the following three properties are satisfied:

  • For all i < m, we have equality of sequence items, that is, s'i = si
  • s'm < sm, that is, the last item of s' is a subpattern of sm
  • If we subtract s'm from sm, that is, delete the subpattern s'm from sm, all the frequent items left in sm - s'm have to come after all the elements in s'm, in alphabetical order

While the first two points come fairly naturally, the last one might seem a little strange, so let's explain it in an example. Given a sequence, <a(abc)>, from a database D, in which a, b, and c are indeed frequent, then, <aa> and <a(ab)> are prefixes for <a(abc)>, but <ab> is not, because in the difference of the last sequence items, <(abc)> - <b> = <(ac)>, the letter a does not alphabetically come after b from <ab>. Essentially, the third property tells us that a prefix can only cut out parts at the beginning of the last sequence item it affects.

With the notion of the prefix defined, it is now easy to say what a suffix is. With the same notation as before, if s' is a prefix of s, then s'' = <(sm - s'm), sm+1, ..., sl> is a suffix for this prefix, which we denote as s'' = s / s'. Furthermore, we will write s = s's'' in a product notation. For instance, given that <a(abc)> is the original sequence and <aa> is the prefix, we denote the suffix for this prefix as follows:

<(_bc)> = <a(abc)> / <aa>

Note that we use an underscore notation to denote the remainder of a sequence by a prefix.

Both the prefix and suffix notions are useful to split up or partition the original sequential pattern mining problem into smaller parts, as follows. Let {<p1>, ...,<pn>} be the complete set of sequential patterns of length 1. Then, we can make the following observations:

  • All the sequential patterns start with one of the pi. This, in turn, means that we can partition all sequential patterns into n disjoint sets, namely those starting with pi, for i between 1 and n.
  • Applying this reasoning recursively, we end up with the following statement: if s is a given sequential pattern of length 1 and {s1, ..., sm} is the complete list of length l+1 sequential superpatterns of s, then all sequential patterns with the prefix s can be partitioned into m sets prefixed by si.

Both these statements are easy to arrive at but provide a powerful tool to subdivide the original problem set into disjointed smaller problems. Such a strategy is called divide and conquer. With this in mind, we can now proceed very similarly to what we did with conditioned databases in FP-growth, namely project databases with respect to a given prefix. Given a sequential pattern database S and a prefix s, the s-projected database, S|s, is the set of all the suffixes for s in S.

We need one last definition to state and analyze the prefix span algorithm. If s is a sequential pattern in S and x is a pattern with a prefix s, then the support count of x in S|s, denoted by suppS|s(x), is the number of sequences y in S|s, so that x < sy; that is, we simply carry over the notion of support to s-projected databases. There are a few interesting properties we can derive from this definition that make our situation much easier. For instance, by definition, we see that for any sequence x with the prefix s, we have the following:

suppS(x) = suppS|s(x)

That is, it does not matter if we count the support in the original or projected database in this case. Moreover, if s' is a prefix of s, it is clear that S|s = (S|s')|s, meaning we can prefix consecutively without losing information. The last and most important statement from a computational complexity perspective is that a projected database cannot exceed its original size. This property should again be clear from the definitions, but it's immensely helpful to justify the recursive nature of prefix span.

Given all this information, we can now sketch the prefix span algorithm in pseudocode as follows. Note that we distinguish between an item s' being appended to the end of a sequential pattern s and the sequence <s'> generated from s' added to the end of s. To give an example, we could either add the letter e to <a(abc)> to form <a(abce)> or add <e> at the end to form <a(abc)e>:

def prefixSpan(s: Prefix, l: Length, S: ProjectedDatabase):
S' = set of all s' in S|s if {
(s' appended to s is a sequential pattern) or
(<s'> appended to s is a sequential pattern)
}
for s' in S' {
s'' = s' appended to s
output s''
call prefixSpan(s'', l+1, S|s'')
}
}
call prefixSpan(<>, 0, S)

The prefix span algorithm, as outlined, finds all sequential patterns; that is, it represents a solution to the sequential pattern mining problem. We cannot outline the proof of this statement here, but we hopefully have provided you with enough intuition to see how and why it works.

Turning to Spark for an example, note that we did not discuss how to effectively parallelize the baseline algorithm. If you are interested in knowing about the implementation details, see https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala, as the parallel version is a little too involved would justify presenting it here. We will study the example first provided in Table 2, that is, the four sequences of <a(abc)(ac)d(cf)>, <(ad)c(bc)(ae)>, <(ef)(ab)(df)cb>, and <eg(af)cbc>. To encode the nested structure of sequences, we use arrays of arrays of strings and parallelize them to create an RDD from them. Initializing and running an instance of PrefixSpan works pretty much the same way as it did for the other two algorithms. The only thing noteworthy here is that, apart from setting the minimum support threshold to 0.7 via setMinSupport, we also specify the maximum length of the patterns to 5 through setMaxPatternLength. This last parameter is there to limit the recursion depth. Despite the clever implementation, the algorithm (and particularly, the computing database projections) can take a prohibitive amount of time:

import org.apache.spark.mllib.fpm.PrefixSpan

val
sequences:RDD[Array[Array[String]]] = sc.parallelize(Seq(
Array(Array("a"), Array("a", "b", "c"), Array("a", "c"), Array("d"), Array("c", "f")),
Array(Array("a", "d"), Array("c"), Array("b", "c"), Array("a", "e")),
Array(Array("e", "f"), Array("a", "b"), Array("d", "f"), Array("c"), Array("b")),
Array(Array("e"), Array("g"), Array("a", "f"), Array("c"), Array("b"), Array("c")) ))
val prefixSpan = new PrefixSpan()
.setMinSupport(0.7)
.setMaxPatternLength(5)
val model = prefixSpan.run(sequences)
model.freqSequences.collect().foreach {
freqSequence => println(freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]") + ", " + freqSequence.freq) }

Running this code in your Spark shell should yield the following output of 14 sequential patterns:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.209.244