Association rule mining

Recall from the association rule introduction that in computing association rules, we are about halfway there once we have frequent item sets, that is, patterns for the specified minimum threshold. In fact, Spark's implementation of association rules assumes that we provide an RDD of FreqItemsets[Item], which we have already seen an example of in the preceding call to model.freqItemsets. On top of that, computing association rules is not only available as a standalone algorithm but is also available through FPGrowth

Before showing how to run the respective algorithm on our running example, let's quickly explain how association rules are implemented in Spark:

  1. The algorithm is already provided with frequent item sets, so we don't have to compute them anymore.
  2. For each pair of patterns, X and Y, compute the frequency of both items X and Y co-occurring and store (X, (Y, supp(X  Y)). We call such pairs of patterns candidate pairs, where X acts as a potential antecedent and Y as a consequent.
  3. Join all the patterns with the candidate pairs to obtain statements of the form, (X, ((Y, supp(X ∪ Y)), supp(X))).
  4. We can then filter expressions of the form (X, ((Y, supp(X ∪ Y)), supp(X))) by the desired minimum confidence value to return all rules X ⇒ Y with that level of confidence.

Assuming we didn't compute the patterns through FP-growth in the last section but, instead, were just given the full list of these item sets, we can create an RDD from a sequence of FreqItemset from scratch and then run a new instance of AssociationRules on it:

import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset

val patterns: RDD[FreqItemset[String]] = sc.parallelize(Seq(
  new FreqItemset(Array("m"), 3L),
  new FreqItemset(Array("m", "c"), 3L),
new FreqItemset(Array("m", "c", "f"), 3L),
new FreqItemset(Array("m", "a"), 3L),
new FreqItemset(Array("m", "a", "c"), 3L),
new FreqItemset(Array("m", "a", "c", "f"), 3L),

new FreqItemset(Array("m", "a", "f"), 3L),
new FreqItemset(Array("m", "f"), 3L),
new FreqItemset(Array("f"), 4L),
new FreqItemset(Array("c"), 4L),
new FreqItemset(Array("c", "f"), 3L),
new FreqItemset(Array("p"), 3L),
new FreqItemset(Array("p", "c"), 3L),
new FreqItemset(Array("a"), 3L),
new FreqItemset(Array("a", "c"), 3L),
new FreqItemset(Array("a", "c", "f"), 3L),
new FreqItemset(Array("a", "f"), 3L), new FreqItemset(Array("b"), 3L) )) val associationRules = new AssociationRules().setMinConfidence(0.7) val rules = associationRules.run(patterns) rules.collect().foreach { rule => println("[" + rule.antecedent.mkString(",") + "=>" + rule.consequent.mkString(",") + "]," + rule.confidence) }

Note that after initializing the algorithm, we set the minimum confidence to 0.7 before collecting the results. Moreover, running AssociationRules returns an RDD of rules of the Rule type. These rule objects have accessors for antecedent, consequent, and confidence, which we use to collect the results that read as follows:

The reason we started this example from scratch is to convey the idea that association rules are indeed a standalone algorithm in Spark. Since the only built-in way to compute patterns in Spark is currently through FP-growth, and association rules depends on the concept of FreqItemset (imported from the FPGrowth submodule) anyway, this seems a bit unpractical. Using our results from the previous FP-growth example, we could well have written the following to achieve the same:

val patterns = model.freqItemsets

Interestingly, association rules can also be computed directly through the interface of FPGrowth. Continuing with the notation from the earlier example, we can simply write the following to end up with the same set of rules as before:

val rules = model.generateAssociationRules(confidence = 0.7)

In practical terms, while both the formulations can be useful, the latter one will certainly be more concise.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.93.122