Finding associations in data with the Apriori algorithm

One of the main goals of data mining and clustering is to learn the implicit relationships in the data. The Apriori algorithm helps to do this by teasing out such relationships into an explicit set of association rules. A common example of this type of analysis is what is done by groceries stores. They analyze receipts to see which items are commonly bought together, and then they can modify the store layout and marketing to suggest the second item once you've decided to buy the first item.

In this recipe, we'll use this algorithm to extract the relationships from the mushroom dataset that we've already seen several times in this chapter.

Getting ready

First, we'll use the same dependencies that we did in the Loading CSV and ARFF files into Weka recipe.

We'll use only one import in our script or REPL:

(import [weka.associations Apriori])

We'll also use the mushroom dataset that we introduced in the Classifying data with decision trees recipe. We'll set the class attribute to the column indicating whether the mushroom is edible or poisonous:

(def shrooms (doto (load-arff "data/UCI/mushroom.arff")
               (.setClassIndex 22)))

Finally, we'll use the defanalysis macro from the Discovering groups of data using K-Means clustering recipe.

How to do it…

We'll train an instance of the Apriori class, extract the classification rules, and use them to classify the instances:

  1. First, we need a function that converts keywords into integers for one of the options to the Apriori class:
    (def rank-metrics
      {:confidence 0 :lift 1 :leverage 2 :conviction 3})
  2. Now we'll use that in our definition of a wrapper function for the Apriori class:
    (defanalysis
      apriori Apriori buildAssociations
      [["-N" rules 10]
       ["-T" rank-metric :confidence rank-metrics]
       ["-C" min-metric 0.9]
       ["-D" min-support-delta 0.05]
       [["-M" "-U"] min-support-bounds [0.1 1.0] :seq]
       ["-S" significance nil :not-nil]
       ["-I" output-itemsets false :flag-true]
       ["-R" remove-missing-value-columns
        false :flag-true]
       ["-V" progress false :flag-true]
       ["-A" mine-class-rules false :flag-true]
       ["-c" class-index nil :not-nil]])
  3. With this in place, we can use it the way we have used the other wrapper functions:
    (def a (apriori shrooms))
  4. Then we can print the association rules:
    user=> (doseq [r (.. a getAssociationRules getRules)]
             (println
               (format "%s => %s %s = %.4f"
                       (mapv str (.getPremise r))
                       (mapv str (.getConsequence r))
                       (.getPrimaryMetricName r)
                       (.getPrimaryMetricValue r))))
    ["veil-color=w"] => ["veil-type=p"] Confidence = 1.0000
    ["gill-attachment=f"] => ["veil-type=p"] Confidence = 1.0000
    …

How it works…

The Apriori algorithm looks for items that are often associated together within a transaction. This can be used for things such as analyzing shopping patterns. In this case, we're viewing the constellation of attributes related to each mushroom as a transaction, and we're using the Apriori algorithm to see which traits are associated with which other traits.

The algorithm attempts to find the premises that imply a set of consequences. For instance, white veil colors (the premise) imply a partial veil type with a confidence of 1.0, so whenever the premise is found, the consequence is also found. A white veil color also implies a free gill attachment, but the confidence is 99 percent, so we know that these two aren't associated all of the time.

The abbreviated data dump of the preceding traits isn't particularly legible, so here's the same information as a table:

Premise

Consequence

Confidence

veil-color=w

veil-type=p

1.0000

gill-attachment=f

veil-type=p

1.0000

gill-attachment=f, veil-color=w

veil-type=p

1.0000

gill-attachment=f

veil-color=w

0.9990

gill-attachment=f, veil-type=p

veil-color=w

0.9990

gill-attachment=f

veil-type=p, veil-color=w

0.9990

veil-color=w

gill-attachment=f

0.9977

veil-type=p, veil-color=w

gill-attachment=f

0.9977

veil-color=w

gill-attachment=f, veil-type=p

0.9977

veil-type=p

veil-color=w

0.9754

From this, we can see that a white veil is associated with a partial veil type, a free gill attachment is associated with a partial white veil, and so on. If we want more information, we can request more rules using the rules parameter.

There's more…

The Weka documentation at http://weka.sourceforge.net/doc.dev/weka/associations/Apriori.html has more information about the Apriori class and its options

For more about the algorithm itself, see Wikipedia's page on the Apriori algorithm at http://en.wikipedia.org/wiki/Apriori_algorithm

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.51.157