Chapter 7. Exploring Association Rules with Apriori

Association rules allow us to explore the relationship between items and sets of items. Such items can be as diverse as the contents of a market basket, the words used in sentences, the components of food products, and so on. Let's go back to the first example: transactions in a shop. Each transaction is composed of one or more items. We are interested in transactions of at least two items because, of course, there cannot be relationships between several items in the purchase of a single item. Imagine customers are purchasing the following sets of items, for which each row represents a transaction. We will use this example more thoroughly in this section:

  • Cherry coke, chips, lemon
  • Cherry coke, chicken wings, lemon
  • Cherry coke, chips, chicken wings, lemon
  • Chips, chicken wings, lemon
  • Cherry coke, lemon, chips, chocolate cake

At first sight, you will notice that there seems to be an association between purchases of cherry coke and lemon, as four out of five (80 percent) transactions have both elements. Other possible associations are featured in this short list of transactions. Can you discover them?

Now, imagine doing this task for lists of thousands of transactions, comprising dozens of items. I bet you'd be bored before finishing this task, and you might miss important associations. The point of mining association rules is to do exactly that job in an automated way and derive indicators of the reliability of these associations.

In this chapter, we will:

  • Examine the important concepts in associations rules
  • Examine how apriori, an algorithm frequently used for such analysis, works
  • Discover the use of market basket analysis with apriori in R

Apriori – basic concepts

There are some concepts about apriori that need to be understood before going further in this chapter: association rules, itemsets, support, confidence, and lift.

Association rules

An association rule is the explicit mention of a relationship in the data, in the form X => Y, where X (the antecedent) can be composed of one or several items. X is called an itemset. In what we will see, Y (the consequent) is always one single item. We might, for instance, be interested in what the antecedents of lemon are if we are interested in promoting the purchase of lemons.

Itemsets

Frequent itemsets are items or collections of items that occur frequently in transactions. Lemon is the most frequent itemset in the previous example, followed by cherry coke and chips. Itemsets are considered frequent if they occur more frequently than a specified threshold. This threshold is called minimal support. The omission of itemsets with support less than the minimal support is called support pruning. Itemsets are often described by their items between brackets: {items}.

Support

The support for an itemset is the proportion among all cases where the itemset of interest is present. As such, it allows estimation of how interesting an itemset or a rule is: when support is low, the interest is limited. The support for {Lemon} in our example is 1, because all transactions contain the purchase of Lemon. The support for {Cherry Coke} is 0.8 because Cherry Coke is purchased in four of five transactions (4/5 = 0.8). The support for {Cherry Coke, Chips} is 0.6 as three transactions contain both Cherry Coke and Chips. It is now your turn to do some math. Can you find the support for {Chips, Chicken wings}?

Confidence

Confidence is the proportion of cases of X where X => Y. This can be computed as the number of cases featuring X and Y divided by the number of cases featuring X. Let's consider the example of the association rule {Cherry Coke, Chips} => Chicken wings. As we have previously mentioned, the {Cherry Coke, Chips} itemset is present in three out of five transactions. Of these three transactions, chicken wings are only purchased in one transaction. So the confidence for the {Cherry Coke, Chips} => Chicken wings rule is 1/3 = 0.33.

Lift

Imagine both the antecedent and the consequent are frequent. For instance, consider the association rule, {Lemon} => Cherry Coke, in which lemon has a support of 1 and cherry coke a support of 0.8. Even without true relationship between the items, they could co-occur quite often. The proportion of cases where this can occur is computed as support(X) * support(Y). In our case, 1 * 0.8 = 0.8. Lift is a measure of the improvement of the rule support over what can be expected by chance—that is, in comparison to the value we just computed. It is computed as Support(X=>Y) / Support(X) * Support(Y).

In the current case:

Lift = support({Lemon, Cherry Coke}) / Support(Lemon)* Support(Cherry Coke) =

(4/5) / ( (5/5) * (4/5) ) = 1

As the lift value is not higher than 1, the rule does not explain the relationship between lemon and cherry coke better than could be expected by chance.

Now that we have discussed some basic terminology, we can continue with describing how the frequently used algorithm, apriori, works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.254.192