Towards association rules

All of this frequent itemset stuff is fine, but we are ultimately on the hunt for association rules, which are much more exciting. Association rules are formed from frequent itemsets, with a few small twists. We are interested in making a statement about the frequent itemsets like this: people who buy vanilla wafers also buy bananas 60% of the time. In order to do so, we need to learn how to calculate a few additional metrics, starting with two we call support and confidence.

Support

If we are looking for frequent itemsets, then we also need a way to express how often we see these sets occurring in baskets, and whether that number qualifies as frequent. If I see {vanilla wafers, bananas} in 90% of baskets, is that considered frequent? What about 50% of baskets? What about 5%? We call this number the support of the itemset. The support is just the number of times we saw that itemset over all the baskets.

To make support more meaningful, and to begin talking about "interestingness," we need to set a minimum support threshold. The minimum support threshold is any percentage between 0-100 that makes sense to our problem domain. If we set the minimum support threshold to 5%, this means that itemsets will only be considered frequent if they are found in at least 5% of all our baskets.

Support for a 2-itemset is typically written using a probability notation, like this:

support(X->Y) = P(XuY)

In other words, we can read this equation as The support of X->Y equals the percentage of baskets that contain both X and Y. Item X could be vanilla wafers and item Y could be bananas in this example. To calculate support of an itemset, we count how many baskets contain both these items, and divide by the total number of baskets. If the support of an itemset exceeds the minimum support threshold, then we can consider the itemset to be potentially interesting.

Confidence

Once we have discovered the frequent itemsets, we can start to consider whether one or more items in the set are directing the purchase of other items. For example, it would be interesting to know that 75% of customers who have vanilla wafers in their basket will also have bananas in that same basket. But, on the other hand, maybe only 1% of customers with bananas in their basket will also buy vanilla wafers. Why? This is because many, many more people buy bananas than buy vanilla wafers. Bananas are common and vanilla wafers are rare. So the direction of the purchasing relationship here is not necessarily symmetrical.

This brings us to a critical concept called confidence. The confidence of a directional relationship is written like this:

confidence(X->Y) = P(Y|X)

We can read this as the confidence of X leading to Y is the probability of Y given X. Or written differently:

confidence(X->Y) = support(XuY) / support(X)

The confidence of X->Y is just the percentage of baskets that contain both X and Y divided by the percentage of baskets that contain just X.

Once we have both support and confidence, we can begin to extend frequent itemsets into association rules.

Association rules

Now that we know how to determine if an itemset is frequent, and how to set up support and confidence levels, we can make a possible association rule from that frequent itemset.

An example of an association rule might look like this:

vanilla wafers -> bananas, whipped cream
[support=1%, confidence=40%]

We read this rule as follows, 1% of all baskets have the combination of vanilla wafers, bananas, and whipped cream; 40% of customers who purchased vanilla wafers also purchased bananas and whipped cream.

The left-hand side of that rule is the determining item, called the antecedent. The right-hand side is the resulting item(s), called the consequent. If we switch around the items on the left-hand and right-hand sides, we need to calculate a different association rule, which, due to the high popularity of bananas, might look something like this:

bananas -> vanilla wafers, whipped cream
[support = .001%, confidence=10%]

An example with data

Imagine we have a store, and we have 10 baskets of goods, as shown in the following table. Right away you can see that this is clearly a contrived case, as all the baskets in this store have exactly three items in them and that is highly unlikely in a real store:

Basket

Item1

Item2

Item3

1

vanilla wafers

bananas

dog food

2

bananas

bread

yogurt

3

bananas

apples

yogurt

4

vanilla wafers

bananas

whipped cream

5

bread

vanilla wafers

yogurt

6

milk

bread

bananas

7

vanilla wafers

apples

bananas

8

yogurt

apples

vanilla wafers

9

vanilla wafers

bananas

milk

10

bananas

bread

peanut butter

First, we need to calculate the support of all the individual items in this store. We have nine items across those 10 baskets:

Item

Support

apples

3

bananas

8

bread

4

dog food

1

milk

2

peanut butter

1

vanilla wafers

6

yogurt

4

whipped cream

1

To make the example easy, let's consider only one frequent itemset, {vanilla wafers, bananas}, at this point. The support of the itemset, {vanilla wafers, bananas}, is the percentage of baskets that contain both vanilla wafers and bananas. There are four baskets (numbers 1, 4, 7, and 9) that contain both of these items. Thus the support for either rule vanilla wafers -> bananas or bananas -> vanilla wafers is 40%, because 4/10 baskets contain both of these items.

Now we can use these support values to calculate the confidence for two proposed association rules:

vanilla wafers -> bananas

bananas -> vanilla wafers

confidence(vanilla wafers -> bananas) =  support(vanilla wafers U bananas) / support(vanilla wafers) = 4/6 = 67%

confidence(bananas -> vanilla wafers) = support (vanilla wafers U bananas) / support(bananas) = 4/8 = 50%

Written as association rules, we have:

vanilla wafers -> bananas [support=40%, confidence=67%]
bananas -> vanilla wafers [support=40%, confidence=50%]

The rule, vanilla wafers -> bananas, is stronger (same support, higher confidence) than the rule, bananas -> vanilla wafers.

Added value – fixing a flaw in the plan

It is very appealing to look at a rule like vanilla wafers->bananas[s=40%,c=67%] and feel satisfied. However, this is a very small dataset and was contrived just for this example. In some cases, association rules can be very misleading, and we should proceed with caution. Consider the following example.

Imagine in a different store, where the support numbers for vanilla wafers and bananas look more like this:

Item

Support

{vanilla wafers}

50%

{bananas}

80%

{vanilla wafers, bananas}

30%

In this case, the support for the items individually is quite high, but the support for the items together is lower.

The confidence of vanilla wafers -> bananas in this scenario is .3/.8 = 37.5%

So what is the problem? Well, it turns out that some items might do better on their own than as the consequence of an association rule. Even if the rule meets some minimum support threshold of support, we need to consider how the items behaved outside of the rule. To do this, we calculate a measure called added value of a given association rule. The added value of the rule vanilla wafers -> bananas is calculated by subtracting the support of bananas from the confidence of the rule. If the added value number is large and positive, then the rule is good and interesting. If the added value number is close to zero, then the rule may be true, but boring. If the added value number is large and negative, then the items in the rule are actually negatively associated and would do better on their own.

We calculate added value like this:

added value = confidence of rule - support of right-hand side

Using the preceding table, here are some calculations:

confidence of rule = .375
support of right-hand side (bananas) = .8

added value = .375 - .8 = -0.425

This number tells us that bananas actually would have done better by themselves. Furthermore, our proposed move of the display case of bananas next to the vanilla wafers in this store might be a mistake since there is nothing gained from attempting to exploit a relationship between bananas and wafers.

We can change the data slightly to contrive a positive, interesting rule:

Item

Support

{vanilla wafers}

50%

{bananas}

30%

{vanilla wafers, bananas}

30%

The confidence of vanilla wafers -> bananas in this scenario is .3/.3 = 100%

confidence of rule = 1.0
support of bananas = .3
added value = 1 - .3 = .7

In this case, bananas really should be placed with the vanilla wafers in the store, because apparently no one is buying bananas as their only product!

There are many more ways to measure the interestingness of association rules, but these are beyond the scope of this book. I would encourage interested readers to search Google Scholar for current academic papers covering this topic. Multiple good sources can be found by using a search phrase such as interestingness measures for association rules. In these papers, there are many good "interestingness" measures for different types of data and problems.

Methods for finding frequent itemsets

So far we have learned that finding association rules is based on first finding frequent itemsets. After that, we are simply calculating based on previously found counts. An important principle that will help us find frequent itemsets faster is called the upward closure property. Upward closure states that an itemset can only be frequent if all the items in it are also frequent. In other words, there is no sense in calculating the support for any itemset if all the itemsets contained in it are not also frequent.

Why is it important to know about closure? Because knowing this rule will save us a lot of time in calculating the possible itemsets. Calculating the support for every possible itemset in a store that has hundreds of thousands of items is clearly not practical! It is definitely to our advantage to reduce the number of itemsets as much as possible. One strategy to reduce the number of itemsets is to take advantage of upward closure to construct an algorithm that works as follows:

  1. First, we will set a support threshold.
  2. Construct a list of 1-itemsets, or singletons:
    • To do this, start with a list of every possible item. This list is called CandidateSingletonList.
    • Calculate the support for every individual item in CandidateSingletonList.
    • Keep only the singletons that meet the support threshold, and add them to a list called SingletonList.
  3. Construct a list of 2-itemsets, or doubletons:
    • To do this, start with SingletonList.
    • Make a list of every possible pairing of the items from SingletonList. This is called CandidateDoubletonList.
    • Keep only the candidate doubletons that meet the support threshold, and add them to a list called DoubletonList.
  4. Construct a list of 3-itemsets, or tripletons:
    • To do this, start with DoubletonList.
    • Make a list of every possible single item that appears in DoubletonList, and match them to each item in DoubletonList, making triples. This is called CandidateTripletonList.
    • Keep only the candidate tripletons that meet the support threshold, and add them to a list called TripletonList.
  5. Repeat step 4, growing the n-itemsets by one, using the single items from the previously constructed list, until you run out of frequent itemsets.

This algorithm is called Apriori, and it was first outlined in a 1994 paper written by Agarwal and Srikant called Fast algorithms for mining association rules in large databases. Since that time, numerous other algorithms have been proposed that attempt to optimize Apriori, including those that exploit parallelism and more interesting data structures, such as trees. There are also separate algorithms for some special types of basket data; for example, if we had baskets with sequential items, or baskets with categorical or hierarchical data. Still, for doing basic frequent itemset generation, Apriori is a classic choice.

Before we implement Apriori, we will draw special attention to a few important guidelines for generating candidate itemsets. While it is definitely time consuming to count the 2-itemsets, this is by far the most intensive work of the entire process. Due to closure property mentioned earlier, each successive pass over the data will construct fewer potential itemsets. As such, it is definitely to our advantage to reduce the number of items that have to be compared at the doubleton phase. To do this, we will set a minimum support threshold, but this threshold can be adjusted depending on the needs of the project you are working on.

In the next section, we will implement the Apriori algorithm in Python and use it to find association rules in a real-world database.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.248.216