Chapter 2. Association Rule Mining

In our data mining toolbox, measuring the frequency of a pattern is a critical task. In some cases, more frequently occurring patterns may end up being more important patterns. If we can find frequently occurring pairs of items, or triples of items, those may be even more interesting.

In this chapter, we begin our exploration of frequent itemsets, and then we extend those to a type of pattern called association rules. We will cover the following topics:

  • What is a frequent itemset? What are the techniques for finding frequent itemsets? Where are the bottlenecks and how can we speed up the process?
  • How can we extend a frequent itemset to become an association rule?
  • What makes a good association rule? We will learn to describe the value of a particular association rule, given its level of support in the database, our confidence in the rule itself, and the value added by the rule we found.

To do this, we will write a program to find frequent itemsets in an open dataset of metadata (facts) about a group of software projects. Then we will learn to find frequent itemsets among the tags used to describe those projects. Next, we will learn how to extend a frequent itemset into an association rule by calculating its support in the database and then adding a probabilistic direction (X implies Y) confidence interval. Finally, we will learn how to interpret an association rule. Specifically, we want to understand what an association rule shows, and what it does not show.

What are frequent itemsets?

Finding frequent itemsets is a type of counting activity. But unlike producing a simple tally of items we observe in a dataset (today we sold 80 carrots and 100 tomatoes), finding frequent itemsets is slightly different. Specifically, to find frequent itemsets we look for co-occurring sets of items within some larger group. These larger groups are sometimes imagined as supermarket transactions or shopping baskets, and the entire exercise is sometimes called market basket analysis. Staying with the supermarket analogy, the items co-occurring within those baskets are sometimes imagined to be combinations of products purchased at the supermarket. For example, given a set of supermarket transactions or baskets, we might be interested in whether the itemset of {carrots, tomatoes} occurs more frequently in baskets than does the {cucumbers, lemons} itemset.

The purpose of frequent itemset mining is to make interesting discoveries of co-occurring items within a set of transactions. In other words, it may be useful if we found out that there are some combinations of items that occur together frequently across multiple baskets. It will be especially interesting if the frequent itemsets we found are slightly unusual or unexpected in some way. The canonical example of an interesting, highly desirable rule in frequent itemset mining is usually described by retelling an urban legend called the diapers and beer story.

The diapers and beer urban legend

I remember first hearing this story in a data mining graduate course I took in 1998. My professor was trying to explain the usefulness of frequent itemsets and association rules, and he told our class the following story:

"A midwestern grocery store chain mined for frequent itemsets in order to discover what interesting combinations of groceries were purchased together. Their plan was to optimize the sales by co-locating these products in the store. To their delight, the grocery store data mining team learned that on Thursday nights between 5 and 7pm, men would frequently purchase a combination of diapers and beer. The grocery store placed these items together by moving a small display of diapers into the beer aisle, and sales of both products soared."

Being a skeptic, I had many questions about this story right away. How did the store know it was men who were doing the buying? After all, this was well before the advent of electronic loyalty cards or rewards cards in grocery stores. How could a store possibly fit a proper selection of diapers in a small display case in the middle of the beer aisle? After all, diapers come in five different sizes and there are usually at least three brands, and - as I quickly learned as a new parent - it is not a good idea to substitute one size or brand for another on a whim, or you may have disastrous results.

It turns out that several other folks were suspicious as well, and some went as far as to attempt to track down the history of this urban legend. The best-researched examples include Dan Powers' DSS Resources newsletter, with its November 10, 2002 issue (Volume 3, Number 23) dedicated to finding out the true origin of this story. This fascinating piece of work is available at http://www.dssresources.com/newsletters/66.php. Later, in 2006, The Register in the UK also ran a story about this urban legend. This article is available at http://www.theregister.co.uk/2006/08/15/beer_diapers.

If you believe the details relayed in both these pieces, the diapers and beer story started as a working example of what was possible with early attempts at data mining: Use our database product and you can query for unusual patterns like diapers and beer! Somehow, that working example extended to a this really happened to me story, which then morphed into an urban legend as the facts were stretched to accommodate various details and motives of the storytellers. As told in recent years, the common variants of this story include:

  • It was Walmart that did the data mining
  • The retailer used its discovered knowledge to raise the price of beer on Thursdays
  • The motivation for buying the beer was as a reward for taking care of the children (presumably the children were the reason for buying the diapers)
  • The retailer was particularly excited about these patterns because diapers are a profitable item

The truth of this story is indeed more mundane than the legend, but its popularity as a motivating case endures. It is very likely that if you do any research into frequent itemsets or association rule mining, the diapers and beer story, showing how market basket analysis is used in the real world, will come up as a case-in-point. It is used in nearly every book, article, and presentation ever given about association rules.

Frequent itemset mining basics

For our purposes, we will use the diapers and beer story as a useful metaphor. Specifically, we can use the terminology in this story to help define the three salient pieces in so-called market basket analysis, or frequent itemset mining:

  • First, to do market basket analysis, we need a market. In the metaphor, the market is an actual supermarket.
  • Next, we need a basket. In the example, the basket is a single shopping transaction. Sometimes we will use the word "basket," sometimes you will hear "transaction" used.
  • We need items. In the metaphor, the grocery items are placed into the basket, or transaction, for purchase.

As long as we have the concepts of a market, a basket, and items, and as long as these things work the way we describe here, we are well on our way to having a dataset that we can mine for frequent itemsets.

There are a few more assumptions buried in the market basket story though, and these will affect whether or not we have a minable dataset. So let's be explicit about those now:

  • There should be a many-to-many relationship between items and baskets. Baskets are made up of many items, and an item can appear in many baskets.
  • The quantity of items is not considered. If the person bought six pack of diapers or one pack of diapers, the relevant fact is that the basket contains diapers.
  • An item may not appear in any basket (I am sure we can all think of an unpopular grocery item), but every basket will contain at least one item. Empty baskets are not interesting!
  • The order of the items in a basket does not matter. In terms of the metaphor, it does not matter whether your beer or your package of diapers was placed in the shopping basket first, nor does it matter which one was placed on the conveyer belt first, nor does it matter which one was entered into the cash register first. Instead, we will metaphorically group items together that were purchased as a single transaction or basket, regardless of their position in that basket.

At this stage in the analysis of our market baskets, we are most interested in finding frequent itemsets. These are groups of items that are found frequently together in baskets. In a grocery store, some combinations of items that people purchase together can be guessed easily using common sense, but some combinations are rare. Cake mix and frosting is a predictable set of items that will be purchased together, but beer and diapers would be a more unusual pair.

Sometimes certain combinations are more expected than others due to the weather, holidays, or regional preferences. As with any data mining exercise, it is important to understand the domain you are studying. In the case of shopping baskets, there are probably wide regional variations due to different food preferences. For example:

  • I live in the southern part of the United States, and there are many interesting combinations that we have in our stores that might seem unusual elsewhere. For example, vanilla wafer cookies and bananas are often purchased together in order to make banana pudding, a popular dessert.
  • A common meal to serve on New Year's Day in my state contains black-eyed peas (a legume) and collard greens (a leafy vegetable), so baskets containing these items may increase around the end of the year.
  • I also live in a place where snow events are fairly rare. Each time the weather forecast calls for snow for our area, everyone panics and buys up all the milk and bread in the store. Even though milk and bread are both very common items to purchase on any day, regardless of the weather, on these snow event days, we may find that the combination of milk and bread is a more common frequent itemset.

We can express these itemsets using a set notation like this:

itemset1 = {vanilla wafers, bananas, whipped cream}
itemset2 = {black eyed peas, collard greens}
itemset3 = {milk, bread}

Itemsets with two items in them are called 2-itemsets, or pairs, and itemsets with three items in them are called 3-itemsets, or triples, and so on. Sometimes pairs and triples are called doubletons and tripletons, respectively.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.63.64