In our data mining toolbox, measuring the frequency of a pattern is a critical task. In some cases, more frequently occurring patterns may end up being more important patterns. If we can find frequently occurring pairs of items, or triples of items, those may be even more interesting.
In this chapter, we begin our exploration of frequent itemsets, and then we extend those to a type of pattern called association rules. We will cover the following topics:
To do this, we will write a program to find frequent itemsets in an open dataset of metadata (facts) about a group of software projects. Then we will learn to find frequent itemsets among the tags used to describe those projects. Next, we will learn how to extend a frequent itemset into an association rule by calculating its support in the database and then adding a probabilistic direction (X implies Y) confidence interval. Finally, we will learn how to interpret an association rule. Specifically, we want to understand what an association rule shows, and what it does not show.
Finding frequent itemsets is a type of counting activity. But unlike producing a simple tally of items we observe in a dataset (today we sold 80 carrots and 100 tomatoes), finding frequent itemsets is slightly different. Specifically, to find frequent itemsets we look for co-occurring sets of items within some larger group. These larger groups are sometimes imagined as supermarket transactions or shopping baskets, and the entire exercise is sometimes called market basket analysis. Staying with the supermarket analogy, the items co-occurring within those baskets are sometimes imagined to be combinations of products purchased at the supermarket. For example, given a set of supermarket transactions or baskets, we might be interested in whether the itemset of {carrots, tomatoes}
occurs more frequently in baskets than does the {cucumbers, lemons}
itemset.
The purpose of frequent itemset mining is to make interesting discoveries of co-occurring items within a set of transactions. In other words, it may be useful if we found out that there are some combinations of items that occur together frequently across multiple baskets. It will be especially interesting if the frequent itemsets we found are slightly unusual or unexpected in some way. The canonical example of an interesting, highly desirable rule in frequent itemset mining is usually described by retelling an urban legend called the diapers and beer story.
I remember first hearing this story in a data mining graduate course I took in 1998. My professor was trying to explain the usefulness of frequent itemsets and association rules, and he told our class the following story:
"A midwestern grocery store chain mined for frequent itemsets in order to discover what interesting combinations of groceries were purchased together. Their plan was to optimize the sales by co-locating these products in the store. To their delight, the grocery store data mining team learned that on Thursday nights between 5 and 7pm, men would frequently purchase a combination of diapers and beer. The grocery store placed these items together by moving a small display of diapers into the beer aisle, and sales of both products soared."
Being a skeptic, I had many questions about this story right away. How did the store know it was men who were doing the buying? After all, this was well before the advent of electronic loyalty cards or rewards cards in grocery stores. How could a store possibly fit a proper selection of diapers in a small display case in the middle of the beer aisle? After all, diapers come in five different sizes and there are usually at least three brands, and - as I quickly learned as a new parent - it is not a good idea to substitute one size or brand for another on a whim, or you may have disastrous results.
It turns out that several other folks were suspicious as well, and some went as far as to attempt to track down the history of this urban legend. The best-researched examples include Dan Powers' DSS Resources newsletter, with its November 10, 2002 issue (Volume 3, Number 23) dedicated to finding out the true origin of this story. This fascinating piece of work is available at http://www.dssresources.com/newsletters/66.php. Later, in 2006, The Register in the UK also ran a story about this urban legend. This article is available at http://www.theregister.co.uk/2006/08/15/beer_diapers.
If you believe the details relayed in both these pieces, the diapers and beer story started as a working example of what was possible with early attempts at data mining: Use our database product and you can query for unusual patterns like diapers and beer! Somehow, that working example extended to a this really happened to me story, which then morphed into an urban legend as the facts were stretched to accommodate various details and motives of the storytellers. As told in recent years, the common variants of this story include:
The truth of this story is indeed more mundane than the legend, but its popularity as a motivating case endures. It is very likely that if you do any research into frequent itemsets or association rule mining, the diapers and beer story, showing how market basket analysis is used in the real world, will come up as a case-in-point. It is used in nearly every book, article, and presentation ever given about association rules.
For our purposes, we will use the diapers and beer story as a useful metaphor. Specifically, we can use the terminology in this story to help define the three salient pieces in so-called market basket analysis, or frequent itemset mining:
As long as we have the concepts of a market, a basket, and items, and as long as these things work the way we describe here, we are well on our way to having a dataset that we can mine for frequent itemsets.
There are a few more assumptions buried in the market basket story though, and these will affect whether or not we have a minable dataset. So let's be explicit about those now:
At this stage in the analysis of our market baskets, we are most interested in finding frequent itemsets. These are groups of items that are found frequently together in baskets. In a grocery store, some combinations of items that people purchase together can be guessed easily using common sense, but some combinations are rare. Cake mix and frosting is a predictable set of items that will be purchased together, but beer and diapers would be a more unusual pair.
Sometimes certain combinations are more expected than others due to the weather, holidays, or regional preferences. As with any data mining exercise, it is important to understand the domain you are studying. In the case of shopping baskets, there are probably wide regional variations due to different food preferences. For example:
We can express these itemsets using a set notation like this:
itemset1 = {vanilla wafers, bananas, whipped cream} itemset2 = {black eyed peas, collard greens} itemset3 = {milk, bread}
Itemsets with two items in them are called 2-itemsets, or pairs, and itemsets with three items in them are called 3-itemsets, or triples, and so on. Sometimes pairs and triples are called doubletons and tripletons, respectively.
18.116.63.64