Modeling and evaluation

We'll start by mining the data for the overall association rules before moving on to our rules for beer specifically. Throughout the modeling process, we'll use the apriori algorithm, which is the appropriately named apriori() function in the arules package. The two main things that we'll need to specify in the function are the dataset and parameters. As for the parameters, you'll need to apply judgment when determining the minimum support, confidence, and the minimum and/or maximum length of basket items in an itemset. Using item frequency plots, along with trial and error, let's set the minimum support at 1 in 1,000 transactions and the minimum confidence at 90 %.

Additionally, let's establish the maximum number of items to be associated as 4. The following code creates the object that we'll call rules:

 rules <-
arules::apriori(Groceries, parameter = list(
supp = 0.001,
conf = 0.9,
maxlen = 4

Calling the object shows how many rules the algorithm produced:

> rules
set of 67 rules

There are many ways to examine rules. The first thing that I recommend is setting the number of displayed digits to only two, with the options() function in base R. Then, sort and inspect the top five rules based on the lift that they provide, as follows:

> options(digits = 2)

> rules <- arules::sort(rules, by = "lift", decreasing = TRUE)

> arules::inspect(rules[1:5])
lhs rhs support confidence lift
1 {liquor, red/blush wine} => {bottled beer} 0.0019
0.90 11.2
2 {root vegetables, butter, cream cheese } => {yogurt}
0.0010 0.91 6.5
3 {citrus fruit, root vegetables, soft cheese}=> {other vegetables}
0.0010 1.00 5.2
4 {pip fruit, whipped/sour cream, brown bread}=> {other vegetables}
0.0011 1.00 5.2
5 {butter,whipped/sour cream, soda} => {other vegetables}
0.0013 0.93 4.8

Lo and behold! The rule that offers the best overall lift is the purchase of liquor and red wine on the probability of purchasing bottled beer. I have to admit that this is pure chance and not intended on my part. As I always say, it's better to be lucky than good. Although, it's still not a very common transaction with support for only 1.9 per 1,000.

You can also sort by the support and confidence, so let's have a look at the first five rules by="confidence" in descending order, as follows:

 > rules <- arules::sort(rules, by = "confidence", decreasing = TRUE)

> arules::inspect(rules[1:5])
lhs rhs support confidence lift
1 {citrus fruit, root vegetables, soft cheese}=> {other vegetables}
0.0010 1 5.2
2 {pip fruit, whipped/sour cream, brown bread}=> {other vegetables}
0.0011 1 5.2
3 {rice, sugar} => {whole milk} 0.0012 1 3.9
4 {canned fish, hygiene articles} => {whole milk} 0.0011 1 3.9
5 {root vegetables, butter, rice} => {whole milk} 0.0010 1 3.9

You can see in the table that confidence for these transactions is 100 %. Moving on to our specific study of beer, we can utilize a function in arules to develop cross -tabulations—the crossTable() function—and then examine whatever suits our needs. The first step is to create a table with our dataset:

 > tab <- arules::crossTable(Groceries)

With tab created, we can now investigate joint occurrences between the items. Here, we'll look at just the first three rows and columns:

 > tab[1:3, 1:3]
frankfurter sausage liver loaf
frankfurter 580 99 7
sausage 99 924 10
liver loaf 7 10 50

As you might imagine, shoppers only selected liver loaf 50 times out of the 9,835 transactions. Additionally, of the 924 times, people gravitated toward sausageten times they felt compelled to grab liver loaf. (Desperate times call for desperate measures!) If you want to look at a specific example, you can either specify the row and column number or spell that item out:

> tab["bottled beer","bottled beer"]
[1] 792

This tells us that there were 792 transactions of bottled beer. Let's see what the joint occurrence between bottled beer and canned beer is:

> tab["bottled beer","canned beer"]
[1] 26

I would expect this to be low as it supports my idea that people lean toward drinking beer from either a bottle or a can. I strongly prefer a bottle. It also makes a handy weapon to protect yourself from all these ruffian protesters such as Occupy Wallstreet and the like.

We can now move on and derive specific rules for bottled beer. We'll again use the apriori() function, but this time, we'll add a syntax around appearance. This means that we'll specify in the syntax that we want the left-hand side to be items that increase the probability of purchasing bottled beer, which will be on the right-hand side. In the following code, notice that I've adjusted the support and confidence numbers. Feel free to experiment with your settings:

> beer.rules <- arules::apriori(
data = Groceries,
parameter = list(support
= 0.0015, confidence = 0.3),
appearance = list(default = "lhs",
rhs = "bottled beer"))

> beer.rules
set of 4 rules

We find ourselves with only 4 association rules. We've seen one of them already; now let's bring in the other three rules in descending order by lift:

 > beer.rules <- arules::sort(beer.rules, decreasing = TRUE, by = "lift")
> arules::inspect(beer.rules)
lhs rhs support confidence lift
1 {liquor, red/blush wine} => {bottled beer} 0.0019 0.90 11.2
2 {liquor} => {bottled beer} 0.0047 0.42 5.2
3 {soda, red/blush wine} => {bottled beer} 0.0016 0.36 4.4
4 {other vegetables, red/blush wine} => {bottled beer}0.0015 0.31

In all of the instances, the purchase of bottled beer is associated with booze, either liquor and/or red wine, which is no surprise to anyone. What's interesting is that white wine isn't in the mix here. Let's take a closer look at this and compare the joint occurrences of bottled beer and types of wine:

    > tab["bottled beer", "red/blush wine"]
[1] 48

> tab["red/blush wine", "red/blush wine"]
[1] 189

> 48/189
[1] 0.25

> tab["white wine", "white wine"]
[1] 187

> tab["bottled beer", "white wine"]
[1] 22
> 22/187
[1] 0.12

It's interesting that 25 % of the time when someone purchased red wine, they also purchased bottled beer; but with white wine, a joint purchase only happened in 12 % of the instances. We certainly don't know why in this analysis, but this could potentially help us to determine how we should position our product in this grocery store. Another thing before we move on is to look at a plot of the rules. This is done with the plot() function in the arulesViz package.

There are many graphics options available. For this example, let's specify that we want graph showing lift and the rules provided and shaded by confidence. The following syntax will provide this accordingly:

> library(arulesViz)
Loading required package: grid

> plot(beer.rules,
+ method = "graph",
+ measure = "lift",
+ shading = "confidence")

The following is the output of the preceding command:

This graph shows that liquor/red wine provides the best lift and the highest level of confidence with both the size of the circle and its shading.

What we've just done in this simple exercise is to show how easy it is with R to conduct a market basket analysis. It doesn't take much imagination to figure out the analytical possibilities that we can include with this technique, for example, incorporate customer segmentation, longitudinal purchase history, and so on, as well as how to use it in advert displays, co-promotions, and so on. 

