Analyzing data with apriori in R

In this section, we will continue with another supermarket example and analyze associations in the Groceries dataset. In order to use this dataset and to explore association rules in R, we need to install and load the arules package:

install.packages("arules")
library(arules)
data(Groceries)

Using apriori for basic analysis

We can now explore relationships between purchased products in this dataset. This dataset is already in a form exploitable by apriori (transactions). We will first use the default parameters as follows:

rules = apriori(Groceries)

The output is provided in the following screenshot:

Using apriori for basic analysis

Running apriori on the Groceries dataset with default parameters

We can see on the first line the parameters used in the analysis—in this case, the default. Around the middle of the output (where the arrow is), we see that there are 169 items in 9835 transactions in this dataset, and that 0 rules have been found (see second to last line). If you try this with your own data, you might find rules with the default parameters if you have very solid associations in your data. Here, the confidence and support thresholds are clearly too strict. Therefore, we will try again with a more relaxed minimal support and confidence, as follows:

rules = apriori(Groceries, parameter = 
   list(support = 0.05, confidence = .1))

The output is provided on the top part of the following screenshot. We notice that five rules that satisfy both minimal support and confidence have been generated. We can examine these rules using the inspect() function (see the bottom part of the screenshot) as follows:

inspect(rules)

Let's examine the output:

Using apriori for basic analysis

Running apriori on the Groceries dataset with different support and confidence thresholds

The first column (lhs) displays the antecedents of each rule (X itemsets in our description in the first section). The second column (rhs) displays the consequents of the rules. We can see that whole milk is bought relatively frequently when yogurt, rolls/buns, and other vegetables are bought, and that other vegetables are bought frequently when milk is bought.

Detailed analysis with apriori

In this section, we will examine more complex relationships using the ICU dataset: a dataset about the outcomes of hospitalizations in the ICU. This dataset has 200 observations and 22 attributes. In order to access this dataset, we first need to install the package that contains it (vcdExtra). We will then have a look at the attributes:

install.packages("vcdExtra")
library(vcdExtra)
data(ICU)
summary(ICU)

The output is provided in the screenshot that follows. The attribute died refers to whether the patient died or not. The attributes age, sex, and race refer to the age (in years), sex (Female, Male) and race (black, white, or other) of the patient. The attribute service is the type of ICU the patient has been admitted into (medical or surgical). Attributes cancer, renal, infect, and fracture refer to the conditions the patient suffered during their stay in the ICU. The cpr attribute refers to whether or not the patient underwent cardiopulmonary resuscitation. The systolic and heartrate attributes refer to measures of cardiac activity. The previcu and admit attributes refer to whether the patient has been in the ICU previously, and whether the admission was elective or an emergency. Attributes po2, ph, pco, bic, and creatin refer to blood measures. Attributes coma and uncons refer to whether the patient has been in a coma or unconscious at any moment during the stay in the ICU.

Detailed analysis with apriori

The summary of the ICU dataset

Preparing the data

As can be seen in the preceding screenshot, the race and white attributes are a bit redundant. We will therefore remove the race attribute (the fourth column). One can also see that there are both numerical (age, systolic, hrtrate) and categorical (died, sex, race) attributes in the dataset. We need only categorical attributes. Therefore, we will recode the numeric attributes into categorical attributes. We will do this with the cut() function on a copy of our dataset. This function simply creates bins by dividing the distance between the minimal and maximal values by the number of bins. As always, domain knowledge would be useful to create more meaningful bins. The reader is advised to take some time to become familiar with the dataset by typing ?ICU:

ICU2 = ICU[-4]
ICU2$age = cut(ICU2$age, breaks = 4)
ICU2$systolic = cut(ICU2$systolic, breaks = 4)
ICU2$hrtrate = cut(ICU2$hrtrate, breaks = 4)

The dataset isn't in a format readily usable with apriori (transactions) yet. We first need to convert the coercions to transaction format before we can use it:

ICU_tr = as(ICU2, "transactions")

Note

Using the discretize() function from the arules package allows the use of different types of binning. For instance, the following code line creates a new attribute named agerec with four bins of approximately equal frequency:

agerec = discretize(ICU$age, method="frequency", categories=4)

Analyzing the data

We will first perform an analysis of all associations with thresholds of .85 for support and .95 for confidence, as follows:

rules = apriori (ICU_tr, 
   parameter = list(support = .85, confidence = .95))

This leads to 43 rules. Let's have a closer look at these. Only the first 10 will be displayed in the following screenshot:

Analyzing the data

A view of association rules in the modified ICU dataset

With high confidence and support, the absence of a cancer is associated with creatin levels lower than or equal to 2. Low arterial concentration of carbonic oxide (<= 45) is associated with high blood concentration of oxygen (>= 60). Patients who did not have a history of renal failure did not need CPR, had a blood pH higher or equal to 7.25, and had a creatin level lower or equal to 2. Patients with a blood concentration of oxygen higher than or equal to 60 had a blood pH higher than or equal to 7.25 and a creatin level equal to or lower than 2. We let the reader interpret the rest of the relationships.

Interestingly, even though the confidence and support of the rules are high, the lift value is not higher than 1—that is, not better than could be expected by chance given the support of the antecedent and the consequent. Further testing might allow us to know more about this. The fisher exact test permits the testing of statistical interdependence in 2x2 tables. Each of our rules can be represented in a 2x2 table—for instance, the antecedent itemset in the rows (yes versus no) and the consequent in the columns (yes versus no). This test is available in the interestMeasure() function, as well as other tests and measures. I am not giving too much detail about this measure; instead, I am focusing on interpreting the results. Only the significance of the test is returned here. If you need the test value, please refer to the next subsection about how to export rules to a data frame, and then use the fisher.test() function from the stats package.

Regarding the significance value returned here (also known as the p value), when it is lower than 0.05 this means that the antecedent and the consequent are related for a particular rule. If it is higher than 0.05, it is considered non-significant, which means we cannot trust the rule. We will discover more about statistical distributions in the next chapter, but you might want to have a look now! Let's use this test to investigate the rules we generated before, rounding the results to two digits after the decimal point:

IM = interestMeasure(rules, "fishersExactTest", ICU_tr)
round(IM, digits=2)

The results are provided in the following table, in order of the rules:

[1]

1

0.66

0

0

0.02

0.02

0

0

0.57

0

0

[12]

0

0.17

0

0.17

0

0

0.55

0.13

0.13

0

0

[23]

0

0

0

0

0

0

0

0

0

0

0

[34]

0

0

0

0.17

0

0

0.02

0.01

0

0.02

 

We can see that the first and second rules are non-significant, but the following are significant, in most instances. For instance, the {cancer = No} => {creatin = <=2} rule as well as { pco = <=45} => {po2=>60} are significant, which means that, when the antecedent is present, the consequent is present relatively more often than absent (and conversely).

We might have higher lift values when looking at what the antecedents of death in the ICU are. For this analysis, we will set the rhs parameter of the appearance argument to died=Yes. We will use lower confidence and support thresholds for this analysis as follows:

rulesDeath = apriori(ICU_tr,
   parameter = list(confidence = 0.3,support=.1),
   appearance = list(rhs = c("died=Yes"), default="lhs"))

The analysis returned 63 association rules with these confidence and support thresholds. Let's have a look. Again, we only display the first 10 association rules in the following screenshot:

Analyzing the data

View of association rules in the modified ICU dataset, with patient death as a consequence

Note

Instead of running apriori again, it is also possible to use the subset() function to select rules. The following line of code will create an object called rulesComa containing only rules where the consequent is coma=None using the existing rules. In the previous and following code, using rhs instead of lhs would have included the rules containing the selected antecedents instead of the selected consequents:

rulesComa = subset(rules, subset = rhs %in% "coma=None")

We can see that 34 percent of patients who were admitted in emergency and had an infection died in the ICU, as did 30 percent of patients who were non-white and whose ICU admission service was medical. Skipping through the results (rule 9 and 10), 31 percent of patients who didn't have a cancer, but were infected and non-white, died in the ICU, as did 31 percent of non-white patients who were infected with a blood oxygen concentration of 60 or more. We will let the reader examine the rest of the results. Looking at the lift value for all the rules here, we can see that these are a bit higher than 1, suggesting that the rules are more reliable than could be randomly expected.

Coercing association rules to a data frame

We have seen how to generate association rules using apriori with constraints, such as minimal support and confidence, or a given consequent. Suppose we want to perform some action on the rules—for example, sort them by decreasing lift values. The easiest way to do this is to coerce the association rules to a data frame, and then perform the operations as we would usually do. In what follows, we will use the rules we generated last and coerce them to a data frame using the as() function. We will then sort the data frame by decreasing lift and display the first five lines of the sorted data frame:

rulesDeath.df = as(rulesDeath,"data.frame")
rulesDeath.df.sorted =   
   rulesDeath.df[order(rulesDeath.df$lift,decreasing = T),]
head(rulesDeath.df.sorted)

The following screenshot shows the output:

Coercing association rules to a data frame

Data frame displaying the five highest lift values

The output shows that the following association rules have the highest lift values, and therefore have the highest performance compared to a random model:

  • {cancer=No,infect=Yes,admit=Emergency,po2=>60,white=Non-white} => {died=Yes}
  • {infect=Yes,admit=Emergency,po2=>60,white=Non-white} => {died=Yes}
  • {cancer=No,infect=Yes,admit=Emergency,fracture=No,po2=>60} => {died=Yes}

Also note that the equivalent could have been obtained without coercing the association rules to a data frame with the following code line:

rulesDeath.sorted = sort(rulesDeath, by ="lift")
inspect(head(rulesDeath.sorted,5))

Visualizing association rules

As for other analyses, visualization is an important tool when examining association rules. The arulesViz package provides visualization tools for association rules. There are plenty of examples in the Visualizing Association Rules: Introduction to the R-extension Package arulesViz article by Hasler and Chelluboina (2011). Here, we will only discover a plotting method that provides great added value to existing plotting tools in R, because of the informative graphics it provides, and also because of its simplicity. This method is called Grouped matrix-based visualization. It uses clustering to group association rules. The reader is advised to read the mentioned paper to learn more about it. Here are a few examples using our data. Let's start by installing and loading the package:

install.packages("arulesViz"); library(arulesViz).

For this purpose, we will first create a new set of association rules for the ICU_tr object, using a minimal support of 0.5. As confidence will not be displayed on the graph, we will set the threshold to 0.95—that is, all included rules will have high support. We will also use the minlen = 2 and maxlen = 2 parameters in order to obtain only rules with exactly one populated antecedent as follows:

morerules = apriori(ICU_tr, parameter=list(confidence=.95, 
   support=.5, minlen=2,maxlen=2))
plot(morerules, method = "grouped")

The resulting graphic is shown in the following screenshot. The graphic displays antecedents in the columns and consequents in the rows. The size of the circles displays support (bigger circles mean higher support) and their color displays the lift value (a darker color means a higher lift). Looking at the graph, we can see that the {uncons=No} => {coma = None} rules have high support and high lift value. The {coma = None} => {unconsc = No} rule, of course, displays the same pattern. We can see that rules with a consequent creatin level <=2 have a generally low lift value. Even if the support is high, interpreting these rules must be done with caution, as they do not show an improvement compared to a random model. The reader is free to interpret the other rules displayed in the following screenshot:

Visualizing association rules

Grouped matrix-based visualization of association rules in the ICU dataset

Note

A final word regarding visualization of association rules: as you now know, it is easy to coerce association rules to a data frame. Once this is done, you can use the tools we discussed in Chapter 2, Visualizing and Manipulating Data Using R, and Chapter 3, Data Visualization with Lattice, and others to visualize the support, confidence, or lift of the rules, or perform analyses using these values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.126.199