The Apriori algorithm

The building blocks of the algorithm are the items that are found in any given transaction. Each transaction could have one or more items in it. The items that form a transaction are called an itemset. An example of a transaction is an invoice.

Given the transactions dataset, the objective is to find the items in data that are associated with each other. Association is measured as frequency of the occurrence of the items in the same context. For example, purchasing one product when another product is purchased represents an association rule. The association rule detects the common usage of items.

More formally, we can define association-rule mining as, given a set of items I = {I1,I2,..Im} and database of transactions D = {t1,t,2..tn}, where ti= { Ii1,Ii2..Iim} where Iik is element of, an association is an implication of X->Y where X,Y subset of I are set of items and X intersection Y is φ. In short, associations express an implication from X-> Y, where X and Y are itemsets.

The algorithm can be better understood by an example. So, let's consider the following table, which shows a representative list of sample transactions in a supermarket:

Transaction

Items

1

Milk, curd, chocolate

2

Bread, butter

3

Coke, jam

4

Bread, milk, butter, Coke

5

Bread, milk, butter, jam

Sample transactions in a super market

Let's try to explore some fundamental concepts that will help us understand how the Apriori algorithm works:

  • Item: An item is any individual product that is part of each of the transactions. For example, milk, Coke, and butter are all termed as items.
  • Itemset: Collection of one or more items. For example, {butter, milk, coke}, {butter, milk}.
  • Support count: Frequency of occurrence of an itemset. For example, support count or σ {butter, bread, milk} = 2.
  • Support: A fraction of transactions that contain an itemset. For example, s = {butter, bread, milk} = 2/5.
  • Frequent itemset: An itemset whose support is greater than the minimum threshold.
  • Support for an itemset in a context: Fraction of contexts that contain both X and Y:

So, s for {milk, butter} -> {bread} will be s = σ {milk, butter, bread}/N = 2/5 = 0.4

  • Confidence: Measures the strength of the rule, whereas support measures how often it should occur in the database. It computes how often items in Y occur in containing X through the following formula:

For example: For {bread} -> {butter}

c or α = σ {butter, bread} / σ {bread} = 3/3 = 1

Let's consider another example confidence for {curd} -> {bread}:

c or α = σ {curd,bread} / σ {bread} = 0/3 = 0

The Apriori algorithm intends to generate all possible combinations of the itemsets from the list of the items and then prunes the itemsets that have met the predefined support and confidence parameter values that were passed to the algorithm. So, it may be understood that the Apriori algorithm is a two-step algorithm:

  1. Generating itemsets from the items
  2. Evaluating and pruning the itemsets based on predefined support and confidence

Let's discuss step 1 in detail. Assume there are n items in the collection. The number of itemsets one could create is 2^n, and all these need to be evaluated in the second step in order to come up with the final results. Even considering just 100 different items, the number of itemsets generated is 1.27e+30! The huge number of itemsets poses a severe computational challenge.

The Apriori algorithm overcomes this challenge by preempting the itemsets that are generally rare or less important. The Apriori principle states that if an itemset is frequent, all of its subsets must also be frequent. This means that if an item does not meet the predefined support threshold, then such item does not participate in the creation of itemsets. The Apriori algorithm thus comes up with restricted number of itemsets that are viable to be evaluated without encountering a computational challenge.

The first step of the algorithm is iterative in nature. In the first iteration, it considers all itemsets of length 1, that is, each itemset contains only one item in it. Then each item is evaluated to eliminate the itemsets that are found to not meet the preset support threshold. The output of the first iteration is all itemsets of length 1 that meet the required support. This becomes the input for iteration 2, and now itemsets of length 2 are formed using only the final itemsets that are output in first iteration. Each of the itemsets formed during step 2 is checked again for the support threshold; if it is not met, such itemsets are eliminated. The iterations continue until no new itemsets can be created. The process of itemsets is illustrated in the following diagram:

Illustration showing the itemsets creation in Apriori algorithm

Once we have all itemsets post all the step 1 iterations of the algorithm, step 2 kicks in. Each of the itemsets generated is tested to check whether it meets the predefined confidence value. If it does not meet the threshold, such itemsets are eliminated from the final output.

At a stage where all iterations are complete and the final rules are the output from Apriori, we make use of a metric called lift to consume the relevant rules from the final output. Lift defines how much more likely one item or itemset is purchased relative to its typical rate of purchase, given that we know another item or itemset has been purchased. For each itemset, we get the lift measurement using the following formula:

Let's delve a little deeper into understanding the lift metric. Assume that in a supermarket, milk and bread are bought together by chance. In such a case, a large number of transactions are expected to cover the milk and bread purchased. A lift (milk -> bread) of more than 1 implies that these items are found together more often than these items are purchased together by chance. We generally would look for lift values greater than 1 when evaluating the rules for their usefulness in business. A lift value higher than 1 indicates that the itemset generated is very strong, and therefore worth considering for implementation.

Now, let's implement the recommendation system using the Apriori algorithm:

# load the required libraries
library(data.table)
library(arules)
library(recommenderlab)
# set the seed so that the results are replicable
set.seed(42)
# reading the Jester5k data
data(Jester5k)
class(Jester5k)

This will result in the following output:

[1] "realRatingMatrix"
attr(,"package")
[1] "recommenderlab"

We can see from the output that the Jester5k data in the recommenderlab library is in the realRatingsMatrix format. We are also aware that the cells in this matrix contain the ratings provided by the users for various jokes and we are aware that the ratings range between -10 to +10.

Applying the Apriori algorithm on the Jester5k dataset give us an opportunity to understand the association between the jokes. However, prior to applying the Apriori algorithm, we will need to transform the dataset to binary values where 1 represents a positive rating and 0 represents a negative rating or no rating. The recommenderlab library comes up with the binarize() function, which can perform the required operation for us. The following code binarizes the ratings matrix:

# binarizing the Jester ratings
Jester5k_bin <- binarize(Jester5k, minRating=1)
# let us verify the binarized object
class(Jester5k_bin)

This will result in the following output:

[1] "binaryRatingMatrix"
attr(,"package")
[1] "recommenderlab"

We can observe from the output that realRatingsMatrix is successfully converted into binaryRatingMatrix. The Apriori algorithm that mines the associations expects a matrix to be passed as input rather than binaryRatingMatrix. We can very easily convert the Jester5k_bin object to the matrix format with the following code:

# converting the binaryratingsmatrix to matrix format
Jester5k_bin_mat <- as(Jester5k_bin,"matrix")
# visualizing the matrix object
View(Jester5k_bin_mat)

This will result in the following output:

We see from the output that all the cells of the matrix are represented as TRUE and FALSE, but Apriori expects the cells to be numeric. Let's now convert the cells into 1 and 0 for TRUE and FALSE, respectively, with the following code:

# converting the cell values to 1 and 0
Jester5k_bin_mat_num <- 1*Jester5k_bin_mat
# viewing the matrix
View(Jester5k_bin_mat_num)

This will result in the following output:

Now we are all set to apply the Apriori algorithm on the dataset. There are two parameters, support and confidence, that we need to pass to the algorithm. The algorithm mines the dataset based on these two parameter values. We pass 0.5 as the value for support and 0.8 as the value for confidence. The following line of code extracts the joke associations that exist in our Jester jokes dataset:

rules <- apriori(data = Jester5k_bin_mat_num, parameter = list(supp = 0.005, conf = 0.8))

This will result in the following output:

Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.8 0.1 1 none FALSE TRUE 5 0.5 1 10 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 2500
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[100 item(s), 5000 transaction(s)] done [0.02s].
sorting and recoding items ... [29 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.01s].
writing ... [78 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].

The rules object that was created from the execution of the Apriori algorithm now has all the joke associations that were extracted and mined from the dataset. As we can see from the output, there are 78 jokes associations that were extracted in total. We can examine the rules with the following line of code:

inspect(rules)

This will result in the following output:

     lhs          rhs   support confidence lift     count
[1] {j48} => {j50} 0.5068 0.8376860 1.084523 2534
[2] {j56} => {j36} 0.5036 0.8310231 1.105672 2518
[3] {j56} => {j50} 0.5246 0.8656766 1.120762 2623
[4] {j42} => {j50} 0.5150 0.8475971 1.097355 2575
[5] {j31} => {j27} 0.5196 0.8255481 1.146276 2598

The output shown is just five rules out of the overall 78 rules that are in the list. The way to read each rule is that the joke shown on the left column (lhs) leads to the joke on the right column (rhs); that is, a user that liked the joke on lhs of the rule generally tends to like the joke shown on rhs. For example, in the first rule, if a user has liked joke j48, it is likely that they will also like j50, therefore it is worth recommending joke j50 to the user that has only read joke j48.

While there are several rules generated by the Apriori algorithm, the strength of each rule is specified by a metric, called lift. This is a metric that describes the worthiness of a rule in a business context. Note that for a rule to be considered general, it has to have a lift that is less than or equal to 1. A lift value greater than 1 signifies a better rule for implementing in business. The aim of the following lines of code is to get such strong rules to the top of the list:

# converting the rules object into a dataframe
rulesdf <- as(rules, "data.frame")
# employing quick sort on the rules dataframe. lift and confidence are
# used as keys to sort the dataframe. - in the command indicates that we
# want lift and confidence to be sorted in descending order
rulesdf[order(-rulesdf$lift, -rulesdf$confidence), ]

This will result in the following output:

It may be observed that the output shown is only a subset of the rules output. The first rule indicates that j35 is a joke that can be recommended to a user that has already read jokes j29 and j50.

Likewise, we could just write a script to search all the jokes that a user has already read and match it with the left side of the rule; if a match is found, the corresponding right side of the rule can be recommended as the joke for the user.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.79.20