The Apriori algorithm

The building blocks of the algorithm are the items that are found in any given transaction. Each transaction could have one or more items in it. The items that form a transaction are called an itemset. An example of a transaction is an invoice.

Given the transactions dataset, the objective is to find the items in data that are associated with each other. Association is measured as frequency of the occurrence of the items in the same context. For example, purchasing one product when another product is purchased represents an association rule. The association rule detects the common usage of items.

More formally, we can define association-rule mining as, given a set of items I = {I1,I2,..Im} and database of transactions D = {t1,t,2..tn}, where ti= { Ii1,Ii2..Iim} where Iik is element of, an association is an implication of X->Y where X,Y subset of I are set of items and X intersection Y is φ. In short, associations express an implication from X-> Y, where X and Y are itemsets.

The algorithm can be better understood by an example. So, let's consider the following table, which shows a representative list of sample transactions in a supermarket:

Transaction	Items
1	Milk, curd, chocolate
2	Bread, butter
3	Coke, jam
4	Bread, milk, butter, Coke
5	Bread, milk, butter, jam

Sample transactions in a super market

Let's try to explore some fundamental concepts that will help us understand how the Apriori algorithm works:

Item: An item is any individual product that is part of each of the transactions. For example, milk, Coke, and butter are all termed as items.
Itemset: Collection of one or more items. For example, {butter, milk, coke}, {butter, milk}.

Support count: Frequency of occurrence of an itemset. For example, support count or σ {butter, bread, milk} = 2.
Support: A fraction of transactions that contain an itemset. For example, s = {butter, bread, milk} = 2/5.
Frequent itemset: An itemset whose support is greater than the minimum threshold.
Support for an itemset in a context: Fraction of contexts that contain both X and Y:

So, s for {milk, butter} -> {bread} will be s = σ {milk, butter, bread}/N = 2/5 = 0.4

Confidence: Measures the strength of the rule, whereas support measures how often it should occur in the database. It computes how often items in Y occur in containing X through the following formula:

For example: For {bread} -> {butter}

c or α = σ {butter, bread} / σ {bread} = 3/3 = 1

Let's consider another example confidence for {curd} -> {bread}:

c or α = σ {curd,bread} / σ {bread} = 0/3 = 0

The Apriori algorithm intends to generate all possible combinations of the itemsets from the list of the items and then prunes the itemsets that have met the predefined support and confidence parameter values that were passed to the algorithm. So, it may be understood that the Apriori algorithm is a two-step algorithm:

Generating itemsets from the items
Evaluating and pruning the itemsets based on predefined support and confidence

Let's discuss step 1 in detail. Assume there are n items in the collection. The number of itemsets one could create is 2^n, and all these need to be evaluated in the second step in order to come up with the final results. Even considering just 100 different items, the number of itemsets generated is 1.27e+30! The huge number of itemsets poses a severe computational challenge.

The Apriori algorithm overcomes this challenge by preempting the itemsets that are generally rare or less important. The Apriori principle states that if an itemset is frequent, all of its subsets must also be frequent. This means that if an item does not meet the predefined support threshold, then such item does not participate in the creation of itemsets. The Apriori algorithm thus comes up with restricted number of itemsets that are viable to be evaluated without encountering a computational challenge.

The first step of the algorithm is iterative in nature. In the first iteration, it considers all itemsets of length 1, that is, each itemset contains only one item in it. Then each item is evaluated to eliminate the itemsets that are found to not meet the preset support threshold. The output of the first iteration is all itemsets of length 1 that meet the required support. This becomes the input for iteration 2, and now itemsets of length 2 are formed using only the final itemsets that are output in first iteration. Each of the itemsets formed during step 2 is checked again for the support threshold; if it is not met, such itemsets are eliminated. The iterations continue until no new itemsets can be created. The process of itemsets is illustrated in the following diagram:

Illustration showing the itemsets creation in Apriori algorithm

Once we have all itemsets post all the step 1 iterations of the algorithm, step 2 kicks in. Each of the itemsets generated is tested to check whether it meets the predefined confidence value. If it does not meet the threshold, such itemsets are eliminated from the final output.

At a stage where all iterations are complete and the final rules are the output from Apriori, we make use of a metric called lift to consume the relevant rules from the final output. Lift defines how much more likely one item or itemset is purchased relative to its typical rate of purchase, given that we know another item or itemset has been purchased. For each itemset, we get the lift measurement using the following formula:

Let's delve a little deeper into understanding the lift metric. Assume that in a supermarket, milk and bread are bought together by chance. In such a case, a large number of transactions are expected to cover the milk and bread purchased. A lift (milk -> bread) of more than 1 implies that these items are found together more often than these items are purchased together by chance. We generally would look for lift values greater than 1 when evaluating the rules for their usefulness in business. A lift value higher than 1 indicates that the itemset generated is very strong, and therefore worth considering for implementation.

Now, let's implement the recommendation system using the Apriori algorithm:

# load the required libraries
library(data.table)
library(arules)
library(recommenderlab)
# set the seed so that the results are replicable
set.seed(42)
# reading the Jester5k data
data(Jester5k)
class(Jester5k)