© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Kulkarni et al.Applied Recommender Systems with Pythonhttps://doi.org/10.1007/978-1-4842-8954-9_2

2. Market Basket Analysis (Association Rule Mining)

Akshay Kulkarni1  , Adarsha Shivananda2, Anoosh Kulkarni3 and V Adithya Krishnan4
(1)
Bangalore, Karnataka, India
(2)
Hosanagara tq, Shimoga dt, Karnataka, India
(3)
Bangalore, India
(4)
Navi Mumbai, India
 

Market basket analysis (MBA) is a technique used in data mining by retail companies to increase sales by better understanding customer buying patterns. It involves analyzing large datasets, such as customer purchase history, to uncover item groupings and products that are likely to be frequently purchased together.

Figure 2-1 explains the MBA at a high level.

The framework of M B A explains the market basket transaction data. An example of a frequent itemset is at the bottom.

Figure 2-1

MBA explained

This chapter explores the implementation of market basket analysis with the help of an open source e-commerce dataset. You start with the dataset in exploratory data analysis (EDA) and focus on critical insights. You then learn about the implementation of various techniques in MBA, plot a graphical representation of the associations, and draw insights.

Implementation

Let’s imports the required libraries.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style
%matplotlib inline
from mlxtend.frequent_patterns import apriori,association_rules
from collections import Counter
from IPython.display import Image

Data Collection

Let’s look at an open source dataset from a Kaggle e-commerce website. Download the dataset from www.kaggle.com/carrie1/ecommerce-data?select=data.csv.

Importing the Data as a DataFrame (pandas)

The following imports the data.
data = pd.read_csv('data.csv', encoding= 'unicode_escape')
data.shape
The following is the output.
(541909, 8)
Let’s print the top five rows of the DataFrame.
data.head()
Figure 2-2 shows the output of the first five rows.

An output sample explains the data of invoice number, stock code, description, quantity, invoice date, unit price, customer I D, and country name.

Figure 2-2

The output

Check for nulls in the data.
data.isnull().sum().sort_values(ascending=False)
The following is the output.
CustomerID     135080
Description      1454
Country             0
UnitPrice           0
InvoiceDate         0
Quantity            0
StockCode           0
InvoiceNo           0
dtype: int64

Cleaning the Data

The following drops nulls and describes the data.
data1 = data.dropna()
data1.describe()
Figure 2-3 shows the output after dropping nulls.

An output sample of quantity, unit price, and customer I D, count, mean, minimum and maximum values.

Figure 2-3

The output

The Quantity column has some negative values, which are part of the incorrect data, so let’s drop these entries.

The following selects only data in which the quantity is greater than 0.
data1 = data1[data1.Quantity > 0]
data1.describe()
Figure 2-4 shows the output after filtering the data in the Quantity column.

An output sample of quantity, unit price, and customer I D, count, mean, minimum and maximum values.

Figure 2-4

The output

Insights from the Dataset

Customer Insights

This segment answers the following questions.
  • Who are my loyal customers?

  • Which customers have ordered most frequently?

  • Which customers contribute the most to my revenue?

Loyal Customers
Let’s create a new Amount feature/column, which is the product of the quantity and its unit price.
data1['Amount'] = data1['Quantity'] * data1['UnitPrice']
Now let’s use the group by function to highlight the customers with the greatest number of orders.
orders = data1.groupby(by=['CustomerID','Country'], as_index=False)['InvoiceNo'].count()
print('The TOP 5 loyal customers with the most number of orders...')
orders.sort_values(by='InvoiceNo', ascending=False).head()
Figure 2-5 shows the top five loyal customers.

An output sample of data from the top 5 loyal customers with the highest number of orders. The data includes customer ID, country, and invoice number.

Figure 2-5

The output

Number of Orders per Customer

Let’s plot the orders by different customers.

Create a subplot of size 15×6.
plt.subplots(figsize=(15,6))
Use bmh for better visualization.
plt.style.use('bmh')
The x axis indicates the customer ID, and the y axis indicates the number of orders.
plt.plot(orders.CustomerID, orders.InvoiceNo)
Let’s label the x axis and the y axis.
plt.xlabel('Customers ID')
plt.ylabel('Number of Orders')
Give a suitable title to the plot.
plt.title('Number of Orders by different Customers')
plt.show()
Figure 2-6 shows the number of orders by different customers.

A graph of the number of orders by different customers. The line begins at 0, peaks around 4500, then decreases, and rises up to 7000.

Figure 2-6

The output

Let’s use the group by function again to get the customers with the highest amount spent (invoices).
money_spent = data1.groupby(by=['CustomerID','Country'], as_index=False)['Amount'].sum()
print('The TOP 5 profitable customers with the highest money spent...')
money_spent.sort_values(by='Amount', ascending=False).head()
Figure 2-7 shows the top five profitable customers.

A screenshot of the input and output of the top 5 profitable customers with the highest money spent. The data frame of the output exhibits the customer ID, country, and amount.

Figure 2-7

The output

Money Spent per Customer
Create a subplot of size 15×6.
plt.subplots(figsize=(15,6))
The x axis indicates the customer ID, and y axis indicates the amount spent.
plt.plot(money_spent.CustomerID, money_spent.Amount)
Let’s use bmh for better visualization.
plt.style.use('bmh')
The following labels the x axis and the y axis.
plt.xlabel('Customers ID')
plt.ylabel('Money spent')
Let’s give a suitable title to the plot.
plt.title('Money Spent by different Customers')
plt.show()
Figure 2-8 shows money spent by different customers.

A graph of the output of money spent by different customers. 250000 is the highest amount spent by a customer. The graph has a fluctuating trend.

Figure 2-8

The output

Patterns Based on DateTime

This segment answers questions like the following.
  • In which month is the highest number of orders placed?

  • On which day of the week is the highest number of orders placed?

  • At what time of the day is the store the busiest?

Preprocessing the Data
The following imports the DateTime library.
import datetime
The following converts InvoiceDate from an object to a DateTime format.
data1['InvoiceDate'] = pd.to_datetime(data1.InvoiceDate, format='%m/%d/%Y %H:%M')
Let’s create a new feature using the month and year.
data1.insert(loc=2, column='year_month', value=data1['InvoiceDate'].map(lambda x: 100*x.year + x.month))
Create a new feature for the month.
data1.insert(loc=3, column='month', value=data1.InvoiceDate.dt.month)
Create a new feature for the day; for example, Monday=1.....until Sunday=7.
data1.insert(loc=4, column='day', value=(data1.InvoiceDate.dt.dayofweek)+1)
Create a new feature for the hour.
data1.insert(loc=5, column='hour', value=data1.InvoiceDate.dt.hour)
How Many Orders Are Placed per Month?
Use bmh style for better visualization.
plt.style.use('bmh')
Let’s use group by to extract the number of invoices per year and month.
ax = data1.groupby('InvoiceNo')['year_month'].unique().value_counts().sort_index().plot(kind='bar',figsize=(15,6))
The following labels the x axis and the y axis.
ax.set_xlabel('Month',fontsize=15)
ax.set_ylabel('Number of Orders',fontsize=15)
Let’s give a suitable title to the plot.
ax.set_title(' # orders for various months (Dec 2010 - Dec 2011)',fontsize=15)
Provide X tick labels.
ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','Jun_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11','Dec_11'), rotation='horizontal', fontsize=13)
plt.show()
Figure 2-9 shows the number of orders in different months.

A bar chart exhibits the output data of a number of orders calculated in different months. November 2011 has the highest number of orders, while December 11 has the least.

Figure 2-9

The output

How Many Orders Are Placed per Day?
Day = 6 is Saturday; there are no orders placed on Saturdays.
data1[data1['day']==6].shape[0]
Let’s use groupby to count the number of invoices by day.
ax = data1.groupby('InvoiceNo')['day'].unique().value_counts().sort_index().plot(kind='bar',figsize=(15,6))
The following labels the x axis and the y axis.
ax.set_xlabel('Day',fontsize=15)
ax.set_ylabel('Number of Orders',fontsize=15)
Let’s give a suitable title to the plot.
ax.set_title('Number of orders for different Days',fontsize=15)

Provide X tick labels.

Since no orders were placed on Saturdays, it is excluded from xticklabels.
ax.set_xticklabels(('Mon','Tue','Wed','Thur','Fri','Sun'), rotation='horizontal', fontsize=15)
plt.show()
Figure 2-10 shows the number of orders for different days.

A graph exhibits the output data of the number of orders calculated on different days. Thursday has the highest number of orders at over 4000.

Figure 2-10

The output

How Many Orders Are Placed per Hour?
Let’s use groupby to count the number of invoices by the hour.
ax = data1.groupby('InvoiceNo')['hour'].unique().value_counts().iloc[:-1].sort_index().plot(kind='bar',figsize=(15,6))
The following labels the x axis and the y axis.
ax.set_xlabel('Hour',fontsize=15)
ax.set_ylabel('Number of Orders',fontsize=15)
Give a suitable title to the plot.
ax.set_title('Number of orders for different Hours',fontsize=15)
Provide X tick labels (all orders are placed between hours 6 and 20).
ax.set_xticklabels(range(6,21), rotation='horizontal', fontsize=15)
plt.show()
Figure 2-11 shows the number of orders for different hours.

A bar chart of the number of orders calculated versus hours. The highest number of orders is at the twelfth hour.

Figure 2-11

The output

Free Items and Sales

This segment displays how “free” items impact the number of orders. It answers how discounts and other offers impact sales.
data1.UnitPrice.describe()
The following is the output.
count    397924.000000
mean          3.116174
std          22.096788
min           0.000000
25%           1.250000
50%           1.950000
75%           3.750000
max        8142.750000
Name: UnitPrice, dtype: float64

Since the minimum unit price = 0, there are either incorrect entries or free items.

Let’s check the distribution of unit prices.
plt.subplots(figsize=(12,6))
Use the darkgrid style for better visualization.
sns.set_style('darkgrid')
Apply boxplot visualization to the unit price.
sns.boxplot(data1.UnitPrice)
plt.show()
Figure 2-12 shows the boxplot for unit price.

A box plot of unit price. The distribution is observed from 0 to 4200 and at 8200-unit prices.

Figure 2-12

The output

Items with UnitPrice = 0 are not outliers. These are the “free” items.

Create a new DataFrame for free items.
free_items_df = data1[data1['UnitPrice'] == 0]
free_items_df.head()
Figure 2-13 shows the filtered data output (unit price = 0).

A screenshot of the filtered output data includes invoice number, stock code, year _ month, month, day, hour, description, quantity, invoice date, unit price, customer I D, and amount.

Figure 2-13

The output

Let’s count the number of free items given away by month and year.
free_items_df.year_month.value_counts().sort_index()
The following is the output.
201012     3
201101     3
201102     1
201103     2
201104     2
201105     2
201107     2
201108     6
201109     2
201110     3
201111    14
Name: year_month, dtype: int64

There is at least one free item every month except June 2011.

Let’s count the number of free items per year and month.
ax = free_items_df.year_month.value_counts().sort_index().plot(kind='bar',figsize=(12,6))
Let’s label the x axis and the y axis.
ax.set_xlabel('Month',fontsize=15)
ax.set_ylabel('Frequency',fontsize=15)
Give a suitable title to the plot.
ax.set_title('Frequency for different Months (Dec 2010 - Dec 2011)',fontsize=15)

Provide X tick labels.

Since there were no free items in June 2011, it is excluded.
ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11'), rotation='horizontal', fontsize=13)
plt.show()
Figure 2-14 shows the frequency for different months.

A bar chart of frequency calculated in different months. November 2011 has the highest range of frequency, while February has the least.

Figure 2-14

The output

The greatest number of free items were given out in November 2011. The greatest number of orders were also placed in November 2011.

Use bmh.
plt.style.use('bmh')
Use groupby to count the unique number of invoices by year and month.
ax = data1.groupby('InvoiceNo')['year_month'].unique().value_counts().sort_index().plot(kind='bar',figsize=(15,6))
The following labels the x axis.
ax.set_xlabel('Month',fontsize=15
The following labels the y axis.
ax.set_ylabel('Number of Orders',fontsize=15)
Give a suitable title to the plot.
ax.set_title('# Number of orders for different Months (Dec 2010 - Dec 2011)',fontsize=15)
Provide X tick labels.
ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','Jun_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11','Dec_11'), rotation='horizontal', fontsize=13)
plt.show()
Figure 2-15 shows the number of orders for different months.

A bar chart of the number of orders calculated in different months. November 2011 has the highest number of orders, while December has the least.

Figure 2-15

The output

Compared to the May month, the sales for the month of August have declined, indicating a slight effect from the “number of free items”.

Use bmh.
plt.style.use('bmh')
Let’s use groupby to sum the amount spent per year and month.
ax = data1.groupby('year_month')['Amount'].sum().sort_index().plot(kind='bar',figsize=(15,6))
The following labels the x axis and the y axis.
ax.set_xlabel('Month',fontsize=15)
ax.set_ylabel('Amount',fontsize=15)
Give a suitable title to the plot.
ax.set_title('Revenue Generated for different Months (Dec 2010 - Dec 2011)',fontsize=15)
Provide X tick labels.
ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','Jun_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11','Dec_11'), rotation='horizontal', fontsize=13)=
plt.show()
Figure 2-16 shows the output of revenue generated for different months.

A bar chart of revenue generated for different months. The highest amount is generated in November 2011.

Figure 2-16

The output

Item Insights

This segment answers questions like the following.
  • Which item was purchased by the greatest number of customers?

  • Which is the most sold item based on the sum of sales?

  • Which is the most sold item based on the count of orders?

  • What are the “first choice” items for the greatest number of invoices?

Most Sold Items Based on Quantity

Create a new pivot table that sums the quantity ordered for each item.
most_sold_items_df = data1.pivot_table(index=['StockCode','Description'], values='Quantity', aggfunc='sum').sort_values(by='Quantity', ascending=False)
most_sold_items_df.reset_index(inplace=True)
sns.set_style('white')
Let’s create a bar plot of the ten most ordered items.
sns.barplot(y='Description', x='Quantity', data=most_sold_items_df.head(10))
Give a suitable title to the plot.
plt.title('Top 10 Items based on No. of Sales', fontsize=14)
plt.ylabel('Item')
Figure 2-17 shows the output of the top ten items based on sales.

A horizontal bar chart of the top 10 items based on the number of sales. Papercraft and a little birdie have the highest quantity.

Figure 2-17

The output

Items Bought by the Highest Number of Customers

Let’s choose WHITE HANGING HEART T-LIGHT HOLDER as an example.
product_white_df = data1[data1['Description']=='WHITE HANGING HEART T-LIGHT HOLDER']
product_white_df.shape
The following is the output.
(2028, 13)
It denotes that WHITE HANGING HEART T-LIGHT HOLDER has been ordered 2028 times.
len(product_white_df.CustomerID.unique())
The following is the output.
856

This means 856 customers ordered WHITE HANGING HEART T-LIGHT HOLDER.

Create a pivot table that displays the sum of unique customers who bought a particular item.
most_bought = data1.pivot_table(index=['StockCode','Description'], values='CustomerID', aggfunc=lambda x: len(x.unique())).sort_values(by='CustomerID', ascending=False)
most_bought
Figure 2-18 shows the output of unique customers who bought a particular item.

A screenshot of stock codes and descriptions.

Figure 2-18

The output

Since the WHITE HANGING HEART T-LIGHT HOLDER count matches length 856, the pivot table looks correct for all items.
most_bought.reset_index(inplace=True)
sns.set_style('white'

Create a bar plot of description (or the item) on the y axis and the sum of unique customers on the x axis.

Plot only the ten most frequently purchased items.
sns.barplot(y='Description', x='CustomerID', data=most_bought.head(10))
Give a suitable title to the plot.
plt.title('Top 10 Items bought by Most no. of Customers', fontsize=14)
plt.ylabel('Item')
Figure 2-19 shows the output top ten items by most of the number of customers.

A horizontal bar chart of the top 10 items bought by the highest number of customers. Regency Cake stands 3 T I E R is the most bought item of all.

Figure 2-19

The output

Most Frequently Ordered Items

Let’s prepare data for the word cloud.
data1['items'] = data1['Description'].str.replace(' ', '_')
Plot the word cloud by using the word cloud library.
from wordcloud import WordCloud
plt.rcParams['figure.figsize'] = (20, 20)
wordcloud = WordCloud(background_color = 'white', width = 1200,  height = 1200, max_words = 121).generate(str(data1['items']))
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most Frequently Bought Items',fontsize = 22)
plt.show()
Figure 2-20 shows the word cloud of frequently ordered items.

A word cloud of the words of habitually ordered items.

Figure 2-20

The output

Top Ten First Choices

Store all the invoice numbers into a list called l.
l = data1['InvoiceNo']
l = l.to_list()
The following finds the length of l.
len(l)
The following is the output.
397924
Use the set function to find unique invoice numbers only and store them in the invoices list.
invoices_list = list(set(l))
The following finds the length of the invoices (or the count of unique invoice numbers).
len(invoices_list)
The following is the output.
18536
Create an empty list.
first_choices_list = []
Loop into a list of unique invoice numbers.
for i in invoices_list:
    first_purchase_list = data1[data1['InvoiceNo']==i]['items'].reset_index(drop=True)[0]
    # Appending
    first_choices_list.append(first_purchase_list)
The following creates a first choices list.
first_choices_list[:5]
The following is the output.
['ROCKING_HORSE_GREEN_CHRISTMAS_',
 'POTTERING_MUG',
 'JAM_MAKING_SET_WITH_JARS',
 'TRAVEL_CARD_WALLET_PANTRY',
 'PACK_OF_12_PAISLEY_PARK_TISSUES_']
The length of the first choices matches the length of the invoices.
len(first_choices_list)
The following is the output.
18536
Use a counter to count repeating first choices.
count = Counter(first_choices_list)
Store the counter in a DataFrame.
df_first_choices = pd.DataFrame.from_dict(count, orient='index').reset_index()
Rename the columns as 'item' and 'count'.
df_first_choices.rename(columns={'index':'item', 0:'count'},inplace=True)
Sort the DataFrame based on the count.
df_first_choices.sort_values(by='count',ascending=False)
Figure 2-21 shows the output of the top ten first choices.

A screenshot of the top 10 first choices. The data of items and counts are represented.

Figure 2-21

The output

plt.subplots(figsize=(20,10))
sns.set_style('white')
Let’s create a bar plot.
sns.barplot(y='item', x='count', data=df_first_choices.sort_values(by='count',ascending=False).head(10))
Give a suitable title to the plot.
plt.title('Top 10 First Choices', fontsize=14)
plt.ylabel('Item')
Figure 2-22 shows the output of the top ten first choices.

A horizontal bar chart of the top 10 first choices.

Figure 2-22

The output

Frequently Bought Together (MBA)

This segment answers questions like the following.
  • Which items are frequently bought together?

  • If a user buys an item X, which item is he/she likely to buy next?

Let’s use group by function to create a market basket DataFrame, which specifies if an item is present in a particular invoice number for all items and all invoices.

The following denotes the quantity in the invoice number, which must be fixed.
market_basket = (data1.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))
market_basket.head(10)
Figure 2-23 shows the output of total quantity, grouped by invoice and description.

An output of the total quantity assembled by description and invoice.

Figure 2-23

The output

This output gets the quantity ordered (e.g., 48,24,126), but we just want to know if an item was purchased or not.

So, let’s encode the units as 1 (if purchased) or 0 (not purchased).
def encode_units(x):
    if x < 1:
        return 0
    if x >= 1:
        return 1
market_basket = market_basket.applymap(encode_units)
market_basket.head(10)

An output of the total quantity assembled by description and invoice.

Figure 2-24

The output

Apriori Algorithm Concepts

Refer to Chapter 1 for more information.

Figure 2-25 explains the support.

An illustration of the apriori-support. Support is equal to 10 over 100, which is equal to 10 percent.

Figure 2-25

Support

Let’s look at an example. If 10 out of 100 users purchase milk, support for milk is 10/100 = 10%. The calculation formula is shown in Figure 2-26.

A set of 2 formulas for calculating movie recommendations and market basket optimization.

Figure 2-26

Formula

Suppose you are looking to build a relationship between milk and bread. If 7 out of 40 milk buyers also buy bread, then confidence = 7/40 = 17.5%

Figure 2-27 explains confidence.

An illustration explains the percentage obtained in confidence. Confidence is equal to 7 over 40, which is equal to 17.5 percent.

Figure 2-27

Confidence

The formula to calculate confidence is shown in Figure 2-28.

A set of 2 formulas for calculating movie recommendation and market basket optimization.

Figure 2-28

Formula

The basic formula is lift = confidence/support.

So here, lift = 17.5/10 = 1.75.

Figure 2-29 explains lift and the formula.

An illustration of the value of lift. Lift is equal to 17.5 % over 10 %, which is equal to 1.75. Two formulas of movie recommendation and market basket optimization to calculate lift.

Figure 2-29

Lift

Association Rules

Association rule mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently an item set occurs in a transaction. A market basket analysis is performed based on the rules created from the dataset.

Figure 2-30 explains the association rule.

An illustration of five transactions and the frequent itemset and the association rule on the right.

Figure 2-30

The output

Figure 2-30 shows that out of the five transactions in which a mobile phone was purchased, three included a mobile screen guard. Thus, it should be recommended.

Implementation Using mlxtend

Let’s look at a sample item.
product_wooden_star_df = market_basket.loc[market_basket['WOODEN STAR CHRISTMAS SCANDINAVIAN']==1]

If A => then B

Use the apriori algorithm and create association rules for the sample item.

Apply the apriori algorithm to product_wooden_star_df.
itemsets_frequent = apriori(product_wooden_star_df, min_support=0.15, use_colnames=True)
Store the association rules into rules.
prod_wooden_star_rules = association_rules(itemsets_frequent, metric="lift", min_threshold=1)
Sort the rules on lift and support.
prod_wooden_star_rules.sort_values(['lift','support'],ascending=False).reset_index(drop=True).head()
Figure 2-31 shows the output of apriori algorithm.

A sample output. It includes data on antecedents, consequents, antecedent support, consequent support, support, confidence, lift, leverage, and conviction.

Figure 2-31

The output

Creating a Function

Create a new function to pass an item name. It returns the items that are bought together frequently. In other words, it returns the items that are likely to be bought by the user because they bought the item passed into the function.
def bought_together_frequently(item):
    # df of item passed
    df_item = market_basket.loc[market_basket[item]==1]
    # Apriori algorithm
    itemsets_frequent = apriori(df_item, min_support=0.15, use_colnames=True)
    # Storing association rules
    a_rules = association_rules(itemsets_frequent, metric="lift", min_threshold=1)
    # Sorting on lift and support
    a_rules.sort_values(['lift','support'],ascending=False).reset_index(drop=True)
    print('Items frequently bought together with {0}'.format(item))
    # Returning top 6 items with highest lift and support
    return a_rules['consequents'].unique()[:6]
Example 1 is as follows.
bought_together_frequently('WOODEN STAR CHRISTMAS SCANDINAVIAN')
The following is the output.
Items frequently bought together with WOODEN STAR CHRISTMAS SCANDINAVIAN
array([frozenset({"PAPER CHAIN KIT 50'S CHRISTMAS "}),
       frozenset({'WOODEN HEART CHRISTMAS SCANDINAVIAN'}),
       frozenset({'WOODEN STAR CHRISTMAS SCANDINAVIAN'}),
       frozenset({'SET OF 3 WOODEN HEART DECORATIONS'}),
       frozenset({'SET OF 3 WOODEN SLEIGH DECORATIONS'}),
       frozenset({'SET OF 3 WOODEN STOCKING DECORATION'})], dtype=object)
Example 2 is as follows.
bought_together_frequently('WHITE METAL LANTERN')
The following is the output.
Items frequently bought together with WHITE METAL LANTERN
array([frozenset({'LANTERN CREAM GAZEBO '}),
       frozenset({'WHITE METAL LANTERN'}),
       frozenset({'REGENCY CAKESTAND 3 TIER'}),
       frozenset({'WHITE HANGING HEART T-LIGHT HOLDER'})], dtype=object)
Example 3 is as follows.
bought_together_frequently('JAM MAKING SET WITH JARS')
The following is the output.
Items frequently bought together with JAM MAKING SET WITH JARS
array([frozenset({'JAM MAKING SET WITH JARS'}),
       frozenset({'JAM MAKING SET PRINTED'}),
       frozenset({'PACK OF 72 RETROSPOT CAKE CASES'}),
       frozenset({'RECIPE BOX PANTRY YELLOW DESIGN'}),
       frozenset({'REGENCY CAKESTAND 3 TIER'}),
       frozenset({'SET OF 3 CAKE TINS PANTRY DESIGN '})], dtype=object)

Validation

JAM MAKING SET PRINTED is a part of invoice 536390, so let’s print all the items from this invoice and cross-check it.
data1[data1 ['InvoiceNo']=='536390']
Figure 2-32 shows the output of filtered data.

A validation output. It includes the year, day, hour, description, quantity, country, and amount.

Figure 2-32

 The output

There are some common items between the recommendations from the bought_together_frequently function and the invoice.

Thus, the recommender is performing well.

Visualization of Association Rules

Let’s try visualization techniques on the WOODEN STAR DataFrame used earlier.
support=prod_wooden_star_rules.support.values
confidence=prod_wooden_star_rules.confidence.values
The following creates a scatter plot.
import networkx as nx
import random
import matplotlib.pyplot as plt
for i in range (len(support)):
    support[i] = support[i] + 0.0025 * (random.randint(1,10) - 5)
    confidence[i] = confidence[i] + 0.0025 * (random.randint(1,10) - 5)
# Creating a scatter plot of support v confidence
plt.scatter(support, confidence,   alpha=0.5, marker="*")
plt.xlabel('support')
plt.ylabel('confidence')
plt.show()
Figure 2-33 shows the confidence vs. support.

A scatterplot of support versus confidence. The plots are observed under 0.2, 0.5, 0.6 and 0.7.

Figure 2-33

The output

Let’s plot a graphical representation.
def graphing_wooden_star(wooden_star_rules, no_of_rules):
    Graph1 = nx.DiGraph()
    color_map=[]
    N = 50
    colors = np.random.rand(N)
    strs=['R0', 'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9', 'R10', 'R11']
    for i in range (no_of_rules):
        # adding as many nodes as number of rules requested by user
        Graph1.add_nodes_from(["R"+str(i)])
    # adding antecedents to the nodes
    for a in wooden_star_rules.iloc[i]['antecedents']:
        Graph1.add_nodes_from([a])
        Graph1.add_edge(a, "R"+str(i), color=colors[i] , weight = 2)
    # adding consequents to the nodes
    for c in wooden_star_rules.iloc[i]['consequents']:
            Graph1.add_nodes_from([c])
            Graph1.add_edge("R"+str(i), c, color=colors[i],  weight=2)
    for node in Graph1:
        found_a_string = False
        for item in strs:
            if node==item:
                found_a_string = True
        if found_a_string:
            color_map.append('yellow')
        else:
            color_map.append('green')
    edges = Graph1.edges()
    colors = [Graph1[u][v]['color'] for u,v in edges]
    weights = [Graph1[u][v]['weight'] for u,v in edges]
    pos = nx.spring_layout(Graph1, k=16, scale=1)
    nx.draw(Graph1, pos, edges=edges, node_color = color_map, edge_color=colors, width=weights, font_size=16, with_labels=False)
    for p in pos:  # raise text positions
           pos[p][1] += 0.07
    nx.draw_networkx_labels(G1, pos)
    plt.show()
Figure 2-34 shows the graphical representation.

A graphical representation of the interconnection between wooden star Christmas Scandinavian, paper chain kit 50's Christmas, R1, R2, R3, R4, and R0.

Figure 2-34

The output

def visualize_rules(item, no_of_rules):
    # df of item passed
    df_item = market_basket.loc[market_basket[item]==1]
    # Apriori algorithm
    itemsets_frequent = apriori(df_item, min_support=0.15, use_colnames=True)
    # Storing association rules
    a_rules = association_rules(itemsets_frequent, metric="lift", min_threshold=1)
    # Sorting on lift and support
    a_rules.sort_values(['lift','support'],ascending=False).reset_index(drop=True)
    print('Items frequently bought together with {0}'.format(item))
    # Returning top 6 items with highest lift and support
    print(a_rules['consequents'].unique()[:6])
    support = a_rules.support.values
    confidence = a_rules.confidence.values
    for i in range (len(support)):
        support[i] = support[i] + 0.0025 * (random.randint(1,10) - 5)
        confidence[i] = confidence[i] + 0.0025 * (random.randint(1,10) - 5)
    # Creating scatter plot of support v confidence
    plt.scatter(support, confidence, alpha=0.5, marker="*")
    plt.title('Support vs Confidence graph')
    plt.xlabel('support')
    plt.ylabel('confidence')
    plt.show()
    # Creating a new digraph
    Graph2 = nx.DiGraph()
    color_map=[]
    N = 50
    colors = np.random.rand(N)
    strs=['R0', 'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9', 'R10', 'R11']
    # adding as many nodes as number of rules requested by user
    for i in range (no_of_rules):
        Graph2.add_nodes_from(["R"+str(i)])
    # adding antecedents to the nodes
    for a in a_rules.iloc[i]['antecedents']:
        Graph2.add_nodes_from([a])
        Graph2.add_edge(a, "R"+str(i), color=colors[i] , weight = 2)
    # adding consequents to the nodes
    for c in a_rules.iloc[i]['consequents']:
        Graph2.add_nodes_from([c])
        Graph2.add_edge("R"+str(i), c, color=colors[i],  weight=2)
    for node in Graph2:
        found_a_string = False
        for item in strs:
            if node==item:
                found_a_string = True
        if found_a_string:
            color_map.append('yellow')
        else:
            color_map.append('green')
    print('Visualization of Rules:')
    edges = Graph2.edges()
    colors = [Graph2[u][v]['color'] for u,v in edges]
    weights = [Graph2[u][v]['weight'] for u,v in edges]
    pos = nx.spring_layout(Graph2, k=16, scale=1)
    nx.draw(Graph2, pos, edges=edges, node_color = color_map, edge_color=colors, width=weights, font_size=16, with_labels=False)
    for p in pos:  # raise text positions
        pos[p][1] += 0.07
    nx.draw_networkx_labels(Graph2, pos)
    plt.show()
Example 1 is as follows.
visualize_rules('WOODEN STAR CHRISTMAS SCANDINAVIAN',4)
Figure 2-35 shows items frequently bought along with WOODEN STAR CHRISTMAS SCANDINAVIAN.

A scatterplot of support versus confidence. The graph has the highest value of above 0.8.

Figure 2-35

The output

Figure 2-36 shows the visualization of rules.
[frozenset({'WOODEN HEART CHRISTMAS SCANDINAVIAN'})
 frozenset({"PAPER CHAIN KIT 50'S CHRISTMAS "})
 frozenset({'WOODEN STAR CHRISTMAS SCANDINAVIAN'})
 frozenset({'SET OF 3 WOODEN HEART DECORATIONS'})
 frozenset({'SET OF 3 WOODEN SLEIGH DECORATIONS'})
 frozenset({'SET OF 3 WOODEN STOCKING DECORATION'})]

A graphical representation of the links of R1, R0, R2 R3, wooden star Christmas Scandinavian, paper chain kit 50's Christmas and wooden heart Christmas Scandinavian.

Figure 2-36

The output

Example 2 is as follows.
visualize_rules('JAM MAKING SET WITH JARS',6)
Figure 2-37 shows the items frequently bought together with JAM MAKING SET WITH JARS.

A scatterplot of support versus confidence. The graph has the highest value above 0.8.

Figure 2-37

The output

Figure 2-38 shows the visualization of rules.
[frozenset({'JAM MAKING SET WITH JARS'})
 frozenset({'JAM MAKING SET PRINTED'})
 frozenset({'PACK OF 72 RETROSPOT CAKE CASES'})
 frozenset({'RECIPE BOX PANTRY YELLOW DESIGN'})
 frozenset({'REGENCY CAKESTAND 3 TIER'})
 frozenset({'SET OF 3 CAKE TINS PANTRY DESIGN '})]

A graphical representation of the links of R 0, 1, 2, 3, 4, 5, jam-making set printed, jam-making set with jars, recipe box pantry yellow design, and a pack of 72 retro spot cake cases.

Figure 2-38

The output

Summary

In this chapter, you learned how to build a recommendation system based on market basket analysis. You also learned how to fetch items that are frequently purchased together and offer suggestions to users. Most e-commerce sites use this method to showcase items bought together. This chapter implemented this method in Python using an e-commerce example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.143.31