Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A. Kulkarni et al.Applied Recommender Systems with Pythonhttps://doi.org/10.1007/978-1-4842-8954-9_2

2. Market Basket Analysis (Association Rule Mining)

Akshay Kulkarni¹, Adarsha Shivananda², Anoosh Kulkarni³ and V Adithya Krishnan⁴

(1)

Bangalore, Karnataka, India

(2)

Hosanagara tq, Shimoga dt, Karnataka, India

(3)

Bangalore, India

(4)

Navi Mumbai, India

Market basket analysis (MBA) is a technique used in data mining by retail companies to increase sales by better understanding customer buying patterns. It involves analyzing large datasets, such as customer purchase history, to uncover item groupings and products that are likely to be frequently purchased together.

Figure 2-1 explains the MBA at a high level.

This chapter explores the implementation of market basket analysis with the help of an open source e-commerce dataset. You start with the dataset in exploratory data analysis (EDA) and focus on critical insights. You then learn about the implementation of various techniques in MBA, plot a graphical representation of the associations, and draw insights.

Implementation

Let’s imports the required libraries.

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

import matplotlib.style

%matplotlib inline

from mlxtend.frequent_patterns import apriori,association_rules

from collections import Counter

from IPython.display import Image

Data Collection

Let’s look at an open source dataset from a Kaggle e-commerce website. Download the dataset from www.kaggle.com/carrie1/ecommerce-data?select=data.csv.

Importing the Data as a DataFrame (pandas)

The following imports the data.

data = pd.read_csv('data.csv', encoding= 'unicode_escape')

data.shape

The following is the output.

(541909, 8)

Let’s print the top five rows of the DataFrame.

data.head()

Figure 2-2 shows the output of the first five rows.

Check for nulls in the data.

data.isnull().sum().sort_values(ascending=False)

The following is the output.

CustomerID 135080

Description 1454

Country 0

UnitPrice 0

InvoiceDate 0

Quantity 0

StockCode 0

InvoiceNo 0

dtype: int64

Cleaning the Data

The following drops nulls and describes the data.

data1 = data.dropna()

data1.describe()

Figure 2-3 shows the output after dropping nulls.

The Quantity column has some negative values, which are part of the incorrect data, so let’s drop these entries.

The following selects only data in which the quantity is greater than 0.

data1 = data1[data1.Quantity > 0]

data1.describe()

Figure 2-4 shows the output after filtering the data in the Quantity column.

Insights from the Dataset

Customer Insights

This segment answers the following questions.

Who are my loyal customers?
Which customers have ordered most frequently?
Which customers contribute the most to my revenue?

Loyal Customers

Let’s create a new Amount feature/column, which is the product of the quantity and its unit price.

data1['Amount'] = data1['Quantity'] * data1['UnitPrice']

Now let’s use the group by function to highlight the customers with the greatest number of orders.

orders = data1.groupby(by=['CustomerID','Country'], as_index=False)['InvoiceNo'].count()

print('The TOP 5 loyal customers with the most number of orders...')

orders.sort_values(by='InvoiceNo', ascending=False).head()

Figure 2-5 shows the top five loyal customers.

Number of Orders per Customer

Let’s plot the orders by different customers.

Create a subplot of size 15×6.

plt.subplots(figsize=(15,6))

Use bmh for better visualization.

plt.style.use('bmh')

The x axis indicates the customer ID, and the y axis indicates the number of orders.

plt.plot(orders.CustomerID, orders.InvoiceNo)

Let’s label the x axis and the y axis.

plt.xlabel('Customers ID')

plt.ylabel('Number of Orders')

Give a suitable title to the plot.

plt.title('Number of Orders by different Customers')

plt.show()

Figure 2-6 shows the number of orders by different customers.

Let’s use the group by function again to get the customers with the highest amount spent (invoices).

money_spent = data1.groupby(by=['CustomerID','Country'], as_index=False)['Amount'].sum()

print('The TOP 5 profitable customers with the highest money spent...')

money_spent.sort_values(by='Amount', ascending=False).head()

Figure 2-7 shows the top five profitable customers.

Money Spent per Customer

Create a subplot of size 15×6.

plt.subplots(figsize=(15,6))

The x axis indicates the customer ID, and y axis indicates the amount spent.

plt.plot(money_spent.CustomerID, money_spent.Amount)

Let’s use bmh for better visualization.

plt.style.use('bmh')

The following labels the x axis and the y axis.

plt.xlabel('Customers ID')

plt.ylabel('Money spent')

Let’s give a suitable title to the plot.

plt.title('Money Spent by different Customers')

plt.show()

Figure 2-8 shows money spent by different customers.

Patterns Based on DateTime

This segment answers questions like the following.

In which month is the highest number of orders placed?
On which day of the week is the highest number of orders placed?
At what time of the day is the store the busiest?

Preprocessing the Data

The following imports the DateTime library.

import datetime

The following converts InvoiceDate from an object to a DateTime format.

data1['InvoiceDate'] = pd.to_datetime(data1.InvoiceDate, format='%m/%d/%Y %H:%M')

Let’s create a new feature using the month and year.

data1.insert(loc=2, column='year_month', value=data1['InvoiceDate'].map(lambda x: 100*x.year + x.month))

Create a new feature for the month.

data1.insert(loc=3, column='month', value=data1.InvoiceDate.dt.month)

Create a new feature for the day; for example, Monday=1.....until Sunday=7.

data1.insert(loc=4, column='day', value=(data1.InvoiceDate.dt.dayofweek)+1)

Create a new feature for the hour.

data1.insert(loc=5, column='hour', value=data1.InvoiceDate.dt.hour)

How Many Orders Are Placed per Month?

Use bmh style for better visualization.

plt.style.use('bmh')

Let’s use group by to extract the number of invoices per year and month.

ax = data1.groupby('InvoiceNo')['year_month'].unique().value_counts().sort_index().plot(kind='bar',figsize=(15,6))

The following labels the x axis and the y axis.

ax.set_xlabel('Month',fontsize=15)

ax.set_ylabel('Number of Orders',fontsize=15)

Let’s give a suitable title to the plot.

ax.set_title(' # orders for various months (Dec 2010 - Dec 2011)',fontsize=15)

Provide X tick labels.

ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','Jun_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11','Dec_11'), rotation='horizontal', fontsize=13)

plt.show()

Figure 2-9 shows the number of orders in different months.

How Many Orders Are Placed per Day?

Day = 6 is Saturday; there are no orders placed on Saturdays.

data1[data1['day']==6].shape[0]

Let’s use groupby to count the number of invoices by day.

ax = data1.groupby('InvoiceNo')['day'].unique().value_counts().sort_index().plot(kind='bar',figsize=(15,6))

The following labels the x axis and the y axis.

ax.set_xlabel('Day',fontsize=15)

ax.set_ylabel('Number of Orders',fontsize=15)

Let’s give a suitable title to the plot.

ax.set_title('Number of orders for different Days',fontsize=15)

Provide X tick labels.

Since no orders were placed on Saturdays, it is excluded from xticklabels.

ax.set_xticklabels(('Mon','Tue','Wed','Thur','Fri','Sun'), rotation='horizontal', fontsize=15)

plt.show()

Figure 2-10 shows the number of orders for different days.

How Many Orders Are Placed per Hour?

Let’s use groupby to count the number of invoices by the hour.

ax = data1.groupby('InvoiceNo')['hour'].unique().value_counts().iloc[:-1].sort_index().plot(kind='bar',figsize=(15,6))

The following labels the x axis and the y axis.

ax.set_xlabel('Hour',fontsize=15)

ax.set_ylabel('Number of Orders',fontsize=15)

Give a suitable title to the plot.

ax.set_title('Number of orders for different Hours',fontsize=15)

Provide X tick labels (all orders are placed between hours 6 and 20).

ax.set_xticklabels(range(6,21), rotation='horizontal', fontsize=15)

plt.show()

Figure 2-11 shows the number of orders for different hours.

Free Items and Sales

This segment displays how “free” items impact the number of orders. It answers how discounts and other offers impact sales.

data1.UnitPrice.describe()

The following is the output.

count 397924.000000

mean 3.116174

std 22.096788

min 0.000000

25% 1.250000

50% 1.950000

75% 3.750000

max 8142.750000

Name: UnitPrice, dtype: float64

Since the minimum unit price = 0, there are either incorrect entries or free items.

Let’s check the distribution of unit prices.

plt.subplots(figsize=(12,6))

Use the darkgrid style for better visualization.

sns.set_style('darkgrid')

Apply boxplot visualization to the unit price.

sns.boxplot(data1.UnitPrice)

plt.show()

Figure 2-12 shows the boxplot for unit price.

Items with UnitPrice = 0 are not outliers. These are the “free” items.

Create a new DataFrame for free items.

free_items_df = data1[data1['UnitPrice'] == 0]

free_items_df.head()

Figure 2-13 shows the filtered data output (unit price = 0).

Let’s count the number of free items given away by month and year.

free_items_df.year_month.value_counts().sort_index()

The following is the output.

201012 3

201101 3

201102 1

201103 2

201104 2

201105 2

201107 2

201108 6

201109 2

201110 3

201111 14

Name: year_month, dtype: int64

There is at least one free item every month except June 2011.

Let’s count the number of free items per year and month.

ax = free_items_df.year_month.value_counts().sort_index().plot(kind='bar',figsize=(12,6))

Let’s label the x axis and the y axis.

ax.set_xlabel('Month',fontsize=15)

ax.set_ylabel('Frequency',fontsize=15)

Give a suitable title to the plot.

ax.set_title('Frequency for different Months (Dec 2010 - Dec 2011)',fontsize=15)

Provide X tick labels.

Since there were no free items in June 2011, it is excluded.

ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11'), rotation='horizontal', fontsize=13)

plt.show()

Figure 2-14 shows the frequency for different months.

The greatest number of free items were given out in November 2011. The greatest number of orders were also placed in November 2011.

Use bmh.

plt.style.use('bmh')

Use groupby to count the unique number of invoices by year and month.

ax = data1.groupby('InvoiceNo')['year_month'].unique().value_counts().sort_index().plot(kind='bar',figsize=(15,6))

The following labels the x axis.

ax.set_xlabel('Month',fontsize=15

The following labels the y axis.

ax.set_ylabel('Number of Orders',fontsize=15)

Give a suitable title to the plot.

ax.set_title('# Number of orders for different Months (Dec 2010 - Dec 2011)',fontsize=15)

Provide X tick labels.

ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','Jun_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11','Dec_11'), rotation='horizontal', fontsize=13)

plt.show()

Figure 2-15 shows the number of orders for different months.

Compared to the May month, the sales for the month of August have declined, indicating a slight effect from the “number of free items”.

Use bmh.

plt.style.use('bmh')

Let’s use groupby to sum the amount spent per year and month.

ax = data1.groupby('year_month')['Amount'].sum().sort_index().plot(kind='bar',figsize=(15,6))

The following labels the x axis and the y axis.

ax.set_xlabel('Month',fontsize=15)

ax.set_ylabel('Amount',fontsize=15)

Give a suitable title to the plot.

ax.set_title('Revenue Generated for different Months (Dec 2010 - Dec 2011)',fontsize=15)

Provide X tick labels.

ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','Jun_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11','Dec_11'), rotation='horizontal', fontsize=13)=

plt.show()

Figure 2-16 shows the output of revenue generated for different months.

Item Insights

This segment answers questions like the following.

Which item was purchased by the greatest number of customers?
Which is the most sold item based on the sum of sales?
Which is the most sold item based on the count of orders?
What are the “first choice” items for the greatest number of invoices?

Most Sold Items Based on Quantity

Create a new pivot table that sums the quantity ordered for each item.

most_sold_items_df = data1.pivot_table(index=['StockCode','Description'], values='Quantity', aggfunc='sum').sort_values(by='Quantity', ascending=False)

most_sold_items_df.reset_index(inplace=True)

sns.set_style('white')

Let’s create a bar plot of the ten most ordered items.

sns.barplot(y='Description', x='Quantity', data=most_sold_items_df.head(10))

Give a suitable title to the plot.

plt.title('Top 10 Items based on No. of Sales', fontsize=14)

plt.ylabel('Item')

Figure 2-17 shows the output of the top ten items based on sales.

Items Bought by the Highest Number of Customers

Let’s choose WHITE HANGING HEART T-LIGHT HOLDER as an example.

product_white_df = data1[data1['Description']=='WHITE HANGING HEART T-LIGHT HOLDER']

product_white_df.shape

The following is the output.

(2028, 13)

It denotes that WHITE HANGING HEART T-LIGHT HOLDER has been ordered 2028 times.

len(product_white_df.CustomerID.unique())

The following is the output.

856

This means 856 customers ordered WHITE HANGING HEART T-LIGHT HOLDER.

Create a pivot table that displays the sum of unique customers who bought a particular item.

most_bought = data1.pivot_table(index=['StockCode','Description'], values='CustomerID', aggfunc=lambda x: len(x.unique())).sort_values(by='CustomerID', ascending=False)

most_bought

Figure 2-18 shows the output of unique customers who bought a particular item.

Since the WHITE HANGING HEART T-LIGHT HOLDER count matches length 856, the pivot table looks correct for all items.

most_bought.reset_index(inplace=True)

sns.set_style('white'

Create a bar plot of description (or the item) on the y axis and the sum of unique customers on the x axis.

Plot only the ten most frequently purchased items.

sns.barplot(y='Description', x='CustomerID', data=most_bought.head(10))

Give a suitable title to the plot.

plt.title('Top 10 Items bought by Most no. of Customers', fontsize=14)

plt.ylabel('Item')

Figure 2-19 shows the output top ten items by most of the number of customers.

Most Frequently Ordered Items

Let’s prepare data for the word cloud.

data1['items'] = data1['Description'].str.replace(' ', '_')

Plot the word cloud by using the word cloud library.

from wordcloud import WordCloud

plt.rcParams['figure.figsize'] = (20, 20)

wordcloud = WordCloud(background_color = 'white', width = 1200, height = 1200, max_words = 121).generate(str(data1['items']))

plt.imshow(wordcloud)

plt.axis('off')

plt.title('Most Frequently Bought Items',fontsize = 22)

plt.show()

Figure 2-20 shows the word cloud of frequently ordered items.

Top Ten First Choices

Store all the invoice numbers into a list called l.

l = data1['InvoiceNo']

l = l.to_list()

The following finds the length of l.

len(l)

The following is the output.

397924

Use the set function to find unique invoice numbers only and store them in the invoices list.

invoices_list = list(set(l))

The following finds the length of the invoices (or the count of unique invoice numbers).

len(invoices_list)

The following is the output.

18536

Create an empty list.

first_choices_list = []

Loop into a list of unique invoice numbers.

for i in invoices_list:

first_purchase_list = data1[data1['InvoiceNo']==i]['items'].reset_index(drop=True)[0]

# Appending

first_choices_list.append(first_purchase_list)

The following creates a first choices list.

first_choices_list[:5]

The following is the output.

['ROCKING_HORSE_GREEN_CHRISTMAS_',

'POTTERING_MUG',

'JAM_MAKING_SET_WITH_JARS',

'TRAVEL_CARD_WALLET_PANTRY',

'PACK_OF_12_PAISLEY_PARK_TISSUES_']

The length of the first choices matches the length of the invoices.

len(first_choices_list)

The following is the output.

18536

Use a counter to count repeating first choices.

count = Counter(first_choices_list)

Store the counter in a DataFrame.

df_first_choices = pd.DataFrame.from_dict(count, orient='index').reset_index()

Rename the columns as 'item' and 'count'.

df_first_choices.rename(columns={'index':'item', 0:'count'},inplace=True)

Sort the DataFrame based on the count.

df_first_choices.sort_values(by='count',ascending=False)

Figure 2-21 shows the output of the top ten first choices.

plt.subplots(figsize=(20,10))

sns.set_style('white')

Let’s create a bar plot.

sns.barplot(y='item', x='count', data=df_first_choices.sort_values(by='count',ascending=False).head(10))

Give a suitable title to the plot.

plt.title('Top 10 First Choices', fontsize=14)

plt.ylabel('Item')

Figure 2-22 shows the output of the top ten first choices.

Frequently Bought Together (MBA)

This segment answers questions like the following.

Which items are frequently bought together?
If a user buys an item X, which item is he/she likely to buy next?

Let’s use group by function to create a market basket DataFrame, which specifies if an item is present in a particular invoice number for all items and all invoices.

The following denotes the quantity in the invoice number, which must be fixed.

market_basket = (data1.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))

market_basket.head(10)

Figure 2-23 shows the output of total quantity, grouped by invoice and description.

This output gets the quantity ordered (e.g., 48,24,126), but we just want to know if an item was purchased or not.

So, let’s encode the units as 1 (if purchased) or 0 (not purchased).

def encode_units(x):

if x < 1:

return 0

if x >= 1:

return 1

market_basket = market_basket.applymap(encode_units)

market_basket.head(10)

Apriori Algorithm Concepts

Refer to Chapter 1 for more information.

Figure 2-25 explains the support.

Let’s look at an example. If 10 out of 100 users purchase milk, support for milk is 10/100 = 10%. The calculation formula is shown in Figure 2-26.

Suppose you are looking to build a relationship between milk and bread. If 7 out of 40 milk buyers also buy bread, then confidence = 7/40 = 17.5%

Figure 2-27 explains confidence.

The formula to calculate confidence is shown in Figure 2-28.

The basic formula is lift = confidence/support.

So here, lift = 17.5/10 = 1.75.

Figure 2-29 explains lift and the formula.

Association Rules

Association rule mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently an item set occurs in a transaction. A market basket analysis is performed based on the rules created from the dataset.

Figure 2-30 explains the association rule.

Figure 2-30 shows that out of the five transactions in which a mobile phone was purchased, three included a mobile screen guard. Thus, it should be recommended.

Implementation Using mlxtend

Let’s look at a sample item.

product_wooden_star_df = market_basket.loc[market_basket['WOODEN STAR CHRISTMAS SCANDINAVIAN']==1]

If A => then B

Use the apriori algorithm and create association rules for the sample item.

Apply the apriori algorithm to product_wooden_star_df.

itemsets_frequent = apriori(product_wooden_star_df, min_support=0.15, use_colnames=True)

Store the association rules into rules.

prod_wooden_star_rules = association_rules(itemsets_frequent, metric="lift", min_threshold=1)

Sort the rules on lift and support.

prod_wooden_star_rules.sort_values(['lift','support'],ascending=False).reset_index(drop=True).head()

Figure 2-31 shows the output of apriori algorithm.

Creating a Function

Create a new function to pass an item name. It returns the items that are bought together frequently. In other words, it returns the items that are likely to be bought by the user because they bought the item passed into the function.

def bought_together_frequently(item):

# df of item passed

df_item = market_basket.loc[market_basket[item]==1]

# Apriori algorithm

itemsets_frequent = apriori(df_item, min_support=0.15, use_colnames=True)

# Storing association rules

a_rules = association_rules(itemsets_frequent, metric="lift", min_threshold=1)

# Sorting on lift and support

a_rules.sort_values(['lift','support'],ascending=False).reset_index(drop=True)

print('Items frequently bought together with {0}'.format(item))

# Returning top 6 items with highest lift and support

return a_rules['consequents'].unique()[:6]

Example 1 is as follows.

bought_together_frequently('WOODEN STAR CHRISTMAS SCANDINAVIAN')

The following is the output.

Items frequently bought together with WOODEN STAR CHRISTMAS SCANDINAVIAN

array([frozenset({"PAPER CHAIN KIT 50'S CHRISTMAS "}),

frozenset({'WOODEN HEART CHRISTMAS SCANDINAVIAN'}),

frozenset({'WOODEN STAR CHRISTMAS SCANDINAVIAN'}),

frozenset({'SET OF 3 WOODEN HEART DECORATIONS'}),

frozenset({'SET OF 3 WOODEN SLEIGH DECORATIONS'}),

frozenset({'SET OF 3 WOODEN STOCKING DECORATION'})], dtype=object)

Example 2 is as follows.

bought_together_frequently('WHITE METAL LANTERN')

The following is the output.

Items frequently bought together with WHITE METAL LANTERN

array([frozenset({'LANTERN CREAM GAZEBO '}),

frozenset({'WHITE METAL LANTERN'}),

frozenset({'REGENCY CAKESTAND 3 TIER'}),

frozenset({'WHITE HANGING HEART T-LIGHT HOLDER'})], dtype=object)

Example 3 is as follows.

bought_together_frequently('JAM MAKING SET WITH JARS')

The following is the output.

Items frequently bought together with JAM MAKING SET WITH JARS

array([frozenset({'JAM MAKING SET WITH JARS'}),

frozenset({'JAM MAKING SET PRINTED'}),

frozenset({'PACK OF 72 RETROSPOT CAKE CASES'}),

frozenset({'RECIPE BOX PANTRY YELLOW DESIGN'}),

frozenset({'REGENCY CAKESTAND 3 TIER'}),

frozenset({'SET OF 3 CAKE TINS PANTRY DESIGN '})], dtype=object)

Validation

JAM MAKING SET PRINTED is a part of invoice 536390, so let’s print all the items from this invoice and cross-check it.

data1[data1 ['InvoiceNo']=='536390']

Figure 2-32 shows the output of filtered data.

There are some common items between the recommendations from the bought_together_frequently function and the invoice.

Thus, the recommender is performing well.

Visualization of Association Rules

Let’s try visualization techniques on the WOODEN STAR DataFrame used earlier.

support=prod_wooden_star_rules.support.values

confidence=prod_wooden_star_rules.confidence.values

The following creates a scatter plot.

import networkx as nx

import random

import matplotlib.pyplot as plt

for i in range (len(support)):

support[i] = support[i] + 0.0025 * (random.randint(1,10) - 5)

confidence[i] = confidence[i] + 0.0025 * (random.randint(1,10) - 5)

# Creating a scatter plot of support v confidence

plt.scatter(support, confidence, alpha=0.5, marker="*")

plt.xlabel('support')

plt.ylabel('confidence')

plt.show()

Figure 2-33 shows the confidence vs. support.

Let’s plot a graphical representation.

def graphing_wooden_star(wooden_star_rules, no_of_rules):

Graph1 = nx.DiGraph()

color_map=[]

N = 50

colors = np.random.rand(N)

strs=['R0', 'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9', 'R10', 'R11']

for i in range (no_of_rules):

# adding as many nodes as number of rules requested by user

Graph1.add_nodes_from(["R"+str(i)])

# adding antecedents to the nodes

for a in wooden_star_rules.iloc[i]['antecedents']:

Graph1.add_nodes_from([a])

Graph1.add_edge(a, "R"+str(i), color=colors[i] , weight = 2)

# adding consequents to the nodes

for c in wooden_star_rules.iloc[i]['consequents']:

Graph1.add_nodes_from([c])

Graph1.add_edge("R"+str(i), c, color=colors[i], weight=2)

for node in Graph1:

found_a_string = False

for item in strs:

if node==item:

found_a_string = True

if found_a_string:

color_map.append('yellow')

else:

color_map.append('green')

edges = Graph1.edges()

colors = [Graph1[u][v]['color'] for u,v in edges]

weights = [Graph1[u][v]['weight'] for u,v in edges]

pos = nx.spring_layout(Graph1, k=16, scale=1)

nx.draw(Graph1, pos, edges=edges, node_color = color_map, edge_color=colors, width=weights, font_size=16, with_labels=False)

for p in pos: # raise text positions

pos[p][1] += 0.07

nx.draw_networkx_labels(G1, pos)

plt.show()

Figure 2-34 shows the graphical representation.

def visualize_rules(item, no_of_rules):

# df of item passed

df_item = market_basket.loc[market_basket[item]==1]

# Apriori algorithm

itemsets_frequent = apriori(df_item, min_support=0.15, use_colnames=True)

# Storing association rules

a_rules = association_rules(itemsets_frequent, metric="lift", min_threshold=1)

# Sorting on lift and support

a_rules.sort_values(['lift','support'],ascending=False).reset_index(drop=True)

print('Items frequently bought together with {0}'.format(item))

# Returning top 6 items with highest lift and support

print(a_rules['consequents'].unique()[:6])

support = a_rules.support.values

confidence = a_rules.confidence.values

for i in range (len(support)):

support[i] = support[i] + 0.0025 * (random.randint(1,10) - 5)

confidence[i] = confidence[i] + 0.0025 * (random.randint(1,10) - 5)

# Creating scatter plot of support v confidence

plt.scatter(support, confidence, alpha=0.5, marker="*")

plt.title('Support vs Confidence graph')

plt.xlabel('support')

plt.ylabel('confidence')

plt.show()

# Creating a new digraph

Graph2 = nx.DiGraph()

color_map=[]

N = 50

colors = np.random.rand(N)

strs=['R0', 'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9', 'R10', 'R11']

# adding as many nodes as number of rules requested by user

for i in range (no_of_rules):

Graph2.add_nodes_from(["R"+str(i)])

# adding antecedents to the nodes

for a in a_rules.iloc[i]['antecedents']:

Graph2.add_nodes_from([a])

Graph2.add_edge(a, "R"+str(i), color=colors[i] , weight = 2)

# adding consequents to the nodes

for c in a_rules.iloc[i]['consequents']:

Graph2.add_nodes_from([c])

Graph2.add_edge("R"+str(i), c, color=colors[i], weight=2)

for node in Graph2:

found_a_string = False

for item in strs:

if node==item:

found_a_string = True

if found_a_string:

color_map.append('yellow')

else:

color_map.append('green')

print('Visualization of Rules:')

edges = Graph2.edges()

colors = [Graph2[u][v]['color'] for u,v in edges]

weights = [Graph2[u][v]['weight'] for u,v in edges]

pos = nx.spring_layout(Graph2, k=16, scale=1)

nx.draw(Graph2, pos, edges=edges, node_color = color_map, edge_color=colors, width=weights, font_size=16, with_labels=False)

for p in pos: # raise text positions

pos[p][1] += 0.07

nx.draw_networkx_labels(Graph2, pos)

plt.show()

Example 1 is as follows.

visualize_rules('WOODEN STAR CHRISTMAS SCANDINAVIAN',4)

Figure 2-35 shows items frequently bought along with WOODEN STAR CHRISTMAS SCANDINAVIAN.

Figure 2-36 shows the visualization of rules.

[frozenset({'WOODEN HEART CHRISTMAS SCANDINAVIAN'})

frozenset({"PAPER CHAIN KIT 50'S CHRISTMAS "})

frozenset({'WOODEN STAR CHRISTMAS SCANDINAVIAN'})

frozenset({'SET OF 3 WOODEN HEART DECORATIONS'})

frozenset({'SET OF 3 WOODEN SLEIGH DECORATIONS'})

frozenset({'SET OF 3 WOODEN STOCKING DECORATION'})]

Example 2 is as follows.

visualize_rules('JAM MAKING SET WITH JARS',6)

Figure 2-37 shows the items frequently bought together with JAM MAKING SET WITH JARS.

Figure 2-38 shows the visualization of rules.

[frozenset({'JAM MAKING SET WITH JARS'})

frozenset({'JAM MAKING SET PRINTED'})

frozenset({'PACK OF 72 RETROSPOT CAKE CASES'})

frozenset({'RECIPE BOX PANTRY YELLOW DESIGN'})

frozenset({'REGENCY CAKESTAND 3 TIER'})

frozenset({'SET OF 3 CAKE TINS PANTRY DESIGN '})]

Summary

In this chapter, you learned how to build a recommendation system based on market basket analysis. You also learned how to fetch items that are frequently purchased together and offer suggestions to users. Most e-commerce sites use this method to showcase items bought together. This chapter implemented this method in Python using an e-commerce example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Market Basket Analysis (Association Rule Mining)

Create new playlist

Sign In

Sign Up

2. Market Basket Analysis (Association Rule Mining)

Implementation

Data Collection

Importing the Data as a DataFrame (pandas)

Cleaning the Data

Insights from the Dataset

Customer Insights

Loyal Customers

Number of Orders per Customer

Money Spent per Customer

Patterns Based on DateTime

Preprocessing the Data

How Many Orders Are Placed per Month?

How Many Orders Are Placed per Day?

How Many Orders Are Placed per Hour?

Free Items and Sales

Item Insights

Most Sold Items Based on Quantity

Items Bought by the Highest Number of Customers

Most Frequently Ordered Items

Top Ten First Choices

Frequently Bought Together (MBA)

Apriori Algorithm Concepts

Association Rules

Implementation Using mlxtend

If A => then B

Creating a Function

Validation

Visualization of Association Rules

Summary

Table of Contents for
2. Market Basket Analysis (Association Rule Mining)