Analysis of a set size of two

To keep matters simple, let's try and study the affinity between two products/services. This can also be phrased as having a set size of two. The data that we have has customers with up to eight products/services in their baskets. Hence, we could have up to a set size of eight. Currently, the data we have resembles a wide state, as represented in Figure 7.3. We will transpose this data, find the count of various products/services, and create some macro variables to store information. The macro variables will primarily hold the count of products and services, and the total number of customers, and a couple of others will aid us in executing the macro.

Here is the data transformation and macro variable initiation:

Proc transpose data=wide out=long (rename=(col1=PRODUCTS) drop=_label_); 
By custid; 
Var pr1-pr8; 
run; 
 
proc sql;  
   create table product_customer_count as 
   select products, count(distinct(custid)) as ANALYSIS_UNIT_FREQ 
   from long 
   where products ne "" 
   group by 1 
   ; 
quit; 
 
proc sql noprint; 
   select count(distinct(products)) into: product_count 
from long  
where products ne ""; 
    
select count(distinct(custid)) into: population from long; 
    
select distinct products into :product_1-:product_%trim(&product_count) from product_customer_count where products ne "" ; 
 
select distinct analysis_unit_freq into :analysis_unit_freq_1-:analysis_unit_freq_%trim(&product_count) from product_customer_count where products ne ""order by products; 
quit;

After the creation of the macro variables, a macro, product_tot, has been written to find the number of instances where products co-occur in the same basket. The loop runs 12 times (the count of products and services the bank has) and produces an equal number of tables. In the following code, the %put statement shows that when i=1, the LHS product is Buy to Let Mortgage.

Here is the common basket count (frequency co-occur):

%macro product_tot; 
%do i = 1 %to &product_count; 
Proc sql; 
   create table product_tot_&i as 
   select products, count(distinct(custid)) as FREQ_CO_OCCUR 
   from 
   (select custid, products 
   from long 
   where custid in (select custid from long where products eq"&&product_&i")) 
   where products ne "" 
   group by 1 
   order by 1 
   ; 
quit; 
%end; 
%mend; 
%product_tot; 
 
%put product is &product_1; 
 
%put product is &product_1; 
product is BTL_Mortgage

In Figure 7.4, we have a count of baskets in which the LHS and the RHS products are found together. We can ignore the first row, or the LHS and RHS products will be the same. However, for the second row, we can say that there are 94 customers who have a Buy to Let Mortgage product and this is followed by the opening of a Business_Current_Account at some stage:

Figure 7.4: Illustration - frequency co-occur

After counting the common baskets for products, let's try and add few metrics to understand the results. The following code adds the metrics and helps us choose the product affinities that are meaningful. The metrics macro calculates the SUPPORT, CONFIDENCE, LIFT, EXPECTED_CONFIDENCE, CHISQ, and P-value.

Here are the metrics and relevant affinities:

%macro metrics; 
%do i = 1 %to &product_count; 
Proc sql; 
   create table metric_&i as 
select distinct "&&product_&i" as LHAND, a.PRODUCTS as RHAND, a.ANALYSIS_UNIT_FREQ, b.freq_co_occur, (b.FREQ_CO_OCCUR/&&analysis_unit_freq_&i)*100 as CONFIDENCE format 5.2,  
(b.FREQ_CO_OCCUR/&population)*100 as SUPPORT format 5.2, (a.ANALYSIS_UNIT_FREQ/&population)*100 as EXPECTED_CONFIDENCE format 5.2, 
calculated CONFIDENCE/calculated EXPECTED_CONFIDENCE as LIFT format 5.2, 
(FREQ_CO_OCCUR*(calculated LIFT-1)**2)*((calculated SUPPORT/100)*(calculated CONFIDENCE/100)) 
         / 
((calculated CONFIDENCE/100 - calculated SUPPORT/100)*(calculated LIFT - calculated CONFIDENCE/100)) as CHISQ format 5.2, 
      1 - Probchi(calculated CHISQ,1) as P format 5.4 
   from product_customer_count as a left join product_tot_&i as b 
   on a.products=b.products 
   order by calculated P 
   ; 
quit; 
%end; 
%mend; 
%metrics; 
 
Proc sql noprint; 
Select memname into :datasets separated by' ' 
From dictionary.tables 
Where libname = "WORK" and memname like "METRIC%"; 
quit; 
 
data summary; 
format LHAND $50.; 
set &datasets; 
if LHAND ne RHAND; 
run; 
 
data relevant (drop=analysis_unit_freq); 
set summary; 
if freq_co_occur ge 50; 
if P le 0.05; 
if confidence ge 60; 
if lift gt 1; 
run; 
 
proc sort data=relevant; 
by descending lift; 
run;

So, what are SUPPORT, CONFIDENCE, LIFT, and EXPECTED_CONFIDENCE? Let's look at the output in Figure 7.5 to understand these metrics. Earlier on, we mentioned that there are 500 customers. This is the population that we are dealing with. Before we move on to the metrics, notice the LHS and RHS columns. In the first row, we can see that when we have Premium_Current_Account in a customer's basket on the LHS, in 93 instances out of 500 the customer subsequently took out a personal loan. Dividing 93 by 500 (the population of baskets analyzed) gives us the support of the rule (18.60). These two products co-occur 93 times in customer baskets. Premium_Current_Account is found in customers' purchase baskets 132 times.

Hence, the confidence of observing the rule, if Premium_Current_Account then Personal_Loans, is (93/132)*100 which is equal to 70.45%. EXPECTED_CONFIDENCE is simply the frequency of baskets with Personal_Loans divided by the population (that is, (218/500)*100=43.60%). LIFT is calculated by dividing CONFIDENCE by EXPECTED_CONFIDENCE (70.45/43.60=1.62):

Figure 7.5: Set size of 2 relevant insights

Later in the chapter, you will see the inbuilt procedures that are available in some SAS versions. Those procedures produce the SUPPORT, CONFIDENCE, EXPECTED_CONFIDENCE, and LIFT metrics. As mentioned in the assumptions about MBA, the code we are using is fairly detailed. However, it gives an opportunity to understand how the metrics are calculated. The code has been further enhanced to include the test of significance. While simulating the model, you can find the denominator values of CONFIDENCE and EXPECTED_CONFIDENCE in the table called product_customer_count.

The summary dataset produced a lot of two-way relationships between the products. We have placed some restrictions on the previous code to generate the relevant dataset. This was done to only output the insights of potential interest in Figure 7.5. Only rules where the products and services co-occur at least 10% of the time in the population were considered. All the rules generated are statistically significant. They all have a minimum confidence of 60%, which is an arbitrary measure that we have added to help focus on key rules.

Lastly, all the rules have a LIFT greater than 1. From an insight implementation perspective, we can further condense the list of 11 insights of interest to 4 (Figure 7.6):

Figure 7.6: Set size of 2 insight implementation perspective

We did know about the relationship between savings accounts, current accounts, and credit cards. The metrics related to them only further strengthened our belief in the relationship between them. It helped us quantify the strength of the relationship but it didn't qualify as an insight that we would be eager to implement. However, if you now look at the remaining relationships in Figure 7.6, you can see insights that the campaign managers might be delighted to implement. Premium_Current_Account is a product of interest for the bank as it is a fee paying current account. It is interesting to note that these customers went on to open other higher-margin products and services. Personal_Loans have the highest LIFT in this list. LIFT is a measure of the strength of the relationship between Premium_Current_Account and Personal_Loans. LIFT values greater than 1 indicate that customers opening Premium_Current_Account are more likely to have Personal_Loans than customers without a Premium_Current_Account. SUPPORT merely states how many times the rule has been observed.

A campaign manager can use the preceding rules to cross-sell RHS products and services to customers. Remember, we don't know from this analysis the time taken by 93 customers to apply for a personal loan after opening a Premium_Current_Account. But since we know about the strength of this relationship, we can try and sell Personal_Loans to customers who don't yet have this product. We can also try and reduce the time that a customer might take to buy both these products. This is helpful as, even though we expect 43.60% (our EXPECTED_CONFIDENCE) of customers to buy both products, we can help reduce the time between the opening of these products.

For simplicity's sake, we have only looked at a set size of two. However, the inbuilt SAS procedure produces output for product relationships with a set size of more than two.

The following is code that produces a three-way MBA.

Here is the data transformation for a set size of three:

data one; 
   input id $ product1 $50.; 
datalines; 
1 BTL_Mortgage| 
2 Business_Current_Account| 
3 Credit_Card| 
4 Currency_Services| 
5 Insurance| 
6 Locker| 
7 Personal_Current_Account|  
8 Personal_Loans| 
9 Premium_Current_Account| 
10 Residential_Mortgage| 
11 Savings_Account| 
12 Trading_Account|  
; 
data two; 
   input id $ product2 $50.; 
datalines; 
1 BTL_Mortgage| 
2 Business_Current_Account| 
3 Credit_Card| 
4 Currency_Services| 
5 Insurance| 
6 Locker| 
7 Personal_Current_Account|  
8 Personal_Loans| 
9 Premium_Current_Account| 
10 Residential_Mortgage| 
11 Savings_Account| 
12 Trading_Account| 
; 
data three; 
   input id $ product3 $50.; 
datalines; 
1 BTL_Mortgage 
2 Business_Current_Account 
3 Credit_Card 
4 Currency_Services 
5 Insurance 
6 Locker 
7 Personal_Current_Account 
8 Personal_Loans 
9 Premium_Current_Account 
10 Residential_Mortgage 
11 Savings_Account 
12 Trading_Account 
; 
run; 
 
data stage1; 
set one; 
   do i = 1 to n; 
      set two point=i nobs=n; 
   output; 
 end; 
run; 
 
data stage2; 
set stage1; 
   do i = 1 to n; 
      set three point=i nobs=n; 
   output; 
 end; 
run; 
 
data matrix (drop = lhand1 lhand2); 
set stage2 (drop = id); 
if product1 ne product2; 
if product1 ne product3; 
if product2 ne product3; 
combo=compress(product1||product2||product3); 
lhand1=scan(combo,1); 
lhand2=scan(combo,2); 
lhand=compress(lhand1||"|"||lhand2); 
run;

In the preceding code, we have created a dataset matrix with all of the possible three-way sequences that can exist between the product and services offered by the bank. After this, we will run a macro to find out the number of instances of each combination in the data.

Here is the pattern finding for a set size of three:

Proc sql noprint; 
   Select count(distinct(combo)) into:combo_count from matrix; 
   Select distinct combo into:combo_1 -:combo_%trim(&combo_count) from matrix; 
   select count(distinct(lhand)) into:lhand_count from matrix; 
   select distinct lhand into:lhand_1 -:lhand_%trim(&lhand_count) from matrix; 
quit; 
%put lhand count is &lhand_count; 
 
%macro combo_find; 
%do i = 1 %to &combo_count; 
Proc sql; 
   create table combos_main_&i as 
   select a.*, compress(pr1||"|"||pr2||"|"||pr3||"|"||pr4||"|"||pr5||"|"||pr6||"|"||pr7||"|"||pr8) as combo, "&&combo_&i" as pattern_found, 
   case when calculated combo contains "&&combo_&i" then 1 else 0 end as combo_count 
   from wide as a 
   ; 
   create table combos_sum_&I as 
   select pattern_found, sum(combo_count) as freq_co_occur 
   from combos_main_&i 
   group by 1 
   having freq_co_occur ge 1 
   order by calculated freq_co_occur desc 
   ; 
quit; 
%end; 
%mend; 
%combo_find; 
 
%macro lhand_find; 
%do i = 1 %to &lhand_count; 
Proc sql; 
   create table lhand_main_&i as 
   select a.*, compress(pr1||"|"||pr2||"|"||pr3||"|"||pr4||"|"||pr5||"|"||pr6||"|"||pr7||"|"||pr8) as combo, 
   "&&lhand_&i" as lhand, 
   case when calculated combo contains "&&lhand_&i" then 1 else 0 end as lhand_count 
   from wide as a 
   ; 
   create table lhand_sum_&i as 
   select lhand, sum(lhand_count) as lhand_co_occur 
   from lhand_main_&i 
   group by 1 
   having lhand_co_occur ge 1 
   order by calculated lhand_co_occur desc 
   ; 
quit; 
%end; 
%mend; 
%lhand_find;

Once the product combination pattern has been found, we will calculate the metrics related to various rules that have been generated.

Here are the metrics for a set size of three:

Proc sql noprint; 
   Select memname into:combosets separated by ' ' 
   From dictionary.tables 
   where libname eq "WORK" and memname like "COMBOS_SUM_%" 
   ; 
quit; 
 
data combos_main (drop = lhand1 lhand2); 
format pattern_found $200.; 
set &combosets; 
lhand1=scan(pattern_found,1); 
lhand2=scan(pattern_found,2); 
lhand=compress(lhand1||"|"||lhand2); 
rhand=scan(pattern_found,-1); 
run; 
 
proc sort data=combos_main; 
by descending freq_co_occur; 
run; 
 
proc sql noprint; 
   select memname into:lhandsets separated by ' ' 
   from dictionary.tables 
   where libname eq "WORK" and memname like "LHAND_SUM_%" 
   ; 
quit; 
 
data lhand_main; 
format lhand $200.; 
set &lhandsets; 
run; 
 
proc sql; 
   create table metric_three_way as 
   select distinct a.LHAND as LHS, a.RHAND as RHS, c.lhand_co_occur as ANALYSIS_UNIT_FREQ, a.freq_co_occur, (a.FREQ_CO_OCCUR/c.lhand_co_occur)*100 as CONFIDENCE format 5.2,  
   (a.FREQ_CO_OCCUR/&population)*100 as SUPPORT format 5.2, (d.analysis_unit_freq/&population)*100 as EXPECTED_CONFIDENCE format 5.2, 
   calculated CONFIDENCE/calculated EXPECTED_CONFIDENCE as LIFT format 5.2, (a.FREQ_CO_OCCUR*(calculated LIFT-1)**2)*((calculated SUPPORT/100)*(calculated CONFIDENCE/100)) 
         / 
         ((calculated CONFIDENCE/100 - calculated SUPPORT/100)*(calculated LIFT - calculated CONFIDENCE/100)) as CHISQ format 5.2, 
         1 - Probchi(calculated CHISQ,1) as P format 5.4 
   From combos_main as a left join product_customer_count as b 
   On a.lhand=b.products 
   left join lhand_main as c 
   on a.lhand=c.lhand 
   left join product_customer_count as d 
   on a.rhand=d.products 
   order by calculated P, calculated confidence desc, calculated expected_confidence desc 
   ; 
quit;

Similar conditions to the one used for a set size of two have been applied. However, you will notice that the threshold for accepting CONFIDENCE and the frequency of co-occurrence has been reduced. In a three-way set size, there is less likelihood of observing a higher co-occurring frequency than a two-way set size.

Here are the relevant rules for a set size of three:

Data relevant_three_way; 
Set metric_three_way; 
if P le 0.05; 
if confidence ge 40; 
if lift gt 1; 
if freq_co_occur ge 15; 
run; 
 
proc sort data=relevant_three_way; 
by descending lift; 
run;

Our relevant_three_way dataset contains no observations as the WHERE conditions are not met. The metric_three_way dataset contains 162 distinct sequences of a set size of three. Figure 7.7 contains a partial output with 10 different rules. The standard SAS inbuilt procedures don't take into account the sequence in a set size of three or larger. This is an important differentiation to consider while evaluating the metric values of metric_three_way. In the preceding table, there are only six rules that are statistically significant.

None of these rules have a LIFT greater than 1. This means that these insights aren't meaningful. Does this mean there are no relevant rules where more than two products and services are involved?

Figure 7.7: Partial output of metrics for a set size of three

EXPECTED_CONFIDENCE is heavily influencing the LIFT and P values. In a three-product rule, adding the significance values distorts the analysis output. The significance values are an addition that we included in the analysis to focus on some key results. Figure 7.8 contains the insights from the three set sizes after the removal of the restrictions placed on the significance measures. LIFT greater than 1 and only baskets with at least 25 occurrences of the rule have been included in the output:

Figure 7.8: Partial output of relevant set size of three

The following code can be used by non-SAS University Edition users who have access to the full procedures offered by SAS. As discussed earlier, the code uses inbuilt procedures to find the rules.

Here is the MBA code for full SAS access users:

Proc dmdb batch data=long out=dmassoc dmdbcat=catseq; 
   id custid; 
   class products; 
run; 
 
proc assoc data=long dmdbcat=catseqout=output items=8 support=15; 
   cust custid; 
   target products; 
run; 
 
proc rulegen in=output out=datarule minconf=75; 
run; 
 
proc sort  data=datarule; 
   by descending lift; 
run;

Table of Contents for Analysis of a set size of two

Create new playlist

Sign In

Sign Up

Table of Contents for
Analysis of a set size of two