Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

10 Churn demographics and firmographics

This chapter covers

Creating a dataset that includes demographic or firmographic information
Converting date information to intervals and analyzing the relationship to churn
Analyzing text categories for the relationship to churn
Forecasting churn probability with demographic or firmographic information
Segmenting customers with demographic or firmographic information

You now know all about using customer behavior data to segment your customers for the purpose of creating interventions to increase engagement. These strategies are the most important ones for increasing customer engagement and retention, which is why they are the focus of the book. But one other way to reduce your customer churn is not about intervening with your existing customers: find new customers who are more likely to be engaged to begin with. Identify facts about customers who tend to be more engaged, and then focus your customer acquisition efforts on finding more customers like them. Such facts are generally known as demographic data (data about individuals) and firmographic data (data about companies). For the purpose of this discussion, I use the following definitions.

DEFINITION Demographics are facts about individual customers, and firmographics are facts about customers that are companies (firms).

Demographics and firmographics generally are unchanging facts about the customer or facts that change only rarely. Demographics and firmographics do not include product use or subscription-derived metrics, but they can include facts about how the customer signed up or about the hardware a customer uses to access an online service. Normally, a business-to-consumer (B2C) or direct-to-consumer (D2C) company uses demographic data, whereas a business-to-business (B2B) company uses firmographic data. As you will see, demographics and firmographics differ in the specific pieces of information that are normally available. But the characteristics of that information are similar in either case, and for that reason, the techniques for handling demographics and firmographics are the same.

NOTE This chapter uses the example of the social network simulation from the GitHub repository for the book (https://github.com/carl24k/fight-churn), which is a consumer product. For that reason, I generally speak about demographics, but the same techniques apply to firmographics.

It is worth noting at the outset that targeting demographics is the least direct method of reducing churn because it doesn’t help your existing customers become more engaged. You can sometimes influence your customer’s behavior, but you cannot change the demographic or firmographic facts about them! Also, targeting acquisitions usually has limited impact because most products and services cannot get all the customers they would like from only one or a few preferred channels. Still, this approach can move the needle on churn over time, and it is worth your while to try every means at your disposal.

This chapter is organized as follows:

Section 10.1 describes typical demographic and firmographic data types and database schemas that contain it and teaches you how to extract such data as part of a dataset.
Section 10.2 shows you how to individually analyze textual demographic data fields with category cohort analysis, which is a bit different from metric cohort analysis because it uses a new concept: confidence intervals.
Section 10.3 teaches you to handle large numbers of demographic categories by combining them.
Section 10.4 demonstrates analyzing a date field for its relationship to churn (the same as metric cohort analysis after the date has been converted to a time interval).
Section 10.5 teaches you the techniques necessary to fit churn probability models like regression and XGBoost when your data includes demographic data fields.
Section 10.6 extends the modeling in section 10.5 to forecasting and segmenting active customers by using demographic fields.

NOTE No real personal information was used to create this chapter. All examples are created from simulated data, designed to be similar to real case studies I have worked on.

10.1 Demographic and firmographic datasets

First, I will explain what exactly I mean by demographic and firmographic data and how it differs from the metrics you have looked at throughout most of this book. Then I will use a social network simulation to demonstrate a typical method for creating a dataset that includes demographic data along with metrics.

10.1.1 Types of demographic and firmographic data

Table 10.1 provides examples of demographic and firmographic data. Although this table covers the most common examples, there are many more possibilities. As you can see, some types of data are common to both consumers and firms in slightly different forms. An individual has a birthdate, and a company has a founding date, for example; a household has a number of members, and a company has a number of employees. Other items are specific to a consumer or businesses, such as a person’s education level or the company’s industry. Table 10.1 also shows the data type for the items listed.

Table 10.1 Examples of demographic and firmographic data

Demographic	Firmographic	Data type
Date of birth	Founding date	Date
Sales channel	Sales channel	String
Place of residence	Company domicile or geography	String
Occupation	Industry or vertical	String
Hardware and OS information	Technology stack information	String
Number of household members	Number of employees	Number
Education level attained	Company stage (start-up, funding round, or public)	String
Gender	B2B or B2C business model	String

In principle, there isn’t much difference between using demographic facts and metrics to understand churn and segment customers.

TAKEAWAY To understand churn and form customer segments with demographic data, you form cohorts of customers based on the values of the demographic fields and compare the churn rates in each cohort.

The non-numeric types are the reason why separate techniques are needed for demographic and firmographic data in comparison with metrics. If you are looking at numeric demographic data, the technique is the same as for metrics except for where the data comes from.

10.1.2 Account data model for the social network simulation

Because demographic data is tied to each account and rarely changes, it is standard to store it in a single database table indexed by account ID, as shown in table 10.2. Table 10.2 includes some of the demographic fields that are part of the social network simulation:

Channel (short for the sales channel) —The sales channel refers to how the customer found the product and signed up. All users sign up through one method, so the channel is a required field with no null values in the social network simulation. In the simulated social network dataset, the different sales channels are as follows:
- App store 1
- App store 2
- Web sign-up
Date of birth —Many products require a customer to enter their date of birth as a statement that they are of (or older than) the minimum age to use the product. Because all users are required to enter something, the date of birth is a required field with no null values for the social network simulation.
Country —The country in which the user lives can often be derived from the user’s payment information or their localization choices in the software. In the social network simulation, users come from more than 20 countries, which are represented by two-character codes (from the International Standards Organization ISO 3166-1 alpha-2 standard). For the social network simulation, the country field can include missing values (null values in the database). It is assumed that this setting is an optional setting; some users don’t bother to set it.

These three fields represent the minimal set necessary to demonstrate the techniques in this chapter. In a real product, there probably would be more fields, although the number varies considerably by product area. Many B2B companies know a great deal about their customers, but demographics can be sparse for consumer products with minimal sign-up requirements.

Table 10.2 Typical account data schema (continued)

Column	Type	Notes
`account_id`	`integer` or `char`	The account ID linking to subscriptions, events, and metrics
`channel`	`char`	The channel through which the customer purchased the app
`date_of_birth`	`date`	The birthdate entered by the customer for age verification when they signed up
`country`	`char`	The country in which the user lives, represented by a two-character string
`...`	`...`	...
`optional fields`	`char`, `float`, `int`, or `date`	Optional; platform specific

In the rest of this section, I’ll show you how to put the data in such a schema to work fighting churn.

10.1.3 Demographic dataset SQL

Given a schema of demographic data keyed by the account ID, the first step is exporting it from the database along with the dataset you usually create for the metrics. This way, you reuse all the existing code you have, showing when accounts renew and who has churned. Also, you will eventually combine the demographic data with the metrics in a single forecasting model, and by exporting the metrics and demographic fields together, you start with everything you are going to need.

Figure 10.1 shows a typical result of such a data extraction. As in the dataset you have used since chapter 4, each row starts with the account ID, the observation date, and the churn indicator. The demographic fields come after those fields and before the metrics.

Figure 10.1 Social network simulation dataset with demographic fields (result of listing 10.1)

Listing 10.1 shows the SQL statement that creates a dataset like the one shown in figure 10.1. Instead of the date_of_birth field, which was in the database, the dataset contains a field called customer_age. The one new technique listing 10.1 introduces is the conversion of the date field for the birthdate to a time interval in years: the customer’s age.

TAKEAWAY You convert demographic date fields to time intervals because then the numeric interval can be used for customer analysis and segmentation in the same way as a metric.

At a high level, the conversion is accomplished by subtracting the demographic date from the observation date, or vice versa:

When the demographic date is in the past (such as a birthdate), you subtract the demographic date from the observation date, and the result is a positive interval representing the time since the demographic field at the time of the observation.
If the demographic date is in the future (such as the day of college graduation), subtract the observation date from the future date to keep the interval positive. Then the interval represents the time from the observation date until the date from the demographic data.

Because the birthdate is in the past, listing 10.1 subtracts the birthdate from the observation date to get the customer’s age. In PostgreSQL, the interval is converted to an age in years by using the date_part function with the 'days' parameter to get the interval length in days and then dividing by 365 (taking care with type conversions).

Listing 10.1 Exporting a dataset with demographic data fields

WITH observation_params AS                                           ①
(
   SELECT  interval '%metric_interval' AS metric_period,
   '%from_yyyy-mm-dd'::timestamp AS obs_start,
   '%to_yyyy-mm-dd'::timestamp AS obs_end
)
SELECT m.account_id, o.observation_date, is_churn,
a.channel,                                                           ②
a.country,                                                           ③
date_part('day',o.observation_date::timestamp                        ④
        - a.date_of_birth::timestamp)::float/365.0 AS customer_age,
SUM(CASE WHEN metric_name_id=0 THEN metric_value else 0 END)
    AS like_per_month,
SUM(CASE WHEN metric_name_id=1 THEN metric_value else 0 END)
    AS newfriend_per_month,
SUM(CASE WHEN metric_name_id=2 THEN metric_value else 0 END)
    AS post_per_month,
SUM(CASE WHEN metric_name_id=3 THEN metric_value else 0 END)
    AS adview_per_month,
SUM(CASE WHEN metric_name_id=4 THEN metric_value else 0 END)
    AS dislike_per_month,
SUM(CASE WHEN metric_name_id=34 THEN metric_value else 0 END)
    AS unfriend_per_month,
SUM(CASE WHEN metric_name_id=6 THEN metric_value else 0 END)
    AS message_per_month,
SUM(CASE WHEN metric_name_id=7 THEN metric_value else 0 END)
    AS reply_per_month,
SUM(CASE WHEN metric_name_id=21 THEN metric_value else 0 END)
    AS adview_per_post,
SUM(CASE WHEN metric_name_id=22 THEN metric_value else 0 END)
    AS reply_per_message,
SUM(CASE WHEN metric_name_id=23 THEN metric_value else 0 END)
    AS like_per_post,
SUM(CASE WHEN metric_name_id=24 THEN metric_value else 0 END)
    AS post_per_message,
SUM(CASE WHEN metric_name_id=25 THEN metric_value else 0 END)
    AS unfriend_per_newfriend,
SUM(CASE WHEN metric_name_id=27 THEN metric_value else 0 END)
    AS dislike_pcnt,
SUM(CASE WHEN metric_name_id=30 THEN metric_value else 0 END)
    AS newfriend_pcnt_chng,
SUM(CASE WHEN metric_name_id=31 THEN metric_value else 0 END)
    AS days_since_newfriend
FROM metric m INNER JOIN observation_params
ON metric_time BETWEEN obs_start AND obs_end
INNER JOIN observation o ON m.account_id = o.account_id
   AND m.metric_time > (o.observation_date - metric_period)::timestamp
   AND m.metric_time <= o.observation_date::timestamp
INNER JOIN account a ON m.account_id = a.id                          ⑤
GROUP BY m.account_id, metric_time, observation_date, 
         is_churn, a.channel, date_of_birth, country                 ⑥
ORDER BY observation_date,m.account_id

① Most of this listing is the same as listings 7.2 and 4.5.

② The channel string from the account table

③ The country string from the account table

④ Subtracts the date of birth from the observation date

⑤ JOINs with the account table

⑥ Includes the demographic fields in the GROUP BY clause

Most of listing 10.1 is the same as the previous listings you’ve used to extract a dataset: observation dates are selected from the observation table and joined with metrics by using an aggregation to flatten the data. The other new aspects of listing 10.1 follow:

The query makes an INNER JOIN on the account table (table 10.1) to select the fields for the channel, country, and date of birth.
Because these demographic fields are one per account in the account table, there is no need to aggregate these fields. Instead, the demographic fields are included in the GROUP BY clause.

You should run listing 10.1 on the social network simulation to create the dataset that will be used throughout the rest of the chapter. Assuming that you are using the Python wrapper program to run the listings, the command is

fight-churn/listings/run_churn_listing.py —chapter 10 —listing 1

The result of listing 10.1 (saved in the output directory) should appear similar to figure 10.1 at the start of this section.

In this section, I describe storing demographic data as a single, unchanging value. But not all demographic or firmographic fields are truly unchanging: people and companies can move, companies can achieve new stages of development, people can achieve higher levels of education, and so on. To model such changes better, some companies track demographic data in a time-sensitive manner, either by adding effective-date timestamps to the account table or by tracking demographic fields in separate tables from the account itself (known as slowly changing dimensions in data warehouse terminology). Because these more advanced methods are not common, I don’t cover them in this book. If that situation is your situation, listing 10.1 is modified to join the demographic data effective dates to the observation date.

The reason why the more complicated approach can be advantageous is that in some scenarios, treating demographic fields as static when they are not can result in a kind of lookahead bias in predicting churn using the demographic field. You see something about a customer in your historical dataset paired with a churn or renewal status in the past, but in a nonhistorical context (at the time of the observation timestamp), you would not have known that information. To make an example from firmographics, consider the company stage at a start-up or public company. Start-ups that go public must be successful and are less likely to go out of business and churn. If the data includes start-ups that went public in the past, the firmographic data identifies them as public companies because that was the current status when you created the dataset. But only successful start-ups go public, so the data becomes biased.

Such a bias can also confer unrealistic forecasting accuracy to a model. That said, this type of scenario is usually a second-order effect, which justifies the usual practice of ignoring the time-changing component of demographic and firmographic data.

10.2 Churn cohorts with demographic and firmographic categories

Now that you’ve got a dataset with demographic data, you will compare the demographic cohorts by their churn rates to see how the demographic data is related to churn. At the start of the chapter, I told you that there are three types of demographics fields: dates, numbers, and strings. Earlier, I showed you that you should convert the dates to numeric intervals. In the cohort analysis, there are only two types: numbers and strings.

Churn cohort analysis with numeric demographic data is exactly the same as cohorts based on metrics, as I will show briefly in section 10.4. This section is about the new subject of comparing churn rates in cohorts by using demographic information described by strings.

10.2.1 Churn rate cohorts for demographic categories

The section is about demographic categories, so I start with a definition.

DEFINITION For the purposes of this book, a category is one possible value of a demographic field described by a string.

In the social network simulation, the categories associated with the channel field are appstore1, appstore2, and web. The categories associated with the country field are two-character codes such as BR, CA, and CN. It is possible for a value to be missing in a demographic field, so you can consider no value (null in the database) to be one additional category for every field.

NOTE For each demographic field, a customer can belong to only one category or have no value as a category.

In principle, churn cohort analysis for demographic categories is simple: define a cohort with each category, and calculate the churn rates. But there are important differences between cohorts made from categories and cohorts made from metrics. As a result, you need to be more careful in how you compare the churn rates in cohorts defined by categories. Following are some important differences between cohorts based on metrics and cohorts based on categories:

With metrics, the cohorts have a natural order given by the metrics. In most cases, categories do not have a meaningful order. Category-based cohorts, therefore, are harder to interpret because you cannot use the trend you see across categories as a guide for interpreting the differences in the churn rates.
For metrics of product use, you have natural expectations, such as “More use leads to lower churn” and “More cost for use leads to higher churn.” But there is no obvious expectation with categories.
When you define metric cohorts, you guarantee that each cohort has a significant portion of the observations—typically, 10% or more. With category-based cohorts, there is no guarantee of the minimum or maximum percentage of the data that might be captured in each cohort.

Based on my own experience, cohorts from demographics have weaker relationships to churn than cohorts based on product-use metrics.

TAKEAWAY You must be more careful making comparisons of churn rates in cohorts based on demographic categories than in cohorts based on metrics.

By careful, I mean that you need to rely on strong evidence to make sure that the difference is significant. For that reason, you will use a new technique known as confidence intervals to make the comparison.

10.2.2 Churn rate confidence intervals

To be more careful with churn rate comparisons between demographic cohorts, you should not simply calculate the churn rates in each cohort; you should also estimate best- and worst-case scenario churn rates in each cohort. This process is known as calculating confidence intervals.

DEFINITION Confidence intervals for a metric like the churn rate are the range from the best-case (lowest) estimate of the churn rate to the worst-case (highest) estimate for the churn rate.

Understanding confidence intervals starts with realizing that the churn rate you calculate on your customers is not the churn rate you want to measure. Consider the following:

What you want to know is what the churn rate would be on all the possible customers in the world who would match your cohort demographic category. That estimate would be the best estimate of future churn for that type of customer.
You can measure only the churn rate you have seen for the customers you have had.

This scenario is illustrated in figure 10.2. You can’t be sure that the churn rate you have seen in past customers is what the churn rate is going to be for future customers. You may see a different churn rate in the future. Maybe you got lucky in the past and got better-than-average customers, or maybe the opposite is true; you never know. But you can expect two things:

The churn rate you would see in the full universe of customers should be close to what you have seen in the past, assuming that you observed a reasonable number of customers in each cohort.
The more customers you see, the closer the churn rate you have seen in the past should be to the churn rate in the entire universe. Put another way, the more customers you see, the less uncertainty exists about the range of possible churn rates for the universe.

For this reason, people usually talk about confidence intervals as the range around the measured churn rate, which is known to be near the center of the best- and worst-case scenarios (but, as you will see, not necessarily at the center). To describe the measured churn and the best-case and worst-case estimates, we’ll use the following definitions.

DEFINITION The measured churn rate on past customers is referred to as the expected value, and it is considered to be the most likely value for the universal churn rate. The upper confidence interval is the range from the expected churn rate to the worst-case estimate, described by the size of that range, or the worst-case churn minus the expected churn. The lower confidence interval is the range from the best-case estimate to the expected churn, described by the size of that range, or the expected churn minus the worst-case estimate.

Figure 10.2 illustrates the differences among the universal churn rate, your estimate, and the upper and lower confidence intervals for the estimate.

Figure 10.2 Confidence intervals assess best- and worst-case scenarios.

I said the churn rate in each cohort should be close to the universal churn rate for such customers, assuming that you observed enough of them. How many is enough was discussed at length in chapter 5: ideally, you want to observe thousands of customers in each category, but hundreds may be enough.

When you use confidence intervals, the number of customers you use translates to the size of the confidence intervals. The more customers you measure the churn on, the narrower is the range of uncertainty around the churn rate. In section 10.2.3, you will learn how to calculate confidence intervals and compare them.

TAKEAWAY Because you can’t calculate the universal churn rate measurement, you will instead calculate best- and worst-case estimates for the universal churn rate, given the available data.

10.2.3 Comparing demographic cohorts with confidence intervals

Figure 10.3 shows an example of comparing demographic cohorts with confidence intervals, which is the result for the channel category in the social network simulation. The basic idea is the same as the metric cohort plots you saw in earlier chapters, but there are a few significant differences:

The data is displayed in a bar chart instead of a line chart. The churn rate in each cohort is shown by the height of each bar.
Each bar has a pair of lines above and below the main bar, showing the extent of the confidence intervals. The lines showing the confidence intervals in a plot are often known as error bars or whiskers.
The x-axis still identifies the cohort, but now it is a string label showing the category that the cohort represents.

Figure 10.3 Channel churn rates with confidence intervals (output of listing 10.2)

In the category cohort plot, you are looking at not only the expected universal churn rates but also the best- and worst-case estimates, so you should use the confidence intervals as a guide to compare the significance of the difference between the category churn rates. This technique is known as statistical significance.

DEFINITION The difference between the churn rates in two different categories is statistically significant if the best-case churn rate (lower confidence interval) for one category is greater than the worst-case churn rate (upper confidence interval) for the other category. In that case, the two confidence intervals do not overlap.

Considering figure 10.3, you would say that the difference between the churn rates for the appstore1 and appstore2 categories is statistically significant because the confidence intervals are far apart. The worst-case churn rate for appstore2 is around 3.5%, and the best-case churn rate for appstore1 is around 4.5%, so the two are not touching.

But the difference between the churn rates for the appstore1 and the web customers is on the borderline for statistical significance because the confidence intervals are practically touching. The best-case churn for the web channel is around 5.4%, and the worst-case churn for appstore1 is also around 5.4%. According to a strict definition, you might say that the difference is not statistically significant. But in practice, statistical significance is not applied as a hard rule. If you have some reason to think that a difference is significant, you might still act on a difference in churn rates when there is a little overlap in the confidence intervals. In this case, I would say the fact that appstore2 is so different lends credibility to differences between the channels and, by extension, the differences between web and appstore1. As you will see in figure 10.4, the confidence intervals for the web and appstore2 churn rates are not touching by just 0.02%, which you can’t tell in the figure. But whether confidence intervals overlap or don’t by such a small amount shouldn’t make a difference in your interpretation.

TAKEAWAY In practice, whether a difference in churn rates is statistically significant or not is not black and white when the edges of the confidence intervals are nearly touching or overlap a little bit.

Listing 10.2 shows the code that produces figure 10.3. Listing 10.2 consists of a main function, category_churn_cohorts, that calls three helper functions:

prepare_category_data—Loads the data and fills any missing categories with the string '-na-'. This string clearly marks any customers that are missing a category.
category_churn_summary—Calculates the churn rates and the confidence intervals and puts all the results in a DataFrame, which is saved as a .csv file. (Details on the calculation follow the listing.)
category_churn_plot—Plots the results in a bar chart, showing the confidence intervals and adding annotations. Confidence intervals are added by setting the yerr param of the bar function, which stands for y error bar.

Listing 10.2 Analyzing category churn rates with confidence intervals

import pandas as pd
import matplotlib.pyplot as plt
import os
import statsmodels.stats.proportion as sp
 
def category_churn_cohorts(data_set_path, cat_col):                       ①
   churn_data =                                                           ②
      prepare_category_data(data_set_path,cat_col)
   summary =                                                              ③
      category_churn_summary(churn_data,cat_col,data_set_path)
   category_churn_plot(cat_col, summary, data_set_path)                   ④
 
def prepare_category_data(data_set_path, cat_col): 
   assert os.path.isfile(data_set_path),
      '"{}" is not a valid dataset path'.format(data_set_path)
   churn_data = pd.read_csv(data_set_path,index_col=[0,1])
   churn_data[cat_col].fillna('-na-',inplace=True)                        ⑤
   return churn_data
 
def category_churn_summary(churn_data,                                    ⑥
                           cat_col, data_set_path):
   summary = churn_data.groupby(cat_col).agg(                             ⑦
      {
         cat_col:'count',
         'is_churn': ['sum','mean']
      }
   )
 
   intervals = sp.proportion_confint(summary[('is_churn','sum')],
                                     summary[ (cat_col,'count')],
                                     method='wilson')                     ⑧
 
   summary[cat_col + '_percent'] = (                                      ⑨
      1.0/churn_data.shape[0]) * summary[(cat_col,'count')]
 
   summary['lo_conf'] = intervals[0]                                      ⑩
   summary['hi_conf'] = intervals[1]
 
   summary['lo_int'] =                                                    ⑪
      summary[('is_churn','mean')]-summary['lo_conf']
   summary['hi_int'] =                                                    ⑫
      summary['hi_conf'] - summary[('is_churn','mean')]
   save_path =                                                            ⑬
      data_set_path.replace('.csv', '_' + cat_col + '_churn_category.csv')
   summary.to_csv(save_path)
   return summary
 
def category_churn_plot(cat_col,                                          ⑭
                        summary, data_set_path):
   n_category = summary.shape[0]
 
   plt.figure(figsize=(max(4,.5*n_category), 4))                          ⑮
   plt.bar(x=summary.index,
           height=summary[('is_churn','mean')],                           ⑯
           yerr=summary[['lo_int','hi_int']].transpose().values,
           capsize=80/n_category)                                         ⑰
   plt.xlabel('Average Churn for  "%s"' % cat_col)                        ⑱
   plt.ylabel('Category Churn Rate')
   plt.grid()
   save_path = 
      data_set_path.replace('.csv', '_' + cat_col + '_churn_category.png')
   plt.savefig(save_path)
   print('Saving plot to %s' % save_path)

① Main function for the category analysis and plot

② Helper function prepare_category_data reads the dataset.

③ Calls category_ churn_summary to perform the analysis

④ Calls category_ churn_plot to make the plot

⑤ Fills any missing values with a string '-na-'

⑥ Uses category_churn_ summary to analyze the categories

⑦ Uses the Pandas aggregation function to group data by the category

⑧ Calculates the confidence intervals

⑨ Divides the category count by the total number of rows

⑩ Copies the results into the summary DataFrame

⑪ Lower confidence interval = mean minus lower confidence bound

⑫ Upper confidence interval = upper confidence minus mean

⑬ Saves the result

⑭ Uses category_churn_plot to plot the result

⑮ Scales the size of the plot based on the number of categories

⑯ The percentage of churns is the bar height.

⑰ The Y error bar is given by confidence intervals.

⑱ Annotates the figure and saves it

You should run the Python wrapper program to produce your own plot like figure 10.3 for the simulated dataset. The command and its arguments to the wrapper program are

fight-churn/listings/run_churn_listing.py —chapter 10 —listing 2

Turning to the details of the calculation of the cohort churn rates in listing 10.2, the average churn rate is calculated in the category_churn_summary function, using the Pandas DataFrame groupby and agg functions:

summary = churn_data.groupby(cat_col).agg({cat_col:'count','is_churn': ['sum','mean']})

The following breaks down the details of this dense line:

The groupby function is called with the category as the grouping variable. The result of this function is a specialized DataFrameGroupBy object that can be used to retrieve different results based on the grouping.
After grouping, the desired measures are found by calling the aggregation function agg on DataFrameGroupBy. The results to be created by DataFrameGroupBy are specified in a dictionary where each dictionary key is a column to calculate aggregate functions and the value for the keys are one or more aggregate functions. In this case, you use the following:
```
{
cat_col :'count',
'is_churn': [
'sum',
'mean']
}
```
- The first entry in the dictionary indicates that the column containing the category (the variable cat_col) should be aggregated with a count. For every category, show the number of rows in the dataset that had the category.
- The second entry in the dictionary indicates that the column containing the churn indicator should be aggregated by summing the number of churns and also calculating the mean, which results in the observed churn rate for the category.

The result of the call to the function is a DataFrame with one row per category and columns containing the three aggregation results. The columns are labeled by tuples, combining the column and the aggregation. The column labeled cat_col,'count' contains the row count for the categories, for example, and the column labeled 'is_churn','mean' contains the mean of the churn indicator, which is the churn rate.

The function category_churn_summary in listing 10.2 uses the statsmodels module to calculate the confidence intervals. The function used is statsmodels.stats .proportion.proportion_confint, which is for calculating confidence intervals on measurements of percentages resulting from binary trials (which is what measuring churn rates amounts to, from a statistician’s point of view). The function proportion _confint takes as parameters the count in each category and the number of churn observations (passed by selection from the aggregation result DataFrame using the tuple labels I’ve described).

As mentioned early in the chapter, the number of observations and number of churns form the basis for calculation of the confidence intervals using statistics. The call to proportion_confint also passes the optional method parameter method= 'wilson'. The Wilson method for calculating confidence intervals is the best choice for churn because it is known to produce the most accurate results when the proportion of events (in this case, churns) in the binary trials is small. I won’t go into details on how the Wilson method calculates confidence intervals, but there are many good resources online.

Figure 10.4 shows the data file output from the category churn cohort analysis with confidence intervals. This output contains all the information used to produce the channel cohort bar chart (figure 10.3) and more details. One important piece of information available in this file and not in the bar chart is the percentage of observations from each channel. Most organizations that acquire customers through different channels already have a good idea of the percentage of customers acquired through each channel. In such a case, you should compare the number in your dataset with the number measured by the sales department for quality assurance (to make sure that there are no problems in the data feed and so on).

Figure 10.4 Data output of category churn cohorts for the channel field

The file output of listing 10.2 also shows the size of the low- and high-confidence intervals. In figure 10.4, you can see that the high interval is a little larger than the low interval. This asymmetry occurs because the churn probability is a small percentage. If the churn rate were 50%, the size of the confidence intervals would be symmetric.

The other demographic field with categories in the simulated social network is the country. Figure 10.5 shows the churn cohort plot for the country categories. The churn cohort results for the country are different from the plot of the churn cohorts for the channel because there are many more countries. Because some of the countries have only a small percentage of the customers, some of the confidence intervals are large compared with the churn rates. In fact, as a result of the large confidence intervals, there are no statistically significant churn rate differences among the countries. All the confidence intervals in the country categories overlap the confidence intervals of the other categories by a large amount. (Figure 10.5 shows no cases in which the confidence intervals overlap by a bit.)

Figure 10.5 Country cohort churn rates with confidence intervals (output of listing 9.2)

Figure 10.6 displays the data file output for the country cohort churn analysis. It shows that most countries have less than 10% of the data and that some have as little as 1%. The countries with the smallest number of customer observations have the largest confidence intervals for the churn rate. SE had just 1% of the observations (236 observations), and with a measured churn of 5.9%, the lower end of the confidence interval is 3.6% and the upper end is 9.7%: a span of around 6%. US, on the other hand, represents 15% of the observations (3,710 observations) with a similar observed churn rate of 5.3%, and the confidence intervals range from 4.7% to 6.1%—a span of only 1.5%.

Figure 10.6 Data output of category churn cohorts for the country field

The results in figures 10.5 and 10.6 show that having too many categories is a problem for doing an effective churn cohort analysis. Section 10.3 teaches you a simple and effective way to deal with this problem.

The function proportion_confint has another parameter: the significance level, which I leave at the default value in my code. If you check the documentation for proportion_confint, you will find that the default significance level is 0.05. This parameter corresponds to what people call the 95% confidence level and represents the degree of certainty that the true universal churn rate is within the range defined by the best- and worst-case estimates.

Like most things in statistics, the best- and worst-case churn rates are estimates, and the significance level determines the possibility that these estimates are also wrong. When people say “95% confidence,” they’re saying 100% minus this significance level. In other words, there is a 5% chance that the true universal churn rate is not within the stated bounds and a 95% chance that the universal churn rate is within the bounds.

Lowering the significance-level parameter less than 0.05 results in larger confidence intervals, or a large difference between the best-case and worst-case estimates. If you use a lower significance level, it takes a larger difference between the churn rates for two categories to qualify as statistically significant (by having the confidence intervals that do not touch). On the other hand, a higher significance level (greater than 0.05) makes smaller confidence intervals, and it will be easier to say that differences are statistically significant, but you will be less sure that the universal churn rate for the category was within the stated bounds.

Choosing the significance level and interpreting confidence intervals is a controversial topic in statistics, and I’m trying to give you some simple best practices. My advice is to leave the significance parameter as the default. In principle, you should use a lower significance level for a demographic field that has a large number of categories (more than a few dozen). That way, you would apply more stringent criteria in determining which differences are significant.

In section 10.3, I will teach you another way of handling a large number of categories: grouping those that are less common. Overall my advice is to leave this parameter unchanged. I mention it here only because you might be asked what significance level you use to calculate the confidence intervals. (The answer is that you use the standard 0.05 significance level.)

10.3 Grouping demographic categories

In section 10.2.3, I showed you that if you have a lot of categories, you run the risk that the number of observations in the rare categories will be too small to produce useful results. With few observations, the confidence intervals can become large, depending on the amount of data you have to work with. If you have millions of customers, you can have statistical significance for the results in even the rarest categories. Still, information overload can be a problem, and it can be desirable to look at fewer categories for that reason as well.

10.3.1 Representing groups with a mapping dictionary

The solution to the problem of having a lot of categories that represent small fractions of the data is grouping rare categories that are related. Countries can be grouped into regions, for example. Figure 10.7 illustrates mapping countries into regions by using a Python dictionary. The dictionary in figure 10.7 is literally a mapping from regions to lists of countries because that mapping is a more efficient way to express the relationship.

Figure 10.7 Mapping group-simulated country categories into regions

The code on which figure 10.7 is based is in the GitHub repository for this book, in the file fight-churn/listings/conf/socialnet_listings.json; look for the chapter 10 section and the key listing_10_3_grouped_category_cohorts. I’ll say more about how and why this particular mapping was chosen later, but for now, I will show you how this kind of grouping helps with the category cohort analysis.

10.3.2 Cohort analysis with grouped categories

Figure 10.8 shows the result of rerunning the cohort analysis based on regions instead of countries. As a result of the grouping, there are six categories. If you look at the data output that goes with the plot (not shown in the figure), you will see that every one of the new categories represents no less than 10% of the data; the smallest category is now the customers who do not have any country (-na-), which is 11%. As a result of the larger number of observations, the size of the confidence interval on every category in figure 10.8 is smaller than when the countries were separate (figure 10.5).

TAKEAWAY If your demographics include rare categories, you can simplify by grouping related categories. This approach reduces the churn rate confidence intervals and information overload.

Despite the smaller confidence intervals, figure 10.8 shows no statistically significant differences between the churn rates in any region. The confidence intervals around the churn rate in every region overlap significantly with all the others. The fact that there is no statistically significant difference in this simulated dataset doesn’t mean that you won’t find important relationships in your own product or service.

Figure 10.8 Churn cohorts for country categories grouped in regions

Listing 10.3 provides the code for performing the grouping and rerunning the category cohort analysis. This listing uses all the helper functions from the category cohort churn analysis (without grouping), and it adds only one new function to perform the grouping: group_category_column. This function has two main parts:

The first part inverts the mapping dictionary so that it is a mapping from country to region rather than from region to country. Inverting a dictionary can be done in a Python one-liner, using a double list comprehension. The first list comprehension iterates over the keys that were the regions, and the second list comprehension iterates over the values in each key, which were the countries. A dictionary mapping the old values to the old keys (country to region) is formed from the results.
After the mapping dictionary has been inverted, a new column is created in the DataFrame, using the DataFrame apply function. The apply function takes another function as a parameter, and that function is applied to all the elements in the column. In this case, the purpose is to look up the value in the inverted dictionary if one is present; otherwise, it returns the original value. The result of applying this function to the column is that every country that is part of one of the region groups will be mapped, and any country that is not will be copied as is. After this mapping, the code in listing 10.3 uses the analysis and plotting functions from listing 10.2, which did category cohort analysis on ungrouped categories.

The group_category_column function makes a name for the new column by prepending the word group to the original column name and dropping the original column from the result.

Listing 10.3 Grouped category cohort analysis

import pandas as pd
import os
 
from listing_10_2_category_churn_cohorts import category_churn_summary, 
   category_churn_plot, prepare_category_data                              ①
 
def grouped_category_cohorts(data_set_path,                                ②
                             cat_col, groups):
   churn_data = prepare_category_data(data_set_path,cat_col)
   group_cat_col =                                                         ③
      group_category_column(churn_data,cat_col,groups)
   summary =                                                               ④
      category_churn_summary(churn_data,group_cat_col,data_set_path)
   category_churn_plot(group_cat_col, summary, data_set_path)
 
 
def group_category_column(df, cat_col, group_dict):                        ⑤
   group_lookup = {                                                        ⑥
                     value: key for key in group_dict.keys() 
                                for value in group_dict[key]
                  }
   group_cat_col = cat_col + '_group'                                      ⑦
   df[group_cat_col] = df[cat_col].apply(lambda x:                         ⑧
                                group_lookup[x] if x in group_lookup else x)
 
   df.drop(cat_col,axis=1,inplace=True)                                    ⑨
   return group_cat_col                                                    ⑩

① This listing reuses the helper functions from listing 10.2.

② The main function is mostly the same as the regular category plot.

③ Calls helper function to map the category column to groups

④ Helper function from listing 10.2 analyzes the categories.

⑤ This function maps the categories into groups with mapping dict.

⑥ Inverts the dictionary

⑦ Makes a new name for the group column

⑧ Transforms data with the DataFrame apply method and lambda

⑨ Drops the original category column

⑩ Returns the new column name as the result

You should run listing 10.3 to create your own cohort analysis where the countries are grouped into regions. Do this with the usual command to the Python wrapper program and these arguments:

fight-churn/listings/run_churn_listing.py —chapter 10 —listing 3

You should get a result that is qualitatively similar to figure 10.8, but don’t expect to get the specific churn rates in each group when you create your own version. The reason is that in the simulation, the countries have no relationship to churn and engagement and thus are random. (Believe me: I know because I created the simulation.) Although you should get confidence intervals of similar size to those in figure 10.8, don’t expect to get the same churn rates.

NOTE For the most part, this book has avoided having you analyze anything in the simulation that did not relate to churn in some way, to save you the time of generating and exploring meaningless data. But in real data from actual products and services, you should expect to find both events and demographic information that are unrelated to customer retention and churn.

WARNING Do not take the results from the social network simulation from the book’s GitHub repository as a guide to what you can expect from your own product or service. The examples are a realistic-looking set of data for the purpose of demonstrating the methods to use on real data, but nothing more. The simulated results cannot be expected to predict the results for any real product or service.

10.3.3 Designing category groups

Now that you know how to implement category groupings for a cohort analysis, I will give you some advice on how to pick such groupings. First, consider the scenario that you do not have a lot of data, so you are grouping categories to find enough observations in your cohorts (so that you end up with reasonable-size confidence intervals around the churn rates). If this situation is your situation, you don’t have the option to do something that is data driven based on your own data; you don’t have enough data to analyze the differences between the categories, and that’s the problem. In this case, you should group categories based on your knowledge of how the categories relate to one another. Apart from the country region example, some sensible groupings you may want to use include the following:

If you have a lot of categories for operating system versions, you can group them by major releases.
If you have categories for industry sectors, you can group related ones such as banking and finance in one group and consumer products and retail in another.
If you have categories for occupations, you can group related fields such as doctors and dentists in one group and software engineers and data scientists in another.
If you have categories for education levels, you can group rare ones such as master’s degrees, doctorates, and so on.

Remember that your goal is to group the rare categories in a reasonable way and try to get a sense of any relationships. If you find some relationships, you can always revise your grouping to take advantage of the structure you discovered (as described later in this section).

Also note that you don’t have to slavishly follow standard definitions of groups: you should customize them based on the details of your product or service. In my own mapping from country to region, I made the following editorial decisions:

I didn’t include China (CN) in the Asia Pacific (APac) group because China alone represented more than 10% of the data samples, which is enough on its own.
I chose to include Mexico (MX) with Latin America (LaAm) and not North America (LaAm) because if this were a real social network, I would expect that language and culture would be more significantly related to engagement than geography is related to engagement. (If my product or service had to do with industrial manufacturing and transportation, I probably would have focused on geographical rather than cultural relationships.)

These are examples of some of the considerations you might want to use. My last piece of advice on the subject follows.

WARNING Do not overthink your category groups or spend too much time on them. Remember the need for agility in your analysis. Do something that gives you a manageable result for a first pass, take feedback from your business colleagues, and iterate from there.

On the other hand, consider that you have enough data to have narrow confidence intervals around every churn rate, and your problem is the information overload from too many categories (or after your first attempt at grouping, you achieved a similar result). Then you can take a more data-driven approach:

Run the category cohort analysis on the ungrouped categories; then use the churn rates you see from the first iteration to decide on groups to use in a second iteration:
- Group categories that are related according to your knowledge and that have similar churn rates.
- In this context, a similar churn rate means that the two categories do not have a statistically significant difference in their churn rates. (Confidence intervals overlap.)
- If the two churn rates are different by a statistically significant amount (confidence intervals do not overlap), do not group them, even if you know that the categories are related.
You should still use groups based on knowledge as described. Do not group categories only on the grounds that the two categories have similar churn rates or other metrics.

You can also use the correlation analysis described in section 10.5 as an additional way to assess the similarity between your groups based on their relationship to other metrics. But as you will see, the grouping algorithm you used for metrics does not work for categories, and I do not recommend using an automated method for this kind of grouping.

If you have too many categories to handle by designing a grouping scheme from your knowledge (hundreds or thousands of categories), chances are that the information is not going to be helpful in your fight against churn. The businesspeople probably wouldn’t segment customers into such confusing categories.

10.4 Churn analysis for date- and numeric-based demographics

As I mentioned earlier, you should look at numeric demographic information with cohorts the same way that you do metrics. In section 10.1, I taught you that date type demographic and firmographic information can easily be converted to numeric intervals, so you can also use metric-style cohort analysis with date type demographic data. Because you learned how to analyze numeric customer data in chapter 5, this section is going to be a short demonstration.

The demographic information for the social network simulation includes the date of birth that the customer entered when they signed up, and listing 10.1 converted this date to a numeric field in the social network simulation dataset: customer_age. Figure 10.9 shows the result of running a standard metric cohort analysis on customer age. The figure shows that in the social network simulation, the higher customer age is associated with higher churn. The lowest age cohort, with an average age around 15 years, has a churn rate around 4%, whereas the higher age cohorts (older than 60 years) have an average churn rate around 5.5%. The change in churn rates across cohorts is a little irregular, but it is consistent with the finding that older customers churn more (the effect is weak compared with the influence of their behavior that was demonstrated in chapters 5 and 7).

Figure 10.9 Customer demographic age cohort analysis

To create your own version of figure 10.9 from the data you simulated, you must reuse the metric cohort listing 5.1 (with the filename listing_5_1_cohort_plot.py). The configuration already has a version that you run as follows:

run_churn_listing.py —chapter 5 —listing 1 —version 17

Your result can be somewhat different from figure 10.9 because the relationship is not strong and the data is randomly simulated. This example demonstrates that after you extract demographic information in numeric format in your dataset, you can analyze it with cohorts the same way that you would a metric.

I interpret metric cohorts based on the consistency of the trend, but by now, you have probably realized that you could add confidence intervals around every point in a metric cohort plot. I don’t do this normally because it makes the plots too cluttered to show to businesspeople, and it’s usually not necessary for interpretation of the relationship to churn. But confidence intervals can help interpret metric cohort plots when the trend and significance are weak. Here’s one strategy I have used:

Divide the metric into three cohorts. You are comparing customers who are low versus medium versus high in the metric. Large groups help make narrow confidence bounds.
Plot the cohort averages with confidence bounds, and see whether the confidence bounds overlap. If the confidence intervals overlap, statistically significant differences exist between customers who are low versus medium versus high in the metric.

I leave this exercise to interested readers.

10.5 Churn forecasting with demographic data

You have learned the techniques to analyze single demographic fields for their relationship to customer churn and retention. As with metrics, you may want to look at the influence on churn for all your demographic fields together to see how the combination predicts churn. Also, you should test forecasting with the demographic or firmographic data combined with your metrics. To do that, you need to convert demographic information in strings to an equivalent form as numeric information because the regression and XGBoost forecasting algorithms that you learned require only numeric inputs.

10.5.1 Converting text fields to dummy variables

To use your string-type demographic information for forecasting, you will convert it to numeric data by using a technique known as dummy variables.

DEFINITION A dummy variable is a binary variable that represents membership in a category, with 1 representing all customers in the category and 0 representing all customers that are not in the category.

If you studied data science in a computer science or engineering program, you may have learned about this technique, called one-hot encoding.

Figure 10.10 shows the process of creating dummy variables. Using dummy variables is similar to flattening metric data to create a dataset. In this case, a string demographic field is a tall data format in the sense that all the possible categories are stored in one column (using strings). To replace the column of strings with numeric data, you add one dummy variable column per unique string in the original data. Each columns is the dummy variable for one string category: all the customers who had a particular string get 1 in the column for that category and 0 in all the other columns. Then you drop the original string column and are left with a purely numeric dataset that still represents the same category information as the dataset that included strings.

Figure 10.10 Flattening a string variable to dummy variable columns

Figure 10.11 shows the result of creating dummy variables for the social network simulation. You can see that the string category labels for the channel and country are removed from the dataset. Instead, a set of new columns containing only zeros and ones represents the categories. Figure 10.11 also shows dummy variable columns for the country field grouped into regions, as they were earlier. The countries are still grouped because the same concerns about an overabundance of sparsely populated categories apply to forecasting, the same way that they did when you were looking at the country alone.

Figure 10.11 Result of creating dummy variables for the simulated social network dataset

Listing 10.4 provides the code to create a dataset with dummy variables like the one in figure 10.11. Creating dummy variables is a standard function of a Pandas DataFrame (called get_dummies). This function automatically detects all the string-type columns in your dataset and replaces them with appropriate binary dummy variables. The names for the dummy variable columns are created by concatenating the original column name with the category string.

Listing 10.4 Creating dummy variables

import pandas as pd
 
from listing_10_3_grouped_category_cohorts import group_category_column    ①
def dummy_variables(data_set_path, groups={},current=False):
   raw_data = pd.read_csv(data_set_path,                                   ②
                          index_col=[0, 1])
 
   for cat in groups.keys():                                               ③
       group_category_column(raw_data,cat,groups[cat])                     ④
 
   data_w_dummies = 
      pd.get_dummies(raw_data,dummy_na=True)                               ⑤
 
   data_w_dummies.to_csv(                                                  ⑥
      data_set_path.replace('.csv', '_xgbdummies.csv')
   New_cols = sorted(list(set(                                             ⑦
                  data_w_dummies.columns).difference(set(raw_data.columns))))
   cat_cols = sorted(list(set(                                             ⑧
                  raw_data.columns).difference(set(data_w_dummies.columns))))
   dummy_col_df =                                                          ⑨
      pd.DataFrame(new_cols,index=new_cols,columns=['metrics'])
   dummy_col_df.to_csv(
      data_set_path.replace('.csv', '_dummies_groupmets.csv'))
   if not current:                                                         ⑩
      new_cols.append('is_churn')
   dummies_only = data_w_dummies[new_cols]                                 ⑪
   save_path =                                                             ⑫
      data_set_path.replace('.csv', '_dummies_groupscore.csv')
   print('Saved dummy variable (only) dataset ' + save_path)
   dummies_only.to_csv(save_path)
 
   raw_data.drop(cat_cols,axis=1,inplace=True)                             ⑬
   save_path = data_set_path.replace('.csv', '_nocat.csv')
   print('Saved no category dataset ' + save_path)
   raw_data.to_csv(save_path)

① Imports the group category mapping function from listing 10.3

② Reads in the raw data

③ Keys of the mapping dictionary are the categories to be mapped.

④ Calls the group mapping function

⑤ Uses the Pandas get_dummies function

⑥ This version of the dataset is for XGBoost forecasting.

⑦ Determines the dummy variable columns by set difference

⑧ Determines the original category columns by set difference

⑨ Saves a list of columns for consistency with grouped datasets

⑩ Includes churn if not being used for current customers

⑪ Saves a dataset with only dummy variables

⑫ Names the dataset consistently with a regular dataset

⑬ Saves the dataset with no demographic categories

Calling the package function get_dummies is not all that happens in listing 10.4. First, listing 10.4 applies the optional grouping of categories that you learned in section 10.2. Then it is saved in three versions: the part with the original metrics and any numeric demographic information, a part with only the dummy variables, and everything together. Each version has a purpose, as follows:

The metrics and numeric demographic information must be converted to scores and run through the metric grouping algorithm. This process should happen without the dummy variables.
Saving the dummy variables by themselves facilitates running a regression analysis on the dummy variables alone.
The version with everything together is for XGBoost, which uses the untransformed metrics together with the dummy variables.

These points will be explained further throughout the rest of this chapter, but for now, I will focus on explaining the rest of listing 10.4. This code is mostly a mechanical use of the Pandas library, separating out the parts of the dataset. The only trick is using sets and operations related to the differences between sets to figure out which columns were added by making the dummy variables.

Listing 10.4 saves multiple versions of the dataset with different filename extensions:

The file with the postfix .dummies is the dataset with only the dummy variables. This file is also saved with the postfix .groupscore because that convention will be expected when you use the regression code on it. A listing of the columns is also saved with the postfix .groupmets because that also will be expected by the regression code, even though for the dummy variables, there will be no groups.
The file with the postfix .nocat is the file with numeric metrics and demographic fields. This file is simply saved and will be run through the usual scoring and grouping.
The file with the postfix .xgbdummies will be reloaded by the XGBoost cross-validation.

You should run listing 10.4 to create your own version of the dataset with the string categories replaced by dummy variables (and the files described previously). If you are using the Python wrapper program, use the usual form of the command and these arguments:

fight-churn/listings/run_churn_listing.py —chapter 10 —listing 4

Your results should look similar to figure 10.11, although the precise accounts and their demographics will be different because the data is randomly generated.

10.5.2 Forecasting churn with categorical dummy variables alone

Now that you have a dataset with demographic dummy variables, it is instructive to try churn forecasting in a regression model with the demographic data alone. This exercise is intended to increase your understanding of the combined influence of the demographic variables on churn probabilities. As you will see, if you want to forecast churn as accurately as possible, use the demographic dummy variables and the metrics together, as described in section 10.5.4.

If you run a regression cross-validation and then fit the model at the optimal C parameter, the results that you get are shown in figure 10.12. The results show that the demographic dummy variables are weakly predictive of churn. The best AUC measurement found in the cross-validation is around 0.56, and the maximum lift is around 1.5. If you recall from chapter 9, the regression using metrics resulted in an AUC higher than 0.7 and a lift higher than 4.0. A low value of the C parameter can be used and then most of the dummy variables removed without affecting the AUC significantly, but the lift is best with a higher value of the C parameter: 0.32 or greater.

Figure 10.12 also shows the regression coefficients and impact on retention probability with the C parameter set to 0.32. The dummy variables for the two appstore channels are assigned fairly large weights, which translates into a positive retention impact of 1.2% and 2.8%, respectively (churn reducing). The web channel gets zero weight, which reflects the fact that it has the highest churn because both of the other two channels were shown to have a positive impact. In this sense, the zero weight means that it is like the default, or baseline, and the other categories represent improvements.

The get_dummies function also created a variable for a channel not available (nan), and this channel also got zero weight, because in this dataset, all customers have the channel assigned. (Pandas makes a nan column for every variable when the na_default parameter is set.) These effects are in line with the churn-rate differences you saw in the category cohort plot (figure 10.3).

Figure 10.12 also shows much smaller coefficients and retention impacts for the country group dummy variables. In this case, CN, Eur, and the missing data have a slight positive retention impact (churn rate lower), and LaAm and APac have a negative retention impact (churn rate higher). Again, these results are in line with what you saw in the cohort plot for the country groups (figure 10.8).

Figure 10.12 Regression results with a dummy category variable dataset

Figure 10.12 was created from the listings from previous chapters, and there are already versions of the configuration prepared for you to do this. To create the regression cross-validation chart from figure 10.12, use the command for regression cross-validation, version 4, as follows:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 5 —version 2

To find the coefficients with the C parameter fixed at 0.32, use the command to run the regression with a fixed value of C:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 4 —version 4

Your result for cross-validation should be similar to figure 10.12, and so should your result for coefficients on the channels, which are randomly assigned to customers but in such a way that they produce consistent results in the simulation. You may get different results for the small weights and impact of the country group because in the simulation they are random.

10.5.3 Combining dummy variables with numeric data

In earlier sections, I mentioned that you cannot use the type of grouping that you use for metrics when you are working with dummy variables derived from categories. Instead, I suggested separating the dummy variables from the metrics and processing the metrics as usual. In this section, I provide details on the reason and this process. I start by explaining some facts about correlations involving dummy variables because that will help make it clear why you do not group categorical dummy variables along with the metrics.

Figure 10.13 shows the portion of the correlation matrix from the social network simulation that relates the demographic categories for channel and country to one another and to the metrics. (You haven’t run the code to create this correlation matrix yet, but you will soon.) The portion of the correlation matrix with metric-to-metric correlations is omitted in figure 10.13. One distinctive feature that might surprise you is the categories; dummy variables from each field are negatively correlated with the other dummy variables from the same field. This is especially true for the channel field, which had only three categories where the correlation is as low as -0.74. For the country groups, the negative correlations between the regions are around -0.2.

Figure 10.13 Correlation matrix for the social network simulation demographic categories

The reason for the negative correlations between the categories is due to the exclusive nature of category membership: if a customer is in one category, it gives them 1 for that category’s dummy variable, and it requires that they have 0 for the other dummy variables from the same field. That exclusivity for the binary indicator results in a negative measured correlation from the definition of the correlation coefficient: when one dummy variable takes a high value (1), the others take low values (0). This explains why the kind of grouping you used for the metric variables will not group demographic categories from the same demographic field. That algorithm uses a high correlation to indicate that the variables should form group members.

Considering the rest of figure 10.13, the demographic category dummy variables are mostly uncorrelated with the metrics, but there are a few exceptions:

The channels appstore1 and web have negative correlation with messages and replies.
The channel appstore2 has positive correlation with messages and replies.
The channel web also has positive correlation with posts.

When you use demographic categories to understand customer churn and retention, it can be worthwhile to look at the correlation matrix using the dummy variables, because it can reveal things about how different groups of your customers use the product. But you should not group the demographic dummy variables with your metric groups, even when they are correlated.

TAKEAWAY The correlations between demographic dummy variables and other metrics can help you understand your customers better, but you should not group dummy variables with other dummy variables or with metrics.

Back in chapter 6, I advised you to use correlation between metrics as a way of assessing the relatedness of the metrics and determining which should be grouped. But there are a few reasons why this same approach doesn’t carry over with dummy variables created from demographic categories:

You can calculate correlation coefficients for 0/1 binary variables, but correlation coefficients are not meant for this purpose. In statistics, other metrics are better for measuring relatedness between binary variables. When you calculate correlation coefficient with your dummy variables, it’s not as good a measure of relatedness as correlation between metrics.
The demographic categories are not related in the same way as behaviors that you group by using correlation. When two behaviors (such as using two product features) are correlated, usually they are part of a single activity or process. Therefore, it is reasonable to represent the overall process with an average of the scores, which is not normally the case for a demographic category and any other metric.

For these reasons, my advice is that if you want to use demographic dummy variables to forecast churn, you should keep all dummy variables separate from the groups.

TAKEAWAY Run the metrics and numeric demographic fields through a standard preparation process without demographic dummy variables, and then combine them with dummy variables at the end.

This result is illustrated in figure 10.14.

Figure 10.14 Metric groups, metric scores, and categories in one dataset

To create your own dataset like the one in figure 10.13, the first step is running the data-preparation process that you learned in earlier chapters on the version of the dataset that has the metrics and numeric demographic information. There is a version of the listing configuration prepared for you to do that with one command. Recall that listing 8.1 (with the filename listing_8_1_prepare_data.py) was the combined data-preparation function, and this is the third use of it (version 3):

fight-churn/listings/run_churn_listing.py —chapter 8 —listing 1 —version 3

After processing the metrics, combine them with the dummy variables. A new function shown in listing 10.5 is a straightforward application of a Pandas DataFrame manipulation. The group scores produced from the metrics are merged with the file for the dummies. The merge is performed with the Pandas DataFrame merge function, using the indices of both DataFrames to perform an INNER JOIN. The final step in listing 10.4 combines the DataFrame that lists the group metrics with the names of the dummy variables; such a file will be expected by the code that runs the regression on the combined dataset.

Listing 10.5 Merging dummy variables with grouped metric scores

import pandas as pd
 
def merge_groups_dummies(data_set_path):
 
   dummies_path =                                                         ①
      data_set_path.replace('.csv', '_dummies_groupscore.csv')
   dummies_df =pd.read_csv(dummies_path,index_col=[0,1])
   dummies_df.drop(['is_churn'],axis=1,inplace=True)                      ②
 
   groups_path =                                                          ③
      data_set_path.replace('.csv', '_nocat_groupscore.csv')
   groups_df = pd.read_csv(groups_path,index_col=[0,1])
 
   merged_df =                                                            ④
      groups_df.merge(dummies_df,left_index=True,right_index=True)
   save_path =                                                            ⑤
      data_set_path.replace('.csv', '_groupscore.csv')
   merged_df.to_csv(save_path)
   print('Saved merged group score + dummy dataset ' + save_path)
 
   standard_group_metrics = pd.read_csv(                                  ⑥
      data_set_path.replace('.csv', '_nocat_groupmets.csv'),index_col=0)
   dummies_group_metrics = pd.read_csv(                                   ⑦
      data_set_path.replace('.csv', '_dummies_groupmets.csv'),index_col=0)
   merged_col_df =                                                        ⑧
      standard_group_metrics.append(dummies_group_metrics)
   merged_col_df.to_csv(data_set_path.replace('.csv', '_groupmets.csv'))

① Loads the file containing the dummy variable dataset

② Drops the churn column

③ Loads the metric group scores

④ Merges the dummy variables and metric group scores

⑤ Saves the merged file under the name for the group scores

⑥ Loads the group metric listing from the metric-only data

⑦ Loads the listing of the dummy variables

⑧ Combines the two metric lists and saves it

You should run listing 10.5 on your own simulated social network dataset to prepare for forecasting in section 10.5.4. Issue the usual command to the Python wrapper program with these arguments:

fight-churn/listings/run_churn_listing.py —chapter 10 —listing 5

After running listing 10.5, one of the results should be a dataset like the one you saw in figure 10.14. Also, now that you have created the combined dataset, you can make a correlation matrix like the one I showed you at the start of this section (figure 10.13). Use a version of the correlation matrix listing configuration by issuing the following command with these arguments:

fight-churn/listings/run_churn_listing.py —chapter 6 —listing 2 —version 3

Running listing 6.2 with parameter configuration version 3 creates the raw data for a correlation matrix like the one shown in figure 10.13. The formatting for figure 10.13 was done in a spreadsheet program (as explained in chapter 6).

10.5.4 Forecasting churn with demographic and metrics combined

Now that you have created a dataset combining the group metric scores and the demographic category dummy variables, you can run a regression or machine learning model to forecast churn probabilities. Figure 10.15 shows the result of the regression.

The cross-validation of the C parameter shows that many of the variables can be assigned zero weight before accuracy is affected. Figure 10.15 also shows the weights resulting from the regression when the C parameter is set to 0.04. Nearly all the demographic dummy variables have zero weight and retention impact (and a few of the metrics as well).

Figure 10.15 was created by using listings from chapter 9. To run your own regression on the dataset with dummy variables and metrics combined, you can use prepared versions of the configuration. To run the cross-validation of the regression C parameter (listing 9.5) shown in figure 10.15, use the following command:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 5 —version 3

To run the regression with the C parameter fixed (listing 9.4) at 0.04 on the combined dummy variables and metrics dataset, use the command

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 4 —version 5

Those commands produce results similar to figure 10.15, although you may have different weights on the country group dummy variables because they are assigned randomly in the simulation.

You may wonder why the regression coefficients in figure 10.15 show that the channel demographic variable had no influence on the churn prediction, but early in the chapter, both the cohort churn analysis with confidence intervals and the regression on the dummy variables showed that the channel was strongly predictive of churn (and retention). What’s going on here? Is something wrong in the regression?

Figure 10.15 Regression result for dataset combining metric scores and category dummy variables

Nothing is wrong. When taken together with the behavioral metrics, the channel provides no additional information about churn, and the regression discovers this fact. The customer channels are correlated with certain behaviors, and behavior causes customer churn and retention in the simulation. When you look at the channel alone, it is related to churn rates, but when combined with the behavioral metrics in a regression, the regression algorithm automatically determines the most explanatory factors and removes the others. The regression correctly determines that customer engagement is most predictable by watching the metrics and not the channels.

TAKEAWAY Demographic categories are often related to churn and engagement because customers from different demographics behave differently. But if you use detailed behavioral metrics, you will usually find that behaviors are the underlying drivers of retention in a predictive forecast.

I told you that understanding demographics and firmographics is a secondary method of fighting churn because behavior can (sometimes) be modified by interventions but demographics cannot (ever). The fact that demographics are not usually helpful in predicting churn is another reason why I emphasize understanding behavior with metrics when fighting churn. But even if a demographic field is not useful for predicting churn, it does not detract from the primary use of demographics in fighting churn.

TAKEAWAY If you see a strong relationship between demographics and retention in your cohort analysis, you should try to emphasize your best demographics in your acquisition efforts. It doesn’t matter if those same demographics are not predictive of engagement in a regression with behavioral metrics.

WARNING Do not assume that your own product or sevice’s churn data will show exactly the same result as I presented here from the simulation. The social network simulation was designed to mimic the result that I have most commonly seen when studying customer churn, but there can always be exceptions, and your product may be one of them.

If you find that your own demographics are strongly predictive of churn, even when you have factored in behavioral metrics, you should check your data to see whether it can be improved. Make sure that all relevant customer behaviors are represented by your events and that your metrics adequately capture the relationships between your events and churn. Demographic correlations with unmeasured behaviors can lead to a result in which demographics predict churn, even when including metrics. If that’s the case, you would be better off figuring out what those behaviors are so that you can measure them and attempt to influence them for the better.

You can also test how much improvement demographic variables make for prediction with a machine learning model like XGBoost. The result of such an experiment is shown in figure 10.16. The demographic variables add around 0.005 to the AUC of XGBoost, or one-half of 1%. Figure 10.16 also shows the improvement in the regression AUC, which is even smaller (but an improvement nonetheless).

Figure 10.16 Accuracy comparison with demographic data

TAKEAWAY The highest predictive accuracy comes from XGBoost using demographic data combined with detailed customer metrics. XGBoost may find demographics more helpful in prediction than regression does.

To reproduce the XGBoost result in figure 10.16, you can run a version of the XGBoost cross-validation listing configuration with the following command (listing_9_6_crossvalidate_xgb.py):

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 6 —version 2

Note that the listing and configuration create the results for XGBoost with demographic variables. If you have been following along, you should have already found the accuracy for the other models and datasets.

10.6 Segmenting current customers with demographic data

The final subject for this chapter is how to use demographic information as part of the effort to segment customers. As the data person, you’re not responsible for defining the segments or intervening with customers, but you do need to provide the data so that the businesspeople can do their jobs effectively. The final dataset for segmenting customers should include the following elements:

All customers active on the most recently available date
Scores for metric groups
The original (unscaled) metric values for metrics that were not grouped
Categorical demographic information in string format
Categories grouped where appropriate
Churn forecast probabilities (optional)

Figure 10.17 is an example of a dataset that has all those features.

Creating such a dataset requires a few steps:

Extract all the metrics and demographic information for current customers from the database.
Reprocess the metric information to form groups, using the score parameters and loading matrix from the historical data.
Save a version of the dataset that has all the desired features.

Note that this process also creates a dataset ready for churn probability forecasting on active customers. That version combines scores for all the metrics and numeric demographic data but dummy variables for the demographic categories.

Figure 10.17 Dataset to segment customers with metric group scores, metrics, and demographic information

Listing 10.6 provides the SQL statement to extract demographic data along with all metrics for currently active customers. This listing is almost the same as similar listings in chapters 4 and 8, so I’ll explain it only briefly. The main portion of the SQL program is the aggregation to flatten the metrics. The new feature is to join on the account table and also select the channel country and the date of birth. The date of birth is converted to a time interval representing the customer’s age in years (following the pattern used to create the historical dataset with demographic data presented earlier in this chapter).

Listing 10.6 Exporting metrics and demographic data for currently active customers

WITH metric_date AS                                               ①
(
   SELECT  max(metric_time) AS last_metric_time FROM metric
),
account_tenures AS (
   SELECT account_id, metric_value AS account_tenure
   FROM metric m INNER JOIN metric_date ON metric_time =last_metric_time
   WHERE metric_name_id = 8
   AND metric_value >= 14
)
SELECT s.account_id, d.last_metric_time AS observation_date,
a.channel,                                                        ②
a.country,                                                        ③
date_part('day',d.last_metric_time::timestamp                     ④
    - a.date_of_birth::timestamp)::float/365.0 AS customer_age,
SUM(CASE WHEN metric_name_id=0 THEN metric_value else 0 END)
    AS like_per_month,
SUM(CASE WHEN metric_name_id=1 THEN metric_value else 0 END)
    AS newfriend_per_month,
SUM(CASE WHEN metric_name_id=2 THEN metric_value else 0 END)
    AS post_per_month,
SUM(CASE WHEN metric_name_id=3 THEN metric_value else 0 END)
    AS adview_per_month,
SUM(CASE WHEN metric_name_id=4 THEN metric_value else 0 END)
    AS dislike_per_month,
SUM(CASE WHEN metric_name_id=34 THEN metric_value else 0 END)
    AS unfriend_per_month,
SUM(CASE WHEN metric_name_id=6 THEN metric_value else 0 END)
    AS message_per_month,
SUM(CASE WHEN metric_name_id=7 THEN metric_value else 0 END)
    AS reply_per_month,
SUM(CASE WHEN metric_name_id=21 THEN metric_value else 0 END)
    AS adview_per_post,
SUM(CASE WHEN metric_name_id=22 THEN metric_value else 0 END)
    AS reply_per_message,
SUM(CASE WHEN metric_name_id=23 THEN metric_value else 0 END)
    AS like_per_post,
SUM(CASE WHEN metric_name_id=24 THEN metric_value else 0 END)
    AS post_per_message,
SUM(CASE WHEN metric_name_id=25 THEN metric_value else 0 END)
    AS unfriend_per_newfriend,
SUM(CASE WHEN metric_name_id=27 THEN metric_value else 0 END)
    AS dislike_pcnt,
SUM(CASE WHEN metric_name_id=30 THEN metric_value else 0 END)
    AS newfriend_pcnt_chng,
SUM(CASE WHEN metric_name_id=31 THEN metric_value else 0 END)
    AS days_since_newfriend
FROM metric m INNER JOIN metric_date ON m.metric_time =d.last_metric_time
INNER JOIN account_tenures t ON t.account_id = m.account_id
INNER JOIN subscription s ON m.account_id=s.account_id
INNER JOIN account a ON m.account_id = a.id                      ⑤
WHERE s.start_date <= d.last_metric_time
AND (s.end_date >=d.last_metric_time OR s.end_date IS null)
GROUP BY s.account_id, d.last_metric_time, 
    a.channel, a.country, a.date_of_birth                        ⑥
ORDER BY s.account_id

① Most of this listing is the same as listings 4.6 and 8.3.

② The channel string from the account table

③ The country string from the account table

④ Subtracts the date of birth from the observation date

⑤ JOINs with the account table

⑥ Includes the demographic fields in the GROUP BY clause

You can run listing 10.6 on your own simulated social network dataset to create your own dataset file for the current customers by running the following command and these arguments:

fight-churn/listings/run_churn_listing.py —chapter 10 —listing 6

The Python program that converts the raw data for current customers to versions that can be used for forecasting and segmenting is shown in listing 10.7. Much of listing 10.7 is similar to the transformation that you saw in chapter 8, and it includes several helper functions from chapters 7, 8, and 10. But listing 10.7 also includes a few new steps to accommodate the demographic data.

The one important new technique in listing 10.7 is what I call aligning the dummy variables in the historical and current datasets. The Pandas get_dummies function (called from listing 10.4 dummy_variables) creates dummy variable columns for every category in the data frame, but the categories in the historical dataset and the current dataset may not match. Typically, the historical dataset has enough customer observations that you will see a rare category in a few customers, but the current dataset will have fewer customers and may not include any examples of the rare category. The result in that case would be that the historical dataset has a column that the current dataset does not have. This situation would cause a failure when you try to forecast churn probabilities on the current dataset.

The same problem would happen if a category goes out of use historically and is no longer present in the current dataset. The reverse problem would occur if a new category comes into use: the historical dataset may lack the category, and only the current dataset includes it. In summary, aligning the categories does two things:

Adds to the current dataset, for any category in the historical data that is missing, a new dummy variables column containing zeros. This way, the current dataset is equivalent in its columns to the historical dataset, and zeros are the correct categorical value for a category of which no one is part.
Drops any categories from the dummy variables for the current dataset that were missing in the historical dataset. Again, this step aligns the columns in the historical and current datasets. If the category was not available in the historical dataset, you don’t know whether or how it’s predictive of churn, so removing it is correct for the purpose of forecasting.

Overall, the main steps in listing 10.7 are

Run the dummy_variables creation listing from earlier in this chapter (listing 10.4), using the path to the current dataset. This code saves three versions of the data:
- Only the numeric fields for further processing by scoring and grouping
- Only the dummy variables to merge back together with the scores and groups later
- The numeric fields and dummy variables together, which is used by XGBoost (this file is saved from within the dummy_variables function)
Load the dummy variables derived from the current dataset.
Run the align_dummies helper function that takes care of inconsistencies between the two sets of dummy variables.
Load the dataset with only numeric fields that were created from the current data by the dummy_variables function. Also load the loading matrix and score parameters created from the historical dataset. Run this current dataset through the reprocessing steps you learned in chapter 8:
1. Transform any skewed columns.
2. Transform any columns with fat tails.
3. Rescale the data so all fields are scores with a mean near 0 and a standard deviation near 1.
4. Combine any correlated metrics by using the loading matrix created on the historical data.
Merge the dummy variables with the metric group and score data, and save this version of the dataset. This version can be used for forecasting churn probabilities for current customers.
Create the version of the dataset designed to be used by businesspeople for segmenting. This version of the dataset combines the following elements:
- Scores for the grouped metrics
- The original (untransformed) metrics for those that are not grouped
- The original strings (not the dummy variables) for the demographic categories

Listing 10.1 Preparing a current customer dataset with demographic fields

import pandas as pd
 
from listing_7_5_fat_tail_scores 
   import transform_fattail_columns, transform_skew_columns
from listing_8_4_rescore_metrics 
   import score_current_data, group_current_data, reload_churn_data
from listing_10_4_dummy_variables import dummy_variables
def rescore_wcats(data_set_path,categories,groups):
 
   current_path = data_set_path.replace('.csv', '_current.csv')
 
   dummy_variables(current_path,groups, current=True)                      ①
   current_dummies = reload_churn_data(data_set_path,
      'current_dummies_groupscore',  '10.7',is_customer_data=True)
   align_dummies(current_dummies,data_set_path)                            ②
 
   nocat_path = 
      data_set_path.replace('.csv', '_nocat.csv')                          ③
   load_mat_df = reload_churn_data(nocat_path,
                                   'load_mat','6.4',is_customer_data=False)
   score_df = reload_churn_data(nocat_path,
                                'score_params','7.5',is_customer_data=False)
   current_nocat = reload_churn_data(data_set_path,'current_nocat','10.7',is_customer_data=True)
   assert set(score_df.index.values)==set(current_nocat.columns.values),
          “Data to re-score does not match transform params”
   assert set(load_mat_df.index.values)==set(current_nocat.columns.values),
          “Data to re-score does not match loading matrix”
   transform_skew_columns(current_nocat,
      score_df[score_df['skew_score']].index.values)
   transform_fattail_columns(current_nocat,
      score_df[score_df['fattail_score']].index.values)
   scaled_data = score_current_data(current_nocat,score_df,data_set_path)
   grouped_data = group_current_data(scaled_data, load_mat_df,data_set_path)
 
   group_dum_df =                                                          ④
      grouped_data.merge(current_dummies,left_index=True,right_index=True)
   group_dum_df.to_csv(                                                    ⑤
      data_set_path.replace('.csv','_current_groupscore.csv'),header=True)
 
   current_df = reload_churn_data(data_set_path,
                                  'current','10.7',is_customer_data=True)
   save_segment_data_wcats(                                                ⑥
      grouped_data,current_df,load_mat_df,data_set_path, categories)
 
 
def align_dummies(current_data,data_set_path):
 
   current_groupments=pd.read_csv(                                         ⑦
      data_set_path.replace('.csv','_current_dummies_groupmets.csv'),
      index_col=0)
 
   new_dummies = set(current_groupments['metrics'])
   original_groupmets =                                                    ⑧
       pd.read_csv(data_set_path.replace('.csv','_dummies_groupmets.csv'),
                   index_col=0)
 
   old_dummies = set(original_groupmets['metrics'])
   missing_in_new = old_dummies.difference(new_dummies)                    ⑨
   for col in missing_in_new:                                              ⑩
       current_data[col]=0.0
   missing_in_old = new_dummies.difference(old_dummies)                    ⑪
   for col in missing_in_old:                                              ⑫
       current_data.drop(col,axis=1,inplace=True)
 
 
def save_segment_data_wcats(current_data_grouped, current_data,
                            load_mat_df, data_set_path, categories):
   group_cols =                                                            ⑬
      load_mat_df.columns[load_mat_df.astype(bool).sum(axis=0) > 1]
   no_group_cols =                                                         ⑭
      list(load_mat_df.columns[load_mat_df.astype(bool).sum(axis=0) == 1])
   no_group_cols.extend(categories)                                        ⑮
   segment_df =                                                            ⑯
      current_data_grouped[group_cols].join(current_data[no_group_cols])
 
   segment_df.to_csv(
      data_set_path.replace('.csv','_current_groupmets_segment.csv'),
      header=True)

① Runs the function dummy_variables on the current dataset

② Calls helper function to align current dummies with historical ones

③ Prepares the current data without categories

④ Merges the group score data with the dummy data

⑤ Saves the result using the original dataset name

⑥ Uses the function to prepare the data for use in segmenting

⑦ Makes a set from the file listing current dummy variables

⑧ Makes a set from the file listing original dummy variables

⑨ Set difference finds dummy columns in original but not current.

⑩ For any dummy missing in the new data, adds a column of zeros

⑪ Set difference finds dummy columns in current but not original.

⑫ Drops dummy columns in current but not original

⑬ Group columns have more than one loading matrix entry.

⑭ Standard metric columns have one loading matrix entry.

⑮ Adds the category variable names to the list

⑯ Makes the segmenting data

You can run listing 10.7 with the Python wrapper program with the following command and these arguments:

fight-churn/listings/run_churn_listing.py —chapter 10 —listing 7

This code creates three files for the current customer data for the purposes described previously:

Forecasting with regression
Forecasting with XGBoost
Segmenting by businesspeople

If you want to forecast with the regression model, use listing 8.5 (with the filename listing_8_5_churn_forecast.py) with the following command and these parameters:

fight-churn/listings/run_churn_listing.py —chapter 8 —listing 5 —version 2

If you want to forecast with the XGBoost model (with the filename listing_9_7_churn _forecast_xgb.py), use

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 7 —version 2

Regarding the dataset for segmenting customers that your business colleagues will use, it is important to realize that for businesspeople, demographic data is important even when it does not relate to churn and retention. The marketing department, for example, will need to write different copy for engagement campaigns targeting customers in different countries or regions. In a large organization, the marketing department probably has access to all of that kind of information through its own system, but I include everything here for the sake of completeness.

TAKEAWAY Demographic information can be relevant to designing interventions with customers, even when it is not related to engagement and retention.

Summary

Demographic and firmographic data are facts about the customers that do not change over time, like metrics. The type of demographic/firmographic fields can be date, numeric, or string.
Date type information about customers can be converted to intervals and analyzed using the same techniques as metrics.
To compare the churn rate in cohorts defined by demographic category strings, you use confidence intervals that are best- and worst-case estimates for the churn rate.
Churn rates in different categories are said to be different by a statistically significant amount when the confidence intervals around their churn rates do not overlap.
If you have many categories representing small percentages of the customer population, you should group related categories before analyzing them.
Grouping demographic categories is usually done using previous knowledge, and the mapping can be efficiently represented by using a dictionary.
To use demographic categories in regression or machine learning forecasting, convert them to columns of binary dummy variables.
Dummy variables are not grouped with metric scores, but investigating the correlation between dummy variables in metrics can provide useful information.
Using demographic information can improve forecasting accuracy, but it is usually a secondary contribution compared with behavior-based metrics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10 Churn demographics and firmographics

Create new playlist

Sign In

Sign Up