9 Forecast accuracy and machine learning

This chapter covers

  • Calculating measurements of forecasting accuracy for churn
  • Backtesting a model in a historical simulation
  • Setting the regression parameter for the minimum metric contribution
  • Picking the best value of the regression parameter by testing (cross-validation)
  • Forecasting churn risk with the XGBoost machine learning model
  • Setting the parameters of the XGBoost model with cross-validation

You know how to forecast the probability of customer churn, and you also know how to check the calibration of your forecasts. Another important measurement of a forecasting model is whether the customers predicted to be highly at risk are really more at risk than those predicted to be safe. This type of predictive performance is generally known as accuracy, although as you will see, there is more than one way to measure accuracy.

Back in chapter 1, I told you that forecasting churn with a predictive model was not the emphasis of this book because it isn’t helpful in many situations. The focus of this book is on having a good set of metrics that segment customers into healthy and unhealthy populations based on behavior. But there are a few reasons why it’s good to have accurate predictive churn forecasts, so this chapter will round out your skill set and ensure that you can forecast accurately when necessary.

One time when it can be useful to forecast churn risk accurately is when an intervention is particularly expensive. An onsite training session with a product expert will be more expensive to deliver than an email, for example. If you’re selecting customers for onsite training with the intention of reducing churn risk, it makes sense to select only customers who have a high churn risk so that you enroll only customers with a suitable risk profile. Alternatively, you might not select the most at-risk customers because they may be beyond saving; it is often better to select customers with above-average but not maximum risk. (Also, you probably would screen the customers by particular metrics to make sure that they would benefit from this hypothetical training.)

Another reason it is worth your time to forecast churn accurately is that doing so validates your entire data and analytic process; you can compare the accuracy of your predictions with known benchmarks, as I explain in this chapter. If you find that the performance of your process is below typical, that result suggests that you need to correct some aspect of your data or process. You may need to improve the way you clean your data by removing invalid examples, for example, or you might need to calculate better metrics. On the other hand, if you find that the performance of your analysis is in the high range of the benchmark, you can be confident that you have done a thorough analysis and there may not be much more to discover. You may even find that your accuracy is impossibly high, which might suggest the need for corrections and improvements in your data preparation, such as increasing the lookahead period you use to make your observations (chapter 4).

This chapter is organized as follows:

  • Section 9.1 explains ways to measure forecasting accuracy and teaches you some accuracy measurements that are particularly useful for churn.

  • Section 9.2 teaches you how to calculate accuracy measurements using a historical simulation.

  • Section 9.3 returns to the regression model from chapter 8 and explains how you can use an optional control parameter to control the number of weights that the regression uses.

  • Section 9.4 teaches you how to pick the best value of the regression control parameter based on the accuracy test results.

  • Section 9.5 teaches you how to predict churn risk by using a machine learning model called XGBoost, which is usually more accurate than regression. You also learn about some of the pitfalls of the machine learning approach and see benchmark results from real case studies.

  • Section 9.6 covers some practical issues involved in forecasting with the machine learning model.

The sections build on one another, so you should read them in order.

9.1 Measuring the accuracy of churn forecasts

To start, you learn what accuracy means in the context of churn forecasting and how to measure it. In fact, measuring the accuracy of churn forecasts is not straightforward.

9.1.1 Why you don’t use the standard accuracy measurement for churn

When you’re talking about the accuracy of a forecast (such as churn probability predictions), the word accuracy has both a general and a specific meaning. First, the general definition.

DEFINITION Accuracy (in the general sense) means the correctness or truthfulness of forecasts.

All methods of measuring the accuracy of churn forecasting involve comparing the predictions of risk with actual churn events, but there are many ways to measure accuracy. To make matters more confusing, one particular measurement of forecasting accuracy is called accuracy. This measurement is specific, but it is not a useful measurement for churn, as you’re about to see. I am going to start with that measurement, which I will call the standard accuracy measurement to prevent confusion with the more general meaning of accuracy. (When I say accuracy, I mean the word in the general sense.)

Figure 9.1 illustrates the standard accuracy measurement. In chapter 8, you learned how to assign a churn or retention forecast probability to each customer. The standard accuracy measurement further assumes that on the basis of those forecasts, you divide the customers into two groups: those who are expected to be retained and those who are expected to churn. I will return to the question of how you might divide customers into those two groups when I finish explaining the standard accuracy measurement.

Figure 9.1 The standard accuracy measurement

After customers are divided into expected retention and expected churn groups, the assigned categories are compared with what really happened. To define the standard accuracy measurement, you need to use the following terms:

  • A true-positive (TP) prediction is a predicted churn that churns.

  • A true-negative (TN) prediction is a predicted retention that stays.

  • A false-positive (FP) prediction is a predicted churn that stays instead of churning.

  • A false-negative (FN) prediction is a predicted retention that churns.

Using these definitions, the standard accuracy measurement is defined as follows.

DEFINITION The standard accuracy is the percentage of forecasts that are either true positives or true negatives. In an equation, this would be Standard Accuracy = (#TP + #TN )/(#Total).

Standard accuracy is meant to represent the percentage of predictions that were correct in a particular literal sense: the percentage of the category assignments that came true. That sounds reasonable, but in fact, standard accuracy is inappropriate for measuring the validity of churn forecasts. Standard accuracy has two problems when it comes to churn:

  • Churn is rare, so standard accuracy is dominated by nonchurns.

  • The basic assumption of the standard accuracy measurement is that you divide customers into two groups: expected churns and expected retentions. But that division isn’t a useful portrayal of customer segmentation use cases.

I will explain each of these problems in detail.

Standard accuracy is dominated by nonchurns because churns are rare, so true positives cannot possibly have much impact on the numerator in the standard accuracy ratio. As a result, the measurement doesn’t always do a good job of showing whether forecasts are appropriate. To make this point, note that there is an easy way to get a high standard accuracy measurement, as illustrated in figure 9.2. If you were to predict that no customers would churn (all customers in the nonchurn group), you would have true-negative predictions for the majority. If you have all the true negatives correctly assigned, the resulting accuracy is the retention rate, and you would have a high standard accuracy measurement without having predicted anything about churn.

Figure 9.2 Gaming the standard accuracy measurement for churn

TAKEAWAY The standard accuracy measurement is inappropriate for churn because churn is rare, so the measurement can be gamed by predicting that no one will churn. More generally, accuracy on churned customers makes only a small contribution to the measurement.

One possible remedy for this weakness in the standard accuracy measurement is to augment it with measurements based on not only true positives and true negatives but also false positives and false negatives. I don’t recommend this approach either, however, because there is another way in which standard accuracy measurement is inappropriate for churn use cases. Calculating standard accuracy relies on the assumption that you divided the customers into two groups: expected churns and expected retentions. Dividing predictions into two exclusive groups is standard for some forecasting use cases, but it is rarely done that way for customer churn.

As mentioned at the start of the chapter, the most common use case for churn and retention forecasts is to select customers for relatively expensive interventions to reduce churn. In that case, the churn or retention probability is used like any other segmenting metric, in that the department organizing the intervention orders the customers by the metric and then uses its own criteria to pick the most appropriate customers. If the intervention has a specific budget, for example, the department might pick a fixed number of customers who are most at risk for churn or a fixed number of customers who are not most at risk. A common strategy is to select customers with above-average risk who still use the product a little because the most at-risk customers who do not use the product may not be savable. You (the data person) aren’t dividing the customers into expected churns and nonchurns as presumed by the standard accuracy measurement.

TAKEAWAY Churn forecasting use cases rely on using the ranking provided by the churn forecast as a segmenting metric but do not involve categorizing the customers into two groups: expected churns and nonchurns.

Because real churn use cases depend on the model’s ability to rank customers by risk but not divide them into two groups per se, it makes more sense to turn to alternative (nonstandard) measurements of accuracy that better reflect the situation. As described in section 9.1.2, these measurements also remedy the problems in the standard accuracy measurement caused by the rarity of churn.

9.1.2 Measuring churn forecast accuracy with the AUC

The first accuracy measurement that you should use for churn is area under the curve (AUC), where the curve refers to an analytic technique known as the receiver operating curve. This naming is unfortunate, because AUC is a technical description of the way in which the metric is calculated but doesn’t convey clearly what it means. But everyone uses this name, so we have no choice but to stick with it; I won’t refer to the receiver operating curve anymore because it is not necessary for understanding or applying the metric. As you will see, my advice is not to even mention this measurement to your business colleagues. If you want more details, it is easy to find resources online.

The meaning of AUC is simpler than the name, as summarized in figure 9.3. As in the standard accuracy measurement, you start with a dataset in which you made a forecast for every customer and know which customers churned. Consider the following test. Take one customer who churned and one customer who didn’t churn. If your model is good, it should have forecast a higher churn risk for the customer who churned than for the one who didn’t. If the model did so, consider that comparison to be a success. Now consider the same test for every possible comparison. One by one, compare every churn with every nonchurn to see whether the model predicted higher churn risk for true churn. The overall proportion of successful predictions is the AUC.

Figure 9.3 Measuring accuracy with the AUC

DEFINITION AUC is the percentage of comparisons in which the model forecasts higher churn risk for a churn than for a nonchurn, considering pairwise comparisons of all churns and nonchurns.

AUC avoids the problem in standard accuracy, which is that prediction on churns doesn’t matter much because churns are such a small percentage of the population. In the AUC calculation, accurate prediction of churns is central because every comparison involves one churn, even if churns are only a small percentage of the data. At the same time, AUC is based on the ranking of risks and doesn’t require an artificial categorization of customers into two groups.

If you think about the definition of AUC, that measurement could involve a lot of comparisons. The total number of pairwise comparisons is the product of the number of churned customers and the number of nonchurned customers. Fortunately, there is a more efficient way to do the calculation, involving that receiver operating curve, but I’m not going to teach you how to use it. Instead, you will use an open source package to do the calculation (listing 9.1). It’s true that AUC is more expensive to calculate than the standard accuracy metric, but the difference is not enough to cause concern.

If you run listing 9.1, you’ll see the short output in figure 9.4—a first demonstration. You will be using the AUC measurement throughout this chapter.

Figure 9.4 Output of listing 9.1 for calculating the forecast model AUC

To demonstrate the AUC, listing 9.1 reloads the logistic regression model that you saved in chapter 8; it also reloads the dataset used to train the model (the historical dataset with labeled churns and retentions, not the current customer dataset). The model’s predict_proba function is used to create forecasts, and these forecasts are passed to the function roc_auc_score from the sklearn.metrics package. You should run listing 9.1 on your own saved data and regression model with the following standard command and these arguments:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 1 

Listing 9.1 Calculating the forecast model AUC

import os
import pickle
from sklearn.metrics import roc_auc_score                              
from listing_8_2_logistic_regression import prepare_data               
 
def reload_regression(data_set_path):                                  
   pickle_path = data_set_path.replace('.csv', '_logreg_model.pkl')
   assert os.path.isfile(pickle_path), 'Run listing 8.2 to save a log reg model'
   with open(pickle_path, 'rb') as fid:
       logreg_model = pickle.load(fid)
   return logreg_model
 
def regression_auc(data_set_path):
   logreg_model = reload_regression(data_set_path)                     
   X,y = prepare_data(data_set_path)                                   
   predictions = logreg_model.predict_proba(X)                         
   auc_score = roc_auc_score(y,predictions[:,1])                       
   print('Regression AUC score={:.3f}'.format(auc_score))

sklean has a function to calculate the AUC.

Reuses the prepare_data function from listing 8.2

Reloads the regression model pickle

Calls the reload_regression function

Calls the prepare_data function from listing 8.2

predict_proba returns probability predictions.

Calls the function to calculate the AUC

You should find that the regression model has an AUC of around 0.7, which raises the question of whether 0.7 is good. AUC is a percentage, like accuracy, and 100% is the best possible. If you had 100% AUC, all the churns were ranked higher in risk than all the nonchurns. But you will never find a real churn-prediction system that has an AUC anywhere near that high.

On the other hand, consider the worst you could possibly do. Zero percent sounds bad, but that result would mean that you had all the nonchurns ranked as a higher risk than the churns. If you think about it, that result would be fine, because then you could use your model as a perfect predictor of retention. Probably, though, something went wrong in your model setup to make it predict backward.

In fact, the worst AUC would be 0.50, which would mean that your predictions were like coin flips: right half the time and wrong half the time. If a forecast model has an AUC of 0.5, it has the worst possible performance—the same as random guessing.

TAKEAWAY AUC ranges from 0.5, which is equivalent to random guessing (no predictive power), to 1.0, which is perfect ranking of churns versus nonchurns.

Table 9.1 shows a list of benchmarks for what you can consider to be healthy and unhealthy AUC. Generally, churn forecasting AUC is healthy in the range from around 0.6 to 0.8. If it’s less than 0.6 or greater than 0.8, something is probably wrong, and you need to check the data in your model. You may not think that high accuracy would be cause for concern, but it could be. I’ll say more about that subject in section 9.2.3.

Table 9.1 Churn forecasting AUC benchmarks

AUC result

Diagnosis

< 0.45

Something is wrong! The model is predicting backward. Check your data and the code calculating the AUC; is it using the wrong column of the predict_proba result?

0.45-0.55

No different from random guessing (0.5). Check your data.

0.55-0.6

Better than random guessing but not good. Check your data, collect better events, or make better metrics.

0.6-0.7

Healthy range for weakly predictable churn.

0.7-0.8

Healthy range for highly predictable churn.

0.8-0.85

Extremely predictable churn. This result is suspicious for a consumer product and usually is possible only for a business product with informative events and advanced metrics.

> 0.85

Something probably is wrong. Normally, churn is not this predictable, even for business products. Check your data to make sure that you’re not using too short of a lead time to construct the dataset and that there are no lookahead events or customer data fields (described in section 9.2.3).

NOTE The AUC benchmarks in table 9.1 apply only to customer churn. For other problem domains, the expected range of forecasting AUC can be higher or lower.

AUC is used throughout the rest of this chapter, but first, you should be aware of one other nonstandard accuracy measurement: the lift.

9.1.3 Measuring churn forecast accuracy with the lift

AUC is a useful metric, but it has one downside: it is abstract and hard to explain. I recommend a different metric for churn accuracy, primarily because it is easy for businesspeople to understand. In fact, this metric, known as the lift, originated in marketing. I’ll explain first the general use of lift in marketing and then its specific application to churn.

DEFINITION Lift is the relative increase in responses due to some treatment relative to the baseline.

If 1% of people who visit a website sign up for the product, and a promotion increases the sign-up rate to 2%, the lift caused by the promotion is 2.0 (2% divided by 1%). According to that definition, a lift of 1.0 means no improvement. One thing to notice about lift is that it emphasizes improvement over the baseline, so it is suitable for measuring improvement in things that are rare to begin with. For measuring the accuracy of prediction models, you can use a more specific version of lift called the top decile lift.

DEFINITION The top decile lift of a predictive churn model is the ratio of the churn rate in the top decile of customers predicted to be most at risk to the overall churn rate.

Figure 9.5 illustrates this definition. The top decile lift is like a regular lift measurement, but the baseline is the overall churn rate, and the treatment is that you picked the 10% most at-risk customers according to the model.

Figure 9.5 Measuring accuracy with the lift

IMPORTANT Because this definition is the most common definition of lift for churn forecasting, when I use the term lift, you should be aware from the context that I mean top decile lift.

Why is the overall churn rate the baseline? That’s how accurate you would be in predicting churn if you were randomly guessing. If you have a 5% churn rate, you will find churns 5% of the time if you pick customers at random. If you can do better than random guessing (lift greater than 1.0), your result improves. You might respond that you could do better than guessing randomly, and you probably could, especially if you use segments based on data-driven metrics like the ones you learn how to make in this book. But the point is that the overall churn rate is a reasonable baseline at all companies, regardless of whatever else you might be doing.

TAKEAWAY Top decile lift is good for measuring accuracy because it emphasizes improvement from a low baseline level of prediction.

Listing 9.2 shows how to calculate the lift with Python, assuming that you have a model saved (as in listing 9.1). Again, the output, as shown in figure 9.6, is a simple printout of the results and only a demonstration.

Figure 9.6 Output of listing 9.2 (lift)

Listing 9.2 doesn’t use an open source package to calculate the lift. At the time of this writing, no open source package makes this calculation, so I have made an implementation for you in the function calc_lift. The steps to calculate the lift are as follows:

  1. Validate the data to make sure you have a sufficient number of distinct forecasts.

  2. Calculate the overall churn rate in the sample.

  3. Sort the predictions by the churn risk forecast.

  4. Locate the position of the top decile.

  5. Calculate the number of churns in the top decile and the top decile churn rate. The result is the top decile churn rate divided by the overall churn rate.

The lift calculation I provide requires at least 10 unique values or levels for the forecasts. Not enough forecasts can be a problem with bad data or a misspecified model. The most common manifestation of bad data or a bad model is that all accounts get the same forecast, but other variants are possible. The criteria of 10 is a rule of thumb, not a hard rule. (In principle, the forecasts should allow you to select exactly 10% of the customers who are most at risk for the comparison. For example, it would be okay for the purpose of calculating lift to have just two distinct predictions coming from the model as long as exactly 10% of the population gets one prediction or the other. The 10-unique-values rule of thumb catches the most egregious model or data failures, and matching the condition precisely is not really necessary anyway.)

Listing 9.2 Calculating the forecast model lift

from listing_8_2_logistic_regression 
   import prepare_data                                       
from listing_9_1_regression_auc 
   import reload_regression                                  
import numpy
 
def calc_lift(y_true, y_pred):                               
   if numpy.unique(y_pred).size < 10:                        
       return 1.0
   overall_churn = sum(y_true)/len(y_true)                   
   sort_by_pred=
      [(p,t) for p,t in sorted(zip(y_pred, y_true))]         
   i90=int(round(len(y_true)*0.9))                           
   top_decile_count=
      sum([p[1] for p in sort_by_pred[i90:]])                
   top_decile_churn = 
      top_decile_count/(len(y_true)-i90)                     
   lift = top_decile_churn/overall_churn
   return lift                                               
 
def top_decile_lift(data_set_path):
   logreg_model = reload_regression(data_set_path)           
   X,y = prepare_data(data_set_path,as_retention=False)      
   predictions = logreg_model.predict_proba(X)
   lift = calc_lift(y,predictions[:,0])                      
   print('Regression Lift score={:.3f}'.format(lift))

Uses the prepare_data function from listing 8.2

Uses the reload_regression function from listing 9.1

Parameters are series of true churn outcomes and predictions.

Checks to make sure that the predictions are valid

Calculates the overall churn rate

Sorts the predictions

Calculates the index of the 90th percentile

Counts the churns in the top decile

Calculates the top decile churn rate

Returns the ratio of the top decile churn to the overall churn

Loads the model, and generates predictions as in listing 8.1

Loads the data but doesn’t invert the outcome to retention

Calls the lift calculation function

You should run listing 9.2 with the following arguments to check the result yourself:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 2 

You should find that the regression model achieves a lift of around 4.0 on the simulated data. I’ve already mentioned that the minimum lift is 1.0, which indicates that your model is no better than random guessing because it can’t find more churns than the overall churn rate. A lift less than 1.0 is akin to an AUC less than 0.5, which means that your model is predicting risk in reverse because the top decile has fewer churns than the overall sample.

You can also deduce the maximum possible lift if the top decile of customers most at risk contained only customers who churned. The lift would be 100% divided by the overall churn rate. So the maximum lift depends on the overall churn rate. Here are some examples:

  • If the churn rate is 20%, the maximum possible lift would occur if the top decile of forecasts were all churns. Then the lift would be 5 (100%/20% = 5).

  • If the churn rate is 5%, the maximum lift would be if all those 5% churns were in the top decile forecast group. Then the top decile churn rate would be 50%, and the lift would be 10 (50%/5% = 10).

The pattern is that the higher the overall churn rate is, the lower the maximum possible lift. You are not going to get anywhere close to those maximums, but the relationship between churn rates and more typical lift values is the same.

TAKEAWAY The higher the overall churn rate, the lower the lift you should expect from a predictive model.

Table 9.2 lists benchmarks for what you can expect to find for lift in real churn prediction use cases. Unlike for the AUC, the reasonable range of lift values depends on the churn rate. If the churn rate is low, it’s easier to get a somewhat greater lift. If the churn rate is high (greater than 10%), the lift is likely to be lower. As explained in the preceding paragraph, the maximum lift is reduced when the churn rate is high. That property carries over to expecting lower lift scores generally because you’re not likely to find so many churns in the top decile. For low-churn products, a healthy lift is in the range from 2.0 to 5.0, whereas for high-churn products, the healthy range is around 1.5 to 3.0.

Table 9.2 Churn forecasting lift benchmarks

Low churn (< 10%) lift result

High churn (> 10%) lift result

Diagnosis

< 0.8

< 0.8

Something is wrong! The model is predicting backward. Check your data and the code calculating the lift. Is it using the wrong column of the predict_proba result?

0.8-1.5

0.8-1.2

Random guessing (1.0), or not very different from random guessing. Check your data.

1.5-2.0

1.2-1.5

Better than random guessing but not good. Check your data, collect better events, or make new metrics.

2.0-3.5

1.5-2.25

Healthy range of weakly predictable churn.

3.5-5.0

2.25-3.0

Healthy range of highly predictable churn.

5.0-6.0

3.0-3.5

Extremely predictable churn. This result is suspicious for a consumer product and usually is possible only for a business product with good events and metrics.

> 6.0

> 3.5

Something probably is wrong. Normally, churn is not this predictable, even for business products. Check your data to make sure you’re that not using too short of a lead time to construct the dataset and that there are no lookahead events or customer data fields (described in section 9.2.3).

I like to use the lift when I explain accuracy to businesspeople because the term is intuitive and related to metrics that they already understand. But there is one problem with the lift: it can be unstable, particularly with small datasets. Small changes in the metrics or model you use to predict may create big changes in the result.

WARNING The lift can be unstable, especially for small datasets. The result can vary significantly, comparing different time periods and forecasting models. To measure lift, you should have thousands of observations and hundreds of churns in the dataset (or more). The lower the churn rate, the more observations you need to make the lift measurement stable.

Suppose that you have only 500 customer observations and a 5% churn rate, so you have only 25 churns. In that case, the lift is based on how many of those 25 are in the top 10% forecast at risk, with the baseline being an expected (average) 2.5. The addition or removal of a few churns from the top decile will make big swings in the lift. Generally, you should use the lift when you have thousands of observations or more. The AUC avoids this type of problem because it always uses every churn in the dataset and maximizes their use (by comparing every churn with every nonchurn).

TAKEAWAY Use the AUC to evaluate your model accuracy for your own understanding. Use the lift to explain the churn accuracy to businesspeople.

Another nice property of the lift is that it makes the imprecise business of forecasting churn sound more impressive. Compare these two statements:

  • This model is three times better than the baseline.

  • This model ranks a customer who churns 70% of the time as more risky than a customer who doesn’t.

Even though both statements imply the same level of improvement above random guessing, three times is a more impressive statistic than 70%.

9.2 Historical accuracy simulation: Backtesting

Now you know the right way to measure accuracy of churn forecasts and what is typical in churn forecasting. But I ignored an important detail: the observations on which you should measure accuracy. As with many parts of the analysis, the situation is a little different for churn.

9.2.1 What and why of backtesting

Earlier, I demonstrated the accuracy measurements you learned by calculating the accuracy of the forecast on the dataset with which you created the model. This demonstration is not the best practice, however; it’s like testing a student on questions that they have already seen in the sense that the same customer observations were used to fit the model. The best practice in forecasting is to test the accuracy of a model on observations that were not used to fit the model. This type of testing is known as out-of-sample testing because it tests observations that were not in the data sample given to the algorithm for determining the model.

In general, accuracy is lower for new customer observations than for the ones used in the model fitting. How different in-sample and out-of-sample accuracy are depends on many factors. For regression on churn problems, the difference is usually slight; for the machine learning model shown in section 9.5, using in-sample observations for testing can create a large overestimate of accuracy.

TAKEAWAY Forecasting models should be tested on out-of-sample data that was not used to fit the model.

Do you need to wait to see how well the model predicts new churns on live customers to see how accurate it is? Waiting would work, and you should do that, but there’s an easier way: hold back some of the observations from the data when you fit the model and then test the accuracy on those held-back observations. Then you can see how accurate the model would be on new customers it hasn’t seen without waiting to get fresh new customers. After testing, you refit the model on all the data without holding anything back and use that final version to make the real new forecasts on active customers.

The next question is which data to use and how much you should hold back for testing. The most realistic way to test the accuracy of a churn-forecasting model is to use a historical simulation. This procedure is called backtesting and is illustrated in figure 9.7.

Figure 9.7 The backtesting process for measuring forecasting performance

DEFINITION Backtesting is the historical simulation of a forecasting model’s accuracy, as though it had been repeatedly fit and then used to forecast out of sample for consecutive periods in the past.

Here’s how backtesting works:

  1. Decide on a point in time in the past that is somewhere around one-half to one-third of the time in the period spanned by your dataset.

  2. Use all the observations that correspond to points in time before that date to fit your model.

  3. Use the observations in the time from one to three months after the date you chose to test. This procedure tests what the accuracy of your model would have been if it were forecasting churn in the past but still forecasts on customer data that came in after the model was fit.

  4. Assuming that you have more data, advance the target date to the end of your test period.

  5. Repeat the process by refitting the model on all the data from the first fit, plus the observations you used to test on, and test the next one to three months.

My advice for churn forecasting is a bit different from what is taught in most data science and statistics courses, which rarely mention backtesting. Students usually learn a random shuffling procedure to create out-of-sample tests that don’t pay attention to timing. The procedure of backtesting originated in financial forecasting on Wall Street. Backtesting was created due to the observation that markets are changing all the time, so predictive models perform differently on randomly shuffled accuracy tests than on live forecasting. Accuracy tests based on a realistic historical simulation do the best job of estimating how a model would have done if had been live at the time.

The reason why live-prediction accuracy can differ from a shuffled data test is that if economic conditions change, such as at the start of a recession, a live model fit before the recession probably won’t predict as well under the new recession conditions. For the model to do better, the new conditions have to be observed for some time; then the model could be refit. But with a shuffled data test, it is as though you fit a model that knows about the recession by observing the future before it happened. Such a model can appear to forecast well, but the real results will likely be worse than the test.

The same reasoning applies to churn forecasting. If your market, product, or competition changes during the time spanned by your dataset, it might be hard to forecast churn accurately in the time after the change. If you shuffle the data, you can get a different result than you would have if you had been forecasting for your customers at that time. The most realistic simulation is to have your model run through the data and forecast out of sample in the order in which events happened. You may not know whether the conditions driving your customer churn behavior changed during your period of observation. But what you don’t know can hurt you, so backtesting is the best practice. Although the historical simulation I described sounds complicated, open source packages take care of all the details for you.

9.2.2 Backtesting code

Open source Python packages provide functions that run historical simulations like the one described in section 9.2.1. You provide the package your data and the type of model you’re fitting and tell it how many tests you want to divide your data into.

Figure 9.8 shows example output from a historical simulation, including the lift and the AUC for each out-of-sample test as well as the averages. For the simulated dataset, you will probably find that the AUC and lift in the backtest are similar to the AUC and lift from the in-sample data, but that will not necessarily be the case for a real product dataset.

Figure 9.8 Output of backtesting (listing 9.3)

In figure 9.8, each testing period is known as a split, in reference to the fact that the data is split into a dataset for fitting the model and a holdout dataset for testing.

DEFINITION Split is a generic term for the division of a dataset into separate parts for model fitting and testing.

Listing 9.3 contains the Python code that produced the output shown in figure 9.8. This listing contains many of the same elements as the regression fitting code in chapter 8 and the accuracy measurements discussed in section 9.1. But there are three important new classes from the sklearn.model_selection package:

  • GridSearchCV—A utility that performs a variety of tests on forecasting models. The name of the class derives from the fact that it specializes in searching for the best models through a process known as cross-validation (the CV in GridSearchCV). You’ll learn more about cross-validation in section 9.4; for now, you use the object to test a single model.

  • TimeSeriesSplit—A helper object that tells GridSearchCV that the testing should be performed by historical simulation, rather than another type of test (typically, random shuffling). The name of the class is TimeSeriesSplit, but I recommend that you stick with the original Wall Street term that your business colleagues are most likely to understand: backtesting.

  • scorer—An object that wraps a scoring function. When you use a nonstandard scoring function with GridSearchCV, you must wrap it in such an object. This task is easy: call the make_scorer function, provided by the package for this purpose. You pass your scoring function as a parameter when making the scorer object. In listing 9.3, this technique is used for the top decile lift calculation.

Other than TimeSeriesSplit, the parameters required to create GridSearchCV are the regression model object and a dictionary containing the two accuracy measurement functions. The lift measurement function is passed with the scorer object, and the AUC scoring function is passed as a string (naming it because this scorer object is a Python standard).

Other parameters that control the details of the test include the following:

  • return_train_score—Controls whether to also test for in-sample accuracy (also known as training accuracy)

  • param_grid—Tests parameters to find a better model (a subject you learn more about in section 9.4)

  • refit—Tells the model to refit a final model on all the data (which you will do in section 9.4)

In other respects, listing 9.3 combines elements you have already seen: loading and preparing data, creating a regression model, and saving results. One thing to note is that the test is triggered by calling the fit function on GridSearchCV rather than on the regression object itself.

Listing 9.3 Backtesting with Python time-series cross-validation

import pandas as pd
from sklearn.model_selection 
   import GridSearchCV, TimeSeriesSplit                                   
from sklearn.metrics import make_scorer                                   
from sklearn.linear_model import LogisticRegression
 
from listing_8_2_logistic_regression 
   import prepare_data                                                    
from listing_9_2_top_decile_lift 
   import calc_lift                                                       
 
def backtest(data_set_path,n_test_split):
 
   X,y = prepare_data(data_set_path,as_retention=False)                   
 
   tscv = TimeSeriesSplit(n_splits=n_test_split)                          
 
   lift_scorer = 
      make_scorer(calc_lift, needs_proba=True)                            
   score_models = 
      {'lift': lift_scorer, 'AUC': 'roc_auc'}                             
 
   retain_reg = LogisticRegression(penalty='l1',                          
                                   solver='liblinear', fit_intercept=True)
 
   gsearch = GridSearchCV(estimator=retain_reg,                           
                          scoring=score_models, cv=tscv,
                          return_train_score=False,  param_grid={'C' : [1]}, refit=False)
 
   gsearch.fit(X,y)                                                       
 
   result_df = pd.DataFrame(gsearch.cv_results_)                          
   save_path = data_set_path.replace('.csv', '_backtest.csv')
   result_df.to_csv(save_path, index=False)
   print('Saved test scores to ' + save_path)

These classes run the tests.

Defines a custom score function: the lift score

Reuses listing 8.2 to reload data

Reuses listing 9.2 to calculate the lift

Loads the data, keeping the outcome as a churn flag

Creates an object that controls the splits

Creates a scorer object that wraps the lift function

Creates a dictionary that defines the scoring functions

Creates a new LogisticRegression object

Creates a GridSearchCV object

Runs the test

Saves the results in a DataFrame

You should run listing 9.3 on your own data from the social network simulation (chapter 8) and confirm that your result is similar to the one in figure 9.8. With the Python wrapper program, the command to run is the following:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 3 

9.2.3 Backtesting considerations and pitfalls

For the simulation, only two tests were used because the entire dataset spans only six months. If more tests were specified for a larger dataset, the additional results would appear as additional columns in the same file. But in backtesting for churn prediction, it is typical to test with a few splits. By contrast, the procedure you may have learned for randomly shuffled tests usually calls for 10 random tests or more. You should pick the number of splits based on the length of time spanned by your data sample and how often you would be likely to refit the model.

Although you may optimistically think you would refit a new model every month, in reality, many companies “set it and forget it.” Even if you are very determined, you will probably refit your own model only a few times a year after you finish the initial development. (Refitting the simulation model every two months may be overly optimistic; I use this example for demonstration purposes.) Also, frequent model changes are confusing to businesspeople. In fact, some companies mandate an annual refitting of production models to prevent “moving the goal posts” when business metrics are tied to the model outputs. For example, if customer support representative compensation is linked to reducing churn probability, then the model must remain fixed for the fiscal year.

If you’re worried that using a few splits for the test is not as rigorous as using 10 tests, don’t worry. These measurements should be made with the spirit of agility and parsimony that I advocated in chapter 1. Using a few tests will tell you whether you are predicting well or have work to do on your model; doing more tests wastes time. Also, if a high number of test splits implies an unrealistic rate of refitting your model when it is live, your test may overestimate the accuracy you would achieve in the real world, where you refit less often.

One other pitfall to be aware of in backtesting for accuracy is the possibility of adverse effects due to mistakes in how times are recorded in your database or data warehouse. This problem occurs mainly if events, subscriptions, or other customer data records were backdated when they were added to your database. In that case, you would calculate historical metrics and run your test with information that may not be available in real time for live forecasting on active customers. This type of error is known as a lookahead error or bias in forecasting.

DEFINITION A lookahead bias is an error that occurs when you estimate accuracy in a historical simulation using information that would not be available in real time for forecasting on active customers.

WARNING Backdated records for events, subscriptions, or other customer data can lead to lookahead bias in your forecasts and cause the backtest to appear more accurate than what you would achieve in real-time forecasting.

The fix for lookahead bias is to be aware of any backdating of records in your database and, if necessary, to correct it with custom lags in the event selection when you calculate metrics. If you know that all events are loaded into the data warehouse with a one-week delay and backdated to the time the event occurred, for example, you should include this delay when you calculate your metrics. The trick is that you won’t notice the one-week delay when you run your historical analysis, but you will when you try to forecast churn probabilities in real time and find that all your metrics are a week old.

9.3 The regression control parameter

After measuring the accuracy of your forecasts, you’re probably wondering whether there is any way to be more accurate. Another problem I’ve mentioned is that a regression can result in many small weights on unimportant metrics. You have a way to adjust the regression that can help with both issues.

9.3.1 Controlling the strength and number of regression weights

In chapter 8, I mentioned that regression models can have too many small weights and that you can remove them. This technique is illustrated in figure 9.9, which shows the relative size of the regression weights from the social network simulation (figure 8.7 in chapter 8). Most of the weights are greater than 0.1, one weight is 0.00, and two weights are 0.01; those 0.01 weights are extraneous. Two small weights may not seem like a problem, but remember that real data can have a dozen or more smaller weights, which can make it harder for you and the businesspeople to understand the result.

Figure 9.9 Regressions result in small weights that can be removed.

It might seem that the simplest thing to do would be to set those very small weights to zero. But the decision about which weights to keep and which to remove may not be so clear-cut. Also, if some weights are removed, others should be readjusted. The regression algorithm has a more principled way to handle this situation with a parameter controlling the total weight available for the algorithm to distribute across all the metrics.

When the control parameter is set to a high value, the regression weights tend to be larger, and there will be fewer zeros. When the control parameter is set to a low value, the weights tend to be lower, and the lower the parameter is set, the more weights will be zero. The precise weights are optimized by the algorithm. Unfortunately, this controlling parameter has no good, generally accepted name. Because there is only one relevant parameter for the regression, I will call it the control parameter. Conveniently, the Python code refers to the parameter as (capital) C, so calling it the control parameter is clear.

DEFINITION The regression control parameter sets the size and number of weights that result from a regression. Higher C settings yield more and higher weights, and lower C settings yield fewer and lower weights.

The Python nomenclature C derives from something called a cost parameter in the regression algorithm. It’s called a cost because the algorithm includes a penalty cost for the size of the weights. But the documentation states that the cost is 1/C, so C is the inverse or reciprocal of the cost. It is confusing to have a parameter that you call the cost, but where the cost is higher for lower parameter values, so I stick with calling it a control parameter, or C.

9.3.2 Regression with the control parameter

Listing 9.4 shows a new version of the regression using the control parameter. This listing reuses all the helper functions from listing 8.2 (chapter 8), so there’s not much to it. The only difference is that listing 9.4 takes a value for C in the function call, passes it in when it creates the object, and then passes it as an extension to the output files. The output files are the same as those produced by listing 8.2.

You should run listing 9.4 on your simulated data. To see the effect of setting the C parameter, you can run three versions. These three versions have the C parameter set to 0.02, 0.01 and 0.005, respectively. Run these versions with the Python wrapper program, using the version argument as follows:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 4 —version 1 2 3

The results of running the two versions of listing 9.4 are compared in figure 9.10, along with the result from the original regression (listing 8.2):

  • In the original listing, all but one weight is nonzero, and the highest-magnitude weight is 0.68.

  • With the C parameter set to 0.02, four weights are zero, and the highest-magnitude weight is 0.61.

  • With the C parameter set to 0.005, eight weights are zero, and the highest magnitude weight is 0.42.

Figure 9.10 Comparison of regression weights resulting from different values of the control parameter, C

This overall pattern is what happens as the C parameter is reduced.

Listing 9.4 Regression using the control parameter C

from sklearn.linear_model import LogisticRegression
from listing_8_2_logistic_regression 
   import prepare_data, save_regression_model                              
from listing_8_2_logistic_regression 
   import save_regression_summary, save_dataset_predictions
 
def regression_cparam(data_set_path, C_param):                             
   X,y = prepare_data(data_set_path)
   retain_reg = LogisticRegression( C=C_param,                             
                                    penalty='l1',
                                    solver='liblinear', fit_intercept=True)
   retain_reg.fit(X, y)                                                    
   c_ext = '_c{:.3f}'.format(C_param)                                      
   save_regression_summary(data_set_path,retain_reg,ext=c_ext)             
   save_regression_model(data_set_path,retain_reg,ext=c_ext)
   save_dataset_predictions(data_set_path,retain_reg,X,ext=c_ext)

This listing uses all the helper functions from listing 8.2.

There is an additional parameter, C, for the regression.

Passes the parameter when the regression is created

Fits the regression, as in listing 8.2

Adds the parameter to the result filename

Calls the save functions

Note that the response of the algorithm to the C parameter setting is irregular. Changing the C parameter from 1 to 0.02 removes two additional metrics from the regression results, and a further reduction from point 0.02 to 0.005 removes three more. The way that the parameter is defined in the algorithm, you need to consider values of the control parameter that vary in a range below 1.0 (the default) and above zero, but the impact varies on a logarithmic scale as the parameter gets smaller.

When I say that the impact varies on a logarithmic scale, I mean that changes in the parameter must be significantly different in the logarithm of the parameter to make a big difference in the algorithm. The impact of going from 1.0 to 0.9 is not going to be much, and the impact of going from 1 to 0.1 is likely to be about the same as going from 0.1 to 0.01. It is inefficient to test the range of parameters between 1 and 10 on a linear scale like [1, 0.9, 0.8, ..., 0.1] because the best value can be below 0.1, and you will probably not see that much change between values like 0.9 and 0.8. Instead, you should test parameters decreasing by a divisive factor, such as dividing by 10: [1, 0.1, 0.01, 0.001]. How small you have to go to see the right impact depends on your data. If you want to do a more detailed search of the parameter space, divide by a smaller factor like 2, as in [0.64, 0.32, 0.16, ...].

TAKEAWAY When you check smaller values of C, you must check values that are orders of a magnitude smaller than 1.0. Usually, a C parameter around 1.0 assigns weights to all (or most) of the metrics that are even a little bit related to churn. To reduce the number of nonzero weights, try values of C like 0.1, 0.01, and 0.001.

9.4 Picking the regression parameter by testing (cross-validation)

At this point, you should be wondering how low you should go with the control parameter. It makes sense to remove metrics with small weights, but at what point should you stop? This decision is best made by looking at the accuracy that results from each parameter setting.

9.4.1 Cross-validation

It should come as no surprise that you can remove metrics with small weights from a regression, and it won’t make much difference in the accuracy. (Because these weights are the small weights, they don’t make much difference.) A logical approach is to remove weights until you find that doing so harms accuracy. What can be more surprising is that removing some metrics improves accuracy.

You’re going to take different values of the C parameter and run a backtest with the parameter to see how accurate the resulting models are. At the same time, you can check how many metrics get zero and nonzero weights in the regression. Figure 9.11 illustrates this process. The general term for this type of procedure in machine learning and statistics is cross-validation.

Figure 9.11 Cross-validation to select the regression parameter

DEFINITION Cross-validation is the process of optimizing a forecasting model by comparing the accuracy and other characteristics of models created with different parameters.

Cross-validation is a common task in data science and machine learning, and what the CV means in the GridSearchCV object you were introduced to earlier. The GridSearch part of the name refers to the fact that a typical cross-validation works on a sequence or multiple sequences of parameters. If there were two parameters, each with its own sequence of values, the combinations of those two sequences would define a grid. In fact, there can be any number of parameters. For the regression model, you will do a cross-validation of one parameter. Later, you use higher-dimensional cross-validation for a machine learning model.

9.4.2 Cross-validation code

Figure 9.12 shows the main result of cross-validation, which plots the AUC, the lift, and the number of weights that you get when running the regression for a sequence of values from the C parameter. This result confirms that small-weight metrics can be removed, and accuracy will not suffer: the number of metrics can be reduced from 13 to 9 before any noticeable change in accuracy occurs. In the simulation, there was a slight gain in the lift, but no gain in the AUC when the less important metrics were removed.

Figure 9.12 Cross-validation result plot

Listing 9.5 contains the code that produced figure 9.12. The listing contains multiple function definitions, but note that much of the code is for plotting and analysis. The Python open source package takes care of the cross-validation in a few lines. The functions in listing 9.5 follow:

  • crossvalidate_regression—This main function performs cross-validation, and it is almost the same as that in listing 9.4. The most important difference is that a sequence of C parameter values is passed instead of a single value. The other difference is that after the fit function on the GridSearchCV object returns, helper functions are called to perform additional analysis and to save the results.

  • test_n_weights—The GridSearchCV object tests each parameter for the accuracy of the model on the backtest, but it doesn’t test the number of weights returned by the regression. A separate loop is called to fit a regression at each C parameter in the sequence, and the number of nonzero weights is counted. This is done on the full dataset, so it is not a backtest but a measurement of the final model.

  • plot_regression_test—This function creates the plot shown in figure 9.12 by combining the results for AUC, lift, and the number of metrics with nonzero weights.

  • one_subplot—This helper function creates and formats each subplot.

Listing 9.5 also saves the results from figure 9.12 in a .csv file, shown in figure 9.13. This result is the output from GridsearchCV (as in section 9.2), but instead of a single row, there is one row in the table per value of the C parameter that was tested. There is also an extra column with the result from testing the number of weights. The output from the cross-validation with multiple parameters shows that the columns labeled rank_test_lift and rank_test_AUC refer to the ranking of the models fit with the different parameter values on the accuracy metrics. (Some of these columns may have seemed extraneous when you first saw them in section 9.2.)

Figure 9.13 Cross-validation result table

You should run listing 9.5 with the following command-line arguments to generate your own plot like figure 9.12 and a .csv file like figure 9.13:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 5

Listing 9.5 Regression C parameter cross-validation

import pandas as pd
import ntpath
import numpy as np
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
 
from listing_8_2_logistic_regression import prepare_data
from listing_9_2_top_decile_lift import calc_lift
 
def crossvalidate_regression(data_set_path,
                             n_test_split):                                
 
   X,y = prepare_data(data_set_path,as_retention=False)
   tscv = TimeSeriesSplit(n_splits=n_test_split)
   score_models = {                                                        
      'lift': make_scorer(calc_lift, needs_proba=True), 
      'AUC': 'roc_auc'
   }
   retain_reg = LogisticRegression(penalty='l1', 
                                   solver='liblinear', fit_intercept=True)
   test_params = {'C' : [0.64, 0.32, 0.16, 0.08,                           
                         0.04, 0.02, 0.01, 0.005, 0.0025]}
   gsearch = GridSearchCV(estimator=retain_reg,                            
                          scoring=score_models, cv=tscv, 
                          verbose=1,return_train_score=False,
                          param_grid=test_params, refit=False)
   gsearch.fit(X,y)
   result_df = pd.DataFrame(gsearch.cv_results_)                           
   result_df['n_weight']= 
      test_n_weights(X,y,test_params)                                      
   result_df.to_csv(data_set_path.replace('.csv', '_crossval.csv'), index=False)
   plot_regression_test(data_set_path,result_df)                           
 
def test_n_weights(X,y,test_params):                                       
   n_weights=[]
   for c in test_params['C']:                                              
       lr = LogisticRegression(penalty='l1',C=c,                           
                               solver='liblinear', fit_intercept=True)
       res=lr.fit(X,~y)                                                    
       n_weights.append(
          res.coef_[0].astype(bool).sum(axis=0))                           
   return n_weights
 
def plot_regression_test(data_set_path, result_df):                        
   result_df['plot_C']=result_df['param_C'].astype(str)                    
   plt.figure(figsize=(4,6))
   plt.rcParams.update({'font.size':8})
   one_subplot(result_df,1,'mean_test_AUC',                                
               ylim=(0.6,0.8),ytick=0.05)
   plt.title(                                                              
      ntpath.basename(data_set_path).replace(
                                       '_dataset.csv',' cross-validation'))
   one_subplot(result_df,2,'mean_test_lift',                               
               ylim=(2, 6),ytick=0.5)
   one_subplot(result_df,3,'n_weight',                                     
               ylim=(0,int(1+result_df['n_weights'].max())),ytick=2)
   plt.xlabel('Regression C Param')                                        
   plt.savefig(data_set_path.replace('.csv', '_crossval_regression.png'))
   plt.close()
def one_subplot(result_df,plot_n,var_name,ylim,ytick):
   ax = plt.subplot(3,1,plot_n)                                            
   ax.plot('plot_C', var_name,                                             
           data=result_df, marker='o', label=var_name)
   plt.ylim(ylim[0],ylim[1])                                               
   plt.yticks(np.arange(ylim[0],ylim[1], step=ytick))                      
   plt.legend()
   plt.grid()

The number of test splits is a parameter.

The score function wraps the lift in a scorer object.

Instead of one C parameter, tests a list

Creates the cross-validation object, and calls the fit method

Puts the result in a DataFrame

Adds another column with the result of the weight test

Makes a plot with plot_ regression_ test

Tests the number of weights for different C parameters

Loops over the parameters

Creates a logistic regression with one value of C

Fits the model

Counts the number of nonzero weights

Makes a plot from the result of the regression tests

String version of C parameter to use as the x-axis

Calls a helper function to plot the AUC

Adds a title above the first of three subplots

Calls a helper function to plot the lift

Calls a helper function to plot the number of nonzero weights

Adds an x-label after the third subplot

Starts the subplot given by the parameter

Plots the named variable against the string version of C

Sets the y-limits based on the parameters

Sets the ticks based on the parameters

9.4.3 Regression cross-validation case studies

Figure 9.14 shows examples of regression cross-validation from real company case studies. The number of nonzero weights is shown as a percentage rather than a count; otherwise, these results are read the same way as figure 9. 12.

Figure 9.14 Cross-validation case study results

Following are some interesting features of the case study results:

  • The forecasts have AUC in the range 0.6 to 0.8.

  • The forecasts have lift in the range 2.0 to 3.5.

  • For two of the three case studies, a noticeable improvement in AUC and lift occurs when many metrics get zero weight from the regression. (This result is a clear example of simplicity also benefiting accuracy.) In these cases, the optimal values of the C parameter are in the range of around 0.02 to 0.08. The improvement over including all the features is a few percentage points of AUC.

  • For the third simulation, the optimal AUC is achieved with all the metrics; removing any metrics results in significant loss of accuracy.

These results are typical, but you may see more diversity in real case studies than I can present here.

9.5 Forecasting churn risk with machine learning

So far, you have learned about forecasting churn with a regression in which predictions are made by multiplying metrics by a set of weights. You can also predict churn with other kinds of forecasting models that are collectively known as machine learning. There is no official definition of what constitutes a machine learning model, but for the purpose of this book, I use the following.

DEFINITION A machine learning model is any predictive algorithm that has the following two characteristics: (1) the algorithm learns to make the prediction by processing sample data (as compared with making predictions with rules set by a human programmer), and (2) the algorithm is not the regression algorithm.

The second condition may seem strange because the regression algorithm certainly meets the first condition. The distinction is historical because the regression approach predates machine learning methods by decades.

9.5.1 The XGBoost learning model

This book covers only one machine learning algorithm—XGBoost—but the same techniques for fitting the model and forecasting apply to most other algorithms you may consider. The XGBoost algorithm is based on the concept of a decision tree, illustrated in figure 9.15.

Figure 9.15 Making predictions with a tree of rules

DEFINITION A decision tree is an algorithm for predicting an outcome (such as a customer’s churning or not churning) that consists of a binary tree made up of rules or tests.

Each test in a decision tree takes a single metric and checks whether it is greater than or less than a predetermined cut point. The prediction for an observation (of a customer) is determined by starting at the root of the tree and performing the first test. The result of the test determines which of the two branches to follow from the node leading to one of the second-level tests. The result of all the tests determines a path through the tree, and each leaf of the tree has a designated prediction.

Small decision trees seem to be simple, and they were once considered to be easy-to-interpret machine learning models. But in practice, large decision trees for datasets with many metrics become hard to interpret. Fortunately, no one has to read the rules in the tree to make a prediction.

An algorithm is used to test metrics and decide on the cut points to optimize performance when making predictions using the sample data. If a backtest shows that the results are accurate, you can make predictions by using a decision tree without being too concerned about the substance of the rules. Methods to interpret a decision tree exist, but they are beyond the scope of this book. If you have more than a few metrics, understanding the influence of metrics on the likelihood of churn is best done through the grouping and regression methods shown in earlier chapters, so I won’t spend time on interpreting decision trees.

Apart from being difficult to interpret, decision trees are no longer state of the art in terms of prediction accuracy. But decision trees are actually the building blocks for more accurate machine learning models. One example is a random forest, illustrated in figure 9.16.

Figure 9.16 Making predictions with a forest of rule trees

DEFINITION A random forest is an algorithm for predicting an outcome such as a customer’s churning by randomly generating a large set of decision trees (a forest). All the trees try to predict the same outcome, but each does so according to a different set of learned rules. The final prediction is made by averaging the predictions of the forest.

The random forest is an example of what is called an ensemble prediction algorithm because the final prediction is made from the combination of a group of other machine learning algorithms. Ensemble means a group evaluated as a whole rather than individually. A random forest is a simple type of ensemble in that each tree gets an equal vote in the outcome, and additional trees are added at random. Boosting is a name for machine learning algorithms that make some important improvements over ensembles such as random forest.

DEFINITION Boosting is a machine learning ensemble in which the ensemble members are added so that they correct the errors of the existing ensemble.

Rather than randomly adding decision trees, as in a random forest, you create each new tree in a boosting ensemble to correct wrong answers made by the existing ensemble, rather than repredicting on the correct examples. Internal to the boosting algorithm, successive trees are generated to correct the observations that were not correctly classified by earlier trees. Also, the weight assigned to successive trees in the vote is made to best correct the mistakes, not an equal vote like in random forests. These improvements make boosted forests of decisions trees more accurate than a truly random forest of decision trees.

XGBoost (short for extreme gradient boosting) is a machine learning model that (at the time of this writing) is the most popular and successful model for general-purpose prediction. XGBoost is popular because it delivers state-of-the-art performance, and the algorithm to fit the model is relatively fast (compared with other boosting algorithms, but not as fast as regression). Details about the XGBoost algorithm are beyond the scope of this book, but there are many excellent free resources online.

9.5.2 XGBoost cross-validation

Machine learning algorithms like XGBoost can make accurate predictions, but this accuracy comes with some additional complexity. One area of complexity is that the algorithms have multiple optional parameters that you must choose correctly to get the best results. The optional parameters for XGBoost include ones that control how the individual decision trees are generated, as well as parameters that control how the votes of different decision trees are combined. Here are a few of the most important parameters for XGBoost:

  • max_depth—The maximum depth of rules in each decision tree

  • n_estimator—The number of decision trees to generate

  • learning_rate—How heavily to emphasize the weight of votes from the best trees

  • min_child_weight—The minimum weight of each tree in the vote, regardless of how well it did

Because there is no straightforward way to select the values for so many parameters, the values are set by out-of-sample cross-validation. You used this approach for the control parameter on the regression in section 9.4.

TAKEAWAY State-of-the-art machine learning models have so many parameters that the only way to make sure you pick the best values is to cross-validate a large number of them. That is, you test a sequence of plausible values for each parameter and choose the ones that have the best values on a cross-validation test.

Figure 9.17 XGBoost code output

Figure 9.17 shows an example of such a cross-validation result.

Figure 9.17 was created by running listing 9.6 on the simulated social network dataset used in earlier chapters. It is similar to the cross-validation results you saw for picking the regression C parameter, but it has both more columns and more rows:

  • There are four columns of parameters because four parameters were part of the test: max_depth, n_estimator, learning_rate, and minimum_child_weight.

  • There are many more rows in the output table—256 parameter combinations, to be precise. The reason for 256 parameter combinations becomes clear when you inspect listing 9.6: the test is made over four parameters, and the sequence of values for each parameter has four entries. The total number of combinations is the product of the number of values for each parameter—in this case, 4 × 4 × 4 × 4 = 256.

You should run listing 9.6 on your own simulated data, using the following usual Python wrapper program command with these arguments:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 6

Do not be surprised if the cross-validation for the XGBoost model takes a lot longer than it did for the regression. There are a lot more parameter combinations to test, and each time a model is fit, the process takes significantly longer. The precise time can vary (depending on your hardware), but for me, the XGBoost model takes about 40 times longer to fit in comparison with the regression model. As shown in figure 9.8, the regression takes only a few hundredths of a second to fit on average; figure 9.17 shows that the XGBoost fits take around 1 to 4 seconds.

NOTE XGBoost is in its own Python package, so if you have not used it before, you need to install it before running listing 9.6.

Listing 9.6 XGBoost cross-validation

import pandas as pd
import pickle
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.metrics import make_scorer
import xgboost as xgb                                                          
 
from listing_8_2_logistic_regression import prepare_data
from listing_9_2_top_decile_lift import calc_lift
 
 
def crossvalidate_xgb(data_set_path,n_test_split):
 
   X,y = prepare_data(data_set_path,ext='',as_retention=False)                 
   tscv = TimeSeriesSplit(n_splits=n_test_split)
   score_models = {'lift': make_scorer(calc_lift, needs_proba=True), 'AUC': 'roc_auc'}
 
   xgb_model = xgb.XGBClassifier(objective='binary:logistic')                  
   test_params = { 'max_depth': [1,2,4,6],                                     
                   'learning_rate': [0.1,0.2,0.3,0.4],                         
                   'n_estimators': [20,40,80,120],                             
                   'min_child_weight' : [3,6,9,12]}                            
 
   gsearch = GridSearchCV(estimator=xgb_model,n_jobs=-1, scoring=score_models, 
                          cv=tscv, verbose=1, return_train_score=False,                                   
                          param_grid=test_params,refit='AUC')                  
   gsearch.fit(X.values,y)                                                     
 
   result_df = pd.DataFrame(gsearch.cv_results_)                               
   result_df.sort_values('mean_test_AUC',ascending=False,inplace=True)         
   save_path = data_set_path.replace('.csv', '_crossval_xgb.csv')
   result_df.to_csv(save_path, index=False)
   print('Saved test scores to ' + save_path)
 
   pickle_path = data_set_path.replace('.csv', '_xgb_model.pkl')               
   with open(pickle_path, 'wb') as fid:
       pickle.dump(gsearch.best_estimator, fid)                                
   print('Saved model pickle to ' + pickle_path)

Imports XGBoost, which is in a separate package

Most of this function is the same as listing 9.5: the regression cross-validation.

Creates an XGBClassifier object for a binary outcome

Tests tree depths from 1 to 6

Tests learning rates from 0.1 to 0.4

Tests the number of estimators from 20 to 120

Tests minimum weights from 3 to 12

Creates the GridSearchCV object with the XGBoost model object, and tests parameters

Refits the best model according to AUC after cross-validation

Passes as values, not a DataFrame, to avoid a known package issue at the time of this writing

Transfers the results to a DataFrame

Sorts the result so the best AUC is first

Creates a pickle of the best result

The best result is in the best_estimator field of the GridSearchCV object.

The code in listing 9.6 is similar to the one for cross-validating the regression (listing 9.5). The main steps are

  1. Prepare the data.

  2. Create a model instance (in this case, an XGBoost model).

  3. Define the accuracy measurement functions to use (lift and AUC).

  4. Define the sequences of parameters to test.

  5. Pass the prepared parameters to the GridSearchCV object and call the fit function.

  6. Save the results (with no additional analysis, as in the regression cross-validation).

One important and slightly subtle difference between listing 9.6 and the regression cross-validation in listing 9.5 is that the dataset is created from the original unscaled metrics, and it doesn’t use scores or groups as you do for the regression. There is no reason to rescale metrics for XGBoost (or decision trees generally) because the cut points in the rules operate as well on the metrics, regardless of scale or skew. Also, grouping correlated metrics doesn’t provide any benefit; in fact, it can hurt the performance of this type of machine learning model. Grouping correlated metrics is beneficial for interpretation and averts the problems that correlated metrics can cause in regression.

On the other hand, for XGBoost, a diversity of metrics is beneficial, and correlation does no harm. (If two metrics are correlated, either can make a suitable rule node in a tree.) For these reasons, the prepare_data function from chapter 8 is called with an empty extension argument so that it loads the original dataset rather than the grouped scores (the default behavior).

9.5.3 Comparison of XGBoost accuracy to regression

Because XGBoost takes much longer to fit the larger number of parameters, you should expect that it provides some improvement in forecasting accuracy. This expectation is confirmed in figure 9.18, which compares the AUC and lift achieved by regression and the XGBoost models for the simulation, as well as three real company case study datasets for the companies introduced in chapter 1. The AUC improvement ranges from 0.02 to 0.06, and XGBoost always produces more accurate forecasts than does regression. In terms of lift, the improvement is 0.1 to 0.5.

Figure 9.18 Comparison of regression and XGBoost lift

Are those improvements significant? Remember that the full range of AUCs you’re likely to see in churn forecasting is around 0.6 to 0.8. The maximum AUC, therefore, is 0.2 more than the minimum, and in relative terms, an improvement of 0.02 in AUC represents a 10% improvement in terms of overall possible range. By the same token, a 0.05 improvement in AUC represents 25% of the difference between worst and best in class, so these improvements are significant. Still, the forecasting is not perfect, even with machine learning, which why I advised in chapter 1 that predicting churn with machine learning is not likely to live up to some of the hype in the machine learning field.

TAKEAWAY Though machine learning algorithms can produce forecasts that are significantly more accurate than regression, churn will always be hard to predict due to factors such as subjectivity, imperfect information, rarity, and extraneous factors that influence the timing of churn.

9.5.4 Comparison of advanced and basic metrics

Another important question is how much improvement in accuracy can be attributed to the work you did to create advanced metrics back in chapter 7. So far, you may have assumed that because the advanced metrics showed a relationship to churn in cohort analysis, they must have improved the model. But as you want to validate your data and modeling by showing that your model can predict out of sample, it makes sense to confirm that the work you did creating more metrics contributed something empirically.

To make the comparison on the simulated social network datasets, you can run additional versions of the cross-validation testing command on the original dataset from chapter 4. That is, you run the dataset without the advanced metrics from chapter 7—you use only the basic metrics from chapter 3. To run the regression cross-validation on the basic metric dataset, use the following:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 5 —version 1

The result is a cross-validation table like one shown in figure 9.13. You will probably find that the maximum accuracy of any model is somewhat less for data with basic metrics than for data with advanced metrics. As illustrated in the bar chart in figure 9.19, the maximum accuracy that I got on my regression simulation with basic metrics was 0.63; for the regression on the simulated data with advanced metrics, the maximum AUC was 0.75. The time spent creating advanced metrics was well spent. In fact, the regression accuracy with advanced metrics is significantly better than when using XGBoost with basic metrics, and the additional improvements for the machine learning algorithm when it uses advanced metrics are relatively small.

Figure 9.19 Comparison of AUC using basic and advanced metrics

You can perform the same check on the XGBoost model by running the second version of the XGBoost cross-validation command with these arguments:

fight-churn/listings/run_churn_listing.py —chapter 9 —listing 6 —version 1

In this case, you will probably find that the XGBoost forecasts did a bit better with the advanced metrics. I got an AUC of 0.774 by using XGBoost with basic metrics compared with 0.797 for XGBoost with advanced metrics; the improvement attributable to advanced metrics is 0.023.

Figure 9.18 also contains similar comparisons for forecasts made on three real company case studies introduced in chapter 1. These comparisons show different relationships between accuracy with and without advanced metrics. These three cases illustrate the range of scenarios you may encounter in your own case studies:

  1. In the first case study, the regression accuracy is significantly improved by the advanced metrics, but XGBoost doesn’t get any improvement, and XGBoost is best overall. This result shows that you can’t always expect advanced metrics to improve machine learning.

  2. In the second case study, both the regression and XGBoost are significantly improved by the addition of advanced metrics. The regression accuracy with advanced metrics is about the same as the XGBoost accuracy with basic metrics. The XGBoost accuracy with advanced metrics is the highest of all by a significant amount: around 0.1 more than regression with basic metrics.

  3. In the third case study, the regression using advanced metrics has higher accuracy than XGBoost without advanced metrics. But the highest accuracy of all is achieved by XGBoost using advanced metrics: more than 0.1 improvement over basic metrics and regression. This case study is most similar to the social network simulation.

These cases demonstrate that if high accuracy on churn forecasts is a high priority for you, both machine learning and advanced metrics are important. In my experience, advanced metrics usually improve the accuracy of churn forecasts for both regression and machine learning models like XGBoost.

9.6 Segmenting customers with machine learning forecasts

Listing 9.6 found a set of parameters that produces a machine learning model with high accuracy. The program also saved the best model in a pickle file. If you want to use the model to forecast on your active customers, you need to reload the saved model and use it on an active customer list. The code is demonstrated in listing 9.7. Listing 9.7 is practically the same as listing 9.5, which you used to make forecasts with the saved regression model. The listing does the following:

  1. Reloads the saved model pickle

  2. Loads the current customer dataset

  3. Calls the predict_proba function on the model, passing the data as a parameter

  4. Saves the results as a DataFrame of predictions and a histogram summarizing the result

As in the XGBoost classification in listing 9.6, the data is kept in its original form, unscaled and ungrouped. The preparation of the data for forecasting must match the way the data was prepared when the model was trained.

Listing 9.7 XGBoost forecasting

import pandas as pd
import os
import pickle
from listing_8_4_rescore_metrics import reload_churn_data
from listing_8_5_churn_forecast import forecast_histogram
 
def churn_forecast_xgb(data_set_path):
   pickle_path = 
      data_set_path.replace('.csv', '_xgb_model.pkl')                   
   assert os.path.isfile(pickle_path), 
      'Run listing 9.6 to save an XGB model'
   with open(pickle_path, 'rb') as fid:
       xgb_model = pickle.load(fid)
 
   curren_df = reload_churn_data(data_set_path,                         
                                 'current','8.3',is_customer_data=True)
   predictions = 
      xgb_model.predict_proba(current_df.values)                        
   predict_df = pd.DataFrame(predictions,                               
                             index=current_df.index,
                             columns=['retain_prob','churn_prob'])
 
   forecast_save_path = 
      data_set_path.replace('.csv', '_current_xgb_predictions.csv')
   print('Saving results to %s' % forecast_save_path)
   predict_df.to_csv(forecast_save_path, header=True)
 
   forecast_histogram(data_set_path,                                    
                      predict_df,ext='xgb')

Reloads the XGBoost model saved in the pickle file

Reloads the current customer metric data

Makes the predictions

Makes a DataFrame from the predictions

This function from listing 8.5 makes a histogram.

Listing 9.7 also creates a histogram of the XGBoost churn forecasts on current customers. It’s not shown because it is similar to the plot that you made for the churn probability forecasts with the regression model (made with the same function).

NOTE You should check the calibration and distribution of XGBoost forecasts as you learned to do for the regression forecasts in chapter 8.

For the social network simulation, the distribution and calibration of XGBoost forecasts turned out to be similar to the regression, but this result is a coincidence, not something you can always expect. You can’t expect XGBoost forecasts to be calibrated and distributed like the regression forecasts because the XGBoost forecast probabilities are not probabilities in the same sense as the regression forecast probabilities.

Recall that calibration refers to the property that your forecasts are in accordance with the true probability of the events occurring. On the other hand, accuracy measured by the AUC and lift depends on the ordering or ranking of the forecasts, not the precise values. The regression model is designed so that the forecast probabilities are calibrated to the sample data, as well as being as accurate as the model allows. When the XGBoost model gives a forecast probability, it is the weighted voted of the ensemble decision trees. Those votes are optimized to rank the risk of churn—something at which XGBoost is successful, as shown by the accuracy results. But the vote of the ensemble decision trees is not designed to produce forecasts calibrated to actual churn rates.

TAKEAWAY XGBoost doesn’t necessarily give calibrated churn probability forecasts. The XGBoost model is optimized for accuracy as measured by the classification of churns, not matching observed churn rates.

As a consequence of the forecasts from the XGBoost model’s not being reliably calibrated, the XGBoost forecasts are not suitable to use for estimating customer lifetime value, as was demonstrated in chapter 8.

WARNING Do not use XGBoost for predicting customer lifetime value or any other use case that depends on the churn probability forecasts matching real churn probabilities. The same applies to most machine learning models: read the literature for the model you’re using to confirm whether it produces forecasts that are calibrated in addition to being accurate.

Summary

  • Because of the rarity of churn, the accuracy of churn forecasts cannot be measured with the standard accuracy measurements.

  • The area under the curve (AUC) is the percentage of times that the model ranks a churn as having higher risk than a nonchurn, considering all pairs of churns and nonchurns.

  • The lift is the ratio of the churn rate in the top decile of churn risk forecasts to the overall churn rate.

  • The AUC and lift are good measurements for the accuracy of churn forecast.

  • Accuracy should be measured on samples that were not used to train the model.

  • For churn, accuracy should be measured in a backtesting (historical) simulation that reflects the fact that product and market conditions may change over time.

  • The regression model taught in this book includes a control parameter that sets the overall size of the weights and the number of nonzero weights.

  • The best value to use for the regression control parameter can be found by testing the accuracy of versions of the model using different values of the regression parameter.

  • Setting a forecasting model parameter by testing is known as cross-validation.

  • For regression, you choose the value of the control parameter that minimizes the number of nonzero weights and helps or doesn’t harm accuracy.

  • Usually, a significant fraction of the metrics can be assigned zero weights in a regression; the accuracy either improves or doesn’t get worse.

  • A machine learning model is a forecasting model that is fit from the data (not programmed) and is not the regression model.

  • A decision tree is a simple machine learning model that forecasts by analyzing customers with a tree of metric comparison rules.

  • XGBoost is a state-of-the-art machine learning model that uses an ensemble of decision trees and weights their predictions together to maximize accuracy.

  • XGBoost and other machine learning models have many parameters that must be set using cross-validation.

  • The accuracy of XGBoost forecasts generally exceeds the accuracy of regression forecasts.

  • Using advanced metrics in addition to basic metrics usually makes forecasts more accurate for both regression and machine learning models.

  • XGBoost churn probability forecasts are not calibrated to actual churn rates, so XGBoost churn forecasts should not be used for customer lifetime value or other use cases that depend on matching the actual churn probabilities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.154.151