Chapter 4. Customer Relationship Prediction with Ensembles

Any type of company offering a service, product, or experience needs a solid understanding of relationship with their customers; therefore, Customer Relationship Management (CRM) is a key element of modern marketing strategies. One of the biggest challenges that businesses face is the need to understand exactly what causes a customer to buy new products.

In this chapter, we will work on a real-world marketing database provided by the French telecom company, Orange. The task will be to estimate the following likelihoods for customer actions:

  • Switch provider (churn)
  • Buy new products or services (appetency)
  • Buy upgrades or add-ons proposed to them to make the sale more profitable (upselling)

We will tackle the Knowledge Discovery and Data Mining (KDD) Cup 2009 challenge (KDD Cup, 2009) and show the steps to process the data using Weka. First, we will parse and load the data and implement the basic baseline models. Later, we will address advanced modeling techniques, including data pre-processing, attribute selection, model selection, and evaluation.

Note

KDD Cup is the leading data mining competition in the world. It is organized annually by ACM Special Interest Group on Knowledge Discovery and Data Mining. The winners are announced at the Conference on Knowledge Discovery and Data Mining, which is usually held in August.

Yearly archives, including all the corresponding datasets, are available here: http://www.kdd.org/kdd-cup.

Customer relationship database

The most practical way to build knowledge on customer behavior is to produce scores that explain a target variable such as churn, appetency, or upselling. The score is computed by a model using input variables describing customers, for example, current subscription, purchased devices, consumed minutes, and so on. The scores are then used by the information system, for example, to provide relevant personalized marketing actions.

In 2009, the conference on KDD organized a machine learning challenge on customer-relationship prediction (KDD Cup, 2009).

Challenge

Given a large set of customer attributes, the task was to estimate the following three target variables (KDD Cup, 2009):

  • Churn probability, in our context, is the likelihood a customer will switch providers:

    Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.

  • Appetency probability, in our context, is the propensity to buy a service or product
  • Upselling probability is the likelihood that a customer will buy an add-on or upgrade:

    Upselling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. Upselling usually involves marketing more profitable services or products, but upselling can also be simply exposing the customer to other options he or she may not have considered previously. Upselling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale.

The challenge was to beat the in-house system developed by Orange Labs. This was an opportunity for the participants to prove that they could handle a large database, including heterogeneous noisy data and unbalanced class distributions.

Dataset

For the challenge, the company Orange released a large dataset of customer data, containing about one million customers, described in ten tables with hundreds of fields. In the first step, they resampled the data to select a less unbalanced subset containing 100,000 customers. In the second step, they used an automatic feature construction tool that generated 20,000 features describing customers, which was then narrowed down to 15,000 features. In the third step, the dataset was anonymized by randomizing the order of features, discarding attribute names, replacing nominal variables with randomly generated strings, and multiplying continuous attributes by a random factor. Finally, all the instances were split randomly into a train and test dataset.

The KDD Cup provided two sets of data: large set and small set, corresponding to fast and slow challenge, respectively. They are described at the KDD Cup site as follows:

Both training and test sets contain 50,000 examples. The data are split similarly for the small and large versions, but the samples are ordered differently within the training and within the test sets. Both small and large datasets have numerical and categorical variables. For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical.

In this chapter, we will work with the small dataset consisting of 50,000 instances described with 230 variables each. Each of the 50,000 rows of data correspond to a client and are associated with three binary outcomes—one for each of the three challenges (upsell, churn, and appetency).

To make this clearer, the following image illustrates the dataset in a table format:

Dataset

The table depicts the first 25 instances, that is, customers, each described with 250 attributes. For this example, only a selected subset of 10 attributes is shown. The dataset contains many missing values and even empty or constant attributes. The last three columns of the table correspond to the three distinct class labels corresponding to the ground truth, that is, if the customer indeed switched the provider (churn), bought a service (appetency), or bought an upgrade (upsell). However, note that the labels are provided separately from the data in three distinct files, hence, it is essential to retain the order of the instances and corresponding class labels to ensure proper correspondence.

Evaluation

The submissions were evaluated according to the arithmetic mean of the area under the ROC curve (AUC) for the three tasks, that is, churn, appetency, and upselling. ROC curve shows the performance of model as a curve obtained by plotting sensitivity against specificity for various threshold values used to determine the classification result (refer to Chapter 1, Applied Machine Learning Quick Start, section ROC curves). Now, the AUC is related to the area under this curve, meaning larger the area, better the classifier. Most toolboxes, including Weka, provide an API to calculate AUC score.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.123.2