Dataset

For the challenge, Orange released a large dataset of customer data, containing about one million customers, described in ten tables with hundreds of fields. In the first step, they resampled the data to select a less unbalanced subset, containing 100,000 customers. In the second step, they used an automatic feature construction tool that generated 20,000 features describing the customers, which was then narrowed down to 15,000 features. In the third step, the dataset was anonymized by randomizing the order of features, discarding the attribute names, replacing the nominal variables with randomly generated strings, and multiplying the continuous attributes by a random factor. Finally, all of the instances were split randomly into training and testing datasets.

The KDD Cup provided two sets of data, a large set and a small set, corresponding to fast and slow challenges, respectively. Both the training and testing sets contained 50,000 examples, and the data was split similarly, but the samples were ordered differently for each set.

In this chapter, we will work with the small dataset, consisting of 50,000 instances, each described with 230 variables. Each of the 50,000 rows of data corresponds to a client, and they are associated with three binary outcomes, one for each of the three challenges (upselling, churn, and appetency).

To make this clearer, the following table illustrates the dataset:

The table depicts the first 25 instances, that is, customers, each described with 250 attributes. For this example, only a selected subset of 10 attributes is shown. The dataset contains many missing values, and even empty or constant attributes. The last three columns of the table correspond to the three distinct class labels involving the ground truth, that is, if the customer indeed switched providers (churn), bought a service (appetency), or bought an upgrade (upsell). However, note that the labels are provided separately from the data in three distinct files, hence it is essential to retain the order of the instances and the corresponding class labels to ensure proper correspondence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.129.194