Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens

A categorical variable, as the name suggests, is used to represent categories or labels. For instance, a categorical variable could represent major cities in the world, the four seasons in a year, or the industry (oil, travel, technology) of a company. The number of category values is always finite in a real-world dataset. The values may be represented numerically. However, unlike other numeric variables, the values of a categorical variable cannot be ordered with respect to one another. (Oil is neither greater than nor less than travel as an industry type.) They are called nonordinal.

A simple question can serve as litmus test for whether something should be a categorical variable: “Does it matter how different two values are, or only that they are different?” A stock price of $500 is five times higher than a price of $100. So, stock price should be represented by a continuous numeric variable. The industry of the company (oil, travel, tech, etc.), on the other hand, should probably be categorical.

Large categorical variables are particularly common in transactional records. For instance, many web services track users using an ID, which is a categorical variable with hundreds to hundreds of millions of values, depending on the number of unique users of the service. The IP address of an internet transaction is another example of a large categorical variable. They are categorical variables because, even though user IDs and IP addresses are numeric, their magnitude is usually not relevant to the task at hand. For instance, the IP address might be relevant when doing fraud detection on individual transactions—some IP addresses or subnets may generate more fraudulent transactions than others. But a subnet of 164.203.x.x is not inherently more fraudulent than 164.202.x.x; the numeric value of the subnet does not matter.

The vocabulary of a document corpus can be interpreted as a large categorical variable, with the categories being unique words. It can be computationally expensive to represent so many distinct categories. If a category (e.g., word) appears multiple times in a data point (document), then we can represent it as a count, and represent all of the categories through their count statistics. This is called bin counting. We start this discussion with common representations of categorical variables, and eventually meander our way to a discussion of bin counting for large categorical variables, which are very common in modern datasets.

Encoding Categorical Variables

The categories of a categorical variable are usually not numeric.¹ For example, eye color can be “black,” “blue,” “brown,” etc. Thus, an encoding method is needed to turn these nonnumeric categories into numbers. It is tempting to simply assign an integer, say from 1 to k, to each of k possible categories—but the resulting values would be orderable against each other, which should not be permissible for categories. So, let’s look at some alternatives.

One-Hot Encoding

A better method is to use a group of bits. Each bit represents a possible category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on.” This is called one-hot encoding, and it is implemented in scikit-learn as sklearn.preprocessing.OneHotEncoder. Each of the bits is a feature. Thus, a categorical variable with k possible categories is encoded as a feature vector of length k. Table 5-1 shows an example.

Table 5-1. One-hot encoding of a category of three cities
	e₁	e₂	e₃
San Francisco	1	0	0
New York	0	1	0
Seattle	0	0	1

One-hot encoding is very simple to understand, but it uses one more bit than is strictly necessary. If we see that k–1 of the bits are 0, then the last bit must be 1 because the variable must take on one of the k values. Mathematically, one can write this constraint as “the sum of all bits must be equal to 1”:

e_{1} + e_{2} + . . . + e_{k} = 1

$e_{1} + e_{2} + . . . + e_{k} = 1$

Thus, we have a linear dependency on our hands. Linear dependent features, as we discovered in Chapter 4, are slightly annoying because they mean that the trained linear models will not be unique. Different linear combinations of the features can make the same predictions, so we would need to jump through extra hoops to understand the effect of a feature on the prediction.

Dummy Coding

The problem with one-hot encoding is that it allows for k degrees of freedom, while the variable itself needs only k–1. Dummy coding² removes the extra degree of freedom by using only k–1 features in the representation (see Table 5-2). One feature is thrown under the bus and represented by the vector of all zeros. This is known as the reference category. Dummy coding and one-hot encoding are both implemented in Pandas as pandas.get_dummies.

Table 5-2. Dummy coding of a category of three cities
	e₁	e₂
San Francisco	1	0
New York	0	1
Seattle	0	0

The outcome of modeling with dummy coding is more interpretable than with one-hot encoding. This is easy to see in a simple linear regression problem. Suppose we have some data about apartment rental prices in three cities: San Francisco, New York, and Seattle (see Table 5-3).

Table 5-3. Toy dataset of apartment prices in three cities
	City	Rent
0	SF	3999
1	SF	4000
2	SF	4001
3	NYC	3499
4	NYC	3500
5	NYC	3501
6	Seattle	2499
7	Seattle	2500
8	Seattle	2501

We can train a linear regressor to predict rental price based solely on the identity of the city (see Example 5-1).

The linear regression model can be written as:

y = w_{1} x_{1} + . . . + w_{n} x_{n}

$y = w_{1} x_{1} + . . . + w_{n} x_{n}$

It is customary to fit an extra constant term called the intercept, so that y can be a nonzero value when the x’s are zeros:

y = w_{1} x_{1} + . . . + w_{n} x_{n} + b

$y = w_{1} x_{1} + . . . + w_{n} x_{n} + b$

Example 5-1. Linear regression on a categorical variable using one-hot and dummy codes

>>> import pandas
>>> from sklearn import linear_model

# Define a toy dataset of apartment rental prices in 
# New York, San Francisco, and Seattle
>>> df = pd.DataFrame({
...     'City': ['SF', 'SF', 'SF', 'NYC', 'NYC', 'NYC', 
...              'Seattle', 'Seattle', 'Seattle'],
...     'Rent': [3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501]
... })
>>> df['Rent'].mean()
3333.3333333333335

# Convert the categorical variables in the DataFrame to one-hot encoding
# and fit a linear regression model
>>> one_hot_df = pd.get_dummies(df, prefix=['city'])
>>> one_hot_df
   Rent  city_NYC  city_SF  city_Seattle
0  3999    0.0       1.0       0.0
1  4000    0.0       1.0       0.0
2  4001    0.0       1.0       0.0
3  3499    1.0       0.0       0.0
4  3500    1.0       0.0       0.0
5  3501    1.0       0.0       0.0
6  2499    0.0       0.0       1.0
7  2500    0.0       0.0       1.0
8  2501    0.0       0.0       1.0

>>> model = linear_regression.LinearRegression()
>>> model.fit(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']],
...           one_hot_df['Rent'])
>>> model.coef_
array([ 166.66666667,  666.66666667, -833.33333333])
>>> model.intercept_
3333.3333333333335

# Train a linear regression model on dummy code
# Specify the 'drop_first' flag to get dummy coding
>>> dummy_df = pd.get_dummies(df, prefix=['city'], drop_first=True)
>>> dummy_df
   Rent  city_SF  city_Seattle
0  3999    1.0       0.0
1  4000    1.0       0.0
2  4001    1.0       0.0
3  3499    0.0       0.0
4  3500    0.0       0.0
5  3501    0.0       0.0
6  2499    0.0       1.0
7  2500    0.0       1.0
8  2501    0.0       1.0

>>> model.fit(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])
>>> model.coef_
array([  500., -1000.])
>>> model.intercept_
3500.0

With one-hot encoding, the intercept term represents the global mean of the target variable, Rent, and each of the linear coefficients represents how much that city’s average rent differs from the global mean.

With dummy coding, the bias coefficient represents the mean value of the response variable y for the reference category, which in the example is the city NYC. The coefficient for the ith feature is equal to the difference between the mean response value for the ith category and the mean of the reference category.

You can see pretty clearly in Table 5-4 how these methods produce very different coefficients for linear models.

Table 5-4. Linear regression learned coefficients
	x₁	x₂	x₃	b
One-hot encoding	166.67	666.67	–833.33	3333.33
Dummy coding	0	500	–1000	3500

Effect Coding

Yet another variant of categorical variable encoding is effect coding. Effect coding is very similar to dummy coding, with the difference that the reference category is now represented by the vector of all –1’s.

Table 5-5. Effect coding of a categorical variable representing three cities
	e₁	e₂
San Francisco	1	0
New York	0	1
Seattle	–1	–1

Effect coding is very similar to dummy coding, but results in linear regression models that are even simpler to interpret. Example 5-2 demonstrates what happens with effect coding as input. The intercept term represents the global mean of the target variable, and the individual coefficients indicate how much the means of the individual categories differ from the global mean. (This is called the main effect of the category or level, hence the name “effect coding.”) One-hot encoding actually came up with the same intercept and coefficients, but in that case there are linear coefficients for each city. In effect coding, no single feature represents the reference category, so the effect of the reference category needs to be separately computed as the negative sum of the coefficients of all other categories. (See “FAQ: What is effect coding?” on the UCLA IDRE website for more details.)

Example 5-2. Linear regression with effect coding

>>> effect_df = dummy_df.copy()
>>> effect_df.ix[3:5, ['city_SF', 'city_Seattle']] = -1.0
>>> effect_df
   Rent  city_SF  city_Seattle
0  3999    1.0       0.0
1  4000    1.0       0.0
2  4001    1.0       0.0
3  3499   -1.0      -1.0
4  3500   -1.0      -1.0
5  3501   -1.0      -1.0
6  2499    0.0       1.0
7  2500    0.0       1.0
8  2501    0.0       1.0

>>> model.fit(effect_df[['city_SF', 'city_Seattle']], effect_df['Rent'])
>>> model.coef_
array([ 666.66666667, -833.33333333])
>>> model.intercept_
3333.3333333333335

Pros and Cons of Categorical Variable Encodings

One-hot, dummy, and effect coding are very similar to one another. They each have pros and cons. One-hot encoding is redundant, which allows for multiple valid models for the same problem. The nonuniqueness is sometimes problematic for interpretation, but the advantage is that each feature clearly corresponds to a category. Moreover, missing data can be encoded as the all-zeros vector, and the output should be the overall mean of the target variable.

Dummy coding and effect coding are not redundant. They give rise to unique and interpretable models. The downside of dummy coding is that it cannot easily handle missing data, since the all-zeros vector is already mapped to the reference category. It also encodes the effect of each category relative to the reference category, which may look strange.

Effect coding avoids this problem by using a different code for the reference category, but the vector of all –1’s is a dense vector, which is expensive for both storage and computation. For this reason, popular ML software packages such as Pandas and scikit-learn have opted for dummy coding or one-hot encoding instead of effect coding.

All three encoding techniques break down when the number of categories becomes very large. Different strategies are needed to handle extremely large categorical variables.

Dealing with Large Categorical Variables

Automated data collection on the internet can generate large categorical variables. This is common in applications such as targeted advertising and fraud detection.

In targeted advertising, the task is to match a user with a set of ads. Features include the user ID, the website domain for the ad, the search query, the current page, and all possible pairwise conjunctions of those features. (The query is a text string that can be chopped up and turned into the usual text features. However, queries are generally short and are often composed of phrases, so the best course of action in this case is usually to keep them intact, or pass them through a hash function to make storage and comparisons easier. We will discuss hashing in more detail later.) Each of these is a very large categorical variable. The challenge is to find a good feature representation that is memory efficient, yet produces accurate models that are fast to train.

Existing solutions can be categorized (haha) thus:

Do nothing fancy with the encoding. Use a simple model that is cheap to train. Feed one-hot encoding into a linear model (logistic regression or linear support vector machine) on lots of machines.
Compress the features. There are two choices:
1. Feature hashing, popular with linear models
2. Bin counting, popular with linear models as well as trees

Using the vanilla one-hot encoding is a valid option. For Microsoft’s search advertising engine, Graepel et al. (2010) report using such binary-valued features in a Bayesian probit regression model that can be trained online using simple updates. Meanwhile, other groups argue for the compression approach. Researchers from Yahoo! swear by feature hashing (Weinberger et al., 2009), though McMahan et al. (2013) experimented with feature hashing on Google’s advertising engine and did not find significant improvements. Yet other folks at Microsoft are taken with the idea of bin counting (Bilenko, 2015).

As we shall see, all of these ideas have pros and cons. We will first describe the solutions themselves, then discuss their trade-offs.

Feature Hashing

A hash function is a deterministic function that maps a potentially unbounded integer to a finite integer range [1, m]. Since the input domain is potentially larger than the output range, multiple numbers may get mapped to the same output. This is called a collision. A uniform hash function ensures that roughly the same number of numbers are mapped into each of the m bins.

Visually, we can think of a hash function as a machine that intakes numbered balls (keys) and routes them to one of m bins. Balls with the same number will always get routed to the same bin (see Figure 5-1). This maintains the feature space while reducing the storage and processing time during machine learning training and evaluation cycles.

Hash functions can be constructed for any object that can be represented numerically (which is true for any data that can be stored on a computer): numbers, strings, complex structures, etc.

When there are very many features, storing the feature vector could take up a lot of space. Feature hashing compresses the original feature vector into an m-dimensional vector by applying a hash function to the feature ID, as shown in Example 5-3. For instance, if the original features were words in a document, then the hashed version would have a fixed vocabulary size of m, no matter how many unique words there are in the input.

Example 5-3. Feature hashing for word features

>>> def hash_features(word_list, m):
...     output = [0] * m
...     for word in word_list:
...         index = hash_fcn(word) % m
...         output[index] += 1
...     return output

Another variation of feature hashing adds a sign component, so that counts are either added to or subtracted from the hashed bin (see Example 5-4). Statistically speaking, this ensures that the inner products between hashed features are equal in expectation to those of the original features.

Example 5-4. Signed feature hashing

>>> def hash_features(word_list, m):
...     output = [0] * m
...     for word in word_list:
...         index = hash_fcn(word) % m
...         sign_bit = sign_hash(word) % 2
...         if (sign_bit == 0):
...             output[index] -= 1
...         else:
...             output[index] += 1
...     return output

The value of the inner product after hashing is within $upper O left-parenthesis StartFraction 1 Over StartRoot m EndRoot EndFraction right-parenthesis$ $upper O left-parenthesis StartFraction 1 Over StartRoot m EndRoot EndFraction right-parenthesis$ of the original inner product, so the size of the hash table m can be selected based on acceptable errors. In practice, picking the right m could take some trial and error.

Feature hashing can be used for models that involve the inner product of feature vectors and coefficients, such as linear models and kernel methods. It has been demonstrated to be successful in the task of spam filtering (Weinberger et al., 2009). In the case of targeted advertising, McMahan et al. (2013) report not being able to get the prediction errors down to an acceptable level unless m is on the order of billions, which does not constitute enough saving in space.

One downside to feature hashing is that the hashed features, being aggregates of original features, are no longer interpretable.

In Example 5-5, we use the Yelp reviews dataset to demonstrate storage and interpretability trade-offs using scikit-learn’s FeatureHasher.

Example 5-5. Feature hashing (a.k.a. “the hashing trick”)

>>> import pandas as pd
>>> import json

# Load the first 10,000 reviews
>>> f = open('yelp_academic_dataset_review.json')
>>> js = []
>>> for i in range(10000):
...     js.append(json.loads(f.readline()))
>>> f.close()
>>> review_df = pd.DataFrame(js)

# Define m as equal to the unique number of business_ids
>>> m = len(review_df.business_id.unique())
>>> m
528

>>> from sklearn.feature_extraction import FeatureHasher
>>> h = FeatureHasher(n_features=m, input_type='string')
>>> f = h.transform(review_df['business_id'])

# How does this affect feature interpretability?
>>> review_df['business_id'].unique().tolist()[0:5]
['vcNAWiLM4dR7D2nwwJ7nCA',
 'UsFtqoBl7naz8AVUBZMjQQ',
 'cE27W9VPgO88Qxe4ol6y_g',
 'HZdLhv6COCleJMo7nPl-RA',
 'mVHrayjG3uZ_RLHkLj-AMg']

>>> f.toarray()
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

# Not great. BUT, let's see the storage size of our features.
>>> from sys import getsizeof
>>> print('Our pandas Series, in bytes: ', getsizeof(review_df['business_id']))
>>> print('Our hashed numpy array, in bytes: ', getsizeof(f))
Our pandas Series, in bytes:  790104
Our hashed numpy array, in bytes:  56

We can clearly see how using feature hashing will benefit us computationally, sacrificing immediate user interpretability. This is an easy trade-off to accept when progressing from data exploration and visualization into a machine learning pipeline for large datasets.

Bin Counting

Bin counting is one of the perennial rediscoveries in machine learning. It has been reinvented and used in a variety of applications, from ad click-through rate prediction to hardware branch prediction (Yeh and Patt, 1991; Lee et al., 1998; Chen et al., 2009; Li et al., 2010). Yet because it is a feature engineering technique and not a modeling or optimization method, there is no research paper on the topic. The most detailed description of the technique can be found in Misha Bilenko’s (2015) blog post “Big Learning Made Easy—with Counts!” and the associated slides.

The idea of bin counting is deviously simple: rather than using the value of the categorical variable as the feature, instead use the conditional probability of the target under that value. In other words, instead of encoding the identity of the categorical value, we compute the association statistics between that value and the target that we wish to predict. For those familiar with naive Bayes classifiers, this statistic should ring a bell, because it is the conditional probability of the class under the assumption that all features are independent. It is best illustrated with an example (see Table 5-6).

Table 5-6. Example of bin-counting features (reproduced from “Big Learning Made Easy—with Counts!” with permission)
User	Number of clicks	Number of nonclicks	Probability of click	QueryHash, AdDomain	Number of clicks	Number of nonclicks	Probability of click
Alice	5	120	0.0400	0x598fd4fe, foo.com	5,000	30,000	0.167
Bob	20	230	0.0800	0x50fa3cc0, bar.org	100	900	0.100
…				…			…
Joe	2	3	0.400	0x437a45e1, qux.net	6	18	0.250

Bin counting assumes that historical data is available for computing the statistics. Table 5-6 contains aggregated historical counts for each possible value of the categorical variables. Based on the number of times the user “Alice” has clicked on any ad and the number of times she has not clicked, we can calculate the probability of her clicking on any ad. Similarly, we can compute the probability of a click for any query–ad domain combination. At training time, every time we see “Alice,” we can use her probability of click as the input feature to the model. The same goes for QueryHash–AdDomain pairs like “0x437a45e1, qux.net.”

Suppose there were 10,000 users. One-hot encoding would generate a sparse vector of length 10,000, with a single 1 in the column that corresponds to the value of the current data point. Bin counting would encode all 10,000 binary columns as a single feature with a real value between 0 and 1.

We can include other features in addition to the historical click-through probability: the raw counts themselves (number of clicks and nonclicks), the log-odds ratio, or any other derivatives of probability. Our example here is for predicting ad click-through rates, but the technique readily applies to general binary classification. It can also be readily extended to multiclass classification using the usual techniques to extend binary classifiers to multiclass; i.e., via one-against-many odds ratios or other multiclass label encodings.

Odds Ratio and Log Odds Ratio for Bin Counting

The odds ratio is usually defined between two binary variables. It looks at their strength of association by asking the question, “How much more likely is it for Y to be true when X is true?” For instance, we might ask, “How much more likely is Alice to click on an ad than the general population?” Here, X is the binary variable “Alice is the current user,” and Y is the variable “click on ad or not.” The computation uses what’s called the two-way contingency table (basically, four numbers that correspond to the four possible combinations of X and Y), as seen in Table 5-7.

Table 5-7. Contingency table for ad click and user
	Click	Nonclick	Total
Alice	5	120	125
Not Alice	995	18,880	19,875
Total	1,000	19,000	20,000

Given an input variable X and a target variable Y, the odds ratio is defined as:

odds ratio = \frac{P (Y = 1 | X = 1) / P (Y = 0 | X = 1)}{P (Y = 1 | X = 0) / P (Y = 0 | X = 0)}

$odds ratio = \frac{P (Y = 1 | X = 1) / P (Y = 0 | X = 1)}{P (Y = 1 | X = 0) / P (Y = 0 | X = 0)}$

In our example, this translates as the ratio between “how much more likely is it that Alice clicks on an ad rather than does not click” and “how much more likely is it that other people click rather than not click.” The number, in this case, is:

odds ratio (user, ad click) = \frac{(5 / 125) / (120 / 125)}{(995 / 19,875) / (18,880 / 19,875)} = 0.7906

$odds ratio (user, ad click) = \frac{(5 / 125) / (120 / 125)}{(995 / 19,875) / (18,880 / 19,875)} = 0.7906$

More simply, we can just look at the numerator, which examines how much more likely it is that a single user (Alice) clicks on an ad versus not clicking. This is suitable for large categorical variables with many values, not just two:

odds ratio (Alice, ad click) = \frac{5 / 125}{120 / 125} = 0.04166

$odds ratio (Alice, ad click) = \frac{5 / 125}{120 / 125} = 0.04166$

Probability ratios can easily become very small or very large. (For instance, there will be users who almost never click on ads, and perhaps users who click on ads much more frequently than not.) The log transform again comes to our rescue. Another useful property of the logarithm is that it turns a division into a subtraction:

log-odds ratio (Alice, ad click) = l o g (\frac{5}{125}) - l o g (\frac{120}{125}) = - 3.178

$log-odds ratio (Alice, ad click) = l o g (\frac{5}{125}) - l o g (\frac{120}{125}) = - 3.178$

In short, bin counting converts a categorical variable into statistics about the value. It turns a large, sparse, binary representation of the categorical variable, such as that produced by one-hot encoding, into a very small, dense, real-valued numeric representation (Figure 5-2).

In terms of implementation, bin counting requires storing a map between each category and its associated counts. (The rest of the statistics can be derived on the fly from the raw counts.) Hence it requires O(k) space, where k is the number of unique values of the categorical variable.

To illustrate bin counting in practice, we’ll use data from a Kaggle competition hosted by Avazu. Here are some relevant statistics about the dataset:

There are 24 variables, including click, a binary click/no click counter, and device_id, which tracks which device an ad was displayed on.
The full dataset contains 40,428,967 observations, with 2,686,408 unique devices.

The aim of the Avazu competition was to predict click-through rate using ad data, but we will use the dataset to demonstrate how bin counting can greatly reduce the feature space for large amounts of streaming data (see Example 5-6).

Example 5-6. Bin-counting example

>>> import pandas as pd

# train_subset data is first 10K rows of 6+GB set
>>> df = pd.read_csv('data/train_subset.csv')

# How many unique features should we have after?
>>> len(df['device_id'].unique())
7201

# For each category, we want to calculate:
# Theta = [counts, p(click), p(no click), p(click)/p(no click)]

>>> def click_counting(x, bin_column):
...     clicks = pd.Series(x[x['click'] > 0][bin_column].value_counts(), 
...                        name='clicks')
...     no_clicks = pd.Series(x[x['click'] < 1][bin_column].value_counts(), 
...                           name='no_clicks')
    
...     counts = pd.DataFrame([clicks,no_clicks]).T.fillna('0')
...     counts['total_clicks'] = counts['clicks'].astype('int64') +
...                              counts['no_clicks'].astype('int64')
...     return counts

>>> def bin_counting(counts):
...     counts['N+'] = counts['clicks']
...                    .astype('int64')
...                    .divide(counts['total_clicks'].astype('int64'))
...     counts['N-'] = counts['no_clicks']
...                    .astype('int64')
...                    .divide(counts['total_clicks'].astype('int64'))
...     counts['log_N+'] = counts['N+'].divide(counts['N-'])
...     # If we wanted to only return bin-counting properties, 
...     # we would filter here
...     bin_counts = counts.filter(items= ['N+', 'N-', 'log_N+'])
...     return counts, bin_counts

# Bin counts example: device_id
>>> bin_column = 'device_id'
>>> device_clicks = click_counting(df.filter(items=[bin_column, 'click']), 
...                                bin_column)
>>> device_all, device_bin_counts = bin_counting(device_clicks)

# Check to make sure we have all the devices
>>> len(device_bin_counts)
7201

>>> device_all.sort_values(by = 'total_clicks', ascending=False).head(4)

         clicks  no_clicks  total  N+        N-        log_N+
a99f214a 15729   71206      86935  0.180928  0.819072  0.220894
c357dbff 33      134        167    0.197605  0.802395  0.246269
31da1bd0 0       62         62     0.000000  1.000000  0.000000
936e92fb 5       54         59     0.084746  0.915254  0.092593

What about rare categories?

Just like rare words, rare categories require special treatment. Think about a user who logs in once a year: there will be very little data to reliably estimate that user’s click-through rate for ads. Moreover, rare categories waste space in the counts table.

One way to deal with this is through back-off, a simple technique that accumulates the counts of all rare categories in a special bin (see Figure 5-3). If the count is greater than a certain threshold, then the category gets its own count statistics. Otherwise, we use the statistics from the back-off bin. This essentially reverts the statistics for a single rare category to the statistics computed on all rare categories. When using the back-off method, it helps to also add a binary indicator for whether or not the statistics come from the back-off bin.

There is another way to deal with this problem, called the count-min sketch (Cormode and Muthukrishnan, 2005). In this method, all the categories, rare or frequent alike, are mapped through multiple hash functions with an output range, m, much smaller than the number of categories, k. When retrieving a statistic, recompute all the hashes of the category and return the smallest statistic. Having multiple hash functions mitigates the probability of collision within a single hash function. The scheme works because the number of hash functions times m, the size of the hash table, can be made smaller than k, the number of categories, and still retain low overall collision probability.

Figure 5-4 illustrates. Each item i is mapped to one cell in each row of the array of counts. When an update of c_t to item i_t arrives, c_t is added to each of these cells, hashed using functions h₁…h_d.

Guarding against data leakage

Since bin counting relies on historical data to generate the necessary statistics, it requires waiting through a data collection period, incurring a slight delay in the learning pipeline. Also, when the data distribution changes, the counts need to be updated. The faster the data changes, the more frequently the counts need to be recomputed. This is particularly important for applications like targeted advertising, where user preferences and popular queries change very quickly, and lack of adaptation to the current distribution could mean huge losses for the advertising platform.

One might ask, why not use the same dataset to compute the relevant statistics and train the model? The idea seems innocent enough. The big problem here is that the statistics involve the target variable, which is what the model tries to predict. Using the output to compute the input features leads to a pernicious problem known as leakage. In short, leakage means that information is revealed to the model that gives it an unrealistic advantage to make better predictions. This could happen when test data is leaked into the training set, or when data from the future is leaked to the past. Any time that a model is given information that it shouldn’t have access to when it is making predictions in real time in production, there is leakage. Kaggle’s wiki gives more examples of leakage and why it is bad for machine learning applications.

If the bin-counting procedure used the current data point’s label to compute part of the input statistic, that would constitute direct leakage. One way to prevent that is by instituting strict separation between count collection (for computing bin-count statistics) and training, as illustrated in Figure 5-5—i.e., use an earlier batch of data points for counting, use the current data points for training (mapping categorical variables to historical statistics we just collected), and use future data points for testing. This fixes the problem of leakage, but introduces the aforementioned delay (the input statistics and therefore the model will trail behind current data).

It turns out that there is another solution, based on differential privacy. A statistic is approximately leakage-proof if its distribution stays roughly the same with or without any one data point. In practice, adding a small random noise with distribution Laplace(0,1) is sufficient to cover up any potential leakage from a single data point. This idea can be combined with leaving-one-out counting to formulate statistics on current data (Zhang, 2015).

Counts without bounds

If the statistics are updated continuously given more and more historical data, the raw counts will grow without bounds. This could be a problem for the model. A trained model “knows” the input data up to the observed scale. A trained decision tree might say, “When x is greater than 3, predict 1.” A trained linear model might say, “Multiply x by 0.7 and see if the result is greater than the global average.” These might be the correct decisions when x lies between 0 and 5. But what happens beyond that? No one knows.

When the input counts increase, the model will need to be retrained to adapt to the current scale. If the counts accumulate rather slowly, then the effective scale won’t change too fast, and the model will not need to be retrained too frequently. But when counts increment very quickly, frequent retraining will be a nuisance.

For this reason, it is often better to use normalized counts that are guaranteed to be bounded in a known interval. For instance, the estimated click-through probability is bounded between [0, 1]. Another method is to take the log transform, which imposes a strict bound, but the rate of increase will be very slow when the count is very large.

Neither method will guard against shifting input distributions (e.g., last year’s Barbie dolls are now out of style and people will no longer click on those ads). The model will need to be retrained to accommodate these more fundamental changes in input data distribution, or the whole pipeline will need to move to an online learning setting where the model is continuously adapting to the input.

Summary

Each of the approaches detailed in this chapter has its pros and cons. Here is a rundown of the trade-offs.

Plain one-hot encoding
Space requirement	O(n) using the sparse vector format, where n is the number of data points
Computation requirement	O(nk) under a linear model, where k is the number of categories
Pros	Easiest to implement Potentially most accurate Feasible for online learning
Cons	Computationally inefficient Does not adapt to growing categories Not feasible for anything other than linear models Requires large-scale distributed optimization with truly large datasets

Feature hashing
Space requirement	O(n) using the sparse matrix format, where n is the number of data points
Computation requirement	O(nm) under a linear or kernel model, where m is the number of hash bins
Pros	Easy to implement Makes model training cheaper Easily adaptable to new categories Easily handles rare categories Feasible for online learning
Cons	Only suitable for linear or kernelized models Hashed features not interpretable Mixed reports of accuracy

Bin-counting
Space requirement	O(n+k) for small, dense representation of each data point, plus the count statistics that must be kept for each category
Computation requirement	O(n) for linear models; also usable for nonlinear models such as trees
Pros	Smallest computational burden at training time Enables tree-based models Relatively easy to adapt to new categories Handles rare categories with back-off or count-min sketch Interpretable
Cons	Requires historical data Delayed updates required, not completely suitable for online learning Higher potential for leakage

As we can see, none of the methods are perfect. Which one to use depends on the desired model. Linear models are cheaper to train and therefore can handle noncompressed representations such as one-hot encoding. Tree-based models, on the other hand, need to do repeated searches over all features for the right split, and are thus limited to small representations such as bin counting. Feature hashing sits in between those two extremes, but with mixed reports on the resulting accuracy.

Bibliography

Agarwal, Alekh, Oliveier Chapelle, Miroslav Dudík, and John Langford. “A Reliable Effective Terascale Linear Learning System.” Journal of Machine Learning Research 15 (2015): 1111−1133.

Bilenko, Misha. “Big Learning Made Easy—with Counts!” Cortana Intelligence and Machine Learning Blog, February 17, 2015. https://blogs.technet.microsoft.com/machinelearning/2015/02/17/big-learning-made-easy-with-counts/.

Chen, Ye, Dmitry Pavlov, and John F. Canny. “Large-Scale Behavioral Targeting.” Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2009): 209–218.

Cormode, Graham, and S. Muthukrishnan. “An Improved Data Stream Summary: The Count-Min Sketch and Its Applications.” Algorithms 55 (2005): 29–38.

Graepel, Thore, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. “Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine.” Proceedings of the 27th International Conference on Machine Learning (2010): 13–20.

He, Xinran, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. “Practical Lessons from Predicting Clicks on Ads at Facebook.” Proceedings of the 8th International Workshop on Data Mining for Online Advertising (2014): 1–9.

Lee, Wenke, Salvatore J. Stolfo, and Kui W. Mok. 1998. “Mining Audit Data to Build Intrusion Detection Models.” Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1998): 66–72.

Li, Wei, Xuerui Wang, Ruofei Zhang, Ying Cui, Jianchang Mao, and Rong Jin. “Exploitation and Exploration in a Performance Based Contextual Advertising System.” Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010): 27–36.

McMahan, H. Brendan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. “Ad Click Prediction: A View from the Trenches.” Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013): 1222–1230.

Weinberger, Kilian, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. 2009. “Feature Hashing for Large Scale Multitask Learning.” Proceedings of the 26th International Conference on Machine Learning (2009): 1113–1120.

Yeh, Tse-Yu, and Yale N. Patt. “Two-Level Adaptive Training Branch Prediction.” Proceedings of the 24th Annual International Symposium on Microarchitecture (1991): 51–61.

Zhang, Owen. 2015. “Tips for data science competitions.” SlideShare presentation. Retrieved from http://bit.ly/2DjuhBD.

¹ In standard statistics literature, the technical term for the categories is levels. A categorical variable with two distinct categories has two levels. But there are a number of other things in statistics that are also called levels, so we do not use that terminology here; instead we use the more colloquial and unambiguous term “categories.”

² Curious readers might wonder why one is called coding and the other encoding. This is largely convention. My guess is that one-hot encoding first became popular in electrical engineering, where information is encoded and decoded all the time. Dummy coding and effect coding, on the other hand, were invented in the statistics community. Somehow the “en” didn’t make its way over the academic divide.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens

Chapter 5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens

Encoding Categorical Variables

One-Hot Encoding

Dummy Coding

Example 5-1. Linear regression on a categorical variable using one-hot and dummy codes

Effect Coding

Example 5-2. Linear regression with effect coding

Pros and Cons of Categorical Variable Encodings

Dealing with Large Categorical Variables

Feature Hashing

Figure 5-1. Hash functions map keys to bins

Example 5-3. Feature hashing for word features

Example 5-4. Signed feature hashing

Example 5-5. Feature hashing (a.k.a. “the hashing trick”)

Bin Counting

Figure 5-2. An illustration of one-hot encoding versus bin-counting statistics for categorical variables

Example 5-6. Bin-counting example

What about rare categories?

Figure 5-3. If a rare category gains counts, it can move above the threshold for the back-off bin, using its own count statistics for modeling

Figure 5-4. The count-min sketch

Guarding against data leakage

Figure 5-5. Using time windows can prevent data leakage during bin counting

Counts without bounds

Summary

Bibliography

Table of Contents for 5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens

Create new playlist

Sign In

Sign Up

Chapter 5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens

Encoding Categorical Variables

One-Hot Encoding

Dummy Coding

Example 5-1. Linear regression on a categorical variable using one-hot and dummy codes

Effect Coding

Example 5-2. Linear regression with effect coding

Pros and Cons of Categorical Variable Encodings

Dealing with Large Categorical Variables

Feature Hashing

Figure 5-1. Hash functions map keys to bins

Example 5-3. Feature hashing for word features

Example 5-4. Signed feature hashing

Example 5-5. Feature hashing (a.k.a. “the hashing trick”)

Bin Counting

Figure 5-2. An illustration of one-hot encoding versus bin-counting statistics for categorical variables

Example 5-6. Bin-counting example

What about rare categories?

Figure 5-3. If a rare category gains counts, it can move above the threshold for the back-off bin, using its own count statistics for modeling

Figure 5-4. The count-min sketch

Guarding against data leakage

Figure 5-5. Using time windows can prevent data leakage during bin counting

Counts without bounds

Summary

Bibliography

Table of Contents for
5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens