A categorical variable, as the name suggests, is used to represent categories or labels. For instance, a categorical variable could represent major cities in the world, the four seasons in a year, or the industry (oil, travel, technology) of a company. The number of category values is always finite in a real-world dataset. The values may be represented numerically. However, unlike other numeric variables, the values of a categorical variable cannot be ordered with respect to one another. (Oil is neither greater than nor less than travel as an industry type.) They are called nonordinal.
A simple question can serve as litmus test for whether something should be a categorical variable: “Does it matter how different two values are, or only that they are different?” A stock price of $500 is five times higher than a price of $100. So, stock price should be represented by a continuous numeric variable. The industry of the company (oil, travel, tech, etc.), on the other hand, should probably be categorical.
Large categorical variables are particularly common in transactional records. For instance, many web services track users using an ID, which is a categorical variable with hundreds to hundreds of millions of values, depending on the number of unique users of the service. The IP address of an internet transaction is another example of a large categorical variable. They are categorical variables because, even though user IDs and IP addresses are numeric, their magnitude is usually not relevant to the task at hand. For instance, the IP address might be relevant when doing fraud detection on individual transactions—some IP addresses or subnets may generate more fraudulent transactions than others. But a subnet of 164.203.x.x is not inherently more fraudulent than 164.202.x.x; the numeric value of the subnet does not matter.
The vocabulary of a document corpus can be interpreted as a large categorical variable, with the categories being unique words. It can be computationally expensive to represent so many distinct categories. If a category (e.g., word) appears multiple times in a data point (document), then we can represent it as a count, and represent all of the categories through their count statistics. This is called bin counting. We start this discussion with common representations of categorical variables, and eventually meander our way to a discussion of bin counting for large categorical variables, which are very common in modern datasets.
The categories of a categorical variable are usually not numeric.1 For example, eye color can be “black,” “blue,” “brown,” etc. Thus, an encoding method is needed to turn these nonnumeric categories into numbers. It is tempting to simply assign an integer, say from 1 to k, to each of k possible categories—but the resulting values would be orderable against each other, which should not be permissible for categories. So, let’s look at some alternatives.
A better method is to use a group of bits. Each bit represents a possible category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on.” This is called one-hot encoding, and it is implemented in scikit-learn as sklearn.preprocessing.OneHotEncoder
. Each of the bits is a feature. Thus, a categorical variable with k possible categories is encoded as a feature vector of length k. Table 5-1 shows an example.
e1 | e2 | e3 | |
---|---|---|---|
San Francisco | 1 | 0 | 0 |
New York | 0 | 1 | 0 |
Seattle | 0 | 0 | 1 |
One-hot encoding is very simple to understand, but it uses one more bit than is strictly necessary. If we see that k–1 of the bits are 0, then the last bit must be 1 because the variable must take on one of the k values. Mathematically, one can write this constraint as “the sum of all bits must be equal to 1”:
Thus, we have a linear dependency on our hands. Linear dependent features, as we discovered in Chapter 4, are slightly annoying because they mean that the trained linear models will not be unique. Different linear combinations of the features can make the same predictions, so we would need to jump through extra hoops to understand the effect of a feature on the prediction.
The problem with one-hot encoding is that it allows for k degrees of freedom, while the variable itself needs only k–1. Dummy coding2 removes the extra degree of freedom by using only k–1 features in the representation (see Table 5-2). One feature is thrown under the bus and represented by the vector of all zeros. This is known as the reference category. Dummy coding and one-hot encoding are both implemented in Pandas as pandas.get_dummies
.
e1 | e2 | |
---|---|---|
San Francisco | 1 | 0 |
New York | 0 | 1 |
Seattle | 0 | 0 |
The outcome of modeling with dummy coding is more interpretable than with one-hot encoding. This is easy to see in a simple linear regression problem. Suppose we have some data about apartment rental prices in three cities: San Francisco, New York, and Seattle (see Table 5-3).
City | Rent | |
---|---|---|
0 | SF | 3999 |
1 | SF | 4000 |
2 | SF | 4001 |
3 | NYC | 3499 |
4 | NYC | 3500 |
5 | NYC | 3501 |
6 | Seattle | 2499 |
7 | Seattle | 2500 |
8 | Seattle | 2501 |
We can train a linear regressor to predict rental price based solely on the identity of the city (see Example 5-1).
The linear regression model can be written as:
It is customary to fit an extra constant term called the intercept, so that y can be a nonzero value when the x’s are zeros:
>>>
import
pandas
>>>
from
sklearn
import
linear_model
# Define a toy dataset of apartment rental prices in
# New York, San Francisco, and Seattle
>>>
df
=
pd
.
DataFrame
({
...
'City'
:
[
'SF'
,
'SF'
,
'SF'
,
'NYC'
,
'NYC'
,
'NYC'
,
...
'Seattle'
,
'Seattle'
,
'Seattle'
],
...
'Rent'
:
[
3999
,
4000
,
4001
,
3499
,
3500
,
3501
,
2499
,
2500
,
2501
]
...
})
>>>
df
[
'Rent'
]
.
mean
()
3333.3333333333335
# Convert the categorical variables in the DataFrame to one-hot encoding
# and fit a linear regression model
>>>
one_hot_df
=
pd
.
get_dummies
(
df
,
prefix
=
[
'city'
])
>>>
one_hot_df
Rent city_NYC city_SF city_Seattle
0 3999 0.0 1.0 0.0
1 4000 0.0 1.0 0.0
2 4001 0.0 1.0 0.0
3 3499 1.0 0.0 0.0
4 3500 1.0 0.0 0.0
5 3501 1.0 0.0 0.0
6 2499 0.0 0.0 1.0
7 2500 0.0 0.0 1.0
8 2501 0.0 0.0 1.0
>>>
model
=
linear_regression
.
LinearRegression
()
>>>
model
.
fit
(
one_hot_df
[[
'city_NYC'
,
'city_SF'
,
'city_Seattle'
]],
...
one_hot_df
[
'Rent'
])
>>>
model
.
coef_
array([ 166.66666667, 666.66666667, -833.33333333])
>>>
model
.
intercept_
3333.3333333333335
# Train a linear regression model on dummy code
# Specify the 'drop_first' flag to get dummy coding
>>>
dummy_df
=
pd
.
get_dummies
(
df
,
prefix
=
[
'city'
],
drop_first
=
True
)
>>>
dummy_df
Rent city_SF city_Seattle
0 3999 1.0 0.0
1 4000 1.0 0.0
2 4001 1.0 0.0
3 3499 0.0 0.0
4 3500 0.0 0.0
5 3501 0.0 0.0
6 2499 0.0 1.0
7 2500 0.0 1.0
8 2501 0.0 1.0
>>>
model
.
fit
(
dummy_df
[[
'city_SF'
,
'city_Seattle'
]],
dummy_df
[
'Rent'
])
>>>
model
.
coef_
array([ 500., -1000.])
>>>
model
.
intercept_
3500.0
With one-hot encoding, the intercept term represents the global mean of the target variable, Rent
, and each of the linear coefficients represents how much that city’s average rent differs from the global mean.
With dummy coding, the bias coefficient represents the mean value of the response variable y for the reference category, which in the example is the city NYC. The coefficient for the ith feature is equal to the difference between the mean response value for the ith category and the mean of the reference category.
You can see pretty clearly in Table 5-4 how these methods produce very different coefficients for linear models.
x1 | x2 | x3 | b | |
---|---|---|---|---|
One-hot encoding | 166.67 | 666.67 | –833.33 | 3333.33 |
Dummy coding | 0 | 500 | –1000 | 3500 |
Yet another variant of categorical variable encoding is effect coding. Effect coding is very similar to dummy coding, with the difference that the reference category is now represented by the vector of all –1’s.
e1 | e2 | |
---|---|---|
San Francisco | 1 | 0 |
New York | 0 | 1 |
Seattle | –1 | –1 |
Effect coding is very similar to dummy coding, but results in linear regression models that are even simpler to interpret. Example 5-2 demonstrates what happens with effect coding as input. The intercept term represents the global mean of the target variable, and the individual coefficients indicate how much the means of the individual categories differ from the global mean. (This is called the main effect of the category or level, hence the name “effect coding.”) One-hot encoding actually came up with the same intercept and coefficients, but in that case there are linear coefficients for each city. In effect coding, no single feature represents the reference category, so the effect of the reference category needs to be separately computed as the negative sum of the coefficients of all other categories. (See “FAQ: What is effect coding?” on the UCLA IDRE website for more details.)
>>>
effect_df
=
dummy_df
.
copy
()
>>>
effect_df
.
ix
[
3
:
5
,
[
'city_SF'
,
'city_Seattle'
]]
=
-
1.0
>>>
effect_df
Rent city_SF city_Seattle
0 3999 1.0 0.0
1 4000 1.0 0.0
2 4001 1.0 0.0
3 3499 -1.0 -1.0
4 3500 -1.0 -1.0
5 3501 -1.0 -1.0
6 2499 0.0 1.0
7 2500 0.0 1.0
8 2501 0.0 1.0
>>>
model
.
fit
(
effect_df
[[
'city_SF'
,
'city_Seattle'
]],
effect_df
[
'Rent'
])
>>>
model
.
coef_
array([ 666.66666667, -833.33333333])
>>>
model
.
intercept_
3333.3333333333335
One-hot, dummy, and effect coding are very similar to one another. They each have pros and cons. One-hot encoding is redundant, which allows for multiple valid models for the same problem. The nonuniqueness is sometimes problematic for interpretation, but the advantage is that each feature clearly corresponds to a category. Moreover, missing data can be encoded as the all-zeros vector, and the output should be the overall mean of the target variable.
Dummy coding and effect coding are not redundant. They give rise to unique and interpretable models. The downside of dummy coding is that it cannot easily handle missing data, since the all-zeros vector is already mapped to the reference category. It also encodes the effect of each category relative to the reference category, which may look strange.
Effect coding avoids this problem by using a different code for the reference category, but the vector of all –1’s is a dense vector, which is expensive for both storage and computation. For this reason, popular ML software packages such as Pandas and scikit-learn have opted for dummy coding or one-hot encoding instead of effect coding.
All three encoding techniques break down when the number of categories becomes very large. Different strategies are needed to handle extremely large categorical variables.
Automated data collection on the internet can generate large categorical variables. This is common in applications such as targeted advertising and fraud detection.
In targeted advertising, the task is to match a user with a set of ads. Features include the user ID, the website domain for the ad, the search query, the current page, and all possible pairwise conjunctions of those features. (The query is a text string that can be chopped up and turned into the usual text features. However, queries are generally short and are often composed of phrases, so the best course of action in this case is usually to keep them intact, or pass them through a hash function to make storage and comparisons easier. We will discuss hashing in more detail later.) Each of these is a very large categorical variable. The challenge is to find a good feature representation that is memory efficient, yet produces accurate models that are fast to train.
Existing solutions can be categorized (haha) thus:
Using the vanilla one-hot encoding is a valid option. For Microsoft’s search advertising engine, Graepel et al. (2010) report using such binary-valued features in a Bayesian probit regression model that can be trained online using simple updates. Meanwhile, other groups argue for the compression approach. Researchers from Yahoo! swear by feature hashing (Weinberger et al., 2009), though McMahan et al. (2013) experimented with feature hashing on Google’s advertising engine and did not find significant improvements. Yet other folks at Microsoft are taken with the idea of bin counting (Bilenko, 2015).
As we shall see, all of these ideas have pros and cons. We will first describe the solutions themselves, then discuss their trade-offs.
A hash function is a deterministic function that maps a potentially unbounded integer to a finite integer range [1, m]. Since the input domain is potentially larger than the output range, multiple numbers may get mapped to the same output. This is called a collision. A uniform hash function ensures that roughly the same number of numbers are mapped into each of the m bins.
Visually, we can think of a hash function as a machine that intakes numbered balls (keys) and routes them to one of m bins. Balls with the same number will always get routed to the same bin (see Figure 5-1). This maintains the feature space while reducing the storage and processing time during machine learning training and evaluation cycles.
Hash functions can be constructed for any object that can be represented numerically (which is true for any data that can be stored on a computer): numbers, strings, complex structures, etc.
When there are very many features, storing the feature vector could take up a lot of space. Feature hashing compresses the original feature vector into an m-dimensional vector by applying a hash function to the feature ID, as shown in Example 5-3. For instance, if the original features were words in a document, then the hashed version would have a fixed vocabulary size of m, no matter how many unique words there are in the input.
>>>
def
hash_features
(
word_list
,
m
):
...
output
=
[
0
]
*
m
...
for
word
in
word_list
:
...
index
=
hash_fcn
(
word
)
%
m
...
output
[
index
]
+=
1
...
return
output
Another variation of feature hashing adds a sign component, so that counts are either added to or subtracted from the hashed bin (see Example 5-4). Statistically speaking, this ensures that the inner products between hashed features are equal in expectation to those of the original features.
>>>
def
hash_features
(
word_list
,
m
):
...
output
=
[
0
]
*
m
...
for
word
in
word_list
:
...
index
=
hash_fcn
(
word
)
%
m
...
sign_bit
=
sign_hash
(
word
)
%
2
...
if
(
sign_bit
==
0
):
...
output
[
index
]
-=
1
...
else
:
...
output
[
index
]
+=
1
...
return
output
The value of the inner product after hashing is within of the original inner product, so the size of the hash table m can be selected based on acceptable errors. In practice, picking the right m could take some trial and error.
Feature hashing can be used for models that involve the inner product of feature vectors and coefficients, such as linear models and kernel methods. It has been demonstrated to be successful in the task of spam filtering (Weinberger et al., 2009). In the case of targeted advertising, McMahan et al. (2013) report not being able to get the prediction errors down to an acceptable level unless m is on the order of billions, which does not constitute enough saving in space.
One downside to feature hashing is that the hashed features, being aggregates of original features, are no longer interpretable.
In Example 5-5, we use the Yelp reviews dataset to demonstrate storage and interpretability trade-offs using scikit-learn’s FeatureHasher
.
>>>
import
pandas
as
pd
>>>
import
json
# Load the first 10,000 reviews
>>>
f
=
open
(
'yelp_academic_dataset_review.json'
)
>>>
js
=
[]
>>>
for
i
in
range
(
10000
):
...
js
.
append
(
json
.
loads
(
f
.
readline
()))
>>>
f
.
close
()
>>>
review_df
=
pd
.
DataFrame
(
js
)
# Define m as equal to the unique number of business_ids
>>>
m
=
len
(
review_df
.
business_id
.
unique
())
>>>
m
528
>>>
from
sklearn.feature_extraction
import
FeatureHasher
>>>
h
=
FeatureHasher
(
n_features
=
m
,
input_type
=
'string'
)
>>>
f
=
h
.
transform
(
review_df
[
'business_id'
])
# How does this affect feature interpretability?
>>>
review_df
[
'business_id'
]
.
unique
()
.
tolist
()[
0
:
5
]
['vcNAWiLM4dR7D2nwwJ7nCA',
'UsFtqoBl7naz8AVUBZMjQQ',
'cE27W9VPgO88Qxe4ol6y_g',
'HZdLhv6COCleJMo7nPl-RA',
'mVHrayjG3uZ_RLHkLj-AMg']
>>>
f
.
toarray
()
array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
# Not great. BUT, let's see the storage size of our features.
>>>
from
sys
import
getsizeof
>>>
(
'Our pandas Series, in bytes: '
,
getsizeof
(
review_df
[
'business_id'
]))
>>>
(
'Our hashed numpy array, in bytes: '
,
getsizeof
(
f
))
Our pandas Series, in bytes: 790104
Our hashed numpy array, in bytes: 56
We can clearly see how using feature hashing will benefit us computationally, sacrificing immediate user interpretability. This is an easy trade-off to accept when progressing from data exploration and visualization into a machine learning pipeline for large datasets.
Bin counting is one of the perennial rediscoveries in machine learning. It has been reinvented and used in a variety of applications, from ad click-through rate prediction to hardware branch prediction (Yeh and Patt, 1991; Lee et al., 1998; Chen et al., 2009; Li et al., 2010). Yet because it is a feature engineering technique and not a modeling or optimization method, there is no research paper on the topic. The most detailed description of the technique can be found in Misha Bilenko’s (2015) blog post “Big Learning Made Easy—with Counts!” and the associated slides.
The idea of bin counting is deviously simple: rather than using the value of the categorical variable as the feature, instead use the conditional probability of the target under that value. In other words, instead of encoding the identity of the categorical value, we compute the association statistics between that value and the target that we wish to predict. For those familiar with naive Bayes classifiers, this statistic should ring a bell, because it is the conditional probability of the class under the assumption that all features are independent. It is best illustrated with an example (see Table 5-6).
User | Number of clicks | Number of nonclicks | Probability of click | QueryHash, AdDomain | Number of clicks | Number of nonclicks | Probability of click |
---|---|---|---|---|---|---|---|
Alice | 5 | 120 | 0.0400 | 0x598fd4fe, foo.com | 5,000 | 30,000 | 0.167 |
Bob | 20 | 230 | 0.0800 | 0x50fa3cc0, bar.org | 100 | 900 | 0.100 |
… | … | … | |||||
Joe | 2 | 3 | 0.400 | 0x437a45e1, qux.net | 6 | 18 | 0.250 |
Bin counting assumes that historical data is available for computing the statistics. Table 5-6 contains aggregated historical counts for each possible value of the categorical variables. Based on the number of times the user “Alice” has clicked on any ad and the number of times she has not clicked, we can calculate the probability of her clicking on any ad. Similarly, we can compute the probability of a click for any query–ad domain combination. At training time, every time we see “Alice,” we can use her probability of click as the input feature to the model. The same goes for QueryHash–AdDomain pairs like “0x437a45e1, qux.net.”
Suppose there were 10,000 users. One-hot encoding would generate a sparse vector of length 10,000, with a single 1 in the column that corresponds to the value of the current data point. Bin counting would encode all 10,000 binary columns as a single feature with a real value between 0 and 1.
We can include other features in addition to the historical click-through probability: the raw counts themselves (number of clicks and nonclicks), the log-odds ratio, or any other derivatives of probability. Our example here is for predicting ad click-through rates, but the technique readily applies to general binary classification. It can also be readily extended to multiclass classification using the usual techniques to extend binary classifiers to multiclass; i.e., via one-against-many odds ratios or other multiclass label encodings.
In short, bin counting converts a categorical variable into statistics about the value. It turns a large, sparse, binary representation of the categorical variable, such as that produced by one-hot encoding, into a very small, dense, real-valued numeric representation (Figure 5-2).
In terms of implementation, bin counting requires storing a map between each category and its associated counts. (The rest of the statistics can be derived on the fly from the raw counts.) Hence it requires O(k) space, where k is the number of unique values of the categorical variable.
To illustrate bin counting in practice, we’ll use data from a Kaggle competition hosted by Avazu. Here are some relevant statistics about the dataset:
click
, a binary click/no click counter, and device_id
, which tracks which device an ad was displayed on.The aim of the Avazu competition was to predict click-through rate using ad data, but we will use the dataset to demonstrate how bin counting can greatly reduce the feature space for large amounts of streaming data (see Example 5-6).
>>>
import
pandas
as
pd
# train_subset data is first 10K rows of 6+GB set
>>>
df
=
pd
.
read_csv
(
'data/train_subset.csv'
)
# How many unique features should we have after?
>>>
len
(
df
[
'device_id'
]
.
unique
())
7201
# For each category, we want to calculate:
# Theta = [counts, p(click), p(no click), p(click)/p(no click)]
>>>
def
click_counting
(
x
,
bin_column
):
...
clicks
=
pd
.
Series
(
x
[
x
[
'click'
]
>
0
][
bin_column
]
.
value_counts
(),
...
name
=
'clicks'
)
...
no_clicks
=
pd
.
Series
(
x
[
x
[
'click'
]
<
1
][
bin_column
]
.
value_counts
(),
...
name
=
'no_clicks'
)
...
counts
=
pd
.
DataFrame
([
clicks
,
no_clicks
])
.
T
.
fillna
(
'0'
)
...
counts
[
'total_clicks'
]
=
counts
[
'clicks'
]
.
astype
(
'int64'
)
+
...
counts
[
'no_clicks'
]
.
astype
(
'int64'
)
...
return
counts
>>>
def
bin_counting
(
counts
):
...
counts
[
'N+'
]
=
counts
[
'clicks'
]
...
.
astype
(
'int64'
)
...
.
divide
(
counts
[
'total_clicks'
]
.
astype
(
'int64'
))
...
counts
[
'N-'
]
=
counts
[
'no_clicks'
]
...
.
astype
(
'int64'
)
...
.
divide
(
counts
[
'total_clicks'
]
.
astype
(
'int64'
))
...
counts
[
'log_N+'
]
=
counts
[
'N+'
]
.
divide
(
counts
[
'N-'
])
...
# If we wanted to only return bin-counting properties,
...
# we would filter here
...
bin_counts
=
counts
.
filter
(
items
=
[
'N+'
,
'N-'
,
'log_N+'
])
...
return
counts
,
bin_counts
# Bin counts example: device_id
>>>
bin_column
=
'device_id'
>>>
device_clicks
=
click_counting
(
df
.
filter
(
items
=
[
bin_column
,
'click'
]),
...
bin_column
)
>>>
device_all
,
device_bin_counts
=
bin_counting
(
device_clicks
)
# Check to make sure we have all the devices
>>>
len
(
device_bin_counts
)
7201
>>>
device_all
.
sort_values
(
by
=
'total_clicks'
,
ascending
=
False
)
.
head
(
4
)
clicks no_clicks total N+ N- log_N+
a99f214a 15729 71206 86935 0.180928 0.819072 0.220894
c357dbff 33 134 167 0.197605 0.802395 0.246269
31da1bd0 0 62 62 0.000000 1.000000 0.000000
936e92fb 5 54 59 0.084746 0.915254 0.092593
Just like rare words, rare categories require special treatment. Think about a user who logs in once a year: there will be very little data to reliably estimate that user’s click-through rate for ads. Moreover, rare categories waste space in the counts table.
One way to deal with this is through back-off, a simple technique that accumulates the counts of all rare categories in a special bin (see Figure 5-3). If the count is greater than a certain threshold, then the category gets its own count statistics. Otherwise, we use the statistics from the back-off bin. This essentially reverts the statistics for a single rare category to the statistics computed on all rare categories. When using the back-off method, it helps to also add a binary indicator for whether or not the statistics come from the back-off bin.
There is another way to deal with this problem, called the count-min sketch (Cormode and Muthukrishnan, 2005). In this method, all the categories, rare or frequent alike, are mapped through multiple hash functions with an output range, m, much smaller than the number of categories, k. When retrieving a statistic, recompute all the hashes of the category and return the smallest statistic. Having multiple hash functions mitigates the probability of collision within a single hash function. The scheme works because the number of hash functions times m, the size of the hash table, can be made smaller than k, the number of categories, and still retain low overall collision probability.
Figure 5-4 illustrates. Each item i is mapped to one cell in each row of the array of counts. When an update of ct to item it arrives, ct is added to each of these cells, hashed using functions h1…hd.
Since bin counting relies on historical data to generate the necessary statistics, it requires waiting through a data collection period, incurring a slight delay in the learning pipeline. Also, when the data distribution changes, the counts need to be updated. The faster the data changes, the more frequently the counts need to be recomputed. This is particularly important for applications like targeted advertising, where user preferences and popular queries change very quickly, and lack of adaptation to the current distribution could mean huge losses for the advertising platform.
One might ask, why not use the same dataset to compute the relevant statistics and train the model? The idea seems innocent enough. The big problem here is that the statistics involve the target variable, which is what the model tries to predict. Using the output to compute the input features leads to a pernicious problem known as leakage. In short, leakage means that information is revealed to the model that gives it an unrealistic advantage to make better predictions. This could happen when test data is leaked into the training set, or when data from the future is leaked to the past. Any time that a model is given information that it shouldn’t have access to when it is making predictions in real time in production, there is leakage. Kaggle’s wiki gives more examples of leakage and why it is bad for machine learning applications.
If the bin-counting procedure used the current data point’s label to compute part of the input statistic, that would constitute direct leakage. One way to prevent that is by instituting strict separation between count collection (for computing bin-count statistics) and training, as illustrated in Figure 5-5—i.e., use an earlier batch of data points for counting, use the current data points for training (mapping categorical variables to historical statistics we just collected), and use future data points for testing. This fixes the problem of leakage, but introduces the aforementioned delay (the input statistics and therefore the model will trail behind current data).
It turns out that there is another solution, based on differential privacy. A statistic is approximately leakage-proof if its distribution stays roughly the same with or without any one data point. In practice, adding a small random noise with distribution Laplace(0,1) is sufficient to cover up any potential leakage from a single data point. This idea can be combined with leaving-one-out counting to formulate statistics on current data (Zhang, 2015).
If the statistics are updated continuously given more and more historical data, the raw counts will grow without bounds. This could be a problem for the model. A trained model “knows” the input data up to the observed scale. A trained decision tree might say, “When x is greater than 3, predict 1.” A trained linear model might say, “Multiply x by 0.7 and see if the result is greater than the global average.” These might be the correct decisions when x lies between 0 and 5. But what happens beyond that? No one knows.
When the input counts increase, the model will need to be retrained to adapt to the current scale. If the counts accumulate rather slowly, then the effective scale won’t change too fast, and the model will not need to be retrained too frequently. But when counts increment very quickly, frequent retraining will be a nuisance.
For this reason, it is often better to use normalized counts that are guaranteed to be bounded in a known interval. For instance, the estimated click-through probability is bounded between [0, 1]. Another method is to take the log transform, which imposes a strict bound, but the rate of increase will be very slow when the count is very large.
Neither method will guard against shifting input distributions (e.g., last year’s Barbie dolls are now out of style and people will no longer click on those ads). The model will need to be retrained to accommodate these more fundamental changes in input data distribution, or the whole pipeline will need to move to an online learning setting where the model is continuously adapting to the input.
Each of the approaches detailed in this chapter has its pros and cons. Here is a rundown of the trade-offs.
Plain one-hot encoding | |
---|---|
Space requirement | O(n) using the sparse vector format, where n is the number of data points |
Computation requirement | O(nk) under a linear model, where k is the number of categories |
Pros |
|
Cons |
|
As we can see, none of the methods are perfect. Which one to use depends on the desired model. Linear models are cheaper to train and therefore can handle noncompressed representations such as one-hot encoding. Tree-based models, on the other hand, need to do repeated searches over all features for the right split, and are thus limited to small representations such as bin counting. Feature hashing sits in between those two extremes, but with mixed reports on the resulting accuracy.
Agarwal, Alekh, Oliveier Chapelle, Miroslav Dudík, and John Langford. “A Reliable Effective Terascale Linear Learning System.” Journal of Machine Learning Research 15 (2015): 1111−1133.
Bilenko, Misha. “Big Learning Made Easy—with Counts!” Cortana Intelligence and Machine Learning Blog, February 17, 2015. https://blogs.technet.microsoft.com/machinelearning/2015/02/17/big-learning-made-easy-with-counts/.
Chen, Ye, Dmitry Pavlov, and John F. Canny. “Large-Scale Behavioral Targeting.” Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2009): 209–218.
Cormode, Graham, and S. Muthukrishnan. “An Improved Data Stream Summary: The Count-Min Sketch and Its Applications.” Algorithms 55 (2005): 29–38.
Graepel, Thore, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. “Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine.” Proceedings of the 27th International Conference on Machine Learning (2010): 13–20.
He, Xinran, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. “Practical Lessons from Predicting Clicks on Ads at Facebook.” Proceedings of the 8th International Workshop on Data Mining for Online Advertising (2014): 1–9.
Lee, Wenke, Salvatore J. Stolfo, and Kui W. Mok. 1998. “Mining Audit Data to Build Intrusion Detection Models.” Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1998): 66–72.
Li, Wei, Xuerui Wang, Ruofei Zhang, Ying Cui, Jianchang Mao, and Rong Jin. “Exploitation and Exploration in a Performance Based Contextual Advertising System.” Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010): 27–36.
McMahan, H. Brendan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. “Ad Click Prediction: A View from the Trenches.” Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013): 1222–1230.
Weinberger, Kilian, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. 2009. “Feature Hashing for Large Scale Multitask Learning.” Proceedings of the 26th International Conference on Machine Learning (2009): 1113–1120.
Yeh, Tse-Yu, and Yale N. Patt. “Two-Level Adaptive Training Branch Prediction.” Proceedings of the 24th Annual International Symposium on Microarchitecture (1991): 51–61.
Zhang, Owen. 2015. “Tips for data science competitions.” SlideShare presentation. Retrieved from http://bit.ly/2DjuhBD.
1 In standard statistics literature, the technical term for the categories is levels. A categorical variable with two distinct categories has two levels. But there are a number of other things in statistics that are also called levels, so we do not use that terminology here; instead we use the more colloquial and unambiguous term “categories.”
2 Curious readers might wonder why one is called coding and the other encoding. This is largely convention. My guess is that one-hot encoding first became popular in electrical engineering, where information is encoded and decoded all the time. Dummy coding and effect coding, on the other hand, were invented in the statistics community. Somehow the “en” didn’t make its way over the academic divide.
18.226.98.32