Hashing categorical features

In machine learning, feature hashing (also called hashing trick) is an efficient way to encode categorical features. It is based on hashing functions in computer science that map data of variable sizes to data of a fixed (and usually smaller) size. It is easier to understand feature hashing through an example.

Let's say we have three features—gender, site_domain, and device_model, for example:

With one-hot encoding, this will become feature vectors of size 9, which come from 2 (from gender) + 4 (from site_domain) + 3 (from device_model). With feature hashing, we want to obtain a feature vector of size 4. We define a hash function as the sum of Unicode code points of each character, and then divide the result by 4 and take the remainder as the hashed output. Take the first row as an example, ord(m) + ord(a) + ord(l) + ord(e) + … + ord(s) + ord(u) + ord(n) + ord(g) = 109 + 97 + 108 + 101 + … + 115 + 117 + 110 + 103 = 1500, then 1500 % 4 = 0, which means [1 0 0 0]. If the remainder is 1, a sample is hashed into [0, 1, 0, 0]; [0, 0, 1, 0] for a sample with 2 as the remainder; [0, 0, 0, 1] for a sample with 3 as the remainder.

Similarly, for other rows, we have the following:

In the end, we use the four-dimension hashed vectors to represent the original data, instead of the nine-dimension one-hot encoded ones.

There are a few things to note about feature hashing:

The same input will always be converted to the same output, for instance, the second and fifth rows.
Two different inputs might be converted to the same output, such as the first and fourth rows. This phenomenon is called hashing collision.

Hence, the choice of the resulting fixed size is important. It will result in serious collision and information loss if the size is too small. If it is too large, it is basically a redundant one-hot encoding. With the correct size, it will make it space-efficient and, at the same time, preserve important information, which further benefits downstream tasks.
Hashing is one-way, which means we cannot revert the output to its input; while one-hot encoding is two-way mapping.

Let's now adopt feature hashing to our click prediction project. Recall that the one-hot encoded vectors are of size 31,458. If we choose 10,000 as the fixed hashing size, we will be able to cut the space to less than one third, and reduce the memory consumed by training the classification model. Also, we will see how quick it is to perform feature hashing compared to one-hot encoding, as there is no need to keep track of all unique values across all columns. It just maps each individual row of string values to a sparse vector through internal hash functions as follows:

We begin by importing the feature hashing module from PySpark and initialize a feature hasher with an output size of 10,000:

>>> from pyspark.ml.feature import FeatureHasher
>>> hasher = FeatureHasher(numFeatures=10000, 
                inputCols=categorical, outputCol="features")

Use the defined hasher to convert the input DataFrame:

>>> hasher.transform(df_train).select("features").show()
+--------------------+
|            features|
+--------------------+
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[29,1228,1...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[1228,1289...|
|(10000,[746,1060,...|
|(10000,[675,1228,...|
|(10000,[1289,1695...|
+--------------------+
only showing top 20 rows

As we can see, the size of the resulting column, features, is 10,000. Again, there is no training or fitting in feature hashing. The hasher is a predefined mapping.

For better organization of the entire workflow, we chain the hasher and classification model together into a pipeline:

>>> classifier = LogisticRegression(maxIter=20, regParam=0.000, 
                                    elasticNetParam=0.000)
>>> stages = [hasher, classifier]
>>> pipeline = Pipeline(stages=stages)

Fit the pipelined model on the training set as follows:

>>> model = pipeline.fit(df_train)

Apply the trained model on the testing set and record the prediction results:

>>> predictions = model.transform(df_test)
>>> predictions.cache()

Evaluate its performance in terms of AUC of ROC:

>>> ev = BinaryClassificationEvaluator(rawPredictionCol = 
                   "rawPrediction", metricName = "areaUnderROC")
>>> print(ev.evaluate(predictions))
0.7448097180769776

We are able to achieve an AUC of 74.48%, which is close to the previous one of 74.89% with one-hot encoding. At the end of the day, we save a substantial amount of computational resources and attain a comparable prediction accuracy. That is a win.

With feature hashing, we lose interpretability but gain computational advantage.

Table of Contents for Hashing categorical features

Create new playlist

Sign In

Sign Up

Table of Contents for
Hashing categorical features