OneHotEncoder

We are only halfway through. Although machine learning algorithms are capable of making use of the colorsIndexed column, they perform better if we one-hot encode it. This actually means that, instead of having a colorsIndexed column containing label indexes between one and three, it is better if we have three columns--one for each color--with the constraint that every row is allowed to set only one of these columns to one, otherwise zero. Let's do it:

var encoder = new OneHotEncoder()
  .setInputCol("colorIndexed")
  .setOutputCol("colorVec")

var encoded = encoder.transform(indexed)

Intuitively, we would expect that we get three additional columns in the encoded DataFrame, for example, colorIndexedRed, colorIndexedGreen, and colorIndexedBlue. However, this is not the case. In contrast, we just get one additional column in the DataFrame and its type is org.apache.spark.ml.linalg.Vector. It uses its internal representation and we basically don't have to care about it, as all ApacheSparkML transformers and estimators are compatible to that format.

Table of Contents for OneHotEncoder

Create new playlist

Sign In

Sign Up

Table of Contents for
OneHotEncoder