OneHotEncoder

We are only halfway through. Although machine learning algorithms are capable of making use of the colorsIndexed column, they perform better if we one-hot encode it. This actually means that, instead of having a colorsIndexed column containing label indexes between one and three, it is better if we have three columns--one for each color--with the constraint that every row is allowed to set only one of these columns to one, otherwise zero. Let's do it:

var encoder = new OneHotEncoder()
.setInputCol("colorIndexed")
.setOutputCol("colorVec")

var encoded = encoder.transform(indexed)

Intuitively, we would expect that we get three additional columns in the encoded DataFrame, for example, colorIndexedRed, colorIndexedGreen, and colorIndexedBlue. However, this is not the case. In contrast, we just get one additional column in the DataFrame and its type is org.apache.spark.ml.linalg.Vector. It uses its internal representation and we basically don't have to care about it, as all ApacheSparkML transformers and estimators are compatible to that format.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.187.108