OneHotEncoder

A one-hot encoding maps a column of label indices to a column of binary vectors, with at most a single value. This encoding allows algorithms that expect continuous features, such as Logistic Regression, to use categorical features. Suppose you have some categorical data in the following format (the same that we used for describing the StringIndexer in the previous section):

Figure 14: DataFrame for applying OneHotEncoder

Now, we want to index the name column so that the most frequent name in the dataset (that is, Jason in our case) gets index 0. However, what's the use of just indexing them? In other words, you can further vectorize them and then you can feed the DataFrame to any ML models easily. Since we have already seen how to create a DataFrame in the previous section, here, we will just show how to encode them toward Vectors:

val indexer = new StringIndexer()
.setInputCol("name")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")


Now let's transform it into a vector using Transformer and then see the contents, as follows:

val encoded = encoder.transform(indexed)
encoded.show()

The resulting DataFrame containing a snap is as follows:

Figure 15: Creating category index and vector using OneHotEncoder

Now you can see that a new column containing feature vectors has been added in the resulting DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.166.124