String indexer

Let's assume that we have a DataFrame df containing a column called color of categorical labels--red, green, and blue. We want to encode them as integer or float values. This is where org.apache.spark.ml.feature.StringIndexer kicks in. It automatically determines the cardinality of the category set and assigns each one a distinct value. So in our example, a list of categories such as red, red, green, red, blue, green should be transformed into 1, 1, 2, 1, 3, 2:

import org.apache.spark.ml.feature.StringIndexer
var indexer = new StringIndexer()
.setInputCol("colors")
.setOutputCol("colorsIndexed")

var indexed = indexer.fit(df).transform(df)

The result of this transformation is a DataFrame called indexed that, in addition to the colors column of the String type, now contains a column called colorsIndexed of type double.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.197.10