Feature preparation

In section, Feature extraction of Chapter 2, Data Preparation for Spark ML, we have reviewed a few methods for feature extraction and discussed their implementation in Apache Spark. All the techniques discussed there can be applied to our data here, especially the ones for utilizing time series and feature comparison to create new features. For example, the customer satisfaction response change over time is considered as possibly an excellent predictor.

For this project, we will need to conduct both feature extraction and feature selection, which will allow us to utilize all the techniques discussed in Chapter 2, Data Preparation for Spark ML and also Chapter 3, A Holistic View on Spark.

The data merging part is also necessary, but its implementation is similar to what was described in the previous chapters, to be completed at ease.

Feature extraction

In the previous chapters, we used Spark SQL and R for feature extraction and, for this real-life project, we will try to use MLlib for feature extraction; even in reality, users may use all the tools available.

A complete guide for MLlib feature extraction can be found at http://spark.apache.org/docs/latest/mllib-feature-extraction.html.

Here, we will use the Word2Vec method for extracting features from the social media data. The following code can be used to load a text file, parse it as an RDD of Seq[String], construct a Word2Vec instance, and then fit a Word2VecModel with the input data. Finally, we display the top 40 synonyms of some specific words such as leave or bad service.

import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("text8").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("china", 40)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}

// Save and load model
model.save(sc, "myModelPath")
val sameModel = Word2VecModel.load(sc, "myModelPath")

Feature selection

MLlib also has a few functions to be used for feature selection, which are similar to what you learned in the previous chapters, so we are not going to repeat them here.

An online guide on feature selection with MLlib can be found at http://spark.apache.org/docs/latest/mllib-feature-extraction.html#feature-selection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.202.27