Feature extraction

In this section, we will turn our focus to feature extraction, which is to develop new features or variables from the available features or information of working datasets. At the same time, we will discuss some of Apache Spark's special capabilities for feature extraction as well as some related feature solutions made easy with Spark.

After this section, we will be able to develop and organize features for various machine learning projects.

Feature development challenges

For most big data machine learning projects, with many big datasets, we often cannot use them immediately. For example, when we take in some web log data, it is very messy and often in a form such as a collection of random text, from which we need to extract useful information and draw out useful features ready for machine learning. For example, we need to extract number of clicks and number of impressions out from web log data, for which many text mining tools and algorithms are ready to be used.

With any feature extraction, machine learning professionals need to decide:

  • What information to use and what features to create
  • What methods and algorithms to use

What feature to extract depends on the following:

  • Data availability and also data properties, such as how easy it is to handle missing cases
  • The available algorithms, as there are a lot of algorithms available for the numeric combination of data elements but less on text manipulation
  • The domain knowledge as the explained ability of features is often concerned

Overall, there are a few commonly used techniques to track features:

  • Data description
  • Data aggregation
  • Time series transformations
  • Geographical
  • PCA

Another task for feature preparation is to select features from hundreds or perhaps thousands of available features and then make them available for our ML projects. In machine learning, specifically in supervised learning, the general problem at hand is always to predict an outcome from a set of predictive features. At first glance, in our big data era, it is tempting to say that the more features we have, the better our predictions will be. However, there are problems that arise as the number of features increases, such as an increase in computing time, which may cause difficulties in interpreting results.

In most cases, for the feature preparation stage, machine learning professionals often use feature selection methods and algorithms, which are associated with regression modeling.

Feature development with Spark MLlib

Feature extraction could be implemented with the Spark SQL, while Spark's MLlib also has some special functions for this task, such as TF-IDF and Word2Vec.

Both MLlib and R have packages for principal component analysis, which are often employed for feature development.

As we may recall, in section Data cleaning made easy, we have four data tables to work with for the purpose of illustration:

  • Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
  • Events(userId INT, action INT, Default)
  • WebLog(userId, webAction)
  • Demographic(memberId, age, edu, income)

Here, we can apply our feature extraction techniques to the third data and then apply feature selection to the final merged (joined) dataset.

With Spark MLlib, we can apply TF-IDF with the following commands:

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)

Alternatively, we can apply Word2Vec as illustrated by the following example. The following example (in Scala) first loads a text file, parses it as an RDD of Seq[String], constructs a Word2Vec instance, and then fits Word2VecModel with the data. Then, we can display the top 40 synonyms of the specified word. Here, we will assume that the extracted file named text8 is in same directory as you run the Spark shell. Run the following code:

import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("text8").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("china", 40)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}

// Save and load model
model.save(sc, "myModelPath")
val sameModel = Word2VecModel.load(sc, "myModelPath")

For more information about using Spark MLlib for feature extraction, go to:

http://spark.apache.org/docs/latest/mllib-feature-extraction.html

Feature development with R

As for the four tables mentioned before, take a look at the following:

  • Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
  • Events(userId INT, action INT, Default)
  • WebLog(userId, webAction)
  • Demographic(memberId, age, edu, income)

As discussed, we can apply our feature extraction techniques to the third data and then apply feature selection to the final merged (joined) data set.

If we implement them in R, with the R notebook in Spark, we need to utilize some of the R packages. If we use ReporteRs, we can execute the following commands:

## Not run: 
doc = docx( title = "My example", template = file.path( 
  find.package("ReporteRs"), "templates/bookmark_example.docx") )
text_extract( doc )
text_extract( doc, header = FALSE, footer = FALSE )
text_extract( doc, bookmark = "author" )

## End(Not run)

Note

For more information on the ReporteRs R package, go to https://cran.r-project.org/web/packages/ReporteRs/ReporteRs.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.20.231