Feature extraction and transformation

Suppose you are going to build a machine learning model that will predict whether a credit card transaction is fraudulent or not. Now, based on the available background knowledge and data analysis, you might decide which data fields (aka features) are important for training your model. For example, amount, customer name, buying company name, and the address of the credit card owners are worth to providing for the overall learning process. These are important to consider since, if you just provide a randomly generated transaction ID, that will not carry any information so would not be useful at all. Thus, once you have decided which features to include in your training set, you then need to transform those features to train the model for better learning. The feature transformations help you add additional background information to the training data. The information enables the machine learning model to benefit from this experience eventually. To make the preceding discussion more concrete, suppose you have the following address of one of the customers represented in the string:

"123 Main Street, Seattle, WA 98101"

If you see the preceding address, the address lacks proper semantics. In other words, the string has limited expressive power. This address will be useful only for learning address patterns associated with that exact address in a database, for example. However, breaking it up into fundamental parts can provide additional features such as the following:

  • "Address" (123 Main Street)
  • "City" (Seattle)
  • "State" (WA)
  • "Zip" (98101)

If you see the preceding patterns, your ML algorithm can now group more different transactions together and discover broader patterns. This is normal, since some customer's zip codes contribute to more fraudulent activity than others. Spark provides several algorithms implemented for the feature extractions and to make transformation easier. For example, the current version provides the following algorithms for feature extractions:

  • TF-IDF
  • Word2vec
  • CountVectorizer

On the other hand, a feature transformer is an abstraction that includes feature transformers and learned models. Technically, a transformer implements a method named transform(), which converts one DataFrame into another, generally by appending one or more columns. Spark supports the following transformers to RDD or DataFrame:

  • Tokenizer
  • StopWordsRemover
  • n-gram
  • Binarizer
  • PCA
  • PolynomialExpansion
  • Discrete cosine transform (DCT)
  • StringIndexer
  • IndexToString
  • OneHotEncoder
  • VectorIndexer
  • Interaction
  • Normalizer
  • StandardScaler
  • MinMaxScaler
  • MaxAbsScaler
  • Bucketizer
  • ElementwiseProduct
  • SQLTransformer
  • VectorAssembler
  • QuantileDiscretizer

Due to page limitations, we cannot describe all of them. But we will discuss some widely used algorithms such as CountVectorizer, Tokenizer, StringIndexer, StopWordsRemover, OneHotEncoder, and so on. PCA, which is commonly used in dimensionality reduction, will be discussed in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.72.153