Creating a simple pipeline

Spark provides pipeline APIs under Spark ML. A pipeline comprises a sequence of stages consisting of transformers and estimators. There are two basic types of pipeline stages, called transformer and estimator:

  • A transformer takes a dataset as an input and produces an augmented dataset as the output so that the output can be fed to the next step. For example, Tokenizer and HashingTF are two transformers. Tokenizer transforms a dataset with text into a dataset with tokenized words. A HashingTF, on the other hand, produces the term frequencies. The concept of tokenization and HashingTF is commonly used in text mining and text analytics.
  • On the contrary, an estimator must be the first on the input dataset to produce a model. In this case, the model itself will be used as the transformer for transforming the input dataset into the augmented output dataset. For example, a Logistic Regression or linear regression can be used as an estimator after fitting the training dataset with corresponding labels and features.

After that, it produces a logistic or linear regression model, which implies that developing a pipeline is easy and simple. Well, all you need to do is to declare required stages, then configure the related stage's parameters; finally, chain them in a pipeline object, as shown in the following figure:

Figure 17: Spark ML pipeline model using logistic regression estimator (DS indicates data store, and the steps inside the dashed line only happen during pipeline fitting)

If you look at Figure 17, the fitted model consists of a Tokenizer, a HashingTF feature extractor, and a fitted logistic regression model. The fitted pipeline model acts as a transformer that can be used for prediction, model validation, model inspection, and, finally, model deployment. However, to increase the performance in terms of prediction accuracy, the model itself needs to be tuned.

Now we know about the available algorithms in Spark MLlib and ML, now it's time to get prepared before starting to use them in a formal way for solving supervised and unsupervised learning problems. In the next section, we will start on feature extraction and transformation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.1