Estimators

We've used estimators before in StringIndexer. We've already stated that estimators somehow contain state that changes while looking at data, whereas this is not the case for transformers. So why is StringIndexer an estimator? This is because it needs to remember all the previously seen strings and maintain a mapping table between strings and label indexes.

In machine learning, it is common to use at least a training and testing subset of your available training data. It can happen that an estimator in the pipeline, such as StringIndexer, has not seen all the string labels while looking at the training dataset. Therefore, you'll get an exception when evaluating the model using the test dataset as the StringIndexer now encounters labels that it has not seen before. This is, in fact, a very rare case and basically could mean that the sample function you use to separate the training and testing datasets is not working; however, there is an option called setHandleInvalid("skip") and your problem is solved.

Another easy way to distinguish between an estimator and a transformer is the additional method called fit on the estimators. Fit actually populates the internal data management structure of the estimators based on a given dataset, which, in the case of StringIndexer, is the mapping table between label strings and label indexes. So now let's take a look at another estimator, an actual machine learning algorithm.

Table of Contents for Estimators

Create new playlist

Sign In

Sign Up

Table of Contents for
Estimators