Training and testing a logistic regression model

With the encoded training and testing set ready, we can now train our classification model. We use logistic regression as an example, but there are many other classification models supported in PySpark, such as decision tree classifiers, random forests, neural networks (which we will be studying in Chapter 9, Stock Price Prediction with Regression Algorithms), linear SVM, and Naïve Bayes. For further details, please refer to the following link: https://spark.apache.org/docs/latest/ml-classification-regression.html#classification.

We train and test a logistic regression model by the following steps:

We first import the logistic regression module and initialize a model:

>>> from pyspark.ml.classification import LogisticRegression
>>> classifier = LogisticRegression(maxIter=20, regParam=0.001, 
                                    elasticNetParam=0.001)

Here, we set the maximum iterations as 20, and the regularization parameter as 0.001.

Now, fit the model on the encoded training set:

>>> lr_model = classifier.fit(df_train_encoded)

Be aware that this might take a while. You can check the running or completed jobs in the Spark UI in the meantime. Refer to the following screenshot for some completed jobs:

Note that each RDDLossFunction represents an iteration of optimizing the logistic regression classifier.

After all iterations, we apply the trained model on the testing set:

>>> predictions = lr_model.transform(df_test_encoded)

Cache the prediction results, as we will compute the prediction's performance:

>>> predictions.cache()
DataFrame[label: int, features: vector, rawPrediction: vector, probability: vector, prediction: double]
Take a look at the prediction DataFrame:
>>> predictions.show()
+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    0|(31458,[5,7,788,4...|[2.80267740289335...|[0.94282033454271...|       0.0|
|    0|(31458,[5,7,788,4...|[2.72243908463177...|[0.93833781006061...|       0.0|
|    0|(31458,[5,7,788,4...|[2.72243908463177...|[0.93833781006061...|       0.0|
|    0|(31458,[5,7,788,4...|[2.82083664358057...|[0.94379146612755...|       0.0|
|    0|(31458,[5,7,788,4...|[2.82083664358057...|[0.94379146612755...|       0.0|
|    0|(31458,[5,7,14,45...|[4.44920221201642...|[0.98844714081261...|       0.0|
|    0|(31458,[5,7,14,45...|[4.44920221201642...|[0.98844714081261...|       0.0|
|    0|(31458,[5,7,14,45...|[4.44920221201642...|[0.98844714081261...|       0.0|
|    0|(31458,[5,7,14,45...|[4.54759977096521...|[0.98951842852058...|       0.0|
|    0|(31458,[5,7,14,45...|[4.54759977096521...|[0.98951842852058...|       0.0|
|    0|(31458,[5,7,14,45...|[4.38991492595212...|[0.98775013592573...|       0.0|
|    0|(31458,[5,7,14,45...|[4.38991492595212...|[0.98775013592573...|       0.0|
|    0|(31458,[5,7,14,45...|[4.38991492595212...|[0.98775013592573...|       0.0|
|    0|(31458,[5,7,14,45...|[4.38991492595212...|[0.98775013592573...|       0.0|
|    0|(31458,[5,7,14,45...|[5.58870435258071...|[0.99627406423617...|       0.0|
|    0|(31458,[5,7,14,45...|[5.66066729150822...|[0.99653187592454...|       0.0|
|    0|(31458,[5,7,14,45...|[5.66066729150822...|[0.99653187592454...|       0.0|
|    0|(31458,[5,7,14,45...|[5.61336061100621...|[0.99636447866332...|       0.0|
|    0|(31458,[5,7,2859,...|[5.47553763410082...|[0.99582948965297...|       0.0|
|    0|(31458,[1,7,651,4...|[1.33424801682849...|[0.79154243844810...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows

This contains the predictive features, ground truth, probabilities of the two classes, and the final prediction (with a decision threshold of 0.5).

We evaluate the AUC of ROC on the testing set using the BinaryClassificationEvaluator function with the areaUnderROC evaluation metric:

>>> from pyspark.ml.evaluation import BinaryClassificationEvaluator
>>> ev = BinaryClassificationEvaluator(rawPredictionCol = 
                  "rawPrediction", metricName = "areaUnderROC")
>>> print(ev.evaluate(predictions))
0.7488839207716323

We are hereby able to obtain an AUC of 74.89%.

Table of Contents for Training and testing a logistic regression model

Create new playlist

Sign In

Sign Up

Table of Contents for
Training and testing a logistic regression model