Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

P. SinghMachine Learning with PySparkhttps://doi.org/10.1007/978-1-4842-7777-5_5

5. Logistic Regression

Pramod Singh¹

(1)

Bangalore, Karnataka, India

This chapter focuses on building a logistic regression model with Pyspark along with understanding the ideas behind logistic regression. Logistic regression is used for classification problems. We have already seen classification details in earlier chapters. Although it is used for classification, still it’s called logistic regression. It is due to the fact that under the hood, linear regression equations still operate to find the relationship between input variables and target variables. The main distinction between linear and logistic regression is that we use some sort of nonlinear function to convert the output of the latter into a probability to restrict it between 0 and 1. For example, we can use logistic regression to predict if a user would buy the product or not. In this case, the model would return a buying probability for each user. Logistic regression is used widely in many business applications.

Probability

To understand logistic regression, we will have to go over the concept of probability first. It is defined as the chances of occurrence of a desired event or interested outcomes upon all possible outcomes. Take for an example if we flip a coin. The chances of getting heads or tails are equal (50%) as shown in Figure 5-1.

If we roll a fair dice, then the probability of getting any of the number between 1 and 6 is equal to 16.7%.

If we pick a ball from a bag that contains four green balls and one blue ball, the probability of picking a green ball is 80%.

Logistic regression is used to predict the probability of each target class. In the case of binary classification (only two classes), it returns the probability associated with each class for every record. As mentioned, it uses linear regression behind the scenes in order to capture the relationship between input and output variables, yet we additionally use one more element (nonlinear function) to convert the output from a continuous form into probability. Let’s understand this with the help of an example. Let’s consider that we have to use models to predict if some particular user would buy the product or not and we are using only a single input variable that is time spent by the user on the website. The data for the same is given in Table 5-1.

Table 5-1

Sample Data

Sr. No	Time Spent (mins)	Converted
1	1	No
2	2	No
3	5	No
4	15	Yes
5	17	Yes
6	18	Yes

Let us visualize this data in order to see the distinction between converted and non-converted users as shown in Figure 5-2.

Figure 5-2
Conversion status vs. time spent

Using Linear Regression

Let’s try using linear regression instead of logistic regression to understand the reasons why logistic regression makes more sense in classification scenarios. In order to use linear regression, we will have to convert the target variable from categorical into numeric form. So let’s reassign the values for the Converted column:

Yes = 1

No = 0

Now, our data looks something as given in Table 5-2.

Table 5-2

Regression Output

Sr. No	Time Spent (mins)	Converted
1	1	0
2	2	0
3	5	0
4	15	1
5	17	1
6	18	1

This process of converting a categorical variable to numerical is also critical, and we will go over this in detail in a later part of this chapter. For now, let’s plot these data points to visualize and understand it better (Figure 5-3).

Figure 5-3
Conversion status (1 and 0) vs. time spent

As we can observe, there are only two values in our target column (1 and 0), and every point lies on either of these two values. Now, let’s suppose we do linear regression on these data points and come up with a “best-fit line,” which is shown in Figure 5-4.

The regression equation for this line would be

$y={B}_0+{B}_1ast x$

${y}_{left(1,0 ight)}={B}_0+{B}_1ast Time Spent$

All looks good so far in terms of coming up with a straight line to distinguish between 1 and 0 values. It seems like linear regression is also doing a good job of differentiating between converted and non-converted users, but there is a slight problem with this approach.

Take for an example a new user spends 20 seconds on the website and we have to predict if this user will convert or not using the linear regression line. We use the preceding regression equation and try to predict the y value for 20 seconds of time spent.

We can simply calculate the value of y by either calculating

$y={B}_0+{B}_1ast (20)$

Or we can also simply draw a perpendicular line from the time spent axis on to the best-fit line to predict the value of y. Clearly, the predicted value of y, which is 1.7, seems way above 1 as shown in Figure 5-5. This approach doesn’t make any sense since we want to predict only between 0 and 1 values.

Figure 5-5
Predictions using a regression line

So, if we use linear regression for classification cases, it creates a situation where the predicted output values can range from –infinity to +infinity. Hence, we need another approach that can tie these values between 0 and 1 only. The notion of values between 0 and 1 is not unfamiliar anymore as we have already seen probability. So, essentially, logistic regression comes up with a decision boundary between positive and negative classes that is associated with a probability value.

Using Logit

To accomplish the objective of converting the output value into probability, we use something called Logit. Logit is a nonlinear function and does a nonlinear transformation of a linear equation to convert the output between 0 and 1. In logistic regression, that nonlinear function is a sigmoid function, which looks like this:

$frac{1}{1+{e}^{-x}}$

And it always produces values between 0 and 1 independent of values of x.

So let’s go back to our earlier linear regression equation

$y={B}_0+{B}_1ast Time Spent$

We pass our output (y) through this nonlinear function (sigmoid) to change its values between 0 and 1:

Probability = $frac{1}{1+{e}^{-y}}$

Probability = $frac{1}{1+{e}^{-Big(}{B}_0+{B}_1ast Time SpentBig)}$

Using the preceding equation, the predicted value gets limited between 0 and 1, and the output now looks as shown in Figure 5-6.

The advantage of using the nonlinear function is that irrespective of any value of input (time spent), the output would always be the probability of conversion. This curve is also known as logistic curve. Logistic regression also assumes that there is a linear relationship between the input and the target variables, and hence the most optimal values of the intercept and coefficients are found out to capture this relationship.

Interpretation (Coefficients)

The coefficients of the input variables are found using a technique known as gradient descent, which looks for optimizing the loss function in such a way that the total error is minimized. We can look at the logistic regression equation and understand the interpretation of coefficients:

$y=frac{1}{1+{e}^{-Big(}{B}_0+{B}_1ast xBig)}$

Let’s say after calculating for the data points in our example, we get the coefficient value of time spent as 0.75.

In order to understand what this 0.75 means, we have to take the exponential value of this coefficient:

e^0.75=2.12

This 2.12 is known as odd ratio, and it suggests that per unit increase in time spent on the website, the odds of customer conversion increase by 112%.

Dummy Variables

So far, we have only dealt with continuous/numerical variables, but the presence of categorical variables in the dataset is inevitable. So let’s understand the approach to use the categorical values for modeling purposes. Since machine learning models only consume data in numerical format, we have to adopt some technique to convert the categorical data into a numerical form. We have already seen one example previously where we converted our target class (Yes/No) into numerical values (1 or 0). This is known as label encoding where we assign unique numerical values to each of the category present in that particular column. There is another approach that works really well known as dummification or one-hot encoding. Let’s understand this with the help of an example. Let’s add one more column to our existing example data. Suppose we have one additional column that contains the search engine the user used. So our data looks something as shown in Table 5-3.

Table 5-3

Additional Data Column

Sr. No	Time Spent (mins)	Search Engine	Converted
1	5	Google	0
2	2	Bing	0
3	10	Yahoo	1
4	15	Bing	1
5	1	Yahoo	0
6	12	Google	1

So, to consume the additional information provided in the Search Engine column, we have to convert this into numerical format using dummification. As a result, we would get an additional number of dummy variables (columns), which would be equal to the number of distinct categories in the Search Engine column. The following steps explain the entire process of converting a categorical feature into numerical:

1.
Find out the distinct number of categories in a categorical column. We have only three distinct categories as of now (Google, Bing, Yahoo).
2.
Create new columns for each of the distinct categories and add value 1 in the category column for when the corresponding search engine is used or else 0 as shown in Table 5-4.
3.
Remove the original category column. So the dataset now contains five columns in total (excluding index) because we have three additional dummy variables as shown in Table 5-5.

Table 5-4

Column Representation

Sr. No	Time Spent (mins)	Search Engine	SE_Google	SE_Bing	SE_Yahoo	Converted
1	1	Google	1	0	0	0
2	2	Bing	0	1	0	0
3	5	Yahoo	0	0	1	0
4	15	Bing	0	1	0	1
5	17	Yahoo	0	1	0	1
6	18	Google	1	0	0	1

Table 5-5

Refined Column Representation

Sr. No	Time Spent (mins)	SE_Google	SE_Bing	SE_Yahoo	Converted
1	1	1	0	0	0
2	2	0	1	0	0
3	5	0	0	1	0
4	15	0	1	0	1
5	17	0	1	0	1
6	18	1	0	0	1

The whole idea is to represent the same information in a different manner so that the machine learning model can learn from categorical values as well.

Model Evaluation

To measure the performance of the logistic regression model, we can use multiple metrics. The most obvious one is the accuracy parameter. Accuracy is the percentage of correct predictions made by the model. However, accuracy is not always the preferred approach. To understand the performance of the logistic model, we should use a confusion matrix. It consists of the value counts for the predictions vs. actual values. A confusion matrix for a binary class looks like Table 5-6.

Table 5-6

Confusion Matrix

Actual/Prediction	Predicted Class (Yes)	Predicted Class (No)
Actual Class (Yes)	True Positives (TP)	False Negatives (FN)
Actual Class (No)	False Positives (FP)	True Negatives (TN)

Actual/Prediction

Predicted Class

(Yes)

Predicted Class

(No)

Actual Class

(Yes)

True Positives

(TP)

False Negatives

(FN)

Actual Class

(No)

False Positives

(FP)

True Negatives

(TN)

Let us understand the individual values in the confusion matrix.

True Positives

These are the values that are of the positive class in actuality, and the model also correctly predicted them to be of the positive class.

Actual Class: Positive (1)
ML Model Prediction Class: Positive (1)

True Negatives

These are the values that are of negative class in actuality, and the model also correctly predicted them to be of the negative class.

Actual Class: Negative (0)
ML Model Prediction Class: Negative (1)

False Positives

These are the values that are of the negative class in actuality, but the model incorrectly predicted them to be of the positive class.

Actual Class: Negative (0)
ML Model Prediction Class: Positive (1)

False Negatives

These are the values that are of the positive class in actuality, but the model incorrectly predicted them to be of the negative class.

Actual Class: Positive (1)
ML Model Prediction Class: Negative (1)

Accuracy

Accuracy is the sum of true positives and true negatives divided by the total number of records:

$frac{left( TP+ TN ight)}{Total number of Records}$

But as said earlier, it is not always the preferred metric because of the target class imbalance. Most of the times, target class frequency is skewed (more number of TN examples compared to TP examples). Take for an example the dataset for fraud detection contains 99% of genuine transactions and only 1% fraud ones. Now, if our logistic regression model predicts all genuine transactions and no fraud case, it still ends up with 99% accuracy. The whole point is to find out the performance in regard to the positive class. Hence, there are a couple of other evaluation metrics that we can use.

Recall

Recall rate helps in evaluating the performance of the model from a positive class standpoint. It tells the percentage of actual positive cases the model is able to predict correctly out of the total number of positives cases:

$frac{(TP)}{left( TP+ FN ight)}$

It talks about the quality of the machine learning model when it comes to predicting the positive class. So out of the total positive class, it tells how many the model was able to predict correctly. This metric is widely used as an evaluation criterion for classification models.

Precision

Precision is about the number of actual positives cases out of all the positive cases predicted by the model:

$frac{(TP)}{left( TP+ FP ight)}$

This can also be used as an evaluation criterion.

F1 Score

F1 Score = $2ast frac{left( Precisionast Recall ight)}{left( Precision+ Recall ight)}$

Probability Cut-Off/Threshold

Since we know the output of the logistic regression model is a probability score, it is very important to decide the cut-off or threshold limit of probability for prediction. By default, the probability threshold is set at 50%. It means, if the probability output of the model is below 50%, the model will predict it to be of the negative class (0) and, if it is equal and above 50%, it would be assigned the positive class (1).

If the threshold limit is very low, then the model will predict a lot of positive classes and would have a high recall rate. On the contrary, if the threshold limit is very high, then the model might miss out on positive cases, and the recall rate would be low, but precision would be higher. In this case, the model will predict very few positive cases. Deciding a good threshold value is often challenging. A Receiver Operator Characteristic curve, or ROC curve, can help to decide which value of the threshold is best.

ROC Curve

The ROC is used to decide the threshold value for the model. It is the plot between recall (also known as sensitivity) and precision (specificity) as shown in Figure 5-7.

One would like to pick a threshold that offers a balance between both recall and precision. So, now that we have understood various components associated with logistic regression, we can go ahead and build a logistic regression model using PySpark.

Logistic Regression Code

This section of the chapter focuses on building a logistic regression model from scratch using PySpark and a Jupyter notebook.

Note

The complete dataset along with the code is available for reference on the GitHub repo of this book and executes best on Spark 3.1 or higher.

Let’s build a logistic regression model using Spark’s MLlib library and predict the target class label.

Data Info

The dataset that we are going to use is a sample dataset that contains a total of 20000 rows and 6 columns. This dataset contains information regarding online users of a retail sports merchandise company. The data captures the country of the user, platform used, age, repeat visitor or first-time visitor, and number of web pages viewed at the website. It also has the information if the customer ultimately bought the product or not (conversion status). We will make use of five input variables to predict the target class using a logistic regression model.

We start the Databricks notebook and import pyspark and SparkSession to create a new SparkSession object to use Spark:

[In]: import pyspark

[In]: from pyspark.sql import SparkSession

[In]:spark=SparkSession.builder.appName("LRwithPySpark").getOrCreate()

We can now read the dataset within Spark using the Dataframe. Since we’re using Databricks, we can mention the file location of the dataset:

[In]: file_location = "/FileStore/tables/data.csv"

[In]: file_type = "csv"

[In]: infer_schema = "true"

[In]: first_row_is_header = "true"

[In]: delimiter = ","

[In]: df = spark.read.format(file_type)

.option("inferSchema", infer_schema)

.option("header", first_row_is_header)

.option("sep", delimiter)

.load(file_location)

[In]: display(df)

[Out]:

Now we look deeper into the dataset by viewing the dataset, validating the shape of the dataset, and discussing various statistical measures of the variables. We start with checking the shape of the dataset:

[In]:print((df.count(), len(df.columns)))

[Out]: (20000, 6)

The output confirms the size of our dataset. Now we can then validate the datatypes of the input values to check if we need to change/cast any attribute’s datatype:

[In]: df.printSchema()

As we can see, there are two columns (Country, Platform) that are categorical in nature and hence need to be converted into numerical form later. We can now use the describe function to go over statistical measures of the dataset:

[In]: df.describe().show()

[Out]:

As we can observe, the average age of visitors is close to 28 years, and they view around nine web pages during the website visit. Let us explore individual columns to understand data in more detail. The groupBy function used along with count returns us the frequency of each of the categories in the data. This is similar to value_counts in Pandas:

[In]: df.groupBy('Country').count().show()

[Out]:

So the maximum number of visitors are from Indonesia followed by India.

[In]: df.groupBy('Platform').count().show()

[Out]:

The Yahoo platform users are the highest in number compared with the other platforms.

[In]: df.groupBy('Status').count().show()

[Out]:

+------+-----+

|Status|count|

+------+-----+

| 1|10000|

| 0|10000|

+------+-----+

We seem to have a balanced target class in this dataset as there are an equal number of users who have converted and not converted. Let’s use the groupBy function along with mean to know more about the dataset:

[In]: df.groupBy('Country').mean().show()

[Out]:

We have the highest conversion rate from Malaysia followed by India. The average number of web page visits is highest in Malaysia and lowest in Brazil.

[In]: df.groupBy('Platform').mean().show()

[Out]:

We have the highest conversion rate from users of the Google platform followed by Yahoo.

[In]: df.groupBy('Status').mean().show()

[Out]:

We can see there is a strong connection between the conversion status and number of pages viewed along with repeat visits.

Now we move on to convert the categorical variables into numerical form using an encoder. We then use VectorAssembler to create a single vector combining all input features:

[In]: from pyspark.ml.feature import StringIndexer

[In]: from pyspark.ml.feature import VectorAssembler

Since we are dealing with two categorical columns, we will have to convert the Country and Platform columns into numerical form.

The first step is to label the column using StringIndexer into numerical form. It allocates unique values to each of the categories of the column. So, in the following example, all the three values of Platform (Yahoo, Google, Bing) are assigned values 0.0, 1.0, and 2.0. This is visible in the column named Platform_index. Similarly for the Country column, we can observe similar values in the Country_index column. The way it assigns values to unique categories is based on the value counts for each category. So, in the case of the Country column, Indonesia has the highest occurrences and hence gets assigned 0.0, followed by India with 1.0:

[In]: si_platform = StringIndexer(inputCol='Country',outputCol='Country_Index')

[In]: df = si_platform.fit(df).transform(df)

[In]: si_country =StringIndexer(inputCol='Platform',outputCol='Platform_Index')

[In]: df = si_country.fit(df).transform(df)

[In] df.show(10)

The next step is to represent each of these values in one form with a one-hot encoded vector. This new vector looks a little different compared with the traditional one-hot encoder in Pandas in terms of representation as it captures the values and positions of the values in the vector:

[In]: from pyspark.ml.feature import OneHotEncoder

[In]: encoder = OneHotEncoder(inputCols=['Country_Index', 'Platform_Index'],outputCols=['Country_vec', 'Platform_vec'])

[In]: df = encoder.fit(df).transform(df)

[In]: df.show(10)

[Out]:

[In]:df.groupBy('Country_vec').count().orderBy('count',ascending=False).show(5,False)

[Out]:

As we can observe, the count values are same for each category in the Country column before one-hot encoding. Let’s interpret the new one-hot encoded vector to understand the components better.

(3,[0],[1.0]) represents a vector of length 3, with 1 value :

Size of vector: 3

Value contained in vector: 1.0

Position of 1.0 value in vector: 0th place

This kind of representation allows to save computational space and hence is faster to compute. The length of the vector is equal to one less than the total number of elements since each value can be easily represented with just the help of three columns:

[In]:df.groupBy('Platform_vec').count().orderBy('count',ascending=False).show(5,False)

[Out]:

In the case of the Platform vector, we observe we just need a vector of size 2 as the total number of unique values in the Platform column are just three. Now that we have converted both the categorical columns into numerical forms, we need to assemble all of the input columns into one single vector that would act as the input feature for the model. So we select the input columns that we need to use to create the single feature vector and name the output vector as features:

[In]: from pyspark.ml.feature import VectorAssembler

[In]: df_assembler = VectorAssembler(inputCols=[ 'Age', 'Repeat_Visitor','Web_pages_viewed','Country_vec','Platform_vec'], outputCol="features")

[In}:df = df_assembler.transform(df)

[In]: df.show()

[Out]:

As we can see, now we have one extra column named features, which is nothing but a combination of all the input features represented as a single dense vector.

[In]: df.select(['features','Status']).show(10,False)

[Out]:

Let us select only the features column as input and the Status column as output for training the logistic regression model:

[In]: model_df=df.select(['features','Status'])

We must split the Dataframe into train and test sets in order to train and evaluate the performance of the logistic regression model. We split it in 75/25 ratio and train our model on 75% of the dataset. We can print the shape of train and test data to validate the size:

[In]: train_df,test_df=model_df.randomSplit([0.75,0.25])

[In]: print(train_df.count())

[Out]: (14972)

[In]: print(test_df.count())

[Out]: (5028)

Let us also validate if the target class is balanced in the train and test sets as well. Otherwise, we have to use some mechanism to maintain the class balance sometimes to improve prediction accuracy:

[In]: train_df.groupBy('Status').count().show()

[Out]:

+------+-----+

|Status|count|

+------+-----+

| 1| 7502|

| 0| 7470|

+------+-----+

[In]: test_df.groupBy('Status').count().show()

[Out]:

+------+-----+

|Status|count|

+------+-----+

| 1| 2498|

| 0| 2530|

+------+-----+

As we can observe, the class balance is well maintained in both train and test sets. We can now go ahead and build the logistic regression model using features as input and Status as output on the train data:

[In]: from pyspark.ml.classification import LogisticRegression

[In]: log_reg=LogisticRegression(labelCol='Status').fit(train_df)

[In]: train_results=log_reg.evaluate(train_df).predictions

[In]: train_results.printSchema()

[Out]:

We can access the predictions made by the model using the evaluate function in Spark, which executes all the steps in an optimized way. That results in another Dataframe that contains four columns in total including prediction and probability. The prediction column signifies the class label the model has predicted for the given row, and the probability column contains two associated probabilities (probability for the negative class at the 0th index and probability for the positive class at the 1st index):

[In]: train_results.filter(train_results['Status']==1).filter(train_results['prediction']==1).select(['Status','prediction','probability']).show(10,False)

[Out]:

So, in the preceding results, probability at the 0th index is for Status with value 0, and probability at the 1st index is for Status value of 1. As we can see, in some cases the model is very confident of the target class and predicts status 1 with almost 99% probability. The next part is to check the performance of the model on unseen or test data. We again make use of the evaluate function to make predictions on test data:

[In]:results=log_reg.evaluate(test_df).predictions

[

[In]: results.select(['Status','prediction']).show(10,False)

[Out]:

+------+----------+

|Status|prediction|

+------+----------+

|0 |0.0 |

|1 |0.0 |

|0 |0.0 |

|1 |1.0 |

|0 |1.0 |

|1 |1.0 |

+------+----------+

As we can observe, there are some cases where the model is misclassifying the target class, whereas at the majority of instances the model is doing a fine job of accurate prediction for both classes.

Confusion Matrix

We will manually create the variables for true positives, true negatives, false positives, and false negatives to calculate the performance metrics of the preceding model on test data:

[In]: true_postives = results[(results.Status == 1) & (results.prediction == 1)].count()

[In]: true_negatives = results[(results.Status == 0) & (results.prediction == 0)].count()

[In]: false_positives = results[(results.Status == 0) & (results.prediction == 1)].count()

[In]: false_negatives = results[(results.Status == 1) & (results.prediction == 0)].count()