Summary

In this chapter, we continued working on the online advertising click-through prediction project. This time, we overcame the categorical feature challenge by means of the one-hot encoding technique. We then resorted to a new classification algorithm logistic regression for its high scalability to large datasets. The in-depth discussion of the logistic regression algorithm stared with the introduction of the logistic function, which led to the mechanics of the algorithm itself. This was followed by how to train a logistic regression using gradient descent. After implementing a logistic regression classifier by hand and testing it on our click-through dataset, we learned how to train the logistic regression model in a more advanced manner, using stochastic gradient descent, and adjusted our algorithm accordingly. We also practiced how to use the SGD-based logistic regression classifier from scikit-learn and applied it to our project. We continued to tackle problems we might face in using logistic regression, including L1 and L2 regularization for eliminating overfitting, online learning techniques for training on large-scale datasets, and handling multiclass scenarios. We also learned how to implement logistic regression with TensorFlow. Finally, the chapter ended with applying the random forest model to feature selection, as an alternative to L1-regularized logistic regression.

You might be curious as to how we can efficiently train the model on the entire dataset of 40 million samples. In the next chapter, we will utilize tools such as Spark and the PySpark module to scale up our solution.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary