15.8 Wrap-Up

In this chapter we began our study of machine learning, using the popular scikit-learn library. We saw that machine learning is divided into two types. Supervised machine learning, which works with labeled data and unsupervised machine learning which works with unlabeled data. Throughout this chapter, we continued emphasizing visualizations using Matplotlib and Seaborn, particularly for getting to know your data.

We discussed how scikit-learn conveniently packages machine-learning algorithms as estimators. Each is encapsulated so you can create your models quickly with a small amount of code, even if you don’t know the intricate details of how these algorithms work.

We looked at supervised machine learning with classification, then regression. We used one of the simplest classification algorithms, k-nearest neighbors, to analyze the Digits dataset bundled with scikit-learn. You saw that classification algorithms predicts the classes to which samples belong. Binary classification uses two classes (such as “spam” or “not spam”) and multi-classification uses more than two classes (such as the 10 classes in the Digits dataset).

We performed the steps of a typical machine-learning case study, including loading the dataset, exploring the data with pandas and visualizations, splitting the data for training and testing, creating the model, training the model and making predictions. We discussed why you should partition your data into a training set and a testing set. You saw ways to evaluate a classification estimator’s accuracy via a confusion matrix and a classification report.

We mentioned that it’s difficult to know in advance which model(s) will perform best on your data, so you typically try many models and pick the one that performs best. We showed that it’s easy to run multiple estimators. We also used hyperparameter tuning with k-fold cross-validation to choose the best value of k for the k-NN algorithm.

We revisited the time series and simple linear regression example from Chapter 10’s Intro to Data Science section, this time implementing it using a scikit-learn LinearRegression estimator. Next, we used a LinearRegression estimator to perform multiple linear regression with the California Housing dataset that’s bundled with scikit-learn. You saw that the LinearRegression estimator, by default, uses all the numerical features in a dataset to make more sophisticated predictions than you can with simple linear regression. Again, we ran multiple scikit-learn estimators to compare how they performed and choose the best one.

Next, we introduced an unsupervised machine learning and mentioned that it’s typically accomplished with clustering algorithms. We used introduced dimensionality reduction (with scikit-learn’s TSNE estimator) and used it to compress the Digits dataset’s 64 features down to two for visualization purposes. This enabled us to see the clustering of the digits data.

We presented one of the simplest unsupervised machine learning algorithms, k-means clustering, and demonstrated clustering on the Iris dataset that’s also bundled with scikit-learn. We used dimensionality reduction (with scikit-learn’s PCA estimator) to compress the Iris dataset’s four features to two for visualization purposes to show the clustering of the three Iris species in the dataset and their centroids. Finally, we ran multiple clustering estimators to compare their ability to label the Iris dataset’s samples into three clusters.

In the next chapter, we’ll continue our study of machine learning technologies with discussions of deep learning and reinforcement learning. We’ll tackle some fascinating and challenging problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.22.169