Scikit-learn is one of the most important and most used packages for machine learning with Python. It features many functions for various predictive algorithms. In this chapter, we examine some of the algorithms included in the Scikit-learn package. Given the breadth of the subject, the example presented reflect the most used models. Those of you who have no prior knowledge of machine learning may find it difficult to understand some of the techniques presented in this chapter, which are not explained in detail.
What Is Machine Learning?
Machine learning is a branch of data analysis that transforms datasets built in a particular way into predictions that can be applied to new data. Machine learning uses data we already have to predict future behaviors. Machine-learning techniques have been a real revolution in data mining and they have a great impact on a variety of fields of application.
Machine learning is widespread among many applications used every day. Large companies such as Amazon, Netflix, Google, Apple, and Facebook use machine-learning algorithms for various reasons. For instance, Facebook uses machine learning to recognize faces in images; Amazon and Netflix analyze customer preferences (the last thing you viewed or bought) to propose new products that might match your interests.
Google, for example, uses machine learning in translation and automatic driving, or also suggesting the road with less traffic based on our habits or the place where we are used to go on a given day of the week. Machine learning also helps us detect spam messages from non spam messages, often using probabilistic methods or by combining multiple methods (for instance, adding probabilistic methods to keywords and user-defined rules).
Apple and Microsoft use machine learning to provide us with a voice assistant that helps us work with a phone or tablet using only our voice. Other companies are currently refining artificial intelligence methods, automated driving, and more.
The field of machine learning gained wider attention when a supercomputer (Watson), developed by IBM, took part to the Jeopardy! quiz program.
Machine learning has also been used to predict election results, first by Nate Silver, a scientist who, in October 2012, published a preview of US elections, and whose results were very close to the actual data.
Predictive data mining is used in the healthcare field. Patient data and clinical records can help identify people who are at greater risk of contracting certain conditions and illness such as diabetes or heart disease. DNA analysis and genetic kits have been used, for example, to detect genes responsible for or otherwise related to certain types of cancer, including breast cancer.
One of the most outdated uses of machine learning is handwriting recognition—in particular, handwritten addresses and zip codes. Recognition is based on various occurrences of each handwritten number using neural networks (conducted by Bell Labs).
Research in machine learning and related topics, such as deep learning and artificial intelligence, improves every day and is at the forefront of the computing world. Some web sites such as Kaggle publish contents every few days during which subscribers try to solve a given problem. The most famous Kaggle contest was announced by Netflix. In 2006, Netflix awarded a $1,000,000 USD prize for a recommendation system. The system that was designed has not been implemented by Netflix because it is too complex and computationally expensive.
Let’s look at the various modules and techniques in the Scikit-learn package.
Import Datasets Included in Scikit-learn
The iris dataset is made up of petals and sepals of three different types of iris: versicolor, virginica, and setosa. It contains 150 equally divided cases on three types of flowers, and five variables.
Creation of Training and Testing Datasets
Preprocessing
Regression
Regression analysis is used to explain the relationship between a variable, y, called a response variable or dependent variable , and one or more independent variables.
K-Nearest Neighbors
Cross-validation
Support Vector Machine
Decision Trees
KMeans
KMeans is an unsupervised method of classification, which means we do not have a label to guide us during classification. For this reason, we choose to use clustering as a helpful exploratory analysis method, because it allows us to group elements of a dataset based on how similar or dissimilar they are.
This was just a cursory discussion of machine learning using the Scikit-learn package. Machine learning is a challenging topic and therefore not easy to sum up in a few pages. I thought it would be helpful to expose to some predictive data mining concepts and various Scikit-learn modules that can be used for machine learning.
Managing Dates
Resources for parsing a date are available at http://strftime.org/ .
Dates that are not recognized are identified as NaT.
Data Sources
As we have seen, the Scikit-learn package also includes datasets that can be imported. For more information on scikit-learn and the featured datasets, you browse the package documentation at http://scikit-learn.org/stable/datasets/ .
A pandas package module, called datareader, features tools to extract data from some online sources ( https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#google-finance )—particularly those dealing with stock exchange repositories, such as Yahoo! Finance and Google Finance.