Chapter 1. Introduction to Practical Machine Learning Using Python

In the technology industry, the skill of analyzing and mining commercial data is becoming more and more important. All the companies that are related to the online world generate data that can be exploited to improve their business, or can be sold to other companies. This huge amount of information, which can be commercially useful, needs to be restructured and analyzed using the expertise of data science (or data mining) professionals. Data science employs techniques known as machine learning algorithms to transform the data in models, which are able to predict the behavior of certain entities that are highly considered by the business environment. This book is about these algorithms and techniques that are so crucial in today's technology business world, and how to efficiently deploy them in a real commercial environment. You will learn the most relevant machine-learning techniques and will have the chance to employ them in a series of exercises and applications designed to enhance commercial awareness and, with the skills learned in this book, these can be used in your professional experience. You are expected to already be familiar with the Python programming language, linear algebra, and statistics methodologies to fully acquire the topics discussed in this book.

  • There are many tutorials and classes available online on these subjects, but we recommend you read the official Python documentation (https://docs.python.org/), the books Elementary Statistics by A. Bluman and Statistical Inference by G. Casella and R. L. Berger to understand the statistical main concepts and methods and Linear Algebra and Its Applications by G. Strang to learn about linear algebra.

The purpose of this introductory chapter is to familiarize you with the more advanced libraries and tools used by machine-learning professionals in Python, such as NumPy, pandas, and matplotlib, which will help you to grasp the necessary technical knowledge to implement the techniques presented in the following chapters. Before continuing with the tutorials and description of the libraries used in this book, we would like to clarify the main concepts of the machine-learning field, and give a practical example of how a machine-learning algorithm can predict useful information in a real context.

General machine-learning concepts

In this book, the most relevant machine-learning algorithms are going to be discussed and used in exercises to make you familiar with them. In order to explain these algorithms and to understand the content of this book, there are a few general concepts we need to visit that are going to be described hereafter.

First of all, a good definition of machine learning is the subfield of computer science that has been developed from the fields of pattern recognition, artificial intelligence, and computational learning theory. Machine learning can also be seen as a data-mining tool, which focuses more on the data analysis aspects to understand the data provided. The purpose of this discipline is the development of programs, which are able to learn from previously seen data, through tunable parameters (usually arrays of double precision values), that are designed to be adjusted automatically to improve the resulting predictions. In this way, computers can predict a behavior, generalizing the underlying structure of the data, instead of just storing (or retrieving) the values like usual database systems. For this reason, machine learning is associated with computational statics, which also attempt to predict a behavior based on previous data. Common industrial applications of machine-learning algorithms are spam filtering, search engines, optical character recognition, and computer vision. Now that we have defined the discipline, we can describe the terminology used in each machine-learning problem, in more detail.

Any learning problem starts with a data set of n samples, which are used to predict the properties of the future unknown data. Each sample is typically composed of more than a single value so it is a vector. The components of this vector are called features. For example, imagine predicting the price of a second-hand car based on its characteristics: year of fabrication, color, engine size, and so on. Each car i in the dataset will be a vector of features x(i) that corresponds to its color, engine size, and many others. In this case, there is also a target (or label) variable associated with each car i, y(i) which is the second-hand car price. A training example is formed by a pair (x(i), y(i)) and therefore the complete set of N data points used to learn is called a training dataset {(x(i), y(i));i=1,…,N}. The symbol x will denote the space of feature (input) values, and y the space of target (output) values. The machine-learning algorithm chosen to solve the problem will be described by a mathematical model, with some parameters to tune in the training set. After the training phase is completed, the performance of the prediction is evaluated using another two sets: validation and testing sets. The validation set is used to choose, among multiple models, the one that returns the best results, while the testing set is usually used to determine the actual precision of the chosen model. Typically the dataset is divided into 50% training set, 25% validation set, and 25% testing set.

The learning problems can be divided in two main categories (both of which are extensively covered in this book):

  • Unsupervised learning: The training dataset is given by input feature vectors x without any corresponding label values. The usual objective is to find similar examples within the data using clustering algorithms, or to project the data from a high-dimensional space down to a few dimensions (blind signal separations algorithms such as principal component analysis). Since there is usually no target value for each training example, it is not possible to evaluate errors of the model directly from the data; you need to use a technique that evaluates how the elements within each cluster are similar to each other and different from the other cluster's members. This is one of the major differences between unsupervised learning and supervised learning.
  • Supervised learning: Each data sample is given in a pair consisting of an input feature vector and a label value. The task is to infer the parameters to predict the target values of the test data. These types of problems can be further divided into:
    • Classification: The data targets belong to two or more classes, and the goal is to learn how to predict the class of unlabeled data from the training set. Classification is a discrete (as opposed to continuous) form of supervised learning, where the label has a limited number of categories. A practical example of the classification problem is the handwritten digit recognition example, in which the objective is to match each feature vector to one of a finite number of discrete categories.
    • Regression: The label is a continuous variable. For example, the prediction of the height of a child based on his age and weight is a regression problem.

We are going to focus on unsupervised learning methods in Chapter 2, Machine Learning Techniques: Unsupervised Learning, while the most relevant supervised learning algorithms are discussed in Chapter 3, Supervised Machine Learning. Chapter 4, Web Mining Techniques will approach the field of web-mining techniques that can also be considered as both supervised and unsupervised methods. The recommendation systems, which are again part of the supervised learning category, are described in Chapter 5, Recommendation Systems. The Django web framework is then introduced in Chapter 6, Getting Started with Django, and then an example of the recommendation system (using both the Django framework and the algorithms explained in Chapter 5, Recommendation Systems) is detailed in Chapter 7, Movie Recommendation System Web Application. We finish the book with an example of a Django web-mining application, using some of the techniques learned in Chapter 4, Web Mining Techniques. By the end of the book you should be able to understand the different machine-learning methods and be able to deploy them in a real working web application using Django.

We continue the chapter by giving an example of how machine learning can be used in real business problems and in tutorials for Python libraries (NumPy, pandas, and matplotlib), which are essential for putting the algorithms learned in each of the following chapters into practice.

Machine-learning example

To explain further what machine learning can do with real data, we consider the following example (the following code is available in the author's GitHub book folder https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_1/). We have taken the Internet Advertisements Data Set from the UC Irvine Machine Learning Repository (http://archive.ics.uci.edu). Web advertisements have been collected from various web pages, and each of them has been transformed into a numeric feature's vector. From the ad.names file we can see that the first three features represent the image size in the page, and the other features are related to the presence of specific words or phrases on the URL of the image or in the text (1558 features in total). The labels values are either ad or nonad, depending on whether the page has an advert or not. As an example, a web page in ad.data is given by:

125, 125, ...., 1. 0, 1, 0, ad.

Based on this data, a classical machine-learning task is to find a model to predict which pages are adverts and which are not (classification). To start with, we consider the data file ad.data which contains the full feature's vectors and labels, but it has also missing values indicated with a ?. We can use the pandas Python library to transform the? to -1 (see next paragraph for a full tutorial on the pandas library):

import pandas as pd
df = pd.read_csv('ad-dataset/ad.data',header=None)
df=df.replace({'?': np.nan})
df=df.replace({'  ?': np.nan})
df=df.replace({'   ?': np.nan})
df=df.replace({'    ?': np.nan})
df=df.replace({'     ?': np.nan})
df=df.fillna(-1)

A DataFrame is created with the data from the ad.data file, and each ? is first replaced with the an value (replace function), then with -1 (the fillna function). Now each label has to be transformed into a numerical value (and so do all the other values in the data):

adindices = df[df.columns[-1]]== 'ad.'
df.loc[adindices,df.columns[-1]]=1
nonadindices = df[df.columns[-1]]=='nonad.'
df.loc[nonadindices,df.columns[-1]]=0
df[df.columns[-1]]=df[df.columns[-1]].astype(float)
df.apply(lambda x: pd.to_numeric(x))

Each ad. label has been transformed into 1 while the nonad. values have been replaced by 0. All the columns (features) need to be numeric and float types (using the astype function and the to_numeric function through a lambda function).

We want to use the Support Vector Machine (SVM) algorithm provided by the scikit-learn library (see Chapter 3, Supervised Machine Learning) to predict 20% of the labels in the data. First, we split the data into two sets: a training set (80%) and a test set (20%):

import numpy as np
dataset = df.values[:,:]
np.random.shuffle(dataset)
data = dataset[:,:-1]
labels = dataset[:,-1].astype(float)
ntrainrows = int(len(data)*.8)
train = data[:ntrainrows,:]
trainlabels = labels[:ntrainrows]
test = data[ntrainrows:,:]
testlabels = labels[ntrainrows:]

Using the libraries provided by Numpy (a tutorial is provided in the next paragraph), the data are shuffled (function random.shuffle) before being split to assure the rows in the two sets are randomly selected. The -1 notation indicates the last column of the array is not considered.

Now we train our SVM model using the training data:

from sklearn.svm import SVC
clf = SVC(gamma=0.001, C=100.)
clf.fit(train, trainlabels)

We have defined our clf variable that declares the SVM model with the values of the parameters. Then the function fit is called to fit the model with the training data (see Chapter 3, Supervised Machine Learning for further details). The mean accuracy in predicting the 20% test cases is performed as follows, using the score function:

score=clf.score(test,testlabels)
print 'score:',score

Running the preceding code (the full code is available in the chapter_1 folder of the author's GitHub account) gives a result of 92% accuracy, which means 92% of the test cases of the predicted label agree with the true label. This is the power of machine learning: from previous data, we are able to infer if a page will contain an advert or not. To achieve that, we have essentially prepared and manipulated the data using the NumPy and pandas libraries, and then applied the SVM algorithm on the cleaned data using the scikit-learn library. Since this book will largely employ the numpy and pandas (and some matplotlib) libraries, the following paragraphs will discuss how to install the libraries and how the data can be manipulated (or even created) using these libraries.

Installing and importing a module (library)

Before continuing with the discussion on the libraries, we need to clarify how to install each module we want to use in Python. The usual way to install a module is through the pip command using the terminal:

>>> sudo pip install modulename

The module is then usually imported into the code using the statement:

import numpy as np

Here, numpy is the library name and np is the reference name from which any function X in the library can be accessed using np.X instead of numpy.X. We are going to assume that all the libraries (scipy, scikit-learn, pandas, scrapy, nltk, and all others) have been be installed and imported in this way.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.139.8