Supervised learning algorithms

We will take a brief tour of some well-known supervised learning algorithms and see how we can apply them to the Titanic survival prediction problem described earlier.

Constructing a model using Patsy for scikit-learn

Before we start our tour of the machine learning algorithms, we need to know a little bit about the Patsy library. We will make use of Patsy to design features that will be used in conjunction with scikit-learn. Patsy is a package for creating what are known as design matrices. These design matrices are transformations of the features in our input data. The transformations are specified by expressions known as formulas, which correspond to a specification of what features we wish the machine learning program to utilize in learning.

A simple example of this is as follows:

Suppose that we want a linear regression of y against some other variables of x, a, and b and the interaction between a and b; then, we can specify the model as follows:

import patsy as pts
pts.dmatrices("y ~ x + a + b + a:b", data)

In the preceding line of code, the formula is specified by the following expression: y ~ x + a + b + a:b.

Note

For further reference, look at: http://patsy.readthedocs.org/en/latest/overview.html

General boilerplate code explanation

In this section, we will introduce boilerplate code for the implementation of the various following algorithms by using Patsy and scikit-learn. The reason for doing this is that most of the code for the following algorithms is repeatable.

In the following sections, the workings of the algorithms will be described and the code specific to each algorithm will be provided as attachments to the chapter.

  1. First, let's make sure that we're in the correct folder by using the following command line. Assuming that the working directory is located at ~/devel/Titanic, we have:
    In [17]: %cd ~/devel/Titanic
            /home/youruser/devel/sandbox/Learning/Kaggle/Titanic
    
  2. Here, we import the needed packages and read in our training and test datasets:
    In [18]: import matplotlib.pyplot as plt
                 import pandas as pd
                 import numpy as np
                 import patsy as pt
    In [19]: train_df = pd.read_csv('csv/train.csv', header=0)
             test_df = pd.read_csv('csv/test.csv', header=0) 
    
  3. Next, we specify the formulas we would like to submit to Patsy:
    In [21]: formula1 = 'C(Pclass) + C(Sex) + Fare'
             formula2 = 'C(Pclass) + C(Sex)'
             formula3 = 'C(Sex)'
             formula4 = 'C(Pclass) + C(Sex) + Age + SibSp + Parch'
             formula5 = 'C(Pclass) + C(Sex) + Age + SibSp + Parch + C(Embarked)' 
             formula6 = 'C(Pclass) + C(Sex) + Age + SibSp + C(Embarked)'
             formula7 = 'C(Pclass) + C(Sex) + SibSp + Parch + C(Embarked)'
             formula8 = 'C(Pclass) + C(Sex) + SibSp + Parch + C(Embarked)'
    
    In [23]: formula_map = {'PClass_Sex_Fare' : formula1,
                            'PClass_Sex' : formula2,
                            'Sex' : formula3,
                            'PClass_Sex_Age_Sibsp_Parch' : formula4,
                            'PClass_Sex_Age_Sibsp_Parch_Embarked' : formula5,
               'PClass_Sex_Embarked' : formula6,
               'PClass_Sex_Age_Parch_Embarked' : formula7,
               'PClass_Sex_SibSp_Parch_Embarked' : formula8
                  }
    

We will define a function that helps us handle missing values. The following function finds the cells within the DataFrame that have null values, obtains the set of similar passengers, and sets the null value to the mean value of that feature for the set of similar passengers. Similar passengers are defined as those having the same gender and passenger class as the passengers with the null feature value.

In [24]: 
def fill_null_vals(df,col_name):
    null_passengers=df[df[col_name].isnull()]
    passenger_id_list = null_passengers['PassengerId'].tolist()
    df_filled=df.copy()
    for pass_id in passenger_id_list:
        idx=df[df['PassengerId']==pass_id].index[0]
        similar_passengers = df[(df['Sex']== 
        null_passengers['Sex'][idx]) & 
        (df['Pclass']==null_passengers['Pclass'][idx])]
        mean_val = np.mean(similar_passengers[col_name].dropna())
        df_filled.loc[idx,col_name]=mean_val
    return df_filled

Here, we create filled versions of our training and test DataFrames.

Our test DataFrame is what the fitted scikit-learn model will generate predictions on to produce output that will be submitted to Kaggle for evaluation:

In [28]: train_df_filled=fill_null_vals(train_df,'Fare')
         train_df_filled=fill_null_vals(train_df_filled,'Age')
         assert len(train_df_filled)==len(train_df)
                  test_df_filled=fill_null_vals(test_df,'Fare')
         test_df_filled=fill_null_vals(test_df_filled,'Age')
         assert len(test_df_filled)==len(test_df)

Here is the actual implementation of the call to scikit-learn to learn from the training data by fitting a model and then generate predictions on the test dataset. Note that even though this is boilerplate code, for the purpose of illustration, an actual call is made to a specific algorithm, in this case, DecisionTreeClassifier.

The output data is written to files with descriptive names, for example, csv/dt_PClass_Sex_Age_Sibsp_Parch_1.csv and csv/dt_PClass_Sex_Fare_1.csv.

In [29]: 
from sklearn import metrics,svm, tree
for formula_name, formula in formula_map.iteritems():
        print "name=%s formula=%s" % (formula_name,formula)
      y_train,X_train = pt.dmatrices('Survived ~ ' + formula, 
                                    train_df_filled,return_type='dataframe')
     y_train = np.ravel(y_train)
     model = tree.DecisionTreeClassifier(criterion='entropy', 
             max_depth=3,min_samples_leaf=5)
     print "About to fit..."
     dt_model = model.fit(X_train, y_train)
     print "Training score:%s" % dt_model.score(X_train,y_train)
     X_test=pt.dmatrix(formula,test_df_filled)
     predicted=dt_model.predict(X_test)
     print "predicted:%s" % predicted[:5]
     assert len(predicted)==len(test_df)
     pred_results = pd.Series(predicted,name='Survived')
     dt_results = pd.concat([test_df['PassengerId'],  
                  pred_results],axis=1)
     dt_results.Survived = dt_results.Survived.astype(int)
     results_file = 'csv/dt_%s_1.csv' % (formula_name)
     print "output file: %s
" % results_file
     dt_results.to_csv(results_file,index=False)

The preceding code follows a standard recipe, and the synopsis is as follows:

  1. Read in the training and test datasets
  2. Fill in any missing values for the features we wish to consider in both datasets
  3. Define formulas for the various feature combinations we wish to generate machine learning models for in Patsy
  4. For each formula, perform the following set of steps:
    1. Call Patsy to create design matrices for our training feature set and training label set (designated by X_train and y_train).
    2. Instantiate the appropriate scikit-learn classifier. In this case, we use DecisionTreeClassifier.
    3. Fit the model by calling the fit(..) method.
    4. Make a call to Patsy to create a design matrix (X_test) for our predicted output via a call to patsy.dmatrix(..).
    5. Predict on the X_test design matrix, and save the results in the variable predicted.
    6. Write our predictions to an output file, which will be submitted to Kaggle.

We will consider the following supervised learning algorithms:

  • Logistic regression
  • Support vector machine
  • Decision tree
  • Random forest

Logistic regression

In logistic regression, we attempt to predict the outcome of a categorical, that is, discrete-valued dependent, variable on the basis of one or more input predictor variables.

Logistic regression can be thought of as the equivalent of applying linear regression but on discrete or categorical variables. However, in the case of binary logistic regression (which applies to the Titanic problem), the function to which we're trying to fit is not a linear one as we're only trying to predict an outcome that can take only two values – 0 and 1. Using a linear function for our regression doesn't make sense as the output cannot take values between 0 and 1. Ideally, what we need to model for the regression of a binary valued output is some sort of step function for values 0 and 1. However, such a function is not well-defined and not differentiable, so an approximation with nicer properties was defined: the logistic function. The logistic function takes values between 0 and 1 but is skewed towards the extreme values of 0 and 1 and can be used as a good approximation for the regression of categorical variables.

The formal definition of the logistic regression function is as follows:

Logistic regression

The following graph is a good illustration as to why the logistic function is suitable for binary logistic regression:

Logistic regression

We can see that as we increase the value of our parameter a, we can get closer to taking on the 0 to 1 values and to the step function we wish to model. A simple application of the preceding function would be to set the output value to 0, if f(x) <0.5, and 1 if not.

The code for plotting the function is included in plot_logistic.py.

Note

A more detailed examination of the logistic regression may be found here at: http://en.wikipedia.org/wiki/Logit and http://logisticregressionanalysis.com/86-what-is-logistic-regression.

In applying logistic regression to the Titanic problem, we wish to predict a binary outcome, that is, whether a passenger survived or not.

We adapted the boilerplate code to use the sklearn.linear_model.LogisticRegression class of scikit-learn.

Upon submitting our data to Kaggle, the following results were obtained:

Formula

Kaggle Score

C(Pclass) + C(Sex) + Fare

0.76077

C(Pclass) + C(Sex)

0.76555

C(Sex)

0.76555

C(Pclass) + C(Sex) + Age + SibSp + Parch

0.74641

C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked)

0.75598

The code implementing logistic regression can be found in the run_logistic_regression_titanic.py file.

Support vector machine

Support vector machine (SVM) is a powerful supervised learning algorithm used for classification and regression. It is a discriminative classifier–it draws a boundary between clusters or classifications of data, so new points can be classified on the basis of the cluster that they fall into.

SVMs do not just find a boundary line; they also try to determine margins for the boundary on either side. The SVM algorithm tries to find the boundary with the largest possible margin around it.

Support vectors are points that define the largest margin around the boundary–remove these points, and possibly, a larger margin can be found.

Hence the name, support, as they support the margin around the boundary line. The support vectors matter. This is illustrated in the following diagram:

Support vector machine

Note

For more information on this, refer to http://winfwiki.wi-fom.de/images/c/cf/Support_vector_2.png.

To use the SVM algorithm for classification, we specify one of the following three kernels: linear, poly, and rbf (also known as radial basis functions).

Then, we import the support vector classifier (SVC):

from sklearn import svm

We then instantiate an SVM classifier, fit the model, and predict the following:

model = svm.SVC(kernel=kernel)
svm_model = model.fit(X_train, y_train)
X_test = pt.dmatrix(formula, test_df_filled)
. . .

Upon submitting our data to Kaggle, the following results were obtained:

Formula

Kernel Type

Kaggle Score

C(Pclass) + C(Sex) + Fare

poly

0.71292

C(Pclass) + C(Sex)

poly

0.76555

C(Sex)

poly

0.76555

C(Pclass) + C(Sex) + Age + SibSp + Parch

poly

0.75598

C(Pclass) + C(Sex) + Age + Parch + C(Embarked)

poly

0.77512

C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(embarked)

poly

0.79426

C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked)

rbf

0.7512

The code can be seen in its entirety in the following file: run_svm_titanic.py.

Here, we see that the SVM with a kernel type of poly (polynomial) and the combination of Pclass, Sex, Age, Sibsp, and Parch features produces the best results when submitted to Kaggle. Surprisingly, it seems as if the embarkation point (Embarked) and whether the passenger travelled alone or with family members (Sibsp + Parch) do have a material effect on a passenger's chances of survival.

The latter effect was probably due to the women-and-children-first policy on the Titanic.

Decision trees

The basic idea behind decision trees is to use the training dataset to create a tree of decisions in order to make a prediction.

It recursively splits the training dataset into subsets on the basis of the value of a single feature. Each split corresponds to a node in the decision tree. The splitting process is continued until every subset is pure, that is, all elements belong to a single class. This always works except in cases where there are duplicate training examples that fall into different classes. In this case, the majority class wins.

The end result is a rule set for making predictions on the test dataset.

Decision trees encode a sequence of binary choices in a process that mimics how a human might classify things, but decide which question is most useful at each step by using the information criteria.

An example of this would be if you wished to determine whether an animal x is a mammal, fish, or a reptile; in this case, we would ask the following questions:

- Does x have fur?
Yes: x is a mammal
No: Does x have feathers?
Yes: x is a bird
No: Does x have scales?
Yes: Does x have gills?
Yes: x is a fish
No: x is a reptile
No: x is an amphibian

This generates a decision tree that looks similar to the following:

Decision trees

Note

Refer to the following link for more information:

http://bit.ly/1C0cM2e.

The binary splitting of questions at each node is the essence of a decision tree algorithm. A major drawback of decision trees is that they can overfit the data.

They are so flexible that given a large depth, they can memorize the inputs, and this results in poor results when they are used to classify unseen data.

The way to fix this is to use multiple decision trees, and this is known as using an ensemble estimator. An example of an ensemble estimator is the random forest algorithm, which we will address next.

To use a decision tree in scikit-learn, we import the tree module:

from sklearn import tree

We then instantiate an SVM classifier, fit the model, and predict the following:

model = tree.DecisionTreeClassifier(criterion='entropy', 
             max_depth=3,min_samples_leaf=5)
dt_model = model.fit(X_train, y_train)X_test = dt.dmatrix(formula, test_df_filled)
#. . .

Upon submitting our data to Kaggle, the following results are obtained:

Formula

Kaggle Score

C(Pclass) + C(Sex) + Fare

0.77033

C(Pclass) + C(Sex)

0.76555

C(Sex)

0.76555

C(Pclass) + C(Sex) + Age + SibSp + Parch

0.76555

C(Pclass) + C(Sex) + Age + Parch +

C(Embarked)

0.78947

C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked)

0.79426

Random forest

The random forest is an example of a non-parametric model as are decision trees. Random forests are based on decision trees. The decision boundary is learned from the data itself. It doesn't have to be a line or a polynomial or radial basis function. The random forest model builds upon the decision tree concept by producing a large number of or a forest of decision trees. It takes a random sample of the data and identifies a set of features to grow each decision tree. The error rate of the model is compared across sets of decision trees to find the set of features that produces the strongest classification model.

To use a random forest in scikit-learn, we import the RandomForestClassifier module:

from sklearn import RandomForestClassifier

We then instantiate a random forest classifier, fit the model, and predict the following:

model = RandomForestClassifier(n_estimators=num_estimators,    
                               random_state=0)
rf_model = model.fit(X_train, y_train)
X_test = dt.dmatrix(formula, test_df_filled)
. . .

Upon submitting our data to Kaggle (Formula: C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked)), the following results are obtained:

Formula

Kaggle Score

10

0.74163

100

0.76077

1000

0.76077

10000

0.77990

100000

0.77990

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.174.0