A naïve approach to Titanic problem

Our first attempt at classifying the Titanic data is to use a naïve, yet very intuitive, approach. This approach involves the following steps:

  1. Select a set of features S, which influence whether a person survives or not.
  2. For each possible combination of features, use the training data to indicate whether the majority of cases survived or not. This can be evaluated in what is known as a survival matrix.
  3. For each test example that we wish to predict survival, look up the combination of features that corresponds to the values of its features and assign its predicted value to the survival value in the survival table. This approach is a naive K-nearest neighbor approach.

Based on what we have seen earlier in our analysis, there are three features that seem to have the most influence on the survival rate:

  • Passenger class
  • Gender
  • Passenger fare (bucketed)

We include passenger fare as it is related to passenger class.

The survival table looks something similar to the following:

     NumberOfPeople  Pclass  PriceBucket     Sex  Survived
0                0       1            0  female         0
1                1       1            0    male         0
2                0       1            1  female         0
3                0       1            1    male         0
4                7       1            2  female         1
5               34       1            2    male         0
6                1       1            3  female         1
7               19       1            3    male         0
8                0       2            0  female         0
9                0       2            0    male         0
10              35       2            1  female         1
11              63       2            1    male         0
12              31       2            2  female         1
13              25       2            2    male         0
14               4       2            3  female         1
15               6       2            3    male         0
16              64       3            0  female         1
17             256       3            0    male         0
18              43       3            1  female         1
19              38       3            1    male         0
20              21       3            2  female         0
21              24       3            2    male         0
22              10       3            3  female         0
23               5       3            3    male         0

The code for generating this table can be found in the file survival_data.py which is attached. To see how we use this table, let us take a look at a snippet of our test data:

In [192]: test_df.head(3)[['PassengerId','Pclass','Sex','Fare']]
Out[192]: PassengerId   Pclass  Sex     Fare
       0        892     3       male    7.8292
       1        893     3       female  7.0000
       2        894     2       male    9.6875

For passenger 892, we see that he is male, his ticket price was 7.8292, and he travelled in the third class.

Hence, the key for survival table lookup for this passenger is {Sex='male', Pclass=3, PriceBucket=0 (since 7.8292 falls in bucket 0)}.

If we look up the survival value corresponding to this key in our survival table (row 17), we see that the value is 0 = Perished; this is the value that we will predict.

Similarly, for passenger 893, we have key={Sex='female', Pclass=3, PriceBucket=0}.

This corresponds to row 16, and hence, we will predict 1, that is, survived, and her predicted survival is 1, that is, survived.

Thus, our results look like the following command:

> head -4 csv/surv_results.csv 
PassengerId,Survived
892,0
893,1
894,0

The source of this information is at: http://bit.ly/1FU7mXj.

Using the survival table approach outlined earlier, one is able to achieve an accuracy of 0.77990 on Kaggle (http://www.kaggle.com).

The survival table approach, while intuitive, is a very basic approach that represents only the tip of the iceberg of possibilities in machine learning.

In the following sections, we will take a whirlwind tour of various machine learning algorithms that will help you, the reader, to get a feel for what is available in the machine learning universe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.206.69