Data analysis and preprocessing using pandas

In this section, we will utilize pandas to do some analysis and preprocessing of the data before submitting it as input to scikit-learn.

Examining the data

In order to start our preprocessing of the data, let us read in the training dataset and examine what it looks like.

Here, we read in the training dataset into a pandas DataFrame and display the first rows:

In [2]: import pandas as pd
        import numpy as np
# For .read_csv, always use header=0 when you know row 0 is the header row
        train_df = pd.read_csv('csv/train.csv', header=0)
In [3]: train_df.head(3)

The output is as follows:

Examining the data

Thus, we can see the various features: PassengerId, PClass, Name, Sex, Age, Sibsp, Parch, Ticket, Fare, Cabin, and Embarked. One question that springs to mind immediately is this: which of the features are likely to influence whether a passenger survived or not?

It should seem obvious that PassengerID, Ticket Code, and Name should not be influencers on survivability since they're identifier variables. We will skip these in our analysis.

Handling missing values

One issue that we have to deal with in datasets for machine learning is how to handle missing values in the training set.

Let's visually identify where we have missing values in our feature set.

For that, we can make use of an equivalent of the missmap function in R, written by Tom Augspurger. The next graphic shows how much data is missing for the various features in an intuitively pleasing manner:

Handling missing values

For more information and the code used to generate this data, see the following: http://bit.ly/1C0a24U.

We can also calculate how much data is missing for each of the features:

In [83]: missing_perc=train_df.apply(lambda x: 100*(1-x.count().sum()/(1.0*len(x))))
In [85]: sorted_missing_perc=missing_perc.order(ascending=False)
         sorted_missing_perc
Out[85]: Cabin          77.104377
         Age            19.865320
         Embarked        0.224467
         Fare            0.000000
         Ticket          0.000000
         Parch           0.000000
         SibSp           0.000000
         Sex             0.000000
         Name            0.000000
         Pclass          0.000000
         Survived        0.000000
         PassengerId     0.000000
         dtype: float64

Thus, we can see that most of the Cabin data is missing (77%), while around 20% of the Age data is missing. We then decide to drop the Cabin data from our learning feature set as the data is too sparse to be of much use.

Let us do a further breakdown of the various features that we would like to examine. In the case of categorical/discrete features, we use bar plots; for continuous valued features, we use histograms:

In [137]:  import random
                bar_width=0.1
                categories_map={'Pclass':{'First':1,'Second':2, 'Third':3},
                'Sex':{'Female':'female','Male':'male'},
                'Survived':{'Perished':0,'Survived':1},
                'Embarked':{'Cherbourg':'C','Queenstown':'Q','Southampton':'S'},
                'SibSp': { str(x):x for x in [0,1,2,3,4,5,8]},
                'Parch': {str(x):x for x in range(7)}
                }
              colors=['red','green','blue','yellow','magenta','orange']
              subplots=[111,211,311,411,511,611,711,811]
              cIdx=0
             fig,ax=plt.subplots(len(subplots),figsize=(10,12))
 
                keyorder = ['Survived','Sex','Pclass','Embarked','SibSp','Parch']
 
for category_key,category_items in sorted(categories_map.iteritems(),
                                          key=lambda i:keyorder.index(i[0])):
    num_bars=len(category_items)
    index=np.arange(num_bars)
    idx=0
    for cat_name,cat_val in sorted(category_items.iteritems()):
        ax[cIdx].bar(idx,len(train_df[train_df[category_key]==cat_val]), label=cat_name,
                color=np.random.rand(3,1))
        idx+=1
    ax[cIdx].set_title('%s Breakdown' % category_key)
    xlabels=sorted(category_items.keys()) 
    ax[cIdx].set_xticks(index+bar_width)
    ax[cIdx].set_xticklabels(xlabels)
    ax[cIdx].set_ylabel('Count')
    cIdx +=1 
fig.subplots_adjust(hspace=0.8)
for hcat in ['Age','Fare']:
    ax[cIdx].hist(train_df[hcat].dropna(),color=np.random.rand(3,1))
    ax[cIdx].set_title('%s Breakdown' % hcat)
    #ax[cIdx].set_xlabel(hcat)
    ax[cIdx].set_ylabel('Frequency')
    cIdx +=1
 
fig.subplots_adjust(hspace=0.8)
plt.show()
Handling missing values

From the data and illustration in the preceding figure, we can observe the following:

  • About twice as many passengers perished than survived (62% vs. 38%).
  • There were about twice as many male passengers as female passengers (65% versus 35%).
  • There were about 20% more passengers in the third class versus the first and second together (55% versus 45%).
  • Most passengers were solo, that is, had no children, parents, siblings, or spouse on board.

These observations might lead us to dig deeper and investigate whether there is some correlation between chances of survival and gender and also fare class, particularly if we take into account the fact that the Titanic had a women-and-children-first policy (http://en.wikipedia.org/wiki/Women_and_children_first) and the fact that the Titanic was carrying fewer lifeboats (20) than it was designed to (32).

In light of this, let us further examine the relationships between survival and some of these features. We start with gender:

In [85]: from collections import OrderedDict
         num_passengers=len(train_df)
         num_men=len(train_df[train_df['Sex']=='male'])
         men_survived=train_df[(train_df['Survived']==1 ) & (train_df['Sex']=='male')]
         num_men_survived=len(men_survived)
         num_men_perished=num_men-num_men_survived
         num_women=num_passengers-num_men
         women_survived=train_df[(train_df['Survived']==1) & (train_df['Sex']=='female')]
         num_women_survived=len(women_survived)
         num_women_perished=num_women-num_women_survived
         gender_survival_dict=OrderedDict()
         gender_survival_dict['Survived']={'Men':num_men_survived,'Women':num_women_survived}
         gender_survival_dict['Perished']={'Men':num_men_perished,'Women':num_women_perished}
         gender_survival_dict['Survival Rate']= {'Men' : round(100.0*num_men_survived/num_men,2),'Women':round(100.0*num_women_survived/num_women,2)}
pd.DataFrame(gender_survival_dict)
Out[85]:

Gender

Survived

Perished

Survival Rate

Men

109

468

18.89

Women

233

81

74.2

We now illustrate this data in a bar chart using the following command:

In [76]: #code to display survival by gender
         fig = plt.figure()
         ax = fig.add_subplot(111)
         perished_data=[num_men_perished, num_women_perished]
         survived_data=[num_men_survived, num_women_survived]
         N=2
         ind = np.arange(N)     # the x locations for the groups
         width = 0.35

         survived_rects = ax.barh(ind, survived_data, width,color='green')
        perished_rects = ax.barh(ind+width, perished_data, width,color='red')

        ax.set_xlabel('Count')
        ax.set_title('Count of Survival by Gender')
        yTickMarks = ['Men','Women']
        ax.set_yticks(ind+width)
        ytickNames = ax.set_yticklabels(yTickMarks)
        plt.setp(ytickNames, rotation=45, fontsize=10)

        ## add a legend
        ax.legend((survived_rects[0], perished_rects[0]), ('Survived', 'Perished') )
        plt.show()

The preceding code produces the following bar graph:

Handling missing values

From the preceding plot, we can see that a majority of the women survived (74%), while most of the men perished (only 19% survived).

This leads us to the conclusion that the gender of the passenger may be a contributing factor to whether a passenger survived or not.

Next, let us look at passenger class. First, we generate the survived and perished data for each of the three passenger classes, as well as survival rates and show them in a table:

In [86]: 
from collections import OrderedDict
num_passengers=len(train_df)
num_class1=len(train_df[train_df['Pclass']==1])
class1_survived=train_df[(train_df['Survived']==1 ) & (train_df['Pclass']==1)]
num_class1_survived=len(class1_survived)
num_class1_perished=num_class1-num_class1_survived
num_class2=len(train_df[train_df['Pclass']==2])
class2_survived=train_df[(train_df['Survived']==1) & (train_df['Pclass']==2)]
num_class2_survived=len(class2_survived)
num_class2_perished=num_class2-num_class2_survived
num_class3=num_passengers-num_class1-num_class2
class3_survived=train_df[(train_df['Survived']==1 ) & (train_df['Pclass']==3)]
num_class3_survived=len(class3_survived)
num_class3_perished=num_class3-num_class3_survived
pclass_survival_dict=OrderedDict()
pclass_survival_dict['Survived']={'1st Class':num_class1_survived,
                                  '2nd Class':num_class2_survived,
                                  '3rd Class':num_class3_survived}
pclass_survival_dict['Perished']={'1st Class':num_class1_perished,
                                  '2nd Class':num_class2_perished,
                                 '3rd Class':num_class3_perished}
pclass_survival_dict['Survival Rate']= {'1st Class' : round(100.0*num_class1_survived/num_class1,2),
               '2nd Class':round(100.0*num_class2_survived/num_class2,2),
               '3rd Class':round(100.0*num_class3_survived/num_class3,2),}
pd.DataFrame(pclass_survival_dict)

Out[86]:

Passenger Class

Survived

Perished

Survival Rate

First Class

136

80

62.96

Second Class

87

97

47.28

Third Class

119

372

24.24

We can then plot the data by using matplotlib in a similar manner to that for the survivor count by gender as described earlier:

In [186]:
fig = plt.figure()
ax = fig.add_subplot(111)
perished_data=[num_class1_perished, num_class2_perished, num_class3_perished]
survived_data=[num_class1_survived, num_class2_survived, num_class3_survived]
N=3
ind = np.arange(N)                # the x locations for the groups
width = 0.35
survived_rects = ax.barh(ind, survived_data, width,color='blue')
perished_rects = ax.barh(ind+width, perished_data, width,color='red')
ax.set_xlabel('Count')
ax.set_title('Survivor Count by Passenger class')
yTickMarks = ['1st Class','2nd Class', '3rd Class']
ax.set_yticks(ind+width)
ytickNames = ax.set_yticklabels(yTickMarks)
plt.setp(ytickNames, rotation=45, fontsize=10)
## add a legend
ax.legend( (survived_rects[0], perished_rects[0]), ('Survived', 'Perished'),
          loc=10 )
plt.show()

This produces the following bar plot:

Handling missing values

It seems clear from the preceding data and illustration that the higher the passenger fare class is, the greater are one's chances of survival.

Given that both gender and fare class seem to influence the chances of a passenger's survival, let's see what happens when we combine these two features and plot a combination of both. For this, we shall use the crosstab function in pandas.

In [173]: survival_counts=pd.crosstab([train_df.Pclass,train_df.Sex],train_df.Survived.astype(bool))
          survival_counts
Out[173]:               Survived False  True
           Pclass       Sex             
           1            female    3     91
                        male     77     45
           2            female    6     70
                        male     91     17
           3            female   72     72
                        male    300     47

Let us now display this data using matplotlib. First, let's do some re-labeling for display purposes:

In [183]: survival_counts.index=survival_counts.index.set_levels([['1st', '2nd', '3rd'], ['Women', 'Men']])
In [184]: survival_counts.columns=['Perished','Survived']

Now, we plot the data by using the plot function of a pandas DataFrame:

In [185]: fig = plt.figure()
          ax = fig.add_subplot(111)
          ax.set_xlabel('Count')
          ax.set_title('Survivor Count by Passenger class, Gender')
          survival_counts.plot(kind='barh',ax=ax,width=0.75,
                               color=['red','black'], xlim=(0,400))
Out[185]: <matplotlib.axes._subplots.AxesSubplot at 0x7f714b187e90>
Handling missing values
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.76.204