Application of machine learning – Kaggle Titanic competition

In order to illustrate how we can use pandas to assist us at the start of our machine learning journey, we will apply it to a classic problem, which is hosted on the Kaggle website (http://www.kaggle.com). Kaggle is a competition platform for machine learning problems. The idea behind Kaggle is to enable companies that are interested in solving predictive analytics problems with their data to post their data on Kaggle and invite data scientists to come up with the proposed solutions to their problems. The competition can be ongoing over a period of time, and the rankings of the competitors are posted on a leader board. At the end of the competition, the top-ranked competitors receive cash prizes.

The classic problem that we will study in order to illustrate the use of pandas for machine learning with scikit-learn is the Titanic: machine learning from disaster problem hosted on Kaggle as their classic introductory machine learning problem. The dataset involved in the problem is a raw dataset. Hence, pandas is very useful in the preprocessing and cleansing of the data before it is submitted as input to the machine learning algorithm implemented in scikit-learn.

The titanic: machine learning from disaster problem

The dataset for the Titanic consists of the passenger manifest for the doomed trip, along with various features and an indicator variable telling whether the passenger survived the sinking of the ship or not. The essence of the problem is to be able to predict, given a passenger and his/her associated features, whether this passenger survived the sinking of the Titanic or not. Please delete this sentence.

The data consists of two datasets: one training dataset and the other test dataset. The training dataset consists of 891 passenger cases, and the test dataset consists of 491 passenger cases.

The training dataset also consists of 11 variables, of which 10 are features and 1 dependent/indicator variable Survived, which indicates whether the passenger survived the disaster or not.

The feature variables are as follows:

  • PassengerID
  • Cabin
  • Sex
  • Pclass (passenger class)
  • Fare
  • Parch (number of parents and children)
  • Age
  • Sibsp (number of siblings)
  • Embarked

We can make use of pandas to help us preprocess data in the following ways:

  • Data cleaning and categorization of some variables
  • Exclusion of unnecessary features, which obviously have no bearing on the survivability of the passenger, for example, their name
  • Handling missing data

There are various algorithms that we can use to tackle this problem. They are as follows:

  • Decision trees
  • Neural networks
  • Random forests
  • Support vector machines

The problem of overfitting

Overfitting is a well-known problem in machine learning, whereby the program memorizes the specific data that it is fed as input, leading to perfect results on the training data and abysmal results on the test data.

In order to prevent overfitting, the 10-fold cross-validation technique can be used to introduce variability in the data during the training phase.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.136