In order to illustrate how we can use pandas to assist us at the start of our machine learning journey, we will apply it to a classic problem, which is hosted on the Kaggle website (http://www.kaggle.com). Kaggle is a competition platform for machine learning problems. The idea behind Kaggle is to enable companies that are interested in solving predictive analytics problems with their data to post their data on Kaggle and invite data scientists to come up with the proposed solutions to their problems. The competition can be ongoing over a period of time, and the rankings of the competitors are posted on a leader board. At the end of the competition, the top-ranked competitors receive cash prizes.
The classic problem that we will study in order to illustrate the use of pandas for machine learning with scikit-learn
is the Titanic: machine learning from disaster problem hosted on Kaggle as their classic introductory machine learning problem. The dataset involved in the problem is a raw dataset. Hence, pandas is very useful in the preprocessing and cleansing of the data before it is submitted as input to the machine learning algorithm implemented in scikit-learn
.
The dataset for the Titanic consists of the passenger manifest for the doomed trip, along with various features and an indicator variable telling whether the passenger survived the sinking of the ship or not. The essence of the problem is to be able to predict, given a passenger and his/her associated features, whether this passenger survived the sinking of the Titanic or not. Please delete this sentence.
The data consists of two datasets: one training dataset and the other test dataset. The training dataset consists of 891 passenger cases, and the test dataset consists of 491 passenger cases.
The training dataset also consists of 11 variables, of which 10 are features and 1 dependent/indicator variable Survived, which indicates whether the passenger survived the disaster or not.
The feature variables are as follows:
We can make use of pandas to help us preprocess data in the following ways:
There are various algorithms that we can use to tackle this problem. They are as follows:
Overfitting is a well-known problem in machine learning, whereby the program memorizes the specific data that it is fed as input, leading to perfect results on the training data and abysmal results on the test data.
In order to prevent overfitting, the 10-fold cross-validation technique can be used to introduce variability in the data during the training phase.
18.224.54.136