Loading the dataset

While maybe not the most fun part of a machine learning problem, loading the data is an important step. I'm going to cover my data loading methodology here so that you can get a feel for how I handle loading a dataset.

from sklearn.preprocessing import StandardScaler
import pandas as pd

TRAIN_DATA = "./data/train/train_data.csv"
VAL_DATA = "./data/val/val_data.csv"
TEST_DATA = "./data/test/test_data.csv"

def load_data():
 """Loads train, val, and test datasets from disk"""
 train = pd.read_csv(TRAIN_DATA)
 val = pd.read_csv(VAL_DATA)
 test = pd.read_csv(TEST_DATA)

 # we will use sklearn's StandardScaler to scale our data to 0 mean, unit variance.
 scaler = StandardScaler()
 train = scaler.fit_transform(train)
 val = scaler.transform(val)
 test = scaler.transform(test)
 # we will use a dict to keep all this data tidy.
 data = dict()

 data["train_y"] = train[:, 10]
 data["train_X"] = train[:, 0:9]
 data["val_y"] = val[:, 10]
 data["val_X"] = val[:, 0:9]
 data["test_y"] = test[:, 10]
 data["test_X"] = test[:, 0:9]
 # it's a good idea to keep the scaler (or at least the mean/variance) so we can unscale predictions
 data["scaler"] = scaler
 return data

When I'm reading data from csv, excel, or even a DBMS, my first step is usually loading it into a pandas dataframe.

It's important to normalize our data so that each feature is on a comparable scale, and that all those scales fall within the bounds of our activation functions. Here, I used Scikit-Learn's StandardScaler to accomplish this task.

This gives us an overall dataset with shape (4898, 10). Our target variable, alcohol, is given as a percentage between 8% and 14.2%.

I've randomly sampled and divided the data into train, val, and test datasets prior to loading the data, so we don't have to worry about that here.

Lastly, the load_data() function returns a dictionary that keeps everything tidy and in one place. If you see me reference data["X_train"] later, just know that I'm referencing the training dataset, that I've stored in a dictionary of data.

. The code and data for this project are both available on the book's GitHub site (https://github.com/mbernico/deep_learning_quick_reference).

Table of Contents for Loading the dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading the dataset