Loading the dataset

While maybe not the most fun part of a machine learning problem, loading the data is an important step.  I'm going to cover my data loading methodology here so that you can get a feel for how I handle loading a dataset.

from sklearn.preprocessing import StandardScaler
import pandas as pd

TRAIN_DATA = "./data/train/train_data.csv"
VAL_DATA = "./data/val/val_data.csv"
TEST_DATA = "./data/test/test_data.csv"

def load_data():
"""Loads train, val, and test datasets from disk"""
train = pd.read_csv(TRAIN_DATA)
val = pd.read_csv(VAL_DATA)
test = pd.read_csv(TEST_DATA)

# we will use sklearn's StandardScaler to scale our data to 0 mean, unit variance.
scaler = StandardScaler()
train = scaler.fit_transform(train)
val = scaler.transform(val)
test = scaler.transform(test)
# we will use a dict to keep all this data tidy.
data = dict()

data["train_y"] = train[:, 10]
data["train_X"] = train[:, 0:9]
data["val_y"] = val[:, 10]
data["val_X"] = val[:, 0:9]
data["test_y"] = test[:, 10]
data["test_X"] = test[:, 0:9]
# it's a good idea to keep the scaler (or at least the mean/variance) so we can unscale predictions
data["scaler"] = scaler
return data

When I'm reading data from csv, excel, or even a DBMS, my first step is usually loading it into a pandas dataframe.  

 It's important to normalize our data so that each feature is on a comparable scale, and that all those scales fall within the bounds of our activation functions. Here, I used Scikit-Learn's StandardScaler to accomplish this task. 

This gives us an overall dataset with shape (4898, 10). Our target variable, alcohol, is given as a percentage between 8% and 14.2%.

I've randomly sampled and divided the data into train, val, and test datasets prior to loading the data, so we don't have to worry about that here.

Lastly,  the load_data() function returns a dictionary that keeps everything tidy and in one place.  If you see me reference data["X_train"] later, just know that I'm referencing the training dataset, that I've stored in a dictionary of data.

. The code and data for this project are both available on the book's GitHub site (https://github.com/mbernico/deep_learning_quick_reference). 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.30.62