Training and testing data

In this section, we're going to look at pulling in training and testing data. We'll be looking at loading the actual data, then we'll revisit normalization and one-hot encoding, and then we'll have a quick discussion about why we actually use training and testing datasets.

In this section, we'll be taking what we learned in the previous chapter about preparing image data and condensing it into just a few lines of code, as shown in the following screenshot:

Loading data

We load the training and testing data along with the training and testing outputs. Then, we normalize, which just means dividing by the maximum value, which we know is going to be 255. Then, we break down the output variables into categorical, or one-hot, encodings. We do these two things (normalization and one-hot encoding) in the exact same fashion for both our training and our testing datasets. It's important that our data is all prepared in the same fashion before we attempt to use it in our machine learning model. Here's a quick note about shapes. Note that the training data (both x and y) have the same initial number:

Loading .shape (training)

The first dimension is 60000 in both the cases, but look at the second and third dimensions (28 and 28)—which is the size of an input image—and the 10 figure. Well, those don't exactly have to match because what we're doing when we run this through a model is transforming the data from 28, 28 dimensions into a 10 dimension.

In addition, look at the testing data. You can see that it's 10000 in the first dimension (28, 28), and then 10000, 10 in the second, as shown in the following screenshot:

Loading .shape (testing)

It's really important that these dimensions match up in the appropriate fashion. So, for a training set, the first dimensions must match your x and y values (your inputs and your outputs), and on your testing set, the same thing must be true as well. But also note that the second and third dimensions, 28 and 28, are the same for both the training and testing data, and the 10 (the output dimensions) are the same for both the testing and training data. Not getting these datasets lined up is one of the most common mistakes that is made when preparing information. But why?! In a word: overfitting.

Overfitting is essentially when your machine learning model memorizes a set of inputs. You can think of it as a very sophisticated hash table that has encoded the input and output mappings in a large set of numbers. But with machine learning, we don't want a hash table, even though we could easily have one. Instead, we want to have a model that can deal with unknown inputs and then predict the appropriate outputs. The testing data represents those unknown inputs. When you train your model across training data and you hold out the testing data, the testing data is there for you to validate that your machine learning model can deal with and predict data that it has never seen before.

All right, now that we've got our training and testing data loaded up, we'll move on to learning about Dropout and Flatten, and putting together an actual neural network.

Table of Contents for Training and testing data

Create new playlist

Sign In

Sign Up

Table of Contents for
Training and testing data