Each input observation is a 28 pixel by 28 pixel black and white image. An image like this one is represented on disk as a 28x28 matrix of values between 0 and 255, where each value is the intensity of black in that pixel. At this point, we only know how to train networks on two-dimensional vectors (we will learn a better way to do this later); so we will flatten this 28x28 matrix into a 1 x 784 input vector.
Once we stack all those 1x784 vectors, we are left with a 50,000 x 784 training set.
If you are experienced with convolutional neural networks, you're probably rolling your eyes right now, and if you aren't, you'll see a way better way to do this soon, but don't skip this chapter too fast. I think that a flattened MNIST is a really great dataset because it looks and behaves a lot like many of the complex real-life problems we encounter in domains with many inputs (for example, IoT, manufacturing, biological, pharma, and medical use cases).