MNIST was a grayscale example and we could represent each image as a pixel intensity value from 0 to 255, in a two-dimensional matrix. However, most of the time, we will be working with color images. Color images are actually three-dimensional matrices, where the dimensions are the image height, image width, and color. This results in a matrix with separate red, blue, and green values for each pixel in the image.
While we were previously showing two-dimensional filters, we can adapt the idea to three dimensions quite simply by performing the convolution between a (height, width, 3 (colors)) matrix and a 3 x 3 x 3 filter. In the end, we're still left with a two-dimensional output, as we take the elementwise product across all three axes of the matrix. As a reminder, these high-dimension matrices are typically called tensors and what we're doing is making them flow, as it were.