Have you ever heard a person humming a melody, and identified the song? It might be easy for you, but I’m comically tone-deaf when it comes to music. Humming, of itself, is an approximation of a song. An even better approximation could be singing. Include some instrumentals, and sometimes a cover of a song sounds indistinguishable from the original.
Instead of songs, in this chapter, you’ll approximate functions. Functions are a general notion of relations between inputs and outputs. In machine learning, you typically want to find the function that relates inputs to outputs. Finding the best possible function fit is difficult, but approximating the function is much easier.
Conveniently, artificial neural networks are a model in machine learning that can approximate any function. As you’ve learned, your model is a function that gives the output you’re looking for, given the inputs you have. In ML terms, given training data, you want to build a neural network model that best approximates the implicit function that might have generated the data—one that might not give you the exact answer but that’s good enough to be useful.
So far, you’ve generated models by explicitly designing a function, whether it be linear, polynomial, or something more complicated. Neural networks enable a little bit of leeway when it comes to picking out the right function, and consequently the right model. In theory, a neural network can model general-purpose types of transformation—where you don’t need to know much at all about the function being modeled!
After section 7.1 introduces neural networks, you’ll learn how to use autoencoders, which encode data into smaller, faster representations, in section 7.2.
If you’ve heard about neural networks, you’ve probably seen diagrams of nodes and edges connected in a complicated mesh. That visualization is mostly inspired by biology—specifically, neurons in the brain. As it turns out, it’s also a convenient way to visualize functions, such as f(x) = w × x + b, shown in figure 7.1.
As a reminder, a linear model is set of linear functions; for example, f(x) = w × x + b, where (w, b) is the vector of parameters. The learning algorithm drifts around the values of w and b until it finds a combination that best matches the data. After the algorithm successfully converges, it’ll find the best possible linear function to describe the data.
Linear is a good place to start, but the real world isn’t always that pretty. And thus, we dive into the type of machine learning responsible for TensorFlow’s inception; this chapter is your introduction to a type of model called an artificial neural network, which can approximate arbitrary functions (not just linear ones).
Is f(x) = |x| a linear function?
ANSWER
No. It’s two linear functions stitched together at zero, and that’s not a single straight line.
To incorporate the concept of nonlinearity, it’s effective to apply a nonlinear function, called the activation function, to each neuron’s output. Three of the most commonly used activation functions are sigmoid (sig), hyperbolic tangent (tan), and a type of ramp function called a Rectifying Linear Unit (ReLU), plotted in figure 7.2.
You don’t have to worry too much about which activation function is better under what circumstances. That’s still an active research topic. Feel free to experiment with the three shown in figure 7.2. Usually, the best one is chosen by using cross-validation to determine which one gives the best model, given the dataset you’re working with. Remember our confusion matrix in chapter 4? You test which model gives the fewest false-positives or false-negatives, or whatever other criteria best suits your needs.
The sigmoid function isn’t new to you. As you may recall, the logistic regression classifier in chapter 4 applied this sigmoid function to the linear function w × x + b. The neural network model in figure 7.3 represents the function f(x) = sig(w × x + b). It’s a one-input, one-output network, where w and b are the parameters of this model.
If you have two inputs (x1 and x2), you can modify your neural network to look like the one in figure 7.4. Given training data and a cost function, the parameters to be learned are w1, w2, and b. When trying to model data, having multiple inputs to a function is common. For example, image classification takes the entire image (pixel by pixel) as the input.
Naturally, you can generalize to an arbitrary number of inputs (x1, x2, ..., xn). The corresponding neural network represents the function f(x1, ..., xn) = sig(wn × xn + ... + w1 × x1 + b), as shown in figure 7.5.
So far, you’ve dealt with only an input layer and an output layer. Nothing’s stopping you from arbitrarily adding neurons in between. Neurons that are used as neither input nor output are called hidden neurons. They’re hidden from the input and output interfaces of the neural network, so no one can directly influence their values. A hidden layer is any collection of hidden neurons that don’t connect to each other, as shown in figure 7.6. Adding more hidden layers greatly improves the expressive power of the network.
As long as the activation function is something nonlinear, a neural network with at least one hidden layer can approximate arbitrary functions. In linear models, no matter what parameters are learned, the function remains linear. The nonlinear neural network model with a hidden layer, on the other hand, is flexible enough to approximately represent any function! What a time to be alive!
TensorFlow comes with many helper functions to help you obtain the parameters of a neural network in an efficient way. You’ll see how to invoke those tools in this chapter when you start using your first neural network architecture: an autoencoder.
An autoencoder is a type of neural network that tries to learn parameters that make the output as close to the input as possible. An obvious way to do so is to return the input directly, as shown in figure 7.7.
But an autoencoder is more interesting than that. It contains a small hidden layer! If that hidden layer has a smaller dimension than the input, the hidden layer is a compression of your data, called encoding.
A couple of audio formats are out there, but the most popular may be MP3 because of its relatively small file size. You may have already guessed that such efficient storage comes with a trade-off. The algorithm to generate an MP3 file takes original uncompressed audio and shrinks it into a much smaller file that sounds approximately the same to your ears. But it’s lossy, meaning that you won’t be able to completely recover the original uncompressed audio from the encoded version. Similarly, in this chapter, we want to reduce the dimensionality of the data to make it more workable, but not necessarily create a perfect reproduction.
The process of reconstructing the input from the hidden layer is called decoding. Figure 7.8 shows an exaggerated example of an autoencoder.
Encoding is a great way to reduce the dimensions of the input. For example, if you can represent a 256 × 256 image in just 100 hidden nodes, you’ve reduced each data item by a factor of thousands!
Let x denote the input vector (x1, x2, ..., xn), and let y denote the output vector (y1, y2, ..., yn). Lastly, let w and w' denote the encoder and decoder weights, respectively. What’s a possible cost function to train this neural network?
ANSWER
See the loss function in listing 7.3.
It makes sense to use an object-oriented programming style to implement an autoencoder. That way, you can later reuse the class in other applications without worrying about tightly coupled code. Creating your code as outlined in listing 7.1 helps build deeper architectures, such as a stacked autoencoder, which has been known to perform better empirically.
Generally, with neural networks, adding more hidden layers seems to improve performance if you have enough data to not overfit the model.
class Autoencoder: def __init__(self, input_dim, hidden_dim): 1 def train(self, data): 2 def test(self, data): 3
Open a new Python source file, and call it autoencoder.py. This file will define the autoencoder class that you’ll use from a separate piece of code.
The constructor will set up all the TensorFlow variables, placeholders, optimizers, and operators. Anything that doesn’t immediately need a session can go in the constructor. Because you’re dealing with two sets of weights and biases (one for the encoding step and the other for the decoding step), you can use TensorFlow’s name scopes to disambiguate a variable’s name.
For instance, the following listing shows an example of defining a variable within a named scope. Now you can seamlessly save and restore this variable without worrying about name collisions.
with tf.name_scope('encode'): weights = tf.Variable(tf.random_normal([input_dim, hidden_dim], dtype=tf.float32), name='weights') biases = tf.Variable(tf.zeros([hidden_dim]), name='biases')
Moving on, let’s implement the constructor, as shown in the following listing.
import tensorflow as tf import numpy as np class Autoencoder: def __init__(self, input_dim, hidden_dim, epoch=250, learning_rate=0.001): self.epoch = epoch 1 self.learning_rate = learning_rate 2 x = tf.placeholder(dtype=tf.float32, shape=[None, input_dim]) 3 with tf.name_scope('encode'): 4 weights = tf.Variable(tf.random_normal([input_dim, hidden_dim], dtype=tf.float32), name='weights') biases = tf.Variable(tf.zeros([hidden_dim]), name='biases') encoded = tf.nn.tanh(tf.matmul(x, weights) + biases) with tf.name_scope('decode'): 5 weights = tf.Variable(tf.random_normal([hidden_dim, input_dim], dtype=tf.float32), name='weights') biases = tf.Variable(tf.zeros([input_dim]), name='biases') decoded = tf.matmul(encoded, weights) + biases self.x = x 6 self.encoded = encoded 6 self.decoded = decoded 6 self.loss = tf.sqrt(tf.reduce_mean(tf.square(tf.subtract(self.x, self.decoded)))) 7 self.train_op = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss) 8 self.saver = tf.train.Saver() 9
Now, in the next listing, you’ll define a class method called train that will receive a dataset and learn parameters to minimize its loss.
def train(self, data): num_samples = len(data) with tf.Session() as sess: 1 sess.run(tf.global_variables_initializer()) 1 for i in range(self.epoch): 2 for j in range(num_samples): 3 l, _ = sess.run([self.loss, self.train_op], 3 feed_dict={self.x: [data[j]]}) 3 if i % 10 == 0: 4 print('epoch {0}: loss = {1}'.format(i, l)) 4 self.saver.save(sess, './model.ckpt') 5 self.saver.save(sess, './model.ckpt') 5
You now have enough code to design an algorithm that learns an autoencoder from arbitrary data. Before you start using this class, let’s create one more method. As shown in the next listing, the test method will let you evaluate the autoencoder on new data.
def test(self, data): with tf.Session() as sess: self.saver.restore(sess, './model.ckpt') 1 hidden, reconstructed = sess.run([self.encoded, self.decoded], feed_dict={self.x: data}) 2 print('input', data) print('compressed', hidden) print('reconstructed', reconstructed) return reconstructed
Finally, create a new Python source file called main.py, and use your Autoencoder class, as shown in the following listing.
from autoencoder import Autoencoder from sklearn import datasets hidden_dim = 1 data = datasets.load_iris().data input_dim = len(data[0]) ae = Autoencoder(input_dim, hidden_dim) ae.train(data) ae.test([[8, 4, 6, 2]])
Running the train function will output debug info about how the loss decreases over the epochs. The test function shows info about the encoding and decoding process:
('input', [[8, 4, 6, 2]]) ('compressed', array([[ 0.78238308]], dtype=float32)) ('reconstructed', array([[ 6.87756062, 2.79838109, 6.25144577, 2.23120356]], dtype=float32))
Notice that you’re able to compress a four-dimensional vector into just one dimension and then decode it back into a four-dimensional vector with some loss in data.
Training a network one sample at a time is the safest bet if you’re not pressured by time. But if your network training is taking longer than desired, one solution is to train it with multiple data inputs at a time, called batch training.
Typically, as the batch size increases, the algorithm speeds up but has a lower likelihood of successfully converging. It’s a double-edged sword. Go wield it in the following listing. You’ll use that helper function later.
def get_batch(X, size): a = np.random.choice(len(X), size, replace=False) return X[a]
To use batch learning, you’ll need to modify the train method from listing 7.4. The batch version is shown in the following listing. It inserts an additional inner loop for each batch of data. Typically, the number of batch iterations should be enough so that all data is covered in the same epoch.
def train(self, data, batch_size=10): with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for i in range(self.epoch): for j in range(500): 1 batch_data = get_batch(data, self.batch_size) 2 l, _ = sess.run([self.loss, self.train_op], feed_dict={self.x: batch_data}) if i % 10 == 0: print('epoch {0}: loss = {1}'.format(i, l)) self.saver.save(sess, './model.ckpt') self.saver.save(sess, './model.ckpt')
Most neural networks, like your autoencoder, accept only one-dimensional input. Pixels of an image, on the other hand, are indexed by both rows and columns. Moreover, if a pixel is in color, it has a value for its red, green, and blue concentration, as shown in figure 7.9.
A convenient way to manage the higher dimensions of an image involves two steps:
You can use images in TensorFlow in many ways. If you have pictures lying around on your hard drive, you can load them using SciPy, which comes with TensorFlow. The following listing shows you how to load an image in grayscale, resize it, and represent it in row-major order.
from scipy.misc import imread, imresize gray_image = imread(filepath, True) 1 small_gray_image = imresize(gray_image, 1. / 8.) 2 x = small_gray_image.flatten() 3
Image processing is a lively field of research, so datasets are readily available for you to use, instead of using your own limited images. For instance, a dataset called CIFAR-10 contains 60,000 labeled images, each 32 × 32 in size.
Can you name other online image datasets? Search online and look around for more!
ANSWER
Perhaps the most used in the deep-learning community is ImageNet (www.image-net.org). A great list can also be found online at http://deeplearning.net/datasets.
Download the Python dataset from www.cs.toronto.edu/~kriz/cifar.html. Place the extracted cifar-10-batches-py folder in your working directory. The following listing is provided from the CIFAR-10 web page; add the code to a new file called main_ imgs.py.
import pickle def unpickle(file): 1 fo = open(file, 'rb') dict = pickle.load(fo, encoding='latin1') fo.close() return dict
Let’s read each of the dataset files by using the unpickle function you just created. The CIFA-10 dataset contains six files, each prefixed with data_batch_ and followed by a number. Each file contains information about the image data and corresponding label. The following listing shows how to loop through all the files and append the datasets to memory.
import numpy as np names = unpickle('./cifar-10-batches-py/batches.meta')['label_names'] data, labels = [], [] for i in range(1, 6): 1 filename = './cifar-10-batches-py/data_batch_' + str(i) batch_data = unpickle(filename) 2 if len(data) > 0: data = np.vstack((data, batch_data['data'])) 3 labels = np.hstack((labels, batch_data['labels'])) 4 else: data = batch_data['data'] labels = batch_data['labels']
Each image is represented as a series of red pixels, followed by green pixels, and then blue pixels. Listing 7.12 creates a helper function to convert the image into grayscale by averaging the red, green, and blue values.
You can achieve more-realistic grayscale in other ways, but this approach of averaging the three values gets the job done. Human perception is more sensitive to green light, so in some other versions of grayscaling, green values might have a higher weight in the averaging.
def grayscale(a): return a.reshape(a.shape[0], 3, 32, 32).mean(1).reshape(a.shape[0], -1) data = grayscale(data)
Lastly, let’s collect all images of a certain class, such as horse. You’ll run your autoencoder on all pictures of horses, as shown in the following listing.
from autoencoder import Autoencoder x = np.matrix(data) y = np.array(labels) horse_indices = np.where(y == 7)[0] horse_x = x[horse_indices] print(np.shape(horse_x)) # (5000, 3072) input_dim = np.shape(horse_x)[1] hidden_dim = 100 ae = Autoencoder(input_dim, hidden_dim) ae.train(horse_x)
You can now encode images similar to your training dataset into just 100 numbers. This autoencoder model is one of the simplest, so clearly it’ll be a lossy encoding. Beware: running this code may take up to 10 minutes. The output will trace loss values of every 10 epochs:
epoch 0: loss = 99.8635025024 epoch 10: loss = 35.3869667053 epoch 20: loss = 15.9411172867 epoch 30: loss = 7.66391372681 epoch 40: loss = 1.39575612545 epoch 50: loss = 0.00389165547676 epoch 60: loss = 0.00203850422986 epoch 70: loss = 0.00186171964742 epoch 80: loss = 0.00231492402963 epoch 90: loss = 0.00166488380637 epoch 100: loss = 0.00172081717756 epoch 110: loss = 0.0018497039564 epoch 120: loss = 0.00220602494664 epoch 130: loss = 0.00179589167237 epoch 140: loss = 0.00122790911701 epoch 150: loss = 0.0027100709267 epoch 160: loss = 0.00213225837797 epoch 170: loss = 0.00215123943053 epoch 180: loss = 0.00148373935372 epoch 190: loss = 0.00171591725666
See the book’s website or GitHub repo for a full example of the output: https://www.manning.com/books/machine-learning-with-tensorflow or http://mng.bz/D0Na.
This chapter introduced the most straightforward type of autoencoder, but other variants have been studied, each with their benefits and applications. Let’s take a look at a few:
3.147.72.74