Stacked Denoising Autoencoders

While autoencoders are valuable tools in themselves, significant accuracy can be obtained by stacking autoencoders to form a deep network. This is achieved by feeding the representation created by the encoder on one layer into the next layer's encoder as the input to that layer.

Stacked denoising autoencoders (SdAs) are currently in use in many leading data science teams for sophisticated natural language analyses as well as a hugely broad range of signals, image, and text analysis.

The implementation of a SdA will be very familiar after the previous chapter's discussion of deep belief networks. The SdA is used in much the same way as the RBMs in our deep belief networks were used. Each layer of the deep architecture will have a dA and sigmoid component, with the autoencoder component being used to pretrain the sigmoid network. The performance measure used by a stacked denoising autoencoder is the training set error, with an intensive period of layer-to-layer (layer-wise) pretraining used to gradually align network parameters before a final period of fine-tuning. During fine-tuning, the network is trained using validation and test data, over fewer epochs but with larger update steps. The goal is to have the network converge at the end of the fine-tuning in order to deliver an accurate result.

In addition to delivering on the typical advantages of deep networks (the ability to learn feature representations for complex or high-dimensional datasets, and the ability to train a model without extensive feature engineering), stacked autoencoders have an additional, interesting property.

Correctly configured stacked autoencoders can capture a hierarchical grouping of their input data. Successive layers of a stacked denoised autoencoder may learn increasingly high-level features. Where the first layer might learn some first-order features from input data (such as learning edges in a photo image), a second layer may learn some grouping of first-order features (for instance, by learning given configurations of edges that correspond to contours or structural elements in the input image).

There's no golden rule to determine how many layers or how large layers should be for a given problem. The best solution is usually to experiment with these model parameters until you find an optimal point. This experimentation is best done with a hyperparameter optimization technique or genetic algorithm (subjects we'll discuss in later chapters of this book).

Higher layers may learn increasingly high-order configurations, enabling a stacked denoised autoencoder to learn to recognize facial features, alphanumerical characters, or generalized forms of objects (such as a bird). This is what gives SdAs their unique capability to learn very sophisticated, high-level abstractions of their input data.

Autoencoders can be stacked indefinitely, and it has been demonstrated that continuing to stack autoencoders can improve the effectiveness of the deep architecture (with the main constraint becoming compute cost in time). In this chapter, we'll look at stacking three autoencoders to solve a natural language processing challenge.

Applying the SdA

Now that we've had a chance to understand the advantages and power of the SdA as a deep learning architecture, let's test our skills on a real-world dataset.

For this chapter, let's step away from image datasets and work with the OpinRank Review dataset, a text dataset of around 259,000 hotel reviews from TripAdvisor—accessible via the UCI machine learning dataset repository. This freely-available dataset provides review scores (as floating point numbers from 1 to 5) and review text for a broad range of hotels; we'll be applying our stacked dA to attempt to identify the scoring of each hotel from its review text.

Note

We'll be applying our autoencoder to analyze a preprocessed version of this data, which is accessible from the GitHub share accompanying this chapter. We'll be discussing the techniques by which we prepare text data in an upcoming chapter. For the interested reader, the source data is available at https://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset.

In order to get started, we're going to need a stacked denoising autoencoder (hereafter SdA) class:

class SdA(object):


    def __init__(
        self,
        numpy_rng,
        theano_rng=None,
        n_ins=280,
        hidden_layers_sizes=[500, 500],
        n_outs=5,
        corruption_levels=[0.1, 0.1]
):

As we previously discussed, the SdA is created by feeding the encoding from one layer's autoencoder as the input to the subsequent layer. This class supports the configuration of the layer count (reflected in, but not set by, the length of the hidden_layers_sizes and corruption_levels vectors). It also supports differentiated layer sizes (in nodes) at each layer, which can be set using hidden_layers_sizes. As we discussed, the ability to configure successive layers of the autoencoder is critical to developing successful representations.

Next, we need parameters to store the MLP (self.sigmoid_layers) and dA (self.dA_layers) elements of the SdA. In order to specify the depth of our architecture, we use the self.n_layers parameter to specify the number of sigmoid and dA layers required:

self.sigmoid_layers = []
self.dA_layers = []
self.params = []
self.n_layers = len(hidden_layers_sizes)

assertself.n_layers> 0

Next, we need to construct our sigmoid and dA layers. We begin by setting the hidden layer size to be set either from the input vector size or by the activation of the preceding layer. Following this, sigmoid_layer and dA_layer components are created, with the dA layer drawing from the dA class that we discussed earlier in this chapter:

for i in xrange(self.n_layers):
   if i == 0:
      input_size = n_ins
   else:
      input_size = hidden_layers_sizes[i - 1]

if i == 0:
   layer_input = self.x
else:
   layer_input = self.sigmoid_layers[-1].output

sigmoid_layer = HiddenLayer(rng=numpy_rng, input=layer_input, n_in=input_size, n_out=hidden_layers_sizes[i], activation=T.nnet.sigmoid)

self.sigmoid_layers.append(sigmoid_layer)
self.params.extend(sigmoid_layer.params)

dA_layer = dA(numpy_rng=numpy_rng, theano_rng=theano_rng, input=layer_input, n_visible=input_size, n_hidden=hidden_layers_sizes[i], W=sigmoid_layer.W, bhid=sigmoid_layer.b)

self.dA_layers.append(dA_layer)

Having implemented the layers of our stacked dA, we'll need a final, logistic regression layer to complete the MLP component of the network:

self.logLayer = LogisticRegression(
   input=self.sigmoid_layers[-1].output,
   n_in=hidden_layers_sizes[-1],
   n_out=n_outs
)

self.params.extend(self.logLayer.params)
self.finetune_cost = self.logLayer.negative_log_likelihood(self.y)
self.errors = self.logLayer.errors(self.y)

This completes the architecture of our SdA. Next up, we need to generate the training functions used by the SdA class. Each function will the minibatch index (index) as an argument, together with several other elements—the corruption_level and learning_rate are enabled here so that we can adjust them (for example, gradually increase or decrease them) during training. Additionally, we identify variables that help identify where the batch starts and ends—batch_begin and batch_end, respectively:

Note

The ability to dynamically adjust the learning rate is particularly very helpful and may be applied in one of two ways. Once a technique has begun to converge on an appropriate solution, it is very helpful to be able to reduce the learning rate. If you do not do this, you risk creating a situation in which the network oscillates between values located around the optimum without ever hitting it. In some contexts, it can be helpful to tie the learning rate to the network's performance measure. If the error rate is high, it makes sense to make larger adjustments until the error rate begins to decrease!

def pretraining_functions(self, train_set_x, batch_size):
    index = T.lscalar('index')  
    corruption_level = T.scalar('corruption')  
    learning_rate = T.scalar('lr')  
    batch_begin = index * batch_size
    batch_end = batch_begin + batch_size

    pretrain_fns = []
    for dA in self.dA_layers:
        cost, updates = dA.get_cost_updates(corruption_level, learning_rate)
        fn = theano.function(
            inputs=[
                index,
                theano.Param(corruption_level, default=0.2),
                theano.Param(learning_rate, default=0.1)
            ],
            outputs=cost,
            updates=updates,
            givens={
                self.x: train_set_x[batch_begin: batch_end]
            }
         )
         pretrain_fns.append(fn)

    return pretrain_fns

The pretraining functions that we've created takes the minibatch index and can optionally take the corruption level or learning rate. It performs one step of pretraining and outputs the cost value and vector of weight updates.

In addition to pretraining, we need to build functions to support the fine-tuning stage, wherein the network is run iteratively over the validation and test data to optimize network parameters. The training function (train_fn) seen in the code below implements a single step of fine-tuning. The valid_score is a Python function that computes a validation score using the error measure produced by the SdA over validation data. Similarly, test_score computes the error score over test data.

To get this process off the ground, we first need to set up training, validation, and test datasets. Each stage requires two datasets (set x and set y) containing the features and class labels, respectively. The required number of minibatches for validation and test is determined, and an index is created to track the batch size (and provide a means of identifying at which entries a batch starts and ends). Training, validation, and testing occurs for each batch and afterward, both valid_score and test_score are calculated across all batches:

def build_finetune_functions(self, datasets, batch_size,learning_rate):

   (train_set_x, train_set_y) = datasets[0]
   (valid_set_x, valid_set_y) = datasets[1]
   (test_set_x, test_set_y) = datasets[2]

   n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]
   n_valid_batches /= batch_size
   n_test_batches = test_set_x.get_value(borrow=True).shape[0]
   n_test_batches /= batch_size

   index = T.lscalar('index')  


   gparams = T.grad(self.finetune_cost, self.params)


   updates = [
       (param, param - gparam * learning_rate)
       For param, gparam in zip(self.params, gparams)
]

train_fn = theano.function(
   inputs=[index],
   outputs=self.finetune_cost,
   updates=updates,
   givens={
      self.x: train_set_x[
         index * batch_size: (index + 1) * batch_size
      ],
      self.y: train_set_y[
         index * batch_size: (index + 1) * batch_size
      ]
   },
   name='train'
)

test_score_i = theano.function(
    [index],
   self.errors,
   givens={
      self.x: test_set_x[
      index * batch_size: (index + 1) * batch_size
   ],
      self.y: test_set_y[
      index * batch_size: (index + 1) * batch_size
   ]
},
   name='test'
)

valid_score_i = theano.function(
   [index],
   self.errors,
   givens={
      self.x: valid_set_x[
         index * batch_size: (index + 1) * batch_size
      ],
      self.y: valid_set_y[
         index * batch_size: (index + 1) * batch_size
      ]
   },
   name='valid'
)


def valid_score():
   return [valid_score_i(i) for i inxrange(n_valid_batches)]

def test_score():
   return [test_score_i(i) for i inxrange(n_test_batches)]

return train_fn, valid_score, test_score

With the training functionality in place, the following code initiates our stacked dA:

numpy_rng = numpy.random.RandomState(89677)
print '... building the model'
   sda = SdA(
      numpy_rng=numpy_rng,
      n_ins=280,
      hidden_layers_sizes=[240, 170, 100],
      n_outs=5
   )

It should be noted that, at this point, we should be trying an initial configuration of layer sizes to see how we do. In this case, the layer sizes used are the product of some initial testing. As we discussed, training the SdA occurs in two stages. The first is a layer-wise pretraining process that loops over all of the SdA's layers. The second is a process of fine-tuning over validation and test data.

To pretrain the SdA, we provide the required corruption levels to train each layer and iterate over the layers using our previously defined pretraining_fns:

print '... getting the pretraining functions'
pretraining_fns = sda.pretraining_functions(train_set_x=train_set_x,
batch_size=batch_size)

print '... pre-training the model'
start_time = time.clock()
corruption_levels = [.1, .2, .2]
for i in xrange(sda.n_layers):

   for epoch in xrange(pretraining_epochs):
      c = []
      for batch_index in xrange(n_train_batches):
         c.append(pretraining_fns[i](index=batch_index,
         corruption=corruption_levels[i],
         lr=pretrain_lr))
print 'Pre-training layer %i, epoch %d, cost ' % (i, epoch),

print numpy.mean(c)

end_time = time.clock()

print(('The pretraining code for file ' +
os.path.split(__file__)[1] + ' ran for %.2fm' % ((end_time - start_time) / 60.)), file = sys.stderr)

At this point, we're able to initialize our SdA class via calling the preceding code stored within this book's GitHub repository: MasteringMLWithPython/Chapter3/SdA.py

Assessing SdA performance

The SdA will take a significant length of time to run. With 15 epochs per layer and each layer typically taking an average of 11 minutes, the network will run for around 500 minutes on a modern desktop system with GPU acceleration and a single-threaded GotoBLAS.

On a system without GPU acceleration, the network will take substantially longer to train, and it is recommended that you use the alternative, which runs over a significantly smaller input dataset: MasteringMLWithPython/Chapter3/SdA_no_blas.py

The results are of high quality, with a validation error score of 3.22% and test error score of 3.14%. These results are particularly impressive given the ambiguous and sometimes challenging nature of natural language processing applications.

It was noticeable that the network classified more correctly for the 1-star and 5-star rating cases than for the intermediate levels. This is largely due to the ambiguous nature of unpolarized or unemotional language.

Part of the reason that this input data was classifiable was via significant feature engineering. While time-consuming and sometimes problematic, we've seen that well-executed feature engineering combined with an optimized model can deliver an excellent level of accuracy. In Chapter 6, Text Feature Engineering, we'll be applying the techniques used to prepare this dataset ourselves.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.14.118