Sentiment analysis

Sentiment analysis involves the evaluation and classification of words based on their context, meaning, and emotional implications. Typically, if we were to look up a word in a dictionary we will find a meaning or definition for the word but, taken out of the context of a sentence, we may not be able to ascribe detailed and precise meaning to the word.

For example, the word toast could be defined as simply a slice of heated and browned bread. But in the context of the sentence He's toast!, the meaning changes completely. Sentiment analysis seeks to derive meanings of words based on their context and usage.

It is important to note that advanced sentiment analysis will expand beyond simple positive or negative classification and ascribe detailed emotional meaning to words. It is far simpler to classify words as positive or negative but far more useful to classify them as happy, furious, indifferent, or anxious.

This type of analysis falls into the category of effective computing, a type of computing interested in the emotional implications and uses of technological tools. This type of computing is especially significant given the growing amount of emotionally influenced data readily available for analysis on social media sites today.

Being able to determine the emotional content of text enables a more targeted, and appropriate response. For example, being able to judge the emotional response in a chat session between a customer and technical representative can allow the representative to do a better job. This can be especially important when there is a cultural or language gap between them.

This type of analysis can also be applied to visual images. It could be used to gauge someone's response to a new product, such as when conducting a taste test, or to judge how people react to scenes of s movie or commercial.

As part of our example we will be using a bag-of-words model. Bag-of-words models simplify word representation for natural language processing by containing a set, known as the bag, of words irrespective of grammar or word order. The words have features used for classification, most importantly the frequency of each word. Because some words such as the, a, or and will naturally have a higher frequency in any text, the words are given a weight as well. Common words with less contextual significance will have a smaller weight and factor less into the text analysis.

Downloading and extracting the Word2Vec model

To demonstrate sentiment analysis, we will use Google's Word2Vec models in conjunction with DL4J to simply classify movie reviews as either positive or negative based upon the words used in the review. This example is adapted from work done by Alex Black (https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/word2vecsentiment/Word2VecSentimentRNN.java). As discussed previously in this chapter, Word2Vec consists of two-layer neural networks trained to build meaning from the context of words. We will also be using a large set of movie reviews from http://ai.stanford.edu/~amaas/data/sentiment/.

Before we begin, you will need to download the Word2Vec data from https://code.google.com/p/word2vec/. The basic process includes:

  • Downloading and extracting the movie reviews
  • Loading the Word2Vec Google News vectors
  • Loading each movie review

The words within the reviews are then broken into vectors and used to train the network. We will train the network across five epochs and evaluate the network's performance after each epoch.

To begin, we first declare three final variables. The first is the URL to retrieve the training data, the second is the location to store our extracted data, and the third is the location of the Google News vectors on the local machine. Modify this third variable to reflect the location on your local machine:

public static final String TRAINING_DATA_URL =  
    "http://ai.stanford.edu/~amaas/" +  
    "data/sentiment/aclImdb_v1.tar.gz"; 
public static final String EXTRACT_DATA_PATH =  
    FilenameUtils.concat(System.getProperty( 
    "java.io.tmpdir"), "dl4j_w2vSentiment/"); 
public static final String GNEWS_VECTORS_PATH =  
    "C:/YOUR_PATH/GoogleNews-vectors-negative300.bin" +  
    "/GoogleNews-vectors-negative300.bin"; 
 

Next we download and extract our model data. The next two methods are modelled after the code found in the DL4J example. We first create a new method, getModelData. The method is shown next in its entirety.

First we create a new File using the EXTRACT_DATA_PATH we defined previously. If the file does not already exist, we create a new directory. Next, we create two more File objects, one for the path to the archived TAR file and one for the path to the extracted data. Before we attempt to extract the data, we check whether these two files exist. If the archive path does not exist, we download the data from the TRAINING_DATA_URL and then extract the data. If the extracted file does not exist, we then extract the data:

  
private static void getModelData() throws Exception { 
    File modelDir = new File(EXTRACT_DATA_PATH); 
    if (!modelDir.exists()) { 
        modelDir.mkdir(); 
    } 
    String archivePath = EXTRACT_DATA_PATH + "aclImdb_v1.tar.gz"; 
    File archiveName = new File(archivePath); 
    String extractPath = EXTRACT_DATA_PATH + "aclImdb"; 
    File extractName = new File(extractPath); 
    if (!archiveName.exists()) { 
        FileUtils.copyURLToFile(new URL(TRAINING_DATA_URL), 
                archiveName); 
        extractTar(archivePath, EXTRACT_DATA_PATH); 
    } else if (!extractName.exists()) { 
        extractTar(archivePath, EXTRACT_DATA_PATH); 
    } 
} 

To extract our data, we will create another method called extractTar. We will provide two inputs to the method, the archivePath and the EXTRACT_DATA_PATH defined before. We also need to define our buffer size to use in the extraction process:

private static final int BUFFER_SIZE = 4096; 

We first create a new TarArchiveInputStream. We use the GzipCompressorInputStream because it provides support for extracting .gz files. We also use the BufferedInputStream to improve performance in our extraction process. The compressed file is very large and may take some time to download and extract.

Next we create a TarArchiveEntry and begin reading in data using the TarArchiveInputStream getNextEntry method. As we process entry in the compressed file, we first check whether the entry is a directory. If it is, we create a new directory in our extraction location. Finally we create a new FileOutputStream and BufferedOutputStream and use the write method to write our data in the extracted location:

private static void extractTar(String dataIn, String dataOut)
    throws IOException {
        try (TarArchiveInputStream inStream =
            new TarArchiveInputStream(
                new GzipCompressorInputStream(
                    new BufferedInputStream(
                        new FileInputStream(dataIn))))) {
            TarArchiveEntry tarFile;
            while ((tarFile = (TarArchiveEntry) inStream.getNextEntry())
                != null) {
                if (tarFile.isDirectory()) {
                    new File(dataOut + tarFile.getName()).mkdirs();
                }else {
                    int count;
                    byte data[] = new byte[BUFFER_SIZE];
                    FileOutputStream fileInStream =
                      new FileOutputStream(dataOut + tarFile.getName());
                    BufferedOutputStream outStream = 
                      new BufferedOutputStream(fileInStream,
                        BUFFER_SIZE);
                    while ((count = inStream.read(data, 0, BUFFER_SIZE))
                        != -1) {
                            outStream.write(data, 0, count);
                    }
                }
            }
        }
    }

Building our model and classifying text

Now that we have created methods to download and extract our data, we need to declare and initialize variables used to control the execution of our model. Our batchSize refers to the amount of words we process in each example, in this case 50. Our vectorSize determines the size of the vectors. The Google News model has word vectors of size 300. nEpochs refers to the number of times we attempt to run through our training data. Finally, truncateReviewsToLength specifies whether, for memory utilization purposes, we should truncate the movie reviews if they exceed a specific length. We have chosen to truncate reviews longer than 300 words:

int batchSize = 50;      
int vectorSize = 300; 
int nEpochs = 5;         
int truncateReviewsToLength = 300;   

Now we can set up our neural network. We will use a MultiLayerConfiguration network, as discussed in Chapter 8, Deep Learning . In fact, our example here is very similar to the model built in configuring and building a model, with a few differences. In particular, in this model we will use a faster learning rate and a GravesLSTM recurrent network in layer 0. We will have the same number of input neurons as we have words in our vector, in this case, 300. We also use gradientNormalization, a technique used to help our algorithm find the optimal solution. Notice we are using the softmax activation function, which was discussed in Chapter 8, Deep Learning . This function uses regression and is especially suited for classification algorithms:

MultiLayerConfiguration sentimentNN =  
         new NeuralNetConfiguration.Builder() 
        .optimizationAlgo(OptimizationAlgorithm 
                 .STOCHASTIC_GRADIENT_DESCENT).iterations(1) 
        .updater(Updater.RMSPROP) 
        .regularization(true).l2(1e-5) 
        .weightInit(WeightInit.XAVIER) 
        .gradientNormalization(GradientNormalization 
                 .ClipElementWiseAbsoluteValue) 
                 .gradientNormalizationThreshold(1.0) 
        .learningRate(0.0018) 
        .list() 
        .layer(0, new GravesLSTM.Builder() 
                 .nIn(vectorSize).nOut(200) 
                .activation("softsign").build()) 
        .layer(1, new RnnOutputLayer.Builder() 
                .activation("softmax") 
                .lossFunction(LossFunctions.LossFunction.MCXENT) 
                .nIn(200).nOut(2).build()) 
        .pretrain(false).backprop(true).build(); 
 

We can then create our MultiLayerNetwork, initialize the network, and set listeners.

MultiLayerNetwork net = new MultiLayerNetwork(sentimentNN); 
net.init(); 
net.setListeners(new ScoreIterationListener(1)); 

Next we create a WordVectors object to load our Google data. We use a DataSetIterator to test and train our data. The AsyncDataSetIterator allows us to load our data in a separate thread, to improve performance. This process requires a large amount of memory and so improvements such as this are essential for optimal performance:

WordVectors wordVectors = WordVectorSerializer
DataSetIterator trainData = new AsyncDataSetIterator(
    new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors,
        batchSize, truncateReviewsToLength, true), 1);
DataSetIterator testData = new AsyncDataSetIterator(
    new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors,
        100, truncateReviewsToLength, false), 1);

Finally, we are ready to train and evaluate our data. We run through our data nEpochs times; in this case, we have five iterations. Each iteration executes the fit method against our training data and then creates a new Evaluation object to evaluate our model using testData. The evaluation is based on around 25,000 movie reviews and can take a significant amount to time to run. As we evaluate the data, we create INDArray to store information, including the feature matrix and labels from our data. This data is used later in the evalTimeSeries method for evaluation. Finally, we print out our evaluation statistics:

for (int i = 0; i < nEpochs; i++) { 
    net.fit(trainData); 
    trainData.reset(); 
 
    Evaluation evaluation = new Evaluation(); 
    while (testData.hasNext()) { 
        DataSet t = testData.next(); 
        INDArray dataFeatures = t.getFeatureMatrix(); 
        INDArray dataLabels = t.getLabels(); 
        INDArray inMask = t.getFeaturesMaskArray(); 
        INDArray outMask = t.getLabelsMaskArray(); 
        INDArray predicted = net.output(dataFeatures, false,  
            inMask, outMask); 
 
        evaluation.evalTimeSeries(dataLabels, predicted, outMask); 
    } 
    testData.reset(); 
 
    out.println(evaluation.stats()); 
} 

The output from the final iteration is shown next. Our examples classified as 0 are considered negative reviews and the ones classified as 1 are considered positive reviews:

Epoch 4 complete. Starting evaluation:
Examples labeled as 0 classified by model as 0: 11122 times
Examples labeled as 0 classified by model as 1: 1378 times
Examples labeled as 1 classified by model as 0: 3193 times
Examples labeled as 1 classified by model as 1: 9307 times
==========================Scores===================================Accuracy: 0.8172
Precision: 0.824
Recall: 0.8172
F1 Score: 0.8206
===================================================================

If compared with previous iterations, you should notice the score and accuracy improving with each evaluation. With each iteration, our model improves its accuracy in classifying movie reviews as either negative or positive.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.67.235