Additional setup

Additional setup is required to include libraries required for text processing. Take a look at the following points:

First is Bazel. On Ubuntu, you will need to follow the official tutorial on this link to install Bazel. https://docs.bazel.build/versions/master/install-ubuntu.html. On macOS, you can use HomeBrew to install bazel as follows:

      $ brew install bazel

Then, we will install swig, which will allow us to wrap C/C++ functions to allow calls in Python. On Ubuntu, you can install it using:

      $ sudo apt-get install swig

On Mac OS, we will also install it using brew, as follows:

      $ brew install swig

Next, we'll install the protocol buffer support, which will allow us to store and retrieve serialized data in a more efficient manner than with XML. We specifically need version 3.3.0 to install it as follows:

      $ pip install -U protobuf==3.3.0

Our text classification will be represented as trees, so we'll need a library to display trees on the command line. We will install it as follows:

      $ pip install asciitree

Finally, we'll need a scientific computing library. If you did image classification chapters, you are already familiar with this. But if not, install NumPy as follows:

      $ pip install numpy autograd

With all this, we'll now install SyntaxNet, which does the heavy lifting for our NLP. SyntaxNet is an open source framework for TensorFlow (https://www.tensorflow.org/) that provides base functionality. Google trained a SyntaxNet model with English and named it Parsey McParseface, which will be included in our installation. We'll be able to either train our own, better or more specific, models in English or train in other languages altogether.

Training data will pose a challenge, as always, so we'll start with just using the pre-trained English model, Parsey McParseface.

So, let's grab the package and configure it, as shown in the following command line:

$ git clone --recursive https://github.com/tensorflow/models.git
$ cd models/research/syntaxnet/tensorflow
$ ./configure

Finally, let's test the system as follows:

$ cd ..
$ bazel test ...

This will take a while. Have patience. If you followed all the instructions closely, all the tests will pass. There may be some errors that appeared on our computer as follows:

If you find that bazel can't download a package, you can try to use the following command and run the test command again:

      $ bazel clean --expunge

If you encounter some failed tests, we suggest that you add the following line into your .bazelrc in home directory in order to receive more error information to debug:

      test --test_output=errors

If you encounter the error Tensor already registered, you need to follow the solution on the Github issue: https://github.com/tensorflow/models/issues/2355.

Now, let's perform a more run-of-the-mill test. Let's provide an English sentence and see how it is parsed:

$ echo 'Faaris likes to feed the kittens.' | bash  
./syntaxnet/demo.sh

We are feeding in a sentence via the echo statement and piping it into the syntaxnet demo script that accepts standard input from the console. Note that to make the example more interesting, I will use an uncommon name, for example, Faaris. Running this command will produce a great deal of debugging information, shown as follows. I cut out stack traces with ellipses (...):

    I syntaxnet/term_frequency_map.cc:101] Loaded 46 terms from 
 syntaxnet/models/parsey_mcparseface/label-map.
    I syntaxnet/embedding_feature_extractor.cc:35] Features: input.digit 
 input.hyphen; input.prefix(length="2") input(1).prefix(length="2") 
 input(2).prefix(length="2") input(3).prefix(length="2") input(-
 1).prefix(length="2")...
    I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: 
 other;prefix2;prefix3;suffix2;suffix3;words
    I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 
 8;16;16;16;16;64
    I syntaxnet/term_frequency_map.cc:101] Loaded 46 terms from 
 syntaxnet/models/parsey_mcparseface/label-map.
    I syntaxnet/embedding_feature_extractor.cc:35] Features: 
 stack.child(1).label stack.child(1).sibling(-1).label stack.child(-
 1)....
    I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: 
 labels;tags;words
    I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 
 32;32;64
    I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from 
 syntaxnet/models/parsey_mcparseface/tag-map.
    I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from 
 syntaxnet/models/parsey_mcparseface/word-map.
    I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from 
 syntaxnet/models/parsey_mcparseface/word-map.
    I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from 
 syntaxnet/models/parsey_mcparseface/tag-map.
    INFO:tensorflow:Building training network with parameters: 
 feature_sizes: [12 20 20] domain_sizes: [   49    51 64038]
    INFO:tensorflow:Building training network with parameters: 
 feature_sizes: [2 8 8 8 8 8] domain_sizes: [    5 10665 10665  8970  
 8970 64038]
    I syntaxnet/term_frequency_map.cc:101] Loaded 46 terms from 
 syntaxnet/models/parsey_mcparseface/label-map.
    I syntaxnet/embedding_feature_extractor.cc:35] Features: 
 stack.child(1).label stack.child(1).sibling(-1).label stack.child(-
 1)....
    I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: 
 labels;tags;words
    I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 
 32;32;64
    I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from 
 syntaxnet/models/parsey_mcparseface/tag-map.
    I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from 
 syntaxnet/models/parsey_mcparseface/word-map.
    I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from 
 syntaxnet/models/parsey_mcparseface/tag-map.
    I syntaxnet/term_frequency_map.cc:101] Loaded 46 terms from 
 syntaxnet/models/parsey_mcparseface/label-map.
    I syntaxnet/embedding_feature_extractor.cc:35] Features: input.digit 
 input.hyphen; input.prefix(length="2") input(1).prefix(length="2") 
 input(2).prefix(length="2") input(3).prefix(length="2") input(-
 1).prefix(length="2")...
    I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: 
 other;prefix2;prefix3;suffix2;suffix3;words
    I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 
 8;16;16;16;16;64
    I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from 
 syntaxnet/models/parsey_mcparseface/word-map.
    INFO:tensorflow:Processed 1 documents
    INFO:tensorflow:Total processed documents: 1
    INFO:tensorflow:num correct tokens: 0
    INFO:tensorflow:total tokens: 7
    INFO:tensorflow:Seconds elapsed in evaluation: 0.12, eval metric: 
 0.00%
    INFO:tensorflow:Processed 1 documents
    INFO:tensorflow:Total processed documents: 1
    INFO:tensorflow:num correct tokens: 1
    INFO:tensorflow:total tokens: 6
    INFO:tensorflow:Seconds elapsed in evaluation: 0.47, eval metric: 
 16.67%
    INFO:tensorflow:Read 1 documents
    Input: Faaris likes to feed the kittens .
    Parse:
    likes VBZ ROOT
     +-- Faaris NNP nsubj
     +-- feed VB xcomp
     |   +-- to TO aux
     |   +-- kittens NNS dobj
     |       +-- the DT det
     +-- . . punct

The final section, starting with Input:, is the most interesting part, and the output we will consume when we use this foundation programmatically. Notice how the sentence is broken down into parts of speech and entity-action-object pairs? Some of the word designations we see are—nsubj, xcomp, aux, dobj, det, and punct. Some of these designations are obvious, while others are not. If you are into deep dive, we suggest perusing the Stanford dependency hierarchy at https://nlp-ml.io/jg/software/pac/standep.html.

Let's try another sentence before we proceed:

Input: Stop speaking so loudly and be quiet !
Parse:
Stop VB ROOT
+-- speaking VBG xcomp
|   +-- loudly RB advmod
|       +-- so RB advmod
|       +-- and CC cc
|       +-- quiet JJ conj
|           +-- be VB cop
+-- ! . punct

Again, here, we will find the model performs pretty well in dissecting the phrase. Try some of your own.

Next, let's actually train a model. Training SyntaxNet is fairly trivial as it is a compiled system. So far, we've piped in data via standard input (STDIO), but we can also pipe in a corpus of text. Remember the protocol buffer library we installed? We will use it now to edit the source file—syntaxnet/models/parsey_mcparseface/context.pbtxt.

Additionally, we will change the source to other training sources, or our own, as shown in the following piece of code:

 input { 
  name: 'wsj-data' 
  record_format: 'conll-sentence' 
  Part { 
    file_pattern: './wsj.conll' 
   } 
 } 
 input { 
  name: 'wsj-data-tagged' 
  record_format: 'conll-sentence' 
  Part { 
    file_pattern: './wsj-tagged.conll' 
   } 
 }

This is how we will train the set; however, it will be pretty challenging to do something better than the natively trained model, Parsey McParseface. So let's train on an interesting dataset using a new model—a Convolutional neural network (CNN) to process text.

I'm a little biased in favor of my alma mater, so we'll use movie review data that Cornell University's department of computer science compiled. The dataset is available at

http://www.cs.cornell.edu/people/pabo/movie-review-data/.

We'll first download and process the movie reviews dataset, then train on it, and finally evaluate based on it.

All our code is available at— https://github.com/dennybritz/cnn-text-classification-tf

The code was inspired by Yoon Kim's paper on the subject, CNNs for sentence classification, implemented and maintained by Google's Denny Britz. Now, we will walk through the code to see how Danny Britz implemented the network

We start on figure 1 with the usual helpers. The only new entrant here is the data helper that downloads and prepares this particular dataset, as shown in the following figure:

We start defining parameters. The training parameters will be very familiar by now—these define the batch size that gets processed on each sweep and how many epochs or full runs we'll undertake. We will also define how often we evaluate progress (100 steps here) and how often we save checkpoints for the model (to allow evaluation and recontinuation). ). Next, we have the code to load and prepare the dataset in figure 2, as follows:

Then, we will take a look at the training part of the code:

Figure 3 shows us instantiating our CNN—a Natural Language CNN—with some of the parameters we defined earlier. We also set up the code to enable the TensorBoard visualization.

Figure 4 shows more items we're capturing for TensorBoard—loss, accuracy for the training, and evaluation sets:

Next, in figure 5, we will define the training and evaluation methods, which are very similar to those we used for image processing. We will receive a set of training data and labels and house them in a dictionary. Then, we will run our TensorFlow session on the dictionary of data, capturing the performance metrics returned.

We will set up the methods at the top and then loop through the training data in batches, applying the training and evaluation methods to each batch of data.

At select intervals, we will also save checkpoints for optional evaluation:

We can run this and end up with a trained model, after a good hour of training on a CPU-only machine. The trained model will be stored as a checkpoint file, which can then be fed into the evaluation program shown in figure 6:

The evaluation program is just an example of usage, but let's go through it. We will start with the typical imports and parameter settings. Here, we will also take the checkpoint directory as an input and we will load some test data; however, you should use your own data.

Next, let's examine the following figure:

We will start with the checkpoint file by just loading it up and recreating a TensorFlow session from it. This allows us to evaluate against the model we just trained, and reuse it over and over.

Next, we will run the test data in batches. In regular use, we will not use a loop or batches, but we have a sizeable set of test data, so we'll do it as a loop.

We will simply run the session against each set of test data and keep the returned predictions (negative versus positive.) The following is some sample positive review data:

 insomnia loses points when it surrenders to a formulaic bang-bang , 
 shoot-em-up scene at the conclusion . but the performances of pacino 
 , williams , and swank keep the viewer wide-awake all the way through 
 .
    what might have been readily dismissed as the tiresome rant of an 
 aging filmmaker still thumbing his nose at convention takes a 
 surprising , subtle turn at the midway point .
    at a time when commercialism has squeezed the life out of whatever 
 idealism american moviemaking ever had , godfrey reggio's career 
 shines like a lonely beacon .
    an inuit masterpiece that will give you goosebumps as its uncanny 
 tale of love , communal discord , and justice unfolds .
    this is popcorn movie fun with equal doses of action , cheese , ham 
 and cheek ( as well as a serious debt to the road warrior ) , but it 
 feels like unrealized potential
    it's a testament to de niro and director michael caton-jones that by 
 movie's end , we accept the characters and the film , flaws and all .
    performances are potent , and the women's stories are ably intercut 
 and involving .
    an enormously entertaining movie , like nothing we've ever seen 
 before , and yet completely familiar .
    lan yu is a genuine love story , full of traditional layers of 
 awakening and ripening and separation and recovery .
    your children will be occupied for 72 minutes .
    pull[s] off the rare trick of recreating not only the look of a 
 certain era , but also the feel .
    twohy's a good yarn-spinner , and ultimately the story compels .
    'tobey maguire is a poster boy for the geek generation . '
     . . . a sweetly affecting story about four sisters who are coping , 
 in one way or another , with life's endgame .
    passion , melodrama , sorrow , laugther , and tears cascade over the 
 screen effortlessly . . .
    road to perdition does display greatness , and it's worth seeing . 
 but it also comes with the laziness and arrogance of a thing that 
 already knows it's won .

Similarly, we have negative data. They are all in the data folder as rt-polarity.pos and rt-polarity.neg.

Here is the network architecture we used:

It is very similar to the architecture we used for images. In fact, the entire effort looks very similar, and it is. The beauty of many of these techniques is its generalizability.

Let's examine the output of training first, which is as follows:

$ ./train.py
...
2017-06-15T04:42:08.793884: step 30101, loss 0, acc 1
2017-06-15T04:42:08.934489: step 30102, loss 1.54599e-07, acc 1
2017-06-15T04:42:09.082239: step 30103, loss 3.53902e-08, acc 1
2017-06-15T04:42:09.225435: step 30104, loss 0, acc 1
2017-06-15T04:42:09.369348: step 30105, loss 2.04891e-08, acc 1
2017-06-15T04:42:09.520073: step 30106, loss 0.0386909, acc 
0.984375
2017-06-15T04:42:09.676975: step 30107, loss 8.00917e-07, acc 1
2017-06-15T04:42:09.821703: step 30108, loss 7.83049e-06, acc 1
...
2017-06-15T04:42:23.220202: step 30199, loss 1.49012e-08, acc 1
2017-06-15T04:42:23.366740: step 30200, loss 5.67226e-05, acc 1
    
Evaluation:
2017-06-15T04:42:23.781196: step 30200, loss 9.74802, acc 0.721
...
Saved model checkpoint to /Users/saif/Documents/BOOK/cnn-text-
classification-tf/runs/1465950150/checkpoints/model-30200

Now let's look at the evaluation step:

$ ./eval.py –eval_train --checkpoint_dir==./runs/1465950150/checkpoints/
    
Parameters:
ALLOW_SOFT_PLACEMENT=True
BATCH_SIZE=64
CHECKPOINT_DIR=/Users/saif/Documents/BOOK/cnn-text-classification-
tf/runs/1465950150/checkpoints/
LOG_DEVICE_PLACEMENT=False
    
Loading data...
Vocabulary size: 18765
Test set size 10662
Evaluating...
Total number of test examples: 10662
Accuracy: 0.973832

That is pretty good accuracy on the dataset we have. The next step will be to apply the trained model to regular use. Some interesting experiments may be to obtain movie review data from another source, perhaps IMDB or Amazon. As the data will not necessarily be tagged, we can use % positive as a metric of general agreement across sites.

We can then use the model in the field. Consider you were a product manufacturer. You could track, in real time, all reviews from myriad sources and filter for just highly negative reviews. Then, your field-representatives could try and address such issues. The possibilities are endless, so we propose an interesting project you could undertake, combining the two items we've learned.

Write a twitter stream reader that takes each tweet and extracts the subject of the tweet. For a specific set of subjects, say companies, evaluate whether the tweet is positive or negative. Create running metrics on percent positive and negative, which evaluates the subject on different time scales.

Table of Contents for Additional setup

Create new playlist

Sign In

Sign Up

Table of Contents for
Additional setup