CNN document model

We previously saw how word embeddings are capable of capturing many semantic relationships between the concepts they represent. We will now introduce a ConvNet document model that builds hierarchical distributed representations of documents. This was published in the paper https://arxiv.org/pdf/1406.3830.pdf by Misha Denil et al. The model is divided into two levels, a sentence level and a document level, both of which are implemented using ConvNets. At the sentence level, a ConvNet is used to transform embeddings for the words in each sentence into an embedding for the entire sentence. At the document level, another ConvNet is used to transform sentence embeddings to a document embedding.

In any ConvNet architecture a convolution layer is followed by a sub-sampling/pooling layer. Here, we use k-max pooling. A k-max pooling operation is slightly different from normal max pooling, which takes the max from a sliding window of neurons. In a k-max pooling operation, the maximum k neurons are taken from all the neurons in the layer below. For example, applying 2-max pooling to [3, 1, 5, 2] yields [3, 5]. Here, normal max pooling with kernel size 3 and stride 1 will give the same result. Let's take another case. If we apply max pooling on [1, 2, 3, 4, 5] we will get [3, 5], but 2-max pool will give [4, 5]. K-max pooling can be applied on variable size inputs and we can still get the same number of output units.

The following diagram depicts the Convolutional Neural Network (CNN) architecture. We have fine-tuned this architecture a bit for various use cases that will be discussed here:

The input layer to this network is not shown here. The input layer is a sequence of sentences in the document, in order, where each sentence is represented by a sequence of word indices. The following code snippet describes how the word indices are defined, given a training corpus. The indices 0 and 1 are reserved for empty and OOV words. First, the documents in the corpus are tokenized to words. The non-English words are filtered out. Also, the frequency of each word in the entire corpus is computed. For a large corpus we can filter out infrequent words from vocabulary. Then, an integer index is assigned to each word in the vocabulary:

from nltk.tokenize import sent_tokenize, wordpunct_tokenize
import re

corpus = ['The cat sat on the mat . It was a nice mat !',
         'The rat sat on the mat . The mat was damaged found at 2 places.']

vocab ={}
word_index = {}
for doc in corpus:
    for sentence in sent_tokenize(doc):
        tokens = wordpunct_tokenize(sentence)
        tokens = [token.lower().strip() for token in tokens]
        tokens = [token for token in tokens 
                      if re.match('^[a-z,.;!?]+$',token) is not None ]
        for token in tokens:
            vocab[token] = vocab.get(token, 0)+1
# i= 0 for empty, 1 for OOV
i = 2
for word, count in vocab.items():
    word_index[word] = i
    i +=1
print(word_index.items())

#Here is the output:
dict_items([('the', 2), ('cat', 3), ('sat', 4), ('on', 5), ('mat', 6), ('.', 7), ('it', 8), ('was', #9), ('a', 10), ('nice', 11), ('!', 12), ('rat', 13), ('damaged', 14), ('found', 15), ('at', 16), ('places', 17)])

Now, the corpus can be converted to an array of word indices. In the corpus, different sentences and documents have different lengths. Although convolutions can handle inputs of arbitrary width, for simplicity of implementation we can define a fixed size input to the network. We can zero pad short sentences and truncate longer sentences to fit the fixed sentence length and do the same at document level. In the following code snippet, we show how to use the keras.preprocessing module to zero pad the sentences and document and prepare the data:

from keras.preprocessing.sequence import pad_sequences

SENTENCE_LEN = 10; NUM_SENTENCES=3;
for doc in corpus:
    doc2wordseq = []
    sent_num =0
    for sentence in sent_tokenize(doc):
        words = wordpunct_tokenize(sentence)
        words = [token.lower().strip() for token in words]
        word_id_seq = [word_index[word] if word_index.get(word) is not    
         None 
                                        else 1 for word in words]
        padded_word_id_seq = pad_sequences([word_id_seq], 
                                            maxlen=SENTENCE_LEN,
                                            padding='post',
                                            truncating='post')

        if sent_num < NUM_SENTENCES:
            doc2wordseq = doc2wordseq + list(padded_word_id_seq[0]) 
    doc2wordseq = pad_sequences([doc2wordseq],  
                                 maxlen=SENTENCE_LEN*NUM_SENTENCES,
                                 padding='post',
                                 truncating='post')
    print(doc2wordseq)

# sample output
[ 2 3 4 5 2 6 7 0 0 0 8 9 10 11 6 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 2 13 4 5 2 6 7 0 0 0 2 6 9 14 15 16 1 17 7 0 0 0 0 0 0 0 0 0 0 0]

So, you can see that each document input is a one-dimensional tensor of size doc_length = SENTENCE_LEN * NUM_SENTENCES. These tensors are passed through the first layer, the embedding layer in the network, to convert word indices to a dense word representation, and we get a two-dimensional tensor of shape doc_length × embedding_dimension. All preceding preprocessing code is bundled in the class Preprocess and has fit and transform methods, like scikit modules. The fit method takes the train corpus as input, builds the vocabulary, and assigns word indices to each word in the vocabulary. Then, the transform method can be used to convert the test or hold out set to padded word index sequences as shown precedingly. The transform method will use the word indices calculated by fit.

The embedding matrix can be initialized with GloVe or Word2vec. Here, we have used 50 dimensional GloVe embedding to initialize the embedding matrix. The words not found in GloVe and OOV words are initialized as follows:

OOV words—words excluded in training data vocabulary (index 1) are initialized by a mean of all GloVe vectors
The words not found in GloVe are initialized by a mean of all Glove vectors and a random vector of the same dimension

The following code snippet does the same in the _init_embedding_matrix method from the GloVe class discussed before:

#oov by average vector:
self.embedding_matrix[1] = np.mean(self.embedding_matrix, axis=0)
for indx in missing_word_index:
    self.embedding_matrix[indx] = np.random.rand(self.EMBEDDING_DIM)+  
                                  self.embedding_matrix[1]

Having initialized the embedding matrix, we are now ready to build the first layer, the embedding layer, as follows:

from keras.layers import Embedding
embedding_layer = Embedding(vocab_size,
                            embedding_dim,
                            weights=[embedding_weights],
                            input_length=max_seq_length,
                            trainable=True,
                            name='embedding')

Next, we have to build the word convolution layer. We want the same one-dimensional convolution filters to be applied across all sentences—that is, the same convolution filter weights to be shared across all sentences. First, we use a Lambda layer to split the input into sentences. Then, if we use C convolution filters, each sentence two-dimensional-tensor of shape (SENTENCE_LEN× EMBEDDING _DIM) will be converted to the ((SENTENCE_LEN-filter+1) × C) tensor. The following code does the same:

#Let's take sentence_len=30, embedding_dim=50, num_sentences = 10
#following convolution filters to be used for all sentences.
word_conv_model = Conv1D(filters= 6,
                         kernel_size= 5,
                         padding="valid",
                         activation="relu", 
                         trainable = True,
                         name = "word_conv",
                         strides=1)

for sent in range(num_sentences):
    ##get one sentence from the input document
    sentence = Lambda(lambda x : x[:, sent*sentence_len: 
                                         (sent+1)*sentence_len, :])(z)
    ##sentence shape : (None, 30, 50) 
    conv = word_conv_model(sentence) 
    ## convolution shape : (None, 26, 6)

The k-max pooling layer is not available in keras. We can implement k-max pool as a custom layer. To implement a custom layer, we need to implement three methods:

call(x): This is where the layer's logic is implemented
compute_output_shape(input_shape): In case the custom layer modifies the shape of its input
build(input_shape): Define layer weights (we don't need this as our layer has no weights)

Here, is the full code of the k-max pooling layer:

import tensorflow as tf
from keras.layers import Layer, InputSpec

class KMaxPooling(Layer):
     def __init__(self, k=1, **kwargs):
        super().__init__(**kwargs)
        self.input_spec = InputSpec(ndim=3)
        self.k = k

    def compute_output_shape(self, input_shape):
        return (input_shape[0], (input_shape[2] * self.k))

    def call(self, inputs):
        
        # swap last two dimensions since top_k will be 
        # applied along the last dimension
        shifted_input = tf.transpose(inputs, [0, 2, 1])
        
        # extract top_k, returns two tensors [values, indices]
        top_k = tf.nn.top_k(shifted_input, k=self.k, sorted=True, 
                             name=None)[0]
        
        # return flattened output
        return top_k

Applying the preceding k-max pooling layer on the word convolutions, we get the sentence embedding layer:

for sent in range(num_sentences):
    ##get one sentence from the input document
    sentence = Lambda(lambda x : x[:,sent*sentence_len: 
                                         (sent+1)*sentence_len, :])(z)
    ##sentence shape : (None, 30, 50) 
    conv = word_conv_model(sentence) 
    ## convolution shape : (None, 26, 6)
    conv = KMaxPooling(k=3)(conv)
    #transpose pooled values per sentence
    conv = Reshape([word_filters*sent_k_maxpool,1])(conv)
    ## shape post k-max pooling and reshape (None, 18=6*3, 1)

So, we convert each sentence of shape 30 ×50 to 18 × 1 and then we concatenate these tensors to get the sentence embedding. We use the Concatenate layer in Keras to implement the same:

z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
z = Permute([2,1], name='sentence_embeddings')(z)
## output shape of sentence embedding is : (None, 10, 18)

As before, one-dimensional convolution followed by k-max pooling is applied on the preceding sentence embedding to get the document embedding. This completes the document model for text. Based on the learning task in hand, the next layers can be defined. For a classification task, the document embedding can be connected to a dense layer, followed by a final softmax layer with K units for k-class classification problems. We can have more than one dense layer before the final layer. The following code snippet implements the same:

sent_conv = Conv1D(filters=16,
                   kernel_size=3,
                   padding="valid",
                   activation="relu",
                   trainable = True,
                   name = 'sentence_conv',
                   strides=1)(z)

z = KMaxPooling(k=5)(sent_conv)
z = Flatten(name='document_embedding')(z)
    
for i in range(num_hidden_layers):
    layer_name = 'hidden_{}'.format(i)
    z = Dense(hidden_dims, activation=hidden_activation, 
               name=layer_name)(z)
        
model_output = Dense(K, activation='sigmoid',name='final')(z)

The whole code is included in the cnn_document_model module.

Table of Contents for CNN document model

Create new playlist

Sign In

Sign Up

Table of Contents for
CNN document model