Classifying text

Classifying text is an important part of machine learning and data science. We have to be able to classify text for a variety of applications, including document retrieval and web searches. It is often important to assign specific labels to the data before we can determine its usefulness for a particular application or search result.

In this chapter, we are going to demonstrate a technique involving the use of paragraph vectors and labeled data with DL4J classes. This example allows us to read in documents and, based on the text inside of the document, assign a label (or classification) to the document. We are also going to show an example of classifying text by similarity. This means we will match phrases and words that have similar structure. This example will also use DL4J.

Word2Vec and Doc2Vec

We will be using Word2Vec and Doc2Vec in a few examples in this chapter. Word2Vec is a neural network with two layers used for text processing. Given a body of text, the network will provide feature vectors for the words contained in the text. These vectors are simply mathematical representations of the word features and can be numerically compared to other vectors. This comparison is often referred to as the distance between two words.

Word2Vec operates with the understanding that words can be classified by determining the probability that two words will occur together. Because of this methodology, Word2Vec can be used for more than classification of sentences. Any object or data that can be represented by text labels can be classified with this network.

Doc2Vec is an extension of Word2Vec. Rather than building vectors representing the features of individual words compared to other words, as Word2Vec does, this network compares words to given labels. The vectors are set up to represent the theme or overall meaning of a document. Our next example shows how these feature vectors are then associated with specific documents.

Classifying text by labels

In our first example using Doc2Vec, we will associate our documents with three labels: health, finance, and science. But before we can associate the data with labels, we have to define those labels and train our model to recognize the labels. Each label represents the meaning or classification of a particular piece of text.

In this example we will use sample documents, each pre-labelled with our categories: health, finance, or science. We will use these paragraphs to train our model and then, as in previous examples, use a set of test data to test our model. We will be using the files found at https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources/paravec. We have based this example upon sample code written for DL4J, which can be found at https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/paragraphvectors/ParagraphVectorsClassifierExample.java.

First we need to set up some instance variables to use later in our code. We will be using a ParagraphVectors object to create our vectors, a LabelAwareIterator object to iterate through our data, and a TokenizerFactory object to tokenize our data:

ParagraphVectors pVect; 
LabelAwareIterator iter; 
TokenizerFactory tFact; 
 

Then we will set up our ClassPathResource. This specifies the directory within our project that contains the data files to be classified. The first resource contains our labeled data used for training purposes. We then direct our iterator and tokenizer to use the resources specified as the ClassPathResource. We also specify that we will use the CommonPreprocessor to preprocess our data:

ClassPathResource resource = new  
         ClassPathResource("paravec/labeled"); 
 
iter = new FileLabelAwareIterator.Builder() 
        .addSourceFolder(resource.getFile()) 
        .build(); 
 
tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 

Next, we build our ParagraphVectors. This is where we specify the learning rate, batch size, and number of training epochs. We include our iterator and tokenizer in the setup process as well. Once we've built our ParagraphVectors, we call the fit method to train our model using the training data in the paravec/labeled directory:

pVect = new ParagraphVectors.Builder() 
        .learningRate(0.025) 
        .minLearningRate(0.001) 
        .batchSize(1000) 
        .epochs(20) 
        .iterate(iter) 
        .trainWordVectors(true) 
        .tokenizerFactory(tFact) 
        .build(); 
 
pVect.fit(); 

Now that we have trained our model, we can use our unlabeled data to test. We create a new ClassPathResource for our unlabeled data and create a new FileLabelAwareIterator as well:

ClassPathResource unlabeledText =  
         new ClassPathResource("paravec/unlabeled"); 
FileLabelAwareIterator unlabeledIter =  
         new FileLabelAwareIterator.Builder() 
               .addSourceFolder(unlabeledText.getFile()) 
               .build(); 

The next step involves iterating through our unlabeled data and identifying the correct label for each document. We can generally expect that each document will fall into multiple labels but have a different weight, or percent match, for each. So, while one article may be mostly classified as a health article, it likely has enough information to be also classified, to a lesser degree, as a science article.

Next, we set up a MeansBuilder and LabelSeeker object. These classes access tables containing the relationships between words and labels, which we will use in our ParagraphVectors. The InMemoryLookupTable class provides access to a default table for word lookup:

MeansBuilder mBuilder =  
   new MeansBuilder((InMemoryLookupTable<VocabWord>)  
      pVect.getLookupTable(),tFact); 
LabelSeeker lSeeker =  
    new LabelSeeker(iter.getLabelsSource().getLabels(), 
               (InMemoryLookupTable<VocabWord>)
    pVect.getLookupTable()); 

Finally, we iterate through our unlabeled documents. For each document, we will change the document into a vector and use our LabelSeeker to get the scores for each document. We log the scores for each document and print out the score with the appropriate labels:

while (unlabeledIter.hasNextDocument()) { 
    LabelledDocument doc = unlabeledIter.nextDocument(); 
    INDArray docCentroid = mBuilder.documentAsVector(doc); 
    List<Pair<String, Double>> scores =  
              lSeeker.getScores(docCentroid); 
    out.println("Document '" + doc.getLabel() +  
       "' falls into the following categories: "); 
    for (Pair<String, Double> score : scores) { 
       out.println ("        " + score.getFirst() + ": " +  
             score.getSecond()); 
        } 
 
} 

The output from our preceding print statements is as follows:

Document 'finance' falls into the following categories: 
finance: 0.2889593541622162
health: 0.11753179132938385
science: 0.021202782168984413
Document 'health' falls into the following categories: 
finance: 0.059537000954151154
health: 0.27373185753822327
science: 0.07699354737997055

In each instance, our documents were classified properly, as demonstrated by the higher percentage assigned to the correct label category. This classification can be used in conjunction with other data analysis techniques to draw additional conclusions about the data contained in the files. Often text classification is an initial or early step in a data analysis process as documents are classified into groups for further analysis.

Classifying text by similarity

In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors class we used in the previous example. To begin, download the raw_sentences.txt file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.

First, we set up our ClassPathResource and assign an iterator to handle our file data. We have used a SentenceIterator for this example:

ClassPathResource srcFile = new  
      ClassPathResource("/raw_sentences.txt"); 
File file = srcFile.getFile(); 
SentenceIterator iter = new BasicLineIterator(file); 
 

Next, we will again use TokenizerFactory to tokenize our data. We also want to create a new LabelsSource object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_:

TokenizerFactory tFact = new DefaultTokenizerFactory(); 
tFact.setTokenPreProcessor(new CommonPreprocessor()); 
LabelsSource labelFormat = new LabelsSource("LINE_"); 

Now we are ready to build our ParagraphVectors. Our setup process includes these methods: minWordFrequency, which specifies the minimum word frequency to use in the training corpus, and iterations, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource, defined before, and our iterator and tokenizer. The trainWordVectors method specifies whether word and document representations should be built together. Finally, sampling determines whether subsampling should occur or not. We then call our build and fit methods:

ParagraphVectors vec = new ParagraphVectors.Builder() 
        .minWordFrequency(1) 
        .iterations(5) 
        .epochs(1) 
        .layerSize(100) 
        .learningRate(0.025) 
        .labelsSource(labelFormat) 
        .windowSize(5) 
        .iterate(iter) 
        .trainWordVectors(false) 
        .tokenizerFactory(tFact) 
        .sampling(0) 
        .build(); 
 
vec.fit(); 
 

Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1, the indexing process begins at 0. So, for example, line 9836 in the document will be associated with the label LINE_9835. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity method takes two labels and returns the relative distance between them in the form of  double:

double similar1 = vec.similarity("LINE_9835", "LINE_12492"); 
out.println("Comparing lines 9836 & 12493  
       ('This is my house .'/'This is my world .')  
       Similarity = " + similar1); 
 
 
double similar2 = vec.similarity("LINE_3720", "LINE_16392"); 
out.println("Comparing lines 3721 & 16393  
       ('This is my way .'/'This is my work .')  
       Similarity = " + similar2); 
 
double similar3 = vec.similarity("LINE_6347", "LINE_3720"); 
out.println("Comparing lines 6348 & 3721  
       ('This is my case .'/'This is my way .')  
       Similarity = " + similar3); 
 
double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); 
out.println("Comparing lines 3721 & 9853  
       ('This is my way .'/'We now have one .')  
       Similarity = " + dissimilar1); 
 
double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); 
out.println("Comparing lines 3721 & 3720  
       ('This is my way .'/'At first he says no .')  
       Similarity = " + dissimilar2); 

The output of our print statements is shown as follows. Compare the result of the similarity method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:

16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4]
Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494
Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001
Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362
Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935
Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619

Although this example uses ParagraphVectors like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.187