Building Conditional Random Fields for sequential text data

The Conditional Random Fields (CRFs) are probabilistic models used to analyze structured data. They are frequently used to label and segment sequential data. CRFs are discriminative models as opposed to HMMs, which are generative models. CRFs are used extensively to analyze sequences, stocks, speech, words, and so on. In these models, given a particular labeled observation sequence, we define a conditional probability distribution over this sequence. This is in contrast with HMMs where we define a joint distribution over the label and the observed sequence.

Getting ready

HMMs assume that the current output is statistically independent of the previous outputs. This is needed by HMMs to ensure that the inference works in a robust way. However, this assumption need not always be true! The current output in a time series setup, more often than not, depends on previous outputs. One of the main advantages of CRFs over HMMs is that they are conditional by nature, which means that we are not assuming any independence between output observations. There are a few other advantages of using CRFs over HMMs. CRFs tend to outperform HMMs in a number of applications, such as linguistics, bioinformatics, speech analysis, and so on. In this recipe, we will learn how to use CRFs to analyze sequences of letters.

We will use a library called pystruct to build and train CRFs. Make sure that you install this before you proceed. You can find the installation instructions at https://pystruct.github.io/installation.html.

How to do it…

  1. Create a new Python file, and import the following packages:
    import os
    import argparse 
    import cPickle as pickle 
    
    import numpy as np
    import matplotlib.pyplot as plt
    from pystruct.datasets import load_letters
    from pystruct.models import ChainCRF
    from pystruct.learners import FrankWolfeSSVM
  2. Define an argument parser to take the C value as an input argument. C is a hyperparameter that controls how specific you want your model to be without losing the power to generalize:
    def build_arg_parser():
        parser = argparse.ArgumentParser(description='Trains the CRF classifier')
        parser.add_argument("--c-value", dest="c_value", required=False, type=float,
                default=1.0, help="The C value that will be used for training")
        return parser
  3. Define a class to handle all CRF-related processing:
    class CRFTrainer(object):
  4. Define an init function to initialize the values:
        def __init__(self, c_value, classifier_name='ChainCRF'):
            self.c_value = c_value
            self.classifier_name = classifier_name
  5. We will use chain CRF to analyze the data. We need to add an error check for this, as follows:
            if self.classifier_name == 'ChainCRF':
                model = ChainCRF()
  6. Define the classifier that we will use with our CRF model. We will use a type of Support Vector Machine to achieve this:
                self.clf = FrankWolfeSSVM(model=model, C=self.c_value, max_iter=50) 
            else:
                raise TypeError('Invalid classifier type')
  7. Load the letters dataset. This dataset consists of segmented letters and their associated feature vectors. We will not analyze the images because we already have the feature vectors. The first letter from each word has been removed, so all we have are lowercase letters:
        def load_data(self):
            letters = load_letters()
  8. Load the data and labels into their respective variables:
            X, y, folds = letters['data'], letters['labels'], letters['folds']
            X, y = np.array(X), np.array(y)
            return X, y, folds
  9. Define a training method, as follows:
        # X is a numpy array of samples where each sample
        # has the shape (n_letters, n_features) 
        def train(self, X_train, y_train):
            self.clf.fit(X_train, y_train)
  10. Define a method to evaluate the performance of the model:
        def evaluate(self, X_test, y_test):
            return self.clf.score(X_test, y_test)
  11. Define a method to classify new data:
        # Run the classifier on input data
        def classify(self, input_data):
            return self.clf.predict(input_data)[0]
  12. The letters are indexed in a numbered array. In order to check the output and make it readable, we need to transform these numbers into alphabets. Define a function to do this:
    def decoder(arr):
        alphabets = 'abcdefghijklmnopqrstuvwxyz'
        output = ''
        for i in arr:
            output += alphabets[i] 
    
        return output
  13. Define the main function and parse the input arguments:
    if __name__=='__main__':
        args = build_arg_parser().parse_args()
        c_value = args.c_value
  14. Initialize the variable with the class and the C value:
        crf = CRFTrainer(c_value)
  15. Load the letters data:
        X, y, folds = crf.load_data()
  16. Separate the data into training and testing datasets:
        X_train, X_test = X[folds == 1], X[folds != 1]
        y_train, y_test = y[folds == 1], y[folds != 1]
  17. Train the CRF model, as follows:
        print "
    Training the CRF model..."
        crf.train(X_train, y_train)
  18. Evaluate the performance of the CRF model:
        score = crf.evaluate(X_test, y_test)
        print "
    Accuracy score =", str(round(score*100, 2)) + '%'
  19. Let's take a random test vector and predict the output using the model:
        print "
    True label =", decoder(y_test[0])
        predicted_output = crf.classify([X_test[0]])
        print "Predicted output =", decoder(predicted_output)
  20. The full code is given in the crf.py file that is already provided to you. If you run this code, you will get the following output on your Terminal. As we can see, the word is supposed to be "commanding". The CRF does a pretty good job of predicting all the letters:
    How to do it…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.238.159