Hidden Markov models

A Hidden Markov model, often abbreviated to HMM, which we will use here, is a Bayesian network with a repeating structure that is commonly used to model and predict sequences. In this section, we'll see two applications of this model: one to model DNA gene sequences, and another to model the sequences of letters that make up English text. The basic diagram for an HMM is shown here:

Hidden Markov models

As we can see in the diagram, the sequence flows from left to right and we have a pair of nodes for every entry in the sequence that we are trying to model. Nodes labeled Ci are known as latent states, hidden states, or merely states, as they are typically nodes that are not observable. The nodes labeled Oi are observed states or observations. We will use the terms states and observations.

Now, as this is a Bayesian network, we can immediately identify some key properties. All the observations are independent of each other given their corresponding state. Also, every state is independent of every other state earlier on in the sequence history, given the state that preceded it (which is its parent in the network). The key idea behind an HMM, therefore, is that the model moves in a linear fashion from one state to the next state.

In each latent state, it produces an observation, which is also known as an emitted symbol. These symbols are the observed part of the sequence. Hidden Markov models are very common in natural language processing, and a good example is their application to part of speech tagging. The task of a part of speech tagger is to read a sentence and return the sequence of corresponding part of speech labels for the words in that sentence. For example, given the previous sentence, a part of speech tagger might return determiner for the word The, singular noun for the word task, and so on.

To model this using an HMM, we would have the words be the emitted symbols, and the part of speech tags be the latent states, as the former are observable and the latter are what we want to determine. There are many other sequence labeling tasks in natural language processing to which Hidden Markov models have been applied, such as named entity recognition, where the goal is to identify the words in a sentence that refer to names of individuals, locations, organizations, and other entities.

A Hidden Markov model is comprised of five core components. The first of these is the set of possible latent class labels. For the part of speech tagger example, this might be a list of all the part of speech tags that we will use. The second component is the set of all possible emitted symbols. For an English part of speech tagger, this is the dictionary of English words.

The next three components involve probabilities. The starting probability vector is a vector of probabilities that tells us the probability of starting in each latent state. For part of speech tagging, we may, for example, have a high probability of starting with a determiner such as the. The transition probability matrix is a matrix that tells us the probabilities of going to state Cj when the current state is Ci. Thus, this contains the probability of moving from a determiner to a noun for our part of speech example. Finally, the emission probability matrix tells us the probability of emitting every symbol in our dictionary for every state that we can be in. Note that some words (such as bank, which is both a noun and a verb) can be labeled with more than one part of speech tag, and so will have nonzero probabilities of being emitted from more than one state.

In circumstances such as part of speech tagging, we usually have a collection of labeled sequences so that our data contain both sequences of observations as well as their corresponding states. In this case, similar to the Naïve Bayes model, we use relative frequency counts to populate the probability components of our model.

For example, to find a suitable starting probability vector, we could tabulate the starting state for every sequence in our data set and use this to get the relative frequency of beginning in each state. When all we have are unlabeled sequences, the task is significantly harder because we might not even know how many states we need to include in our model. One method to assign states to unlabeled observation sequences in training data is known as the Baum-Welch algorithm.

Once we know the parameters of our model, the question becomes how to predict the most likely sequence of states behind a sequence of observations. Given an unlabeled sentence in English, a part of speech tagger based on an HMM must predict the sequence of part of speech labels. The most commonly used algorithm for this is based on a programming technique known as dynamic programming and is known as the Viterbi algorithm.

The algorithms we have discussed for the Hidden Markov model are beyond the scope of this book but are quite intuitive and well worth studying. Given a basic understanding of the core components of the model and its assumptions, our next goal is to see how we can apply them to some real-world situations. We will first see an example with labeled sequences and later an example with unlabeled sequences.

Tip

Perhaps the most definitive and thorough introduction to hidden Markov models is the seminal paper titled A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, L. R. Rabiner, published in the Proceedings of the IEEE 1989. The Jurafsky and Martin textbook we mentioned earlier is also an ideal reference to learn more about HMMs, including details on the Baum-Welch and Viterbi algorithms as well as applications such as part of speech tagging and named entity recognition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.188.138