Probabilistic graph models (PGMs), also known as graph models, capture the relationship between different variables and represent the probability distributions. PGMs capture joint probability distributions and can be used to answer different queries and make inferences that allow us to make predictions on unseen data. PGMs have the great advantage of capturing domain knowledge of experts and the causal relationship between variables to model systems. PGMs represent the structure and they can capture knowledge in a representational framework that makes it easier to share and understand the domain and models. PGMs capture the uncertainty or the probabilistic nature very well and are thus very useful in applications that need scoring or uncertainty-based approaches. PGMs are used in a wide variety of applications that use machine learning such as applications to domains of language processing, text mining and information extraction, computer vision, disease diagnosis, and DNA structure predictions.
Judea Pearl is the pioneer in the area of PGMs and was the first to introduce the topic of Bayesian Networks (References [2] and [7]). Although covering all there is to know about PGMs is beyond the scope of this chapter, our goal is to cover the most important aspects of PGMs—Bayes network and directed PGMs—in some detail. We will divide the subject into the areas of representation, inference, and learning and will discuss specific algorithms and sub-topics in each of these areas. We will cover Markov Networks and undirected PGMs, summarizing some differences and similarities with PGMs, and addressing related areas such as inference and learning. Finally, we will discuss specialized networks such as tree augmented networks (TAN), Markov chains and hidden Markov models (HMM). For an in-depth treatment of the subject, see Probabilistic Graphical Models, by Koller and Friedman (References [1]).
Many basic concepts of probability are detailed in Appendix B, Probability. Some of the key ideas in probability theory form the building blocks of probabilistic graph models. A good grasp of the relevant theory can help a great deal in understanding PGMs and how they are used to make inferences from data.
In this section, we will discuss important concepts related to probability theory that will be used in the discussion later in this chapter.
The essence of conditional probability, given two related events a and ß, is to capture how we assign a value for one of the events when the other is known to have occurred. The conditional probability, or the conditional distribution, is represented by P(a | ß), that is, the probability of event a happening given that the event ß has occurred (equivalently, given that ß is true) and is formally defined as:
The P(a n ß) captures the events where both a and ß occur.
The conditional probability definition gives rise to the chain rule of conditional probabilities that says that when there are multiple events α1, α2….αn then:
P(α1 ∩ α2 ∩….∩ αn ) = P(α1 )P(α2 ¦ α1)P(α3 | α1 ∩ α2)..P(α∩ |α1 ∩ α2 ∩….∩ αn-1)
The probability of several events can be expressed as the probability of the first times the probability of the second given the first, and so on. Thus, the probability of αn depends on everything α1 to αn and is independent of the order of the events.
Bayes rule also follows from the conditional probability rule and can be given formally as:
It is natural to map the event spaces and outcomes by considering them as attributes and values. Random variables are defined as attributes with different known specific values. For example, if Grade is an attribute associated with Student, and has values {A, B, C}, then P(Grade = A) represents a random variable with an outcome.
Random variables are generally denoted by capital letters, such as X, Y, and Z and values taken by them are denoted by Val(X) = x. In this chapter, we will primarily discuss values that are categorical in nature, that is, that take a fixed number of discrete values. In the real world, the variables can have continuous representations too. The distribution of a variable with categories {x1, x2 …xn} can be represented as:
Such a distribution over many categories is called a multinomial distribution. In the special case when there are only two categories, the distribution is said to be the Bernoulli distribution.
Given a random variable, a probability distribution over all the events described by that variable is known as the marginal distribution. For example, if Grade is the random variable, the marginal distribution can be defined as (Grade = A) = 0.25, P(Grade = b) = 0.37 and P(Grade = C) = 0.38.
In many real-world models, there are more than one random variables and the distribution that considers all of these random variable is called the joint distribution. For example, if Intelligence of student is considered as another variable and denoted by P(Intelligence) or P(I) and has binary outcomes {low, high}, then the distribution considering Intelligence and Grade represented as P(Intelligence, Grade) or P(I, G), is the joint distribution.
Marginal distribution of one random variable can be computed from the joint distribution by summing up the values over all of the other variables. The marginal distribution over grade can be obtained by summing over all the rows as shown in Table 1 and that over the intelligence can be obtained by summation over the columns.
Marginal Independence is defined as follows. Consider two random variables X and Y; then P(X|Y) = P(X) means random variable X is independent of Y. It is formally represented as (P satisfies X is independent of Y).
This means the joint distribution can be given as:
P(X, Y) = P(X)P(Y)
If the difficulty level of the exam (D) and the intelligence of the student (I) determine the grade (G), we know that the difficulty level of the exam is independent of the intelligence of the student and (D ⊥ I) also implies P(D, I) = P(D)P(I).
When two random variables are independent given a third variable, the independence is called conditional independence. Given a set of three random variables X, Y, and Z, we can say that ; that is, variable X is independent of Y given Z. The necessary condition for conditional independence is
Factors are the basic building blocks for defining the probability distributions in high-dimensional (large number of variables) spaces. They give basic operations that help in manipulating the probability distributions.
A "factor" is defined as a function that takes as input the random variables known as "scope" and gives a real-value output.
Formally, a factor is represented as where scope is (X1, X2, ….Xk ).
Different types of factors are as follows:
Various operations are performed on factors, such as:
Given the probability over random variables, many queries can be performed to answer certain questions. Some common types of queries are explained in the subsequent sections.
This is one of the most common types of query and it has two parts:
P(X|E = e)
Examples of probabilistic queries are posterior marginal estimations such as P(I = high|L = bad, S = low) = ? and evidential probability such as P(L = bad, S = low) = ?.
MAP queries are used to find the probability assignment to the subset of variables that are most likely and hence are also called most probable explanation (MPE). The difference between these and probabilistic queries is that, instead of getting the probability, we get the most likely values for all the variables.
Formally, if we have variables W= X – E, where we have E = e as the evidence and are interested in finding the most likely assignment to the variables in W,
MAP(W|e) = argmaxwP(w,e)
A much more general form of marginal query is when we have a subset of variables, say given by Y that forms our query and with evidence of E = e, we are interested in finding most likely assignments to the variables in Y. Using the MAP definition, we get:
MAP(Y|e) = argmaxyP(y|e)
Let's say, Z= X – Y – E, then the marginal MAP query is:
3.141.30.210