Chapter 6. Probabilistic Graph Modeling

Probabilistic graph models (PGMs), also known as graph models, capture the relationship between different variables and represent the probability distributions. PGMs capture joint probability distributions and can be used to answer different queries and make inferences that allow us to make predictions on unseen data. PGMs have the great advantage of capturing domain knowledge of experts and the causal relationship between variables to model systems. PGMs represent the structure and they can capture knowledge in a representational framework that makes it easier to share and understand the domain and models. PGMs capture the uncertainty or the probabilistic nature very well and are thus very useful in applications that need scoring or uncertainty-based approaches. PGMs are used in a wide variety of applications that use machine learning such as applications to domains of language processing, text mining and information extraction, computer vision, disease diagnosis, and DNA structure predictions.

Judea Pearl is the pioneer in the area of PGMs and was the first to introduce the topic of Bayesian Networks (References [2] and [7]). Although covering all there is to know about PGMs is beyond the scope of this chapter, our goal is to cover the most important aspects of PGMs—Bayes network and directed PGMs—in some detail. We will divide the subject into the areas of representation, inference, and learning and will discuss specific algorithms and sub-topics in each of these areas. We will cover Markov Networks and undirected PGMs, summarizing some differences and similarities with PGMs, and addressing related areas such as inference and learning. Finally, we will discuss specialized networks such as tree augmented networks (TAN), Markov chains and hidden Markov models (HMM). For an in-depth treatment of the subject, see Probabilistic Graphical Models, by Koller and Friedman (References [1]).

Probability revisited

Many basic concepts of probability are detailed in Appendix B, Probability. Some of the key ideas in probability theory form the building blocks of probabilistic graph models. A good grasp of the relevant theory can help a great deal in understanding PGMs and how they are used to make inferences from data.

Concepts in probability

In this section, we will discuss important concepts related to probability theory that will be used in the discussion later in this chapter.

Conditional probability

The essence of conditional probability, given two related events a and ß, is to capture how we assign a value for one of the events when the other is known to have occurred. The conditional probability, or the conditional distribution, is represented by P(a | ß), that is, the probability of event a happening given that the event ß has occurred (equivalently, given that ß is true) and is formally defined as:

Conditional probability

The P(a n ß) captures the events where both a and ß occur.

Chain rule and Bayes' theorem

The conditional probability definition gives rise to the chain rule of conditional probabilities that says that when there are multiple events α1, α2….αn then:

P(α1α2 ∩….∩ αn ) = P(α1 )P(α2 ¦ α1)P(α3 | α1α2)..P(α |α1α2 ∩….∩ αn-1)

The probability of several events can be expressed as the probability of the first times the probability of the second given the first, and so on. Thus, the probability of αn depends on everything α1 to αn and is independent of the order of the events.

Bayes rule also follows from the conditional probability rule and can be given formally as:

Chain rule and Bayes' theorem

Random variables, joint, and marginal distributions

It is natural to map the event spaces and outcomes by considering them as attributes and values. Random variables are defined as attributes with different known specific values. For example, if Grade is an attribute associated with Student, and has values {A, B, C}, then P(Grade = A) represents a random variable with an outcome.

Random variables are generally denoted by capital letters, such as X, Y, and Z and values taken by them are denoted by Val(X) = x. In this chapter, we will primarily discuss values that are categorical in nature, that is, that take a fixed number of discrete values. In the real world, the variables can have continuous representations too. The distribution of a variable with categories {x1, x2 …xn} can be represented as:

Random variables, joint, and marginal distributions

Such a distribution over many categories is called a multinomial distribution. In the special case when there are only two categories, the distribution is said to be the Bernoulli distribution.

Given a random variable, a probability distribution over all the events described by that variable is known as the marginal distribution. For example, if Grade is the random variable, the marginal distribution can be defined as (Grade = A) = 0.25, P(Grade = b) = 0.37 and P(Grade = C) = 0.38.

In many real-world models, there are more than one random variables and the distribution that considers all of these random variable is called the joint distribution. For example, if Intelligence of student is considered as another variable and denoted by P(Intelligence) or P(I) and has binary outcomes {low, high}, then the distribution considering Intelligence and Grade represented as P(Intelligence, Grade) or P(I, G), is the joint distribution.

Marginal distribution of one random variable can be computed from the joint distribution by summing up the values over all of the other variables. The marginal distribution over grade can be obtained by summing over all the rows as shown in Table 1 and that over the intelligence can be obtained by summation over the columns.

Random variables, joint, and marginal distributions

Table 1. Marginal distributions over I and G

Marginal independence and conditional independence

Marginal Independence is defined as follows. Consider two random variables X and Y; then P(X|Y) = P(X) means random variable X is independent of Y. It is formally represented as Marginal independence and conditional independence (P satisfies X is independent of Y).

This means the joint distribution can be given as:

P(X, Y) = P(X)P(Y)

If the difficulty level of the exam (D) and the intelligence of the student (I) determine the grade (G), we know that the difficulty level of the exam is independent of the intelligence of the student and (DI) also implies P(D, I) = P(D)P(I).

When two random variables are independent given a third variable, the independence is called conditional independence. Given a set of three random variables X, Y, and Z, we can say that Marginal independence and conditional independence; that is, variable X is independent of Y given Z. The necessary condition for conditional independence is

Marginal independence and conditional independence

Factors

Factors are the basic building blocks for defining the probability distributions in high-dimensional (large number of variables) spaces. They give basic operations that help in manipulating the probability distributions.

A "factor" is defined as a function that takes as input the random variables known as "scope" and gives a real-value output.

Formally, a factor is represented as Factors where scope is (X1, X2, ….Xk ).

Factor types

Different types of factors are as follows:

  • Joint distribution: For every combination of variables, you get a real-valued output.
  • Unnormalized measure: When, in a joint distribution, one of the variables is constant, the output is also real-valued, but it is unnormalized as it doesn't sum to one. However, it is still a factor.
  • Conditional probability distribution: A probability distribution of the form P(G|I) is also a factor.

Various operations are performed on factors, such as:

  • Factor product: If two factors ϕ1 (X1, X2) and ϕ2 (X2, X3) are multiplied, it gives rise to ϕ3 (X1, X2, X3). In effect, it is taking tables corresponding to ϕ1 and multiplying it with ϕ2
  • Factor marginalization: This is the same as marginalization, where ϕ1 (X1, X2, X3) can be marginalized over a variable, say X2, to give ϕ2 (X1, X3).
  • Factor reduction: This is only taking the values of other variables when one of the variables is constant.

Distribution queries

Given the probability over random variables, many queries can be performed to answer certain questions. Some common types of queries are explained in the subsequent sections.

Probabilistic queries

This is one of the most common types of query and it has two parts:

  • Evidence: A subset of variables with a well-known outcome or category. For example, a random variable E = e.
  • Query: A random variable from the rest of the variables. For example, a random variable X.

    P(X|E = e)

Examples of probabilistic queries are posterior marginal estimations such as P(I = high|L = bad, S = low) = ? and evidential probability such as P(L = bad, S = low) = ?.

MAP queries and marginal MAP queries

MAP queries are used to find the probability assignment to the subset of variables that are most likely and hence are also called most probable explanation (MPE). The difference between these and probabilistic queries is that, instead of getting the probability, we get the most likely values for all the variables.

Formally, if we have variables W= X – E, where we have E = e as the evidence and are interested in finding the most likely assignment to the variables in W,

MAP(W|e) = argmaxwP(w,e)

A much more general form of marginal query is when we have a subset of variables, say given by Y that forms our query and with evidence of E = e, we are interested in finding most likely assignments to the variables in Y. Using the MAP definition, we get:

MAP(Y|e) = argmaxyP(y|e)

Let's say, Z= X – Y – E, then the marginal MAP query is:

MAP queries and marginal MAP queries
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.30.210