Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Bayesian networks

Generally, all Probabilistic Graphical Models have three basic elements that form the important sections:

Representation: This answers the question of what does the model mean or represent. The idea is how to represent and store the probability distribution of P(X₁, X₂, …. X_n).
Inference: This answers the question: given the model, how do we perform queries and get answers. This gives us the ability to infer the values of the unknown from the known evidence given the structure of the models. Motivating the main discussion points are various forms of inferences involving trade-offs between computational and correctness concerns.
Learning: This answers the question of what model is right given the data. Learning is divided into two main parts:
- Learning the parameters given the structure and data
- Learning the structure with parameters given the data

We will use the well-known student network as an example of a Bayesian network in our discussions to illustrate the concepts and theory. The student network has five random variables capturing the relationship between various attributes defined as follows:

Difficulty of the exam (D)
Intelligence of the student (I)
Grade the student gets (G)
SAT score of the student (S)
Recommendation Letter the student gets based on grade (L).

Each of these attributes has binary categorical values, for example, the variable Difficulty (D) has two categories (d0, d1) corresponding to low and high, respectively. Grades (G) has three categorical values corresponding to the grades (A, B, C). The arrows as indicated in the section on graphs indicate the dependencies encoded from the domain knowledge—for example, Grade can be determined given that we know the Difficulty of the exam and Intelligence of the student while the Recommendation Letter is completely determined if we know just the Grade (Figure 2). It can be further observed that no explicit edge between the variables indicates that they are independent of each other—for example, the Difficulty of the exam and Intelligence of the student are independent variables.

Figure 2. The "Student" network

Representation

A graph compactly represents the complex relationships between random variables, allowing fast algorithms to make queries where a full enumeration would be prohibitive. In the concepts defined here, we show how directed acyclic graph structures and conditional independence make problems involving large numbers of variables tractable.

Definition

A Bayesian network is defined as a model of a system with:

A number of random variables {X₁, X₂, …. X_k}
A Directed Acyclic Graph (DAG) with nodes representing random variables.
A local conditional probability distribution (CPD) for each node with dependence to its parent nodes P(X_i | parent(X_i)).
A joint probability distribution obtained using the chain rule of distribution is a factor given as:
For the student network defined, the joint distribution capturing all nodes can be represented as:
P(D,I,G,S,L)=P(D)P(I)P(G¦D,I)P(S¦I)P(L¦G)

Reasoning patterns

The Bayesian networks help in answering various queries given some data and facts, and these reasoning patterns are discussed here.

Causal or predictive reasoning

If evidence is given as, for example, "low intelligence", then what would be the chances of getting a "good letter" as shown in Figure 3, in the top right quadrant? This is addressed by causal reasoning. As shown in the first quadrant, causal reasoning flows from the top down.

Evidential or diagnostic reasoning

If evidence such as a "bad letter" is given, what would be the chances that the student got a "good grade"? This question, as shown in Figure 3 in the top left quadrant, is addressed by evidential reasoning. As shown in the second quadrant, evidential reasoning flows from the bottom up.

Intercausal reasoning

Obtaining interesting patterns from finding a "related cause" is the objective of intercausal reasoning. If evidence of "grade C" and "high intelligence" is given, then what would be the chance of course difficulty being "high"? This type of reasoning is also called "explaining away" as one cause explains the reason for another cause and this is illustrated in the third quadrant, in the bottom-left of Figure 3.

Combined reasoning

If a student takes an "easy" course and has a "bad letter", what would be the chances of him getting a "grade C" ? This is explained by queries with combined reasoning patterns. Note that it has mixed information and does not a flow in a single fixed direction as in the case of other reasoning patterns and is shown in the bottom-right of the figure, in quadrant 4:

Figure 3. Reasoning patterns

Independencies, flow of influence, D-Separation, I-Map

The conditional independencies between the nodes can be exploited to reduce the computations when performing queries. In this section, we will discuss some of the important concepts that are associated with independencies.

Flow of influence

Influence is the effect of how the condition or outcome of one variable changes the value or the belief associated with another variable. We have seen this from the reasoning patterns that influence flows from variables in direct relationships (parent/child), causal/evidential (parent and child with intermediates) and in combined structures.

The only case where the influence doesn't flow is when there is a "v-structure". That is, given edges between three variables there is a v-structure and no influence flows between X_{i - 1} and X_{i + 1}. For example, no influence flows between the Difficulty of the course and the Intelligence of the student.

D-Separation

Random variables X and Y are said to be d-separated in the graph G, given there is no active trail between X and Y in G given Z. It is formally denoted by:

dsep_G (X,Y|Z)

The point of d-separation is that it maps perfectly to the conditional independence between the points. This gives to an interesting property that in a Bayesian network any variable is independent of its non-descendants given the parents of the node.

In the Student network example, the node/variable Letter is d-separated from Difficulty, Intelligence, and SAT given the grade.

I-Map

From the d-separation, in graph G, we can collect all the independencies from the d-separations and these independencies are formally represented as:

If P satisfies I(G) then we say the G is an independency-map or I-Map of P.

The main point of I-Map is it can be formally proven that a factorization relationship to the independency holds. The converse can also be proved.

In simple terms, one can read in the Bayesian network graph G, all the independencies that hold in the distribution P regardless of any parameters!

Consider the student network—its whole distribution can be shown as:

P(D,I,G,S,L) = P(D)P(I|D)P(G¦D,I)P(S¦D,I,G)P(L¦D,I,G,S)

Now, consider the independence from I-Maps:

Variables I and D are non-descendants and not conditional on parents so P(I|D) = P(I)
Variable S is independent of its non-descendants D and G, given its parent I. P(S¦D,I,G)=P(S|I)
Variable L is independent of its non-descendants D, I, and S, given its parent G. P(L¦D,I,G,S)=P(L|G)
(D,I,G,S,L)=P(D)P(I)P(G¦D,I)P(S¦I)P(L¦G)

Thus, we have shown that I-Map helps in factorization given just the graph network!

Inference

The biggest advantage of probabilistic graph models is their ability to answer probability queries in the form of conditional or MAP or marginal MAP, given some evidence.

Formally, the probability of evidence E = e is given by:

But the problem has been shown to be NP-Hard (Reference [3]) or specifically, #P-complete. This means that it is intractable when there are a large number of trees or variables. Even for a tree-width (number of variables in the largest clique) of 25, the problem seems to be intractable—most real-world models have tree-widths larger than this.

So if the exact inference discussed before is intractable, can some approximations be used so that within some bounds of the error, we can make the problem tractable? It has been shown that even an approximate algorithm to compute inferences with an error ? < 0.5, so that we find a number p such that |P(E = e) – p|< ?, is also NP-Hard.

But the good news is that this is among the "worst case" results that show exponential time complexity. In the "general case" there can be heuristics applied to reduce the computation time both for exact and approximate algorithms.

Some of the well-known techniques for performing exact and approximate inferencing are depicted in Figure 4, which covers most probabilistic graph models in addition to Bayesian networks.

Figure 4. Exact and approximate inference techniques

It is beyond the scope of this chapter to discuss each of these in detail. We will explain a few of the algorithms in some detail accompanied by references to give the reader a better understanding.

Elimination-based inference

Here we will describe two techniques, the variable elimination algorithm and the clique-tree or junction-tree algorithm.

Variable elimination algorithm

The basics of the Variable elimination (VE) algorithm lie in the distributive property as shown:

(ab+ac+ad)= a (b+c+d)

In other words, five arithmetic operations of three multiplications and two additions can be reduced to four arithmetic operations of one multiplication and three additions by taking a common factor a out.

Let us understand the reduction of the computations by taking a simple example in the student network. If we have to compute a probability query such as the difficulty of the exam given the letter was good, that is, P(D¦L=good)=?.

Using Bayes theorem:

To compute P(D¦L=good)=? we can use the chain rule and joint probability:

If we rearrange the terms on the right-hand side:

If we now replace since the factor is independent of the variable I that S is conditioned on, we get:

Thus, if we proceed carefully, eliminating one variable at a time, we have effectively converted O(2ⁿ) factors to O(nk²) factors where n is the number of variables and k is the number of observed values for each.

Thus, the main idea of the VE algorithm is to impose an order on the variables such that the query variable comes last. A list of factors is maintained over the ordered list of variables and summation is performed. Generally, we use dynamic programming in the implementation of the VE algorithm (References [4]).

Input and output

Inputs:

List of Condition Probability Distribution/Table F
List of query variables Q
List of observed variables E and the observed value e

Output:

P(Q|E = e)

How does it work?

The algorithm calls the eliminate function in a loop, as shown here:

VariableElimination:

While ?, the set of all random variables in the Bayesian network is not empty
1. Remove the first variable Z from ?
2. eliminate(F, Z)
end loop.
Set ? product of all factors in F
Instantiate observed variables in ? to their observed values.
return (renormalization)

eliminate (F, Z)

Remove from the F all functions, for example, X₁, X₂, …. X_k that involve Z.
Compute new function
Compute new function
Add new function ? to F
Return F

Consider the same example of the student network with P(D, L = good) as the goal.

Pick a variable ordering list: S, I, L, G, and D
Initialize the active factor list and introduce the evidence:
List: P(S¦I)P(I)P(D)P(G¦I,D)P(L¦G)d(L = good)
Eliminate the variable SAT or S off the list
List: P(I)P(D)P(G¦I,D)P(L¦G)d(L = good) ?1 (I)
Eliminate the variable Intelligence or I
List: P(D)P(L¦G)d(L = good) ?2 (G,D)
Eliminate the variable Letter or L
List: P(D) ?₃ (G) ?₂ (G,D)
Eliminate the variable Grade or G
List: P(D) ?₄ (D)

Thus with two values, P(D=high) ?₄ (D=high) and P(D=low) ?₄ (D=low), we get the answer.

Advantages and limitations

The advantages and limitations are as follows:

The main advantage of the VE algorithm is its simplicity and generality that can be applied to many networks.
The computational reduction advantage of VE seems to go away when there are many connections in the network.
The choice of optimal ordering of variables is very important for the computational benefit.

Clique tree or junction tree algorithm

Junction tree or Clique Trees are more efficient forms of variable elimination-based techniques.