How to do it...

In this recipe, we will work with a dataset containing several features for the employees of a certain company: the area, the sleep quality, whether they recently had a child, diet quality, travel time and performance. In this example, all these features have two levels, but that is obviously not a requirement:

First, we load the dataset, we define the network, and we plot it. We could use several networks, but this one is a reasonable one to begin with. Note that the performance (maybe the most relevant variable for us here) is impacted by two variables:diet_quality and travel_time. Presumably, people that must travel more are more tired and perform worse at work; also, people who are not eating well, may feel too tired to work well. Both variables depend on other variables:

library(bnlearn)
data = read.csv("./employee_data.csv")[-1]
dag = model2network("[Area][travel_time|Area][performance|travel_time:diet_quality] 
[Recently_had_child][Sleep_quality|Recently_had_child:Area][diet_quality|Sleep_quality]")
plot(dag)

This plot shows how the different variables are connected:

We now fit the model using two arguments: the structure that we specified and the data. The data frame containing the data needs to have the exact same column names as the model that we defined, and there cannot be any extra variables:

fitted = bn.fit(dag, data)

Once the model has been fitted, we can execute queries, which are essentially predictions for conditional probabilities. For example, let's see what the predicted probabilities are that the employee has a high performance level given that they live in a SUBURBAN or URBAN area:

cpquery(fitted, (performance=="HIGH"), (Area=="URBAN"))
cpquery(fitted, (performance=="HIGH"), (Area=="SUBURBAN"))

The following screenshot shows result:

We do a similar exercise, now querying the probability that the performance is HIGH given that travel_time is HIGH and sleep_quality is HIGH, and given that sleep_quality is LOW. These are predictive queries:

cpquery(fitted, (performance=="HIGH"), (travel_time=="HIGH" & Sleep_quality=="HIGH"))
cpquery(fitted, (performance=="HIGH"), (travel_time=="HIGH" & Sleep_quality=="LOW"))

The following screenshot shows the results of the query:

A different query could be, given that the someone's performance is HIGH/LOW, what is the probability that the person in question is sleeping well or not?:

cpquery(fitted, (Sleep_quality=="HIGH"), (performance=="HIGH"))
cpquery(fitted, (Sleep_quality=="LOW") , (performance=="HIGH"))

Take a look at the following screenshot:

We can plot the results for each node. For example, let's see how the conditional probabilities for diet_quality change according to the node that is connected to it (sleep_quality). The orange columns refer to each sleep_quality level, and the rows refer to each diet_quality level:

bn.fit.dotplot(fitted$diet_quality)
Conditional probabilities for diet_quality

Take a look at the following screenshot:

We have already stated that there are essentially two ways of building Bayesian networks: the expert approach (which is what we have used so far—specifying how the nodes are connected), and the automatic way. The latter relies on several sophisticated algorithms estimate the best structure. We can do this using the hc() function. Unfortunately, when we have lots of variables, it is very difficult not to rely on this fully automatic approach (the maxp= parameter specifies the maximum number of ascendants that a node can have):

dag2 = hc(data, maxp=2)
plot(dag2)

This plot shows how the different variables are connected:

The automatic model that we get doesn't make any sense at all. The beautiful thing about BNs is that we can build hybrid structures, which use some expert knowledge in conjunction with an automatic approach. We have two interesting parameters: the blacklist= parameter specifies which connections we don't want to have, and the whitelist= specifies which connections we want to have. The automatic algorithm will complement these connections that we specify in whitelist.

Once we have the structure, we can fit it as usual:

whitelist = data.frame(from=c("travel_time","diet_quality"),to=c("performance", "performance"))
dag2 = hc(data,maxp=2,whitelist=whitelist)
plot(dag2)
fitted2 <- bn.fit(dag2,data)

This plot shows how the different variables are connected:

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...