How to do it...

In the following example, we will generate a synthetic dataset containing 10 variables that are relevant and 10 variables that are irrelevant. We would expect our variable selection routine to flag those latter ones as irrelevant.

This is a good example to introduce JAGS, which is another option to using STAN. JAGS is not as sophisticated as STAN, and for some situations such as when the posterior densities are correlated, it is not as efficient. Nevertheless, JAGS is even easier to use, and can accommodate complicated models with just a few lines. It must be noted that JAGS requires a less declarative syntax where there is no necessity to declare variables and parameters:

We first generate our synthetic dataset. It will have three relevant features, and three irrelevant ones:

v1_1 = rnorm(1000,10,1)
v1_2 = rnorm(1000,10,1)
v1_3 = rnorm(1000,10,1)
v2_1 = rnorm(1000,10,1)
v2_2 = rnorm(1000,10,1)
v2_3 = rnorm(1000,10,1)
U = rnorm(1000,0,3)
Y = v1_1 + v1_2 + v1_3 + U
lista = list(x=cbind(v1_1,v1_2,v1_3,v2_1,v2_2,v2_3),y=Y,n=length(Y))

We load JAGS, as shown in the following code snippet:

library('rjags')

We define our model. We loop through the dataset and do the following calculation for each variable: index * b coefficient * variable. The intuition behind this approach is that, if a variable is irrelevant, the posterior distribution of the index variable will be concentrated around zero. In this case, our prior distribution is a Bernoulli (0.5), meaning we think that 50% of the variables are zero:

mod <- " model{
for (i in 1:n){
mu[i] = id[1]*b[1]*x[i,1] + id[2]*b[2]*x[i,2] + id[3]*b[3]*x[i,3] + id[4]*b[4]*x[i,4] + id[5]*b[5]*x[i,5] + id[6]*b[6]*x[i,6]
y[i] ~ dnorm(mu[i], prec)
}
for (i in 1:6){
b[i] ~ dnorm(0.0, 1/2)
id[i] ~ dbern(0.5)
}
prec ~ dgamma(1, 2)
}"

We compile and pass the data to our model. We are using 100 iterations as adaption, which means that JAGS will use them to tune the acceptance rate accordingly:

jags <- jags.model(textConnection(mod), data = lista, n.chains = 1, n.adapt = 100)

Next, we run 5,000 iterations:

update(jags, 5000)

We extract the samples for the b and the Bernoulli indicator variables:

 samps <- coda.samples( jags, c("b","id"), n.iter=1000 )

We do a summary. As we can see, the first three Bernoulli variables have a mean of the posterior distribution around one, which makes sense because they should be relevant. The other three variables have a mean of the posterior distribution around zero. In consequence, we should rerun this model without the variables associated to those last three Bernoulli variables (and obviously without any of the Bernoulli variables, as they were just used for variable selection):

summary(samps)

The following screenshot shows the posterior density summary. Check the first three and the last three indicator variables:

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...