The Bayesian logistic regression model

The name logistic regression comes from the fact that the dependent variable of the regression is a logistic function. It is one of the widely used models in problems where the response is a binary variable (for example, fraud or not-fraud, click or no-click, and so on).

A logistic function is defined by the following equation:

The Bayesian logistic regression model

It has the particular feature that, as y varies from The Bayesian logistic regression model to The Bayesian logistic regression model, the function value varies from 0 to 1. Hence, the logistic function is ideal for modeling any binary response as the input signal is varied.

The inverse of the logistic function is called logit. It is defined as follows:

The Bayesian logistic regression model

In logistic regression, y is treated as a linear function of explanatory variables X. Therefore, the logistic regression model can be defined as follows:

The Bayesian logistic regression model

Here, The Bayesian logistic regression model is the set of basis functions and The Bayesian logistic regression model are the model parameters as explained in the case of linear regression in Chapter 4, Machine Learning Using Bayesian Inference. From the definition of GLM in Chapter 5, Bayesian Regression Models, one can immediately recognize that logistic regression is a special case of GLM with the logit function as the link function.

Bayesian treatment of logistic regression is more difficult compared to the case of linear regression. Here, the likelihood function consists of a product of logistic functions; one for each data point. To compute the posterior, one has to normalize this function multiplied by the prior (to get the denominator of the Bayes formula). One approach is to use Laplace approximation as explained in Chapter 3, Introducing Bayesian Inference. Readers might recall that in Laplace approximation, the posterior is approximated as a Gaussian (normal) distribution about the maximum of the posterior. This is achieved by finding the maximum a posteriori (MAP) solution first and computing the second derivative of the negative log likelihood around the MAP solution. Interested readers can find the details of Laplace approximation to logistic regression in the paper by D.J.C. MacKay (reference 2 in the References section of this chapter).

Instead of using an analytical approximation, Polson and Scott recently proposed a fully Bayesian treatment of this problem using a data augmentation strategy (reference 3 in the References section of this chapter). The authors have implemented their method in the R package: BayesLogit. We will use this package to illustrate Bayesian logistic regression in this chapter.

The BayesLogit R package

The package can be downloaded from the CRAN website at http://cran.r-project.org/web/packages/BayesLogit/index.html. The package contains the logit function that can be used to perform a Bayesian logistic regression. The syntax for calling this function is as follows:

>logit(Y,X,n=rep(1,length(Y) ),m0=rep(0,ncol(X) ),P0=matrix(0,nrow=ncol(X),ncol=ncol(X) ),samp=1000,burn=500)

Here, Y is an N-dimensional vector containing response values; X is an N x P dimensional matrix containing values of independent variables, n is an N-dimensional vector, The BayesLogit R package is a P-dimensional prior mean, and The BayesLogit R package is a P x P dimensional prior precision. The other two arguments are related to MCMC simulation parameters. The number of MCMC simulations saved is denoted by samp and the number of MCMC simulations discarded at the beginning of the run before saving samples is denoted by burn.

The dataset

To illustrate Bayesian logistic regression, we use the Parkinsons dataset from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/Parkinsons). The dataset was used by Little et.al. to detect Parkinson's disease by analyzing voice disorder (reference 4 in the References section of this chapter). The dataset consists of voice measurements from 31 people, of which 23 people have Parkinson's disease. There are 195 rows corresponding to multiple measurements from a single individual. The measurements can be grouped into the following sets:

  • The vocal fundamental frequency
  • Jitter
  • Shimmer
  • The ratio of noise to tonal components
  • The nonlinear dynamical complexity measures
  • The signal fractal scaling exponent
  • The nonlinear measures of fundamental frequency variation

In total, there are 22 numerical attributes.

Preparation of the training and testing datasets

Before we can train the Bayesian logistic model, we need to do some preprocessing of the data. The dataset contains multiple measurements from the same individual. Here, we take all observations; each from a sampled set of individuals in order to create the training and test sets. Also, we need to separate the dependent variable (class label Y) from the independent variables (X). The following R code does this job:

>#install.packages("BayesLogit") #One time installation of package
>library(BayesLogit)
>PDdata <- read.table("parkinsons.csv",sep=",",header=TRUE,row.names = 1)
>rnames <- row.names(PDdata)
>cnames <- colnames(PDdata,do.NULL = TRUE,prefix = "col")
>colnames(PDdata)[17] <- "y"
>PDdata$y <- as.factor(PDdata$y)

>rnames.strip <- substr(rnames,10,12)
>PDdata1 <- cbind(PDdata,rnames.strip)
>rnames.unique <- unique(rnames.strip)
>set.seed(123)
>samp <- sample(rnames.unique,as.integer(length(rnames.unique)*0.2),replace=F)
>PDtest <- PDdata1[PDdata1$rnames.strip %in% samp,-24]  # -24 to remove last column
>PDtrain <- PDdata1[!(PDdata1$rnames.strip %in% samp),-24] # -24 to remove last column
>xtrain <- PDtrain[,-17]
>ytrain <- PDtrain[,17]
>xtest <- PDtest[,-17]
>ytest<- PDtest[,17]

Using the Bayesian logistic model

We can use xtrain and ytrain to train the Bayesian logistic regression model using the logit( ) function:

>blmodel <- logit(ytrain,xtrain,n=rep(1,length(ytrain)),m0 = rep(0,ncol(xtrain)),P0 = matrix(0,nrow=ncol(xtrain),ncol=ncol(xtrain)),samp = 1000,burn = 500)

The summary( ) function will give a high-level summary of the fitted model:

>summary(blmodel)

To predict values of Y for a new dataset, we need to write a custom script as follows:

>psi <- blmodel$beta %*% t(xtrain)  # samp x n
>p   <- exp(psi) / (1 + exp(psi) )  # samp x n
>ypred.bayes <- colMeans(p)

The error of prediction can be computed by comparing it with the actual values of Y present in ytest:

>table(ypred.bayes,ytest)

One can plot the ROC curve using the pROC package as follows:

>roc(ytrain,ypred.bayes,plot = T)
Using the Bayesian logistic model

The ROC curve has an AUC of 0.942 suggesting a good classification accuracy. Again, the model is presented here to illustrate the purpose and is not tuned to obtain maximum performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.255