Predicting heart disease

We'll put logistic regression for the binary classification task to the test with a real-world data set from the UCI Machine Learning Repository. This time, we will be working with the Statlog (Heart) data set, which we will refer to as the heart data set henceforth for brevity. The data set can be downloaded from the UCI Machine Repository's website at http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29. The data contain 270 observations for patients with potential heart problems. Of these, 120 patients were shown to have heart problems, so the split between the two classes is fairly even. The task is to predict whether a patient has a heart disease based on their profile and a series of medical tests. First, we'll load the data into a data frame and rename the columns according to the website:

> heart <- read.table("heart.dat", quote = """)
> names(heart) <- c("AGE", "SEX", "CHESTPAIN", "RESTBP", "CHOL", "SUGAR", "ECG", "MAXHR", "ANGINA", "DEP", "EXERCISE", "FLUOR", "THAL", "OUTPUT")

The following table contains the definitions of our input features and the output:

Column name

Type

Definition

AGE

Numerical

Age (years)

SEX

Binary

Gender

CHESTPAIN

Categorical

4-valued chest pain type

RESTBP

Numerical

Resting blood pressure (beats per minute)

CHOL

Numerical

Serum cholesterol (mg/dl)

SUGAR

Binary

Is the fasting blood sugar level > 120 mg/dl?

ECG

Categorical

3-valued resting electrocardiographic results

MAXHR

Numerical

Maximum heart rate achieved (beats per minute)

ANGINA

Binary

Was angina induced by exercise?

DEP

Numerical

ST depression induced by exercise relative to rest

EXERCISE

Ordered categorical

Slope of the peak exercise ST segment

FLUOR

Numerical

The number of major vessels colored by fluoroscopy

THAL

Categorical

3-valued Thal

OUTPUT

Binary

Presence or absence of a heart disease

Before we train a logistic regression model for these data, there are a couple of preprocessing steps that we should perform. A common pitfall when working with numerical data is the failure to notice when a feature is actually a categorical variable and not a numerical variable when the levels are coded as numbers. In the heart data set, we have four such features. The CHESTPAIN, THAL, and ECG features are all categorical features. The EXERCISE variable, although an ordered categorical variable, is nonetheless a categorical variable, so it will have to be coded as a factor as well:

> heart$CHESTPAIN = factor(heart$CHESTPAIN)
> heart$ECG = factor(heart$ECG)
> heart$THAL = factor(heart$THAL)
> heart$EXERCISE = factor(heart$EXERCISE)

In Chapter 1, Gearing Up for Predictive Modeling, we saw how we can transform categorical features with many levels into a series of binary valued indicator variables. By doing this, we can use them in a model such as linear or logistic regression, which requires all the inputs to be numerical. As long as the relevant categorical variables in a data frame have been coded as factors, R will automatically apply a coding scheme when performing logistic regression. Concretely, R will treat one of the k factor levels as a reference level and create k-1 binary features from the other factor levels. We'll see visual evidence of this when we study the summary output of the logistic regression model that we'll train.

Next, we should observe that the OUTPUT variable is coded so that class 1 corresponds to the absence of heart disease and class 2 corresponds to the presence of heart disease. As a final change, we'll want to recode the OUTPUT variable so that we will have the familiar class labels of 0 and 1, respectively. This is done by simply subtracting 1:

> heart$OUTPUT = heart$OUTPUT - 1

Our data frame is now ready. Before we train our model, however, we will split our data frame into two parts, for training and testing, exactly as we did for linear regression. Once again, we'll use an 85-15 split:

> library(caret)
> set.seed(987954)
> heart_sampling_vector <- 
  createDataPartition(heart$OUTPUT, p = 0.85, list = FALSE)
> heart_train <- heart[heart_sampling_vector,]
> heart_train_labels <- heart$OUTPUT[heart_sampling_vector]
> heart_test <- heart[-heart_sampling_vector,]
> heart_test_labels <- heart$OUTPUT[-heart_sampling_vector]

We now have 230 observations in our training set and 40 observations in our test set. To train a logistic regression model in R, we use the glm() function, which stands for generalized linear model. This function can be used to train various generalized linear models, but we'll focus on the syntax and usage for logistic regression here. The call is as follows:

> heart_model <- 
  glm(OUTPUT ~ ., data = heart_train, family = binomial("logit"))

Note that the format is very similar to what we saw with linear regression. The first parameter is the model formula, which identifies the output variable and which features we want to use (in this case, all of them). The second parameter is the data frame and the final family parameter is used to specify that we want to perform logistic regression. We can use the summary() function to find out more about the model we just trained, as follows:

> summary(heart_model)

Call:
glm(formula = OUTPUT ~ ., family = binomial("logit"), data = heart_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7137  -0.4421  -0.1382   0.3588   2.8118  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.946051   3.477686  -2.285 0.022321 *  
AGE         -0.020538   0.029580  -0.694 0.487482    
SEX          1.641327   0.656291   2.501 0.012387 *  
CHESTPAIN2   1.308530   1.000913   1.307 0.191098    
CHESTPAIN3   0.560233   0.865114   0.648 0.517255    
CHESTPAIN4   2.356442   0.820521   2.872 0.004080 ** 
RESTBP       0.026588   0.013357   1.991 0.046529 *  
CHOL         0.008105   0.004790   1.692 0.090593 .  
SUGAR       -1.263606   0.732414  -1.725 0.084480 .  
ECG1         1.352751   3.287293   0.412 0.680699    
ECG2         0.563430   0.461872   1.220 0.222509    
MAXHR       -0.013585   0.012873  -1.055 0.291283    
ANGINA       0.999906   0.525996   1.901 0.057305 .  
DEP          0.196349   0.282891   0.694 0.487632    
EXERCISE2    0.743530   0.560700   1.326 0.184815    
EXERCISE3    0.946718   1.165567   0.812 0.416655    
FLUOR        1.310240   0.308348   4.249 2.15e-05 ***
THAL6        0.304117   0.995464   0.306 0.759983    
THAL7        1.717886   0.510986   3.362 0.000774 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 315.90  on 229  degrees of freedom
Residual deviance: 140.36  on 211  degrees of freedom
AIC: 178.36

Number of Fisher Scoring iterations: 6
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.158.32