Boosting algorithms

Boosting is a technique that uses weights and a set of weak learners, such as decision trees, in order to improve model performance. Boosting assigns weights to data based on model misclassification and future learner's (created during the boosting machine learning process) focus on the misclassified examples. Examples that were correctly classified will be reassigned new weights which will generally be lower than those that were not correctly classified. The weight can be based on a cost function, such as a majority vote, using subsets of the data.

In simple and non-technical terms, boosting uses a series of weak learners, and each learner 'learns' from the mistakes of the prior learners.

Boosting is generally more popular compared to bagging as it assigns weights relative to model performance rather than assigning equal weights to all data points as in bagging. This is conceptually similar to the difference between a weighted average versus an average function with no weighting criteria.

There are several packages in R for boosting algorithms and some of the commonly used ones are as follows:

Adaboost
GBM (Stochastic Gradient Boosting)
XGBoost

Of these, XGBoost is a widely popular machine learning package that has been used very successfully in competitive machine learning platforms such as Kaggle. XGBoost has a very elegant and computationally efficient way to creating ensemble models. Because it is both accurate and extremely fast, users have often used XGBoost for compute-intensive ML challenges. You can learn more about Kaggle at http://www.kaggle.com.

# Creating an XGBoost model in R

library(caret)
library(xgboost) 
 
set.seed(123) 
train_ind<- sample(nrow(PimaIndiansDiabetes),as.integer(nrow(PimaIndiansDiabetes)*.80)) 
 
training_diab<- PimaIndiansDiabetes[train_ind,] 
test_diab<- PimaIndiansDiabetes[-train_ind,] 
 
diab_train<- sparse.model.matrix(~.-1, data=training_diab[,-ncol(training_diab)]) 
diab_train_dmatrix<- xgb.DMatrix(data = diab_train, label=training_diab$diabetes=="pos") 
 
diab_test<- sparse.model.matrix(~.-1, data=test_diab[,-ncol(test_diab)]) 
diab_test_dmatrix<- xgb.DMatrix(data = diab_test, label=test_diab$diabetes=="pos") 
 
 
 
param_diab<- list(objective = "binary:logistic", 
eval_metric = "error", 
              booster = "gbtree", 
max_depth = 5, 
              eta = 0.1) 
 
xgb_model<- xgb.train(data = diab_train_dmatrix, 
param_diab, nrounds = 1000, 
watchlist = list(train = diab_train_dmatrix, test = diab_test_dmatrix), 
print_every_n = 10) 
 
 
predicted <- predict(xgb_model, diab_test_dmatrix) 
predicted <- predicted > 0.5 
 
actual <- test_diab$diabetes == "pos" 
confusionMatrix(actual,predicted) 
 
 
# RESULT 
 
Confusion Matrix and Statistics 
 
          Reference 
Prediction FALSE TRUE 
     FALSE    80   17 
     TRUE     21   36 
 
Accuracy : 0.7532           
                 95% CI : (0.6774, 0.8191) 
    No Information Rate : 0.6558           
    P-Value [Acc> NIR] : 0.005956         
 
Kappa : 0.463            
Mcnemar's Test P-Value : 0.626496         
 
Sensitivity : 0.7921           
Specificity : 0.6792           
PosPredValue : 0.8247           
NegPredValue : 0.6316           
Prevalence : 0.6558           
         Detection Rate : 0.5195           
   Detection Prevalence : 0.6299           
      Balanced Accuracy : 0.7357           
 
       'Positive' Class : FALSE

Table of Contents for Boosting algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Boosting algorithms