Boosting algorithms

Boosting is a technique that uses weights and a set of weak learners, such as decision trees, in order to improve model performance. Boosting assigns weights to data based on model misclassification and future learner's (created during the boosting machine learning process) focus on the misclassified examples. Examples that were correctly classified will be reassigned new weights which will generally be lower than those that were not correctly classified. The weight can be based on a cost function, such as a majority vote, using subsets of the data.

In simple and non-technical terms, boosting uses a series of weak learners, and each learner 'learns' from the mistakes of the prior learners.

Boosting is generally more popular compared to bagging as it assigns weights relative to model performance rather than assigning equal weights to all data points as in bagging. This is conceptually similar to the difference between a weighted average versus an average function with no weighting criteria.

There are several packages in R for boosting algorithms and some of the commonly used ones are as follows:

  • Adaboost
  • GBM (Stochastic Gradient Boosting)
  • XGBoost

Of these, XGBoost is a widely popular machine learning package that has been used very successfully in competitive machine learning platforms such as Kaggle. XGBoost has a very elegant and computationally efficient way to creating ensemble models. Because it is both accurate and extremely fast, users have often used XGBoost for compute-intensive ML challenges. You can learn more about Kaggle at http://www.kaggle.com.

# Creating an XGBoost model in R

library(caret) library(xgboost) set.seed(123) train_ind<- sample(nrow(PimaIndiansDiabetes),as.integer(nrow(PimaIndiansDiabetes)*.80)) training_diab<- PimaIndiansDiabetes[train_ind,] test_diab<- PimaIndiansDiabetes[-train_ind,] diab_train<- sparse.model.matrix(~.-1, data=training_diab[,-ncol(training_diab)]) diab_train_dmatrix<- xgb.DMatrix(data = diab_train, label=training_diab$diabetes=="pos") diab_test<- sparse.model.matrix(~.-1, data=test_diab[,-ncol(test_diab)]) diab_test_dmatrix<- xgb.DMatrix(data = diab_test, label=test_diab$diabetes=="pos") param_diab<- list(objective = "binary:logistic", eval_metric = "error", booster = "gbtree", max_depth = 5, eta = 0.1) xgb_model<- xgb.train(data = diab_train_dmatrix, param_diab, nrounds = 1000, watchlist = list(train = diab_train_dmatrix, test = diab_test_dmatrix), print_every_n = 10) predicted <- predict(xgb_model, diab_test_dmatrix) predicted <- predicted > 0.5 actual <- test_diab$diabetes == "pos" confusionMatrix(actual,predicted) # RESULT Confusion Matrix and Statistics Reference Prediction FALSE TRUE FALSE 80 17 TRUE 21 36 Accuracy : 0.7532 95% CI : (0.6774, 0.8191) No Information Rate : 0.6558 P-Value [Acc> NIR] : 0.005956 Kappa : 0.463 Mcnemar's Test P-Value : 0.626496 Sensitivity : 0.7921 Specificity : 0.6792 PosPredValue : 0.8247 NegPredValue : 0.6316 Prevalence : 0.6558 Detection Rate : 0.5195 Detection Prevalence : 0.6299 Balanced Accuracy : 0.7357 'Positive' Class : FALSE
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.134.17