We have just discussed that there are no options for parallel processing when using GBM from Scikit-learn, and this is exactly where XGBoost comes in. Expanding on GBM, XGBoost introduces more scalable methods leveraging multithreading on a single machine and parallel processing on clusters of multiple servers (using sharding). The most important improvement of XGBoost over GBM lies in the capability of the latter to manage sparse data. XGBoost automatically accepts sparse data as input without storing zero values in memory. A second benefit of XGBoost lies in the way in which the best node split values are calculated while branching the tree, a method named quantile sketch. This method transforms the data by a weighting algorithm so that candidate splits are sorted based on a certain accuracy level. For more information read the article at http://arxiv.org/pdf/1603.02754v3.pdf.
XGBoost stands for Extreme Gradient Boosting, an open source gradient boosting algorithm that has gained a lot of popularity in data science competitions such as Kaggle (https://www.kaggle.com/) and KDD-cup 2015. (The code is available on GitHub at https://github.com/dmlc/XGBoost, as we described in Chapter 1, First Steps to Scalability.) As the authors (Tianqui Chen, Tong He, and Carlos Guestrin) report on papers that they wrote on their algorithm, XGBoost, among 29 challenges held on Kaggle during 2015, 17 winning solutions used XGBoost as a standalone or part of some kind of ensemble of multiple models. In their paper XGBoost: A Scalable Tree Boosting System (which can be found at http://learningsys.org/papers/LearningSys_2015_paper_32.pdf), the authors report that, in the recent KDD-cup 2015, XGBoost was used by every team that ended in the top ten of the competition. Apart from successful performances in both accuracy and computational efficiency, our principal concern in this book is scalability, and XGBoost is indeed a scalable solution from different points of view. XGBoost is a new generation of GBM algorithms with important tweaks to the initial tree boost GBM algorithm. XGBoost provides parallel processing; the scalability offered by the algorithm is due to quite a few new tweaks and additions developed by its authors:
From a practical point of view, XGBoost features mostly the same parameters as GBM. XGBoost is also quite capable of dealing with missing data. Other tree ensembles, based on standard decisions trees, require missing data first to be imputed using an off-scale value (such as a large negative number) in order to develop an appropriate branching of the tree to deal with missing values. XGBoost, instead, first fits all the non-missing values and, after having created the branching for the variable, it decides which branch is better for the missing values to take in order to minimize the prediction error. Such an approach leads to trees that are more compact and an effective imputation strategy leading to more predictive power.
The most important XGBoost parameters are as follows:
eta
(default=0.3): This is the equivalent of the learning rate in Scikit-learn's GBMmin_child_weight
(default=1): Higher values prevent overfitting and tree complexitymax_depth
(default=6): This is the number of interactions in the treessubsample
(default=1): This is a fraction of samples of the training data that we take in each iterationcolsample_bytree
(default=1): This is the fraction of features in each iterationlambda
(default=1): This is the L2 regularization (Boolean)seed
(default=0): This is the equivalent of Scikit-learn's random_state
parameter, allowing reproducibility of learning processes across multiple tests and different machinesNow that we know XGBoost's most important parameters, let's run an XGBoost example on the same dataset that we used for GBM with the same parameter settings (as much as possible). XGBoost is a little less straightforward to use than the Scikit-learn package. So we will provide some basic examples that you can use as a starting point for more complex models. Before we dive deeper into XGBoost applications, let's compare it to the GBM method in sklearn
on the spam dataset; we have already loaded the data in-memory:
import xgboost as xgb import numpy as np from sklearn.metrics import classification_report from sklearn import cross_validation clf = xgb.XGBClassifier(n_estimators=100,max_depth=8, learning_rate=.1,subsample=.5) clf1 = GradientBoostingClassifier(n_estimators=100,max_depth=8, learning_rate=.1,subsample=.5) %timeit xgm=clf.fit(X_train,y_train) %timeit gbmf=clf1.fit(X_train,y_train) y_pred = xgm.predict(X_test) y_pred2 = gbmf.predict(X_test) print 'XGBoost results %r' % (classification_report(y_test, y_pred)) print 'gbm results %r' % (classification_report(y_test, y_pred2)) OUTPUT: 1 loop, best of 3: 1.71 s per loop 1 loop, best of 3: 2.91 s per loop XGBoost results ' precision recall f1-score support 0 0.95 0.97 0.96 835 1 0.95 0.93 0.94 546 avg / total 0.95 0.95 0.95 1381 ' gbm results ' precision recall f1-score support 0 0.95 0.97 0.96 835 1 0.95 0.92 0.93 546 avg / total 0.95 0.95 0.95 1381
We can clearly see that XGBoost is quite faster than GBM (1.71s versus 2.91s) even though we didn't use parallelization for XGBoost. Later, we can even arrive at a greater speedup when we use parallelization and out-of-core methods for XGBoost when we apply out-of-memory streaming. In some cases, the XGBoost model results in a higher accuracy than GBM, but (almost) never the other way around.
Boosting methods are often used for classification but can be very powerful for regression tasks as well. As regression is often overlooked, let's run a regression example and walk through the key issues. Let's fit a boosting model on the California housing set with gridsearch. The California house dataset has recently been added to Scikit-learn, which saves us some preprocessing steps:
import numpy as np import scipy.sparse import xgboost as xgb import os import pandas as pd from sklearn.cross_validation import train_test_split import numpy as np from sklearn.datasets import fetch_california_housing from sklearn.metrics import mean_squared_error pd=fetch_california_housing() #because the y variable is highly skewed we apply the log transformation y=np.log(pd.target) X_train, X_test, y_train, y_test = train_test_split(pd.data, y, test_size=0.15, random_state=111) names = pd.feature_names print names import xgboost as xgb from xgboost.sklearn import XGBClassifier from sklearn.grid_search import GridSearchCV clf=xgb.XGBRegressor(gamma=0,objective= "reg:linear",nthread=-1) clf.fit(X_train,y_train) y_pred = clf.predict(X_test) print 'score before gridsearch %r' % mean_squared_error(y_test, y_pred) params = { 'max_depth':[4,6,8], 'n_estimators':[1000], 'min_child_weight':range(1,3), 'learning_rate':[.1,.01,.001], 'colsample_bytree':[.8,.9,1] ,'gamma':[0,1]} #with the parameter nthread we specify XGBoost for parallelisation cvx = xgb.XGBRegressor(objective= "reg:linear",nthread=-1) clf=GridSearchCV(estimator=cvx,param_grid=params,n_jobs=-1,scoring='mean_absolute_error',verbose=True) clf.fit(X_train,y_train) print clf.best_params_ y_pred = clf.predict(X_test) print 'score after gridsearch %r' %mean_squared_error(y_test, y_pred) #Your output might look a little different based on your hardware. OUTPUT ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'] score before gridsearch 0.07110580252173157 Fitting 3 folds for each of 108 candidates, totalling 324 fits [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.9min [Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 11.3min [Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed: 22.3min finished {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'min_child_weight': 1, 'n_estimators': 1000, 'max_depth': 8, 'gamma': 0} score after gridsearch 0.049878294113796254
We have been able to improve our score quite a bit with gridsearch; you can see the optimal parameters of our gridsearch. You can see its resemblance with regular boosting methods in sklearn. However, XGBoost by default parallelizes the algorithm over all available cores. You can improve the performance of the model by increasing the n_estimators
parameter to around 2,500 or 3,000. However, we found that the training time would be a little too long for readers with less powerful computers.
XGBoost has some very practical built-in functionalities to plot the variable importance. First, there's an handy tool for feature selection relative to the model at hand. As you probably know, variable importance is based on the relative influence of each feature at the tree construction. It provides practical methods for feature selection and insight into the nature of the predictive model. So let's see how we can plot importance with XGBoost:
import numpy as np import os from matplotlib import pylab as plt # %matplotlib inline <- this only works in jupyter notebook #our best parameter set # {'colsample_bytree': 1, 'learning_rate': 0.1, 'min_child_weight': 1, 'n_estimators': 500, #'max_depth': 8, 'gamma': 0} params={'objective': "reg:linear", 'eval_metric': 'rmse', 'eta': 0.1, 'max_depth':8, 'min_samples_leaf':4, 'subsample':.5, 'gamma':0 } dm = xgb.DMatrix(X_train, label=y_train, feature_names=names) regbgb = xgb.train(params, dm, num_boost_round=100) np.random.seed(1) regbgb.get_fscore() regbgb.feature_names regbgb.get_fscore() xgb.plot_importance(regbgb,color='magenta',title='california-housing|variable importance')
Feature importance should be used with some caution (for GBM and random forest as well). The feature importance metrics are purely based on the tree structure built on the specific model trained with the parameters of that model. This means that if we change the parameters of the model, the importance metrics, and some of the rankings, will change as well. Therefore, it's important to note that for any importance metric, they should not be taken as a generic variable conclusion that generalizes across models.
In terms of accuracy/performance trade-off, this is simply the best desktop solution. We saw that, with the previous random forest example, we needed to perform subsampling in order to prevent overloading our main memory.
An often-overlooked capability of XGBoost is the method of streaming data through memory. This method parses data through the main memory in a stage-wise fashion to subsequently be parsed into XGBoost model training. This method is a prerequisite to train models on large datasets that are impossible to fit in the main memory. Streaming with XGBoost only works with LIBSVM files, which means that we first have to parse our dataset to the LIBSVM format and import it in the memory cache preserved for XGBoost. Another thing to note is that we use different methods to instantiate XGBoost models. The Scikit-learn-like interface for XGBoost only works on regular NumPy objects. So let's look at how this works.
First, we need to load our dataset in the LIBSVM format and split it into train and test sets before we proceed with preprocessing and training. Parameter tuning with gridsearch is unfortunately not possible with this XGBoost method. If you want to tune parameters, we need to transform the LIBSVM file into a Numpy object, which will dump the data from the memory cache to the main memory. This is unfortunately not scalable, so if you want to perform tuning on large datasets, I would suggest using the reservoir sampling tools we previously introduced and apply tuning to subsamples:
import urllib from sklearn.datasets import dump_svmlight_file from sklearn.datasets import load_svmlight_file trainfile = urllib.URLopener() trainfile.retrieve("http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/poker.bz2", "pokertrain.bz2") X,y = load_svmlight_file('pokertrain.bz2') dump_svmlight_file(X, y,'pokertrain', zero_based=True,query_id=None, multilabel=False) testfile = urllib.URLopener() testfile.retrieve("http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/poker.t.bz2", "pokertest.bz2") X,y = load_svmlight_file('pokertest.bz2') dump_svmlight_file(X, y,'pokertest', zero_based=True,query_id=None, multilabel=False) del(X,y) from sklearn.metrics import classification_report import numpy as np import xgboost as xgb dtrain = xgb.DMatrix('/yourpath/pokertrain#dtrain.cache') dtest = xgb.DMatrix('/yourpath/pokertest#dtestin.cache') # For parallelisation it is better to instruct "nthread" to match the exact amount of cpu cores you want #to use. param = {'max_depth':8,'objective':'multi:softmax','nthread':2,'num_class':10,'verbose':True} num_round=100 watchlist = [(dtest,'eval'), (dtrain,'train')] bst = xgb.train(param, dtrain, num_round,watchlist) print bst OUTPUT: [89] eval-merror:0.228659 train-merror:0.016913 [90] eval-merror:0.228599 train-merror:0.015954 [91] eval-merror:0.227671 train-merror:0.015354 [92] eval-merror:0.227777 train-merror:0.014914 [93] eval-merror:0.226247 train-merror:0.013355 [94] eval-merror:0.225397 train-merror:0.012155 [95] eval-merror:0.224070 train-merror:0.011875 [96] eval-merror:0.222421 train-merror:0.010676 [97] eval-merror:0.221881 train-merror:0.010116 [98] eval-merror:0.221922 train-merror:0.009676 [99] eval-merror:0.221733 train-merror:0.009316
We can really experience a great speedup from in-memory XGBoost. We would need a lot more training time if we had used the internal memory version. In this example, we already included the test set as a validation round in the watchlist. However, if we want to predict values on unseen data, we can simply use the same prediction procedure as with any other model in Scikit-learn and XGBoost:
bst.predict(dtest) OUTPUT: array([ 0., 0., 1., ..., 0., 0., 1.], dtype=float32)
In the previous chapter, we covered how to store a GBMmodel to disk to later import and use it for predictions. XGBoost provides the same functionalities. So let's see how we can store and import the model:
import pickle bst.save_model('xgb.model')
Now you can import the saved model from the directory that you previously specified:
imported_model = xgb.Booster(model_file='xgb.model')
Great, now you can use this model for predictions:
imported_model.predict(dtest) OUTPUT array([ 9., 9., 9., ..., 1., 1., 1.], dtype=float32)
18.118.25.174