In a batch context, it is common to operate feature selection by the following:
Applying such approaches in an online setting is certainly still possible, but quite expensive in terms of the required time because of the quantity of data to stream to complete a single model. Recursive approaches based on a large number of iterations and tests require a nimble dataset that can fit in memory. As just previously quoted, in such a case, subsampling would be a good option in order to figure out features and models later to be applied to a larger scale.
Keeping on our out-of-core approach, regularization is the ideal solution as a way to select variables while streaming and filter out noisy or redundant features. Regularization works fine with online algorithms as it operates as the online machine learning algorithm is working and fitting its coefficients from the examples, without any need to run other streams for the purpose of selection. Regularization is, in fact, just a penalty value, which is added to the optimization of the learning process. It is dependent on the features' coefficient and a parameter named alpha
setting the impact of regularization. The regularization balancing intervenes when coefficients' weights are updated by the model. At that time, regularization acts by reducing the resulting weights if the value of the update is not large enough. The trick of excluding or attenuating redundant variables is achieved because of the regularization alpha
parameter, which has to be empirically set at the correct magnitude for the best result with respect to each specific data to be learned.
SGD implements the same regularization strategies to be found in batch algorithms:
L1 regularization is the perfect strategy when there are unusual and redundant variables as it will push the coefficients of such features to zero, making them irrelevant when calculating the prediction.
L2 is suitable when there are many correlations between the variables as its strategy is just to reduce the weights of the features whose variation is less important for the loss function minimization. With L2, all the variables keep on contributing to the prediction, though some less so.
Elastic Net mixes L1 and L2 using a weighted sum. This solution is interesting as sometimes L1 regularization is unstable when dealing with highly correlated variables, choosing one or the other with respect to the seen examples. Using ElasticNet
, many unusual features will still be pushed to zero as in L1 regularization, but correlated ones will be attenuated as in L2.
Both SGDClassifier
and SGDRegressor
can implement L1, L2, and Elastic Net regularization using the penalty
, alpha
, and l1_ratio
parameters.
If penalty
determinates what kind of regularization is chosen, alpha
, as mentioned, will determinate its strength. As alpha
is a constant that multiplies the penalization term; low alpha values will bring little influence on the final coefficient, whereas high values will significantly affect it. Finally, l1_ratio
represents, when penalty='elasticnet'
, how much percentage is the L1 penalization with respect to L2.
Setting regularization with SGD is very easy. For instance, you may try changing the previous code example inserting a penalty L2 into SGDClassifier
:
SGD = SGDClassifier(loss='hinge', penalty='l2', alpha= 0.0001, random_state=1, average=True)
If you prefer to test an Elastic-Net mixing the effects of the two regularization approaches, all you have to do is explicit the ratio between L1 and L2 by setting l1_ratio
:
SGD = SGDClassifier(loss=''hinge'', penalty=''elasticnet'', alpha= 0.001, l1_ratio=0.5, random_state=1, average=True)
As the success of regularization depends on plugging the right kind of penalty and best alpha, regularization will be seen in action in our examples when dealing with the problem of hyperparameter optimization.
3.149.234.188