The fastest way to insert non-linearity into a linear SGD learner (and basically a no-brainer) is to transform the vector of the example received from the stream into a new vector including both power transformations and a combination of the features upto a certain degree.
Combinations can represent interactions between the features (explicating when two features concur to have a special impact on the response), thus helping the SVM linear model to include a certain amount of non-linearity. For instance, a two-way interaction is made by the multiplication of two features. A three-way is made by multiplying three features and so on, creating even more complex interactions for higher-degree expansions.
In Scikit-learn, the preprocessing module contains the PolynomialFeatures
class, which can automatically transform the vector of features by polynomial expansion of the desired degree:
In: from sklearn.linear_model import SGDRegressor from sklearn.preprocessing import PolynomialFeatures source = '\bikesharing\hour.csv' local_path = os.getcwd() b_vars = ['holiday','hr','mnth', 'season','weathersit','weekday','workingday','yr'] n_vars = ['hum', 'temp', 'atemp', 'windspeed'] std_row, min_max = explore(target_file=local_path+''+source, binary_features=b_vars, numeric_features=n_vars) poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False) SGD = SGDRegressor(loss='epsilon_insensitive', epsilon=0.001, penalty=None, random_state=1, average=True) val_rmse = 0 val_rmsle = 0 predictions_start = 16000 def apply_log(x): return np.log(x + 1.0) def apply_exp(x): return np.exp(x) - 1.0 for x,y,n in pull_examples(target_file=local_path+'' +source,vectorizer=std_row, min_max=min_max, sparse = False, binary_features=b_vars, umeric_features=n_vars, target='cnt'): y_log = apply_log(y) # Extract only quantitative features and expand them num_index = [j for j, i in enumerate(std_row.feature_names_) if i in n_vars] x_poly = poly.fit_transform(x[:,num_index])[:,len(num_index):] new_x = np.concatenate((x, x_poly), axis=1) # MACHINE LEARNING if (n+1) >= predictions_start: # HOLDOUT AFTER N PHASE predicted = SGD.predict(new_x) val_rmse += (apply_exp(predicted) - y)**2 val_rmsle += (predicted - y_log)**2 if (n-predictions_start+1) % 250 == 0 and (n+1) > predictions_start: print n, print '%s holdout RMSE: %0.3f' % (time.strftime('%X'), (val_rmse / float(n-predictions_start+1))**0.5), print 'holdout RMSLE: %0.3f' % ((val_rmsle / float(n-predictions_start+1))**0.5) else: # LEARNING PHASE SGD.partial_fit(new_x, y_log) print '%s FINAL holdout RMSE: %0.3f' % (time.strftime('%X'), (val_rmse / float(n-predictions_start+1))**0.5) print '%s FINAL holdout RMSLE: %0.3f' % (time.strftime('%X'), (val_rmsle / float(n-predictions_start+1))**0.5) Out: ... 21:49:24 FINAL holdout RMSE: 219.191 21:49:24 FINAL holdout RMSLE: 1.480
Though polynomial expansions are a quite powerful transformation, they can be computationally expensive when we are trying to expand to higher degrees and quickly contrast the positive effects of catching important non-linearity by overfitting caused by over-parameterization (when you have too many redundant and not useful features). As seen in SVC and SVR, kernel transformations can come to our aid. SVM kernel transformations, being implicit, require the data matrix in-memory in order to work. There is a class of transformations in Scikit-learn, based on random approximations, which, in the context of a linear model, can achieve very similar results as a kernel SVM.
The sklearn.kernel_approximation
module contains a few such algorithms:
RBFSampler
: This approximates a feature map of an RBF kernelNystroem
: This approximates a kernel map using a subset of the training dataAdditiveChi2Sampler
: This approximates feature mapping for an additive chi2 kernel, a kernel used in computer visionSkewedChi2Sampler
: This approximates feature mapping similar to the skewed chi-squared kernel also used in computer visionApart from the Nystroem method, none of the preceding classes require to learn from a sample of your data, making them perfect for online learning. They just need to know how an example vector is shaped (how many features there are) and then they will produce many random non-linearities that can, hopefully, fit well to your data problem.
There are no complex optimization algorithms to explain in these approximation algorithms; in fact, optimization itself is replaced by randomization and the results largely depend on the number of output features, pointed out by the n_components
parameters. The more the output features, the higher the probability that by chance you'll get the right non-linearities working perfectly with your problem.
It is important to notice that, if chance has really such a great role in creating the right features to improve your predictions, then reproducibility of the results turns out to be essential and you should strive to obtain it or you won't be able to consistently retrain and tune your algorithm in the same way. Noticeably, each class is provided with a random_state
parameter, thus allowing the controlling of random feature generation and being able to recreate it later on the same just as on different computers.
The theoretical fundamentals of such feature creation techniques are explained in the scientific articles, Random Features for Large-Scale Kernel Machines by A. Rahimi and Benjamin Recht (http://www.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf) and Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning by A. Rahimi and Benjamin Recht (http://www.eecs.berkeley.edu/~brecht/papers/08.rah.rec.nips.pdf).
For our purposes, it will suffice to know how to implement the technique and have it contribute to improving our SGD models, both linear and SVM-based:
In: source = 'shuffled_covtype.data' local_path = os.getcwd() n_vars = ['var_'+str(j) for j in range(54)] std_row, min_max = explore(target_file=local_path+''+source, binary_features=list(), fieldnames= n_vars+['covertype'], numeric_features=n_vars, max_rows=50000) from sklearn.linear_model import SGDClassifier from sklearn.kernel_approximation import RBFSampler SGD = SGDClassifier(loss='hinge', penalty=None, random_state=1, average=True) rbf_feature = RBFSampler(gamma=0.5, n_components=300, random_state=0) accuracy = 0 accuracy_record = list() predictions_start = 50 sample = 5000 early_stop = 50000 for x,y,n in pull_examples(target_file=local_path+''+source, vectorizer=std_row, min_max=min_max, binary_features=list(), numeric_features=n_vars, fieldnames= n_vars+['covertype'], target='covertype', sparse=False): rbf_x = rbf_feature.fit_transform(x) # LEARNING PHASE if n > predictions_start: accuracy += int(int(SGD.predict(rbf_x))==y[0]) if n % sample == 0: accuracy_record.append(accuracy / float(sample)) print '%s Progressive accuracy at example %i: %0.3f' % (time.strftime('%X'), n, np.mean(accuracy_record[-sample:])) accuracy = 0 if early_stop and n >= early_stop: break SGD.partial_fit(rbf_x, y, classes=range(1,8)) Out: ... 07:57:45 Progressive accuracy at example 50000: 0.707
18.117.77.76