Neural networks and hyperparameter optimization

As the parameter space of neural networks and deep learning models is so wide, optimization is a hard task and computationally very expensive. A wrong neural network architecture can be a recipe for failure. These models can only be accurate if we apply the right parameters and choose the right architecture for our problem. Unfortunately, there are only a few applications that provide tuning methods. We found that the best parameter tuning method at the moment is randomized search, an algorithm that iterates over the parameter space at random sparing computational resources. The sknn library is really the only library that has this option. Let's walk through the parameter tuning methods with the following example based on the wine-quality dataset.

In this example, we first load the wine dataset. Than we apply transformation to the data, from where we tune our model based on chosen parameters. Note that this dataset has 13 features; we specify the units within each layer to be between 4 and 20. We don't use mini-batch in this case; the dataset is simply too small:

import numpy as np
import scipy as sp 
import pandas as pd
from sklearn.grid_search import RandomizedSearchCV
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from scipy import stats
from sklearn.cross_validation import train_test_split
from sknn.mlp import  Layer, Regressor, Classifier as skClassifier

# Load data
df = pd.read_csv(' ' , sep = ';')
X = df.drop('quality' , 1).values # drop target variable

y1 = df['quality'].values # original target variable
y = y1 <= 5 # new target variable: is the rating <= 5?

# Split the data into a test set and a training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print X_train.shape

max_net = skClassifier(layers= [Layer("Rectifier",units=10),
params={'learning_rate': sp.stats.uniform(0.001, 0.05,.1),
'hidden0__units': sp.stats.randint(4, 20),
'hidden0__type': ["Rectifier"],
'hidden1__units': sp.stats.randint(4, 20),
'hidden1__type': ["Rectifier"],
'hidden2__units': sp.stats.randint(4, 20),
'hidden2__type': ["Rectifier"],
max_net2 = RandomizedSearchCV(max_net,param_distributions=params,n_iter=25,cv=3,scoring='accuracy',verbose=100,n_jobs=1,

print "best score %s" % model_tuning.best_score_
print "best parameters %s" % model_tuning.best_params_

[CV]  hidden0__units=11, learning_rate=0.100932183167, hidden2__units=4, hidden2__type=Rectifier, batch_size=30, hidden1__units=11, learning_rule=adagrad, hidden1__type=Rectifier, hidden0__type=Rectifier, score=0.655914 -   3.0s
[Parallel(n_jobs=1)]: Done  74 tasks       | elapsed:  3.0min
[CV] hidden0__units=11, learning_rate=0.100932183167, hidden2__units=4, hidden2__type=Rectifier, batch_size=30, hidden1__units=11, learning_rule=adagrad, hidden1__type=Rectifier, hidden0__type=Rectifier 
[CV]  hidden0__units=11, learning_rate=0.100932183167, hidden2__units=4, hidden2__type=Rectifier, batch_size=30, hidden1__units=11, learning_rule=adagrad, hidden1__type=Rectifier, hidden0__type=Rectifier, score=0.750000 -   3.3s
[Parallel(n_jobs=1)]: Done  75 tasks       | elapsed:  3.0min
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:  3.0min finished
best score 0.721366278222

best parameters {'hidden0__units': 14, 'learning_rate': 0.03202394348494512, 'hidden2__units': 19, 'hidden2__type': 'Rectifier', 'batch_size': 30, 'hidden1__units': 17, 'learning_rule': 'adagrad', 'hidden1__type': 'Rectifier', 'hidden0__type': 'Rectifier'}


Warning: As the parameter space is searched at random, the results can be inconsistent.

We can see that the best parameters for our model are, most importantly, the first layer with 14 units, the second layer contains 17 units, and the third layer contains 19 units. This is quite a complex architecture that we might never have been able to deduce ourselves, which demonstrates the importance of hyperparameter optimization.

