Finding the most important Haar-like features for face classification with the random forest ensemble classifier

A random forest classifier is trained in order to select the most salient features for face classification. The idea is to check which features are the most often used by the ensemble of trees. By using only the most salient features in subsequent steps, computation speed can be increased, while retaining accuracy. The following code snippet shows how to compute the feature importance for the classifier and displays the top 25 most important Haar-like features:

# For speed, only extract the two first types of features
feature_types = ['type-2-x', 'type-2-y']
# Build a computation graph using dask. This allows using multiple CPUs for
# the computation step
X = delayed(extract_feature_image(img, feature_types)
 for img in images)
# Compute the result using the "processes" dask backend
t_start = time()
X = np.array(X.compute(scheduler='processes'))
time_full_feature_comp = time() - t_start
y = np.array([1] * 100 + [0] * 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=150, random_state=0, stratify=y)
print(time_full_feature_comp)
# 104.87986302375793
print(X.shape, X_train.shape)
# (200, 101088) (150, 101088)

from sklearn.metrics import roc_curve, auc, roc_auc_score
# Extract all possible features to be able to select the most salient.
feature_coord, feature_type = 
        haar_like_feature_coord(width=images.shape[2], height=images.shape[1],
                                feature_type=feature_types)
    
# Train a random forest classifier and check performance
clf = RandomForestClassifier(n_estimators=1000, max_depth=None,
                             max_features=100, n_jobs=-1, random_state=0)
t_start = time()
clf.fit(X_train, y_train)
time_full_train = time() - t_start
print(time_full_train)
# 1.6583366394042969
auc_full_features = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
print(auc_full_features)
# 1.0

# Sort features in order of importance, plot six most significant
idx_sorted = np.argsort(clf.feature_importances_)[::-1]

fig, axes = pylab.subplots(5, 5, figsize=(10,10))
for idx, ax in enumerate(axes.ravel()):
 image = images[1]
 image = draw_haar_like_feature(image, 0, 0, images.shape[2], images.shape[1],
                               [feature_coord[idx_sorted[idx]]])
 ax.imshow(image), ax.set_xticks([]), ax.set_yticks([])
fig.suptitle('The most important features', size=30)

The following screenshot shows the output of the preceding code block—the top 25 most important Haar-like features for face detection:

By, keeping only a few of the most important features (~3% of all of the features), most (~70%) feature importance can be preserved and by training the RandomForest classifier only with those features we should be able to retain the accuracy of the validation dataset (that we obtained by training the classifier with all the features), but with a much smaller time required for feature extraction and to train the classifier. The code is left as an exercise for the reader.

Table of Contents for Finding the most important Haar-like features for face classification with the random forest ensemble classifier

Create new playlist

Sign In

Sign Up

Table of Contents for
Finding the most important Haar-like features for face classification with the random forest ensemble classifier