The row normalization method

Our final normalization method works row-wise instead of column-wise. Instead of calculating statistics on each column, mean, min, max, and so on, the row normalization technique will ensure that each row of data has a unit norm, meaning that each row will be the same vector length. Imagine if each row of data belonged to an n-dimensional space; each one would have a vector norm, or length. Another way to put it is if we consider every row to be a vector in space:

x = (x1, x2, ..., xn)

Where 1, 2, ..., n in the case of Pima would be 8, 1 for each feature (not including the response), the norm would be calculated as: 

||x|| = √(x12 + x22 + ... + xn2)

This is called the L-2 Norm. Other types of norms exist, but we will not get into that in this text. Instead, we are concerned with making sure that every single row has the same norm. This comes in handy, especially when working with text data or clustering algorithms.

Before doing anything, let's see the average norm of our mean-imputed matrix, using the following code:

np.sqrt((pima_imputed**2).sum(axis=1)).mean() 
# average vector length of imputed matrix

223.36222025823744

Now, let's bring in our row-normalizer, as shown in the following code:

from sklearn.preprocessing import Normalizer # our row normalizer

normalize = Normalizer()

pima_normalized = pd.DataFrame(normalize.fit_transform(pima_imputed), columns=pima_column_names)

np.sqrt((pima_normalized**2).sum(axis=1)).mean()
# average vector length of row normalized imputed matrix

1.0

After normalizing, we see that every single row has a norm of one now. Let's see how this method fares in our pipeline:

knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors':[1, 2, 3, 4, 5, 6, 7]}

mean_impute_normalize = Pipeline([('imputer', Imputer()), ('normalize', Normalizer()), ('classify', knn)])
X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute_normalize, knn_params)
grid.fit(X, y)

print grid.best_score_, grid.best_params_

0.682291666667 {'imputer__strategy': 'mean', 'classify__n_neighbors': 6}

Ouch, not great, but worth a try. Now that we have seen three different methods of data normalization, let's put it all together and see how we did on this dataset.

There are many learning algorithms that are affected by the scale of data. Here is a list of some popular learning algorithms that are affected by the scale of data:

  • KNN-due to its reliance on the Euclidean Distance
  • K-Means Clustering - same reasoning as KNN
  • Logistic regression, SVM, neural networks—if you are using gradient descent to learn weights
  • Principal component analysis—eigen vectors will be skewed towards larger columns
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.29.195