The bias, variance, and regularization properties

Bias, variance, and the closely related topic of regularization hold very special and fundamental positions in the field of machine learning.

Bias happens when a machine learning model is too 'simple', leading to results that are consistently off from the actual values.

Variance happens when a model is too 'complex', leading to results that are very accurate on test datasets, but do not perform well on unseen/new datasets.

Once users become familiar with the process of creating machine learning models, it would seem that the process is quite simplistic - get the data, create a training set and a test set, create a model, apply the model on the test dataset, and the exercise is complete. Creating models is easy; creating a good model is a much more challenging topic. But how can one test the quality of a model? And, perhaps more importantly, how does one go about building a 'good' model?

The answer lies in a term called regularization. It's arguably a fancy word, but all it means is that during the process of creating a model, one benefits from penalizing an overly impressive performance on a training dataset and relaxing the same on a poorly performing model.

To understand regularization, it would help to know the concepts of overfitting and underfitting. For this, let us look at a simple but familiar example of drawing lines of best fit. For those who have used Microsoft Excel, you may have noticed the option to draw the line of best fit - in essence, given a set of points, you can draw a line that represents the data and approximates the function that the points represent.

The following table shows the prices vs square footage of a few properties. In order to determine the relationship between house prices and the size of the house, we can draw a line of best fit, or a trend line, as shown as follows:

Sq. ft.	Price ($)
862	170,982
1235	227,932
932	183,280
1624	237,945
1757	275,921
1630	274,713
1236	201,428
1002	193,128
1118	187,073
1339	202,422
1753	283,989
1239	228,170
1364	230,662
995	169,369
1000	157,305

If we were to draw a line of best fit using a linear trend line, the chart would look somewhat like this:

Excel provides an useful additional feature that allows users to draw an extension of the trend line which can provide an estimate, or a prediction, of unknown variables. In this case, extending the trendline will show us, based on the function, what the prices for houses in the 1,800-2,000 sq. ft. range are likely to be.

The linear function that describes the data is as follows:

y=126.13x + 54,466.81

The following chart with an extended trend line shows that the price is most likely between $275,000 and $300,000:

However, one may argue that the line is not the best approximation and that it may be possible to increase the value of R2, which in this case is 0.87. In general, the higher the R^2, the better the model that describes the data. There are various different types of R² values, but for the purpose of this section, we'll assume that the higher the R², the better the model.

In the next section, we will draw a new trend line that has a much higher R^2, but using a polynomial function. This function has a higher R^2 (0.91 vs 0.87) and visually appears to be closer to the points on average.

The function in this case is a 6^th-order polynomial:

y = -0.00x⁶ + 0.00x⁵ - 0.00x⁴ + 2.50x³ - 2,313.40x² + 1,125,401.77x - 224,923,813.17

But, even though the line has a higher R^2, if we extend the trend line, intending to find what the prices of houses in the 1,800-2,000 sq. ft. range are likely to be, we get the following result.

Houses in the 1,800-2,000 sq. ft. range go from approx. $280,000 to negative $2 million at the 2,000^th sq. ft. In other words, people purchasing houses with 1800 sq. ft. are expected to spend $ 280,000 and those purchasing houses with 2,000 sq. ft. should, according to this function, with a 'higher R^2', receive $2 million! This, of course, is not accurate, but what we have just witnessed is what is known as over-fitting. The image below illustrates this phenomenon.

At the other end of the spectrum is under-fitting. This happens when the model built does not describe the data. In the following chart, the function y = 0.25x - 200 is one such example:

In brief, this section can be abbreviated as follows:

A function that fits the data too well, such that the function can approximate nearly all of the points in the training dataset is considered overfitting.
A function that does not fit the data at all, or in other words is far from the actual points in the training dataset, is considered underfitting.
Machine learning is the process of balancing between overfitting and underfitting the data. This is arguably not an easy exercise, which is why even though building a model may be trivial, building a model that is reasonably good is a much more difficult challenge.
Underfitting is when your function is not thinking at all - it has a high bias.
Overfitting is when your function is thinking too hard - it has a high variance.
Another example for underfitting and overfitting is given in coming example.

Say we are tasked with determining if a bunch of fruit are oranges or apples, and have been given their location in a fruit basket (left-side or right-side), size and weight:


Basket 1 (Training Dataset)	Basket 2 (Test Dataset)

An example of overfitting could be that, based on the training dataset, with regard to Basket 1 we could conclude that the only fruits located on the right hand side of the basket are oranges and those on the left are all apples.

An example of underfitting could be that I conclude that the basket has only oranges.

Model 1: In the first case - for overfitting - I have, in essence, memorized the locations.

Model 2: In the second case - for underfitting - I could not remember anything precisely at all.

Now, given a second basket - the test dataset where the positions of the apples and oranges are switched - if I were to use Model 1, I would incorrectly conclude that all the fruits on the right hand side are oranges and those on the left hand side are apples (since I memorized the training data).

If I were to use Model 2, I would, again, incorrectly conclude that all the fruits are oranges.

There are, however, ways to manage the balance between underfitting and overfitting - or in other words, between high bias and high variance.

One of the methods commonly used for bias-variance trade-off is known as regularization. This refers to the process of penalizing the model (for example, the model's coefficients in a regression) in order to produce an output that generalizes well across a range of data points.

The table on the next page illustrates some of the key concepts of bias and variance and illustrates options for remedial steps when a model has high bias or high variance:

In terms of the modeling process, a high bias is generally indicated by the fact that both the training set error as well as the test set error remain consistently high. For high variance (overfitting), the training set error decreases rapidly, but the test set error remains unchanged.

Table of Contents for The bias, variance, and regularization properties

Create new playlist

Sign In

Sign Up

Table of Contents for
The bias, variance, and regularization properties