Dealing with Real-World Data

In this chapter, we're going to talk about the challenges of dealing with real-world data, and some of the quirks you might run into. The chapter starts by talking about the bias-variance trade-off, which is kind of a more principled way of talking about the different ways you might overfit and underfit data, and how it all interrelates with each other. We then talk about the k-fold cross-validation technique, which is an important tool in your chest to combat overfitting, and look at how to implement it using Python.

Next, we analyze the importance of cleaning your data and normalizing it before actually applying any algorithms on it. We see an example to determine the most popular pages on a website which will demonstrate the importance of cleaning data. The chapter also covers the importance of remembering to normalize numerical data. Finally, we look at how to detect outliers and deal with them.

Specifically, this chapter covers the following topics:

Analyzing the bias/variance trade-off
The concept of k-fold cross-validation and its implementation
The importance of cleaning and normalizing data
An example to determine the popular pages of a website
Normalizing numerical data
Detecting outliers and dealing with them

Table of Contents for Dealing with Real-World Data

Create new playlist

Sign In

Sign Up

Table of Contents for
Dealing with Real-World Data