Chapter 14

What is overfitting?

Many ML models (for example, decision trees) actively fit to perform well on the training set at hand, but at some point, this process goes beyond generalizable knowledge that's valuable for the task, with some parts being irrelevant to the test set. This is not only meaningless but will also affect the model's performance on other data. This phenomenon is known as overfitting, and there are ways to overcome it.

Why should we use cross-validation?

Cross-validation is a technique that's aimed at overcoming the issue of overfitting. In its basic form, it splits a training set into multiple folds, trains multiple models with the same settings on different combinations of those folds, and measures their performance on other folds—and then averages the performance across all models. As a result, this sampling and prediction on the data each model never saw prevents it from reporting "better" results on a dataset by addressing its specifics. Thus, using cross-validation allows for safe feature-engineering and model adjustments.

Why can it be bad if our metrics are improving on the test set? Which features are useful for improving model performance on cross-validation?

Improvements on the test set can represent an actual increase in model performance—or overfitting. To improve a model's "actual" performance, you need to either tune the model's parameters, add new features or process existing features to better represent underlying dependencies. One usual trick, for example, is to convert date features into a set of features representing cycles, such as day of the week, time of the day, month/day of the year, and so on.

Why do some features decrease the performance of a decision tree on test data or in cross-validation?

Certain features with little to no value for prediction but a high enough variance "appear" to be useful for decision trees (and other algorithms) on their training sets – and thus lead to a decrease in out-of-sample performance. To prevent that, you need to either filter features thoroughly or use algorithms that have fewer issues with overfitting.

What is the difference between the random search and grid search algorithms for parameter optimization?

Both algorithms are designed to find the best combination of the model's hyperparameters—a set of parameters that can only be optimized by running the model on a specific dataset. GridSearch is the most simple, brute-force solution – all it does is run the model over a finite number of combinations ("grid") of those parameters. Due to dimensionality reduction, even a small, finite number of choices for each quickly leads to huge computations. Random search, on the other hand, is similar, but does not require a finite set of choices, instead deriving from distributions, and attempts to pick each other combination based on the results of the previous runs. As a result, it is a faster and better-resulting solution to use than GridSearch in most cases.

Why is Git not sufficient for data version control?

Git, at its core, is an immutable file-based system – which means that, on each commit, it stores a copy of each file that was changed since the last commit. Thus, any kind of change in any dataset beyond basic metrics will result in a copy of the whole file, which will quickly lead to huge memory consumption. In fact, GitHub prevents uploading files above a certain threshold, so using Git to control data is not an option for data version control.

What are the alternatives to DVC for data version control and experimentation logging?

Currently, there are plenty of alternative solutions to data version control. All of them have different flavors and focus on different aspects. Among the most popular alternatives are MLflow and Sacred, but both are language-specific and require some custom code.

Table of Contents for Chapter 14

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 14