Preprocessing data

The more disciplined we are in handling our data, the better results we are likely to achieve in the end. The first step in this procedure is known as data preprocessing, and it comes in (at least) three different flavors:

Data formatting: The data may not be in a format that is suitable for us to work with; for example, the data might be provided in a proprietary file format, which our favorite machine learning algorithm does not understand.

Data cleaning: The data may contain invalid or missing entries, which need to be cleaned up or removed.

Data sampling: The data may be far too large for our specific purpose, forcing us to sample the data intelligently.

Once the data has been preprocessed, we are ready for the actual feature engineering: to transform the preprocessed data to fit our specific machine learning algorithm. This step usually involves one or more of three possible procedures:

Scaling: Certain machine learning algorithms often require the data to be within a common range, such as to have zero mean and unit variance. Scaling is the process of bringing all features (which might have different physical units) into a common range of values.
Decomposition: Datasets often have many more features than we could possibly process. Feature decomposition is the process of compressing data into a smaller number of highly informative data components.
Aggregation: Sometimes, it is possible to group multiple features into a single, more meaningful one. For example, a database might contain the date and time for each user who logged into a web-based system. Depending on the task, this data might be better represented by simply counting the number of logins per user.

Let's look at some of these processes in more detail.

Table of Contents for Preprocessing data

Create new playlist

Sign In

Sign Up

Table of Contents for
Preprocessing data