Representing Data and Engineering Features

In the last chapter, we built our very first supervised learning models and applied them to some classic datasets, such as the Iris and the Boston datasets. However, in the real world, data rarely comes in a neat <n_samples x n_features> feature matrix that is part of a pre-packaged database. Instead, it is our responsibility to find a way to represent the data in a meaningful way. The process of finding the best way to represent our data is known as feature engineering, and it is one of the main tasks of data scientists and machine learning practitioners trying to solve real-world problems.

I know you would rather jump right to the end and build the deepest neural network mankind has ever seen. But, trust me, this stuff is important! Representing our data in the right way can have a much greater influence on the performance of our supervised model than the exact parameters we choose. And we get to invent our own features, too. In this chapter, we will, therefore, go over some common feature engineering tasks. We will cover preprocessing and scaling techniques, along with dimensionality reduction. We will also learn to represent categorical variables, text features, and images.

In this chapter, we will cover the following topics:

  • Common preprocessing techniques that everyone uses but nobody talks about
  • Centering and multidimensional scaling
  • Representing categorical variables
  • Reducing the dimensions of data using techniques such as PCA
  • Representing text features
  • Learn the best way to encode images

Let's start at the top.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.188.138