Working with categorical data

Another important class of data commonly encountered is categorical data. Categorical features have discrete values that belong to a finite set of classes. These classes may be represented as text or numbers. Depending upon whether there is any order to the classes or not, categorical features are termed as ordinal and nominal respectively.

Nominal features are those categorical features that have a finite set of values but do not have any natural ordering to them. For instance, weather seasons, movie genres, and so on are all nominal features. Categorical features that have a finite set of classes with a natural ordering to them are termed as ordinal features. For instance, days of the week, dress sizes, and so on are ordinals.

Typically, any standard workflow in feature engineering involves some form of transformation of these categorical values into numeric labels and then the application of some encoding scheme on these values. Popular encoding schemes are briefly mentioned as follows:

  • One-hot encoding: This strategy creates n binary valued columns for a categorical attribute assuming there are n number of distinct categories
  • Dummy coding: This strategy creates n-1 binary valued columns for a categorical attribute assuming there are n number of distinct categories
  • Feature hashing: This strategy is leveraged where we use a hash function to add several features into a single bin or bucket (new feature), which is popularly used when we have a large number of features

Code snippets to better understand feature engineering for categorical data are available in the notebook feature_engineering_numerical_and_categorical_data.ipynb.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.36.41