How to do it...

Preprocessing of the data typically refers to the stage where you read in the data and do basic actions to make the data usable in your domain. In our case, this means we'd like the data in a pandas or NumPy data frame. Their formats are interchangeable with minimal sets of code and hence are used commonly in the data science field. In the following example, you'll get experience reading in a dataset and converting categorical variables to numerical variables. This process can easily be converted to a method later in this chapter.

Here's the recipe:

  1. Import the following packages to start the work:
!/usr/bin/env python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  1. Read the UCI Machine Learning Income data (https://archive.ics.uci.edu/ml/datasets/adult) from this directory. There are three different ways to read the data. Notice how this incorrectly reads the first line as the header:
df0 = pd.read_csv('/data/adult.data')      
  1. header=None enumerates the classes without a name:
df2 = pd.read_csv('/data/adult.data', names = ['age', 'workclass', 
'fnlwgt', 'education', 'education-num', 'marital-
status', 'occupation', 'relationship', 'race',
'sex', 'capital-gain', 'capital-loss', 'hours-per-
week', 'native-country','Label'])
  1. Create an empty dictionary:
mappings = {}
  1. Run through all columns in the CSV for col_name in df2.columns:
if(df2[col_name].dtype == 'object'):
If the type of variables are categorical, they will be an object type.
  1. Create a mapping from categorical to numerical variables:
df2[col_name]= df2[col_name].astype('category')
df2[col_name], mapping_index = pd.Series(df2[col_name]).factorize()
  1. Store the mappings in the dictionary:
mappings[col_name]={}
for i in range(len(mapping_index.categories)):
mappings[col_name][i]=mapping_index.categories[i]
  1. Store a continuous tag for variables that are already numerical else:
mappings[col_name] = 'continuous'

We'll cover the details of the recipes in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.235.79