How it works...

This is where we're going to cover how a works—what components are driving this particular code snippet to work. As we progress through this recipe, I encourage you to work through it with me. The first thing we need to do in the script is tell the interpreter where our Python is and import our core libraries for use in the script:

#!/usr/bin/env python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

After installing the right libraries, we need to talk about the right way to read in the data. I provided a few examples of the right way and the correct way to read in the data. It's important to understand how the data can be read in, what happens, and what you do when not specifying any argument like this:

# Notice this incorrectly reads the first line as the header

df0 = pd.read_csv('/data/adult.data')

After executing this command, you'll notice that the pandas read_csv method will incorrectly read one of the columns as a header. Next, we will attempt not to specify a header. Let's look at the result:

# The header=None enumerates the classes without a name 

df1 = pd.read_csv('/data/adult.data', header = None)

The data is correctly read in except for the fact that we don't have any header names for each of the columns. Finally, by using the data description, we are able to specify the header and name the columns appropriately:

# Specifying the header, the read_csv method will work correctly 

df2 = pd.read_csv('/data/adult.data', names = ['age', 'workclass',  
                  'fnlwgt', 'education', 'education-num', 'marital-
                   status', 'occupation', 'relationship', 'race', 
                  'sex', 'capital-gain', 'capital-loss', 'hours-per-
                   week', 'native-country','Label'])

The data is now correctly read into an array and ready to for use in the next section. After we have a set of data we can manipulate, we are going to need to ensure that we convert all of our categorical variables to numerical variables. The following method is going to create a numerical mapping for every categorical variable automatically. Here's the general method:

# Create an empty dictionary

mappings = {}

# Run through all columns in the CSV

for col_name in df2.columns:

   # If the type of variables are categorical, they will be an 'object'  
   type

    if(df2[col_name].dtype == 'object'):

        # Create a mapping from categorical to numerical variables

        df2[col_name]= df2[col_name].astype('category')
        df2[col_name], mapping_index =  
        pd.Series(df2[col_name]).factorize()

    # Store the mappings in dictionary

        mappings[col_name]={}
        for i in range(len(mapping_index.categories)):
             mappings[col_name][i]=mapping_index.categories[i]

    # Store a continuous tag for variables that are already numerical

    else:
        mappings[col_name] = 'continuous'

This block of code is fairly simple in that it simply detects whether the column contains categorical or numerical data. One issue you will notice with this method is that it naively assumes that all of the data is one type or another (numerical or categorical). Part of your exercise will be to handle cases where the data contains a mix of data types.

Let's cover the basis of how we'll start this function:

# Create an empty dictionary

mappings = {}

# Run through all columns in the CSV

for col_name in df2.columns:

We've created an empty array and we're going to walk through each column within the .columns method. Now, for every column, we are going to check whether the data is categorical and then do an operation:

    # If the type of variables are categorical, they will be an  
   'object' type

    if(df2[col_name].dtype == 'object'):

        # Create a mapping from categorical to numerical variables

        df2[col_name]= df2[col_name].astype('category')
        df2[col_name], mapping_index =  
        pd.Series(df2[col_name]).factorize()

    # Store the mappings in dictionary

        mappings[col_name]={}
        for i in range(len(mapping_index.categories)):
             mappings[col_name][i]=mapping_index.categories[i]

The first two lines are going to create a mapping index using the factorize method. This method will simply assign numerical indices to each of the categorical variables. Once we have this mapping, we create a dictionary that we can use in the future to convert back to the categorical variable. After all, an indexed value for country is meaningless without the keys to know what each number means. Next, let's see what we do when the variable isn't a categorical variable:

# Store a continuous tag for variables that are already numerical

  else:
      mappings[col_name] = 'continuous'

We simply assign the dictionary with the continuous tag. In the future, we can do a simple check to ensure whether the continuous tag is or isn't there.

Now, let's check out the results for our mapping index—this represents a portion of the mapping dictionary:

 'education-num': 'continuous',
 'fnlwgt': 'continuous',
 'hours-per-week': 'continuous',
 'marital-status': {0: ' Divorced',
  1: ' Married-AF-spouse',
  2: ' Married-civ-spouse',
  3: ' Married-spouse-absent',
  4: ' Never-married',
  5: ' Separated',
  6: ' Widowed'},

Now, we've made a method to filter the data and ensure that all of our data is numerical for learning.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...