Importing and exploring the world's top incomes dataset

Once you have downloaded and installed everything in the previous recipe, you can read the dataset with Python and then start doing some preliminary analysis to get a sense of what the data you have looks like.

The dataset that we'll explore in this chapter was created by Alvaredo, Facundo, Anthony B. Atkinson, Thomas Piketty, and Emmanuel Saez, The World Top Incomes Database, http://topincomes.g-mond.parisschoolofeconomics.eu/, 10/12/2013. It contains global information about the highest incomes per country for approximately the past 100 years, gleaned from tax records.

Getting ready

If you've completed the previous recipe, you should have everything you need to continue.

How to do it...

Let's use the following sequence of steps to import the data and start our exploration of this dataset in Python:

  1. With the following snippet, we create a Python list in memory that contains dictionaries of each row, where the keys are the column names (the first row of the CSV contains the header information) and the values are the values for that particular row:
    #Reading this data is straightforward with the built in csv module:
    import csv
    data_file = "../data/income_dist.csv"
    with open(data_file, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        data = list(reader)
    

    Note

    Note that the input file, income_dist.csv, might be in a different directory on your system depending on where you place it.

  2. We perform a quick check with len to reveal the number of records:
    len(data)
    2180
    
  3. When utilizing CSV data with headers, we check the field names on the CSV reader itself, as well as getting the number of variables:
    print reader.fieldnames
    ['Country', 'Year', 'Top 10% income share', ...]len(reader.fieldnames)
    354
    
  4. While this data is not too large, let's start using best practices when accessing it. Rather than holding all of the data in memory, we use a generator to access the data one row at a time.

    Generators are Python expressions that allow you to create functions that act as iterables; rather than returning all of the data, they yield data one "part" at a time in a memory-efficient iteration context. As our datasets get larger, it's useful to use generators to perform filtering on demand and clean data as you read it.

    def dataset(path):
        with open(path, 'r') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                yield row
    

    Also, take note of the with open(path, 'r') as csvfile statement. This statement ensures that the CSV file is closed when the with block is exited, even (or especially) if there is an exception. Python with blocks replace the try, except, and finally statements, and are syntactically brief while semantically more correct programming constructs.

  5. Using our new function, we can take a look to determine which countries are involved in our dataset:
    print set([row["Country"] for row in dataset(data_file)])
    set(['Canada', 'Italy', 'France', 'Netherlands', 'Ireland', ...])
    
  6. We can also inspect the range of years that this dataset covers, as follows:
    print min(set([int(row["Year"]) for row in dataset(data_file)]))
    1875
    print max(set([int(row["Year"]) for row in dataset(data_file)]))2010
    

    In both of these previous examples, we used a Python list comprehension to generate a set. A comprehension is a concise statement that generates an iterable, much like the earlier memory-safe generators. The output variable (or variables) is specified, along with the for keyword, and the iterable to express the variable, along with an optional if condition. In Python 2.7, set and dictionary comprehensions also exist. The previous country set could also be expressed as follows:

    {row["Country"] for row in dataset(data_file)}set(['Canada', 'Italy', 'France', 'Netherlands', 'Ireland', ...])
    
  7. Finally, let's filter just the data for the United States so we can analyze it exclusively:
    filter(lambda row: row["Country"] == "United States",
               dataset(data_file))
    

    The Python filter function creates a list from all of the values of a sequence or iterable (the second parameter) that make the function specified by the first parameter true. In this case, we use an anonymous function (a lambda function) to check whether the value in the specified row's Country column is equal to United States.

  8. With this initial discovery and exploration of the dataset, we can now take a look at some of the data using matplotlib, one of the main scientific plotting packages available for Python and very similar to the plotting capabilities of MATLAB:
    import csv
    import numpy as np
    import matplotlib.pyplot as plt
    
    def dataset(path, filter_field=None, filter_value=None):
        with open(path, 'r') as csvfile:
            reader = csv.DictReader(csvfile)
            if filter_field:
                for row in filter(lambda row: row[filter_field]==filter_value, reader):
                    yield row
            else:
                for row in reader:
                    yield row
    
    def main(path):
        data  = [(row["Year"], float(row["Average income per tax unit"]))
                 for row in dataset(path, "Country", "United States")]
    
        width = 0.35
        ind   = np.arange(len(data))
        fig   = plt.figure()
        ax    = plt.subplot(111)
    
        ax.bar(ind, list(d[1] for d in data))
        ax.set_xticks(np.arange(0, len(data), 4))
        ax.set_xticklabels(list(d[0] for d in data)[0::4], rotation=45)
        ax.set_ylabel("Income in USD")
        plt.title("U.S. Average Income 1913-2008")
    
        plt.show()
    
    if __name__ == "__main__":
        main("income_dist.csv")
    

    The preceding snippet will give us the following output:

    How to do it...

    The preceding example of data exploration with Python should seem familiar from many of the R chapters. Loading the dataset, filtering, and computing ranges required a few more lines of code and specific typecasting, but we quickly created analyses in a memory-safe fashion.

  9. When we moved on to creating charts, we started using NumPy and matplotlib a bit more. NumPy can be used in a very similar fashion to R, to load data from a CSV file to an array in memory and dynamically determine the type of each column. To do this, the following two module functions can be used:
    • genfromtext: This function creates an array from tabular data stored in a text file with two main loops. The first converts each line of the file to string sequences, and the second converts each string to an appropriate datatype. It is a bit slower and not as memory efficient, but the result is a convenient data table stored in memory. This function also handles missing data, which other faster and simpler functions cannot.
    • recfromcsv: This function is a helper function based on genfromtext that has default arguments set to provide access to a CSV file.

    Have a look at the following snippet:

    import numpy as np
    dataset = np.recfromcsv(data_file, skip_header=1)
    dataset
    array([[             nan,   1.93200000e+03,              nan, ...,
                     nan,   1.65900000e+00,   2.51700000e+00],
       [             nan,   1.93300000e+03,              nan, ...,
                     nan,   1.67400000e+00,   2.48400000e+00],
       [             nan,   1.93400000e+03,              nan, ...,
                     nan,   1.65200000e+00,   2.53400000e+00],
       ..., 
       [             nan,   2.00600000e+03,   4.52600000e+01, ...,
          1.11936337e+07,   1.54600000e+00,   2.83000000e+00],
       [             nan,   2.00700000e+03,   4.55100000e+01, ...,
          1.19172976e+07,   1.53000000e+00,   2.88500000e+00],
       [             nan,   2.00800000e+03,   4.56000000e+01, ...,
          9.14119000e+06,   1.55500000e+00,   2.80300000e+00]])
    

    The first argument to the function should be the data source. It should be either a string that points to a local or remote file or a file-like object with a read method. URLs will be downloaded to the current working directory before they are loaded. Additionally, the input can be either text or a compressed file. The function recognizes gzip and bzip2. These files need to have the .gz or .bz2 extensions to be readable. Notable optional arguments to genfromtext include the delimiter, , (comma) by default in recfromcsv; skip_header and skip_footer, which take an optional number of lines to skip from the top or bottom respectively; and dtype, which specifies the datatype of the cells. By default, the dtype is None, and NumPy will attempt to detect the correct format.

  10. We can now get an overall sense of the scope of our data table:
    dataset.size
    771720
    dataset.shape
    (2179, 354)
    

    Note

    Depending on your version of NumPy, you might see slightly different output. The dataset.size statement might report back the number of rows of data (2179), and the shape might output as (2179,).

    The size property on ndarray returns the number of elements in the matrix. The shape property returns a tuple of the dimensions in our array. CSVs are naturally two-dimensional, therefore the (m, n) tuple indicates the number of rows and columns, respectively.

    However, there are a couple of gotchas with using this method. First, note that we had to skip our header line; genfromtxt does allow named columns by setting the keyword argument names to True (and in this case, you won't set skip_headers=1). Unfortunately, in this particular dataset, the column names might contain commas. The CSV reader deals with this correctly since the strings that contain commas are quoted, but genfromtxt is not a CSV reader in general. To fix this, either the headers have to be fixed, or some other names need to be added. Secondly, the Country column has been reduced to NaN, and the Year column has been turned into a floating point integer, which is not ideal.

  11. A manual fix on the dataset is necessary, and this is not uncommon. Since we know that there are 354 columns and the first two columns are Country and Year, we can precompute our column names and datatypes:
    names = ["country", "year"]
    names.extend(["col%i" % (idx+1) for idx in xrange(352)])
    dtype = "S64,i4," + ",".join(["f18" for idx in xrange(352)])
    
    dataset = np.genfromtxt(data_file, dtype=dtype, names=names, delimiter=",", skip_header=1, autostrip=2)
    

    We name the first two columns country and year, respectively, and assign them datatypes of S64 or string-64, then assign the year column as i4 or integer-4. For the rest of the columns, we assign them the name coln, where n is an integer from 1 to 352, and the datatype is f18 or float-18. These character lengths allow us to capture as much data as possible, including exponential floating point representations.

    Unfortunately, as we look through the data, we can see a lot of nan values that represent Not a Number, a fixture in floating point arithmetic used to represent values that are not numbers nor are equivalent to infinity. Missing data is a very common issue in the data wrangling and cleaning stage of the pipeline. It appears that the dataset contains many missing or invalid entries, which makes sense given the historical data, and countries that may not have had effective data collection for given columns.

  12. In order to clean the data, we use a NumPy masked array, which is actually a combination of a standard NumPy array and a mask, a set of Boolean values that indicate whether the data in that position should be used in computations or not. This can be done as follows:
    import numpy.ma as ma
    ma.masked_invalid(dataset['col1'])
    masked_array(data = [-- -- -- ..., 45.2599983215332 45.5099983215332 45.599998474121094], 
    mask = [ True  True  True ..., False False False],fill_value = 1e+20)
    

How it works...

Our dataset function has been modified to filter on a single field and value if desired. If no filter has been specified, it generates the entire CSV. The main piece of interest is what happens in the main function. Here, we generate a bar chart of average incomes in the United States per year using matplotlib. Let's walk through the code.

We collect our data as (year, avg_income) tuples in a list comprehension that utilizes our special dataset method to filter data only for the United States.

Note

We have to cast the average income per tax unit to a float in order to compute on it. In this case, we leave the year as a string since it simply acts as a label; however, in order to perform datetime computations, we might want to convert that year to a date using datetime.strptime (row['Year'], '%Y').date().

After we have performed our data collection, filtering, and conversions, we set up the chart. The width is the maximum width of a bar. An ind iterable (ndarray) refers to the x axis locations for each bar; in this case, we want one location for every data point in our set. A NumPy np.arange function is similar to the built-in xrange functions; it returns an iterable (ndarray) of evenly spaced values in the given interval. In this case, we provide a stop value that is the length of the list and use the default start value of 0 and step size of 1, but these can also be specified. The use of arange allows floating point arguments, and it is typically much faster than simply instantiating the full array of values.

The figure and subplot module functions utilize the matplotlab.pyplot module to create the base figure and axes, respectively. The figure function creates a new figure, or returns a reference to a previously created figure. The subplot function returns a subplot axis positioned by the grid definition with the following arguments: the number of rows, number of columns, and plot number. This function has a convenience when all three arguments are less than 10; simply supply a three-digit number with the respective values, for example, plot.subplot (111) creates 1 x 1 axes in subplot 1.

We then use the subplot to create a bar chart from our data. Note the use of another comprehension that passes the values of the incomes from our dataset along with the indices we created with np.arange. On setting the x axis labels, however, we notice that if we add all years as individual labels, the x axis is unreadable. Instead, we add ticks for every 4 years, starting with the first year. In this case, you can see that we use a step size of 4 in np.arange to set our ticks, and similarly, in our labels, we use slices on the Python list to step through every four labels. For example, for a given list, we will use:

mylist[s:e:t]

The slice of the list starts at s, ends at e, and has the step size t. Negative numbers are also supported in slices to be iterated from the end of the list; for example, mylist[-1] will return the last item of the list.

There's more...

NumPy is an incredibly useful and powerful library, but we should note some very important differences. The list datatype in Python is wildly different from the numpy array. The Python list can contain any number of different datatypes, including lists. Thus, the following example list is perfectly valid:

python_list = ['bob' , 5.1, True, 1, [5, 3, 'sam'] ]

Underneath the hood, the Python list contains pointers to the memory locations of the elements of the list. To access the first element of the list, Python goes to the memory location for the list and grabs the first value stored there. This value is a memory address for the first element. Python then jumps to this new memory location to grab the value for the actual first element. Thus, "grabbing" the first element of the list requires two memory lookups.

NumPy arrays are very much like C. They must contain a single datatype, and this allows the array to be stored in a contiguous block of memory, which makes reading the array significantly faster. When reading the first element of the array, Python goes to the appropriate memory address that contains the actual value to be retrieved. When the next element in the array is needed, it is right next to the location of the first element in memory, which makes reading much faster.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.78.30