Once you have downloaded and installed everything in the previous recipe, you can read the dataset with Python and then start doing some preliminary analysis to get a sense of what the data you have looks like.
The dataset that we'll explore in this chapter was created by Alvaredo, Facundo, Anthony B. Atkinson, Thomas Piketty, and Emmanuel Saez, The World Top Incomes Database, http://topincomes.g-mond.parisschoolofeconomics.eu/, 10/12/2013. It contains global information about the highest incomes per country for approximately the past 100 years, gleaned from tax records.
Let's use the following sequence of steps to import the data and start our exploration of this dataset in Python:
#Reading this data is straightforward with the built in csv module: import csv data_file = "../data/income_dist.csv" with open(data_file, 'r') as csvfile: reader = csv.DictReader(csvfile) data = list(reader)
len
to reveal the number of records:len(data) 2180
print reader.fieldnames ['Country', 'Year', 'Top 10% income share', ...]len(reader.fieldnames) 354
Generators are Python expressions that allow you to create functions that act as iterables; rather than returning all of the data, they yield data one "part" at a time in a memory-efficient iteration context. As our datasets get larger, it's useful to use generators to perform filtering on demand and clean data as you read it.
def dataset(path): with open(path, 'r') as csvfile: reader = csv.DictReader(csvfile) for row in reader: yield row
Also, take note of the with open(path, 'r') as csvfile
statement. This statement ensures that the CSV file is closed when the with
block is exited, even (or especially) if there is an exception. Python with
blocks replace the try
, except
, and finally
statements, and are syntactically brief while semantically more correct programming constructs.
print set([row["Country"] for row in dataset(data_file)]) set(['Canada', 'Italy', 'France', 'Netherlands', 'Ireland', ...])
print min(set([int(row["Year"]) for row in dataset(data_file)])) 1875 print max(set([int(row["Year"]) for row in dataset(data_file)]))2010
In both of these previous examples, we used a Python list comprehension to generate a set. A comprehension is a concise statement that generates an iterable, much like the earlier memory-safe generators. The output variable (or variables) is specified, along with the for
keyword, and the iterable to express the variable, along with an optional if
condition. In Python 2.7, set and dictionary comprehensions also exist. The previous country set could also be expressed as follows:
{row["Country"] for row in dataset(data_file)}set(['Canada', 'Italy', 'France', 'Netherlands', 'Ireland', ...])
filter(lambda row: row["Country"] == "United States", dataset(data_file))
The Python filter
function creates a list from all of the values of a sequence or iterable (the second parameter) that make the function specified by the first parameter true. In this case, we use an anonymous function (a lambda
function) to check whether the value in the specified row's Country
column is equal to United States
.
import csv import numpy as np import matplotlib.pyplot as plt def dataset(path, filter_field=None, filter_value=None): with open(path, 'r') as csvfile: reader = csv.DictReader(csvfile) if filter_field: for row in filter(lambda row: row[filter_field]==filter_value, reader): yield row else: for row in reader: yield row def main(path): data = [(row["Year"], float(row["Average income per tax unit"])) for row in dataset(path, "Country", "United States")] width = 0.35 ind = np.arange(len(data)) fig = plt.figure() ax = plt.subplot(111) ax.bar(ind, list(d[1] for d in data)) ax.set_xticks(np.arange(0, len(data), 4)) ax.set_xticklabels(list(d[0] for d in data)[0::4], rotation=45) ax.set_ylabel("Income in USD") plt.title("U.S. Average Income 1913-2008") plt.show() if __name__ == "__main__": main("income_dist.csv")
The preceding snippet will give us the following output:
The preceding example of data exploration with Python should seem familiar from many of the R chapters. Loading the dataset, filtering, and computing ranges required a few more lines of code and specific typecasting, but we quickly created analyses in a memory-safe fashion.
genfromtext
: This function creates an array from tabular data stored in a text file with two main loops. The first converts each line of the file to string sequences, and the second converts each string to an appropriate datatype. It is a bit slower and not as memory efficient, but the result is a convenient data table stored in memory. This function also handles missing data, which other faster and simpler functions cannot.recfromcsv
: This function is a helper function based on genfromtext
that has default arguments set to provide access to a CSV file.Have a look at the following snippet:
import numpy as np dataset = np.recfromcsv(data_file, skip_header=1) dataset array([[ nan, 1.93200000e+03, nan, ..., nan, 1.65900000e+00, 2.51700000e+00], [ nan, 1.93300000e+03, nan, ..., nan, 1.67400000e+00, 2.48400000e+00], [ nan, 1.93400000e+03, nan, ..., nan, 1.65200000e+00, 2.53400000e+00], ..., [ nan, 2.00600000e+03, 4.52600000e+01, ..., 1.11936337e+07, 1.54600000e+00, 2.83000000e+00], [ nan, 2.00700000e+03, 4.55100000e+01, ..., 1.19172976e+07, 1.53000000e+00, 2.88500000e+00], [ nan, 2.00800000e+03, 4.56000000e+01, ..., 9.14119000e+06, 1.55500000e+00, 2.80300000e+00]])
The first argument to the function should be the data source. It should be either a string that points to a local or remote file or a file-like object with a read
method. URLs will be downloaded to the current working directory before they are loaded. Additionally, the input can be either text or a compressed file. The function recognizes gzip and bzip2. These files need to have the .gz
or .bz2
extensions to be readable. Notable optional arguments to genfromtext
include the delimiter, ,
(comma) by default in recfromcsv
; skip_header
and skip_footer
, which take an optional number of lines to skip from the top or bottom respectively; and dtype
, which specifies the datatype of the cells. By default, the dtype
is None
, and NumPy will attempt to detect the correct format.
dataset.size 771720 dataset.shape (2179, 354)
The size
property on ndarray
returns the number of elements in the matrix. The shape property returns a tuple of the dimensions in our array. CSVs are naturally two-dimensional, therefore the (m, n)
tuple indicates the number of rows and columns, respectively.
However, there are a couple of gotchas with using this method. First, note that we had to skip our header line; genfromtxt
does allow named columns by setting the keyword argument names to True
(and in this case, you won't set skip_headers=1
). Unfortunately, in this particular dataset, the column names might contain commas. The CSV reader deals with this correctly since the strings that contain commas are quoted, but genfromtxt
is not a CSV reader in general. To fix this, either the headers have to be fixed, or some other names need to be added. Secondly, the Country
column has been reduced to NaN
, and the Year
column has been turned into a floating point integer, which is not ideal.
Country
and Year
, we can precompute our column names and datatypes:names = ["country", "year"] names.extend(["col%i" % (idx+1) for idx in xrange(352)]) dtype = "S64,i4," + ",".join(["f18" for idx in xrange(352)]) dataset = np.genfromtxt(data_file, dtype=dtype, names=names, delimiter=",", skip_header=1, autostrip=2)
We name the first two columns country
and year
, respectively, and assign them datatypes of S64
or string-64, then assign the year
column as i4
or integer-4. For the rest of the columns, we assign them the name coln
, where n
is an integer from 1 to 352, and the datatype is f18
or float-18. These character lengths allow us to capture as much data as possible, including exponential floating point representations.
Unfortunately, as we look through the data, we can see a lot of nan
values that represent Not a Number, a fixture in floating point arithmetic used to represent values that are not numbers nor are equivalent to infinity. Missing data is a very common issue in the data wrangling and cleaning stage of the pipeline. It appears that the dataset contains many missing or invalid entries, which makes sense given the historical data, and countries that may not have had effective data collection for given columns.
import numpy.ma as ma ma.masked_invalid(dataset['col1']) masked_array(data = [-- -- -- ..., 45.2599983215332 45.5099983215332 45.599998474121094], mask = [ True True True ..., False False False],fill_value = 1e+20)
Our dataset function has been modified to filter on a single field and value if desired. If no filter has been specified, it generates the entire CSV. The main piece of interest is what happens in the main function. Here, we generate a bar chart of average incomes in the United States per year using matplotlib. Let's walk through the code.
We collect our data as (year, avg_income)
tuples in a list comprehension that utilizes our special dataset method to filter data only for the United States.
We have to cast the average income per tax unit to a float in order to compute on it. In this case, we leave the year as a string since it simply acts as a label; however, in order to perform datetime
computations, we might want to convert that year to a date using datetime.strptime (row['Year'], '%Y').date()
.
After we have performed our data collection, filtering, and conversions, we set up the chart. The width is the maximum width of a bar. An ind
iterable (ndarray
) refers to the x axis locations for each bar; in this case, we want one location for every data point in our set. A NumPy np.arange
function is similar to the built-in xrange
functions; it returns an iterable (ndarray
) of evenly spaced values in the given interval. In this case, we provide a stop value that is the length of the list and use the default start value of 0
and step size of 1
, but these can also be specified. The use of arange
allows floating point arguments, and it is typically much faster than simply instantiating the full array of values.
The figure
and subplot
module functions utilize the matplotlab.pyplot
module to create the base figure and axes, respectively. The figure
function creates a new figure, or returns a reference to a previously created figure. The subplot
function returns a subplot axis positioned by the grid definition with the following arguments: the number of rows, number of columns, and plot number. This function has a convenience when all three arguments are less than 10; simply supply a three-digit number with the respective values, for example, plot.subplot (111)
creates 1 x 1 axes in subplot 1.
We then use the subplot to create a bar chart from our data. Note the use of another comprehension that passes the values of the incomes from our dataset along with the indices we created with np.arange
. On setting the x axis labels, however, we notice that if we add all years as individual labels, the x axis is unreadable. Instead, we add ticks for every 4 years, starting with the first year. In this case, you can see that we use a step size of 4
in np.arange
to set our ticks, and similarly, in our labels, we use slices on the Python list to step through every four labels. For example, for a given list, we will use:
mylist[s:e:t]
The slice of the list starts at s
, ends at e
, and has the step size t
. Negative numbers are also supported in slices to be iterated from the end of the list; for example, mylist[-1]
will return the last item of the list.
NumPy is an incredibly useful and powerful library, but we should note some very important differences. The list datatype in Python is wildly different from the numpy
array. The Python list can contain any number of different datatypes, including lists. Thus, the following example list is perfectly valid:
python_list = ['bob' , 5.1, True, 1, [5, 3, 'sam'] ]
Underneath the hood, the Python list contains pointers to the memory locations of the elements of the list. To access the first element of the list, Python goes to the memory location for the list and grabs the first value stored there. This value is a memory address for the first element. Python then jumps to this new memory location to grab the value for the actual first element. Thus, "grabbing" the first element of the list requires two memory lookups.
NumPy arrays are very much like C. They must contain a single datatype, and this allows the array to be stored in a contiguous block of memory, which makes reading the array significantly faster. When reading the first element of the array, Python goes to the appropriate memory address that contains the actual value to be retrieved. When the next element in the array is needed, it is right next to the location of the first element in memory, which makes reading much faster.
3.128.78.30