Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among other functions. To give Python these enhanced features, Pandas introduces two new data types to Python: Series
and DataFrame
. The DataFrame
represents your entire spreadsheet or rectangular data, whereas the Series
is a single column of the DataFrame
. A Pandas DataFrame
can also be thought of as a dictionary or collection of Series
objects.
Why should you use a programming language like Python and a tool like Pandas to work with data? It boils down to automation and reproducibility. If a particular set of analyses need to be performed on multiple data sets, a programming language has the ability to automate the analysis on those data sets. Although many spreadsheet programs have their own macro programming languages, many users do not use them. Furthermore, not all spreadsheet programs are available on all operating systems. Performing data analysis using a programming language forces the user to maintain a running record of all steps performed on the data. I, like many people, have accidentally hit a key while viewing data in a spreadsheet program, only to find out that my results no longer make any sense due to bad data. This is not to say that spreadsheet programs are bad or that they do not have their place in the data workflow, they do. Rather, my point is that there are better and more reliable tools out there.
1. Prior knowledge needed (appendix)
a. relative directories
b. calling functions
c. dot notation
d. primitive Python containers
e. variable assignment
2. This chapter
a. loading data
b. subset data
c. slicing
e. basic Pandas data structures (Series
, DataFrame
)
f. resemble other Python containers (list
, numpy.ndarray
)
g. basic indexing
This chapter will cover:
1. Loading a simple delimited data file
2. Counting how many rows and columns were loaded
3. Determining which type of data was loaded
4. Looking at different parts of the data by subsetting rows and columns
When given a data set, we first load it and begin looking at its structure and contents. The simplest way of looking at a data set is to examine and subset specific rows and columns. We can see which type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics.
Since Pandas is not part of the Python standard library, we have to first tell Python to load (import
) the library.
import pandas
With the library loaded, we can use the read_csv
function to load a CSV data file. To access the read_csv
function from Pandas, we use dot notation. More on dot notations can be found in Appendices H, O, and S.
About the Gapminder Data Set
The Gapminder data set originally comes from www.gapminder.org
. The version of the Gapminder data used in this book was prepared by Jennifer Bryan from the University of British Columbia. The repository can be found at: www.github.com/jennybc/gapminder
.
# by default the read_csv function will read a comma-separated file;
# our Gapminder data are separated by tabs
# we can use the sep parameter and indicate a tab with
df = pandas.read_csv('../data/gapminder.tsv', sep=' ')
# we use the head method so Python shows us only the first 5 rows
print(df.head())
When working with Pandas functions, it is common practice to give pandas
the alias pd
. Thus the following code is equivalent to the preceding example:
import pandas as pd
df = pd.read_csv('../data/gapminder.tsv', sep=' ')
We can check whether we are working with a Pandas DataFrame
by using the built-in type
function (i.e., it comes directly from Python, not any package such as Pandas).
print(type(df))
The type
function is handy when you begin working with many different types of Python objects and need to know which object you are currently working on.
The data set we loaded is currently saved as a Pandas DataFrame
object and is relatively small. Every DataFrame
object has a shape
attribute that will give us the number of rows and columns of the DataFrame
.
# get the number of rows and columns
print(df.shape)
The shape attribute returns a tuple (Appendix J) in which the first value is the number of rows and the second number is the number of columns. From the preceding results, we see our Gapminder data set has 1704 rows and 6 columns.
Since shape
is an attribute of the dataframe, and not a function or method of the DataFrame
, it does not have parentheses after the period. If you made the mistake of putting parentheses after the shape
attribute, it would return an error.
# shape is an attribute, not a method
# this will cause an error
print(df.shape())
Typically, when first looking at a data set, we want to know how many rows and columns there are (we just did that). To get the gist of which information it contains, we look at the columns. The column names, like shape
, are specified using the column
attribute of the dataframe object.
# get column names
print(df.columns)
The Pandas DataFrame
object is similar to the DataFrame
-like objects found in other languages (e.g., Julia and R) Each column (Series
) has to be the same type, whereas each row can contain mixed types. In our current example, we can expect the country
column to be all strings and the year to be integers. However, it’s best to make sure that is the case by using the dtypes
attribute or the info
method. Table 1.1 compares the types in Pandas to the types in native Python.
# get the dtype of each column
print(df.dtypes)
# get more information about our data
print(df.info())
Pandas Type |
Python Type |
Description |
|
|
Most common data type |
|
|
Whole numbers |
|
|
Numbers with decimals |
|
|
|
Now that we’re able to load a simple data file, we want to be able to inspect its contents. We could print
out the contents of the dataframe, but with today’s data, there are often too many cells to make sense of all the printed information. Instead, the best way to look at our data is to inspect it in parts by looking at various subsets of the data. We already saw that we can use the head
method of a dataframe to look at the first five rows of our data. This is useful to see if our data loaded properly and to get a sense of each of the columns, its name, and its contents. Sometimes, however, we may want to see only particular rows, columns, or values from our data.
Before continuing, make sure you are familiar with Python containers (Appendices I, J, and K).
If we want to examine multiple columns, we can specify them by names, positions, or ranges.
If we want only a specific column from our data, we can access the data using square brackets.
# just get the country column and save it to its own variable
country_df = df['country']
# show the first 5 observations
print(country_df.head())
# show the last 5 observations
print(country_df.tail())
To specify multiple columns by the column name, we need to pass in a Python list
between the square brackets. This may look a bit strange since there will be two sets of square brackets.
# Looking at country, continent, and year
subset = df[['country', 'continent', 'year']]
print(subset.head())
print(subset.tail())
Again, you can opt to print
the entire subset
dataframe. We won’t use this option in this book, as it would take up an unnecessary amount of space.
At times, you may want to get a particular column by its position, rather than its name. For example, you want to get the first (“country”) column and third column (“year”), or just the last column (“gdpPercap”).
As of pandas v0.20
, you are no longer able to pass in a list of integers in the square brackets to subset columns. For example, df[[1]]
, df[[0, -1]
, and df[list(range(5)]
no longer work. There are other ways of subsetting columns (Section 1.3.3), but they build on the technique used to subset rows.
Rows can be subset in multiple ways, by row name or row index. Table 1.2 gives a quick overview of the various methods.
Subset method |
Description |
|
Subset based on index label (row name) |
|
Subset based on row index (row number) |
|
Subset based on index label or row index |
Let’s take a look at part of our Gapminder data.
On the left side of the printed dataframe, we see what appear to be row numbers. This column-less row of values is the index label of the dataframe. Think of the index label as being like a column name, but for rows instead of columns. By default, Pandas will fill in the index labels with the row numbers (note that it starts counting from 0
). A common example where the row index labels are not the same as the row number is when we work with time series data. In that case, the index label will be a timestamp of sorts. For now, though, we will keep the default row number values.
We can use the loc
attribute on the dataframe to subset rows based on the index label.
# get the first row
# Python counts from 0
print(df.loc[0])
# get the 100th row
# Python counts from 0
print(df.loc[99])
# get the last row
# this will cause an error
print(df.loc[-1])
Note that passing -1
as the loc
will cause an error, because it is actually looking for the row index label (row number) ‘-1
’, which does not exist in our example. Instead, we can use a bit of Python to calculate the number of rows and pass that value into loc
.
# get the last row (correctly)
# use the first value given from shape to get the number of rows
number_of_rows = df.shape[0]
# subtract 1 from the value since we want the last index value
last_row_index = number_of_rows - 1
# now do the subset using the index of the last row
print(df.loc[last_row_index])
Alternatively, we can use the tail
method to return the last 1
row, instead of the default 5
.
# there are many ways of doing what you want
print(df.tail(n=1))
Notice that when we used tail()
and loc
, the results were printed out differently. Let’s look at which type is returned when we use these methods.
subset_loc = df.loc[0]
subset_head = df.head(n=1)
# type using loc of 1 row
print(type(subset_loc))
# type using head of 1 row
print(type(subset_head))
At the beginning of this chapter, we mentioned that Pandas introduces two new data types into Python. Depending on which method we use and how many rows we return, Pandas will return a different object. The way an object gets printed to the screen can be an indicator of the type, but it’s always best to use the type
function to be sure. We go into more details about these objects in Chapter 2.
Subsetting Multiple Rows Just as for columns, we can select multiple rows.
# select the first, 100th, and 1000th rows
# note the double square brackets similar to the syntax used to
# subset multiple columns
print(df.loc[[0, 99, 999]])
iloc
does the same thing as loc
but is used to subset by the row index number. In our current example, iloc
and loc
will behave om exactly the same way since the index labels are the row numbers. However, keep in mind that the index labels do not necessarily have to be row numbers.
# get the 2nd row
print(df.iloc[1])
## get the 100th row
print(df.iloc[99])
Note that when we put 1
into the list, we actually get the second row, rather than the first row. This follows Python’s zero-indexed behavior, meaning that the first item of a container is index 0 (i.e., 0th item of the container). More details about this kind of behavior are found in Appendices I, L, and P.
With iloc
, we can pass in the -1
to get the last row—something we couldn’t do with loc
.
# using -1 to get the last row
print(df.iloc[-1])
Just as before, we can pass in a list of integers to get multiple rows.
## get the first, 100th, and 1000th rows
print(df.iloc[[0, 99, 999]])
The ix
attribute does not work in versions later than Pandas v0.20
, since it can be confusing. Nevertheless, this section quickly reviews ix
for completeness.
ix
can be thought of as a combination of loc
and iloc
, as it allows us to subset by label or integer. By default, it searches for labels. If it cannot find the corresponding label, it falls back to using integer indexing. This can be the cause for a lot of confusion, which is why this feature has been taken out. The code using ix
will look exactly like that written when using loc
or iloc
.
# first row
df.ix[0]
# 100th row
df.ix[99]
# 1st, 100th, and 1000th rows
df.ix[[0, 99, 999]]
The loc
and iloc
attributes can be used to obtain subsets of columns, rows, or both. The general syntax for loc
and iloc
uses square brackets with a comma. The part to the left of the comma is the row values to subset; the part to the right of the comma is the column values to subset. That is, df.loc[[rows], [columns]]
or df.iloc[[rows], [columns]]
If we want to use these techniques to just subset columns, we must use Python’s slicing syntax (Appendix L). We need to do this because if we are subsetting columns, we are getting all the rows for the specified column. So, we need a method to capture all the rows.
The Python slicing syntax uses a colon, :
. If we have just a colon, the attribute refers to everything. So, if we just want to get the first column using the loc
or iloc
syntax, we can write something like df.loc[:, [columns]]
to subset the column(s).
# subset columns with loc
# note the position of the colon
# it is used to select all rows
subset = df.loc[:, ['year', 'pop']]
print(subset.head())
# subset columns with iloc
# iloc will alow us to use integers
# -1 will select the last column
subset = df.iloc[:, [2, 4, -1]]
print(subset.head())
We will get an error if we don’t specify loc
and iloc
correctly.
# subset columns with loc
# but pass in integer values
# this will cause an error
subset = df.loc[:, [2, 4, -1]]
print(subset.head())
# subset columns with iloc
# but pass in index names
# this will cause an error
subset = df.iloc[:, ['year', 'pop']]
print(subset.head())
You can use the built-in range
function to create a range of values in Python. This way you can specify beginning and end values, and Python will automatically create a range of values in between. By default, every value between the beginning and the end (inclusive left, exclusive right; see Appendix L) will be created, unless you specify a step (Appendices L and P). In Python 3, the range
function returns a generator (Appendix P). If you are using Python 2, the range
function returns a list (Appendix I), and the xrange
function returns a generator.
If we look at the code given earlier (Section 1.3.1.2), we see that we subset columns using a list of integers. Since range
returns a generator, we have to convert the generator to a list first.
Note that when range(5)
is called, five integers are returned: 0 – 4.
# create a range of integers from 0 to 4 inclusive
small_range = list(range(5))
print(small_range)
# subset the dataframe with the range
subset = df.iloc[:, small_range]
print(subset.head())
# create a range from 3 to 5 inclusive
small_range = list(range(3, 6))
print(small_range)
subset = df.iloc[:, small_range]
print(subset.head())
Again, note that the values are specified in a way such that the range is inclusive on the left, and exclusive on the right.
# create a range from 0 to 5 inclusive, every other integer
small_range = list(range(0, 6, 2))
subset = df.iloc[:, small_range]
print(subset.head())
Converting a generator to a list is a bit awkward; we can use the Python slicing syntax to fix this.
Python’s slicing syntax, :
, is similar to the range
syntax. Instead of a function that specifies start, stop, and step values delimited by a comma, we separate the values with the colon.
If you understand what was going on with the range
function earlier, then slicing can be seen as a shorthand means to the same thing.
While the range function can be used to create a generator and converted to a list of values, the colon syntax for slicing only has meaning when slicing and subsetting values, and has no inherent meaning on its own.
small_range = list(range(3))
subset = df.iloc[:, small_range]
print(subset.head())
# slice the first 3 columns
subset = df.iloc[:, :3]
print(subset.head())
small_range = list(range(3, 6))
subset = df.iloc[:, small_range]
print(subset.head())
# slice columns 3 to 5 inclusive
subset = df.iloc[:, 3:6]
print(subset.head())
small_range = list(range(0, 6, 2))
subset = df.iloc[:, small_range]
print(subset.head())
# slice every other first 5 columns
subset = df.iloc[:, 0:6:2]
print(subset.head())
Question
What happens if you use the slicing method with two colons, but leave a value out? For example, what is the result in each of the following cases?
■ df.iloc[:, 0:6:]
■ df.iloc[:, 0::2]
■ df.iloc[:, ::2]
■ df.iloc[:, ::]
We’ve been using the colon, :
, in loc
and iloc
to the left of the comma. When we do so, we select all the rows in our dataframe. However, we can choose to put values to the left of the comma if we want to select specific rows along with specific columns.
# using loc
print(df.loc[42, 'country'])
# using iloc
print(df.iloc[42, 0])
Just make sure you don’t forget the differences between loc
and iloc
.
# will cause an error
print(df.loc[42, 0])
Now, look at how confusing ix
can be. Good thing it no longer works.
# get the 43rd country in our data
df.ix[42, 'country']
# instead of 'country' I used the index 0
df.ix[42, 0]
We can combine the row and column subsetting syntax with the multiple-row and multiple-column subsetting syntax to get various slices of our data.
# get the 1st, 100th, and 1000th rows
# from the 1st, 4th, and 6th columns
# the columns we are hoping to get are
# country, lifeExp, and gdpPercap
print(df.iloc[[0, 99, 999], [0, 3, 5]])
In my own work, I try to pass in the actual column names when subsetting data whenever possible. That approach makes the code more readable since you do not need to look at the column name vector to know which index is being called. Additionally, using absolute indexes can lead to problems if the column order gets changed for some reason. This is just a general rule of thumb, as there will be exceptions where using the index position is a better option (i.e., concatenating data in Section 4.3).
# if we use the column names directly,
# it makes the code a bit easier to read
# note now we have to use loc, instead of iloc
print(df.loc[[0, 99, 999], ['country', 'lifeExp', 'gdpPercap']])
Remember, you can use the slicing syntax on the row portion of the loc
and iloc
attributes.
print(df.loc[10:13, ['country', 'lifeExp', 'gdpPercap']])
If you’ve worked with other numeric libraries or languages, you know that many basic statistic calculations either come with the library or are built into the language. Let’s look at our Gapminder data again.
print(df.head(n=10))
There are several initial questions that we can ask ourselves:
1. For each year in our data, what was the average life expectancy? What is the average life expectancy, population, and GDP?
2. What if we stratify the data by continent and perform the same calculations?
3. How many countries are listed in each continent?
To answer the questions just posed, we need to perform a grouped (i.e., aggregate) calculation. In other words, we need to perform a calculation, be it an average or a frequency count, but apply it to each subset of a variable. Another way to think about grouped calculations is as a split–apply–combine process. We first split our data into various parts, then apply a function (or calculation) of our choosing to each of the split parts, and finally combine all the individual split calculations into a single dataframe. We accomplish grouped/aggregate computations by using the groupby
method on dataframes.
# For each year in our data, what was the average life expectancy?
# To answer this question,
# we need to split our data into parts by year;
# then we get the 'lifeExp' column and calculate the mean
print(df.groupby('year')['lifeExp'].mean())
Let’s unpack the statement we used in this example. We first create a grouped object. Notice that if we printed the grouped dataframe, Pandas would return only the memory location.
grouped_year_df = df.groupby('year')
print(type(grouped_year_df))
print(grouped_year_df)
From the grouped data, we can subset the columns of interest on which we want to perform our calculations. To our question, we need the lifeExp
column. We can use the subsetting methods described in Section 1.3.1.1.
grouped_year_df_lifeExp = grouped_year_df['lifeExp']
print(type(grouped_year_df_lifeExp))
print(grouped_year_df_lifeExp)
Notice that we now are given a series (because we asked for only one column) in which the contents of the series are grouped (in our example by year).
Finally, we know the lifeExp
column is of type float64
. An operation we can perform on a vector of numbers is to calculate the mean to get our final desired result.
mean_lifeExp_by_year = grouped_year_df_lifeExp.mean()
print(mean_lifeExp_by_year)
We can perform a similar set of calculations for the population and GDP since they are of types int64
and float64
, respectively. But what if we want to group and stratify the data by more than one variable? And what if we want to perform the same calculation on multiple columns? We can build on the material earlier in this chapter by using a list!
# the backslash allows us to break up 1 long line of Python code
# into multiple lines
# df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean()
# is the same as the following code
multi_group_var = df.
groupby(['year', 'continent'])
[['lifeExp', 'gdpPercap']].
mean()
print(multi_group_var)
The output data is grouped by year and continent. For each year–continent pair, we calculated the average life expectancy and average GDP. The data is also printed out a little differently. Notice the year and continent “column names” are not on the same line as the life expectancy and GPD “column names.” There is some hierarchal structure between the year and continent row indices. We’ll discuss working with these types of data in more detail in Chapter 10.
If you need to “flatten” the dataframe, you can use the reset_index
method.
flat = multi_group_var.reset_index()
print(flat.head(15))
Question
Does the order of the list we used to group the data matter?
Another common data-related task is to calculate frequencies. We can use the nunique
and value_counts
methods, respectively, to get counts of unique values and frequency counts on a Pandas Series
.
# use the nunique (number unique)
# to calculate the number of unique values in a series
print(df.groupby('continent')['country'].nunique())
Question
What do you get if you use value_counts
instead of nunique
?
Visualizations are extremely important in almost every step of the data process. They help us identify trends in data when we are trying to understand and clean the data, and they help us convey our final findings. More information about visualization and plotting is described in Chapter 3.
Let’s look at the yearly life expectancies for the world population again.
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
print(global_yearly_life_expectancy)
We can use Pandas to create some basic plots as shown in Figure 1.1.
global_yearly_life_expectancy.plot()
This chapter explained how to load up a simple data set and start looking at specific observations. It may seem tedious at first to look at observations this way, especially if you are already familiar with the use of a spreadsheet program. Keep in mind that when doing data analytics, the goal is to produce reproducible results, and to not repeat repetitive tasks. Scripting languages give you that ability and flexibility.
Along the way you learned about some of the fundamental programming abilities and data structures that Python has to offer. You also encountered a quick way to obtain aggregated statistics and plots. The next chapter goes into more detail about the Pandas DataFrame
and Series
object, as well as other ways you can subset and visualize your data.
As you work your way though this book, if there is a concept or data structure that is foreign to you, check the various appendices for more information on that topic. Many fundamental programming features of Python are covered in the appendices.
18.222.164.141