Since a pandas DataFrame is structured similarly to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or running the following command:
$ pip install Quandl
Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, refer to the Where to find help and references section in Chapter 1, Getting Started with Python Libraries, or search through the Python docs at https://docs.python.org/2/.
Without further preamble, let's take a look at how to query data in a pandas DataFrame:
import Quandl # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual # PyPi url https://pypi.python.org/pypi/Quandl sunspots = Quandl.get("SIDC/SUNSPOTS_A")
head()
and tail()
methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter:print "Head 2", sunspots.head(2) print "Tail 2", sunspots.tail(2)
This gives us the first two and last two rows of the sunspot data:
Head 2 Number Year 1700-12-31 5 1701-12-31 11 [2 rows x 1 columns] Tail 2 Number Year 2012-12-31 57.7 2013-12-31 64.9 [2 rows x 1 columns]
Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index.
last_date = sunspots.index[-1] print "Last value", sunspots.loc[last_date]
You can check the following output with the result from the previous step:
Last value Number 64.9 Name: 2013-12-31 00:00:00, dtype: float64
YYYYMMDD
format as follows:print "Values slice by date", sunspots["20020101": "20131231"]
This gives the records from 2002 through 2013:
Values slice by date Number Year 2002-12-31 104.0 [TRUNCATED] 2013-12-31 64.9 [12 rows x 1 columns]
print "Slice from a list of indices", sunspots.iloc[[2, 4, -4, -2]]
The preceding code selects the following rows:
Slice from a list of indices Number Year 1702-12-31 16.0 1704-12-31 36.0 2010-12-31 16.0 2012-12-31 57.7 [4 rows x 1 columns]
print "Scalar with Iloc", sunspots.iloc[0, 0] print "Scalar with iat", sunspots.iat[1, 0]
This gives us the first and second values of the dataset as scalars:
Scalar with Iloc 5.0 Scalar with iat 11.0
Where
clause of SQL. The following code queries for values larger than the arithmetic mean. Notice that there is a difference when we perform the query on the whole DataFrame and when we perform it on a single column:print "Boolean selection", sunspots[sunspots > sunspots.mean()] print "Boolean selection with column label", sunspots[sunspots. Number > sunspots.Number.mean()]
The notable difference is that the first query yields all the rows, with rows not conforming to the condition that has a value of NaN
. The second query returns only the rows where the value is larger than the mean:
Boolean selection Number Year 1700-12-31 NaN [TRUNCATED] 1759-12-31 54.0 ... [314 rows x 1 columns] Boolean selection with column label Number Year 1705-12-31 58.0 [TRUNCATED] 1870-12-31 139.1 ... [127 rows x 1 columns]
The example code is in the query_demo.py
file of this book's code bundle:
import Quandl # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual # PyPi url https://pypi.python.org/pypi/Quandl sunspots = Quandl.get("SIDC/SUNSPOTS_A") print "Head 2", sunspots.head(2) print "Tail 2", sunspots.tail(2) last_date = sunspots.index[-1] print "Last value", sunspots.loc[last_date] print "Values slice by date", sunspots["20020101": "20131231"] print "Slice from a list of indices", sunspots.iloc[[2, 4, -4, -2]] print "Scalar with Iloc", sunspots.iloc[0, 0] print "Scalar with iat", sunspots.iat[1, 0] print "Boolean selection", sunspots[sunspots > sunspots.mean()] print "Boolean selection with column label", sunspots[sunspots.Number > sunspots.Number.mean()]
3.143.0.85