Querying data in pandas

Since a pandas DataFrame is structured similarly to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or running the following command:

$ pip install Quandl

Note

Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, refer to the Where to find help and references section in Chapter 1, Getting Started with Python Libraries, or search through the Python docs at https://docs.python.org/2/.

Without further preamble, let's take a look at how to query data in a pandas DataFrame:

  1. As a first step, we obviously have to download the data. After importing the Quandl API, get the data as follows:
    import Quandl
    
    # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual
    # PyPi url https://pypi.python.org/pypi/Quandl
    sunspots = Quandl.get("SIDC/SUNSPOTS_A")
  2. The head() and tail() methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter:
    print "Head 2", sunspots.head(2) 
    print "Tail 2", sunspots.tail(2)

    This gives us the first two and last two rows of the sunspot data:

    Head 2             Number
    Year              
    1700-12-31       5
    1701-12-31      11
    
    [2 rows x 1 columns]
    Tail 2             Number
    Year              
    2012-12-31    57.7
    2013-12-31    64.9
    
    [2 rows x 1 columns]
    

    Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index.

  3. The following is the query for the last value using the last date:
    last_date = sunspots.index[-1]
    print "Last value", sunspots.loc[last_date]

    You can check the following output with the result from the previous step:

    Last value Number    64.9
    Name: 2013-12-31 00:00:00, dtype: float64
    
  4. Query the date with date strings in the YYYYMMDD format as follows:
    print "Values slice by date", sunspots["20020101": "20131231"]

    This gives the records from 2002 through 2013:

    Values slice by date             Number
    Year              
    2002-12-31   104.0
    [TRUNCATED]
    2013-12-31    64.9
    
    [12 rows x 1 columns]
    
  5. A list of indices can be used to query as well:
    print "Slice from a list of indices", sunspots.iloc[[2, 4, -4, -2]]

    The preceding code selects the following rows:

    Slice from a list of indices             Number
    Year              
    1702-12-31    16.0
    1704-12-31    36.0
    2010-12-31    16.0
    2012-12-31    57.7
    
    [4 rows x 1 columns]
    
  6. To select scalar values, we have two options. The second option given here should be faster. Two integers are required, the first for the row and the second for the column:
    print "Scalar with Iloc", sunspots.iloc[0, 0]
    print "Scalar with iat", sunspots.iat[1, 0]

    This gives us the first and second values of the dataset as scalars:

    Scalar with Iloc 5.0
    Scalar with iat 11.0
    
  7. Querying with Booleans works much like the Where clause of SQL. The following code queries for values larger than the arithmetic mean. Notice that there is a difference when we perform the query on the whole DataFrame and when we perform it on a single column:
    print "Boolean selection", sunspots[sunspots > sunspots.mean()]
    print "Boolean selection with column label", sunspots[sunspots. Number > sunspots.Number.mean()]

    The notable difference is that the first query yields all the rows, with rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean:

    Boolean selection             Number
    Year              
    1700-12-31     NaN
    [TRUNCATED]
    1759-12-31    54.0
                   ...
    
    [314 rows x 1 columns]
    Boolean selection with column label             Number
    Year              
    1705-12-31    58.0
    [TRUNCATED]
    1870-12-31   139.1
                   ...
    
    [127 rows x 1 columns]
    

The example code is in the query_demo.py file of this book's code bundle:

import Quandl

# Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual
# PyPi url https://pypi.python.org/pypi/Quandl
sunspots = Quandl.get("SIDC/SUNSPOTS_A")
print "Head 2", sunspots.head(2)
print "Tail 2", sunspots.tail(2)

last_date = sunspots.index[-1]
print "Last value", sunspots.loc[last_date]

print "Values slice by date", sunspots["20020101": "20131231"]

print "Slice from a list of indices", sunspots.iloc[[2, 4, -4, -2]]

print "Scalar with Iloc", sunspots.iloc[0, 0]
print "Scalar with iat", sunspots.iat[1, 0]

print "Boolean selection", sunspots[sunspots > sunspots.mean()]
print "Boolean selection with column label", sunspots[sunspots.Number > sunspots.Number.mean()]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.0.85