Chapter 4. pandas Primer

pandas is named after panel data (an econometric term) and Python data analysis, and is a popular open source Python project. This chapter is a tutorial on basic pandas functionalities, where we will learn about pandas data structures and operations.

Note

The official pandas documentation insists on naming the project pandas in all lowercase letters. The other convention they insist on is this import statement: import pandas as pd. We will try to follow these conventions as much as possible.

In this chapter, we will install and explore pandas. Then, we will acquaint ourselves with the two central pandas data structures: DataFrame and Series. After this, you will learn how to perform SQL-like operations on the data contained in these data structures. pandas has statistical utilities including time-series routines, some of which will be demonstrated. The topics we will pursue are as follows:

  • Installing and exploring pandas
  • DataFrame and Series data structures
  • Querying data in pandas
  • Statistics with pandas DataFrames
  • Data aggregation with pandas DataFrames
  • Concatenating, joining, and appending DataFrames
  • Handling missing values
  • Dealing with dates
  • Pivot tables
  • Remote data access

Installing and exploring pandas

The minimal dependency set requirements for pandas is given as follows:

  • NumPy: This is the fundamental numerical array package that we installed and covered extensively in the preceding chapters
  • python-dateutil: This is a date-handling library
  • pytz: This handles time zone definitions

This list is the bare minimum; a longer list of optional dependencies can be located at http://pandas.pydata.org/pandas-docs/stable/install.html. We can install pandas via PyPI with pip or easy_install, using a binary installer, with the aid of our operating system package manager, or from the source by checking out the code. The binary installers can be downloaded from http://pandas.pydata.org/getpandas.html.

The command to install pandas with pip is as follows:

$ pip install pandas

You may have to prepend the preceding command with sudo if your user account doesn't have sufficient rights. For most, if not all, Linux distributions, the pandas package name is python-pandas. Please refer to the manual pages of your package manager for the correct command to install. These commands should be the same as the ones summarized in Chapter 1, Getting Started with Python Libraries. To install from the source, we need to execute the following commands from the command line:

$ git clone git://github.com/pydata/pandas.git 
$ cd pandas
$ python setup.py install

This procedure requires the correct setup of the compiler and other dependencies; therefore, it is recommended only if you really need the most up-to-date version of pandas. Once we have installed pandas, we can explore it further by adding pandas-related lines to our documentation-scanning script pkg_check.py of the previous chapter. The program prints the following output:

pandas version 0.13.1
pandas.compat DESCRIPTION compat  Cross-compatible functions for Python 2 and 3. Key items to import for 2/3 compatible code: * iterators: range(), map(),
pandas.computation 
pandas.core
pandas.io 
pandas.rpy 
pandas.sandbox 
pandas.sparse 
pandas.stats 
pandas.tests 
pandas.tools 
pandas.tseries 
pandas.util 

Unfortunately, the documentation of the pandas subpackages lacks informative descriptions; however, the subpackage names are descriptive enough to get an idea of what they are about.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.40.189