Chapter 1. A Tour of pandas

In this chapter, we will take a look at pandas, which is an open source Python-based data analysis library. It provides high-performance and easy-to-use data structures and data analysis tools built with the Python programming language. The pandas library brings many of the good things from R, specifically the DataFrame objects and R packages such as plyr and reshape2, and places them in a single library that you can use in your Python applications.

The development of pandas was begun in 2008 by Wes McKinney when he worked at AQR Capital Management. It was opened sourced in 2009 and is currently supported and actively developed by various organizations and contributors. It was initially designed with finance in mind, specifically with its ability around time series data manipulation, but emphasizes the data manipulation part of the equation leaving statistical, financial, and other types of analyses to other Python libraries.

In this chapter, we will take a brief tour of pandas and some of the associated tools such as IPython notebooks. You will be introduced to a variety of concepts in pandas for data organization and manipulation in an effort to form both a base understanding and a frame of reference for deeper coverage in later sections of this book. By the end of this chapter, you will have a good understanding of the fundamentals of pandas and even be able to perform basic data manipulations. Also, you will be ready to continue with later portions of this book for more detailed understanding.

This chapter will introduce you to:

  • pandas and why it is important
  • IPython and IPython Notebooks
  • Referencing pandas in your application
  • The Series and DataFrame objects of pandas
  • How to load data from files and the Web
  • The simplicity of visualizing pandas data

Note

pandas is always lowercase by convention in pandas documentation, and this will be a convention followed by this book.

pandas and why it is important

pandas is a library containing high-level data structures and tools that have been created to assist a Python programmer to perform powerful data manipulations, and discover information in that data in a simple and fast way.

The simple and effective data analysis requires the ability to index, retrieve, tidy, reshape, combine, slice, and perform various analyses on both single and multidimensional data, including heterogeneous typed data that is automatically aligned along index labels. To enable these capabilities, pandas provides the following features (and many more not explicitly mentioned here):

  • High performance array and table structures for representation of homogenous and heterogeneous data sets: the Series and DataFrame objects
  • Flexible reshaping of data structure, allowing the ability to insert and delete both rows and columns of tabular data
  • Hierarchical indexing of data along multiple axes (both rows and columns), allowing multiple labels per data item
  • Labeling of series and tabular data to facilitate indexing and automatic alignment of data
  • Ability to easily identify and fix missing data, both in floating point and as non-floating point formats
  • Powerful grouping capabilities and a functionality to perform split-apply-combine operations on series and tabular data
  • Simple conversion from ragged and differently indexed data of both NumPy and Python data structures to pandas objects
  • Smart label-based slicing and subsetting of data sets, including intuitive and flexible merging, and joining of data with SQL-like constructs
  • Extensive I/O facilities to load and save data from multiple formats including CSV, Excel, relational and non-relational databases, HDF5 format, and JSON
  • Explicit support for time series-specific functionality, providing functionality for date range generation, moving window statistics, time shifting, lagging, and so on
  • Built-in support to retrieve and automatically parse data from various web-based data sources such as Yahoo!, Google Finance, the World Bank, and several others

For those desiring to get into data analysis and the emerging field of data science, pandas offers an excellent means for a Python programmer (or just an enthusiast) to learn data manipulation. For those just learning or coming from a statistical language like R, pandas can offer an excellent introduction to Python as a programming language.

pandas itself is not a data science toolkit. It does provide some statistical methods as a matter of convenience, but to draw conclusions from data, it leans upon other packages in the Python ecosystem, such as SciPy, NumPy, scikit-learn, and upon graphics libraries such as matplotlib and ggvis for data visualization. This is actually the strength of pandas over other languages such as R, as pandas applications are able to leverage an extensive network of robust Python frameworks already built and tested elsewhere.

In this book, we will look at how to use pandas for data manipulation, with a specific focus on gathering, cleaning, and manipulation of various forms of data using pandas. Detailed specifics of data science, finance, econometrics, social network analysis, Python, and IPython are left as reference. You can refer to some other excellent books on these topics already available at https://www.packtpub.com/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.141.219