1. Pandas DataFrame Basics

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

1 Pandas DataFrame Basics

1.1 Introduction

Pandas is an open-source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc. To give Python these enhanced features, Pandas introduces two new data types to Python: Series and DataFrame. The DataFrame will represent your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series.

Why should you use a programming language like Python and a tool like Pandas to work with data? It boils down to automation and reproducibility. If there is a particular set of analyses that needs to be performed on multiple data sets, a programming language can automate the analysis on the data sets. Although many spreadsheet programs have their own macro programming languages, many users do not use them. Furthermore, not all spreadsheet programs are available on all operating systems. Performing data tasks using a programming language forces the user to have a running record of all steps performed on the data. I, like many people, have accidentally hit a key while viewing data in a spreadsheet program, only to find out that my results do not make any sense anymore due to bad data. This is not to say spreadsheet programs are bad or do not have their place in the data workflow. They do, but there are better and more reliable tools out there. These better tools can work in tandem with spreadsheet programs while providing more reliable data manipulation, and introduce the possibility of incorporating data from other data sets and databases.

Learning Objectives

The concept map for this chapter can be found in Figure A.1.

Use Pandas functions to load a simple delimited data file
Calculate how many rows and columns were loaded
Identify the type of data that were loaded
Name differences between functions, methods, and attributes
Use methods and attributes to subset rows and columns
Calculate basic grouped and aggregated statistics from data
Use methods and attributes to create a simple figure from data

1.2 Load Your First Data Set

When given a data set, we first load it and begin looking at its structure and contents. The simplest way of looking at a data set is to look at and subset specific rows and columns. We can see what type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics.

Since Pandas is not part of the Python standard library, we have to first tell Python to load (i.e., import) the library. If you have not installed data and packages needed to go through the book please see Appendix B.

import pandas

With the library loaded we can use the read_csv() function to load a CSV data file. In order to access the read_csv() function from pandas, we use something called “dot notation.” More on dot notations can be found in Appendix L, Appendix P, and Appendix E. We write pandas.read_csv() to say: within the pandas library we just loaded, look inside for the read_csv() function.

Pandas	Python	Description
object	string	most common data type
int64	int	whole numbers
float64	float	numbers with decimals
datetime64	datetime	datetime is found in the Python standard library (i.e., it is not loaded by default and needs to be imported)

Subset attribute	Description
`.loc[]`	Subset based on index label (row name)
`.iloc[]`	Subset based on row index (row number)
~~`.ix[]` (no longer works in `Pandas v0.20`)~~	~~Subset based on index label or row index~~

Table of Contents for 1. Pandas DataFrame Basics

Create new playlist

Sign In

Sign Up

1

Pandas DataFrame Basics

1.1 Introduction

Learning Objectives

1.2 Load Your First Data Set

1.3 Look at Columns, Rows, and Cells

1.3.1 Select and Subset Columns by Name

1.3.1.1 Single Value Returns DataFrame or Series

1.3.1.2 Using Dot Notation to Pull a Column of Values

1.3.2 Subset Rows

1.3.2.1 Subset Rows by index Label - .loc[]

1.3.2.2 Subsetting Multiple Rows

1.3.3 Subset Rows by Row Number: .iloc[]

1.3.4 Mix It Up

1.3.4.1 Selecting Columns

1.3.4.2 Subsetting with range()

1.3.4.3 Subsetting with Slicing :

1.3.5 Subsetting Rows and Columns

1.3.5.1 Subsetting Multiple Rows and Columns

1.4 Grouped and Aggregated Calculations

1.4.1 Grouped Means

1.4.2 Grouped Frequency Counts

1.5 Basic Plot

Conclusion

Table of Contents for
1. Pandas DataFrame Basics

1.3.1.1 Single Value Returns `DataFrame` or `Series`

1.3.2.1 Subset Rows by `index` Label - `.loc[]`

1.3.3 Subset Rows by Row Number: `.iloc[]`

1.3.4.2 Subsetting with `range()`

1.3.4.3 Subsetting with Slicing `:`