1 Pandas DataFrame Basics

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

1. Pandas DataFrame Basics

1.1 Introduction

Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among other functions. To give Python these enhanced features, Pandas introduces two new data types to Python: Series and DataFrame. The DataFrame represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series objects.

Why should you use a programming language like Python and a tool like Pandas to work with data? It boils down to automation and reproducibility. If a particular set of analyses need to be performed on multiple data sets, a programming language has the ability to automate the analysis on those data sets. Although many spreadsheet programs have their own macro programming languages, many users do not use them. Furthermore, not all spreadsheet programs are available on all operating systems. Performing data analysis using a programming language forces the user to maintain a running record of all steps performed on the data. I, like many people, have accidentally hit a key while viewing data in a spreadsheet program, only to find out that my results no longer make any sense due to bad data. This is not to say that spreadsheet programs are bad or that they do not have their place in the data workflow, they do. Rather, my point is that there are better and more reliable tools out there.

Concept Map

1. Prior knowledge needed (appendix)

a. relative directories

b. calling functions

c. dot notation

d. primitive Python containers

e. variable assignment

2. This chapter

a. loading data

b. subset data

c. slicing

d. filtering

e. basic Pandas data structures (Series, DataFrame)

f. resemble other Python containers (list, numpy.ndarray)

g. basic indexing

Objectives

This chapter will cover:

1. Loading a simple delimited data file

2. Counting how many rows and columns were loaded

3. Determining which type of data was loaded

4. Looking at different parts of the data by subsetting rows and columns

1.2 Loading Your First Data Set

When given a data set, we first load it and begin looking at its structure and contents. The simplest way of looking at a data set is to examine and subset specific rows and columns. We can see which type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics.

Since Pandas is not part of the Python standard library, we have to first tell Python to load (import) the library.

import pandas

With the library loaded, we can use the read_csv function to load a CSV data file. To access the read_csv function from Pandas, we use dot notation. More on dot notations can be found in Appendices H, O, and S.

About the Gapminder Data Set

The Gapminder data set originally comes from www.gapminder.org. The version of the Gapminder data used in this book was prepared by Jennifer Bryan from the University of British Columbia. The repository can be found at: www.github.com/jennybc/gapminder.

Pandas Type	Python Type	Description
`object`	`string`	Most common data type
`int64`	`int`	Whole numbers
`float64`	`float`	Numbers with decimals
`datetime64`	`datetime`	`datetime` is found in the Python standard library (i.e., it is not loaded by default and needs to be imported)

Subset method	Description
`loc`	Subset based on index label (row name)
`iloc`	Subset based on row index (row number)
`ix` (no longer works in `Pandas v0.20`)	Subset based on index label or row index

Table of Contents for 1 Pandas DataFrame Basics

Create new playlist

Sign In

Sign Up

1. Pandas DataFrame Basics

1.1 Introduction

Concept Map

Objectives

1.2 Loading Your First Data Set

1.3 Looking at Columns, Rows, and Cells

1.3.1 Subsetting Columns

1.3.1.1 Subsetting Columns by Name

1.3.1.2 Subsetting Columns by Index Position Break in Pandas v0.20

1.3.2 Subsetting Rows

1.3.2.1 Subset Rows by Index Label: loc

1.3.2.2 Subset Rows by Row Number: iloc

1.3.2.3 Subsetting Rows With ix No Longer Works in Pandas v0.20

1.3.3 Mixing It Up

1.3.3.1 Subsetting Columns

1.3.3.2 Subsetting Columns by Range

1.3.3.3 Slicing Columns

1.3.3.4 Subsetting Rows and Columns

1.3.3.5 Subsetting Multiple Rows and Columns

1.4 Grouped and Aggregated Calculations

1.4.1 Grouped Means

1.4.2 Grouped Frequency Counts

1.5 Basic Plot

1.6 Conclusion

Table of Contents for
1 Pandas DataFrame Basics