2. Pandas Data Structures Basics

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2 Pandas Data Structures Basics

Chapter 1 introduced the Pandas DataFrame and Series objects. These data structures resemble the primitive Python data containers (lists and dictionaries) for indexing and labeling, but have additional features that make working with data easier.

Learning Objectives

The concept map for this chapter can be found in Figure A.2.

Use functions to create and load manual data
Describe the Series object
Describe the DataFrame object
Identify basic operations on Series objects
Identify basic operations on DataFrame objects
Perform conditional subsetting, fancy slicing, and indexing
Use methods to save data

2.1 Create Your Own Data

Whether you are manually inputting data or creating a small test example, knowing how to create DataFrames without loading data from a file is a useful skill. It is especially helpful when you are asking a question about a StackOverflow error.

2.1.1 Create a Series

The Pandas Series is a one-dimensional container (i.e., Python Iterable), similar to the built-in Python list. It is the data type that represents each column of the DataFrame. Table 1.1 lists the possible dtypes for Pandas DataFrame columns. Each value in a DataFrame column must be stored as the same dtype. For example, if a column contains the number 1 and the sequence of letters (i.e., string) "pizza", the entire dtype of the column will be a string (Pandas will call this an object dtype).

Since a DataFrame can be thought of as a dictionary of Series objects, where each key is the column name and the value is the Series, we can conclude that a Series is very similar to a Python list, except that each element must be the same dtype. Those who have used the numpy library will realize this is the same behavior as demonstrated by the ndarray.

The easiest way to create a Series is to pass in a Python list. If we pass in a list of mixed types, the most common representation of both will be used. Typically the dtype will be object.

Series attributes	Description
.loc	Subset using index value
.iloc	Subset using index position
.dtype or dtypes	The type of the `Series` contents
.T	Transpose of the series
.shape	Dimensions of the data
.size	Number of elements in the `Series`
.values	`ndarray` or `ndarray`-like of the `Series`

Series Methods	Description
`.append()`	Concatenates two or more `Series`
`.corr()`	Calculate a correlation with another `Series`⁵
`.cov()`	Calculate a covariance with another `Series`⁶
`.describe()`	Calculate summary statistics⁷
`.drop_duplicates()`	Returns a `Series` without duplicates
`.equals()`	Determines whether a `Series` has the same elements
`.get_values()`	Get values of the `Series`; same as the `values` attribute
`.hist()`	Draw a histogram
`.isin()`	Checks whether values are contained in a `Series`
`.min()`	Returns the minimum value
`.max()`	Returns the maximum value
`.mean()`	Returns the arithmetic mean
`.median()`	Returns the median
`.mode()`	Returns the mode(s)
`.quantile()`	Returns the value at a given quantile
`.replace()`	Replaces values in the `Series` with a specified value
`.sample()`	Returns a random sample of values from the `Series`
`.sort_values()`	Sorts values
`.to_frame()`	Converts a `Series` to a `DataFrame`
`.transpose()`	Returns the transpose
`.unique()`	Returns a `numpy.ndarray` of unique values

Syntax	Selection Result
`df[column_name]`	`Series`
`df[[column1, column2, ... ]]`	`DataFrame`
`df.loc[row_label]`	Row by row index label (row name)
`df.loc[[label1, label2, ...]]`	Multiple rows by index label
`df.iloc[row_number]`	Row by row number
`df.iloc[[row1, row2, ...]]`	Multiple rows by row number
`df[bool]`	Row based on `bool`
`df[[bool1, bool2, ...]]`	Multiple rows based on `bool`
`df[start:stop:step]`	Rows based on slicing notation

Export Method	Description
`.to_clipboard()`	Save data into the system clipboard for pasting
`.to_dense()`	Convert data into a regular “dense” `DataFrame`
`.to_dict()`	Convert data into a Python `dict`
`.to_gbq()`	Convert data into a Google BigQuery table
`.to_hdf()`	Save data into a hierarchal data format (HDF)
`.to_msgpack()`	Save data into a portable JSON-like binary
`.to_html()`	Convert data into a HTML table
`.to_json()`	Convert data into a JSON string
`.to_latex()`	Convert data into a LATEX tabular environment
`.to_records()`	Convert data into a record array
`.to_string()`	Show `DataFrame` as a string for `stdout`
`.to_sparse()`	Convert data into a `SparceDataFrame`
`.to_sql()`	Save data into a SQL database
`.to_stata()`	Convert data into a Stata `dta` file

Table of Contents for 2. Pandas Data Structures Basics

Create new playlist

Sign In

Sign Up

2

Pandas Data Structures Basics

Learning Objectives

2.1 Create Your Own Data

2.1.1 Create a Series

2.1.2 Create a DataFrame

2.2 The Series

2.2.1 The Series Is ndarray-like

2.2.1.1 Series Methods

2.2.2 Boolean Subsetting: Series

2.2.3 Operations Are Automatically Aligned and Vectorized (Broadcasting)

2.2.3.1 Vectors of the Same Length

2.2.3.2 Vectors With Integers (Scalars)

2.2.3.3 Vectors With Different Lengths

2.2.3.4 Vectors With Common Index Labels (Automatic Alignment)

2.3 The DataFrame

2.3.1 Parts of a DataFrame

2.3.2 Boolean Subsetting: DataFrames

2.3.3 Operations Are Automatically Aligned and Vectorized (Broadcasting)

2.4 Making Changes to Series and DataFrames

2.4.1 Add Additional Columns

2.4.2 Directly Change a Column

2.4.3 Modifying Columns with .assign()

2.4.4 Dropping Values

2.5 Exporting and Importing Data

2.5.1 Pickle

2.5.1.1 Series

2.5.1.2 DataFrame

2.5.1.3 Read pickle data

2.5.2 Comma-Separated Values (CSV)

2.5.2.1 Import CSV Data

2.5.3 Excel

2.5.3.1 Series

2.5.3.2 DataFrames

2.5.4 Feather

2.5.5 Arrow

2.5.6 Dictionary

2.5.7 JSON (JavaScript Objectd Notation)

2.5.8 Other Data Output Types

Conclusion

Table of Contents for
2. Pandas Data Structures Basics

2.4.3 Modifying Columns with `.assign()`

2.5.1.3 Read `pickle` data