The pandas Series
data structure is a one-dimensional heterogeneous array with labels. We can create a pandas Series
data structure as follows:
When creating a Series
, we can hand the constructor a list of axis labels, which is commonly referred to as the index. The index is an optional parameter. By default, if we use a NumPy array as the input data, pandas will index values by autoincrementing the index commencing from 0. If the data handed to the constructor is a Python dict, the sorted dict keys will become the index. In the case of a scalar value as the input data, we are required to supply the index. For each new value in the index, the scalar input value will be reiterated. The pandas Series
and DataFrame
interfaces have features and behaviors borrowed from NumPy arrays and Python dictionaries, such as slicing, lookup using a key, and vectorized operations. Performing a lookup on a DataFrame
column returns a Series
. We will demonstrate this and other features of Series
by going back to the previous section and loading the CSV file again.
Country
column, which happens to be the first column in the datafile. Then, show the type of the object currently in the local scope:country_col = df["Country"] print "Type df", type(df) print "Type country col", type(country_col)
We can now confirm that we get a Series when we select a column of a data frame:
Type df <class 'pandas.core.frame.DataFrame'> Type country col <class 'pandas.core.series.Series'>
Series
data structure shares some of the attributes of DataFrame
and also has a name attribute. Explore these properties as follows:print "Series shape", country_col.shape print "Series index", country_col.index print "Series values", country_col.values print "Series name", country_col.name
The output (truncated to save space) is given as follows:
Series shape (202,) Series index Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...], dtype='int64') Series values ['Afghanistan' … 'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe'] Series name Country
Country
Series and print the type:print "Last 2 countries", country_col[-2:] print "Last 2 countries type", type(country_col[-2:])
Slicing yields another Series as demonstrated:
Last 2 countries 200 Zambia 201 Zimbabwe Name: Country, dtype: object Last 2 countries type <class 'pandas.core.series.Series'>
DataFrame
and Series
. We can, for instance, apply the NumPy sign()
function, which yields the sign of a number. 1
is returned for positive numbers, -1
for negative numbers, and 0
for zeros. Apply the function to the DataFrame
and last column, which happens to be the population for each country in the dataset:print "df signs", np.sign(df) last_col = df.columns[-1] print "Last df column signs", last_col, np.sign(df[last_col])
The output is truncated here to save space and is as follows:
df signs Country CountryID Continent Adolescent fertility rate (%) 0 1 1 1 1 [TRUNCATED] 59 1 1 ... ... [202 rows x 9 columns] Last df column signs Population (in thousands) total 0 1 1 1 [TRUNCATED] 198 NaN 199 1 200 1 201 1 Name: Population (in thousands) total, Length: 202, dtype: float64
We can perform all sorts of numerical operations between DataFrames, Series, and NumPy arrays. If we get the underlying NumPy array of a pandas Series and subtract this array from the Series, we can reasonably expect the following two outcomes:
The rule for NumPy functions is to produce NaNs for most operations involving NaNs, as illustrated by the following IPython session:
In: np.sum([0, np.nan]) Out: nan
Write the following code to perform the subtraction:
print np.sum(df[last_col] - df[last_col].values)
The snippet yields the result predicted by the second option:
0.0
Please refer to the series_demo.py
file in the book's code bundle:
from pandas.io.parsers import read_csv import numpy as np df = read_csv("WHO_first9cols.csv") country_col = df["Country"] print "Type df", type(df) print "Type country col", type(country_col) print "Series shape", country_col.shape print "Series index", country_col.index print "Series values", country_col.values print "Series name", country_col.name print "Last 2 countries", country_col[-2:] print "Last 2 countries type", type(country_col[-2:]) print "df signs", np.sign(df) last_col = df.columns[-1] print "Last df column signs", last_col, np.sign(df[last_col]) print np.sum(df[last_col] - df[last_col].values)
18.117.230.81