pandas Series

The pandas Series data structure is a one-dimensional heterogeneous array with labels. We can create a pandas Series data structure as follows:

  • From a Python dict
  • From a NumPy array
  • From a single scalar value

When creating a Series, we can hand the constructor a list of axis labels, which is commonly referred to as the index. The index is an optional parameter. By default, if we use a NumPy array as the input data, pandas will index values by autoincrementing the index commencing from 0. If the data handed to the constructor is a Python dict, the sorted dict keys will become the index. In the case of a scalar value as the input data, we are required to supply the index. For each new value in the index, the scalar input value will be reiterated. The pandas Series and DataFrame interfaces have features and behaviors borrowed from NumPy arrays and Python dictionaries, such as slicing, lookup using a key, and vectorized operations. Performing a lookup on a DataFrame column returns a Series. We will demonstrate this and other features of Series by going back to the previous section and loading the CSV file again.

  1. We will start by selecting the Country column, which happens to be the first column in the datafile. Then, show the type of the object currently in the local scope:
    country_col = df["Country"]
    print "Type df", type(df)
    print "Type country col", type(country_col)

    We can now confirm that we get a Series when we select a column of a data frame:

    Type df <class 'pandas.core.frame.DataFrame'>
    Type country col <class 'pandas.core.series.Series'>
    

    Note

    If you want, you can open a Python or IPython shell, import pandas, and view with the dir() function, a list of functions and attributes for the classes of the previous printout. However, be aware that you will get a long list of functions in both cases.

  2. The pandas Series data structure shares some of the attributes of DataFrame and also has a name attribute. Explore these properties as follows:
    print "Series shape", country_col.shape
    print "Series index", country_col.index
    print "Series values", country_col.values
    print "Series name", country_col.name

    The output (truncated to save space) is given as follows:

    Series shape (202,)
    Series index Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...], dtype='int64')
    
    Series values ['Afghanistan' … 'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe']
    Series name Country
    
  3. To demonstrate the slicing of a Series, select the last two countries of the Country Series and print the type:
    print "Last 2 countries", country_col[-2:]
    print "Last 2 countries type", type(country_col[-2:])

    Slicing yields another Series as demonstrated:

    Last 2 countries 200      Zambia
    201    Zimbabwe
    Name: Country, dtype: object
    Last 2 countries type <class 'pandas.core.series.Series'>
    
  4. NumPy functions can operate on pandas DataFrame and Series. We can, for instance, apply the NumPy sign() function, which yields the sign of a number. 1 is returned for positive numbers, -1 for negative numbers, and 0 for zeros. Apply the function to the DataFrame and last column, which happens to be the population for each country in the dataset:
    print "df signs", np.sign(df)
    last_col = df.columns[-1]
    print "Last df column signs", last_col, np.sign(df[last_col])

    The output is truncated here to save space and is as follows:

    df signs    Country CountryID Continent Adolescent fertility rate (%)  
    0        1         1         1                             1
    [TRUNCATED]
    59                                           1                               1  
                                               ...                             ...  
    
    [202 rows x 9 columns]
    Last df column signs Population (in thousands) total 0     1
    1     1
    [TRUNCATED]
    198   NaN
    199     1
    200     1
    201     1
    Name: Population (in thousands) total, Length: 202, dtype: float64
    

    Note

    Please note that the population value at index 198 is NaN. The matching record is given as follows:

    West Bank and Gaza,199,1,,,,,,
    

We can perform all sorts of numerical operations between DataFrames, Series, and NumPy arrays. If we get the underlying NumPy array of a pandas Series and subtract this array from the Series, we can reasonably expect the following two outcomes:

  • An array filled with zeros and at least one NaN (we saw one NaN in the previous step)
  • We can also expect to get only zeros

The rule for NumPy functions is to produce NaNs for most operations involving NaNs, as illustrated by the following IPython session:

In: np.sum([0, np.nan])
Out: nan

Write the following code to perform the subtraction:

print np.sum(df[last_col] - df[last_col].values)

The snippet yields the result predicted by the second option:

0.0

Please refer to the series_demo.py file in the book's code bundle:

from pandas.io.parsers import read_csv
import numpy as np

df = read_csv("WHO_first9cols.csv")
country_col = df["Country"]
print "Type df", type(df)
print "Type country col", type(country_col)

print "Series shape", country_col.shape
print "Series index", country_col.index
print "Series values", country_col.values
print "Series name", country_col.name

print "Last 2 countries", country_col[-2:]
print "Last 2 countries type", type(country_col[-2:])

print "df signs", np.sign(df)
last_col = df.columns[-1]
print "Last df column signs", last_col, np.sign(df[last_col])

print np.sum(df[last_col] - df[last_col].values)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.230.81