Data structures in pandas

The pandas was created by Wed McKinney in 2008 as a result of frustrations he encountered while working on time series data in R. It is built on top of NumPy and provides features not available in it. It provides fast, easy-to-understand data structures and helps fill the gap between Python and a language such as R.

A key reference for the various operations I demonstrate here is the official pandas data structure documentation: http://pandas.pydata.org/pandas-docs/dev/dsintro.html.

There are three main data structures in pandas:

  • Series
  • DataFrame
  • Panel

Series

Series is really a 1D NumPy array under the hood. It consists of a NumPy array coupled with an array of labels.

Series creation

The general construct for creating a Series data structure is as follows:

import pandas as pd  
ser=pd.Series(data, index=idx)

where data can be one of the following:

  • An ndarray
  • A Python dictionary
  • A scalar value

Using numpy.ndarray

In this case, the index must be the same length as the data. If an index is not specified, the following default index [0,... n-1] will be created, where n is the length of the data. The following example creates a Series structure of seven random numbers between 0 and 1; the index is not specified:

In [466]: import numpy as np
          np.random.seed(100)
          ser=pd.Series(np.random.rand(7)); ser
Out[466]:0    0.543405
         1    0.278369
         2    0.424518
         3    0.844776
         4    0.004719
         5    0.121569
         6    0.670749
         dtype: float64

The following example creates a Series structure of the first 5 months of the year with a specified index of month names:

In [481]: import calendar as cal
       monthNames=[cal.month_name[i] for i in np.arange(1,6)]
    months=pd.Series(np.arrange(1,6),index=monthNames);months

Out[481]: January     1
    February    2
    March       3
    April       4
    May         5
    dtype: int64

In [482]: months.index
Out[482]: Index([u'January', u'February', u'March', u'April', u'May'], dtype=object)

Using Python dictionary

If the data is a dictionary and an index is provided, the labels will be constructed from it; else, the keys of the dictionary will be used for the labels. The values of the dictionary are used to populate the Series structure.

In [486]: currDict={'US' : 'dollar', 'UK' : 'pound', 
                    'Germany': 'euro', 'Mexico':'peso',
                    'Nigeria':'naira',
                    'China':'yuan', 'Japan':'yen'}
          currSeries=pd.Series(currDict); currSeries
Out[486]: China        yuan
         Germany      euro
         Japan         yen
         Mexico       peso
         Nigeria     naira
         UK          pound
         US         dollar
         dtype: object

The index of a pandas Series structure is of type pandas.core.index.Index and can be viewed as an ordered multiset.

In the following case, we specify an index, but the index contains one entry that isn't a key in the corresponding dict. The result is that the value for the key is assigned as NaN, indicating that it is missing. We will deal with handling missing values in a later section.

In [488]: stockPrices = {'GOOG':1180.97,'FB':62.57, 
                         'TWTR': 64.50, 'AMZN':358.69,
                         'AAPL':500.6}
            stockPriceSeries=pd.Series(stockPrices,
                          index=['GOOG','FB','YHOO', 
                                 'TWTR','AMZN','AAPL'],
                           name='stockPrices')
         stockPriceSeries
Out[488]: GOOG    1180.97
         FB        62.57
         YHOO        NaN
         TWTR      64.50
         AMZN     358.69
         AAPL     500.60
         Name: stockPrices, dtype: float64

Note that Series also has a name attribute that can be set as shown in the preceding snippet. The name attribute is useful in tasks such as combining Series objects into a DataFrame structure.

Using scalar values

For scalar data, an index must be provided. The value will be repeated for as many index values as possible. One possible use of this method is to provide a quick and dirty method of initialization, with the Series structure to be filled in later. Let us see how to create a Series using scalar values:

In [491]: dogSeries=pd.Series('chihuahua', 
                   index=['breed','countryOfOrigin',
                   'name', 'gender'])
          dogSeries
Out[491]: breed              chihuahua
         countryOfOrigin    chihuahua
         name               chihuahua
         gender             chihuahua
         dtype: object

Failure to provide an index just results in a scalar value being returned as follows:

In [494]: dogSeries=pd.Series('pekingese'), dogSeries
Out[494]: 'pekingese'

In [495]: type(dogSeries)
Out[495]: str

Operations on Series

The behavior of Series is very similar to that of numpy arrays discussed in a previous section, with one caveat being that an operation such as slicing also slices the index.

Assignment

Values can be set and accessed using the index label in a dictionary-like manner:

In [503]: currDict['China']
Out[503]: 'yuan'

In [505]: stockPriceSeries['GOOG']=1200.0
            stockPriceSeries
Out[505]: GOOG    1200.00
         FB        62.57
         YHOO        NaN
         TWTR      64.50
         AMZN     358.69
         AAPL     500.60
         dtype: float64

Just as in the case of dict, KeyError is raised if you try to retrieve a missing label:

In [506]: stockPriceSeries['MSFT']
KeyError: 'MSFT'

This error can be avoided by explicitly using get as follows:

In [507]: stockPriceSeries.get('MSFT',np.NaN)
Out[507]: nan

In this case, the default value of np.NaN is specified as the value to return when the key does not exist in the Series structure.

Slicing

The slice operation behaves the same way as a NumPy array:

In [498]: stockPriceSeries[:4]
Out[498]: GOOG    1180.97
         FB        62.57
         YHOO        NaN
         TWTR      64.50
         dtype: float64

Logical slicing also works as follows:

In [500]: stockPriceSeries[stockPriceSeries > 100]
Out[500]: GOOG    1180.97
         AMZN     358.69
         AAPL     500.60
         dtype: float64

Other operations

Arithmetic and statistical operations can be applied, just as with a NumPy array:

In [501]: np.mean(stockPriceSeries)
Out[501]: 433.46600000000001
In [502]: np.std(stockPriceSeries)
Out[502]: 410.50223047384287

Element-wise operations can also be performed on series:

In [506]: ser
Out[506]: 0    0.543405
          1    0.278369
          2    0.424518
          3    0.844776
          4    0.004719
          5    0.121569
          6    0.670749
          dtype: float64
In [508]: ser*ser
Out[508]: 0    0.295289
          1    0.077490
          2    0.180215
          3    0.713647
          4    0.000022
          5    0.014779
          6    0.449904
         dtype: float64
In [510]: np.sqrt(ser)
Out[510]: 0    0.737160
          1    0.527607
          2    0.651550
          3    0.919117
          4    0.068694
          5    0.348668
          6    0.818993
         dtype: float64

An important feature of Series is that the data is automatically aligned on the basis of the label:

In [514]: ser[1:]
Out[514]: 1    0.278369
          2    0.424518
          3    0.844776
          4    0.004719
          5    0.121569
          6    0.670749
         dtype: float64
In [516]:ser[1:] + ser[:-2]
Out[516]: 0         NaN
         1    0.556739
         2    0.849035
         3    1.689552
         4    0.009438
         5         NaN
         6         NaN
         dtype: float64

Thus, we can see that for non-matching labels, NaN is inserted. The default behavior is that the union of the indexes is produced for unaligned Series structures. This is preferable as information is preserved rather than lost. We will handle missing values in pandas in a later chapter of the book.

DataFrame

DataFrame is an 2-dimensional labeled array. Its column types can be heterogeneous: that is, of varying types. It is similar to structured arrays in NumPy with mutability added. It has the following properties:

  • Conceptually analogous to a table or spreadsheet of data.
  • Similar to a NumPy ndarray but not a subclass of np.ndarray.
  • Columns can be of heterogeneous types: float64, int, bool, and so on.
  • A DataFrame column is a Series structure.
  • It can be thought of as a dictionary of Series structures where both the columns and the rows are indexed, denoted as 'index' in the case of rows and 'columns' in the case of columns.
  • It is size mutable: columns can be inserted and deleted.

Every axis in a Series/DataFrame has an index, whether default or not. Indexes are needed for fast lookups as well as proper aligning and joining of data in pandas. The axes can also be named-for example in the form of month for the array of columns Jan Feb Mar... Dec. Here is a representation of an indexed DataFrame, with named columns across and an index column of characters V, W, X, Y, Z:

 columns nums strs bools decs 
 index   
 V            11       cat   True   1.4
 W            -6      hat   False  6.9 
 X             25     bat   False  -0.6
 Y               8     mat  True   3.7
 Z             -17    sat    False  18.

DataFrame Creation

DataFrame is the most commonly used data structure in pandas. The constructor accepts many different types of arguments:

  • Dictionary of 1D ndarrays, lists, dictionaries, or Series structures
  • 2D NumPy array
  • Structured or record ndarray
  • Series structures
  • Another DataFrame structure

Row label indexes and column labels can be specified along with the data. If they're not specified, they will be generated from the input data in an intuitive fashion, for example, from the keys of dict. (in case of column labels) or by using np.range(n) in the case of row labels, where n corresponds to the number of rows.

Using dictionaries of Series

Here, we create a DataFrame structure by using a dictionary of Series objects.

In [97]:stockSummaries={
'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88], 
        index=['Closing price','EPS',
                'Shares Outstanding(M)',
                'Beta', 'P/E','Market Cap(B)']),
'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64],
        index=['Closing price','EPS','Shares Outstanding(M)',
               'Beta','P/E','Market Cap(B)']),
'FB': pd.Series([61.48,0.59,2450,104.93,150.92], 
      index=['Closing price','EPS','Shares Outstanding(M)',
             'P/E', 'Market Cap(B)']),
'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36],
        index=['Closing price','EPS','Shares Outstanding(M)',
               'P/E','Beta', 'Market Cap(B)']),
'TWTR':pd.Series([65.25,-0.3,555.2,36.23],
       index=['Closing price','EPS','Shares Outstanding(M)',
              'Market Cap(B)']), 
'AAPL':pd.Series([501.53,40.32,892.45,12.44,447.59,0.84],
       index=['Closing price','EPS','Shares Outstanding(M)','P/E',
              'Market Cap(B)','Beta'])}

In [99]: stockDF=pd.DataFrame(stockSummaries); stockDF
Out[99]:
 

AAPL

AMZN

FB

GOOG

TWTR

YHOO

Beta

0.84

0.52

NaN

0.87

NaN

0.66

Closing price

501.53

346.15

61.48

1133.43

65.25

34.9

EPS

40.32

0.59

0.59

36.05

-0.3

1.27

Market Cap(B)

447.59

158.88

150.92

380.64

36.23

35.36

P/E

12.44

589.8

104.93

31.44

NaN

27.48

Shares Outstanding(M)

892.45

459

2450

335.83

555.2

1010

In [100]:stockDF=pd.DataFrame(stockSummaries,
                 index=['Closing price','EPS',
                        'Shares Outstanding(M)',
                        'P/E', 'Market Cap(B)','Beta']);stockDF
Out [100]:
 

AAPL

AMZN

FB

GOOG

TWTR

YHOO

Closing price

501.53

346.15

61.48

1133.43

65.25

34.9

EPS

40.32

0.59

0.59

36.05

-0.3

1.27

Shares Outstanding(M)

892.45

459

2450

NaN

555.2

1010

P/E

12.44

589.8

104.93

31.44

NaN

27.48

Market Cap(B)

447.59

158.88

150.92

380.64

36.23

35.36

Beta

0.84

0.52

NaN

0.87

NaN

0.66

In [102]:stockDF=pd.DataFrame(stockSummaries,
                 index=['Closing price','EPS','Shares Outstanding(M)',
                        'P/E', 'Market Cap(B)','Beta'],
                 columns=['FB','TWTR','SCNW'])
          stockDF
Out [102]:

 

FB

TWTR

SCNW

Closing price

61.48

65.25

NaN

EPS

0.59

-0.3

NaN

Shares Outstanding(M)

2450

555.2

NaN

P/E

104.93

NaN

NaN

Market Cap(B)

150.92

36.23

NaN

Beta

NaN

NaN

NaN

The row index labels and column labels can be accessed via the index and column attributes:

In [527]: stockDF.index
Out[527]: Index([u'Closing price', u'EPS',
          u'Shares      Outstanding(M)', 
          u'P/E', u'Market Cap(B)', u'Beta'], dtype=object)
In [528]: stockDF.columns
Out[528]: Index([u'AAPL', u'AMZN', u'FB', u'GOOG', u'TWTR',
          u'YHOO'], dtype=object)

The source for the preceding data is Google Finance, accessed on 2/3/2014: http://finance.google.com.

Using a dictionary of ndarrays/lists

Here, we create a DataFrame structure from a dictionary of lists. The keys become the column labels in the DataFrame structure and the data in the list becomes the column values. Note how the row label indexes are generated using np.range(n).

In [529]:algos={'search':['DFS','BFS','Binary Search',
                 'Linear','ShortestPath (Djikstra)'],
      'sorting': ['Quicksort','Mergesort', 'Heapsort',
                   'Bubble Sort', 'Insertion Sort'],
      'machine learning':['RandomForest',
                           'K Nearest Neighbor',
                           'Logistic Regression',
                           'K-Means Clustering',
                           'Linear Regression']}
algoDF=pd.DataFrame(algos); algoDF
Out[529]:       
     machine learning  search   sorting
0   RandomForest   DFS   Quicksort
1   K Nearest Neighbor   BFS   Mergesort
2   Logistic Regression   Binary   Search   Heapsort
3   K-Means Clustering	 Linear   Bubble Sort
4   Linear Regression   ShortestPath (Djikstra)   Insertion Sort

In [530]: pd.DataFrame(algos,index=['algo_1','algo_2','algo_3','algo_4',
'algo_5'])
Out[530]:               
             machine learning  search     sorting
algo_1   RandomForest        DFS     Quicksort
algo_2   K Nearest Neighbor  BFS     Mergesort
algo_3   Logistic Regression  Binary Search Heapsort
algo_4   K-Means Clustering  Linear     Bubble Sort
algo_5   Linear Regression  ShortestPath (Djikstra) Insertion Sort

Using a structured array

In this case, we use a structured array, which is an array of records or structs. For more information on structured arrays, refer to the following: http://docs.scipy.org/doc/numpy/user/basics.rec.html.

In [533]: memberData = np.zeros((4,), 
                dtype=[('Name','a15'),
                       ('Age','i4'),
                      ('Weight','f4')])
          memberData[:] = [('Sanjeev',37,162.4),
                           ('Yingluck',45,137.8),
                           ('Emeka',28,153.2),
                           ('Amy',67,101.3)]
            memberDF=pd.DataFrame(memberData);memberDF
Out[533]:         Name       Age    Weight
        0   Sanjeev    37  162.4
        1   Yingluck   45  137.8
        2   Emeka        28  153.2
        3   Amy        67  101.3
In [534]: pd.DataFrame(memberData, index=['a','b','c','d'])
Out[534]:    Name       Age    Weight
        a   Sanjeev    37  162.4
        b   Yingluck   45  137.8
        c   Emeka        28  153.2
        d   Amy        67  101.3

Using a Series structure

Here, we show how to construct a DataFrame structure from a Series structure:

In [ 540]: currSeries.name='currency'
            pd.DataFrame(currSeries)
Out[540]:        currency
        China   yuan
        Germany euro
        Japan   yen
        Mexico   peso
        Nigeria naira
        UK   pound
        US   dollar

There are also alternative constructors for DataFrame; they can be summarized as follows:

  • DataFrame.from_dict: It takes a dictionary of dictionaries or sequences and returns DataFrame.
  • DataFrame.from_records: It takes a list of tuples or structured ndarray.
  • DataFrame.from_items: It takes a sequence of (key, value) pairs. The keys are the column or index names, and the values are the column or row values. If you wish the keys to be row index names, you must specify orient='index' as a parameter and specify the column names.
  • pandas.io.parsers.read_csv: This is a helper function that reads a CSV file into a pandas DataFrame structure.
  • pandas.io.parsers.read_table: This is a helper function that reads a delimited file into a pandas DataFrame structure.
  • pandas.io.parsers.read_fwf: This is a helper function that reads a table of fixed-width lines into a pandas DataFrame structure.

Operations

Here, I will briefly describe the various DataFrame operations.

Selection

A specific column can be obtained as a Series structure:

In [543]: memberDF['Name']
Out[543]: 0    Sanjeev
         1    Yingluck
         2    Emeka
         3    Amy
    Name: Name, dtype: object

Assignment

A new column can be added via assignment, as follows:

In [545]:   memberDF['Height']=60;memberDF
Out[545]:          Name        Age  Weight  Height
  0                Sanjeev     37   162.4   60
  1                Yingluck    45   137.8   60
  2                Emeka       28   153.2   60
  3                Amy         67   101.3   60

Deletion

A column can be deleted, as you would in the case of dict:

In [546]: del memberDF['Height']; memberDF
Out[546]:         Name       Age  Weight
  0               Sanjeev    37   162.4
  1               Yingluck   45   137.8
  2               Emeka      28   153.2
  3
               
Amy        67   101.3

It can also be popped, as with a dictionary:

In [547]: memberDF['BloodType']='O'
           bloodType=memberDF.pop('BloodType'), bloodType
Out[547]: 0    O
          1    O
          2    O
          3    O
    Name: BloodType, dtype: object

Basically, a DataFrame structure can be treated as if it were a dictionary of Series objects. Columns get inserted at the end; to insert a column at a specific location, you can use the insert function:

In [552]: memberDF.insert(2,'isSenior',memberDF['Age']>60);
          memberDF
Out[552]:      Name      Age  isSenior  Weight
  0            Sanjeev   37   False      162.4
  1            Yingluck  45   False     137.8
  2            Emeka     28   False     153.2
  3            Amy       67   True      101.3

Alignment

DataFrame objects align in a manner similar to Series objects, except that they align on both column and index labels. The resulting object is the union of the column and row labels:

In [559]: ore1DF=pd.DataFrame(np.array([[20,35,25,20],
                                       [11,28,32,29]]),
                              columns=['iron','magnesium',
                                       'copper','silver'])
       ore2DF=pd.DataFrame(np.array([[14,34,26,26],
                                   [33,19,25,23]]),
                             columns=['iron','magnesium',
                                     'gold','silver'])
    ore1DF+ore2DF
Out[559]:     copper  gold  iron  magnesium  silver
  0           NaN     NaN   34    69         46
  1           NaN     NaN   44    47         52

In the case where there are no row labels or column labels in common, the value is filled with NaN, for example, copper and gold. If you combine a DataFrame object and a Series object, the default behavior is to broadcast the Series object across the rows:

In [562]: ore1DF + pd.Series([25,25,25,25],
                             index=['iron','magnesium',
                             'copper','silver'])
Out[562]:    iron  magnesium   copper   silver
  0          45    60          50       45
  1          36    53          57       54

Other mathematical operations

Mathematical operators can be applied element wise on DataFrame structures:

In [565]: np.sqrt(ore1DF)
Out[565]:        iron       magnesium   copper         silver
  0              4.472136   5.916080    5.000000       4.472136
  1              3.316625   5.291503    5.656854       5.385165

Panel

Panel is a 3D array. It is not as widely used as Series or DataFrame. It is not as easily displayed on screen or visualized as the other two because of its 3D nature. The Panel data structure is the final piece of the data structure jigsaw puzzle in pandas. It is less widely used, and is used for 3D data. The three axis names are as follows:

  • items: This is axis 0. Each each item corresponds to a DataFrame structure.
  • major_axis: This is axis 1. Each item corresponds to the rows of the DataFrame structure.
  • minor_axis: This is axis 2. Each item corresponds to the columns of each DataFrame structure.

As for Series and DataFrame, there are different ways to create Panel objects. They are explained in the upcoming sections.

Using 3D NumPy array with axis labels

Here, we show how to construct a Panel object from a 3D NumPy array.

In 586[]: stockData=np.array([[[63.03,61.48,75],
                           [62.05,62.75,46],
                            [62.74,62.19,53]],
                              [[411.90, 404.38, 2.9],
                               [405.45, 405.91, 2.6],
                               [403.15, 404.42, 2.4]]])
          stockData
Out[586]: array([[[  63.03,   61.48,   75.  ],
                  [  62.05,   62.75,   46.  ],
                  [  62.74,   62.19,   53.  ]],
                 [[ 411.9 ,  404.38,    2.9 ],
                  [ 405.45,  405.91,    2.6 ],
                  [ 403.15,  404.42,    2.4 ]]])
In [587]: stockHistoricalPrices = pd.Panel(stockData, 
                                           items=['FB', 'NFLX'],
                                  major_axis=pd.date_range('2/3/2014', periods=3),
minor_axis=['open price', 'closing price', 'volume'])
          stockHistoricalPrices
Out[587]: <class 'pandas.core.panel.Panel'>
          Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
          Items axis: FB to NFLX
          Major_axis axis: 2014-02-03 00:00:00 to 2014-02-05 00:00:00
          Minor_axis axis: open price to volume

Using a Python dictionary of DataFrame objects

We construct a Panel structure by using a Python dictionary of DataFrame structures.

In [591]: USData=pd.DataFrame(np.array([[249.62  , 8900],
                                       [ 282.16,12680],
                                       [309.35,14940]]),
                    columns=['Population(M)','GDP($B)'],          
                    index=[1990,2000,2010])
          USData
Out[591]:       Population(M)   GDP($B)
        1990    249.62          8900
        2000    282.16          12680
        2010    309.35          14940
In [590]: ChinaData=pd.DataFrame(np.array([[1133.68, 390.28],
                                           [ 1266.83,1198.48],
                                           [1339.72, 6988.47]]),
                                                
                    columns=['Population(M)','GDP($B)'],
                    index=[1990,2000,2010])
          ChinaData
Out[590]:          Population(M)   GDP($B)
           1990    1133.68         390.28
           2000    1266.83         1198.48
           2010    1339.72         6988.47
In [592]:US_ChinaData={'US' : USData,
                       'China': ChinaData}
          pd.Panel(US_ChinaData)
Out[592]: <class 'pandas.core.panel.Panel'>
          Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
          Items axis: China to US
          Major_axis axis: 1990 to 2010

Using the DataFrame.to_panel method

This method converts a DataFrame structure having a MultiIndex to a Panel structure:

In [617]: mIdx = pd.MultiIndex(levels=[['US', 'China'],    
                                       [1990,2000, 2010]],
                     labels=[[1,1,1,0,0,0],[0,1,2,0,1,2]])
mIdx
Out[617]: MultiIndex
          [(u'China', 1990), (u'China', 2000), (u'China', 2010), 
           (u'US', 1990), (u'US', 2000), (u'US', 2010)]

ChinaUSDF = pd.DataFrame({'Population(M)' : [1133.68, 1266.83, 
                                             1339.72, 249.62,       
                                             282.16,309.35], 
                         'GDB($B)': [390.28, 1198.48, 6988.47, 
                                     8900,12680, 14940]}, index=mIdx)
ChinaUSDF
In [618]: ChinaUSDF = pd.DataFrame({'Population(M)' : [1133.68,   
                                                       1266.83, 
                                                       1339.72, 
                                                        249.62,         
                                                        282.16,
                                                        309.35], 
                                     'GDB($B)': [390.28, 1198.48, 
                                                 6988.47, 8900,
                                                 12680,14940]}, 
                          index=mIdx)
          ChinaUSDF

Out[618]:                       GDB($B)       Population(M)
            China       1990     390.28        1133.68
                        2000     1198.48       1266.83
                        2010     6988.47       1339.72
            US          1990     8900.00        249.62
                        2000     12680.00       282.16
                        2010     14940.00       309.35
In [622]: ChinaUSDF.to_panel()
Out[622]: <class 'pandas.core.panel.Panel'>
          Dimensions: 2 (items) x 2 (major_axis) x 3 (minor_axis)
          Items axis: GDB($B) to Population(M)
          Major_axis axis: US to China
          Minor_axis axis: 1990 to 2010

The sources of US/China Economic data are the following sites:

Other operations

Insertion, deletion, and item-wise operations behave the same as in the case of DataFrame. Panel structures can be re-arranged via transpose. The feature set of the operations of Panel is relatively underdeveloped and not as rich as for Series and DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.144.108