Chapter 6. Missing Data, Time Series, and Plotting Using Matplotlib

In this chapter, we take a tour of some topics that are necessary to develop expertise in using pandas. Knowledge of these topics is very useful for the preparation of data as input for programs or code that process data for analysis, prediction, or visualization. The topics that we'll discuss are as follows:

  • Handling missing data
  • Handling time series and dates
  • Plotting using matplotlib

By the end of this chapter the user should be proficient in these critical areas.

Handling missing data

Missing data refers to data points that show up as NULL or N/A in our datasets for some reason; for example, we may have a time series that spans all calendar days of the month that shows the closing price of a stock for each day, and the closing price for nonbusiness days would show up as missing. An example of corrupted data would be a financial dataset that shows the activity date of a transaction in the wrong format; for example, YYYY-MM-DD instead of YYYYMMDD due to an error on the part of the data provider.

In the case of pandas, missing values are generally represented by the NaN value.

Other than appearing natively in the source dataset, missing values can be added to a dataset by an operation such as reindexing, or changing frequencies in the case of a time series:

In [84]: import numpy as np
         import pandas as pd
         import matplotlib.pyplot as plt
         %matplotlib inline
In [85]: date_stngs = ['2014-05-01','2014-05-02',
                       '2014-05-05','2014-05-06','2014-05-07']
         tradeDates = pd.to_datetime(pd.Series(date_stngs))
In [86]: closingPrices=[531.35,527.93,527.81,515.14,509.96]
In [87]: googClosingPrices=pd.DataFrame(data=closingPrices,
                                        columns=['closingPrice'],
                                        index=tradeDates)
         googClosingPrices
Out[87]:                closingPrice
        tradeDates      
        2014-05-01       531.35
        2014-05-02       527.93
        2014-05-05       527.81
        2014-05-06       515.14
        2014-05-07       509.96
        5 rows 1 columns

The source of the preceding data can be found at http://yhoo.it/1dmJqW6.

The pandas also provides an API to read stock data from various data providers, such as Yahoo:

In [29]: import pandas.io.data as web
In [32]: import datetime
         googPrices = web.get_data_yahoo("GOOG",
                      start=datetime.datetime(2014, 5, 1),
                      end=datetime.datetime(2014, 5, 7))
In [38]: googFinalPrices=pd.DataFrame(googPrices['Close'],
                                        index=tradeDates)
In [39]: googFinalPrices
Out[39]:           Close
       2014-05-01  531.34998
       2014-05-02  527.92999
       2014-05-05  527.81000
       2014-05-06  515.14001
       2014-05-07  509.95999

For more details, refer to http://pandas.pydata.org/pandas-docs/stable/remote_data.html.

We now have a time series that depicts the closing price of Google's stock from May 1, 2014 to May 7, 2014 with gaps in the date range since the trading only occur on business days. If we want to change the date range so that it shows calendar days (that is, along with the weekend), we can change the frequency of the time series index from business days to calendar days as follows:

In [90]: googClosingPricesCDays=googClosingPrices.asfreq('D')
        googClosingPricesCDays
Out[90]:    closingPrice
  2014-05-01  531.35
  2014-05-02  527.93
  2014-05-03  NaN
  2014-05-04  NaN
  2014-05-05  527.81
  2014-05-06  515.14
  2014-05-07  509.96
  7 rows 1 columns

Note that we have now introduced NaN values for the closingPrice for the weekend dates of May 3, 2014 and May 4, 2014.

We can check which values are missing by using the isnull and notnull functions as follows:

In [17]: googClosingPricesCDays.isnull()
Out[17]: closingPrice
   2014-05-01   False
   2014-05-02   False
   2014-05-03   True
   2014-05-04   True
   2014-05-05   False
   2014-05-06   False
   2014-05-07   False
   7 rows 1 columns

In [18]: googClosingPricesCDays.notnull()
Out[18]: closingPrice
   2014-05-01   True
   2014-05-02   True
   2014-05-03   False
   2014-05-04   False
   2014-05-05   True
   2014-05-06   True
   2014-05-07   True
   7 rows 1 columns

A Boolean DataFrame is returned in each case. In datetime and pandas Timestamps, missing values are represented by the NaT value. This is the equivalent of NaN in pandas for time-based types.

In [27]: tDates=tradeDates.copy()
        tDates[1]=np.NaN
        tDates[4]=np.NaN

In [28]: tDates
Out[28]: 0   2014-05-01
         1          NaT
         2   2014-05-05
         3   2014-05-06
         4          NaT
         Name: tradeDates, dtype: datetime64[ns]

In [4]: FBVolume=[82.34,54.11,45.99,55.86,78.5]
        TWTRVolume=[15.74,12.71,10.39,134.62,68.84]

In [5]: socialTradingVolume=pd.concat([pd.Series(FBVolume),
                         pd.Series(TWTRVolume),
                                       tradeDates], axis=1,
               keys=['FB','TWTR','TradeDate'])
       socialTradingVolume
Out[5]:      FB       TWTR    TradeDate
         0   82.34    15.74   2014-05-01
         1   54.11    12.71   2014-05-02
         2   45.99    10.39   2014-05-05
         3   55.86    134.62  2014-05-06
         4   78.50    68.84   2014-05-07
         5 rows × 3 columns

In [6]: socialTradingVolTS=socialTradingVolume.set_index('TradeDate')
socialTradingVolTS
Out[6]:
  TradeDate    FB      TWTR
  2014-05-01   82.34   15.74
  2014-05-02   54.11   12.71
  2014-05-05   45.99   10.39
  2014-05-06   55.86   134.62
  2014-05-07   78.50   68.84
  5 rows × 2 columns

In [7]: socialTradingVolTSCal=socialTradingVolTS.asfreq('D')
       socialTradingVolTSCal
Out[7]:
              FB      TWTR
  2014-05-01  82.34   15.74
  2014-05-02  54.11   12.71
  2014-05-03  NaN     NaN
  2014-05-04  NaN     NaN
  2014-05-05  45.99   10.39
  2014-05-06  55.86   134.62
  2014-05-07  78.50   68.84
  7 rows × 2 columns

We can perform arithmetic operations on data containing missing values. For example, we can calculate the total trading volume (in millions of shares) across the two stocks for Facebook and Twitter as follows:

In [8]: socialTradingVolTSCal['FB']+socialTradingVolTSCal['TWTR']
Out[8]: 2014-05-01     98.08
       2014-05-02     66.82
       2014-05-03       NaN
       2014-05-04       NaN
       2014-05-05     56.38
       2014-05-06    190.48
       2014-05-07    147.34
       Freq: D, dtype: float64

By default, any operation performed on an object that contains missing values will return a missing value at that position as shown in the following command:

In [12]: pd.Series([1.0,np.NaN,5.9,6])+pd.Series([3,5,2,5.6])
Out[12]: 0     4.0
        1     NaN
        2     7.9
        3    11.6
        dtype: float64
In [13]: pd.Series([1.0,25.0,5.5,6])/pd.Series([3,np.NaN,2,5.6])
Out[13]: 0    0.333333
        1         NaN
        2    2.750000
        3    1.071429
       dtype: float64

There is a difference, however, in the way NumPy treats aggregate calculations versus what pandas does.

In pandas, the default is to treat the missing value as 0 and do the aggregate calculation, whereas for NumPy, NaN is returned if any of the values are missing. Here is an illustration:

In [15]: np.mean([1.0,np.NaN,5.9,6])
Out[15]: nan

In [16]: np.sum([1.0,np.NaN,5.9,6])
Out[16]: nan

However, if this data is in a pandas Series, we will get the following output:

In [17]: pd.Series([1.0,np.NaN,5.9,6]).sum()
Out[17]: 12.9
In [18]: pd.Series([1.0,np.NaN,5.9,6]).mean()
Out[18]: 4.2999999999999998

It is important to be aware of this difference in behavior between pandas and NumPy. However, if we wish to get NumPy to behave the same way as pandas, we can use the np.nanmean and np.nansum functions, which are illustrated as follows:

In [41]: np.nanmean([1.0,np.NaN,5.9,6])
Out[41]: 4.2999999999999998

In [43]: np.nansum([1.0,np.NaN,5.9,6])
Out[43]: 12.9

For more information on NumPy np.nan* aggregation functions, refer to http://docs.scipy.org/doc/numpy-dev/reference/routines.statistics.html.

Handling missing values

There are various ways to handle missing values, which are as follows:

  1. By using the fillna() function to fill in the NA values. This is an example:
    In [19]: socialTradingVolTSCal
    Out[19]:    FB   TWTR
            2014-05-01  82.34   15.74
            2014-05-02  54.11   12.71
            2014-05-03  NaN     NaN
            2014-05-04  NaN     NaN
            2014-05-05  45.99   10.39
            2014-05-06  55.86   134.62
            2014-05-07  78.50   68.84
            7 rows × 2 columns
    
    In [20]: socialTradingVolTSCal.fillna(100)
    Out[20]:            FB      TWTR
           2014-05-01   82.34   15.74
           2014-05-02   54.11   12.71
           2014-05-03   100.00  100.00
           2014-05-04   100.00  100.00
           2014-05-05   45.99   10.39
           2014-05-06   55.86   134.62
           2014-05-07   78.50   68.84
         7 rows × 2 columns
    

    We can also fill forward or backward values using the ffill or bfill arguments:

    In [23]: socialTradingVolTSCal.fillna(method='ffill')
    Out[23]:            FB      TWTR
           2014-05-01   82.34   15.74
           2014-05-02   54.11   12.71
           2014-05-03   54.11   12.71
           2014-05-04   54.11   12.71
           2014-05-05   45.99   10.39
           2014-05-06   55.86   134.62
           2014-05-07   78.50   68.84
           7 rows × 2 columns
    
    In [24]: socialTradingVolTSCal.fillna(method='bfill')
    Out[24]:            FB      TWTR
           2014-05-01   82.34   15.74
           2014-05-02   54.11   12.71
           2014-05-03   45.99   10.39
           2014-05-04   45.99   10.39
           2014-05-05   45.99   10.39
           2014-05-06   55.86   134.62
           2014-05-07   78.50   68.84
           7 rows × 2 columns
    

    The pad method is an alternative name for ffill. For more details, you can go to http://bit.ly/1f4jvDq.

  2. By using the dropna() function to drop/delete rows and columns with missing values. The following is an example of this:
    In [21]: socialTradingVolTSCal.dropna()
    Out[21]:      FB      TWTR
      2014-05-01  82.34   15.74
      2014-05-02  54.11   12.71
      2014-05-05  45.99   10.39
      2014-05-06  55.86   134.62
      2014-05-07  78.50   68.84
      5 rows × 2 columns
    
  3. We can also interpolate and fill in the missing values by using the interpolate() function, as explained in the following commands:
    In [27]: pd.set_option('display.precision',4)
            socialTradingVolTSCal.interpolate()
    Out[27]:       FB       TWTR
      2014-05-01   82.340   15.740
      2014-05-02   54.110   12.710
      2014-05-03   51.403   11.937
      2014-05-04   48.697   11.163
      2014-05-05   45.990   10.390
      2014-05-06   55.860   134.620
      2014-05-07   78.500   68.840
      7 rows × 2 columns
    

    The interpolate() function also takes an argument—method that denotes the method. These methods include linear, quadratic, cubic spline, and so on. You can obtain more information from the official documentation at http://pandas.pydata.org/pandas-docs/stable/missing_data.html#interpolation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.178.181