In this chapter, we take a tour of some topics that are necessary to develop expertise in using pandas. Knowledge of these topics is very useful for the preparation of data as input for programs or code that process data for analysis, prediction, or visualization. The topics that we'll discuss are as follows:
matplotlib
By the end of this chapter the user should be proficient in these critical areas.
Missing data refers to data points that show up as NULL or N/A in our datasets for some reason; for example, we may have a time series that spans all calendar days of the month that shows the closing price of a stock for each day, and the closing price for nonbusiness days would show up as missing. An example of corrupted data would be a financial dataset that shows the activity date of a transaction in the wrong format; for example, YYYY-MM-DD instead of YYYYMMDD due to an error on the part of the data provider.
In the case of pandas, missing values are generally represented by the NaN value.
Other than appearing natively in the source dataset, missing values can be added to a dataset by an operation such as reindexing, or changing frequencies in the case of a time series:
In [84]: import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline In [85]: date_stngs = ['2014-05-01','2014-05-02', '2014-05-05','2014-05-06','2014-05-07'] tradeDates = pd.to_datetime(pd.Series(date_stngs)) In [86]: closingPrices=[531.35,527.93,527.81,515.14,509.96] In [87]: googClosingPrices=pd.DataFrame(data=closingPrices, columns=['closingPrice'], index=tradeDates) googClosingPrices Out[87]: closingPrice tradeDates 2014-05-01 531.35 2014-05-02 527.93 2014-05-05 527.81 2014-05-06 515.14 2014-05-07 509.96 5 rows 1 columns
The source of the preceding data can be found at http://yhoo.it/1dmJqW6.
The pandas also provides an API to read stock data from various data providers, such as Yahoo:
In [29]: import pandas.io.data as web In [32]: import datetime googPrices = web.get_data_yahoo("GOOG", start=datetime.datetime(2014, 5, 1), end=datetime.datetime(2014, 5, 7)) In [38]: googFinalPrices=pd.DataFrame(googPrices['Close'], index=tradeDates) In [39]: googFinalPrices Out[39]: Close 2014-05-01 531.34998 2014-05-02 527.92999 2014-05-05 527.81000 2014-05-06 515.14001 2014-05-07 509.95999
For more details, refer to http://pandas.pydata.org/pandas-docs/stable/remote_data.html.
We now have a time series that depicts the closing price of Google's stock from May 1, 2014 to May 7, 2014 with gaps in the date range since the trading only occur on business days. If we want to change the date range so that it shows calendar days (that is, along with the weekend), we can change the frequency of the time series index from business days to calendar days as follows:
In [90]: googClosingPricesCDays=googClosingPrices.asfreq('D') googClosingPricesCDays Out[90]: closingPrice 2014-05-01 531.35 2014-05-02 527.93 2014-05-03 NaN 2014-05-04 NaN 2014-05-05 527.81 2014-05-06 515.14 2014-05-07 509.96 7 rows 1 columns
Note that we have now introduced NaN
values for the closingPrice
for the weekend dates of May 3, 2014 and May 4, 2014.
We can check which values are missing by using the isnull
and notnull
functions as follows:
In [17]: googClosingPricesCDays.isnull() Out[17]: closingPrice 2014-05-01 False 2014-05-02 False 2014-05-03 True 2014-05-04 True 2014-05-05 False 2014-05-06 False 2014-05-07 False 7 rows 1 columns In [18]: googClosingPricesCDays.notnull() Out[18]: closingPrice 2014-05-01 True 2014-05-02 True 2014-05-03 False 2014-05-04 False 2014-05-05 True 2014-05-06 True 2014-05-07 True 7 rows 1 columns
A Boolean DataFrame is returned in each case. In datetime
and pandas Timestamps, missing values are represented by the NaT
value. This is the equivalent of NaN
in pandas for time-based types.
In [27]: tDates=tradeDates.copy() tDates[1]=np.NaN tDates[4]=np.NaN In [28]: tDates Out[28]: 0 2014-05-01 1 NaT 2 2014-05-05 3 2014-05-06 4 NaT Name: tradeDates, dtype: datetime64[ns] In [4]: FBVolume=[82.34,54.11,45.99,55.86,78.5] TWTRVolume=[15.74,12.71,10.39,134.62,68.84] In [5]: socialTradingVolume=pd.concat([pd.Series(FBVolume), pd.Series(TWTRVolume), tradeDates], axis=1, keys=['FB','TWTR','TradeDate']) socialTradingVolume Out[5]: FB TWTR TradeDate 0 82.34 15.74 2014-05-01 1 54.11 12.71 2014-05-02 2 45.99 10.39 2014-05-05 3 55.86 134.62 2014-05-06 4 78.50 68.84 2014-05-07 5 rows × 3 columns In [6]: socialTradingVolTS=socialTradingVolume.set_index('TradeDate') socialTradingVolTS Out[6]: TradeDate FB TWTR 2014-05-01 82.34 15.74 2014-05-02 54.11 12.71 2014-05-05 45.99 10.39 2014-05-06 55.86 134.62 2014-05-07 78.50 68.84 5 rows × 2 columns In [7]: socialTradingVolTSCal=socialTradingVolTS.asfreq('D') socialTradingVolTSCal Out[7]: FB TWTR 2014-05-01 82.34 15.74 2014-05-02 54.11 12.71 2014-05-03 NaN NaN 2014-05-04 NaN NaN 2014-05-05 45.99 10.39 2014-05-06 55.86 134.62 2014-05-07 78.50 68.84 7 rows × 2 columns
We can perform arithmetic operations on data containing missing values. For example, we can calculate the total trading volume (in millions of shares) across the two stocks for Facebook and Twitter as follows:
In [8]: socialTradingVolTSCal['FB']+socialTradingVolTSCal['TWTR'] Out[8]: 2014-05-01 98.08 2014-05-02 66.82 2014-05-03 NaN 2014-05-04 NaN 2014-05-05 56.38 2014-05-06 190.48 2014-05-07 147.34 Freq: D, dtype: float64
By default, any operation performed on an object that contains missing values will return a missing value at that position as shown in the following command:
In [12]: pd.Series([1.0,np.NaN,5.9,6])+pd.Series([3,5,2,5.6]) Out[12]: 0 4.0 1 NaN 2 7.9 3 11.6 dtype: float64 In [13]: pd.Series([1.0,25.0,5.5,6])/pd.Series([3,np.NaN,2,5.6]) Out[13]: 0 0.333333 1 NaN 2 2.750000 3 1.071429 dtype: float64
There is a difference, however, in the way NumPy treats aggregate calculations versus what pandas does.
In pandas, the default is to treat the missing value as 0
and do the aggregate calculation, whereas for NumPy, NaN
is returned if any of the values are missing. Here is an illustration:
In [15]: np.mean([1.0,np.NaN,5.9,6]) Out[15]: nan In [16]: np.sum([1.0,np.NaN,5.9,6]) Out[16]: nan
However, if this data is in a pandas Series, we will get the following output:
In [17]: pd.Series([1.0,np.NaN,5.9,6]).sum() Out[17]: 12.9 In [18]: pd.Series([1.0,np.NaN,5.9,6]).mean() Out[18]: 4.2999999999999998
It is important to be aware of this difference in behavior between pandas and NumPy. However, if we wish to get NumPy to behave the same way as pandas, we can use the np.nanmean
and np.nansum
functions, which are illustrated as follows:
In [41]: np.nanmean([1.0,np.NaN,5.9,6]) Out[41]: 4.2999999999999998 In [43]: np.nansum([1.0,np.NaN,5.9,6]) Out[43]: 12.9
For more information on NumPy np.nan* aggregation functions, refer to http://docs.scipy.org/doc/numpy-dev/reference/routines.statistics.html.
There are various ways to handle missing values, which are as follows:
fillna()
function to fill in the NA values. This is an example:In [19]: socialTradingVolTSCal Out[19]: FB TWTR 2014-05-01 82.34 15.74 2014-05-02 54.11 12.71 2014-05-03 NaN NaN 2014-05-04 NaN NaN 2014-05-05 45.99 10.39 2014-05-06 55.86 134.62 2014-05-07 78.50 68.84 7 rows × 2 columns In [20]: socialTradingVolTSCal.fillna(100) Out[20]: FB TWTR 2014-05-01 82.34 15.74 2014-05-02 54.11 12.71 2014-05-03 100.00 100.00 2014-05-04 100.00 100.00 2014-05-05 45.99 10.39 2014-05-06 55.86 134.62 2014-05-07 78.50 68.84 7 rows × 2 columns
We can also fill forward or backward values using the ffill
or bfill
arguments:
In [23]: socialTradingVolTSCal.fillna(method='ffill') Out[23]: FB TWTR 2014-05-01 82.34 15.74 2014-05-02 54.11 12.71 2014-05-03 54.11 12.71 2014-05-04 54.11 12.71 2014-05-05 45.99 10.39 2014-05-06 55.86 134.62 2014-05-07 78.50 68.84 7 rows × 2 columns In [24]: socialTradingVolTSCal.fillna(method='bfill') Out[24]: FB TWTR 2014-05-01 82.34 15.74 2014-05-02 54.11 12.71 2014-05-03 45.99 10.39 2014-05-04 45.99 10.39 2014-05-05 45.99 10.39 2014-05-06 55.86 134.62 2014-05-07 78.50 68.84 7 rows × 2 columns
The pad
method is an alternative name for ffill
. For more details, you can go to http://bit.ly/1f4jvDq.
dropna()
function to drop/delete rows and columns with missing values. The following is an example of this:In [21]: socialTradingVolTSCal.dropna() Out[21]: FB TWTR 2014-05-01 82.34 15.74 2014-05-02 54.11 12.71 2014-05-05 45.99 10.39 2014-05-06 55.86 134.62 2014-05-07 78.50 68.84 5 rows × 2 columns
interpolate()
function, as explained in the following commands:In [27]: pd.set_option('display.precision',4) socialTradingVolTSCal.interpolate() Out[27]: FB TWTR 2014-05-01 82.340 15.740 2014-05-02 54.110 12.710 2014-05-03 51.403 11.937 2014-05-04 48.697 11.163 2014-05-05 45.990 10.390 2014-05-06 55.860 134.620 2014-05-07 78.500 68.840 7 rows × 2 columns
The interpolate()
function also takes an argument—method that denotes the method. These methods include linear, quadratic, cubic spline, and so on. You can obtain more information from the official documentation at http://pandas.pydata.org/pandas-docs/stable/missing_data.html#interpolation.
18.188.178.181