Summarized data and descriptive statistics

pandas provides several classes of statistical operations that can be applied to a Series or DataFrame object. These reductive methods, when applied to a Series, result in a single value. When applied to a DataFrame, an axis can be specified and the method will then be either applied to each column or row and results in a Series.

The average value is calculated using .mean(). The following calculates the average of the prices for AAPL and MSFT:

In [119]:
   # calc the mean of the values in each column
   one_mon_hist.mean()

Out[119]:
   MSFT     47.493182
   AAPL    112.411364
   dtype: float64

pandas has taken each column and independently calculated the mean for each and returned the results as values in a Series that is indexed with the column names.

The default is to apply the method on axis=0, applying the function to each column. The following code calculates the mean across axis=1:

In [120]:
   # calc the mean of the values in each row
   one_mon_hist.mean(axis=1)

Out[120]:
   0    81.845
   1    81.545
   2    82.005
   ...
   19    80.680
   20    79.770
   21    78.415
   Length: 22, dtype: float64

Variance is calculated using the .var() method. The following code calculates the variance of the price for both stocks during the period represented in the DataFrame object:

In [121]:
   # calc the variance of the values in each column
   one_mon_hist.var()

Out[121]:
   MSFT    0.870632
   AAPL    5.706231
   dtype: float64

The median of the values is determined using the .median() method:

In [122]:
   # calc the median of the values in each column
   one_mon_hist.median()

   Out[122]:
   MSFT     47.625
   AAPL    112.530
   dtype: float64

Although not a reductive calculation, the minimum and maximum values can be found with the .min() and .max() methods:

In [123]:
   # location of min price for both stocks
   one_mon_hist[['MSFT', 'AAPL']].min()

Out[123]:
   MSFT     45.16
   AAPL    106.75
   dtype: float64

In [124]:
   # and location of the max
   one_mon_hist[['MSFT', 'AAPL']].max()

Out[124]:
   MSFT     48.84
   AAPL    115.93
   dtype: float64

Some pandas statistical methods are referred to as indirect statistics, for example, .idxmin() and .idxmax() return the index location where the minimum and maximum values exist, respectively. The following code determines the location of the minimum prices for both stocks:

In [125]:
   # location of the min price for both stocks
   one_mon_hist[['MSFT', 'AAPL']].idxmin()

Out[125]:
   MSFT    11
   AAPL    11
   dtype: int64

In [126]:
   # and location of the max
   one_mon_hist[['MSFT', 'AAPL']].idxmax()

Out[126]:
   MSFT    3
   AAPL    2
   dtype: int64

The most common value of a Series, the mode, can be determined with .mode(). Determining the mode is best demonstrated with data different from sp500. The following code determines the mode of the given Series:

In [127]:
   # find the mode of this Series
   s = pd.Series([1, 2, 3, 3, 5])
   s.mode()

Out[127]:
   0    3
   dtype: int64

This has not returned a scalar value representing the mode, but a Series. This is because there can be more than one value for the mode of a Series, as demonstrated in the following sample:

In [128]:
   # there can be more than one mode
   s = pd.Series([1, 2, 3, 3, 5, 1])
   s.mode()

Out[128]:
   0    1
   1    3
   dtype: int64

Accumulations in pandas are statistical methods that determine a value, by continuously applying the next value in a Series to the current result. Good examples are the cumulative product and cumulative sum of a Series. To demonstrate, we can use the following DataFrame that calculates both on a simple Series of data:

In [129]:
   # calculate a cumulative product
   pd.Series([1, 2, 3, 4]).cumprod()

Out[129]:
   0     1
   1     2
   2     6
   3    24
   dtype: int64

In [130]:
   # calculate a cumulative sum
   pd.Series([1, 2, 3, 4]).cumsum()

Out[130]:
   0     1
   1     3
   2     6
   3    10
   dtype: int64

Note that these do not reduce to a single value, but represent the cumulative value at each location across the Series.

The .describe() returns a simple set of summary statistics about a Series or DataFrame. The values returned are, themselves, a Series where the index label contains the name of the specific statistics that are computed. This function is handy if you want to get a quick and easy overview of the important statistics of a Series or DataFrame.

The following code returns summary statistics on the monthly stock data, including the count of items that are not part of NaN; the mean and standard deviation; minimum and maximum values; and the values of the 25, 50, and 75 percentiles. The code is as follows:

In [131]:
   # summary statistics
   one_mon_hist.describe()

Out[131]:
               MSFT        AAPL
   count  22.000000   22.000000
   mean   47.493182  112.411364
   std     0.933077    2.388772
   min    45.160000  106.750000
   25%    46.967500  111.660000
   50%    47.625000  112.530000
   75%    48.125000  114.087500
   max    48.840000  115.930000

Non-numerical data will result in a slightly different set of summary statistics:

In [132]:
   # get summary stats on non-numeric data
   s = pd.Series(['a', 'a', 'b', 'c', np.NaN])
   s.describe()

Out[132]:
   count     4
   unique    3
   top       a
   freq      2
   dtype: object

This has given us the count variable of items that are not part of NaN, the number of unique items that are not part of NaN, the most common item (top), and the number of times the most frequent item occurred (freq).

This example leads into a category of descriptive methods that assist in determining counts of items or unique values. The .count() method will return the number of values that are not part of NaN values in a Series or on a specific axis of a DataFrame:

In [133]:
   # get summary stats on non-numeric data
   s.count()

Out[133]:
   4

A list of unique items can be obtained using the .unique() method:

In [134]:
   # return a list of unique items
   s.unique()

Out[134]:
   array(['a', 'b', 'c', nan], dtype=object)

The number of occurrences of each unique (value that is not part of NaN) value can be determined with the .value_counts() method:

In [135]:
   # number of occurrences of each unique value
   s.value_counts()

Out[135]:
   a    2
   b    1
   c    1
   dtype: int64

There are quite a few more built-in statistical functions, but the focus of this chapter (and book) is not on statistical analysis and their discovery will be left to you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.240.196