pandas provides several classes of statistical operations that can be applied to a Series
or DataFrame
object. These reductive methods, when applied to a Series
, result in a single value. When applied to a DataFrame
, an axis can be specified and the method will then be either applied to each column or row and results in a Series
.
The average value is calculated using .mean()
. The following calculates the average of the prices for AAPL
and MSFT
:
In [119]: # calc the mean of the values in each column one_mon_hist.mean() Out[119]: MSFT 47.493182 AAPL 112.411364 dtype: float64
pandas has taken each column and independently calculated the mean for each and returned the results as values in a Series
that is indexed with the column names.
The default is to apply the method on axis=0
, applying the function to each column. The following code calculates the mean across axis=1
:
In [120]: # calc the mean of the values in each row one_mon_hist.mean(axis=1) Out[120]: 0 81.845 1 81.545 2 82.005 ... 19 80.680 20 79.770 21 78.415 Length: 22, dtype: float64
Variance is calculated using the .var()
method. The following code calculates the variance of the price for both stocks during the period represented in the DataFrame
object:
In [121]: # calc the variance of the values in each column one_mon_hist.var() Out[121]: MSFT 0.870632 AAPL 5.706231 dtype: float64
The median of the values is determined using the .median()
method:
In [122]: # calc the median of the values in each column one_mon_hist.median() Out[122]: MSFT 47.625 AAPL 112.530 dtype: float64
Although not a reductive calculation, the minimum and maximum values can be found with the .min()
and .max()
methods:
In [123]: # location of min price for both stocks one_mon_hist[['MSFT', 'AAPL']].min() Out[123]: MSFT 45.16 AAPL 106.75 dtype: float64 In [124]: # and location of the max one_mon_hist[['MSFT', 'AAPL']].max() Out[124]: MSFT 48.84 AAPL 115.93 dtype: float64
Some pandas statistical methods are referred to as indirect statistics, for example, .idxmin()
and .idxmax()
return the index location where the minimum and maximum values exist, respectively. The following code determines the location of the minimum prices for both stocks:
In [125]: # location of the min price for both stocks one_mon_hist[['MSFT', 'AAPL']].idxmin() Out[125]: MSFT 11 AAPL 11 dtype: int64 In [126]: # and location of the max one_mon_hist[['MSFT', 'AAPL']].idxmax() Out[126]: MSFT 3 AAPL 2 dtype: int64
The most common value of a Series
, the mode, can be determined with .mode().
Determining the mode is best demonstrated with data different from sp500
. The following code determines the mode of the given Series
:
In [127]: # find the mode of this Series s = pd.Series([1, 2, 3, 3, 5]) s.mode() Out[127]: 0 3 dtype: int64
This has not returned a scalar value representing the mode, but a Series
. This is because there can be more than one value for the mode of a Series
, as demonstrated in the following sample:
In [128]: # there can be more than one mode s = pd.Series([1, 2, 3, 3, 5, 1]) s.mode() Out[128]: 0 1 1 3 dtype: int64
Accumulations in pandas are statistical methods that determine a value, by continuously applying the next value in a Series
to the current result. Good examples are the cumulative product and cumulative sum of a Series
. To demonstrate, we can use the following DataFrame
that calculates both on a simple Series
of data:
In [129]: # calculate a cumulative product pd.Series([1, 2, 3, 4]).cumprod() Out[129]: 0 1 1 2 2 6 3 24 dtype: int64 In [130]: # calculate a cumulative sum pd.Series([1, 2, 3, 4]).cumsum() Out[130]: 0 1 1 3 2 6 3 10 dtype: int64
Note that these do not reduce to a single value, but represent the cumulative value at each location across the Series
.
The .describe()
returns a simple set of summary statistics about a Series
or DataFrame
. The values returned are, themselves, a Series
where the index label contains the name of the specific statistics that are computed. This function is handy if you want to get a quick and easy overview of the important statistics of a Series
or DataFrame
.
The following code returns summary statistics on the monthly stock data, including the count of items that are not part of NaN
; the mean and standard deviation; minimum and maximum values; and the values of the 25, 50, and 75 percentiles. The code is as follows:
In [131]: # summary statistics one_mon_hist.describe() Out[131]: MSFT AAPL count 22.000000 22.000000 mean 47.493182 112.411364 std 0.933077 2.388772 min 45.160000 106.750000 25% 46.967500 111.660000 50% 47.625000 112.530000 75% 48.125000 114.087500 max 48.840000 115.930000
Non-numerical data will result in a slightly different set of summary statistics:
In [132]: # get summary stats on non-numeric data s = pd.Series(['a', 'a', 'b', 'c', np.NaN]) s.describe() Out[132]: count 4 unique 3 top a freq 2 dtype: object
This has given us the count
variable of items that are not part of NaN
, the number of unique
items that are not part of NaN
, the most common item (top
), and the number of times the most frequent item occurred (freq
).
This example leads into a category of descriptive methods that assist in determining counts of items or unique values. The .count()
method will return the number of values that are not part of NaN
values in a Series
or on a specific axis of a DataFrame
:
In [133]: # get summary stats on non-numeric data s.count() Out[133]: 4
A list of unique items can be obtained using the .unique()
method:
In [134]: # return a list of unique items s.unique() Out[134]: array(['a', 'b', 'c', nan], dtype=object)
The number of occurrences of each unique (value that is not part of NaN
) value can be determined with the .value_counts()
method:
In [135]: # number of occurrences of each unique value s.value_counts() Out[135]: a 2 b 1 c 1 dtype: int64
There are quite a few more built-in statistical functions, but the focus of this chapter (and book) is not on statistical analysis and their discovery will be left to you.
3.149.240.196