Chapter 7. Descriptive Statistics

This and the following chapter are mainly aimed at SAS, SPSS, or Minitab users, and especially those employing the languages R or S for statistical computing. We will develop an environment for working effectively in the field of data analysis, with the aid of IPython sessions powered up with the following resources from the SciPy stack:

  • The probability and statistics submodule of the library of symbolic computations, sympy.stats.
  • The two libraries of statistical functions scipy.stats and scipy.stats.mstats (the latter for data provided by masked arrays), together with the module statsmodels, for data exploration, estimation on statistical models, and performing statistical tests in a numerical setting. The package statsmodels uses, under the hood, the powerful library patsy to describe statistical models and building design matrices in Python (R or S users will find patsy compatible with their formula mini-language).
  • For statistical inference, we again use scipy.stats and statsmodels (for frequentist and likelihood inference) and the module pymc that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo.
  • Two incredibly powerful libraries of high-level data manipulation tools.
    • The Python Data Analysis library pandas, created by Wes McKinney to address the useful functionalities of time series, data alignment, and the treatment of databases in a similar fashion to SQL.
    • The package PyTables, created by Francesc Alted, Ivan Vilata and others, for managing hierarchical datasets. It is designed to efficiently and easily cope with extremely large amounts of data.
  • The clustering module scipy.cluster for vector quantization, the k-means algorithm, hierarchical and agglomerative clustering.
  • A few SciPy toolkits (SciKits for short):
    • scikit-learn: A set of Python modules for machine learning and data mining.
    • scikits.bootstrap: Bootstrap confidence interval estimation routines.

An obvious knowledge of statistics is needed to follow the techniques in these chapters. Any good basic textbook with an excellent selection of examples and problems will suffice. For a more in-depth study of inference, we recommend the second edition of the book Statistical Inference, by George Casella and Roger L. Berger, published by Duxbury in 2002.

Documentation for the following python libraries can be obtained online through their corresponding official pages:

The best way to get acquainted with Pandas is without a doubt the book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, by Wes McKinney himself—the creator of this amazing library. Familiarity with SQL is a must and, for this, our recommendation is to get good training online.

One of the best resources to understand the topic of model estimation is the book Methods of Statistical Model Estimation, by Joseph Hilbe and Andrew Robinson. Although all the codes in this resource are written for R, they are easily portable to a combination of routines and classes from scipy.stats, statsmodels, PyMC, and scikit-learn.

The package statsmodels used to be part of the scikit-learn toolkit. Good documentation to learn the usage and power of this package, and its underlying library to describe statistical models (patsy) are always the tutorials offered by their creators online:

For the scipy toolkits, the best resource is found via their page at http://scikits.appspot.com/. Browsing the different toolkits will point us to good tutorials and further references. In particular, for scikit-learn, two must-reads are the official page at http://scikit-learn.org/stable/, and the seminal article Scikit-learn: Machine Learning in Python, by Fabian Pedregosa et al., published in the Journal of Machine Learning Research in 2011.

But as it is our custom in this book, we will develop the material from the point of view of the material itself. We have thus divided the exposition in two chapters, the first of which is concerned with the most basic topics in Probability and Statistics:

  • Probability—Random variables and their distributions.
  • Data Exploration.

In the next chapter, we will address more advanced topics in Statistics and Data Analysis:

  • Statistical inference.
  • Machine learning. The construction and study of systems that can learn from data. Machine learning focuses on prediction based on known properties learned from some training data.
  • Data mining. Discovering patterns in large data sets. Data mining focuses on the discovery of priori unknown properties in the data.

Motivation

On Tuesday, September 8, 1857, the steamboat SS Central America left Havana at 9 A.M. for New York, carrying about 600 passengers and crew members. Inside this vessel, precious cargo was stored—a set of manuscripts by John James Audubon, and three tons of gold bars and coins. The manuscripts documented an expedition through the yet uncharted southwestern United States and California, and contained 200 sketches and paintings of its wildlife. The gold, fruit of many years of prospecting and mining during the California Gold Rush, was meant to start anew the lives of many of the passengers aboard.

On the 9th, the vessel ran into a storm which developed into a hurricane. The steamboat endured four hard days at sea, and by Saturday morning the ship was doomed. The captain arranged to have women and children taken off to the brig Marine, which offered them assistance at about noon. In spite of the efforts of the remaining crew and passengers to save the ship, the inevitable happened at about 8 P.M. that same day. The wreck claimed the lives of 425 men, and carried to the bottom of the sea the valuable cargo.

It was not until late 1980s that technology allowed recovery of shipwrecks in deep sea. However, no technology would be of any help without an accurate location of the site. In the following paragraphs, we would like to illustrate the power of the SciPy stack by performing a simple simulation. The objective is the creation of a dataset of possible locations for the wreck of the SS Central America. We mine this data to attempt to pinpoint the most probable target.

We simulate several possible paths of the steamboat (say 10,000 randomly generated possibilities), between 7 A.M. on Saturday, and 13 hours later, at 8 P.M on Sunday. At 7 A.M. on that Saturday, the ship's captain, William Herndon, took a celestial fix and verbally relayed the position to the schooner El Dorado. The fix was 31º25'N, 77º10'W. Since the ship was not operative at that point—no engine, no sails—for the next thirteen hours its course was solely subjected to the effect of ocean currents and winds. With enough information, it is possible to model the drift and leeway on different possible paths.

We start by creating a DataFrame—a computational structure that will hold all the values we need in a very efficient way. We do so with the help of the pandas libraries:

In [1]: from datetime import datetime, timedelta; 
   ...: from dateutil.parser import parse
In [2]: interval = [parse("9/12/1857 7 am")]
In [3]: for k in range(14*2-1):
   ...:     if k % 2 == 0:
   ...:         interval.append(interval[-1])
   ...:     else:
   ...:         interval.append(interval[-1] + timedelta(hours=1))
   ...:
In [4]: import numpy as np, pandas as pd
In [5]: herndon = pd.DataFrame(np.zeros((28, 10000)),
   ...:                        index = [interval, ['Lat', 'Lon']*14])

Each column of the DataFrame herndon is to hold the latitude and longitude of a possible path of the SS Central America, sampled every hour. For instance, to observe the first path, we issue the following command:

In [6]: herndon[0]
Out[6]:
1857-09-12 07:00:00  Lat    0
                     Lon    0
1857-09-12 08:00:00  Lat    0
                     Lon    0
1857-09-12 09:00:00  Lat    0
                     Lon    0
1857-09-12 10:00:00  Lat    0
                     Lon    0
1857-09-12 11:00:00  Lat    0
                     Lon    0
1857-09-12 12:00:00  Lat    0
                     Lon    0
1857-09-12 13:00:00  Lat    0
                     Lon    0
1857-09-12 14:00:00  Lat    0
                     Lon    0
1857-09-12 15:00:00  Lat    0
                     Lon    0
1857-09-12 16:00:00  Lat    0
                     Lon    0
1857-09-12 17:00:00  Lat    0
                     Lon    0
1857-09-12 18:00:00  Lat    0
                     Lon    0
1857-09-12 19:00:00  Lat    0
                     Lon    0
1857-09-12 20:00:00  Lat    0
                     Lon    0
Name: 0, dtype: float64

Let's populate this data following a similar analysis to that followed by the Columbus America Discovery Group, as explained by Lawrence D. Stone in the article Revisiting the SS Central America Search, from the 2010 International Conference on Information Fusion.

The celestial fix obtained by Capt. Herndon at 7 A.M. was taken with a sextant in the middle of a storm. There are some uncertainties in the estimation of latitude and longitude with this method and under those weather conditions, which are modeled by a bivariate normally distributed random variable with mean (0,0), and standard deviations of 0.9 nautical miles (for latitude) and 3.9 nautical miles (for longitude). We first create a random variable with those characteristics. Let's use this idea to populate the data frame with several random initial locations:

In [7]: from scipy.stats import multivariate_normal
In [8]: celestial_fix = multivariate_normal(cov = np.diag((0.9, 3.9)))

Tip

To estimate the corresponding celestial fixes, as well as all further geodetic computations, we will use the accurate formulas of Vincenty for ellipsoids, assuming a radius at the equator of a = 6378137 meters and a flattening of the ellipsoid of f = 1/298.257223563 (these figures are regarded as one of the standards for use in cartography, geodesy, and navigation, and are referred to by the community as the World Geodetic System WGS-84 ellipsoid).

A very good set of formulas coded in Python can be found at https://github.com/blancosilva/Mastering-Scipy/blob/master/chapter7/Geodetic_py.py. For a description and the theory behind these formulas, read the excellent survey on Wikipedia at https://en.wikipedia.org/wiki/Vincenty%27s_formulae.

In particular, for this example, we will be using Vincenty's direct formula that computes the resulting latitude phi2, longitude L2, and azimuth s2, of an object starting at latitude phi1, longitude L1, and traveling s meters with initial azimuth s1. Latitudes, longitudes, and azimuths are given in degrees, and distances in meters. We also use the convention of assigning negative values to the latitudes to the west. To apply the conversion from nautical miles or knots to their respective units in SI, we employ the system of units in scipy.constants.

In [9]: from Geodetic_py import vinc_pt; 
   ...: from scipy.constants import nautical_mile
In [10]: a = 6378137.0; 
   ....: f = 1./298.257223563
In [11]: for k in range(10000):
   ....:     lat_delta,lon_delta = celestial_fix.rvs() * nautical_mile
   ....:     azimuth = 90 - np.angle(lat_delta+1j*lon_delta, deg=True)
   ....:     distance = np.hypot(lat_delta, lon_delta)
   ....:     output = vinc_pt(f, a, 31+25./60,
   ....:                      -77-10./60, azimuth, distance)
   ....:     herndon.ix['1857-09-12 07:00:00',:][k] = output[0:2]
   ....:
In [12]: herndon.ix['1857-09-12 07:00:00',:]
Out[12]:
          0          1          2          3          4          5    
Lat  31.455345  31.452572  31.439491  31.444000  31.462029  31.406287  
Lon -77.148860 -77.168941 -77.173416 -77.163484 -77.169911 -77.168462 

          6          7          8          9       ...           9990 
Lat  31.390807  31.420929  31.441248  31.367623    ...      31.405862  
Lon -77.178367 -77.187680 -77.176924 -77.172941    ...     -77.146794  

          9991       9992       9993       9994       9995       9996 
Lat  31.394365  31.428827  31.415392  31.443225  31.350158  31.392087  
Lon -77.179720 -77.182885 -77.159965 -77.186102 -77.183292 -77.168586  

          9997       9998       9999 
Lat  31.443154  31.438852  31.401723 
Lon -77.169504 -77.151137 -77.134298 
[2 rows x 10000 columns]

We simulate the drift according to the formula D = (V + leeway * W). In this formula, V (the ocean current) is modeled as a vector pointing about Northeast (around 45 degrees of azimuth) and a variable speed between 1 and 1.5 knots. The other random variable, W, represents the action of the winds in the area during the hurricane, which we choose to represent by directions ranging between south and east, and speeds with a mean of 0.2 knots and standard deviation of 1/30 knots. Both random variables are coded as bivariate normal. Finally, we have accounted for the leeway factor. According to a study performed on the blueprints of the SS Central America, we have estimated this leeway to be about 3 percent:

Tip

This choice of random variables to represent the ocean current and wind differs from the ones used in the aforementioned paper. In our version, we have not used the actual covariance matrices as computed by Stone from data received from the Naval Oceanographic Data Center. Rather, we have presented a very simplified version.

In [13]: current = multivariate_normal((np.pi/4, 1.25),
   ....:                 cov=np.diag((np.pi/270, .25/3))); 
   ....: wind = multivariate_normal((np.pi/4, .3),
   ....:                 cov=np.diag((np.pi/12, 1./30))); 
   ....: leeway = 3./100
In [14]: for date in pd.date_range('1857-9-12 08:00:00',
   ....:                           periods=13, freq='1h'):
   ....:      before  = herndon.ix[date-timedelta(hours=1)]
   ....:      for k in range(10000):
   ....:           angle, speed = current.rvs()
   ....:           current_v = speed * nautical_mile * (np.cos(angle)
   ....:                             + 1j * np.sin(angle))
   ....:           angle, speed  = wind.rvs()
   ....:           wind_v = speed * nautical_mile * (np.cos(angle)
   ....:                             + 1j * np.sin(angle))
   ....:           drift = current_v + leeway * wind_v
   ....:           azimuth = 90 - np.angle(drift, deg=True)
   ....:           distance = abs(drift)
   ....:           output = vinc_pt(f, a, before.ix['Lat'][k],
   ....:                            before.ix['Lon'][k],
   ....:                            azimuth, distance)
   ....:           herndon.ix[date,:][k] = output[:2]

Let's plot the first three of those simulated paths:

In [15]: import matplotlib.pyplot as plt; 
   ....: from mpl_toolkits.basemap import Basemap
In [16]: m = Basemap(llcrnrlon=-77.4, llcrnrlat=31.2,urcrnrlon=-76.6,
   ....:             urcrnrlat=31.8, projection='lcc', lat_0 = 31.5,
   ....:             lon_0=-77, resolution='l', area_thresh=1000.)
In [17]: m.drawmeridians(np.arange(-77.4,-76.6,0.1),   
   ....:                 labels=[0,0,1,1]); 
   ....: m.drawparallels(np.arange(31.2,32.8,0.1),labels=[1,1,0,0]);
   ....: m.drawmapboundary()
In [18]: colors = ['r', 'b', 'k']; 
   ....: styles = ['-', '--', ':']
In [19]: for k in range(3):
   ....:     latitudes = herndon[k][:,'Lat'].values
   ....:     longitudes = herndon[k][:,'Lon'].values
   ....:     longitudes, latitudes = m(longitudes, latitudes)
   ....:     m.plot(longitudes, latitudes, color=colors[k],
   ....:            lw=3, ls=styles[k])
   ....:
In [20]: plt.show()

This presents us with these three possible paths followed by the SS Central America during its drift in the storm. As expected, they observe a north-easterly general direction, on occasion showing deviations from the effect of the strong winds:

Motivation

The focus of this simulation is, nonetheless, on the final location of all these paths. Let's plot them all on the same map first, for a quick visual evaluation:

In [21]: latitudes, longitudes = herndon.ix['1857-9-12 20:00:00'].values
In [22]: m = Basemap(llcrnrlon=-82., llcrnrlat=31, urcrnrlon=-76,
   ....:             urcrnrlat=32.5, projection='lcc', lat_0 = 31.5,
   ....:             lon_0=-78, resolution='h', area_thresh=1000.)
In [23]: X, Y = m(longitudes, latitudes)
In [24]: x, y = m(-81.2003759, 32.0405369)   # Savannah, GA
In [25]: m.plot(X, Y, 'ko', markersize=1); 
   ....: m.plot(x,y,'bo'); 
   ....: plt.text(x-10000, y+10000, 'Savannah, GA'); 
   ....: m.drawmeridians(np.arange(-82,-76,1), labels=[1,1,1,1]); 
   ....: m.drawparallels(np.arange(31,32.5,0.25), labels=[1,1,0,0]);
   ....: m.drawcoastlines(); 
   ....: m.drawcountries(); 
   ....: m.fillcontinents(color='coral'); 
   ....: m.drawmapboundary(); 
   ....: plt.show()
Motivation

To obtain a better estimate of the true location of the shipwreck, it is possible to expand the simulation by using information from Captain Johnson of the Norwegian bark Ellen. This ship rescued several survivors at 8 A.M. on Sunday, at a recorded position of 31º55'N, 76º13'W. We can employ a similar technique to trace back to the location where the ship sunk using a reverse drift. For this case, the uncertainty in the celestial fix is modeled by a bivariate normal distribution with standard deviations of 0.9 (latitude) and 5.4 nautical miles (longitude).

Tip

A third simulation is also possible, using information from the El Dorado, but we do not factor this in our computations.

Since at this point the only relevant information is the location of the wreck, we do not need to keep the intermediate locations in our simulated paths. We record our data in a pandas Series instead:

In [26]: interval = []
In [27]: for k in range(10000):
   ....:     interval.append(k)
   ....:     interval.append(k)
   ....:
In [28]: ellen = pd.Series(index = [interval, ['Lat','Lon']*10000]);
   ....: celestial_fix =multivariate_normal(cov=np.diag((0.9,5.4)));
   ....: current = multivariate_normal((225, 1.25),
   ....:                               cov=np.diag((2./3, .25/3)))
In [29]: for k in range(10000):
   ....:    lat_delta, lon_delta = celestial_fix.rvs()*nautical_mile
   ....:    azimuth = 90 - np.angle(lat_delta+1j*lon_delta, deg=True)
   ....:    distance = np.hypot(lat_delta, lon_delta)
   ....:    output = vinc_pt(f, a, 31+55./60,
   ....:                     -76-13./60, azimuth, distance)
   ....:    ellen[k] = output[0:2]
   ....:
In [30]: for date in pd.date_range('1857-9-13 07:00:00', periods=12,
   ....:                           freq='-1h'):
   ....:     for k in range(10000):
   ....:         angle, speed = current.rvs()
   ....:         output = vinc_pt(f, a, ellen[k,'Lat'],
   ....:                          ellen[k,'Lon'], 90-angle, speed)
   ....:         ellen[k]=output[0:2]
   ....:

The purpose of the simulation is the construction of a map that indicates the probability of finding the shipwreck depending on latitude and longitude. We can construct it by performing a kernel density estimation on the simulated data. The difficulty in this case lies in using the correct metric. Unfortunately, we are not able to create a metric based upon Vincenty's formulas in SciPy suitable for this operation, instead, we have two options:

  • A linear approximation in a small area, using the routine gaussian_kde from the library scipy.stats
  • A spherical approximation, using the class KernelDensity from the toolkit scikit-learn, imposing a Harvesine metric and a ball tree algorithm

The advantage of the first method is that it is faster, and the computations for optimal bandwidth are done internally. The second method is more accurate if we are able to provide the correct bandwidth. In any case, we prepare the data in the same way, using the simulation as training data:

In [31]: training_latitudes, training_longitudes = herndon.ix['1857-9-12 20:00:00'].values; 
   ....: training_latitudes = np.concatenate((training_latitudes,
   ....:                                       ellen[:,'Lat'])); 
   ....: training_longitudes = np.concatenate((training_longitudes,
   ....:                                       ellen[:,'Lon'])); 
   ....: values = np.vstack([training_latitudes,
   ....:                     training_longitudes]) * np.pi/180.

For the linear approximation, we perform the following computations:

In [32]: from scipy.stats import gaussian_kde
In [33]: kernel_scipy = gaussian_kde(values)

For the spherical approximation, and assuming a less than optimal bandwidth of 10-7, we instead issue the following:

In [32]: from sklearn.neighbors import KernelDensity
In [33]: kernel_sklearn = KernelDensity(metric='haversine',
   ....:                                bandwidth=1.e-7,
   ....:                                kernel='gaussian',
   ....:                                algorithm='ball_tree')
   ....: kernel_sklearn.fit(values.T)
Out[33]:
KernelDensity(algorithm='ball_tree', atol=0, bandwidth=1e-07,
       breadth_first=True, kernel='gaussian', leaf_size=40,
       metric='haversine', metric_params=None, rtol=0)

From here all we need to do is generate a map, construct a grid on it, and using these values, project the corresponding evaluation of the computed kernel. This will give us a probability density function (PDF) of the corresponding distribution:

In [34]: plt.figure(); 
   ....: m = Basemap(llcrnrlon=-77.1, llcrnrlat=31.4,urcrnrlon=-75.9,
   ....:             urcrnrlat=32.6, projection='lcc', lat_0 = 32,
   ....:             lon_0=-76.5, resolution='l', area_thresh=1000);
   ....: m.drawmeridians(np.arange(-77.5,-75.5,0.2),
   ....:                 labels=[0,0,1,1]); 
   ....: m.drawparallels(np.arange(31,33,0.2), labels=[1,1,0,0]); 
   ....: grid_lon, grid_lat = m.makegrid(25, 25); 
   ....: xy = np.vstack([grid_lat.ravel(),
   ....:                 grid_lon.ravel()]) * np.pi/180.

The computations of the PDF are done, depending of the kernel implemented, as follows:

In [35]: data = kernel_scipy(xy)
In [35]: data = np.exp(kernel_sklearn.score_samples(xy.T))

All that remains is to plot the results. We show the results of the first method, and leave the second as a nice exercise:

In [36]: levels = np.linspace(data.min(), data.max(), 6); 
   ....: data = data.reshape(grid_lon.shape)
In [37]: grid_lon, grid_lat = m(grid_lon, grid_lat); 
   ....: cs = m.contourf(grid_lon, grid_lat, data,
   ....:                 clevels=levels, cmap=plt.cm.Greys); 
   ....: cbar = m.colorbar(cs, location='bottom', pad="10%"); 
   ....: plt.show()

This presents us with a region of roughly 50 x 50 (nautical miles), colored by the corresponding density. The darker regions indicate a higher probability of finding the shipwreck:

Motivation

Note

The actual location of the remains of the SS Central America is at 31º35'N, 77º02'W, not too far from the results of our rough approximation—and as a matter of fact, very close to Captain Herndon's fix as communicated to the Marine.

This short motivational example illustrates the power of the SciPy stack to perform statistical simulations, store and manipulate the resulting data in optimal ways, and analyze them using state-of-the-art algorithms to extract valuable information. In the following pages, we will cover these techniques in more depth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.136.186