Basic descriptive statistics with NumPy

In this book, we will try to use as many varied datasets as possible. This depends on the availability of the data. Unfortunately, this means that the subject of the data might not exactly match your interests. Every dataset has its own quirks, but the general skills you acquire in this book should transfer to your own field. In this chapter, we will load a number of Comma-separated Value (CSV) files into NumPy arrays in order to analyze the data.

To load the data, we will use the NumPy loadtxt() function as follows:

Note

The code for this example can be found in basic_stats.py in the code bundle.

import numpy as np
from scipy.stats import scoreatpercentile

data = np.loadtxt("mdrtb_2012.csv", delimiter=',', usecols=(1,), skiprows=1, unpack=True)

print "Max method", data.max()
print "Max function", np.max(data)

print "Min method", data.min()
print "Min function", np.min(data)

print "Mean method", data.mean()
print "Mean function", np.mean(data)

print "Std method", data.std()
print "Std function", np.std(data)

print "Median", np.median(data)
print "Score at percentile 50", scoreatpercentile(data, 50)

Next, we will compute the mean, median, maximum, minimum, and standard deviations of a NumPy array.

Note

If these terms sound unfamiliar to you, please take some time to learn about them from Wikipedia or any other source. As mentioned in the Preface, we will assume familiarity with basic mathematical and statistical concepts.

The data comes from the mdrtb_2012.csv file, which can be found in the code bundle. This is an edited version of the CSV file, which can be downloaded from the WHO website at https://extranet.who.int/tme/generateCSV.asp?ds=mdr_estimates. It contains data about a type of tuberculosis. The file we are going to use is a reduced version of the original file containing only two columns: the country and percentage of new cases. Here are the first two lines of the file:

country,e_new_mdr_pcnt
Afghanistan,3.5

Now, let's compute the mean, median, maximum, minimum, and standard deviations of a NumPy array:

  1. First, we will load the data with the following function call:
    data = np.loadtxt("mdrtb_2012.csv", delimiter=',', usecols=(1,), skiprows=1, unpack=True)

    In the preceding call, we specify a comma as a delimiter, the second column to load data from, and that we want to skip the header. We also specify the name of the file and assume that the file is in the current directory; otherwise, we will have to specify the correct path.

  2. The maximum of an array can be obtained via a method of the ndarray and NumPy functions. The same goes for the minimum, mean, and standard deviations. The following code snippet prints the various statistics:
    print "Max method", data.max()
    print "Max function", np.max(data)
    
    print "Min method", data.min()
    print "Min function", np.min(data)
    
    print "Mean method", data.mean()
    print "Mean function", np.mean(data)
    
    print "Std method", data.std()
    print "Std function", np.std(data)

    The output is as follows:

    Max method 50.0
    Max function 50.0
    Min method 0.0
    Min function 0.0
    Mean method 3.2787037037
    Mean function 3.2787037037
    Std method 5.76332073654
    Std function 5.76332073654
    
  3. The median can be retrieved with a NumPy or SciPy function, which can estimate the 50th percentile of the data with the following lines:
    print "Median", np.median(data)
    print "Score at percentile 50", scoreatpercentile(data, 50)

    The following is printed:

    Median 1.8
    Score at percentile 50 1.8
    
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.136.84