Computing percentiles in Python

Let's look at some more examples of percentiles using Python and kind of get our hands on it and conceptualize this a little bit more. Go ahead and open the Percentiles.ipynb file if you'd like to follow along, and again I encourage you to do so because I want you to play around with this a little bit later.

Let's start off by generating some randomly distributed normal data, or normally distributed random data, rather, refer to the following code block:

%matplotlib inline 
import numpy as np 
import matplotlib.pyplot as plt 
 
vals = np.random.normal(0, 0.5, 10000) 
 
plt.hist(vals, 50) 
plt.show() 

In this example, what we're going to do is generate some data centered around zero, that is with a mean of zero, with a standard deviation of 0.5, and I'm going to make 10000 data points with that distribution. Then, we're going to plot a histogram and see what we come up with.

The generated histogram looks very much like a normal distribution, but because there is a random component we have a little outlier near the deviation of -2 in this example here. Things are tipped a little bit at the mean, a little bit of random variation there to make things interesting.

NumPy provides a very handy percentile function that will compute the percentile values of this distribution for you. So, we created our vals list of data using np.random.normal, and I can just call the np.percentile function to figure out the 50th percentile value in using the following code:

np.percentile(vals, 50) 

The following is the output of the preceding code:

0.0053397035195310248

The output turns out to be 0.005. So remember, the 50th percentile is just another name for the median, and it turns out the median is very close to zero in this data. You can see in the graph that we're tipped a little bit to the right, so that's not too surprising.

I want to compute the 90th percentile, which gives me the point at which 90% of the data is less than that value. We can easily do that with the following code:

np.percentile(vals, 90) 

Here is the output of that code:

Out[4]: 0.64099069837340827 

The 90th percentile of this data turns out to be 0.64, so it's around here, and basically, at that point less than 90% of the data is less than that value. I can believe that. 10% of the data is greater than 0.64, 90% of it, less than 0.65.

Let's compute the 20th percentile value, that would give me the point at which 20% of the values are less than that number that I come up with. Again, we just need a very simple alteration to the code:

np.percentile(vals, 20) 

This gives the following output:

Out[5]:-0.41810340026619164 

The 20th percentile point works out to be -0.4, roughly, and again I believe that. It's saying that 20% of the data lies to the left of -0.4, and conversely, 80% is greater.

If you want to get a feel as to where those breaking points are in a dataset, the percentile function is an easy way to compute them. If this were a dataset representing income distribution, we could just call np.percentile(vals, 99) and figure out what the 99th percentile is. You could figure out who those one-percenters people keep talking about really are, and if you're one of them.

Alright, now to get your hands dirty. I want you to play around with this data. This is an IPython Notebook for a reason, so you can mess with it and mess with the code, try different standard deviation values, see what effect it has on the shape of the data and where those percentiles end up lying, for example. Try using smaller dataset sizes and add a little bit more random variation in the thing. Just get comfortable with it, play around with it, and find you can actually do this stuff and write some real code that works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.132.6