Generating controlled random datasets

In this recipe, we will show different ways of generating random number sequences and word sequences. Some of the examples use standard Python modules, and others use NumPy/SciPy functions.

We will go through some statistics terminology but will explain every term, so you don't have to have a statistical reference book with you while reading this recipe.

We generate artificial datasets using common Python modules. By doing so, we are able to understand distributions, variance, sampling, and similar statistical terminology. More importantly, we can use this fake data as a way to understand if our statistical method is capable of discovering models we want to discover. We can do that because we know the model in advance and verify our statistical method by applying it over our known data. In real life, we don't have that ability and there is always a percentage of uncertainty that we must assume, giving way to errors.

Getting ready

We don't need anything new installed on the system in order to exercise these examples. Having some knowledge of statistics is useful, although not required.

To refresh our statistical knowledge, here's a little glossary we will use in this and the following chapters:

  • Distribution or probability distribution: This links the outcome of a statistical experiment with the probability of occurrence of that experiment.
  • Standard deviation: This is a numerical value that indicates how individuals vary in comparison to a group. If they vary more, the standard derivation will be big, and in the opposite condition—if all the individual experiments are more or less the same across the whole group, the standard derivation will be small.
  • Variance: This equals the square of standard derivation.
  • Population or statistical population: This is a total set of all the potentially observable cases. For example, all the grades of all the students in the world if we are interested in getting the student average of the world.
  • Sample: This is a subset of the population. We cannot obtain all the grades of all the students in the world, so we have to gather only a sample of data and model it.

How to do it...

We can generate a simple random sample using Python's module random. Here's an example of this:

import pylab
import random

SAMPLE_SIZE = 100

# seed random generator
# if no argument provided
# uses system current time
random.seed()

# store generated random values here
real_rand_vars = []

# pick some random values
real_rand_vars = [random.random() for val in xrange(SIZE)]
# create histogram from data in 10 buckets
pylab.hist(real_rand_vars, 10)
# define x and y labels
pylab.xlabel("Number range")
pylab.ylabel("Count")

# show figure
pylab.show()

This is a uniformly distributed sample. When we run this example, we should see something similar to the following plot:

How to do it...

Try setting SAMPLE_SIZE to a big number (say 10000) and see how the histogram behaves.

If we want to have values that range not from 0 to 1, but say from 1 to 6 (by simulating single dice throws), we could use random.randint(min, max); here, min and max are the lower and upper inclusive bounds respectively. If what you want to generate are floats and not integers, there is a random.uniform(min, max) function to provide that.

In a similar fashion and using the same tools, we can generate a time series plot of fictional price growth data with some random noise, as shown here:

import pylab
import random

# days to generate data for
duration = 100
# mean value
mean_inc = 0.2

# standard deviation
std_dev_inc = 1.2

# time series
x = range(duration)
y = []
price_today = 0

for i in x:
    next_delta = random.normalvariate(mean_inc, std_dev_inc)
    price_today += next_delta
    y.append(price_today)

pylab.plot(x,y)
pylab.xlabel("Time")
pylab.xlabel("Time")
pylab.ylabel("Value")
pylab.show()

This code defines a series of 100 data points (fictional days). For every next day, we pick a random value from the normal distribution (random.normalvariate()) ranging from mean_inc to std_dev_inc and add that value to yesterday's price value (price_today).

If we wanted more control, we could use different distributions. The following code illustrates and visualizes different distributions. We will comment separate code sections as we present them. We start by importing required modules and defining a number of histogram buckets. We also create a figure that will hold our histograms as shown in the following lines of code:

# coding: utf-8
import random
import matplotlib
import matplotlib.pyplot as plt

SAMPLE_SIZE = 1000
# histogram buckets
buckets = 100

plt.figure()

# we need to update font size just for this example
matplotlib.rcParams.update({'font.size': 7})

To lay out all the required plots, we define a grid of six by two subplots for all the histograms. The first plot is a uniformly distributed random variable as seen in the following lines of code:

plt.subplot(621)
plt.xlabel("random.random")
# Return the next random floating point number in the range [0.0, 1.0).
res = [random.random() for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

For the second plot, we plot a uniformly distributed random variable as shown here:

plt.subplot(622)
plt.xlabel("random.uniform")
# Return a random floating point number N such that a <= N <= b for a <= b and b <= N <= a for b < a.
# The end-point value b may or may not be included in the range depending on floating-point rounding in the equation a + (b-a) * random().
a = 1
b = SAMPLE_SIZE
res = [random.uniform(a, b) for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

Here is the third plot which is a triangular distribution:

plt.subplot(623)
plt.xlabel("random.triangular")

# Return a random floating point number N such that low <= N <= high and with the specified  # mode between those bounds. The low and high bounds default to zero and one. The mode 
# argument defaults to the midpoint between the bounds, giving a symmetric distribution.
low = 1
high = SAMPLE_SIZE
res = [random.triangular(low, high) for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

The fourth plot is a beta distribution. The condition on the parameters is that alpha and beta should be greater than zero. The returned values range between 0 and 1.

plt.subplot(624)
plt.xlabel("random.betavariate")
alpha = 1
beta = 10
res = [random.betavariate(alpha, beta) for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

The fifth plot visualizes an exponential distribution. lambd is 1.0 divided by the desired mean. It should be non-zero (the parameter would be called lambda, but that is a reserved word in Python). The returned values range from 0 to positive infinity if lambd is positive, and from negative infinity to 0 if lambd is negative, as shown here:

plt.subplot(625)
plt.xlabel("random.expovariate")
lambd = 1.0 / ((SAMPLE_SIZE + 1) / 2.)
res = [random.expovariate(lambd) for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

Our next plot is the gamma distribution, where the condition on the parameters is that alpha and beta are greater than 0. The probability distribution function is shown here:

How to do it...

Here's the code for the gamma distribution:

plt.subplot(626)
plt.xlabel("random.gammavariate")


alpha = 1
beta = 10
res = [random.gammavariate(alpha, beta) for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

Log normal distribution is our next plot. If you take the natural logarithm of this distribution, you'll get a normal distribution with the mean mu and the standard deviation sigma. mu can have any value; moreover, sigma must be greater than zero as shown here:

plt.subplot(627)
plt.xlabel("random.lognormvariate")
mu = 1
sigma = 0.5
res = [random.lognormvariate(mu, sigma) for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

The next plot is normal distribution, where mu is the mean and sigma is the standard deviation as shown here:

plt.subplot(628)
plt.xlabel("random.normalvariate")
mu = 1
sigma = 0.5
res = [random.normalvariate(mu, sigma) for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

Here is the last plot which is the Pareto distribution and alpha is the shape parameter:

plt.subplot(629)
plt.xlabel("random.paretovariate")
alpha = 1
res = [random.paretovariate(alpha) for _ in xrange(1, SAMPLE_SIZE)]
plt.hist(res, buckets)

plt.tight_layout()
plt.show()

This was a big code example, but basically we pick 1,000 random numbers according to various distributions. These are common distributions used in different statistical branches (economics, sociology, bio-sciences, and so on).

We should see differences in the histogram based on the distribution algorithm used. Take a moment to understand the following nine plots:

How to do it...

Use seed() to initialize the pseudo-random generator, so random() produces the same expected random values. This is sometimes useful and it is better than pregenerating random data and saving it to a file. The latter technique is not always feasible as it requires saving (possibly huge amounts of) data on a filesystem.

If you want to prevent any repeatability of your randomly generated sequences, we recommend using random.SystemRandom, which uses os.urandom underneath; os.urandom provides access to more entropy sources. If using this random generator interface, seed() and setstate() have no effect; hence these samples are not reproducible.

If we want to have some random words, the easiest way (on Linux) is probably to use /usr/share/dicts/words. We can see how that is done in the following example:

import random

with open('/usr/share/dict/words', 'rt') as f:
    words = f.readlines()
words = [w.rstrip() for w in words]

for w in random.sample(words, 5):
    print w

This solution is for Unix only and will not work on Windows (but, it will work on Mac OS). For Windows, you could use a file constructed from various free sources (Project Gutenberg, Wiktionary, British National Corpus, or http://norvig.com/big.txt by Dr Peter Norvig).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.173.53