Making histograms

Histograms are simple; yet it's important to get the right data into them. We will cover histograms in 2D for now.

Histograms are used to visualize estimations of distribution of data. Generally, we use a few terms when speaking of histograms. Vertical rectangles represent frequencies of data points within a particular interval called a bin. Bins are created at fixed intervals, so the total area of a histogram sums to the number of data points.

Instead of using absolute values of data, histograms can display relative frequencies of data. When this is the case, the total area equals 1.

Histograms are often used in image manipulation software as a way to visualize image properties such as distribution of light in a particular color channel. Further, these image histograms can be used in computer vision algorithms to detect peaks aiding in edge detections, image segmentation, and so on.

In Chapter 5, Making 3D Visualizations, we have recipes that deal with 3D histograms.

Getting ready

The number of bins is the value we want to get right, but it is hard to get them right as there are no strict rules on what is the optimal number of bins. There are different theories on how to calculate the number of bins, the simplest being the one based on a ceiling function, where the number of bins (k) is equal to the ceiling (max(x) – min(x))/h,where x is the dataset plotted and h is the desired bin width. This is just one option as the number of bins required to display data properly is dependent on real data distribution.

How to do it...

We create a histogram calling matplotlib.pyplot.hist() with a set of parameters. Here are some of the most useful ones:

  • bins: This is either an integer number of bins or a sequence giving the bins. The default is 10.
  • range: This is the range of bins and is not used if bins are given as a sequence. Outliers are ignored and the default is None.
  • normed: If the value for this is True, histogram values are normalized and form probability density. The default is False.
  • histtype: This parameter allows us to specify the type of histogram that we want. The default value is 'bar' and the other options are shown here:
    • barstacked: This gives stacked-view histograms for multiple data
    • step: This creates a line plot that is left unfilled
    • stepfilled: This creates line plot that is filled by default
  • align: This centers bars between bin edges. The default is mid. Other values are left and right.
  • color: This specifies the color of the histogram. It may be a single value or have a sequence of colors. If multiple datasets are specified, the color sequence will be used in the same order. If not specified, a default line color sequence is used.
  • orientation: This allows the creation of histograms that are horizontal by setting orientation to horizontal. The default is vertical.

The following code demonstrates how hist() is used:

import numpy as np
import matplotlib.pyplot as plt

mu = 100
sigma = 15
x = np.random.normal(mu, sigma, 10000)

ax = plt.gca()

# the histogram of the data
ax.hist(x, bins=35, color='r')

ax.set_xlabel('Values')
ax.set_ylabel('Frequency')

ax.set_title(r'$mathrm{Histogram:} mu=%d, sigma=%d$' % (mu, sigma))

plt.show()

This creates a neat, red-colored histogram for our data sample:

How to do it...

How it works...

We start by generating some normally distributed data. The histogram is plotted with the specified number of bins—35—and it is normalized by setting normed to True (or 1); we set the color to red (r).

After that, we set labels and a title for the plot. Here we used the ability to write LaTeX expressions to write math symbols and mixed that with Python format strings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.200.206