Common plots used in statistical analyses

Having seen how to create, lay out, and annotate time-series charts, we will now look at creating a number of charts, other than time series that are commonplace in presenting statistical information.

Bar plots

Bar plots are useful in order to visualize the relative differences in values of non time-series data. Bar plots can be created using the kind='bar' parameter of the .plot() method:

In [24]:
   # make a bar plot
   # create a small series of 10 random values centered at 0.0
   np.random.seed(seedval)
   s = pd.Series(np.random.rand(10) - 0.5)
   # plot the bar chart
   s.plot(kind='bar'),
Bar plots

If the data being plotted consists of multiple columns, a multiple series bar plot will be created:

In [25]:
   # draw a multiple series bar chart
   # generate 4 columns of 10 random values
   np.random.seed(seedval)
   df2 = pd.DataFrame(np.random.rand(10, 4), 
                      columns=['a', 'b', 'c', 'd'])
   # draw the multi-series bar chart
   df2.plot(kind='bar'),
Bar plots

If you would prefer stacked bars, you can use the stacked parameter, setting it to True:

In [26]:
   # horizontal stacked bar chart
   df2.plot(kind='bar', stacked=True);
Bar plots

If you want the bars to be horizontally aligned, you can use kind='barh':

In [27]:
   # horizontal stacked bar chart
   df2.plot(kind='barh', stacked=True);
Bar plots

Histograms

Histograms are useful for visualizing distributions of data. The following shows you a histogram of generating 1000 values from the normal distribution:

In [28]:
   # create a histogram
   np.random.seed(seedval)
   # 1000 random numbers
   dfh = pd.DataFrame(np.random.randn(1000))
   # draw the histogram
   dfh.hist();
Histograms

The resolution of a histogram can be controlled by specifying the number of bins to allocate to the graph. The default is 10, and increasing the number of bins gives finer detail to the histogram. The following increases the number of bins to 100:

In [29]:
   # histogram again, but with more bins
   dfh.hist(bins = 100);
Histograms

If the data has multiple series, the histogram function will automatically generate multiple histograms, one for each series:

In [30]:
   # generate a multiple histogram plot
   # create DataFrame with 4 columns of 1000 random values
   np.random.seed(seedval)
   dfh = pd.DataFrame(np.random.randn(1000, 4), 
                      columns=['a', 'b', 'c', 'd'])
   # draw the chart.  There are four columns so pandas draws
   # four historgrams
   dfh.hist();
Histograms

If you want to overlay multiple histograms on the same graph (to give a quick visual difference of distribution), you can call the pyplot.hist() function multiple times before .show() is called to render the chart:

In [31]:
   # directly use pyplot to overlay multiple histograms
   # generate two distributions, each with a different
   # mean and standard deviation
   np.random.seed(seedval)
   x = [np.random.normal(3,1) for _ in range(400)]
   y = [np.random.normal(4,2) for _ in range(400)]

   # specify the bins (-10 to 10 with 100 bins)
   bins = np.linspace(-10, 10, 100)

   # generate plot x using plt.hist, 50% transparent
   plt.hist(x, bins, alpha=0.5, label='x')
   # generate plot y using plt.hist, 50% transparent
   plt.hist(y, bins, alpha=0.5, label='y')
   plt.legend(loc='upper right'),
Histograms

Box and whisker charts

Box plots come from descriptive statistics and are a useful way of graphically depicting the distributions of categorical data using quartiles. Each box represents the values between the first and third quartiles of the data with a line across the box at the median. Each whisker reaches out to demonstrate the extent to five interquartile ranges below and above the first and third quartiles:

In [32]:
   # create a box plot
   # generate the series
   np.random.seed(seedval)
   dfb = pd.DataFrame(np.random.randn(10,5))
   # generate the plot
   dfb.boxplot(return_type='axes'),
Box and whisker charts

Note

There are ways to overlay dots and show outliers, but for brevity, they will not be covered in this text.

Area plots

Area plots are used to represent cumulative totals over time, to demonstrate the change in trends over time among related attributes. They can also be "stacked" to demonstrate representative totals across all variables.

Area plots are generated by specifying kind='area'. A stacked area chart is the default:

In [33]:
   # create a stacked area plot
   # generate a 4-column data frame of random data
   np.random.seed(seedval)
   dfa = pd.DataFrame(np.random.rand(10, 4), 
                      columns=['a', 'b', 'c', 'd'])
   # create the area plot
   dfa.plot(kind='area'),
Area plots

To produce an unstacked plot, specify stacked=False:

In [34]:
   # do not stack the area plot
   dfa.plot(kind='area', stacked=False);
Area plots

Note

By default, unstacked plots have an alpha value of 0.5, so that it is possible to see how the data series overlaps.

Scatter plots

A scatter plot displays the correlation between a pair of variables. A scatter plot can be created from DataFrame using .plot() and specifying kind='scatter', as well as specifying the x and y columns from the DataFrame source:

In [35]:
   # generate a scatter plot of two series of normally
   # distributed random values
   # we would expect this to cluster around 0,0
   np.random.seed(111111)
   sp_df = pd.DataFrame(np.random.randn(10000, 2), 
                        columns=['a', 'b'])
   sp_df.plot(kind='scatter', x='a', y='b')
Scatter plots

We can easily create more elaborate scatter plots by dropping down a little lower into matplotlib. The following code gets Google stock data for the year of 2011 and calculates delta in the closing price per day, and renders close versus volume as bubbles of different sizes, derived on the size of the values in the data:

In [36]:
   # get Google stock data from 1/1/2011 to 12/31/2011
   from pandas.io.data import DataReader
   stock_data = DataReader("GOOGL", "yahoo", 
                           datetime(2011, 1, 1), 
                           datetime(2011, 12, 31))

   # % change per day
   delta = np.diff(stock_data["Adj Close"])/stock_data["Adj Close"][:-1]

   # this calculates size of markers
   volume = (15 * stock_data.Volume[:-2] / stock_data.Volume[0])**2
   close = 0.003 * stock_data.Close[:-2] / 0.003 * stock_data.Open[:-2]

   # generate scatter plot
   fig, ax = plt.subplots()
   ax.scatter(delta[:-1], delta[1:], c=close, s=volume, alpha=0.5)

   # add some labels and style
   ax.set_xlabel(r'$Delta_i$', fontsize=20)
   ax.set_ylabel(r'$Delta_{i+1}$', fontsize=20)
   ax.set_title('Volume and percent change')
   ax.grid(True);
Scatter plots

Note

Note the nomenclature for the x and y axes labels, which creates a nice mathematical style for the labels.

Density plot

You can create kernel density estimation plots using the .plot() method and setting the kind='kde' parameter. A kernel density estimate plot, instead of being a pure empirical representation of the data, makes an attempt and estimates the true distribution of the data, and hence smoothes it into a continuous plot. The following generates a normal distributed set of numbers, displays it as a histogram, and overlays the kde plot:

In [37]:
   # create a kde density plot
   # generate a series of 1000 random numbers
   np.random.seed(seedval)
   s = pd.Series(np.random.randn(1000))
   # generate the plot
   s.hist(normed=True) # shows the bars
   s.plot(kind='kde'),
Density plot

The scatter plot matrix

The final composite graph we'll look at in this chapter, is one that is provided by pandas in its plotting tools subcomponent: the scatter plot matrix. A scatter plot matrix is a popular way of determining whether there is a linear correlation between multiple variables. The following creates a scatter plot matrix with random values, which then shows a scatter plot for each combination, as well as a kde graph for each variable:

In [38]:
   # create a scatter plot matrix
   # import this class
   from pandas.tools.plotting import scatter_matrix

   # generate DataFrame with 4 columns of 1000 random numbers
   np.random.seed(111111)
   df_spm = pd.DataFrame(np.random.randn(1000, 4), 
                         columns=['a', 'b', 'c', 'd'])
   # create the scatter matrix
   scatter_matrix(df_spm, alpha=0.2, figsize=(6, 6), diagonal='kde'),
The scatter plot matrix

Note

We will see this plot again, as it is applied to finance in the next chapter, where we look at correlations of various stocks.

Heatmaps

A heatmap is a graphical representation of data, where values within a matrix are represented by colors. This is an effective means to show relationships of values that are measured at the intersection of two variables, at each intersection of the rows and the columns of the matrix. A common scenario, is to have the values in the matrix normalized to 0.0 through 1.0 and have the intersections between a row and column represent the correlation between the two variables. Values with less correlation (0.0) are the darkest, and those with the highest correlation (1.0) are white.

Heatmaps are easily created with pandas and matplotlib using the .imshow() function:

In [39]:
   # create a heatmap
   # start with data for the heatmap
   s = pd.Series([0.0, 0.1, 0.2, 0.3, 0.4],
                 ['V', 'W', 'X', 'Y', 'Z'])
   heatmap_data = pd.DataFrame({'A' : s + 0.0,
                                'B' : s + 0.1,
                                'C' : s + 0.2,
                                'D' : s + 0.3,
                                'E' : s + 0.4,
                                'F' : s + 0.5,
                                'G' : s + 0.6
                        })
   heatmap_data

Out [39]:
        A    B    C    D    E    F    G
   V  0.0  0.1  0.2  0.3  0.4  0.5  0.6
   W  0.1  0.2  0.3  0.4  0.5  0.6  0.7
   X  0.2  0.3  0.4  0.5  0.6  0.7  0.8
   Y  0.3  0.4  0.5  0.6  0.7  0.8  0.9
   Z  0.4  0.5  0.6  0.7  0.8  0.9  1.0

In [40]:
   # generate the heatmap
   plt.imshow(heatmap_data, cmap='hot', interpolation='none')
   plt.colorbar()  # add the scale of colors bar
   # set the labels
   plt.xticks(range(len(heatmap_data.columns)), heatmap_data.columns)
   plt.yticks(range(len(heatmap_data)), heatmap_data.index);
Heatmaps

Note

We will see an example of heatmaps to show correlations in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.178