Having seen how to create, lay out, and annotate time-series charts, we will now look at creating a number of charts, other than time series that are commonplace in presenting statistical information.
Bar plots are useful in order to visualize the relative differences in values of non time-series data. Bar plots can be created using the kind='bar'
parameter of the .plot()
method:
In [24]: # make a bar plot # create a small series of 10 random values centered at 0.0 np.random.seed(seedval) s = pd.Series(np.random.rand(10) - 0.5) # plot the bar chart s.plot(kind='bar'),
If the data being plotted consists of multiple columns, a multiple series bar plot will be created:
In [25]: # draw a multiple series bar chart # generate 4 columns of 10 random values np.random.seed(seedval) df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd']) # draw the multi-series bar chart df2.plot(kind='bar'),
If you would prefer stacked bars, you can use the stacked
parameter, setting it to True
:
In [26]: # horizontal stacked bar chart df2.plot(kind='bar', stacked=True);
If you want the bars to be horizontally aligned, you can use kind='barh'
:
In [27]: # horizontal stacked bar chart df2.plot(kind='barh', stacked=True);
Histograms are useful for visualizing distributions of data. The following shows you a histogram of generating 1000 values from the normal distribution:
In [28]: # create a histogram np.random.seed(seedval) # 1000 random numbers dfh = pd.DataFrame(np.random.randn(1000)) # draw the histogram dfh.hist();
The resolution of a histogram can be controlled by specifying the number of bins to allocate to the graph. The default is 10, and increasing the number of bins gives finer detail to the histogram. The following increases the number of bins to 100
:
In [29]: # histogram again, but with more bins dfh.hist(bins = 100);
If the data has multiple series, the histogram function will automatically generate multiple histograms, one for each series:
In [30]: # generate a multiple histogram plot # create DataFrame with 4 columns of 1000 random values np.random.seed(seedval) dfh = pd.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd']) # draw the chart. There are four columns so pandas draws # four historgrams dfh.hist();
If you want to overlay multiple histograms on the same graph (to give a quick visual difference of distribution), you can call the pyplot.hist()
function multiple times before .show()
is called to render the chart:
In [31]: # directly use pyplot to overlay multiple histograms # generate two distributions, each with a different # mean and standard deviation np.random.seed(seedval) x = [np.random.normal(3,1) for _ in range(400)] y = [np.random.normal(4,2) for _ in range(400)] # specify the bins (-10 to 10 with 100 bins) bins = np.linspace(-10, 10, 100) # generate plot x using plt.hist, 50% transparent plt.hist(x, bins, alpha=0.5, label='x') # generate plot y using plt.hist, 50% transparent plt.hist(y, bins, alpha=0.5, label='y') plt.legend(loc='upper right'),
Box plots come from descriptive statistics and are a useful way of graphically depicting the distributions of categorical data using quartiles. Each box represents the values between the first and third quartiles of the data with a line across the box at the median. Each whisker reaches out to demonstrate the extent to five interquartile ranges below and above the first and third quartiles:
In [32]: # create a box plot # generate the series np.random.seed(seedval) dfb = pd.DataFrame(np.random.randn(10,5)) # generate the plot dfb.boxplot(return_type='axes'),
Area plots are used to represent cumulative totals over time, to demonstrate the change in trends over time among related attributes. They can also be "stacked" to demonstrate representative totals across all variables.
Area plots are generated by specifying kind='area'
. A stacked area chart is the default:
In [33]: # create a stacked area plot # generate a 4-column data frame of random data np.random.seed(seedval) dfa = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd']) # create the area plot dfa.plot(kind='area'),
To produce an unstacked plot, specify stacked=False
:
In [34]: # do not stack the area plot dfa.plot(kind='area', stacked=False);
A scatter plot displays the correlation between a pair of variables. A scatter plot can be created from DataFrame
using .plot()
and specifying kind='scatter'
, as well as specifying the x
and y
columns from the DataFrame
source:
In [35]: # generate a scatter plot of two series of normally # distributed random values # we would expect this to cluster around 0,0 np.random.seed(111111) sp_df = pd.DataFrame(np.random.randn(10000, 2), columns=['a', 'b']) sp_df.plot(kind='scatter', x='a', y='b')
We can easily create more elaborate scatter plots by dropping down a little lower into matplotlib. The following code gets Google stock data for the year of 2011 and calculates delta in the closing price per day, and renders close versus volume as bubbles of different sizes, derived on the size of the values in the data:
In [36]: # get Google stock data from 1/1/2011 to 12/31/2011 from pandas.io.data import DataReader stock_data = DataReader("GOOGL", "yahoo", datetime(2011, 1, 1), datetime(2011, 12, 31)) # % change per day delta = np.diff(stock_data["Adj Close"])/stock_data["Adj Close"][:-1] # this calculates size of markers volume = (15 * stock_data.Volume[:-2] / stock_data.Volume[0])**2 close = 0.003 * stock_data.Close[:-2] / 0.003 * stock_data.Open[:-2] # generate scatter plot fig, ax = plt.subplots() ax.scatter(delta[:-1], delta[1:], c=close, s=volume, alpha=0.5) # add some labels and style ax.set_xlabel(r'$Delta_i$', fontsize=20) ax.set_ylabel(r'$Delta_{i+1}$', fontsize=20) ax.set_title('Volume and percent change') ax.grid(True);
You can create kernel density estimation plots using the .plot()
method and setting the kind='kde'
parameter. A kernel density estimate plot, instead of being a pure empirical representation of the data, makes an attempt and estimates the true distribution of the data, and hence smoothes it into a continuous plot. The following generates a normal distributed set of numbers, displays it as a histogram, and overlays the kde
plot:
In [37]: # create a kde density plot # generate a series of 1000 random numbers np.random.seed(seedval) s = pd.Series(np.random.randn(1000)) # generate the plot s.hist(normed=True) # shows the bars s.plot(kind='kde'),
The final composite graph we'll look at in this chapter, is one that is provided by pandas in its plotting tools subcomponent: the scatter plot matrix. A scatter plot matrix is a popular way of determining whether there is a linear correlation between multiple variables. The following creates a scatter plot matrix with random values, which then shows a scatter plot for each combination, as well as a kde graph for each variable:
In [38]: # create a scatter plot matrix # import this class from pandas.tools.plotting import scatter_matrix # generate DataFrame with 4 columns of 1000 random numbers np.random.seed(111111) df_spm = pd.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd']) # create the scatter matrix scatter_matrix(df_spm, alpha=0.2, figsize=(6, 6), diagonal='kde'),
A heatmap is a graphical representation of data, where values within a matrix are represented by colors. This is an effective means to show relationships of values that are measured at the intersection of two variables, at each intersection of the rows and the columns of the matrix. A common scenario, is to have the values in the matrix normalized to 0.0 through 1.0 and have the intersections between a row and column represent the correlation between the two variables. Values with less correlation (0.0) are the darkest, and those with the highest correlation (1.0) are white.
Heatmaps are easily created with pandas and matplotlib using the .imshow()
function:
In [39]: # create a heatmap # start with data for the heatmap s = pd.Series([0.0, 0.1, 0.2, 0.3, 0.4], ['V', 'W', 'X', 'Y', 'Z']) heatmap_data = pd.DataFrame({'A' : s + 0.0, 'B' : s + 0.1, 'C' : s + 0.2, 'D' : s + 0.3, 'E' : s + 0.4, 'F' : s + 0.5, 'G' : s + 0.6 }) heatmap_data Out [39]: A B C D E F G V 0.0 0.1 0.2 0.3 0.4 0.5 0.6 W 0.1 0.2 0.3 0.4 0.5 0.6 0.7 X 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Y 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Z 0.4 0.5 0.6 0.7 0.8 0.9 1.0 In [40]: # generate the heatmap plt.imshow(heatmap_data, cmap='hot', interpolation='none') plt.colorbar() # add the scale of colors bar # set the labels plt.xticks(range(len(heatmap_data.columns)), heatmap_data.columns) plt.yticks(range(len(heatmap_data)), heatmap_data.index);
18.225.255.178