How it works...

In step 1, we begin with a quest to analyze movie budgets by finding the median budget per year in millions of dollars. After finding the median budget for each year, we decided to smooth it out, as there is going to be quite a lot of variability from year to year. We choose to smooth the data because we are looking for a general trend and are not necessarily interested in the exact value of any one year.

In this step, we use the rolling method to calculate a new value for each year based on the average of the last five years of data. For example, the median budgets from the years 2011 through 2015 are grouped together and averaged. The result is the new value for the year 2015. The only required parameter for the rolling method is the size of the window, which, by default, ends at the current year.

The rolling method returns a groupby-like object that must have its groups acted on with another function to produce a result. Let's manually verify that the rolling method works as expected for a few of the previous years:

>>> med_budget.loc[2012:2016].mean()
17.78

>>> med_budget.loc[2011:2015].mean()
17.98

>>> med_budget.loc[2010:2014].mean()
19.1

These values are the same as the output from step 1. In step 2, we get ready to use matplotlib by putting our data into NumPy arrays. In step 3, we create our Figure and Axes to set up the object-oriented interface. The plt.subplots method supports a large number of inputs. See the documentation to view all possible parameters for both this and for the figure function (http://bit.ly/2ydM8ZU and http://bit.ly/2ycno40).

The first two parameters in the plot method represent the x and y values for a line plot. All of the line properties are available to be changed inside the call to plot. The set_title Axes method provides a title and can set all the available text properties inside its call. The same goes for the set_ylablel method. If you are setting the same properties for many objects, you can pack them together into a dictionary and pass this dictionary as one of the arguments, as done with **text_kwargs.

In step 4, we notice an unexpected downward trend in median budget beginning around the year 2000 and suspect that the number of movies collected per year might play an explanatory role. We choose to add this dimension to the graph by creating a bar plot of every fifth year of data beginning from 1970. We use boolean selection on our NumPy data arrays in the same manner as we do for the pandas Series in step 5.

The bar method takes the x-value the height, and the width of the bars as its first three arguments and places the center of the bars directly at each x-value. The bar height was derived from the movie count that was first scaled down to be between zero and one, and then multiplied by the maximum median budget. These bar heights are stored in the variable ct_norm_5. To label each bar correctly, we first zip together the bar center, its height, and the actual movie count. We then loop through this zipped object and place the count preceding the bar with the text method, which accepts an x-value, y-value, and a string. We adjust the y-value slightly upwards and use the horizontal alignment parameter, ha, to center the text.

Look back at step 3, and you will notice the plot method with the label parameter equal to All Movies. This is the value that matplotlib uses when you create a legend for your plot. A call to the legend Axes method puts all the plots with assigned labels in the legend.

To investigate the unexpected dip in the median budget, we can focus on just the top 10 budgeted movies for each year. Step 6 uses a custom aggregation function after grouping by year to do so, and then smooths the result in the same manner as before. These results could be plotted directly on the same graph, but because the values are so much greater, we opt to create an entire new Figure with two Axes.

We start step 7 by creating a Figure with two subplots in a two row by one column grid. Remember that when creating more than one subplot, all the Axes get stored in a NumPy array. The final result from step 5 is recreated in the top Axes. We plot the top 10 budgeted movies in the bottom Axes. Notice that the years align for both the bottom and top Axes because the sharex parameter was set to True in the Figure creation. When sharing an axis, matplotlib removes the labels for all the ticks but keeps those tiny vertical lines for each tick. To remove these tick lines, we use the setp pyplot function. Although this isn't directly object-oriented, it is explicit and very useful when we want to set properties for an entire sequence of plotting objects. We set all the tick lines to invisible with this useful function.

Finally, we then make several calls to Figure methods. This is a departure from our normal calling of Axes methods. The tight_layout method adjusts the subplots to look much nicer by removing extra space and ensuring that different Axes don't overlap. The suptitle method creates a title for the entire Figure, as opposed to the set_title Axes method, which creates titles for individual Axes. It accepts an x and y location to represent a place in the figure coordinate system, in which (0, 0) represents the bottom left and (1, 1) represents the top right. By default, the y-value is 0.98, but we move it up a few points to 1.02.

Each Axes also has a coordinate system in which (0, 0) is used for the bottom left and (1, 1) for the top right. In addition to those coordinate system, each Axes also has a data coordinate system, which is more natural to most people and represents the bounds of the x and y-axis. These bounds may be retrieved with ax.get_xlim() and ax.get_ylim() respectively. All the plotting before this used the data coordinate system. See the Transformations tutorial to learn more about the coordinate systems (http://bit.ly/2gxDkX3).

As both Axes use the same units for the y axis, we use the text Figure method to place a custom y axis label directly between each Axes, using the figure coordinate system. Finally, we save the Figure to our desktop. The tilde, ~, in the path represents the home directory, but the savefig method won't understand what this means. You must use the expanduser function from the os library to create the full path. For instance, the path variable becomes the following on my machine:

>>> os.path.expanduser('~/Desktop/movie_budget.png')
'/Users/Ted/Desktop/movie_budget.png'

The savefig method can now create the file in the correct location. By default, savefig will save only what is plotted within (0, 0) to (1, 1) of the figure coordinate system. As our title is slightly outside of this area, some of it will be cropped. Setting the bbox_inches parameter to tight will have matplotlib include any titles or labels that are extending outside of this region.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...