Creating graphs is an important step in exploratory analysis, and one of the first stages in data analysis. We can use Matplotlib to construct a variety of analytical graphs that display our data in different ways.
Basic Plots
To represent our data graphically, we can use the Matplotlib library, one of the most used graphics representation software packages. Its documentation can be found at https://matplotlib.org. The section gallery of the Matplotlib site (at https://matplotlib.org/gallery.html) features a series of examples of charts with code.
Matplotlib is installed with Anaconda, so we have to import it.
# we import the Matplotlib library
>>> import matplotlib as mlp
>>> import matplotlib.pyplot as plt
>>> %matplotlib inline
# this last line of code allows us to view the charts directly using Jupyter
We can represent elements by inserting them directly into the function, in the form of a list. The following code produces the plot in the Figure 10-1.
>>> plt.plot([5,7,2,4])
>>> plt.plot([5,7,2,4], [4,6,9,2], 'ro')
# 'ro' stands for round object
The plot of round objects is shown in Figure 10-2.
As shown in Figure 10-1, we can create a line. But, we can also customize color and type of representation by modifying arguments.
Now let’s see how to create pie charts: a pie chart can be used to show the composition of something (like a market). To plot a pie we can use the plt.pie() function:
We can create yet other types of plots and charts. For example, we can build a scatterplot. A scatterplot is very useful to see the relationship between two variables.
We can create bar charts with the plt.bar() function. Bar charts and histograms are very useful to compare our data and also to display categorical variables:
We can create a chart from a data frame. To do this, we must import pandas for dataset and NumPy management. Let’s generate a random set of ten cases and four variables.
We can also select a column using methods other than the name, such as the .loc method.
>>> df1.loc[1].hist()
We create box plots by using the boxplot() function. This kind of visualization can be used to show the shape of the distribution, its central value, and its variability:
Each function that we use has its own parameters, which we can change, as we saw in the first section of this chapter. For instance, we can change the colors of the area chart by applying the palette we already created:
We can save our plots and charts with the .savefig method. We can also designate its name and set the resolution (dots per inch) as well:
>>> df1.plot(kind = "scatter", x = "var3", y = "var4")
# we save the image in the working directory in the following way
>>> plt.savefig('graph1.png’, dpi = 600)
Let’s check whether the chart has been saved successfully to our working directory. We can use the image downloaded for example for a presentation, or including it in a report after the data analysis or to explain our data in an exploratory phase.
Selecting Plot and Chart Styles
Matplotlib also includes a set of styles that can be applied to charts. We can view these styles by typing:
>>> plt.style.available
['bmh',
'classic',
'dark_background',
'fivethirtyeight',
'ggplot',
'grayscale',
'seaborn-bright',
'seaborn-colorblind',
'seaborn-dark-palette',
'seaborn-dark',
'seaborn-darkgrid',
'seaborn-deep',
'seaborn-muted',
'seaborn-notebook',
'seaborn-paper',
'seaborn-pastel',
'seaborn-poster',
'seaborn-talk',
'seaborn-ticks',
'seaborn-white',
'seaborn-whitegrid',
'seaborn']
To apply a style, we must insert a line of code that features the theme name:
# we present the two datasets together and define whether we want color, transparency through the alpha parameter, and the number of intervals into which we want data to be divided.
Matplotlib is just one of many Python packages that can be used to display data. Other chart creation packages can be found at http://pbpython.com/visualization-tools-1.html. One of the most used data mining charts, for example, is seaborn.
Summary
Matplotlib is one of the most basic libraries for plotting data. Plotting datasets for data analysis is crucial to understanding the relationships among variables.