Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11

Visualizing the Data

IN THIS CHAPTER

Selecting the right graph for the job

Working with advanced scatterplots

Exploring time-related and geographical data

Creating graphs

Chapter 10 helped you understand the mechanics of working with MatPlotLib, which is an important first step toward using it. This chapter takes the next step in helping you use MatPlotLib to perform useful work. The main goal of this chapter is to help you visualize your data in various ways. Creating a graphic presentation of your data is essential if you want to help other people understand what you’re trying to say. Even though you can see what the numbers mean in your mind, other people will likely need graphics to see what point you’re trying to make by manipulating data in various ways.

The chapter starts by looking at some basic graph types that MatPlotLib supports. You don’t find the full list of graphs and plots listed in this chapter — it could take an entire book to explore them all in detail. However, you do find the most common types.

In the remainder of the chapter, you begin exploring specific sorts of plotting as it relates to data science. Of course, no book on data science would be complete without exploring scatterplots, which are used to help people see patterns in seemingly unrelated data points. Because much of the data that you work with today is time related or geographic in nature, the chapter devotes two special sections to these topics. You also get to work with both directed and undirected graphs, which is fine for social media analysis.

You don’t have to type the source code for this chapter manually. In fact, it’s a lot easier if you use the downloadable source. The source code for this chapter appears in the P4DS4D2_11_Visualizing_the_Data.ipynb source code file (see the Introduction for details on how to find that source file).

Choosing the Right Graph

The kind of graph you choose determines how people view the associated data, so choosing the right graph from the outset is important. For example, if you want to show how various data elements contribute toward a whole, you really need to use a pie chart. On the other hand, when you want people to form opinions on how data elements compare, you use a bar chart. The idea is to choose a graph that naturally leads people to draw the conclusion that you need them to draw about the data that you’ve carefully massaged from various data sources. (You also have the option of using line graphs — a technique demonstrated in Chapter 10.) The following sections describe the various graph types and provide you with basic examples of how to use them.

Showing parts of a whole with pie charts

Pie charts focus on showing parts of a whole. The entire pie would be 100 percent. The question is how much of that percentage each value occupies. The following example shows how to create a pie chart with many of the special features in place:

import matplotlib.pyplot as plt

%matplotlib inline

values = [5, 8, 9, 10, 4, 7]

colors = ['b', 'g', 'r', 'c', 'm', 'y']

labels = ['A', 'B', 'C', 'D', 'E', 'F']

explode = (0, 0.2, 0, 0, 0, 0)

plt.pie(values, colors=colors, labels=labels,

explode=explode, autopct='%1.1f%%',

counterclock=False, shadow=True)

plt.title('Values')

plt.show()

The essential part of a pie chart is the values. You could create a basic pie chart using just the values as input.

The colors parameter lets you choose custom colors for each pie wedge. You use the labels parameter to identify each wedge. In many cases, you need to make one wedge stand out from the others, so you add the explode parameter with list of explode values. A value of 0 keeps the wedge in place — any other value moves the wedge out from the center of the pie.

Each pie wedge can show various kinds of information. This example shows the percentage occupied by each wedge with the autopct parameter. You must provide a format string to format the percentages.

Some parameters affect how the pie chart is drawn. Use the counterclock parameter to determine the direction of the wedges. The shadow parameter determines whether the pie appears with a shadow beneath it (for a 3-D effect). You can find other parameters at https://matplotlib.org/api/pyplot_api.html.

In most cases, you also want to give your pie chart a title so that others know what it represents. You do this using the title() function. Figure 11-1 shows the output from this example.

Screenshot of a dialog box displaying a pie chart depicting a percentage of the whole. — FIGURE 11-1: Pie charts show a percentage of the whole.

Creating comparisons with bar charts

Bar charts make comparing values easy. The wide bars and segregated measurements emphasize the differences between values, rather than the flow of one value to another as a line graph would do. Fortunately, you have all sorts of methods at your disposal for emphasizing specific values and performing other tricks. The following example shows just some of the things you can do with a vertical bar chart.

import matplotlib.pyplot as plt

%matplotlib inline

values = [5, 8, 9, 10, 4, 7]

widths = [0.7, 0.8, 0.7, 0.7, 0.7, 0.7]

colors = ['b', 'r', 'b', 'b', 'b', 'b']

plt.bar(range(0, 6), values, width=widths,

color=colors, align='center')

plt.show()

To create even a basic bar chart, you must provide a series of x coordinates and the heights of the bars. The example uses the range() function to create the x coordinates, and values contains the heights.

Of course, you may want more than a basic bar chart, and MatPlotLib provides a number of ways to get the job done. In this case, the example uses the width parameter to control the width of each bar, emphasizing the second bar by making it slightly larger. The larger width would show up even in a black-and-white printout. It also uses the color parameter to change the color of the target bar to red (the rest are blue).

As with other chart types, the bar chart provides some special features that you can use to make your presentation stand out. The example uses the align parameter to center the data on the x coordinate (the standard position is to the left). You can also use other parameters, such as hatch, to enhance the visual appearance of your bar chart. Figure 11-2 shows the output of this example.

Screenshot of a dialog box displaying a bar chart making it easier to perform comparisons. — FIGURE 11-2: Bar charts make it easier to perform comparisons.

This chapter helps you get started using MatPlotLib to create a variety of chart and graph types. Of course, more examples are better, so you can also find some more advanced examples on the MatPlotLib site at https://matplotlib.org/1.2.1/examples/index.html. Some of the examples, such as those that demonstrate animation techniques, become quite advanced, but with practice you can use any of them to improve your own charts and graphs.

Showing distributions using histograms

Histograms categorize data by breaking it into bins, where each bin contains a subset of the data range. A histogram then displays the number of items in each bin so that you can see the distribution of data and the progression of data from bin to bin. In most cases, you see a curve of some type, such as a bell curve. The following example shows how to create a histogram with randomized data:

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

x = 20 * np.random.randn(10000)

plt.hist(x, 25, range=(-50, 50), histtype='stepfilled',

align='mid', color='g', label='Test Data')

plt.legend()

plt.title('Step Filled Histogram')

plt.show()

In this case, the input values are a series of random numbers. The distribution of these numbers should show a type of bell curve. As a minimum, you must provide a series of values, x in this case, to plot. The second argument contains the number of bins to use when creating the data intervals. The default value is 10. Using the range parameter helps you focus the histogram on the relevant data and exclude any outliers.

You can create multiple histogram types. The default setting creates a bar chart. You can also create a stacked bar chart, stepped graph, or filled stepped graph (the type shown in the example). In addition, it’s possible to control the orientation of the output, with vertical as the default.

As with most other charts and graphs in this chapter, you can add special features to the output. For example, the align parameter determines the alignment of each bar along the baseline. Use the color parameter to control the colors of the bars. The label parameter doesn’t actually appear unless you also create a legend (as shown in this example). Figure 11-3 shows typical output from this example.

Screenshot of a dialog box displaying a step-filled histogram depicting the distributions of numbers. — FIGURE 11-3: Histograms let you see distributions of numbers.

Random data varies call by call. Every time you run the example, you see slightly different results because the random-generation process differs.

Depicting groups using boxplots

Boxplots provide a means of depicting groups of numbers through their quartiles (three points dividing a group into four equal parts). A boxplot may also have lines, called whiskers, indicating data outside the upper and lower quartiles. The spacing shown within a boxplot helps indicate the skew and dispersion of the data. The following example shows how to create a boxplot with randomized data.

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

spread = 100 * np.random.rand(100)

center = np.ones(50) * 50

flier_high = 100 * np.random.rand(10) + 100

flier_low = -100 * np.random.rand(10)

data = np.concatenate((spread, center,

flier_high, flier_low))

plt.boxplot(data, sym='gx', widths=.75, notch=True)

plt.show()

To create a usable dataset, you need to combine several different number-generation techniques, as shown at the beginning of the example. Here are how these techniques work:

spread: Contains a set of random numbers between 0 and 100
center: Provides 50 values directly in the center of the range of 50
flier_high: Simulates outliers between 100 and 200
flier_low: Simulates outliers between 0 and –100

The code combines all these values into a single dataset using concatenate(). Being randomly generated with specific characteristics (such as a large number of points in the middle), the output will show specific characteristics but will work fine for the example.

The call to boxplot() requires only data as input. All other parameters have default settings. In this case, the code sets the presentation of outliers to green Xs by setting the sym parameter. You use widths to modify the size of the box (made extra large in this case to make the box easier to see). Finally, you can create a square box or a box with a notch using the notch parameter (which normally defaults to False). Figure 11-4 shows typical output from this example.

Screenshot of a dialog box displaying a box plot, which is a square box or a box with a notch, representing a group of numbers. — FIGURE 11-4: Use boxplots to present groups of numbers.

The box shows the three data points as the box, with the red line in the middle being the median. The two black horizontal lines connected to the box by whiskers show the upper and lower limits (for four quartiles). The outliers appear above and below the upper and lower limit lines as green Xs.

Seeing data patterns using scatterplots

Scatterplots show clusters of data rather than trends (as with line graphs) or discrete values (as with bar charts). The purpose of a scatterplot is to help you see data patterns. The following example shows how to create a scatterplot using randomized data:

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

x1 = 5 * np.random.rand(40)

x2 = 5 * np.random.rand(40) + 25

x3 = 25 * np.random.rand(20)

x = np.concatenate((x1, x2, x3))

y1 = 5 * np.random.rand(40)

y2 = 5 * np.random.rand(40) + 25

y3 = 25 * np.random.rand(20)

y = np.concatenate((y1, y2, y3))

plt.scatter(x, y, s=[100], marker='^', c='m')

plt.show()

The example begins by generating random x and y coordinates. For each x coordinate, you must have a corresponding y coordinate. It’s possible to create a scatterplot using just the x and y coordinates.

It’s possible to dress up a scatterplot in a number of ways. In this case, the s parameter determines the size of each data point. The marker parameter determines the data point shape. You use the c parameter to define the colors for all the data points, or you can define a separate color for individual data points. Figure 11-5 shows the output from this example.

Screenshot of a dialog box displaying a scatterplot depicting groups of data points and their associated patterns. — FIGURE 11-5: Use scatterplots to show groups of data points and their associated patterns.

Creating Advanced Scatterplots

Scatterplots are especially important for data science because they can show data patterns that aren’t obvious when viewed in other ways. You can see data groupings with relative ease and help the viewer understand when data belongs to a particular group. You can also show overlaps between groups and even demonstrate when certain data is outside the expected range. Showing these various kinds of relationships in the data is an advanced technique that you need to know in order to make the best use of MatPlotLib. The following sections demonstrate how to perform these advanced techniques on the scatterplot you created earlier in the chapter.

Depicting groups

Color is the third axis when working with a scatterplot. Using color lets you highlight groups so that others can see them with greater ease. The following example shows how you can use color to show groups within a scatterplot:

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

x1 = 5 * np.random.rand(50)

x2 = 5 * np.random.rand(50) + 25

x3 = 30 * np.random.rand(25)

x = np.concatenate((x1, x2, x3))

y1 = 5 * np.random.rand(50)

y2 = 5 * np.random.rand(50) + 25

y3 = 30 * np.random.rand(25)

y = np.concatenate((y1, y2, y3))

color_array = ['b'] * 50 + ['g'] * 50 + ['r'] * 25

plt.scatter(x, y, s=[50], marker='D', c=color_array)

plt.show()

The example works essentially the same as the scatterplot example in the previous section, except that this example uses an array for the colors. Unfortunately, if you’re seeing this in the printed book, the differences between the shades of gray in Figure 11-6 will be hard to see. However, the first group is blue, followed by green for the second group. Any outliers appear in red.

Screenshot of a dialog box displaying color arrays to depict a better picture of the scatterplot groups. — FIGURE 11-6: Color arrays can make the scatterplot groups stand out better.

Showing correlations

In some cases, you need to know the general direction that your data is taking when looking at a scatterplot. Even if you create a clear depiction of the groups, the actual direction that the data is taking as a whole may not be clear. In this case, you add a trendline to the output. Here’s an example of adding a trendline to a scatterplot that includes groups but isn’t quite as clear as the scatterplot shown previously in Figure 11-6.

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.pylab as plb

%matplotlib inline

x1 = 15 * np.random.rand(50)

x2 = 15 * np.random.rand(50) + 15

x3 = 30 * np.random.rand(25)

x = np.concatenate((x1, x2, x3))

y1 = 15 * np.random.rand(50)

y2 = 15 * np.random.rand(50) + 15

y3 = 30 * np.random.rand(25)

y = np.concatenate((y1, y2, y3))

color_array = ['b'] * 50 + ['g'] * 50 + ['r'] * 25

plt.scatter(x, y, s=[90], marker='*', c=color_array)

z = np.polyfit(x, y, 1)

p = np.poly1d(z)

plb.plot(x, p(x), ’m-’)

plt.show()

The code for creating the scatterplot is essentially the same as in the example in the “Depicting groups” section, earlier in the chapter, but the plot doesn’t define the groups as clearly. Adding a trendline means calling the NumPy polyfit() function with the data, which returns a vector of coefficients, p, that minimizes the least-squares error. (Least-square regression is a method for finding a line that summarizes the relationship between two variables, x and y in this case, at least within the domain of the explanatory variable x. The third polyfit() parameter expresses the degree of the polynomial fit.)

The vector output of polyfit() is used as input to poly1d(), which calculates the actual y axis data points. The call to plot() creates the trendline on the scatterplot. You can see a typical result of this example in Figure 11-7.

Screenshot of a dialog box displaying scatterplot trendlines depicting the general data direction, to calculate the actual data points. — FIGURE 11-7: Scatterplot trendlines can show you the general data direction.

Plotting Time Series

Nothing is truly static. When you view most data, you see an instant of time — a snapshot of how the data appeared at one particular moment. Of course, such views are both common and useful. However, sometimes you need to view data as it moves through time — to see it as it changes. Only by viewing the data as it changes can you expect to understand the underlying forces that shape it. The following sections describe how to work with data on a time-related basis.

Representing time on axes

Many times, you need to present data over time. The data could come in many forms, but generally you have some type of time tick (one unit of time), followed by one or more features that describe what happens during that particular tick. The following example shows a simple set of days and sales on those days for a particular item in whole (integer) amounts.

import pandas as pd

import matplotlib.pyplot as plt

import datetime as dt

%matplotlib inline

start_date = dt.datetime(2018, 7, 30)

end_date = dt.datetime(2018, 8, 5)

daterange = pd.date_range(start_date, end_date)

sales = (np.random.rand(len(daterange)) * 50).astype(int)

df = pd.DataFrame(sales, index=daterange,

columns=['Sales'])

df.loc['Jul 30 2018':'Aug 05 2018'].plot()

plt.ylim(0, 50)

plt.xlabel('Sales Date')

plt.ylabel('Sale Value')

plt.title('Plotting Time')

plt.show()

The example begins by creating a DataFrame to hold the information. The source of the information could be anything, but the example generates it randomly. Notice that the example creates a date_range to hold the starting and ending date time frame for easier processing using a for loop.

An essential part of this example is the creation of individual rows. Each row has an actual time value so that you don’t lose information. However, notice that the index (row_s.name property) is a string. This string should appear in the form that you want the dates to appear when presented in the plot.

Using loc[] lets you select a range of dates from the total number of entries available. Notice that this example uses only some of the generated data for output. It then adds some amplifying information about the plot and displays it onscreen. The call to plot() must specify the x and y values in this case or you get an error. Figure 11-8 show typical output from the randomly generated data.

Screenshot of a dialog box displaying a line graph depicting the flow of data over time. — FIGURE 11-8: Use line graphs to show the flow of data over time.

Plotting trends over time

As with any other data presentation, sometimes you really can’t see what direction the data is headed in without help. The following example starts with the plot from the previous section and adds a trendline to it:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import datetime as dt

%matplotlib inline

start_date = dt.datetime(2018, 7, 29)

end_date = dt.datetime(2018, 8, 7)

daterange = pd.date_range(start_date, end_date)

sales = (np.random.rand(len(daterange)) * 50).astype(int)

df = pd.DataFrame(sales, index=daterange,

columns=['Sales'])

lr_coef = np.polyfit(range(0, len(df)), df['Sales'], 1)

lr_func = np.poly1d(lr_coef)

trend = lr_func(range(0, len(df)))

df['trend'] = trend

df.loc['Jul 30 2018':'Aug 05 2018'].plot()

plt.xlabel('Sales Date')

plt.ylabel('Sale Value')

plt.title('Plotting Time')

plt.legend(['Sales', 'Trend'])

plt.show()

Remember The “Showing correlations” section, earlier in this chapter, shows how most people add a trendline to their graph. In fact, this is the approach that you often see used online. You’ll also notice that a lot of people have trouble using this approach in some situations. This example takes a slightly different approach by adding the trendline directly to the DataFrame. If you print df after the call to df[’trend’] = trend, you see trendline data similar to the values shown here:

Sales trend

2018-07-29 6 18.890909

2018-07-30 13 20.715152

2018-07-31 38 22.539394

2018-08-01 22 24.363636

2018-08-02 40 26.187879

2018-08-03 39 28.012121

2018-08-04 36 29.836364

2018-08-05 21 31.660606

2018-08-06 7 33.484848

2018-08-07 49 35.309091

Using this approach makes it ultimately easier to plot the data. You call plot() only once and avoid relying on the MatPlotLib, pylab, as shown in the example in the “Showing correlations” section. The resulting code is simpler and less likely to cause the issues you see online.

When you plot the initial data, the call to plot() automatically generates a legend for you. MatPlotLib doesn’t automatically add the trendline, so you must also create a new legend for the plot. Figure 11-9 shows typical output from this example using randomly generated data.

Screenshot of a dialog box displaying a line graph with a trendline added to it depicting the average direction of change. — FIGURE 11-9: Add a trendline to show the average direction of change in a chart or graph.

Plotting Geographical Data

Knowing where data comes from or how it applies to a specific place can be important. For example, if you want to know where food shortages have occurred and plan how to deal with them, you need to match the data you have to geographical locations. The same holds true for predicting where future sales will occur. You may find that you need to use existing data to determine where to put new stores. Otherwise, you could put a store in a location that won’t receive much in the way of sales, and the effort will lose money rather than make it. The following sections describe how to work with Basemap to interact with geographical data.

You must shut the Notebook environment down before you make any changes to it or else conda will complain that some files are in use. To shut the Notebook environment down, close and halt the kernel for any Notebook files you have open and then press Ctrl+C in the Notebook terminal window. Wait a few seconds before you attempt to do anything to give the files time to close properly.

Using an environment in Notebook

Some of the packages you install have a tendency to also change your Notebook environment by installing other packages that may not work well with your baseline setup. Consequently, you see problems with code that functioned earlier. Normally, these problems consist mostly of warning messages, such as deprecation warnings as discussed in the “Dealing with deprecated library issues” section, later in this chapter. In some cases, however, the changed packages can also tweak the output you obtain from code. Perhaps a newer package uses an updated algorithm or interacts with the code differently. When you have a package, such as Basemap, that makes changes to the overall baseline configuration and you want to maintain your current configuration, you need to set up an environment for it. An environment keeps your baseline configuration intact but also allows the new package to create the environment it needs to execute properly. The following steps help you create the Basemap environment used for this chapter:

Open an Anaconda Prompt.

Notice that the prompt shows the location of your folder on your system, but that it’s preceded by (base). The (base) indicator tells you that you’re in your baseline environment — the one you want to preserve.
Type conda create -n Basemap python=3 anaconda=5.2.0 and press Enter.

This action creates a new Basemap environment. This new environment will use Python 3.6 and Anaconda 5.2.0. You get precisely the same baseline as you’ve been using so far.
Type source activate Basemap if you’re using OS X or Linux or activate Basemap if you’re using Windows and press Enter.

You have now changed over to the Basemap environment. Notice that the prompt no longer says (base), it says (Basemap) instead.
Follow the instructions in the “Getting the Basemap toolkit” section to install your copy of Basemap.
Type Jupyter Notebook and press Enter.

You see Notebook start, but it uses the Basemap environment, rather than the baseline environment. This copy of Notebook works precisely the same as any other copy of Notebook that you’ve used. The only difference is the environment in which it operates.

This same technique works for any special package that you want to install. You should reserve it for packages that you don’t intend to use every day. For example, this book uses Basemap for just one example, so it’s appropriate to create an environment for it.

After you have finished using the Basemap environment, type deactivate at the prompt and press Enter. You see the prompt change back to (base).

Getting the Basemap toolkit

Before you can work with mapping data, you need a library that supports the required mapping functionality. A number of such packages are available, but the easiest to work with and install is the Basemap Toolkit. You can obtain this toolkit from https://matplotlib.org/basemap/users/intro.html. (Make sure you close Notebook and stop the server before you proceed in this section to avoid file access errors.) However, the easiest method is to use the conda tool from the Anaconda Prompt to enter the following commands:

conda install -c conda-forge basemap=1.1.0

conda install -c conda-forge basemap-data-hires

conda install -c conda-forge proj4=5.2.0

The site does include supplementary information about the toolkit, so you may want to visit it anyway. Unlike some other packages, this one does include instructions for Mac, Windows, and Linux users. In addition, you can obtain a Windows-specific installer. Make sure to also check out the usage video at http://nbviewer.ipython.org/github/mqlaql/geospatial-data/blob/master/Geospatial-Data-with-Python.ipynb.

You need the following code to use the toolkit once you have it installed:

import numpy as np

import matplotlib.pyplot as plt

from mpl_toolkits.basemap import Basemap

%matplotlib inline

Dealing with deprecated library issues

One of the major advantages of working with Python is the huge number of packages that it supports. Unfortunately, not every package receives updates quickly enough to avoid using deprecated features in other packages. A deprecated feature is one that still exists in the target package, but the developers of that package plan to remove it in an upcoming update. Consequently, you receive a deprecated package warning when you run your code. Even though the deprecation warning doesn’t keep your code from running, it does tend to make people leery of your application. After all, no one wants to see what appears to be an error message as part of the output. The fact that Notebook displays these messages in light red by default doesn’t help matters.

Unfortunately, your copy of the Basemap toolkit may produce a deprecated feature warning message, so this section tells you how to overcome this issue. You can discover more about the potential issues you see at https://github.com/matplotlib/basemap/issues/382. These messages look something like this:

C:UsersLucaAnaconda3libsite-packagesmpl_toolkits

asemap\__init__.py:1708: MatplotlibDeprecationWarning:

The axesPatch function was deprecated in version 2.1.

Use Axes.patch instead.

limb = ax.axesPatch

C:UsersLucaAnaconda3libsite-packagesmpl_toolkits

asemap\__init__.py:1711: MatplotlibDeprecationWarning:

The axesPatch function was deprecated in version 2.1. Use

Axes.patch instead.

if limb is not ax.axesPatch:

That looks like a lot of really terrifying text, but these messages point out two issues. The first is that the problem is in MatPlotLib and it revolves about the axesPatch call. The messages also tell you that this particular call is deprecated starting with version 2.1. Use this code to check your version of MatPlotLib:

import matplotlib

print(matplotlib.__version__)

If you installed Anaconda using the instructions in Chapter 3, you see that you have MatPlotLib 2.2.2 as a minimum. Consequently, one way to deal with this problem is to downgrade your copy of MatPlotLib by using the following command at the Anaconda Prompt:

conda install -c conda-forge matplotlib=2.0.2

The problem with this approach is that it can also cause problems for any code that uses the newer features found in MatPlotLib 2.2.2. It’s not optimal, but if you use Basemap in your application a lot, it might be a practical solution.

A better solution is to simply admit that the problem exists by documenting it as part of your code. Documenting the problem and its specific cause makes it easier to check for the problem later after a package update. To do this, you add the two lines of code shown here:

import warnings

warnings.filterwarnings("ignore")

Remember The call to filterwarnings() performs the specified action, which is "ignore" in this case. To cancel the effects of filtering the warnings, you call resetwarnings(). Notice that the module attribute is the same as the source of the problems in the warning messages. You can also define a broader filter by using the category attribute. This particular call is narrow, affecting only one module.

Using Basemap to plot geographic data

Now that you have a good installation of Basemap, you can do something with it. The following example shows how to draw a map and place pointers to specific locations on it:

austin = (-97.75, 30.25)

hawaii = (-157.8, 21.3)

washington = (-77.01, 38.90)

chicago = (-87.68, 41.83)

losangeles = (-118.25, 34.05)

m = Basemap(projection='merc',llcrnrlat=10,urcrnrlat=50,

llcrnrlon=-160,urcrnrlon=-60)

m.drawcoastlines()

m.fillcontinents(color='lightgray',lake_color='lightblue')

m.drawparallels(np.arange(-90.,91.,30.))

m.drawmeridians(np.arange(-180.,181.,60.))

m.drawmapboundary(fill_color='aqua')

m.drawcountries()

x, y = m(*zip(*[hawaii, austin, washington,

chicago, losangeles]))

m.plot(x, y, marker='o', markersize=6,

markerfacecolor='red', linewidth=0)

plt.title("Mercator Projection")

plt.show()

The example begins by defining the longitude and latitude for various cities. It then creates the basic map. The projection parameter defines the basic map appearance. The next four parameters, llcrnrlat, urcrnrlat, llcrnrlon, and urcrnrlon define the sides of the map. You can define other parameters, but these parameters generally create a useful map.

The next set of calls defines the map particulars. For example, drawcoastlines() determines whether the coastlines are highlighted to make them easy to see. To make landmasses easy to discern from water, you want to call fillcontinents() with the colors of your choice. When working with specific locations, as the example does, you want to call drawcountries() to ensure that the country boundaries appear on the map. At this point, you have a map that’s ready to fill in with data.

In this case, the example creates x and y coordinates using the previously stored longitude and latitude values. It then plots these locations on the map in a contrasting color so that you can easily see them. The final step is to display the map, as shown in Figure 11-10.

Screenshot of a dialog box displaying a map illustrating data for creating x and y coordinates using the previously stored longitude and latitude values. — FIGURE 11-10: Maps can illustrate data in ways other graphics can’t.

Visualizing Graphs

A graph is a depiction of data showing the connections between data points using lines. The purpose is to show that some data points relate to other data points, but not all the data points that appear on the graph. Think about a map of a subway system. Each of the stations connects to other stations, but no single station connects to all the stations in the subway system. Graphs are a popular data science topic because of their use in social media analysis. When performing social media analysis, you depict and analyze networks of relationships, such as friends or business connections, from social hubs such as Facebook, Google+, Twitter, or LinkedIn.

The two common depictions of graphs are undirected, where the graph simply shows lines between data elements, and directed, where arrows added to the line show that data flows in a particular direction. For example, consider a depiction of a water system. The water would flow in just one direction in most cases, so you could use a directed graph to depict not only the connections between sources and targets for the water but also to show water direction by using arrows. The following sections help you understand the two types of graphs better and show you how to create them.

Developing undirected graphs

As previously stated, an undirected graph simply shows connections between nodes. The output doesn’t provide a direction from one node to the next. For example, when establishing connectivity between web pages, no direction is implied. The following example shows how to create an undirected graph:

import networkx as nx

import matplotlib.pyplot as plt

%matplotlib inline

G = nx.Graph()

H = nx.Graph()

G.add_node(1)

G.add_nodes_from([2, 3])

G.add_nodes_from(range(4, 7))

H.add_node(7)

G.add_nodes_from(H)

G.add_edge(1, 2)

G.add_edge(1, 1)

G.add_edges_from([(2,3), (3,6), (4,6), (5,6)])

H.add_edges_from([(4,7), (5,7), (6,7)])

G.add_edges_from(H.edges())

nx.draw_networkx(G)

plt.show()

In contrast to the canned example found in the “Using NetworkX basics” section of Chapter 8, this example builds the graph using a number of different techniques. It begins by importing the Networkx package you use in Chapter 8. To create a new undirected graph, the code calls the Graph() constructor, which can take a number of input arguments to use as attributes. However, you can build a perfectly usable graph without using attributes, which is what this example does.

The easiest way to add a node is to call add_node() with a node number. You can also add a list, dictionary, or range() of nodes using add_nodes_from(). In fact, you can import nodes from other graphs if you want.

Even though the nodes used in the example rely on numbers, you don’t have to use numbers for your nodes. A node can use a single letter, a string, or even a date. Nodes do have some restrictions. For example, you can’t create a node using a Boolean value.

Nodes don’t have any connectivity at the outset. You must define connections (edges) between them. To add a single edge, you call add_edge() with the numbers of the nodes that you want to add. As with nodes, you can use add_edges_from() to create more than one edge using a list, dictionary, or another graph as input. Figure 11-11 shows the output from this example (your output may differ slightly but should have the same connections).

Screenshot of a dialog box displaying an undirected graph connecting various nodes together to form a pattern. — FIGURE 11-11: Undirected graphs connect nodes together to form patterns.

Developing directed graphs

You use directed graphs when you need to show a direction, say from a start point to an end point. When you get a map that shows you how to get from one specific point to another, the starting node and ending node are marked as such and the lines between these nodes (and all the intermediate nodes), show direction.

Your graphs need not be boring. You can dress them up in all sorts of ways so that the viewer gains additional information in different ways. For example, you can create custom labels, use specific colors for certain nodes, or rely on color to help people see the meaning behind your graphs. You can also change edge line weight and use other techniques to mark a specific path between nodes as the better one to choose. The following example shows many (but not nearly all) the ways in which you can dress up a directed graph and make it more interesting:

import networkx as nx

import matplotlib.pyplot as plt

%matplotlib inline

G = nx.DiGraph()

G.add_node(1)

G.add_by nodes_from([2, 3])

G.add_nodes_from(range(4, 6))

G.add_path([6, 7, 8])

G.add_edge(1, 2)

G.add_edges_from([(1,4), (4,5), (2,3), (3,6), (5,6)])

colors = ['r', 'g', 'g', 'g', 'g', 'm', 'm', 'r']

labels = {1:'Start', 2:'2', 3:'3', 4:'4',

5:'5', 6:'6', 7:'7', 8:'End'}

sizes = [800, 300, 300, 300, 300, 600, 300, 800]

nx.draw_networkx(G, node_color=colors, node_shape='D',

with_labels=True, labels=labels,

node_size=sizes)

plt.show()

The example begins by creating a directional graph using the DiGraph() constructor. You should note that the NetworkX package also supports MultiGraph() and MultiDiGraph() graph types. You can see a listing of all the graph types at https://networkx.lanl.gov/reference/classes.html.

Adding nodes is much like working with an undirected graph. You can add single nodes using add_node() and multiple nodes using add_nodes_from(). The add_path() call lets you create nodes and edges at the same time. The order of nodes in the call is important. The flow from one node to another is from left to right in the list supplied to the call.

Adding edges is much the same as working with an undirected graph, too. You can use add_edge() to add a single edge or add_edges_from() to add multiple edges at one time. However, the order of the node numbers is important. The flow goes from the left node to the right node in each pair.

This example adds special node colors, labels, shape (only one shape is used), and sizes to the output. You still call on draw_networkx() to perform the task. However, adding the parameters shown changes the appearance of the graph. Note that you must set with_labels to True in order to see the labels provided by the labels parameter. Figure 11-12 shows the output from this example.

Screenshot of a dialog box displaying a directed graph connecting various nodes depicting the direction between the nodes. — FIGURE 11-12: Use directed graphs to show direction between nodes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 11: Visualizing the Data

Create new playlist

Sign In

Sign Up

Visualizing the Data

Choosing the Right Graph

Showing parts of a whole with pie charts

Creating comparisons with bar charts

Showing distributions using histograms

Depicting groups using boxplots

Seeing data patterns using scatterplots

Creating Advanced Scatterplots

Depicting groups

Showing correlations

Plotting Time Series

Representing time on axes

Plotting trends over time

Plotting Geographical Data

Using an environment in Notebook

Getting the Basemap toolkit

Dealing with deprecated library issues

Using Basemap to plot geographic data

Visualizing Graphs

Developing undirected graphs

Developing directed graphs

Table of Contents for
Chapter 11: Visualizing the Data