Big data visualization with datashader

Big data also needs to be visualized! Big data visualizations are somewhat rare; in part because they are hard to do, but also because they are hard to interpret and communicate insights. A big data visualization is usually either a network, a map, or a mapping (similarity-based, computed 2- or 3-dimensional distributions). They are usually astonishing and complex! In fact, a few early inventors of big data visualizations, such as Eric Fisher, became famous for their work with big data.

As we mentioned, big data visualizations are generally hard due to the mere size of the dataset. Standard tools won't work— for matplotlib, even with a raster engine, it will take hours to plot millions of points, and Altair won't do it at all. For a long time, there wasn't an easy solution to this problem. This changed with the announcement of yet another Python library: datashader. datashader leverages a few modern techniques and packages for fast computation (specifically Numba, another package for fast computation that leverages just-in-time compilation, which we'll discuss in Chapter 20, Best Practices and Python Performance), and a smart approach to the visualization itself—it bins data to a grid of pixels, computing the aggregate for this grid under the hood. Indeed, binning can save a lot of time on visualization, and using a pixel grid resolves all of the drawbacks of larger bins—you can't see anything within a pixel, anyway. On top of that, once datashader computes a matrix of pixel values, we can change the appearance of the picture without the need to re-aggregate values.

Let's try datashader on one example. None of the datasets we've worked with so far are large enough, so we'll use a new one—an open dataset of 311 complaints for New York City for the whole year of 2018. We briefly mentioned this data, and shared code to collect it, in Chapter 6, First Script – Geocoding with Web API. Just in case, the code we used to collect this data is also stored in this chapter's folder, as an _pull_311.py script. To get the data, just run this script from the Terminal: python _pull_311.py. The code for visualization is stored in the 3_big_data_viz_311.ipynb notebook.

311 is a municipal public service that is meant to process citizens' input regarding non-urgent issues; in other words, it is something similar to 911, but for non-threatening, non-urgent issues such as noise, litter, fallen trees, graffiti, and more. Complaints can be filed via a phone call, text message, email, application, or web form. The city of New York shares anonymized records of such complaints daily, including the time of the complaint, coordinates, type of complaint, relevant city department or institution, and time of complaint closing and some other information.

The data we collected is stored in 12 CSV files, one for each month of 2018, and includes 2,747,985 records. This is not big data per se—at least it can fit in memory on modern machines, but it is hard to work with and definitely already non-trivial to visualize.

Let's try to load the data, first. Because we're dealing with multiple CSVs, let's use glob, a built-in Python function for getting multiple files from a pattern:

import pandas as pd
from glob import glob

Now, we need to specify a pattern, and run glob on it:

>>> paths = './data/311/*.csv'
>>> files = glob(paths)
>>> files
['./data/311/2018-06.csv',
 './data/311/2018-12.csv',
 './data/311/2018-07.csv',
 './data/311/2018-11.csv',
 './data/311/2018-05.csv',
 './data/311/2018-04.csv',
 './data/311/2018-10.csv',
 './data/311/2018-01.csv',
 './data/311/2018-03.csv',
 './data/311/2018-02.csv',
 './data/311/2018-09.csv',
 './data/311/2018-08.csv']

And finally, we can now traverse over those files, load them one by one, and concatenate into a single dataframe:

data = pd.concat([pd.read_csv(p, low_memory=False, index_col=0) for p in glob(paths)])

Here, we used the low_memory=False flag, which helps pandas to correctly match the data type of each column.

If your machine has a small memory size, you might want to load fewer months. Alternatively, and only for some maps, data can be read by month, aggregated with datashader separately within the generator, and then summarized all together. datashader stores canvas matrices as simple 2-dimensional numeric numpy matrices, so it is easy to do.

Now, we can plot a simple density distribution of complaints. First, let's load datashader:

import datashader as ds
import datashader.transfer_functions as tf
from datashader.colors import inferno

Now, we'll create a canvas (essentially, a 2-dimensional matrix of pixels). Next, we'll use this canvas to aggregate our data. The last argument, ds.count(), is an aggregation function, in this case, counting the number of records (complaints) for each pixel:

cvs = ds.Canvas(plot_width=1000, plot_height=1000)
agg = cvs.points(data, 'x_coordinate_state_plane', 'y_coordinate_state_plane', ds.count())

Once this is done, we can move to the actual visualization with a single command:

tf.shade(agg, cmap=inferno, how='eq_hist')

Here, we essentially just colorize the agg matrix, converting values into colors. Notice the last argument: it describes how to distribute values along the color map. A linear strategy will map values to color without any distortion— maximum values will get edge colors, and all values in between will get color proportionally. However, often data is not distributed evenly—there are spikes and long tails with relatively small values—leading to small blurbs of distinct colors, and everything else in the same color. To fight that, other strategies can be used, for example, log and cube will colorize the logarithm and cubic root of the values, respectfully. A weapon of last resort, eq_hist will, essentially, colorize the rank of the value—this way, there will be about the same number of pixels of each tone. A choice of strategy depends on the specifics of a dataset.

Because of the high density of elements on the plot, and the nature of visualization itself, we couldn't use anything but color; not all colors are properly represented on B/W images, so we recommend checking those visualizations in the repository (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications/blob/master/Chapter12/3_big_data_viz-311.ipynb).

Here is the resulting chart. The map is gorgeous—and very detailed. From the distribution, we can see a distinct shape of two islands, roads, towns, and structures:

Let's look closer at the chart. Here, lighter colors (yellow, originally) represent a higher density of complaints. Darker colors mean a smaller number of complaints.

Since the colors are not visible here, please refer to the graphics bundle link (https://static.packt-cdn.com/downloads/9781789535365_ColorImages.pdf) for all images in the book.

As you see, density generally decreases from the center of the city to its edges, and that makes sense. At the same time, we can eyeball areas of a higher or lower number of complaints, compared to the surroundings. Most of them are quite meaningful for the person familiar with the city: for example, why is the Bergen Beach (dark corner in the lower center of the image) so distinctive from the surroundings? Why is the eastern edge of the Central Park (the white rectangular on Manhattan) so dark compared to the surroundings? Why is the density so high on the eastern side of Prospect Park (the right-hand white shape in the middle of Brooklyn) than on the western side? The high level of resolution allows us to drill down to the smallest elements on the map, questioning even small spatial irregularity.

Let's do a similar map, but this time aggregate by the most typical source of complaint, which is stored in the open_data_channel_type column. First, let's check all of the possible sources:

>>> data['open_data_channel_type'].value_counts()
PHONE   1469034
ONLINE  565348
UNKNOWN 366890
MOBILE  314247
OTHER   32466
Name: open_data_channel_type, dtype: int64

As there are only five sources, it seems easier to manually define a color to each:

colors = {
 'PHONE':'red',
 'ONLINE':'blue',
 'UNKNOWN':'grey',
 'MOBILE':'green',
 'OTHER': 'brown'
}

We also have to convert this column into the category data type. The category data type is way more compact than strings (it stores an integer number for each category), but is also required by datashader for category-based operations—otherwise, datashader simply won't work:

data['open_data_channel_type'] = data['open_data_channel_type'].astype('category')

Now, let's aggregate by most typical cause:

agg_cat = cvs.points(data, 'x_coordinate_state_plane', 'y_coordinate_state_plane',
                     ds.count_cat('open_data_channel_type'))

With color keys, there is no need to specify a coloring method—you just pass colors:

tf.shade(agg_cat, color_key=colors)

Here is our result. Here, we're looking at exactly the same dataset, just colored by the most frequent source for each pixel:

As with the previous map, this one contains multiple interesting patterns—we can eyeball it for hours! For example, we could notice chains of pixels on Staten Island (lower-left corner)—perhaps those are highways. Through the city, we can notice clusters of blue (online) and green (mobile)—those, most likely, can be attributed to offices and transit areas.

At our final step, let's see how the average time it takes to close the complaint is distributed across the city. Before mapping, we need to compute this time as a number—for example, subtract two timestamps, and convert the timedelta object into a number of minutes as an integer:

data['created_date'] = pd.to_datetime(data['created_date'])
data['closed_date'] = pd.to_datetime(data['closed_date'])
data['time_to_close'] = (data['closed_date'] - data['created_date']).dt.seconds

Now, we can calculate the average of time_to_close for each pixel:

agg_time = cvs.points(data, 'x_coordinate_state_plane', 'y_coordinate_state_plane', ds.mean('time_to_close'))

tf.shade(agg_time, cmap=inferno, how='eq_hist')

And here is the outcome:

Again, there is plenty of interesting stuff here (most likely, a large difference in time to close is a result of the different nature of complaints). Even more interesting is to compare different maps—some areas share similar properties in one context, but drastically different in the others. The best part is that we, relatively easily, created a bunch of insanely detailed maps that communicate both the overall picture of the dataset and all of the intricacies of specific locations. Indeed, big data visualizations often introduce us to unexpected patterns that are hard to catch in any other way.

Table of Contents for Big data visualization with datashader

Create new playlist

Sign In

Sign Up

Table of Contents for
Big data visualization with datashader