15 Designing Data Visualizations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

15
Designing Data Visualizations

Data visualization, when done well, allows you to reveal patterns in your data and communicate insights to your audience. This chapter describes the conceptual and design skills necessary to craft effective and expressive visual representations of your data. In doing so, it introduces skills for each of the following steps in the visualization process:

Understanding the purpose of visualization
Selecting a visual layout based on your question and data type
Choosing optimal graphical encodings for your variables
Identifying visualizations that are able to express your data
Improving the aesthetics (i.e., making it readable and informative)

15.1 The Purpose of Visualization

“The purpose of visualization is insight, not pictures.”¹

¹Card, S. K., Mackinlay, J. D., & Shneiderman, B. (1999). Readings in information visualization: Using vision to think. Burlington, MA: Morgan Kaufmann.

Generating visual displays of your data is a key step in the analytical process. While you should strive to design aesthetically pleasing visuals, it’s important to remember that visualization is a means to an end. Devising appropriate renderings of your data can help expose underlying patterns in your data that were previously unseen, or that were undetectable by other tests.

To demonstrate how visualization makes a distinct contribution to the data analysis process (beyond statistical tests), consider the canonical data set Anscombe’s Quartet (which is included with the R software as the data set anscombe). This data set consists of four pairs of x and y data: (x₁, y₁), (x_x, y₂), and so on. The data set is shown in Table 15.1.

Table 15.1 Anscombe’s Quartet: four data sets with two features each

x₁	y₁	x₂	y₂	x₃	y₃	x₄	y₄
10.00	8.04	10.00	9.14	10.00	7.46	8.00	6.58
8.00	6.95	8.00	8.14	8.00	6.77	8.00	5.76
13.00	7.58	13.00	8.74	13.00	12.74	8.00	7.71
9.00	8.81	9.00	8.77	9.00	7.11	8.00	8.84
11.00	8.33	11.00	9.26	11.00	7.81	8.00	8.47
14.00	9.96	14.00	8.10	14.00	8.84	8.00	7.04
6.00	7.24	6.00	6.13	6.00	6.08	8.00	5.25
4.00	4.26	4.00	3.10	4.00	5.39	19.00	12.50
12.00	10.84	12.00	9.13	12.00	8.15	8.00	5.56
7.00	4.82	7.00	7.26	7.00	6.42	8.00	7.91
5.00	5.68	5.00	4.74	5.00	5.73	8.00	6.89

The challenge of Anscombe’s Quartet is to identify differences between the four pairs of columns. For example, how does the (x₁, y₁) pair differ from the (x₂, y₂) pair? Using a nonvisual approach to answer this question, you could compute a variety of descriptive statistics for each set, as shown in Table 15.2. Given these six statistical assessments, these four data sets appear to be identical. However, if you graphically represent the relationship between each x and y pair, as in Figure 15.1, you reveal the distinct nature of their relationships.

Table 15.2 Anscombe’s Quartet: the (X, Y) pairs share identical summary statistics

Set	Mean X	Std. Deviation X	Mean Y	Std. Deviation Y	Correlation	Linear Fit
1	9.00	3.32	7.50	2.03	0.82	y = 3 + 0.5x
2	9.00	3.32	7.50	2.03	0.82	y = 3 + 0.5x
3	9.00	3.32	7.50	2.03	0.82	y = 3 + 0.5x
4	9.00	3.32	7.50	2.03	0.82	y = 3 + 0.5x

A figure shows an Anscombe's quartet to show how different the four sets appear when graphed. — Figure 15.1 Anscombe’s Quartet: scatterplots reveal four different (x, y) relationships that are not detectable using descriptive statistics.

The four sets of Anscombe's Quartet are plotted on four different graphs to reveal different (x,y) relationships. The first set shows scattered, positive correlation; the second set shows the plots starting to fall after a steady ascent; the third set shows direct, positive correlation with a single outlier, and the fourth set shows a constant, vertical line of plots with one outlier.

While computing summary statistics is an important part of the data exploration process, it is only through visual representations that differences across these sets emerge. The simple graphics in Figure 15.1 expose variations in the distributions of x and y values, as well as in the relationships between them. Thus the choice of representation becomes paramount when analyzing and presenting data. The following sections introduce basic principles for making that choice.

15.2 Selecting Visual Layouts

The challenge of visualization, like many design challenges, is to identify an optimal solution (i.e., a visual layout) given a set of constraints. In visualization design, the primary constraints are:

The specific question of interest you are attempting to answer in your domain
The type of data you have available for answering that question
The limitations of the human visual processing system
The spatial limitations in the medium you are using (pixels on the screen, inches on the page, etc.)

This section focuses on the second of these constraints (data type); the last two constraints are addressed in Section 15.3 and Section 15.4. The first constraint (the question of interest) is closely tied to Chapter 10 on understanding data. Based on your domain, you need to hone in on a question of interest, and identify a data set that is well suited for answering your question. This section will expand upon the same data set and question from Chapter 10:

“What is the worst disease in the United States?”

As with the Anscombe’s Quartet example, most basic exploratory data questions can be reduced to investigating how a variable is distributed or how variables are related to one another. Once you have mapped from your question of interest to a specific data set, your visualization type will largely depend on the data type of your variables. The data type of each column—nominal, ordinal, or continuous—will dictate how the information can be represented. The following sections describe techniques for visually exploring each variable, as well as making comparisons across variables.

15.2.1 Visualizing a Single Variable

Before assessing relationships across variables, it is important to understand how each individual variable (i.e., column or feature) is distributed. The primary question of interest is often what does this variable look like? The specific visual layout you choose when answering this question will depend on whether the variable is categorical or continuous. To use the disease burden data set as an example, you may want to know what is the range of the number of deaths attributable to each disease.

For continuous variables, a histogram will allow you to see the distribution and range of values, as shown in Figure 15.2. Alternatively, you can use a box plot or a violin plot, both of which are shown Figure 15.3. Note that outliers (extreme values) in the data set have been removed to better express the information in the charts.

Distribution of the number of deaths for cause in the United States. — Figure 15.2 The distribution of the number of deaths attributable to each disease in the United States (a continuous variable) using a histogram. Some outliers have been removed for demonstration.

A histogram shows the distribution of the number of deaths due to each disease in the United States. The horizontal axis represents the number of deaths (marked from 0 to 40k) and the vertical axis represents the frequency (ranging from 0 to 40, in increments of 10). The frequency is highest (over 45) when then number of deaths is around 1-2k, and gradually decreases beyond the 10k mark. The frequency is consistently very low when the number of deaths is beyond 20k.

Distribution of the number of deaths - represented using a violin plot and a box plot. — Figure 15.3 Alternative visualizations for showing distributions of the number of deaths in the United States: violin plot (left) and box plot (right). Some outliers have been removed for demonstration.

The number of deaths for each cause in the US is represented using a violin plot (left) and a box plot (right). Both plots show similar data - the violin plot is broader at the base and narrow toward the peak. The box plot shows the box ranging from 0 to 10k on the vertical axis with the median line at the point 2k on the vertical axis. Outliers are in the range 25k to 40k.

While these visualizations display information about the distribution of the number of deaths by cause, they all leave an obvious question unanswered: what are the names of these diseases? Figure 15.4 uses a bar chart to label the top 10 causes of death, but due to the constraint of the page size, this display is able to express just a small subset of the data. In other words, bar charts don’t easily scale to hundreds or thousands of observations because they are inefficient to scan, or won’t fit in a given medium.

A bar chart shows the top causes of death in the US. — Figure 15.4 Top causes of death in the United States as shown in a bar chart.

A horizontal bar chart shows the top 10 causes of death in the United States in the year 2016. The diseases are: (in the order of highest to lowest deaths) Ischemic heart disease; Alzheimer disease and other dementias; Tracheal, bronchus, and lung cancer; Cerebrovascular disease; Chronic obstructive pulmonary disease; Lower respiratory disease; Lower respiratory infections; Chronic kidney disease; Colon and rectum cancer; Diabetes mellitus; and Breast cancer. Ishcemic heart disease caused the most number of deaths (over 500k). Alzheimer disease and other dementias led to a little over 200k deaths. Lung cancer, Cerebrovascular disease, and Chronic obstructive pulmonary disease caused over 100k deaths, while the remaining diseases caused less than 100k deaths in 2016 alone.

15.2.1.1 Proportional Representations

Depending on the data stored in a given column, you may be interested in showing each value relative to the total of the column. For example, using the disease burden data set, you may want to express each value proportional to the total number of deaths. This allows you answer the question, Of all deaths, what percentage is attributable to each disease? To do this, you can transform the data to percentages, or use a representation that more clearly expresses parts of a whole. Figure 15.5 shows the use of a stacked bar chart and a pie chart, both of which more intuitively express proportionality. You can also use a treemap, as shown later in Figure 15.14, though the true benefit of a treemap is expressing hierarchical data (more on this later in the chapter). Later sections explore the trade-offs in perceptual accuracy associated with each of these representations.

Top 10 causes of death in the US - represented using a stacked bar chart and a pie chart. — Figure 15.5 Proportional representations of the top causes of death in the United States: stacked bar chart (top) and pie chart (bottom).

A figure graphically depics the top 10 causes of deaths in the United States using a stacked bar chart (top) and a pie chart (bottom). Both graphs show the similar data, but in different representations. They infer the following data. Ishcemic heart disease caused the most number of deaths (over 500k). Alzheimer disease and other dementias led to a little over 200k deaths. Lung cancer, Cerebrovascular disease, and Chronic obstructive pulmonary disease caused over 100k deaths, while the remaining diseases caused less than 100k deaths in 2016 alone. Note: the numbers are represented in the bar chart, and the pie chart represents only the proportions.

If your variable of interest is a categorical variable, you will need to aggregate your data (e.g., count the number of occurrences of different categories) to ask similar questions about the distribution.

Once doing so, you can use similar techniques to show the data (e.g., bar chart, pie chart, treemap). For example, the diseases in this data set are categorized into three types of diseases: non-communicable diseases, such as heart disease or lung cancer; communicable diseases, such as tuberculosis or whooping cough; and injuries, such as road traffic accidents or self harm. To understand how this categorical variable (disease type) is distributed, you can count the number of rows for each category, then display those quantitative values, as in Figure 15.6.

A bar chart shows the number of cases for 3 groups: non-communicable diseases, communicable diseases, and injuries. — Figure 15.6 A visual representation of the number of causes in each disease category: noncommunicable diseases, communicable diseases, and injuries.

A horizontal bar chart shows the number of causes for non-communicable diseases, communicable diseases, and injuries. The horizontal axis represents the number of causes (ranging from 0 to 80, in increments of of 20) and the vertical axis shows the cause group (non-communicable, communicable, and injuries). The chart shows the highest number of causes (around 80) for non-communicable diseases, around 50 for communicable diseases, and a lowest of around 20 for causes by injuries.

15.2.2 Visualizing Multiple Variables

Once you have explored each variable independently, you will likely want to assess relationships between or across variables. The type of visual layout necessary for making these comparisons will (again) depend largely on the type of data you have for each variable.

For comparing relationships between two continuous variables, the best choice is a scatterplot. The visual processing system is quite good at estimating the linearity in a field of points created by a scatterplot, allowing you to describe how two variables are related. For example, using the disease burden data set, you can compare different metrics for measuring health loss. Figure 15.7 compares the disease burden as measured by the number of deaths due to each cause to the number of years of life lost (a metric that accounts for the age at death for each individual).

Deaths versus years of life lost. — Figure 15.7 Using a scatterplot to compare two continuous variables: the number of deaths versus the years of live lost for each disease in the United States.

A scatter plot compares the number of years of life lost versus the number of deaths in the United States. The concentration of plots is highest when the number of deaths is less than 40k (roughly). At this region, the number of years of life lost varies around 5-10 million.There is one outlier - at over 500k deaths, when the number of years of life lost is around 75 million.

You can extend this approach to multiple continuous variables by creating a scatterplot matrix of all continuous features in the data set. Figure 15.8 compares all pairs of metrics of disease burden, including number of deaths, years of life lost (YLLs), years lived with disability (YLDs, a measure of the disability experienced by the population), and disability-adjusted life years (DALYs, a combined measure of life lost and disability).

A four cross four matrix of various graphs comparing multiple continuous measurements of disease burden. — Figure 15.8 Comparing multiple continuous measurements of disease burden using a scatterplot matrix.

The columns of the matrix represent DALYs, Deaths, YLDs, and YLLs, from left to right. The rows of the matrix from top to bottom represent DALYs, Deaths, YLDs, and YLLs. The horizontal axis of the DALYs column ranges from 0 to 50M. The horizontal axis of Deaths column ranges from 0 to 400k, in increments of 200k. The horizontal axis of YLDs column ranges from 0 to 40M, in increments of 20M. The horizontal axis of YLLs column ranges from 0 to 50M. The vertical axis of DALYs row ranges from 0 to 400, in increments of 100. The vertical axis of Deaths row ranges from 0 to 400k, in increments of 200k. The vertical axis of YLDs row ranges from 0 to 50M, in increments of 10M. The vertical axis of YLLs row ranges from 0 to 60M, in increments of 10M. For example, when a graph is plotted for YLDS against Deaths, the horizontal axis will range from 0 to 400k, in increments of 200k and the vertical axis will range from 0 to 50M, in increments of 10M. The scatter plot diagrams along the diagonal of the matrix show vertical bars plotted. The scatter plots above the diagonal show correlation values. The scatter plots below the diagonal show dots plotted. The dots are denser near the origin.

When comparing relationships between one continuous variable and one categorical variable, you can compute summary statistics for each group (see Figure 15.6), use a violin plot to display distributions for each category (see Figure 15.9), or use faceting to show the distribution for each category (see Figure 15.10).

Distribution of the number of deaths by three causes - represented as violin plots. — Figure 15.9 A violin plot showing the continuous distributions of the number of deaths for each cause (by category). Some outliers have been removed for demonstration.

A violin plot shows the distribution of the number of deaths for 3 causes - communicable diseases, injuries, and non-communicable diseases. The numbers are: around 10k for communicable diseases and around 45k for injuries and non-communicable diseases. The average deaths, as indicated by the thickness of the violin plot, is: 0-2k for communicable diseases, 0-10k for injuries, and 0-18k for non-communicable diseases. Note: the values are approximate.

Histograms showing the distribution of number of deaths for each cause. — Figure 15.10 A faceted layout of histograms showing the continuous distributions of the number of deaths for each cause (by category). Some outliers have been removed for demonstration.

Three histograms show the continuous distribution of the number of deaths for each cause (communicable, injuries, and non-communicable diaseases). The frequency is highest for communicable diseases when the number of deaths is less than 1k. The total number of deaths is not higher than 10k. Injuries have lower frequencies (less than 5 on average) and the number of deaths is nearly nil beyond 10k deaths. For non-communicable diseases, the frequency lowers as the number of deaths increases.

For assessing relationships between two categorical variables, you need a layout that enables you to assess the co-occurrences of nominal values (that is, whether an observation contains both values). A great way to do this is to count the co-occurrences and show a heatmap. As an example, consider a broader set of population health data that evaluates the leading cause of death in each country (also from the Global Burden of Disease study). Figure 15.11 shows a subset of this data, including the disease type (communicable, non-communicable) for each disease, and the region where each country is found.

A table shows the leading cause of death in 10 countries. — Figure 15.11 The leading cause of death in each country. The category of each disease (communi-cable, non-communicable) is shown, as is the region in which each country is found.

The table lists 10 countries along with the region, leading cause of death, and category of the disease. The data observed is as follows. Botswana - Southern Sub-Saharan Africa, HIV/AIDS, Communicable; Brazil - Tropical Latin America, Ischemic heart disease, Non-communicable; Brunei - High-income Asia Pacific, Ischemic heart disease, Non-communicable; Bulgaria - Central Europe, Ischemic heart disease, Non-communicable; Burkina Faso - Western Sub-Saharan Africa, Malaria, Communicable; Burundi - Eastern Sub-Saharan Africa, Diarrheal diseases, Communicable; Cambodia - Southeast Asia, Lower respiratory infections, Communicable; Cameroon - Western Sub-Saharan Africa, HIV/AIDS, Communicable; Canada - High-income North America, Ischemic heart disease, Non-communicable; and Cape Verde - Western Sub-Saharan Africa, Ischemic heart disease, Non-communicable.

One question you may ask about this categorical data is:

“In each region, how often is the leading cause of death a communicable disease versus a non-communicable disease?”

To answer this question, you can aggregate the data by region, and count the number of times each disease category (communicable, non-communicable) appears as the category for the leading cause of death. This aggregated data (shown in Figure 15.12) can then be displayed as a heatmap, as in Figure 15.13.

A table shows the number of countries in each region in which the leading cause of death is communicable/non-communicable disease. — Figure 15.12 Number of countries in each region in which the leading cause of death is communicable/non-communicable.

The table lists 8 regions according to the category of leading cause (communicable/non-communicable) and the number of countries in the region. The data observed from the table is as follows. Andean Latin America - Communicable, 1; Andean Latin America - Non-Communicable, 2; Australasia - Non-Communicable, 2; Caribbean - Non-Communicable, 18; Central Asia - Non-Communicable, 9; Central Europe - Non-Communicable, 13; Central Latin America - Communicable, 1; and Central Latin America - Non-Communicable, 8.

A heatmap of the number of countries where the leading cause of death is communicable/non-communicable. — Figure 15.13 A heatmap of the number of countries in each region in which the leading cause of death is communicable/non-communicable.

The horizontal axis represents the leading cause of death. The vertical axis represents the regions: Western Sub-Saharan Africa, Western Europe, Tropical Latin America, Southern Sub-Saharan Africa, Southern Latin America, Southeast Asia, South Asia, Oceania, North Africa and Middle East, High-income North America, Eastern Sub-Saharan Africa, Eastern Europe, East Asia, Central Sub-Saharan Africa, Central Latin America, Central Europe, Central Asia, Caribbean, Australasia, and Andean Latin America. The graph shows two bars representing communicable and non-communicable and in different shades indicating the number of countries from 1 to 20.

15.2.3 Visualizing Hierarchical Data

One distinct challenge is showing a hierarchy that exists in your data. If your data naturally has a nested structure in which each observation is a member of a group, visually expressing that hierarchy can be critical to your analysis. Note that there may be multiple levels of nesting for each observation (observations may be part of a group, and that group may be part of a larger group). For example, in the disease burden data set, each country is found within a particular region, which can be further categorized into larger groupings called super-regions. Similarly, each cause of death (e.g., lung cancer) is a member of a family of causes (e.g., cancers), which can be further grouped into overarching categories (e.g., non-communicable diseases). Hierarchical data can be visualized using treemaps (Figure 15.14), circle packing (Figure 15.15), sunburst diagrams (Figure 15.16), or other layouts. Each of these visualizations uses an area encoding to represent a numeric value. These shapes (rectangles, circles, or arcs) are organized in a layout that clearly expresses the hierarchy of information.

A treemap showing the number of deaths in the U.S. for both sexes and all ages, in 2016. — Figure 15.14 A treemap of the number of deaths in the United States from each cause. Screenshot from GBD Compare, a visualization tool for the global burden of disease (https://vizhub.healthdata.org/gbd-compare/).

The treemap shows three regions. The first region, filled in blue, includes IHD, stroke, lung c, liver c, stomach c, colorect C, breast c, oth neopla, leukemia, cervic C, lymphoma, prostate C, Esophag C, Lip oral C, Kidney C, Pancreas C, Ovary C, Bladder C, Uterus C, Myeloma, Melanoma, Skin C, Aort An, A fib, HTN HD, Oth Cardio, RHD, CMP, Endocar, PAD, Brain C, Drugs, Oth MSK, Diabetes CKD, Alzheimer, Urinary, Endocrine, Oth Neuro, Parkinson, MS, ALS, COPD, Asthma, Cirr HepC, Cirr Alc, Oth Cirr, Ileus, Oth Digest, Gall Bile, IBD, and Vasc Intest. The second region, filled in red, includes LRI and HIV. The third region, filled in green, includes Falls, Road Inj, Violence, Self Harm, Fire, and F body.

A circle pack layout represents the disease burden in the United States. — Figure 15.15 A re-creation of the treemap visualization (of disease burden in the United States) using a circle pack layout. Created using the d3.js library https://d3js.org.

The circle pack layout encloses a bigger circle and two smaller circles. The various diseases visualized in the bigger circle with varying sizes of inner circles are (filled with blue) Breast C, COPD, Colorect C, CMP, Alzheimer's, Stroke, Drugs, Lung C, Prostate C, Oth Cardio, HTN HD, CKD, Diabetes, Pancreas C, and IHD. The diseases visualized in the two smaller circles are LRI (filled with red); Road Inj, Falls, and Self Harm (filled with green).

A sunburst chart represents the disease burden in the United States. — Figure 15.16 A re-creation of the treemap visualization (of disease burden in the United States) using a sunburst diagram. Created using the d3.js library https://d3js.org.

The sunburst chart has a root node followed by two levels of concentric rings that are sliced to represent each category of diseases, the size of the sections corresponding to the value. Level 1 has three nodes, "Non-communicable" (covers the major portion, filled in blue) and two other parent nodes: "Injuries," that has an unnamed leaf node in the next level (both levels filled in green) and "Comm" parent node has its leaf node "LRI" in the next level (both levels filled in red). Level 2 has various nodes (segments): Alzheimer's, IHD, Stroke, COPD, Diabetes, CKD, Lung C, and Colorect C (all filled in blue).

The benefit of visualizing the hierarchy of a data set, however, is not without its costs. As described in Section 15.3, it is quite difficult to visually decipher and compare values encoded in a treemap (especially with rectangles of different aspect ratios). However, these displays provide a great summary overview of hierarchies, which is an important starting point for visually exploring data.

15.3 Choosing Effective Graphical Encodings

While the previously given guidelines for selecting visual layouts based on the data relationship to explore are a good place to start, there are often multiple ways to represent the same data set. Representing data in another format (e.g., visually) is called encoding that data. When you encode data, you use a particular “code” such as color or size to represent each value. These visual representations are then visually decoded by anyone trying to interpret the underlying values.

Your task is thus to select the encodings that are most accurately decoded by users, answering the question:

“What visual form best allows you to exploit the human visual system and available space to accurately display your data values?”

In designing a visual layout, you should choose the graphical encodings that are most accurately visually decoded by your audience. This means that, for every value in your data, your user’s interpretation of that value should be as accurate as possible. The accuracy of these perceptions is referred to as the effectiveness of a graphical encoding. Academic research² measuring the perceptiveness of different visual encodings has established a common set of possible encodings for quantitative information, listed here in order from most effective to least effective:

Position: the horizontal or vertical position of an element along a common scale
Length: the length of a segment, typically used in a stacked bar chart
Area: the area of an element, such as a circle or a rectangle, typically used in a bubble chart (a scatterplot with differently sized markers) or a treemap
Angle: the rotational angle of each marker, typically used in a circular layout like a pie chart
Color: the color of each marker, usually along a continuous color scale
Volume: the volume of a three-dimensional shape, typically used in a 3D bar chart

²Most notably, Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531–554. https://doi.org/10.1080/01621459.1984.10478080

As an example, consider the very simple data set in Table 15.3. An effective visualization of this data set would enable you to easily distinguish between the values of each group (e.g., between the values 10 and 11). While this identification is simple for a position encoding, detecting this 10% difference is very difficult for other encodings. Comparisons between encodings of this data set are shown in Figure 15.17.

Table 15.3 A simple data set to demonstrate the perceptiveness of different graphical encodings (shown in Figure 15.17). Users should be able to visually distinguish between these values.

group	value
a	1
b	10
c	11
d	7
e	8

Different types of graphical encodings: Position Encoding, Length Encoding, Area Encoding, and Color Encoding are presented for the same dataset. — Figure 15.17 Different graphical encodings of the same data. Note the variation in perceptibility of differences between values!

The horizontal axis for all the encoding types represents "group" that reads from 'a' to 'e'. The vertical axis for the position encoding represents "value" ranging from 0 to 9 in increments of 3. It is visualized as dots and it infers the following data: (a, 1), (b, 10), (c, 11.5), (d, 7), and (e, 8.5). The vertical axis for the length encoding represents "value" ranging from 0 to 30 in increments of 10. It is visualized in the form of bands from 'a' to 'e' and it infers the following data (from bottom to top): (band e, 0 to 8), (band d, 8 to 15), (band c, 15 to 25), (band b, 25 to 35), and (band a, 35 to 37). For area encoding, the vertical axis represents "value" that includes 0.0, 2.5, 5.0, 7.5, and 10.0. It is visualized as dots and the size of the dots increases as the value increases. All the plots are marked at the middle and thus infers the data: (a, 0.0), (b, 7.5), (c, 7.5), (d, 5.0), and (e, 5.0). For color encoding, the vertical axis represents "value" that includes 0.0, 2.5, 5.0, 7.5, and 10.0. It is visualized as small circles and the shades of the circles decrease as the value increases. All the circles are marked at the middle and thus infers the data: (a, 0.0), (b, 9.6), (c, 10.0), (d, 6.5), and (e, 7.5). Note: All the values are marked approximately.

Thus when a visualization designer makes a blanket claim like “You should always use a bar chart rather than a pie chart,” the designer is really saying, “A bar chart, which uses position encoding along a common scale, is more accurately visually decoded compared to a pie chart (which uses an angle encoding).”

To design your visualization, you should begin by encoding the most important data features with the most accurately decoded visual features (position, then length, then area, and so on). This will provide you with guidance as you compare different chart options and begin to explore more creative layouts.

While these guidelines may feel intuitive, the volume and distribution of your data often make this task more challenging. You may struggle to display all of your data, requiring you to also work to maximize the expressiveness of your visualizations (see Section 15.4).

15.3.1 Effective Colors

Color is one of the most prominent visual encodings, so it deserves special consideration. To describe how to use color effectively in visualizations, it is important to understand how color is measured. While there are many different conceptualizations of color spaces, a useful one for visualization is the hue–saturation–lightness (HSL) model, which defines a color using three attributes:

The hue of a color, which is likely how you think of describing a color (e.g., “green” or “blue”)
The saturation or intensity of a color, which describes how “rich” the color is on a linear scale between gray (0%) and the full display of the hue (100%)
The lightness of the color, which describes how “bright” the color is on a linear scale from black (0%) to white (100%)

This color model can be seen in Figure 15.18, which is an example of an interactive color selector³ that allows you to manipulate each attribute independently to pick a color. The HSL model provides a good foundation for color selection in data visualization.

³HSL Calculator by w3schools: https://www.w3schools.com/colors/colors_hsl.asp

A snapshot of the H S L Calculator. — Figure 15.18 An interactive hue–staturation–lightness color picker, from w3schools.

The screen shows a thumbnail view of the selected color on the left. The text box on the right reads, hsl (153, 87 percent, 55 percent). Three fields along with sliders are shown at the bottom: H set to 153, S set to 87, and L set to 55.

When selecting colors for visualization, the data type of your variable should drive your decisions. Depending on the data type (categorical or continuous), the purpose of your encoding will likely be different:

For categorical variables, a color encoding is used to distinguish between groups. Therefore, you should select colors with different hues that are visually distinct and do not imply a rank ordering.
For continuous variables, a color encoding is used to estimate values. Therefore, colors should be picked using a linear interpolation between color points (i.e., different lightness values).

Picking colors that most effectively satisfy these goals is trickier than it seems (and beyond the scope of this short section). But as with any other challenge in data science, you can build upon the open source work of other people. One of the most popular tools for picking colors (especially for maps) is Cynthia Brewer’s ColorBrewer.⁴ This tool provides a wonderful set of color palettes that differ in hue for categorical data (e.g., “Set3”) and in lightness for continuous data (e.g., “Purples”); see Figure 15.19. Moreover, these palettes have been carefully designed to be viewable to people with certain forms of color blindness. These palettes are available in R through the RColorBrewer package; see Chapter 16 for details on how to use this package as part of your visualization process.

⁴ColorBrewer: http://colorbrewer2.org

All palettes in the colorbrewer package are displayed. — Figure 15.19 All palettes made available by the `colorbrewer` package in `R`. Run the `display. brewer.all()` function to see them in RStudio.

The package includes the following: YlOrRd, YlOrBr, YlGnBu, ylGn, Reds, RdPu, Purples, PuRd, PuBuGn, PuBu, OrRd, Oranges, Greys, Greens, GnBu, BuPu, BuGn, Blues, Set3, Set2, Set1, Pastel2, Pasterl1, Paired, Dark2, Accent, Spectral, RdYlGn, RdYlBu, RdGy, RdBu, PuOr, PRGn, PiYG, and BrBG.

Selecting between different types of color palettes depends on the semantic meaning of the data. This choice is illustrated in Figure 15.20, which shows map visualizations of the population of each county in Washington state. The choice between different types of continuous color scales depends on the data:

Sequential color scales are often best for displaying continuous values along a linear scale (e.g., for this population data).
Diverging color scales are most appropriate when the divergence from a center value is meaningful (e.g., the midpoint is zero). For example, if you were showing changes in population over time, you could use a diverging scale to show increases in population using one hue, and decreases in population using another hue.
Multi-hue color scales afford an increase in contrast between colors by providing a broader color range. While this allows for more precise interpretations than a (single hue) sequential color scale, the user may misinterpret or misjudge the differences in hue if the scale is not carefully chosen.
Black and white color scales are equivalent to sequential color scales (just with a hue of gray!) and may be required for your medium (e.g., when printing in a book or newspaper).

Figure 15.20 Population data in Washington represented with four ColorBrewer scales. The sequential and black/white scales accurately represent continuous data, while the diverging scale (inappropriately) implies divergence from a meaningful center point. Colors in the multi-hue scale may be misinterpreted as having different meanings.

Overall, the choice of color will depend on the data. Your goal is to make sure that the color scale chosen enables the viewer to most effectively distinguish between the data’s values and meanings.

15.3.2 Leveraging Preattentive Attributes

You often want to draw attention to particular observations in your visualizations. This can help you drive the viewer’s focus toward specific instances that best convey the information or intended interpretation (to “tell a story” about the data). The most effective way to do this is to leverage the natural tendencies of the human visual processing system to direct a user’s attention. This class of natural tendencies is referred to as preattentive processing: the cognitive work that your brain does without you deliberately paying attention to something. More specifically, these are the “[perceptual] tasks that can be performed on large multi-element displays in less than 200 to 250 milliseconds.”⁵ As detailed by Colin Ware,⁶ the visual processing system will automatically process certain stimuli without any conscious effort. As a visualization designer, you want to take advantage of visual attributes that are processed preattentively, making your graphics as rapidly understood as possible.

⁵Healey, C. G., & Enns, J. T. (2012). Attention and visual memory in visualization and computer graphics. IEEE Transactions on Visualization and Computer Graphics, 18(7), 1170–1188. https://doi.org/10.1109/TVCG.2011.127. Also at: https://www.csc2.ncsu.edu/faculty/healey/PP/

⁶Ware, C. (2012). Information visualization: Perception for design. Philadelphia, PA: Elsevier.

As an example, consider Figure 15.21, in which you are able to count the occurrences of the number 3 at dramatically different speeds in each graphic. This is possible because your brain naturally identifies elements of the same color (more specifically, opacity) without having to put forth any effort. This technique can be used to drive focus in a visualization, thereby helping people quickly identify pertinent information.

Two graphics on the left and right depict preattentive attributes. — Figure 15.21 Because opacity is processed preattentively, the visual processing system identifies elements of interest (the number 3) without effort in the right graphic, but not in the left graphic.

The graphic on the left and right shows a big sequence of numbers with a text above it reading How many 3s are there? In the graph on the right, the opacity of number 3 is high, when compared to other numbers.

In addition to color, you can use other visual attributes that help viewers preattentively distinguish observations from those around them, as illustrated in Figure 15.22. Notice how quickly you can identify the “selected” point—though this identification happens more rapidly with some encodings (i.e., color) than with others!

A set of five graphs illustrates the preattentive attributes. — Figure 15.22 Driving focus with preattentive attributes. The selected point is clear in each graph, but especially easy to detect using color.

The horizontal (x) and vertical (y) axes of all the graphs range from 0.0 to 10.0, in increments of 2.5. The first graph titled color shows all plots in black except one in red. The second graph titled shape shows all plots as circles except one as a cross mark. The third graph titled enclosure shows all plots, except one enclosed in a rectangle. The fourth graph titled opacity shows all plots are dimmed, except one, which is bold. The fifth graph titled size shows all points of same size, except one which is larger than the rest.

As you can see, color and opacity are two of the most powerful ways to grab attention. However, you may find that you are already using color and opacity to encode a feature of your data, and thus can’t also use these encodings to draw attention to particular observations. In that case, you can consider the remaining options (e.g., shape, size, enclosure) to direct attention to a specific set of observations.

15.4 Expressive Data Displays

The other principle you should use to guide your visualization design is to choose layouts that allow you to express as much data as possible. This goal was originally articulated as Mackinlay’s Expressiveness Criteria⁷ (clarifications added):

⁷Mackinlay, J. (1986). Automating the design of graphical presentations of relational information. ACM Transactions on Graphics, 5(2), 110–141. https://doi.org/10.1145/22949.22950. Restatement by Jeffrey Heer.

A set of facts [data] is expressible in a language [visual layout] if that language contains a sentence [form] that

encodes all the facts in the set,
encodes only the facts in the set.

The prompt of this expressiveness aim is to devise visualizations that express all of (and only) the data in your data set. The most common barrier to expressiveness is occlusion (overlapping data points). As an example, consider Figure 15.23, which visualizes the distribution of the number of deaths attributable to different causes in the United States. This chart uses the most visually perceptive visual encoding (position), but fails to express all of the data due to the overlap in values.

A chart representing the distribution of Number of deaths in the United States. — Figure 15.23 Position encoding of the number of deaths from each cause in the United States. Notice how the overlapping points (occlusion) prevent this layout from expressing all of the data. Some outliers have been removed for demonstration.

The horizontal band representing the number of deaths for each cause ranges from 0 to 40k, in increments of 10k. The chart shows solid circles marked along the axis representing the appropriate data. The circles from 0 to 20k are found to be dense and overlapping each other. The circles from 20k to 30k are lesser than the previous range and yet overlap each other. The circles from 30k to 40k are very less in number and do not overlap.

There are two common approaches to address the failure of expressiveness caused by overlapping data points:

Adjust the opacity of each marker to reveal overlapping data.
Break the data into different groupings or facets to alleviate the overlap (by showing only a subset of the data at a time).

These approaches are both implemented in combination in Figure 15.24.

A set of three charts representing the distribution of Number of deaths from each cause in the United States. — Figure 15.24 Position encoding of the number of deaths from each cause in the United States, faceted by the category of each cause. The use of a lower opacity in conjunction with the faceting enhances the expressiveness of the plots. Some outliers have been removed for demonstration.

The horizontal band of all the charts represents the number of deaths for each cause and it ranges from 0 to 40k, in increments of 10k. The first chart titled communicable shows solid circles overlapping in the range 0 to 10k. The second chart titled injuries shows solid circles overlapping in the ranges 0 to 10k, near 20k, near 35k, and after 40k. The third chart titled non-communicable shows solid circles overlapping each other throughout the band. The circles are dense near the left end.

Alternatively, you could consider changing the data that you are visualizing by aggregating it in an appropriate way. For example, you could group your data by values that have similar number of deaths (putting each into a “bin”), and then use a position encoding to show the number of observations per bin. The result of this is the commonly used layout known as a histogram, as shown in Figure 15.25. While this visualization does communicate summary information to your audience, it is unable to express each individual observation in the data (which would communicate more information through the chart).

A histogram named "Distribution of the Number of Deaths for Each Cause" represents the number of deaths attributable to each cause. — Figure 15.25 Histogram of the number of deaths attributable to each cause.

The horizontal axis of the histogram represents "Number of Deaths" ranging from 0 to 40k (in thousand) in increments of 10k. The vertical axis represents "Number of Causes" ranging from 0 to 40 in increments of 10. The graph infers the following data: (0, 47), (1k, 15), (2k, 10), (6k, 10), (9k, 7), (20k, 0), (39k, 1), and (44k, 1). Note: All the values are marked approximately.

At times, the expressiveness and effectiveness principles are at odds with one another. In an attempt to maximize expressiveness (and minimize the overlap of your symbols), you may have to choose a less effective encoding. While there are multiple strategies for this—for example, breaking the data into multiple plots, aggregating the data, and changing the opacity of your symbols—the most appropriate choice will depend on the distribution and volume of your data, as well as the specific question you wish to answer.

15.5 Enhancing Aesthetics

Following the principles described in this chapter will go a long way in helping you devise informative visualizations. But to gain trust and attention from your potential audiences you will also want to spend time investing in the aesthetics (i.e., beauty) of your graphics.

Tip

Making beautiful charts is a practice of removing clutter, not adding design.

One of the most renowned data visualization theorists, Edward Tufte, frames this idea in terms of the data–ink ratio.⁸ Tufte argues that in every chart, you should maximize the ink dedicated to displaying the data (and in turn, minimize the non-data ink). This can translate to a number of actions:

⁸Tufte, E. R. (1986). The visual display of quantitative information. Cheshire, CT: Graphics Press.

Remove unnecessary encodings. For example, if you have a bar chart, the bars should have different colors only if that information isn’t otherwise expressed.
Avoid visual effects. Any 3D effects, unnecessary shading, or other distracting formatting should be avoided. Tufte refers to this as “chart junk.”
Include chart and axis labels. Provide a title for your chart, as well as meaningful labels for your axes.
Lighten legends/labels. Reduce the size or opacity of axis labels. Avoid using striking colors.

It’s easy to look at a chart such as the chart on the left side of Figure 15.26 and claim that it looks unpleasant. However, describing why it looks distracting and how to improve it can be more challenging. If you follow the tips in this section and strive for simplicity, you can remove unnecessary elements and drive focus to the data (as shown on the right-hand side of Figure 15.26).

A set of two vertical bars on the left and right illustrates enhancing the visualization and addition of informative labels. — Figure 15.26 Removing distracting and uninformative visual features (left) and adding informative labels to create a cleaner chart (right).

The graph on the left shows the horizontal axis marked with values: a, b, c, d, and e. The vertical axis ranges from 0 to 12, in increments of 2. The graph shows five bars in different colors. The graph on the right shows the horizontal axis labeled group marked with the values: a, b, c, d, and e. The vertical axis labeled size ranges from 0 to 9, in increments of 3. The graph represents the group size data.

Luckily, many of these optimal choices are built into the default R packages for visualization, or are otherwise readily implemented. That being said, you may have to adhere to the aesthetics of your organization (or your own preferences!), so choosing an easily configurable visualization package (such as ggplot2, described in Chapter 16) is crucial.

As you begin to design and build visualizations, remember the following guidelines:

Dedicate each visualization to answering a specific question of interest.
Select a visual layout based on your data type.
Choose optimal graphical encodings based on how well they are visually decoded.
Ensure that your layout is able to express your data.
Enhance the aesthetics by removing visual effects, and by including clear labels.

These guidelines will be a helpful start, and don’t forget that visualizations are about insights, not pictures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.