6. A Visualization Tool for Mining Large Correlation Tables: The Association Navigator (3/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A Visualization Tool for Mining Large Correlation Tables 83

Frame 1: n = 1887

Corr = 0.824 (pval = 0)

ssc_diagnosis_verbal_iq_p1.CDV

Frequency

100

150

200

250

0 50 100 150

ssc_diagnosis_nonverbal_iq_p1.CDV

Frequency

0 50 100 150

100

200

300

0 50 100 150

100

150

ssc_diagnosis_verbal_iq_p1.CDV

ssc_diagnosis_nonverbal_iq_p1.CDV

Frame 2: n = 1880

Corr = 0.707 (pval = 0)

ados_module_p1.CDV

Frequency

200

400

600

800

1000

Frequency

100

200

300

ssc_diagnosis_vma_p1.CDV

0 100 200 300

100

200

300

ados_module_p1.CDV

ssc_diagnosis_vma_p1.CDV

1234

ssc_diagnosis_verbal_iq_type_p1.CDV

Frequency

500

1000

1500

Frequency

500

1000

1500

ssc_diagnosis_nonverbal_iq_type_p1.CDV

ssc_diagnosis_verbal_iq_type_p1.CDV

ssc_diagnosis_nonverbal_iq_type_p1.CDV

Frame 3: n = 1887

Corr = 0.76 (pval = 0)

FIGURE 6.7

Scatterplots and histograms/barplots for three variable pairs. Corr, correlation; pval,

p-value.

marginal histograms (for quantitative variables) and barcharts (for categorical variables).

From Figure 6.7, we can draw a few conclusions and recommendations:

• A most basic use of the plots is to note the type of the variables: In Figure 6.7,

both variables on the left (

..nonverbal iq.. and ..verbal iq..)andthey-variable in

the center (

..vma..)arequantitative;thex-variable in the center (ados module..)is

apparently ordinal with four levels, and both variables on the right (nonverbal

iq type and

verbal

iq type)arebinary. Quantitative variables can have strong marginal features: it might

b e of interest to observe that the x-variable on the left is slightly bimodal, with a major mode

around x = 90 and a minor mode around x = 30.

∗

The y-variable in the center scatterplot is

partially censored on the upper side at about y = 210, as can be seen both in the scatterplot

and in the (lower) histogram.

• Categorical variables, when scored numerically, can be gainfully displayed in scatterplots. It is

useful to jitter them to avoid b eing misled by overplotting. I n Figure 6.7, jittering is applied

to the x-variable in the center scatterplot and to both binary variables in the right-hand

scatterplot.

∗

The bimodality of the IQ distribution is a measurement artifact: for cognitively highly impaired

probands, a diﬀerent and more appropriate IQ test is administered. In theory, this alternative test should

be scaled to cohere with the test administered to the majority, but in practice it creates a minor mode in

the low end of the IQ distribution, more so for verbal IQ than for nonverbal IQ.

84 Handbook of Big Data

• To enhance the p erception of the association, the scatterplots can be decorated with smo oths

for continuous variables and with tr aces of group means when the x-variable is categorical with

few er than, say, eight groups (default, can be changed). In the left and center scatterplots of

Figure 6.7, the associations of the y-variables with the x-variables are seen to be somewhat

nonlinear, but compared to the linear component of the association, the nonlinearities are

relatively modest.

∗

The AN shows scatterplots and histograms/barcharts in a window separate from the

blockplot window, one triple of plots at a time. To overcome the one-at-a time limitation,

the AN also oﬀers scatterplot matrices (sometimes called sploms) of arbitrary numbers of

variables. An example, involving four variables (diﬀerent from those in Figure 6.7), is shown

in Figure 6.8. For readers not familiar with scatterplot matrices, note that each variable pair

is shown twice, in plots located symmetrically oﬀ the diagonal, and with reverse roles as

x-andy-variables. Each diagonal cell shows a variable label that indicates (1) the common

x-axis in the column of the cell and (2) the common y-axis in the row of the cell. For

the reader familier with scatterplot matrices, note that we show the vertical order of the

variables ascending from bottom to top, the reason being consistency with the convention

we use in the blockplots.

20 40 60 80 100 120

0246

40 50 60 70 80 90

ados2

algorithm

p1.OCUV

srs

teacher

score

p1.CDV

40 50

80 90

40 50 60 70 80 90

srs

parent

score

p1.CDV

vineland

composite

standard

score

p1.CDV

40 50 60 70 80 90 0 2 4 6

20 40 60 80 100 120

FIGURE 6.8

Scatterplot matrix of four variables. (Note the convention for the vertical order of the

variables: bottom to top, for consistency with the blockplots.)

∗

The nonlinearity on the left could be due to the marginal distributions. The nonlinearity in the center

is expected by the expert: verbal mental age (vma)onthey -axis should be considerably higher on average in

ADOS modules 3 and especially 4 because these modules or levels are formed from a simple test of language

competence.

A Visualization Tool for Mining Large Correlation Tables 85

As for particulars of the scatterplot matrix shown in Figure 6.8, the visually most striking

features concern marginal distributions, not associations: The ﬁrst variable is capped at the

maximal value +90, and the fourth variable is binary. Otherwise the associations look simply

monotone and seem well summarized by correlations.

6.3.6 Variations on Blockplots

Blockplots are not the most common visualizations of correlation tables. As a Google search

of correlation plot reveals, the most frequent visual rendering of correlation tables is in terms

of heatmaps where square cells are always ﬁlled and numeric values are coded on a gray or

color scale. An example is shown in the left frame of Figure 6.9; for comparison, the right

frame shows the corresponding blockplot. Here are a few observations about the two types

of plots:

• Color or gray scale is generally a weaker visual cue than size. This argument favors

blockplots as long as the blocks are not too small, that is, as long as the view is not

zoomed out too much. The superiority of blockplots over heatmaps is also noted by

Wickham et al. (2006, Figure 2).

• In heatmaps, color fuses adjacent cells when they are close in value. This may or may

not be a problem for the trained eye, but there is a loss of identity of the rows and

columns in heatmaps.

• Heatmaps do not permit markup with background color because they ﬁll the square or

rectangular cells completely. This problem can be overcome by shrinking the heatmap

cells somewhat to allow some surrounding space to be freed up that can be ﬁlled with

background color for markup, as shown in the center frame of Figure 6.9. This method

of rendering, however, seems to further decrease the crispness of heatmaps.

• Heatmaps perform nicely when the view is heavily zoomed out, in which case the

individual blocks are so small that size is no longer visually functional as a cue. In this

case, color coding works well and gives an accurate impression of global structure. We

solve this problem for blockplots by showing only 10,000 or so of the largest correlations

when heavily zoomed out. Thinning the table in this manner works well even when the

visible table is so large that each cell is strictly speaking below the pixel resolution of a

raster screen.

Because none of the two types of plots—blockplots or heatmaps—may be uniformly superior

at all scales, the AN provides both, and with one keystroke, one can toggle between the two

rendering methods. Varying block size allows for the mixed variant shown in the center of

Figure 6.9.

Visualization of correlation tables has a small literature in statistics. An early reference

that addresses large correlation tables is Hills (1969), who applies half-normal plots to

tell statistically signiﬁcant from insigniﬁcant correlations and clusters variables visually

in two-dimensional projections. Closer to the present work are articles by Murdoch and

Chow (1996) and Friendly (2002). Both propose relatively complex renderings of correlations

with ellipses or augmented circles that may not scale up to the sizes of tables we have in

mind but may be useful for conveying richer information for tables that are smaller, yet too

large for numeric table display. Blockplot coding, which uses squares, has the advantage

that these shapes can completely ﬁll their cells to represent extremal correlations as these

are geometrically similar to the shapes of the containing cells (at least if the the default

aspect ratio of the blockplot is maintained), whereas all other shapes leave residual space

even when maximally expanded.

What we prefer to call descriptively blockplots, possibly contracted to blots,has

previously been named ﬂuctuation diagrams (Hofmann 2000). Under this term, one can ﬁnd

a static implementation in the R-package extracat on the CRAN site authored by Pilhoefer

86 Handbook of Big Data

FIGURE 6.9

Aheatmap(left) compared with a corresponding blockplot (right), as well as a shrunk

heatmap (center).

and Unwin (2013). Static software for heatmaps is readily available, for example, in the

R-function heatmap(). Heatmaps are often applied to raw data tables, but they can be

equally applied to correlation tables. Many variations of glyph coding can be found in the

classic book by Bertin (1983).

An interesting aspect of blockplots is that there exists science regarding the perception

of area size. A general theory holds that most continuous stimuli (continua such as length,

area, volume, weight, brightness, and loudness) result in perceptions according to Stevens’

power law (Stevens 1957; Stevens and Galanter 1957; Stevens 1975). That is, a quantitative

stimulus x translates to a quantitative perception p(x) through a law of the form p(x)=

. As discussed by Cleveland (1985, p. 243) with reference to Stevens (1975), for area

perception the power is about β =0.7, meaning that an actual area ratio of 2:1 is on

average perceived as a ratio of (2:1)

0.7

≈ 1.62. This law can be leveraged to determine

the transformation that should be used to map correlations to squares in a blockplot. In

R the symbol size is parametrized in terms of a linear expansion factor called cex (character

expansion). Our goal is to use block sizes such that their perceived ratios faithfully reﬂect the

ratios of respective correlations. This results in the condition cor ∼ p(cex

)=(cex

)

0.7

cex

1.4

; hence, cex ∼ cor

1/1.4

≈ cor

0.7

. This is indeed the default power transformation

in the AN, although users can change it (see Appendix B). If most correlations are very

small, a power closer to zero will expand the range of small values, resulting in enhanced

discrimination at the low end at the cost of attenuated discrimination at the high end.

6.4 Op eration of the AN

The purpose of the AN is to generate the displays described above in rapid order and

even with real-time motion. Numerous real-time operations are under mouse and keyboard

control, while a few text-based operations are under dialog and menu control. Further

parameters can be controlled from the R language (see Appendix B), but this will not be

necessary for most users. This section describes the operations of the AN, the purposes they

serve, as well as a minimal set of R-related instructions that concern one-time setup, regular

starting up, and saving of state. The software will be available as an R-package, but the

instructions below do not reﬂect this and get the reader going by sourcing the software from

the ﬁrst author’s site.

A Visualization Tool for Mining Large Correlation Tables 87

6.4.1 Starting Up the AN

In order to simply see some AN running, the reader may paste the following code into an

R interpreter:

source("http://stat.wharton.upenn.edu/~buja/association-navigator.R")

p<-200

mymatrix <- matrix(rnorm(20000),ncol=p)

colnames(mymatrix) <- paste("V", 1:p, "_", c(rep("A",p/2),rep("B",p/2)), sep="")

a.n <- a.nav.create(mymatrix)

a.nav.run(a.n)

This code will download and source the software, generate an artiﬁcial data matrix of normal

random numbers, generate an instance of an AN from it, and start up by creating a window

showing a blockplot of correlations as they arise from pure random association among 100

variables given a sample size of 200, divided into two blocks of 100 variables each, suﬃxed

A and B, respectively. The reader may left-drag the mouse in the plot to see a ﬁrst real-time

response.

To prevent confusion in the operation of an AN, users should note the following

fundamental points:

• Important: While the AN is running, the R interpreter (R Gui) is blocked by the

execution of the AN’s event loop! All interactions must be directed at the master window

of the AN, which usually shows a blockplot.

• Quitting the AN and returning to the R interpreter is done by typing the capital letter Q

into the AN master window. The master window will remain as a passive R plot window.

It will no longer respond to user input, but the R interpreter (R Gui) will be responsive

again. (A live AN can also be stopped violently by typing interrupt characters ctrl-C

into the R interpreter or by killing the AN master window, but an educated R user

wouldn’t be this crude.)

• Help: On typing the letter h into a live AN, a help window will appear with terse

documentation of all AN interactions. The window is meant to give reminders to

previously initiated AN users, not introductions to beginners. The help window is

actually a menu such that selecting a line documenting a keystroke will emulate the

eﬀects of the keystroke. Because the help window is a menu, it must be closed in order

to regain the AN’s attention. (This behavior will be changed in a future version.)

• Notion of state: An AN instance has an internal state. As a consequence, whenever a

user stops a live AN and restarts it, it will resume in the exact state in which it was

stopped.

• Saving state: From the previous point follows that state of an AN is saved across R ses-

sions if the core image has been saved (save.image()) before quitting the R sessions.

6.4.2 Moving Around: Crosshair Placement, Panning, and Zooming

When an AN is run for the ﬁrst time, it shows an overview of the complete correlation

table, which may comprise hundreds of variables. Most likely the variables will be organized

in variable groups that are characterized by shared suﬃxes of variable names and visually

form a series of highlight squares along the ascending diagonal. The ﬁrst order of business

is to zoom in and pan up and down the ascending diagonal to gain an overview of these

sub-tables. Here are the steps:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. A Visualization Tool for Mining Large Correlation Tables: The Association Navigator (3/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
6. A Visualization Tool for Mining Large Correlation Tables: The Association Navigator (3/6)