6. A Visualization Tool for Mining Large Correlation Tables: The Association Navigator (6/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

98 Handbook of Big Data

dialog (

Sort Adjustees...): The result is a sorted list of the adjustees according to the R

values from the regression of the adjustees/y-variables onto the adjustors/x-variables. See

Figure 6.14 for an example.

6.4.12 Future of AN

The functionality described here reﬂects the 2015 implementation of the AN. Changes to

the AN are planned, the major one being a redesign to give the lens windows interactive

responsiveness as well. Currently all interaction is funneled throught the blockplot window,

even if the actions aﬀect the lens window.

Other obvious functionality is still missing, above all sorting of variables, manual and

algorithmic. A limited set of sorting operations may be added in a future version of the AN.

If readers of this chapter and users of the AN have further suggestions, the authors would

appreciate hearing.

Appendix A: Versatility of Correlation Analysis

We return to the apparent limitations of correlations as measures of association, which was

left as a loose end in the Section 6.2. We address the objections that (1) correlations are

measures of linear association only, (2) correlations reﬂect bivariate association only, and

(3) correlations apply to quantitative variables only. Toward this end, we make the following

observations and recommendations:

1. Although it is true that correlation is strictly speaking a measure of linear

association among quantitative variables, it is also a fact that correlation is useful

as a measure of monotone association in general, even when it is nonlinear. As

long as the association is roughly monotone, correlation will be positive when

the association is increasing and negative when it is decreasing. Admittedly,

correlation is not an optimal measure of nonlinear monotone association, but

it is still a useful one, in particular in the large-p problem. Finally, if gross

nonlinearity is discovered, it is always possible to replace a variable X with a

nonlinear transform f(X) [often log(X)] so its association with other variables

becomes more linear.

∗

2. The objection that correlations only reﬂect bivariate association is factually

correct but practically not very relevant. In practical data analysis, it is too

contrived to entertain the possibility that, for example, there exists association

among three variables, but there exists no monotone association among each pair

of variables.

†

In general, one follows the principle that lower-order association

is more likely than higher-order association; hence, pairwise association is more

likely than true interaction among three variables. Therefore, data analysts look

ﬁrst for groups of variables that are linked by pairwise association, and thereafter

they may examine whether these variables also exhibit higher-order association.

Note, however, that even multivariate methods such as principal components

∗

Linearity of association is not a simple concept. For one thing, it is asymmetric: if Y is linearly

associated with X, it does not follow that X is linearly associated with Y . The reason is that the deﬁnition

of linear association, E[Y |X]=β

+ β

X, is not symmetric in X and Y . Linearity of association in both

directions holds only for certain nice distributions such as bivariate Gaussians. A counterexampls is as

follows: Let X be uniformly distributed on an interval and Y = β

+ β

X +  with independent Gaussian

,thenY is linearly associated with X by construction, yet X is not linearly associated with Y .

†

An example would be three variables jointly uniformly distributed on the surface of a two-sphere in

three-space.

A Visualization Tool for Mining Large Correlation Tables 99

analysis (PCA) do not detect true higher-order interaction because they, too, rely

on correlations only. Finally, we are not asserting that simple correlation analysis

should be the end of data analysis, but it should certainly be near the beginning

in the large-p problems envisioned here, namely, in the analysis of relatively noisy

data as they arise in many social science and medical contexts.

∗

3. The ﬁnal objection we consider is that correlations do not apply to categorical

variables. This objection can be refuted with very practical advice on how to make

categorical data quantitative and how to interpret the meaning of the resulting

correlations. We discuss several cases in turn:

a. If a categorical variable X is ordinal (its categories have a natural order),

it is common practice to simply number the categories in order and use the

resulting integer variable as a quantitative variable. The resulting correlations

will be able to reﬂect monotone association with other variables that may

be expressed by saying “the higher categories of X tend to be associated

with higher/lower values/categories of other variables.” An obvious objection

is that the equi-spaced integers may not be a good quantiﬁcation of the

categories. If this is a serious concern worth some eﬀort, one may want

to look into optimal scoring procedures [see, e.g., De Leeuw and van

Rijckevorsel (1980) and Giﬁ (1990)]. The idea behind these methods is to

estimate new scores for the categorical variables by making them as linearly

associated as possible through optimization of the ﬁt of a joint PCA.

b. If a categorical variable X is binary, it is common practice to numerically code

its two categories with the values 0 and 1, thereby creating a so-called dummy

variable. This practice is pervasive in the analysis of variance (ANOVA), but

its usefulness is lesser known in multivariate analysis which is our concern.

The interpretation of correlations with dummy variables is highly interesting

as it solves two seemingly diﬀerent association problems:

i. First-order association between a binary variable X and a quantitative

variable Y means that there exists a diﬀerence between the two means

of Y in the two groups denoted by X. As it turns out, the correlation

of a dummy variable X with a quantitative variable Y is mathematically

equivalent to forming the t-statistic for a two-sample comparison of the

two means of Y in the two categories of X (t ∝ r/(1 −r

)

1/2

). Even more,

the statistical test for a zero correlation is virtually identical to the t-test

for equality of the two means. Thus, two sample mean comparisons can be

subsumed under correlation analysis.

ii. Association between two binary variables means that their 2×2 table shows

dependence. This situation is usually addressed with Fisher’s exact test of

independence. It turns out, however, that Fisher’s exact test is equivalent to

testing the correlation between the dummy variables, the only discrepancy

being that the normal approximation used to calculate the p-value of a

correlation is just that, an approximation, although an adequate one in

most cases.

c. If a categorical variable X is truly nominal with more than two values, that

is, neither binary nor ordinal, we may again follow the lead of ANOVA and

replace X with a collection of dummy variables, one per category. For example,

∗

In other large-p problems, the variables may be so highly structured that they become intrinsically low

dimensional, as, for example, in the analysis of libraries of registered images where each variable corresponds

to a pixel location and its values consist of intensities at that location across the images. The problem here

is not to locate groups of variables with association but to describe the manifold formed by the images in

very high-dimensional pixel space. A sensible approach in this case would be nonlinear dimension reduction.

100 Handbook of Big Data

if in a medical context data are collected in multiple sites, it will be of interest

to see whether substantive variables in some sites are systematically diﬀerent

from other sites. It is then useful to introduce dummy variables for the sites

and examine their correlations with the substantive variables. A signiﬁcant

correlation indicates a signiﬁcant mean diﬀerence at that site compared to the

other sites.

This discussion shows that categorical variables can be fruitfully included in

correlation analysis, with either numerical coding of ordinal variables or dummy

coding of binary and nominal variables.

This concludes our discussion of the versatility of correlation analysis.

Appendix B: Creating and Programming AN Instances

To create a new instance of an AN for a given dataset, use the following R statement:

a.n <- a.nav.create(datamatrix)

where datamatrix is a numeric matrix, not a dataframe. The new AN instance a.n can be

run with the following R statement:

a.nav.run(a.n)

These steps are completely general and may be useful for arbitrary numeric data matrices

with up to about 2000 variables.

Table B.1 shows a template for forming potentially useful instances of ANs that display

large numbers of SSC phenotype variables. As written the statement would produce an

AN on the order of 3000 variables.

AN’s are implemented not as lists but as environments, a relatively little known data

structure among most R users. Environments have some interesting properties. One can

look inside an AN with the R idiom

with(a.n, objects())

in order to list the AN internal variables inside the AN instance a.n. Assignments and any

other kind of programming of the internal state variables can be achieved the same way. For

example, if one desires a change of color of highlight strips to mistyrose,onecanachieve

this with the following:

with(a.n, { strips.col <- "mistyrose"; a.nav.blockplot() })

The call to a.nav.blockplot() redisplays the blockplot with the new paramater setting.

Changing the blockplot glyph from square to diamond is achieved with

with(a.n, {blot.pch <- 18; a.nav.blockplot() })

and reversing the color convention from “blue = positive” to “red = positive” in the style

of heatmaps is done with

with(a.n, { blot.col.pos <- 2; blot.col.neg <- 4; a.nav.blockplot() })

Note, however, that this aﬀects only blockplots, not heatmaps, the latter requiring

computation of a color scale, not just a binary color decision. Still, there is plenty of

opportunity for playfulness by experimenting with display parameters. A more sophisticated

example concerns changing the power transformation that maps correlations to glyph sizes:

with(a.n, {blot.pow <- .7; a.nav.cors.trans(); a.nav.blockplot() })

In addition to redisplay with a.nav.blockplot(), this also requires recomputation of the

display table with a.nav.cors.trans().

A Visualization Tool for Mining Large Correlation Tables 101

TABLE B.1

Template for joining large numbers of SSC tables and creating an AN for them.

a.n <- a.nav.create(cbind(

"family.ID"=as.numeric(v.families),

v.sites, v.srs.bg, v.individual,

v.family, v.parent.race, v.parent.common,

v.proband.cdv, v.proband.ocuv, v.sibling.s1, v.sibling.s2,

v.ados.common,

v.ados.1, v.ados.1.raw, v.ados.2, v.ados.2.raw,

v.ados.3, v.ados.3.raw, v.ados.4, v.ados.4.raw,

v.adi.r.diagnostic, v.adi.r.pca, v.adi.r,

v.adi.r.dum, v.adi.r.loss,

v.ssc.diagnosis,

v.vineland.ii.p1, v.vineland.ii.s1,

v.cbcl.2.5.p1, v.cbcl.2.5.s1,

v.cbcl.6.18.p1, v.cbcl.6.18.s1,

v.abc, v.abc.raw, v.rbs.r, v.rbs.r.raw,

v.srs.parent.p1, v.srs.parent.recode.p1,

v.srs.teacher.p1, v.srs.teacher.recode.p1,

v.srs.parent.s1, v.srs.parent.recode.s1,

v.srs.teacher.s1, v.srs.teacher.recode.s1,

v.srs.adult.fa, v.srs.adult.recode.fa,

v.srs.adult.mo, v.srs.adult.recode.mo,

v.bapq.fa, v.bapq.recode.fa, v.bapq.mo, v.bapq.recode.mo,

v.fhi.interviewer.fa, v.fhi.interviewer.mo,

v.scq.current.p1, v.scq.life.p1,

v.scq.current.s1, v.scq.life.s1,

v.ctopp.nr, v.purdue.pegboard, v.dcdq, v.ppvt,

v.das.ii.early.years, v.das.ii.school.age,

v.ctrf.2.5, v.trf.6.18,

v.ssc.med.hx.v2.autoimmune.disorders, v.ssc.med.hx.v2.birth.defects,

v.ssc.med.hx.v2.chronic.illnesses, v.ssc.med.hx.v2.diet.medication.sleep,

v.ssc.med.hx.v2.genetic.disorders, v.ssc.med.hx.v2.labor.delivery.birth.feeding,

v.ssc.med.hx.v2.language.disorders,

v.ssc.med.hx.v2.medical.history.child.1, v.ssc.med.hx.v2.medical.history.child.2,

v.ssc.med.hx.v2.medical.history.child.3,

v.ssc.med.hx.v2.medications.drugs.mother,

v.ssc.med.hx.v2.neurological.conditions,

v.ssc.med.hx.v2.other.developmental.disorders, v.ssc.med.hx.v2.pdd,

v.ssc.med.hx.v2.pregnancy.history, v.ssc.med.hx.v2.pregnancy.illness.vaccinations,

v.ssc.psuh.fa, vv.ssc.psuh.mo,

v.temperature.form.raw

), remove=T )

Readers should make a selection from this template as the full collection creates a data

matrix with about 3000 variables.

R environments represent one of the few data types that disobeys the functional

programming paradigm that is otherwise fundamental to R. As a consequence, assignment

of an AN does not allocate a new copy but passes a reference instead. In particular, the

R statement

b.n <- a.n

102 Handbook of Big Data

creates a variable

b.n that will be a reference to the same environment as the variable a.n.

Hence, the two statements

a.nav.run(a.n)

a.nav.run(b.n)

will run oﬀ the same AN instance. They have identical eﬀects in the sense that interactive

operations aﬀect the same instance.

Acknowledgments

This work was partially supported by a grant from the Simons Foundation (SFARI award

#121221 to A.M.K.). We appreciate obtaining access to the phenotypic data on SFARI

Base (https://base.sfari.org). Partial support was also provided by the National Science

Foundation Grant DMS-1007689 to A.B.

References

J. Bertin. Semiology of Graphics. Madison, WI: University of Wisconsin Press, 1983.

W. S. Cleveland. The Elements of Graphing Data. Paciﬁc Grove, CA: Wadsworth &

Brooks/Cole, 1985.

J. De Leeuw and J. van Rijckevorsel. HOMALS and PRINCALS—Some generalizations of

principal components analysis. In E. Diday et al., editors, Data Analysis and Informatics

II, pp. 231–242. Amsterdam, the Netherlands: Elsevier Science Publishers, 1980.

M. Friendly. Corrgrams: Exploratory displays of correlation matrices. The American

Statistician, 56(4):316–324, 2002.

A. Giﬁ. Nonlinear Multivariate Analysis. New York: John Wiley & Sons, 1990.

M. Hills. On looking at large correlation matrices. Biometrika, 56(2):249–253, 1969.

H. Hofmann. Exploring categorical data: Interactive mosaic plots. Metrika, 51(1):11–26,

2000.

D. J. Murdoch and E. D. Chow. A graphical display of large correlation matrices. The

American Statistician, 50(2):178–180, 1996.

A. Pilhoefer and A. Unwin. New approaches in visualization of categorical data: R package

extracat. Journal of Statistical Software, 53(7):1–25, 2013.

S. S. Stevens. On the psychophysical law. The Psychological Review, 64(3):153–181, 1957.

S. S. Stevens. Psychophysics. New York: John Wiley & Sons, 1975.

S. S. Stevens and E. H. Galanter. Ratio scales and category scales for a dozen perceptual

continua. Journal of Experimental Psychology, 54(6):377–411, 1957.

H. Wickham, H. Hofmann, and D. Cook. Exploring cluster analysis. http://www.had.co.nz/

model-vis/clusters.pdf, 2006.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. A Visualization Tool for Mining Large Correlation Tables: The Association Navigator (6/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
6. A Visualization Tool for Mining Large Correlation Tables: The Association Navigator (6/6)