98 Handbook of Big Data
dialog (
Sort Adjustees...): The result is a sorted list of the adjustees according to the R
2
values from the regression of the adjustees/y-variables onto the adjustors/x-variables. See
Figure 6.14 for an example.
6.4.12 Future of AN
The functionality described here reflects the 2015 implementation of the AN. Changes to
the AN are planned, the major one being a redesign to give the lens windows interactive
responsiveness as well. Currently all interaction is funneled throught the blockplot window,
even if the actions affect the lens window.
Other obvious functionality is still missing, above all sorting of variables, manual and
algorithmic. A limited set of sorting operations may be added in a future version of the AN.
If readers of this chapter and users of the AN have further suggestions, the authors would
appreciate hearing.
Appendix A: Versatility of Correlation Analysis
We return to the apparent limitations of correlations as measures of association, which was
left as a loose end in the Section 6.2. We address the objections that (1) correlations are
measures of linear association only, (2) correlations reflect bivariate association only, and
(3) correlations apply to quantitative variables only. Toward this end, we make the following
observations and recommendations:
1. Although it is true that correlation is strictly speaking a measure of linear
association among quantitative variables, it is also a fact that correlation is useful
as a measure of monotone association in general, even when it is nonlinear. As
long as the association is roughly monotone, correlation will be positive when
the association is increasing and negative when it is decreasing. Admittedly,
correlation is not an optimal measure of nonlinear monotone association, but
it is still a useful one, in particular in the large-p problem. Finally, if gross
nonlinearity is discovered, it is always possible to replace a variable X with a
nonlinear transform f(X) [often log(X)] so its association with other variables
becomes more linear.
2. The objection that correlations only reflect bivariate association is factually
correct but practically not very relevant. In practical data analysis, it is too
contrived to entertain the possibility that, for example, there exists association
among three variables, but there exists no monotone association among each pair
of variables.
In general, one follows the principle that lower-order association
is more likely than higher-order association; hence, pairwise association is more
likely than true interaction among three variables. Therefore, data analysts look
first for groups of variables that are linked by pairwise association, and thereafter
they may examine whether these variables also exhibit higher-order association.
Note, however, that even multivariate methods such as principal components
Linearity of association is not a simple concept. For one thing, it is asymmetric: if Y is linearly
associated with X, it does not follow that X is linearly associated with Y . The reason is that the definition
of linear association, E[Y |X]=β
0
+ β
1
X, is not symmetric in X and Y . Linearity of association in both
directions holds only for certain nice distributions such as bivariate Gaussians. A counterexampls is as
follows: Let X be uniformly distributed on an interval and Y = β
0
+ β
1
X + with independent Gaussian
,thenY is linearly associated with X by construction, yet X is not linearly associated with Y .
An example would be three variables jointly uniformly distributed on the surface of a two-sphere in
three-space.
A Visualization Tool for Mining Large Correlation Tables 99
analysis (PCA) do not detect true higher-order interaction because they, too, rely
on correlations only. Finally, we are not asserting that simple correlation analysis
should be the end of data analysis, but it should certainly be near the beginning
in the large-p problems envisioned here, namely, in the analysis of relatively noisy
data as they arise in many social science and medical contexts.
3. The final objection we consider is that correlations do not apply to categorical
variables. This objection can be refuted with very practical advice on how to make
categorical data quantitative and how to interpret the meaning of the resulting
correlations. We discuss several cases in turn:
a. If a categorical variable X is ordinal (its categories have a natural order),
it is common practice to simply number the categories in order and use the
resulting integer variable as a quantitative variable. The resulting correlations
will be able to reflect monotone association with other variables that may
be expressed by saying “the higher categories of X tend to be associated
with higher/lower values/categories of other variables.” An obvious objection
is that the equi-spaced integers may not be a good quantification of the
categories. If this is a serious concern worth some effort, one may want
to look into optimal scoring procedures [see, e.g., De Leeuw and van
Rijckevorsel (1980) and Gifi (1990)]. The idea behind these methods is to
estimate new scores for the categorical variables by making them as linearly
associated as possible through optimization of the fit of a joint PCA.
b. If a categorical variable X is binary, it is common practice to numerically code
its two categories with the values 0 and 1, thereby creating a so-called dummy
variable. This practice is pervasive in the analysis of variance (ANOVA), but
its usefulness is lesser known in multivariate analysis which is our concern.
The interpretation of correlations with dummy variables is highly interesting
as it solves two seemingly different association problems:
i. First-order association between a binary variable X and a quantitative
variable Y means that there exists a difference between the two means
of Y in the two groups denoted by X. As it turns out, the correlation
of a dummy variable X with a quantitative variable Y is mathematically
equivalent to forming the t-statistic for a two-sample comparison of the
two means of Y in the two categories of X (t r/(1 r
2
)
1/2
). Even more,
the statistical test for a zero correlation is virtually identical to the t-test
for equality of the two means. Thus, two sample mean comparisons can be
subsumed under correlation analysis.
ii. Association between two binary variables means that their 2×2 table shows
dependence. This situation is usually addressed with Fisher’s exact test of
independence. It turns out, however, that Fisher’s exact test is equivalent to
testing the correlation between the dummy variables, the only discrepancy
being that the normal approximation used to calculate the p-value of a
correlation is just that, an approximation, although an adequate one in
most cases.
c. If a categorical variable X is truly nominal with more than two values, that
is, neither binary nor ordinal, we may again follow the lead of ANOVA and
replace X with a collection of dummy variables, one per category. For example,
In other large-p problems, the variables may be so highly structured that they become intrinsically low
dimensional, as, for example, in the analysis of libraries of registered images where each variable corresponds
to a pixel location and its values consist of intensities at that location across the images. The problem here
is not to locate groups of variables with association but to describe the manifold formed by the images in
very high-dimensional pixel space. A sensible approach in this case would be nonlinear dimension reduction.
100 Handbook of Big Data
if in a medical context data are collected in multiple sites, it will be of interest
to see whether substantive variables in some sites are systematically different
from other sites. It is then useful to introduce dummy variables for the sites
and examine their correlations with the substantive variables. A significant
correlation indicates a significant mean difference at that site compared to the
other sites.
This discussion shows that categorical variables can be fruitfully included in
correlation analysis, with either numerical coding of ordinal variables or dummy
coding of binary and nominal variables.
This concludes our discussion of the versatility of correlation analysis.
Appendix B: Creating and Programming AN Instances
To create a new instance of an AN for a given dataset, use the following R statement:
a.n <- a.nav.create(datamatrix)
where datamatrix is a numeric matrix, not a dataframe. The new AN instance a.n can be
run with the following R statement:
a.nav.run(a.n)
These steps are completely general and may be useful for arbitrary numeric data matrices
with up to about 2000 variables.
Table B.1 shows a template for forming potentially useful instances of ANs that display
large numbers of SSC phenotype variables. As written the statement would produce an
AN on the order of 3000 variables.
AN’s are implemented not as lists but as environments, a relatively little known data
structure among most R users. Environments have some interesting properties. One can
look inside an AN with the R idiom
with(a.n, objects())
in order to list the AN internal variables inside the AN instance a.n. Assignments and any
other kind of programming of the internal state variables can be achieved the same way. For
example, if one desires a change of color of highlight strips to mistyrose,onecanachieve
this with the following:
with(a.n, { strips.col <- "mistyrose"; a.nav.blockplot() })
The call to a.nav.blockplot() redisplays the blockplot with the new paramater setting.
Changing the blockplot glyph from square to diamond is achieved with
with(a.n, {blot.pch <- 18; a.nav.blockplot() })
and reversing the color convention from “blue = positive” to “red = positive in the style
of heatmaps is done with
with(a.n, { blot.col.pos <- 2; blot.col.neg <- 4; a.nav.blockplot() })
Note, however, that this affects only blockplots, not heatmaps, the latter requiring
computation of a color scale, not just a binary color decision. Still, there is plenty of
opportunity for playfulness by experimenting with display parameters. A more sophisticated
example concerns changing the power transformation that maps correlations to glyph sizes:
with(a.n, {blot.pow <- .7; a.nav.cors.trans(); a.nav.blockplot() })
In addition to redisplay with a.nav.blockplot(), this also requires recomputation of the
display table with a.nav.cors.trans().
A Visualization Tool for Mining Large Correlation Tables 101
TABLE B.1
Template for joining large numbers of SSC tables and creating an AN for them.
a.n <- a.nav.create(cbind(
"family.ID"=as.numeric(v.families),
v.sites, v.srs.bg, v.individual,
v.family, v.parent.race, v.parent.common,
v.proband.cdv, v.proband.ocuv, v.sibling.s1, v.sibling.s2,
v.ados.common,
v.ados.1, v.ados.1.raw, v.ados.2, v.ados.2.raw,
v.ados.3, v.ados.3.raw, v.ados.4, v.ados.4.raw,
v.adi.r.diagnostic, v.adi.r.pca, v.adi.r,
v.adi.r.dum, v.adi.r.loss,
v.ssc.diagnosis,
v.vineland.ii.p1, v.vineland.ii.s1,
v.cbcl.2.5.p1, v.cbcl.2.5.s1,
v.cbcl.6.18.p1, v.cbcl.6.18.s1,
v.abc, v.abc.raw, v.rbs.r, v.rbs.r.raw,
v.srs.parent.p1, v.srs.parent.recode.p1,
v.srs.teacher.p1, v.srs.teacher.recode.p1,
v.srs.parent.s1, v.srs.parent.recode.s1,
v.srs.teacher.s1, v.srs.teacher.recode.s1,
v.srs.adult.fa, v.srs.adult.recode.fa,
v.srs.adult.mo, v.srs.adult.recode.mo,
v.bapq.fa, v.bapq.recode.fa, v.bapq.mo, v.bapq.recode.mo,
v.fhi.interviewer.fa, v.fhi.interviewer.mo,
v.scq.current.p1, v.scq.life.p1,
v.scq.current.s1, v.scq.life.s1,
v.ctopp.nr, v.purdue.pegboard, v.dcdq, v.ppvt,
v.das.ii.early.years, v.das.ii.school.age,
v.ctrf.2.5, v.trf.6.18,
v.ssc.med.hx.v2.autoimmune.disorders, v.ssc.med.hx.v2.birth.defects,
v.ssc.med.hx.v2.chronic.illnesses, v.ssc.med.hx.v2.diet.medication.sleep,
v.ssc.med.hx.v2.genetic.disorders, v.ssc.med.hx.v2.labor.delivery.birth.feeding,
v.ssc.med.hx.v2.language.disorders,
v.ssc.med.hx.v2.medical.history.child.1, v.ssc.med.hx.v2.medical.history.child.2,
v.ssc.med.hx.v2.medical.history.child.3,
v.ssc.med.hx.v2.medications.drugs.mother,
v.ssc.med.hx.v2.neurological.conditions,
v.ssc.med.hx.v2.other.developmental.disorders, v.ssc.med.hx.v2.pdd,
v.ssc.med.hx.v2.pregnancy.history, v.ssc.med.hx.v2.pregnancy.illness.vaccinations,
v.ssc.psuh.fa, vv.ssc.psuh.mo,
v.temperature.form.raw
), remove=T )
Readers should make a selection from this template as the full collection creates a data
matrix with about 3000 variables.
R environments represent one of the few data types that disobeys the functional
programming paradigm that is otherwise fundamental to R. As a consequence, assignment
of an AN does not allocate a new copy but passes a reference instead. In particular, the
R statement
b.n <- a.n
102 Handbook of Big Data
creates a variable
b.n that will be a reference to the same environment as the variable a.n.
Hence, the two statements
a.nav.run(a.n)
a.nav.run(b.n)
will run off the same AN instance. They have identical effects in the sense that interactive
operations affect the same instance.
Acknowledgments
This work was partially supported by a grant from the Simons Foundation (SFARI award
#121221 to A.M.K.). We appreciate obtaining access to the phenotypic data on SFARI
Base (https://base.sfari.org). Partial support was also provided by the National Science
Foundation Grant DMS-1007689 to A.B.
References
J. Bertin. Semiology of Graphics. Madison, WI: University of Wisconsin Press, 1983.
W. S. Cleveland. The Elements of Graphing Data. Pacific Grove, CA: Wadsworth &
Brooks/Cole, 1985.
J. De Leeuw and J. van Rijckevorsel. HOMALS and PRINCALS—Some generalizations of
principal components analysis. In E. Diday et al., editors, Data Analysis and Informatics
II, pp. 231–242. Amsterdam, the Netherlands: Elsevier Science Publishers, 1980.
M. Friendly. Corrgrams: Exploratory displays of correlation matrices. The American
Statistician, 56(4):316–324, 2002.
A. Gifi. Nonlinear Multivariate Analysis. New York: John Wiley & Sons, 1990.
M. Hills. On looking at large correlation matrices. Biometrika, 56(2):249–253, 1969.
H. Hofmann. Exploring categorical data: Interactive mosaic plots. Metrika, 51(1):11–26,
2000.
D. J. Murdoch and E. D. Chow. A graphical display of large correlation matrices. The
American Statistician, 50(2):178–180, 1996.
A. Pilhoefer and A. Unwin. New approaches in visualization of categorical data: R package
extracat. Journal of Statistical Software, 53(7):1–25, 2013.
S. S. Stevens. On the psychophysical law. The Psychological Review, 64(3):153–181, 1957.
S. S. Stevens. Psychophysics. New York: John Wiley & Sons, 1975.
S. S. Stevens and E. H. Galanter. Ratio scales and category scales for a dozen perceptual
continua. Journal of Experimental Psychology, 54(6):377–411, 1957.
H. Wickham, H. Hofmann, and D. Cook. Exploring cluster analysis. http://www.had.co.nz/
model-vis/clusters.pdf, 2006.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.186.167