Guy: “I used to think correlation implied causation. Then I took a statistics class. Now I don’t.”
Girl: “Sounds like the class helped.”
Guy: “Well, maybe.”
—Dialogue in an XKCD.com cartoon: https://xkcd.com/552/
There are certain headlines that are impossible for any journalist to resist. I’ll give you one: “Eat More Chocolate, Become Smarter!” You may think that I’m joking, but here’s a real 2012 headline from Reuters: “Eat chocolate, win the Nobel Prize?”
The story—written tongue-in-cheek, or so it seems—was based on a “study” by Dr. Franz H. Messerli, director of the Hypertension Program at New York City’s St. Luke’s-Roosevelt Hospital, published in the well-respected New England Journal of Medicine. Reuters wasn’t the only news organization that picked up Dr. Messerli’s results.1
1 Dr. Messerli’s article is “Chocolate Consumption, Cognitive Function, and Nobel Laureates,” http://tinyurl.com/bh3eeea, and the story by Reuters is here: http://tinyurl.com/8bbweav. For comments about this case, read: “Chocolate consumption and Nobel Prizes: A bizarre juxtaposition if there ever was one,” http://tinyurl.com/pzlbuf6.
Dr. Messerli’s “study” didn’t truly qualify as such. It was just a short piece in a section called “Occasional Notes,” and it was based on data downloaded from Wikipedia: the chocolate consumption per capita for each country (kilograms per year) and the number of Nobel Prize winners per 10 million people.
Throw those two variables together in a scatter plot, and you’ll get Figure 9.1. You can see an r = 0.79 in there. We’ll get to what that means in a minute. For now, just keep in mind that r measures the strength of the linear relationship between two quantitative variables. The highest value that r can adopt is 1.0, so 0.79 denotes a pretty strong linear relationship.
Dr. Messerli told Reuters, “I started plotting this in a hotel room in Kathmandu, because I had nothing else to do, and I could not believe my eyes.” Well, it’s better to believe them. Figure 9.2 exposes the even stronger relationship (r = 0.87) that exists between the age of Miss America and the number of murders by steam, hot vapors, or hot objects in the United States I may have just proved that the Miss America jury has a grave responsibility in reducing murder rates by picking younger winners.
As demonstrated by the website this example came from, Spurious Correlations (http://tylervigen.com/spurious-correlations), you don’t need to look into data too hard to find stirring, but utterly absurd, associations between disparate variables.
Some researchers devoted time to write rebuttals to Dr. Messerli’s meditation.2 The most amusing one was published in The Journal of Nutrition,3 and it stated the obvious: chocolate consumption is indeed correlated with the number of Nobel laureates in a country. But so is wine consumption—and the number of IKEA stores (Figure 9.3).
2 To be fair to Dr. Messerli, his article has plenty of caveats: it’s based on a known fact—chocolate improves cognitive function—and it’s written in a cheeky, half-serious tone. When observing that Sweden is an outlier, Dr. Messerli wrote, “Either the Nobel committee in Stockholm has some inherent patriotic bias when assessing the candidates for these awards or, perhaps, the Swedes are particularly sensitive to chocolate, and even minuscule amounts greatly enhance their cognition.” At the end of the article, he adds, “Dr. Messerli reports regular daily chocolate consumption, mostly but not exclusively in the form of Lindt’s dark varieties.”
What all these variables have in common is that they are related to each country’s wealth. That’s the lurking variable we’re overlooking. The higher the median income is in a country, the more money citizens have to invest in education, wine and chocolate, or impossible-to-assemble furniture from IKEA.
If dealing with a single variable is tricky, exploring how different variables influence each other is even worse. In this chapter, we’re going to enter the world of quantitative relationships, correlation, regression, and the multiple ways we can display them visually. We’ll also address a big elephant that has made a home of our tiny living room (packed with IKEA furniture, of course): is it possible to make the leap between correlation and causation?
We say that two variables are related when changes in one of them are accompanied by variations on the other one. Take the small data set of student scores on Figure 9.4. You can represent the positive linear relationship between the variables with a scatter plot and a trend line.
In a scatter plot, the closer the data points are to this trend line, the stronger the association between our variables is. In our current data set, all points are sitting on the line, meaning that the association is perfect. The line is an excellent model for our data. We can express it with an equation or function. Let’s call math scores X and English scores Y. Here’s the function:
Y = (X × 1.5) + 1
You can double-check the function yourself. When the English score is 14, the math score is 22, which is the result of (14 times 1.5) + 1.
Our function doesn’t just take actual values into account, but it can also predict missing or inexistent ones. For instance, not a single student got a score of 20 in the English exam. If that were the case, we could estimate her math score: (20 × 1.5) + 1 = 31. With some minimal effort, just by reversing the previous formula, we can also estimate English scores based on math scores.
Unfortunately, the world is far messier than made-up examples. Associations between variables are rarely perfect. Just check the relationship between IKEA stores and chocolate consumption in Figure 9.3 again. There’s a trend line in there, too, but it doesn’t fit the data perfectly. The points aren’t sitting on the trend line but cluster close to it.
If the relationship between two quantitative variables is linear, we can talk about a correlation between them. Its strength can be expressed with that puzzling r we saw before. That’s the correlation coefficient.4 The formula for r is pretty straightforward, but not relevant for this discussion.5 Here are a few important things to remember about it:
4 As per usual in this book, I’m just giving you an overview, so refer to the recommendations at the end of the chapter for more information.
5 In case you’re interested in how to do this manually, here we go: first, remember z scores, discussed in previous chapters? You need to calculate all z scores for X and Y values. Then, multiply each z score of X by each z score of Y and add up all products. Then, divide the result by the number of X-Y pairs.
• r may adopt any value between –1.0 and 1.0.
• Correlation has a direction. If r is a negative number, each increment of X will result in a consistent decrease of Y. If r is a positive number, X and Y increase and decrease together. In both cases, the ratio of change is constant.
• Authors disagree on what constitutes a weak or a strong correlation, but here’s a useful scale: 0.1–0.3 is modest; 0.3–0.5 is moderate; 0.5–0.8 is strong; and 0.8–0.9 is very strong. The same applies to negative correlations: just put a minus symbol in front of those figures.
• Let’s mention this again, just in case: in any scatter plot, the closer most of the data points are to the trend line, the greater r will be, and vice versa.
• Correlation is very sensitive to outliers. It’s a non-resistant statistic. A single outlier may distort r badly. There are methods to deal with this challenge. You can refer to the bibliography and learn, for instance, about Spearman’s resistant correlation coefficient or the Kendall rank correlation coefficient.
I’ll reiterate that correlation applies only when the relationship between variables is linear. This isn’t always the case. Sometimes, the association between two variables is better described as a curve.
I have an old data set of fuel efficiency with two variables: speed and miles per gallon consumed by 15 cars tested in 1984. I calculated the correlation between speed and average mileage and got: r = –0.27. Should I move on? First, let’s see the data on a connected scatter plot. Figure 9.5 shows the average of the 1984 cars compared to the average of nine cars tested in 1997.
It is clear to me that a simple linear association is inadequate in this case: both 1984 and 1997 cars are inefficient at low speeds and at high speeds. The high-efficiency peak is reached at medium speeds.
An interesting pattern I spotted was that the curve for the 1997 cars has a distinctive two-peak shape, and it is flatter than the one of 1984 cars. Newer cars are more efficient at low and high speeds but slightly less efficient at medium speeds. This might mean nothing, as I don’t know if the models tested in 1984 and 1997 were the same. We cannot assess if this comparison makes sense.
Out of curiosity I decided to plot all nine models tested in 1997 to see if the two-peak pattern is common. It is (Figure 9.6). Several models show a sudden drop in efficiency in the middle of the curve. Call me a dummy—my ignorance about cars is as epic as my ignorance about sports, and I’m not proud of either—but this was surprising. I had always thought that fuel efficiency invariably follows a smooth up-and-down curve, like the one for 1984 cars. I’ve learned something new, thanks to a chart (again).
Data puzzles are often the best starting point to spot potential stories, so here’s another one: I once read that the best-performing U.S. state in the SAT (Scholastic Aptitude Test) is North Dakota. That sounded interesting. Inspired by an exercise devised by statisticians David S. Moore and George P. McCabe, I searched the Internet and downloaded the average state-level 2014 SAT scores and the participation rate and designed a scatter plot with a straight trend line (first chart on Figure 9.7).
There are several intriguing patterns in that chart. I can see at least three data clusters and a few exceptions (second chart on Figure 9.7):
1. States with very low participation rates and very high average scores,
2. States with high participation and lower scores, and
3. States with full participation and even lower scores.
A single linear association, then, isn’t a good summary of these data, and r alone, even being high, can mislead us. A solution could be to divide the data into groups of identical size and then calculate trend lines for each.
This is, roughly speaking, the idea behind locally weighted scatter plot smoothing. That’s quite a mouthful, so let’s use the acronym LOWESS instead. Many tools and programming languages for visualization can calculate a LOWESS model and curve for us. After that, we may color-code the states by region (Figure 9.8).
Imagine that someone were to report just the average SAT scores. The resulting bar chart would be distorted by the fact that in states with low participation rates, probably just the best high schoolers take the SAT.6 North Dakota, for instance, has a participation rate of just 2 percent and a score of 1,816, while Washington D.C., where all students take the exam, the average score is just 1,309. Our scatter plot is a much better depiction of the situation.
6 This is just a guess. Read “Why The Midwest Dominates the SAT,” http://www.forbes.com/sites/bentaylor/2014/07/17/why-the-midwest-dominates-the-sat/.
When I first learned about this participation-score mismatch, I felt so curious that I did some digging. If high schoolers don’t take the SAT, what do they do? I read that most students in Midwestern states take the ACT (American College Testing,) so I decided to see if the inverse relationship between participation and average scores still held.
I downloaded data from www.act.org and designed Figure 9.9. It’s like the mirror image of Figure 9.8, with a handful of exceptions: states in the Midwest have higher participation rates this time, but their average scores don’t drop as dramatically as I was expecting. It would hardly be possible to discover such quirky morsels of information without these color-coded scatter plots.
A single scatter plot displays the relationship between two variables. But what if we wish to compare more?
The data set of SAT scores I’ve been playing with doesn’t just include the average state scores, but also the scores for each test: critical reading, math, and writing. Figure 9.10 is a series of scatter plots that share the same Y-axis (score) and X-axis (participation). If this graphic were interactive, whenever a reader selects a state or group of states in one of the scatter plots, they’d be also highlighted on the other charts, enabling the identification of patterns of similitude and difference. Math scores seem to be higher than writing ones. Interesting, isn’t it?
The limitation of this kind of graphic is that you can compare participation to any other score, but you can’t see the correlation between individual tests (math versus writing, for instance).
Enter the scatter plot matrix (Figure 9.11), a common tool in scientific research that unfortunately doesn’t get much attention from news organizations. Scatter plot matrices are tailored to explore multivariate data. They can provide a very rich big-picture view of the relationships between numerous variables.
If you’ve never seen a matrix like this before, it will take you a minute to learn how to read it, so I have designed a little infographic guide (Figure 9.12). As you can notice, scatter plot matrices can be color-coded according to the strength of the correlations and even simplified as heat maps (Figure 9.13).
My data set of SAT scores isn’t particularly interesting, as correlation coefficients are barely distinguishable from 1.0, so I designed another example of scatter plot matrix and heat map (Figure 9.14) where I compared several state-level metrics, such as poverty, food stamp recipients, obesity rates, educational attainment, and so on.
Deciding between a scatter plot matrix and a heat map depends on many factors. For instance, heat maps alone are not appropriate when the relationship between variables isn’t always clearly linear, as in the example before. However, a heat map can be a concise summary of large data sets. When in doubt, always do a detailed scatter plot first.
Another powerful way to visualize multivariate data, which we saw briefly in Chapter 5, is the parallel coordinates plot, invented by mathematician Al Inselberg in 1959. A strength of parallel coordinates plots is that they can be used to show relationships between any kind of data, categorical or quantitative, as shown in Figure 9.15, designed by Stephen Few for his book Signal. It shows the performance of 10 products (divided into three groups) in four different regions.
At first glance, what we have here looks like a cluttered mess, but suspend your disbelief in its potential until I’ve explained a bit more about (the insights we can gain). Here are a few that stand out:
• No technology products have been on the market for more than three years.
• All products have either been on the market for less than three years or for more than six years.
• The four top-grossing items are all technology products.
• The top-grossing item is probably doing well because of a major investment in marketing in a particular region, resulting in a fairly low profit margin.
• The lowest-grossing items are all office supplies.
• The products that have been on the market for the shortest time are all furniture products that earn better than average annualized revenues.
• Four of the top five profit margins are associated with technology products.
• Two office supply products are losing money in particular regions, which in one case appears to be tied to an expensive marketing campaign.7
7 Few, Stephen. Signal: Understanding What Matters in a World of Noise. Analytics Press, 2015.
Not bad for what we initially thought to be a hopelessly busy chart, I’d say.
Two features are critical in the design of parallel coordinates plots. First, ordering variables in different ways helps identify relevant patterns. Second, adding interactivity helps readers. Parallel coordinates with many lines are usually more effective when people can highlight the portions they’re interested in, while graying out the rest.
The SAT data I explored before reminded me of an interactive visualization I made with my colleagues Gerardo Rodríguez and Camila Guimarães when I was the head of infographics and visualization at Época magazine, in Brazil. In September 2010, the Ministry of Education released the results of the ENEM, the Brazilian equivalent of the SAT.
We designed an online tool that first let parents search the entire database to find any school in the country. But we also wanted to play with visualizations a bit. One of our sources, a statistician who specializes in educational data, suggested that we look into the relationship between average school scores and participation rates.
The reason, according to him, was that ENEM scores are distorted because a large percentage of students from good schools take the test, but a much smaller proportion from bad schools do so. Some bad schools allegedly try to look better by discouraging their low-performing students from participating.
Our source also suggested to color-code public and private schools. Brazil’s educational system is very unequal: most public schools are much worse than most private schools. Curious about the results? See Figure 9.16. Each dot is a school. The X-axis is the average ENEM scores. The Y-axis encodes participation rate (as a percentage). The vertical and horizontal lines are the averages of those variables, and they divide the plot into quadrants. Most public schools are on the lower-left quadrant (bad scores, low participation), while most private schools are on the upper right (good scores, high participation). There are plenty of exceptions, but the pattern is clear.
In the past five years, data teams have flourished in Brazilian media. The most active one is Estadão Dados, from O Estado de S. Paulo, one of the main national newspapers in the country. Figure 9.17 is, in my opinion, one of its finest projects. Its title reads, “How the Bolsa Família influenced voting for Dilma Rousseff.” It was designed by Rodrigo Burdarelli and José Roberto de Toledo.
The Bolsa Família is a welfare program that aids poor families in exchange for keeping their kids in school and vaccinated. Dilma Rousseff was elected president of Brazil twice, in 2010 and in 2014, when this graphic was published.
Rousseff belongs to a leftist party (the PT, Partido dos Trabalhadores). Estadão Dados’ goal was to show that there’s a clear relationship between percentage of families who receive help from Bolsa Família (X-axis) and vote for Rousseff (Y-axis).
In the graphic, each circle is a municipal district. Bubble size represents population. The association is solid, and it becomes even clearer when you only show northeastern districts or southern ones. Brazil’s northeast is very poor; southern cities and towns are much richer, on average.
I really liked this visualization the first time I saw it, but its title made me feel uneasy. Scatter plots can be deceptive: the relationship between Bolsa Família and votes for Rousseff exists, but does that mean that benefitting from this welfare program leads to an increase in votes for leftist candidates? We could be getting the sequence of events wrong: it may be that voting for leftist candidates in poor regions was already strong when Bolsa Família was launched in 2003. Longitudinal analyses are always more illuminating than cross-sectional ones.
I asked Estadão Dados about my misgivings, and its members told me that the headline was based on studies published in 1989, 1994, and 1998, which showed that before Bolsa Família, poorer areas didn’t vote massively for PT candidates. Moreover, a 2010 analysis by the Brazilian Institute of Public Opinion and Statistics (Ibope) revealed that the factor that better explained voting for the PT presidential candidate was Bolsa Família.
From Mexico’s El Financiero comes an elegant example of an interactive scatter plot (Figure 9.18). Designer-developers Hugo López and Jhasua Razo were interested in analyzing if the surface area of houses and apartments in Mexico City is proportional to their price. The relationship between the two variables is robust, but there are some curious outliers.
The visualization lets readers select which neighborhoods to show, and the X- and Y-axis vary accordingly. When you visit the graphic, notice how beautifully animation effects are used here, not just to make the presentation flashier but easier to understand as well.
In the years that I’ve been a New York Times subscriber (since 2012), its enormous graphics desk—around 30 people at the time of this writing—has produced plenty of fine relationship charts. Two of them have remained in my memory to this day. The first one, by Hannah Fairfield and Graham Roberts (Figure 9.19), is visualization expert Robert Kosara’s favorite chart ever:
It shows men’s versus women’s weekly earnings, with men on the horizontal axis and women on the vertical. A heavy black diagonal line shows equal wages, three additional lines show where women make 10%, 20%, and 30% less. Any point to the bottom right of the line means that women make less money than men. The diagonal lines are a stroke of genius (pun fully intended). When you see a line in a scatterplot, it’s usually a regression line that models the data; i.e., a line that follows the points. But such a line only helps reinforce the difficulty of judging the differences between the two axes, which is something we’re not good at, and which is not typically something you do in a scatterplot anyway. But the diagonal line, as simple as it is, makes it not just possible, but effortless. It’s such a simple device and yet so clear and effective. All the points on the line indicate occupations where men and women make the same amount of money. To the top left of the line is the area where women make more money than men, and to the bottom right where women make less.8
8 “My Favorite Charts,” https://eagereyes.org/blog/2014/my-favorite-charts.
Amen to that. My other personal favorite is the heat map in Figure 9.20, designed by Jon Huang and Aron Pilhofer and launched the day after Osama bin Laden was killed in Pakistan. The New York Times asked readers to send their opinions about the event: was it significant or not? Was their view positive or negative? Readers were also prompted to plot their responses on the chart. The result gives you an idea of how nearly 14,000 NYT’s readers reacted to this news: most of them are on the upper-right quadrant, that is, positive and significant.
On Monday, October 27, 2014, John Tory was elected mayor of Toronto by a comfortable margin of 6 percentage points. His main opponent was Doug Ford, the seedy older brother of the even seedier ex-mayor Rob Ford, who couldn’t run for re-election because he was diagnosed with cancer. Rob Ford’s tenure as a mayor had been punctuated by repeated incidents related to substance abuse, anything from alcohol to crack.
In a long story published by Global News, investigative reporter Patrick Cain asked “Who is Ford nation?”9 As he explained in the methodology section of the story, he first gathered the results of several polls. Then he found the center geographic point of where each poll had taken place with the help of a mapping program. He placed that point in, or as close as possible to, its corresponding electoral track. That way, he could compare polling estimates to indicators such as unemployment and income.
The plots are striking. Figure 9.21 shows just three of them: even as right-wingers, the Ford brothers found strong support in old areas of Toronto, where many citizens have low income, only a high school education, weren’t born in Canada, and English is not their primary language.
The connection between poverty and educational success, which is bidirectional and confounded by many factors, has been explored by social scientists and journalists for decades. However, it is rare to find a narrative as convincing as the one that ties together a 2015 series of articles and visualizations by the Daily Herald, a newspaper that serves Chicago’s suburbs, in collaboration with WBEZ (Chicago Public Radio).10
10 First story: http://www.dailyherald.com/article/20150622/news/150629873/. All data and the methodology used: http://reportcards.dailyherald.com/lowincome/.
Reporters and designers at both organizations looked at a decade of test scores and neighborhood-level poverty rates and revealed a very strong connection between low income and kids’ performance at school. See the scatter plots on Figure 9.22, created by Tim Broderick.
Journalists also looked into the past and discovered that a variation in the proportion of poor students in a school is a very good predictor of test scores in the following time period. They also explored data at the district level, and the same relationships exist. In some of the counties that encompass the Chicago urban sprawl, correlation is very close to being 1.0.
Moreover, the stories written by the Herald and WBEZ are inspiring examples of how to present complex data: not as charts alone—which may open the door for misinterpretation—but by putting the data in context by interviewing experts, school officials, politicians, parents, and children, and explaining the many nuances, caveats, exceptions, and gray areas in the analysis.
There is a difference between the scatter plots I designed at the beginning of this chapter and the ones coming from Global News and, above all, the Daily Herald. In the charts I used to explain correlation, it didn’t matter much which variable went into the X-axis and which into the Y-axis: they were interchangeable. In the charts from the Daily Herald, they aren’t: X (low income) is an explanatory variable, and Y (performance in tests) is a response variable (a variation in low-income rates can explain variations in test scores, at least in part).11 This isn’t just a correlational chart. It’s a called a regression model.
11 In many books, explanatory variables are called “independent” variables, and response variables are called “dependent” variables. I find that nomenclature needlessly confusing. Some statisticians prefer the terms “input” and “output” variables.
In regression, X has some limited predictive value. If you anticipate a future value of X, you’ll also be able to roughly predict the corresponding Y value. How? I’m going to explain just the most simple kind of regression, univariate linear least squares regression. To learn about other kinds (logistic regression, Bayesian regression, etc., both univariate and multivariate), please go to the recommended readings at the end of the chapter.
Remember what we learned about scatter plots: the linear relationship between two quantitative variables can be expressed as an equation, a function of the form.
Y = X modified in some way
The formula for simple linear regression will look bewildering for a second, but don’t worry. I am including it just for you to understand the results you’ll get when you ask a software tool to calculate it:
Y = Intercept + X × Slope
When you calculate regression in a computer, you’ll usually get at least four estimated values: the intercept, the slope, a certain amount of error, and a confidence level. (We’ll get to the latter two soon.) What are the intercept and the slope? Look at Figure 9.23, which displays the relationship between SAT participation rates and scores.
The intercept is the value of Y at which the regression line crosses the 0 point on the X-axis.
The slope is the rate at which values of Y vary when values of X change. In our data set, this value is -3.68. That means that for each increment of 1 in the participation rate, there’s very roughly a 3.68 decrease in average SAT scores. This “very roughly” is an indispensable qualifier. In the scatter plot, many states are far from the regression line. Still, we can predict what the average score might be for a participation rate of 25 percent, for instance:
Intercept + X × Slope = Y
1,742 + 25 × (-3.68) = Average score of 1,650
You can check this value on the chart itself. Just put your finger on the 25 point on the horizontal scale and see at what point on the Y-axis the vertical gridline crosses the regression line: exactly at the 1,650 point. This is not considering the error in the model, though. As you can see in the chart, the actual SAT scores close to the 25-percent participation rate differ from that estimate quite a bit. No model is ever perfect.
A couple of warnings are relevant at this point: first, visualization designers and journalists regularly deal with high-level aggregated data, such as country or state averages. Be aware that association between variables is usually much stronger in this kind of data than it is at lower levels of analysis—by city block, for instance.
Second, it is absolutely crucial to keep in mind that inferences from data should be done only at the same level of aggregation as the data. Group-level data can’t be used to analyze individual-level phenomenon.12 In the current case, we’re dealing with state-level data. Our model would not work well to predict the average SAT score of a single school based on its rate of participation.
12 This is called the ecological fallacy. Read “Ecological Inference and the Ecological Fallacy,” http://web.stanford.edu/class/ed260/freedman549.pdf.
Let’s go back to Figure 9.23. Besides r, the correlation coefficient, there’s also an r2 by the chart. This is the coefficient of determination, which is simply the result of squaring the correlation coefficient. If r = –0.91, as in this example, then r2 = (–0.91) × (–0.91) = 0.83.
The coefficient of determination is a pretty useful statistic. It is a measure of how much of the variation in a response variable (Y) depends on the explanatory variable (X). You can think of it as a percentage. If r2 = 0.83, you can say that 83 percent of the variation in SAT scores can be explained by the participation rate. We saw r2 in the Global Herald charts before (Figure 9.22), with values such as 0.63. Therefore, the proportion of students from low-income households (X-axis, explanation variable) explains 63 percent of the variation in school test results.
There’s a difference between using one or more explanatory variables to estimate the variation of a response variable, and assuming that our explanatory variables are the causes for the changes on the response variables. Neither correlation nor regression imply causation, although they can be used as first clues when they are solid enough.
Ultimately, the only way to determine causation with reasonable certainty is to run randomized controlled experiments that can rule out extraneous factors. However, running experiments is difficult or isn’t even an option in many cases. If you wish to determine if poverty really causes worse performance in school, you’d need, for instance, to test a group of middle-income kids one year, then make them poor for a few years, and test them again, comparing them to another group of students whose living conditions don’t change (this is the control group). That’s not something that an academic ethics committee would ever approve!
What happens, then, when an experiment is out of the question? It is still possible to make the case for a causal connection if some strict criteria are met. Statisticians David S. Moore and George P. McCabe (see the recommended readings section) suggest these:
• The strength of the association—a high coefficient of determination, for instance—between the variables you’re studying.
• The association with other alternative explanatory variables is weaker.
• Several observational studies, using different data sets than yours, show a consistent strong connection between the variables.
• The explanatory variable precedes the response. For instance, to infer causation from a correlation between educational attainment and poverty levels, you need to examine if changes in education policies precede variations in poverty rates in a consistent manner.
• The cause makes logical sense and is rational. Remember Chapter 4, particularly the part about coming up with good explanations.
Moore and McCabe use the link between smoking and lung cancer as an example. The evidence we have is overwhelming, but is observational, not experimental.13 No sane person would dare put people at risk of getting lung cancer just to test if cigarette consumption is the cause.14
13 A more controversial case, at least in some parts of the United States, is that of the association between gun legislation and gun violence, which I believe fulfills all of Moore’s and McCabe’s criteria. To learn more, see Adam Gopnik’s “Armed Correlations,” http://www.newyorker.com/news/daily-comment/armed-correlations.
14 This hasn’t been the case in the past. The history of unethical medical experimentation on human subjects is long and painful to read. And its protagonists were usually quite sane, as books like Dark Medicine: Rationalizing Unethical Medical Research (2008) and Against Their Will: The Secret History of Medical Experimentation on Children in Cold War America prove. The discussion about if it is ethical to use data generated by unethical experiments has kept philosophers and bioethicists busy for ages. For a good introduction, see Jonathan Steinberg’s “The Ethical Use of Unethical Human Research,” at http://tinyurl.com/qhoe346.
Relationship charts can be transformed in a similar way as we did when exploring time series and other kinds of data in other chapters: we can separate the smooth from the rough and then study the residuals. Or we can transform the magnitudes of the axis to clarify the relationship.
A classic example to illustrate this idea in statistics and biology courses is the correlation between body weights and brain weights of several species of mammals (Figure 9.24). When you design a scatter plot with the raw scores, most dots lie at the lower-left end because of the three glaring outliers. Correlation is very strong (r = 0.93), but our scatter plot is hard to read. If we take the logs, as I did on the second chart, gaps in the data disappear, and our model becomes clearer and more effective.
• Few, Stephen. Signal: Understanding What Matters in a World of Noise. Burlingame, CA, 2015. Steve’s fourth book is a concise and meaty overview of exploratory data analysis.
• Moore, David S., and George P. McCabe. Introduction to the Practice of Statistics (5th edition). New York, NY: W.H. Freeman and Company, 2005. Besides including a crystal-clear explanation of correlation and regression, this whole book is a treat.