Chapter Two
Pitfall 1: Epistemic Errors

“Everybody gets so much information all day long that they lose their common sense.”

Gertrude Stein

How We Think About Data

Epistemology is the branch of philosophy that deals with the nature, origin, and scope of our knowledge. It comes from the Greek words episteme (knowledge) and logos (word/speech) – knowledge speech, or, in other words, talking about knowledge.

Let's talk about knowledge as it relates to working with data. But first, why spend any time on this at all? A fair question, since clearly practitioners of many disciplines ignore the fundamental principles underlying their area of focus. Take driving an automobile, for example. Most drivers can't explain how the internal combustion engine works, or how the batteries in their electric vehicle function. But that doesn't stop them from driving around town, does it?

But working with data isn't like driving a car in this respect. It's more like cooking. In order to do it well, we need a working knowledge of the ways heat is transferred to food over time and how different flavors combine to produce a final result. We can't just start throwing ingredients together at random and hope to make a great dish. That's just one of the things I learned living with roommates in college, by the way.

But that's what happens when we start cooking things up with data before we have an understanding of the basic principles of knowledge. Epistemology is our data cookbook of sorts. Let's see what's in it.

Pitfall 1A: The Data-Reality Gap

The first epistemic principle to embrace is that there is always a gap between our data and the real world. We fall headfirst into a pitfall when we forget that this gap exists, that our data isn't a perfect reflection of the real-world phenomena it's representing. Do people really fail to remember this? It sounds so basic. How could anyone fall into such an obvious trap?

I'm not exaggerating when I say that I fail to avoid this trap almost every single time. The first pitfall is a gaping hole and pretty much every one falls right into it at first.

It works like this: I get my hands on some data and I perform some analysis, but I don't stop to think about where the data came from, who collected it, what it tells me, and, importantly, what it doesn't tell me.

It's easy when working with data to treat it as reality itself rather than data collected about reality. Here are some examples:

  • It's not crime, it's reported crime.
  • It's not the outer diameter of a mechanical part, it's the measured outer diameter.
  • It's not how the public feels about a topic, it's how people who responded to the survey are willing to say they feel.

You get the picture. This distinction may seem like a technicality, and sometimes it might be (the number of home runs Hank Aaron “reportedly” hit?), but it can also be a very big deal. Let's see some examples of how we can fall into this pitfall.

Example 1: All the Meteorites We Don't See

The Meteoritical Society provides data for 34,513 meteorites that struck the surface of the earth between 2,500 BCE and 2012.1 If you and I were to take this figure and run with it, we might make a number of incorrect assumptions that stem from the first data pitfall we're considering together.

Let's look more closely to get a better understanding of the impact of falling into this pitfall.

A friend of mine, Ramon Martinez, created a map (Figure 2.1) depicting where each of these 34,513 meteorites had struck the surface of the earth.

What do you notice about the data now that we're looking at it on a map? Doesn't it seem uncanny that meteorites are so much more likely to hit the surface of the earth where there's land, as opposed to where there's ocean? And what about areas like the Amazon (not the one in Seattle), or Greenland, or parts of Central Africa? Is there some kind of shield over those areas, or some deity protecting those areas from damage? We humans are great at coming up with bullshit theories like this, aren't we?

The explanation is obvious, and Ramon actually gives it to us right in the title at the top of the visualization: “Every Recorded Meteorite Impact.” In order for a meteorite to be in the database, it had to be recorded. And in order for it to be recorded, it had to be observed by someone. Not just anyone, but someone who knew whom to tell about it. And the person they told had to be faithful in carrying out their job of recording it. That's much more likely to occur in areas of higher population density of developed nations.

Map depicting numerous meteorite strikes that had struck the surface of the earth.

FIGURE 2.1 Meteorite strikes by Ramon Martinez.

Source: Ramon Martinez, https://public.tableau.com/profile/ramon.martinez#!/vizhome/meteorite_fall_on_earth/Meteoritefallonearthvisualized.

The map, then, isn't showing us where meteorites are more likely to strike the earth. It's telling us where meteorites are more likely to have fallen (in the past), and were observed by someone who reported it to someone who recorded it faithfully.

Now, that's a mouthful, isn't it? And you may roll your eyes and say it's all just a bunch of technicalities. But think again about the 34,513 figure. If we began with this figure, and if we assumed like I did at first that experts or enthusiasts observe and record every single meteorite strike no matter where it falls, then we'd have a pretty inaccurate idea of how often this kind of event actually occurs on the planet.

It's not that the data that the Meteoritical Society provides is wrong; it's just that there's a gap between the number of meteorites that have actually hit the earth since 2,500 BCE, and those that have been observed, reported, and recorded. It's safe to say that there's a massive difference between the unknowable total count and the number in the database. After all, about 71% of the earth's surface is covered by water, and some of the land itself is also completely uninhabited.

But the number of meteorites that are missing from the database because they aren't seen due to geographic reasons pales in comparison to the ones that are missing due to lack of historical record-keeping. If we look at a dot plot (Figure 2.2) that shows the number of meteorites recorded by calendar year – each year having its own dot – we see that records weren't kept until the twentieth century.

There's a huge gap in time between the oldest known meteorite (traced to Iraq, from about 2,500 BCE) and the second oldest (traced to Poland, from about 600 BCE). No year prior to 1800 includes more than two recorded meteorites. And then in the twentieth century the numbers increase dramatically, with over 3,000 recorded in 1979 and 1988 alone. It's safe to assume that ancient times saw a plethora of meteorites as well; it's just that humans didn't see them. Or if they did, they had nowhere to record it – at least not in a place that was preserved over the ages.

Graph depicting a timeline of recorded meteorite impacts on earth from 2,500 BCE to 2012.

FIGURE 2.2 A timeline of recorded meteorite falls, 2,500 BCE–2012

Example 2: Are Earthquakes Really on the Rise?

Let's consider another geological phenomenon: earthquakes. I grew up in Southern California, and I remember the early morning of January 17, 1994, quite vividly. At 4:31 a.m., a magnitude 6.7 earthquake struck the San Fernando Valley region of Los Angeles, killing 57 people, injuring more than 8,700, and causing widespread damage.

The United States Geological Survey provides an Earthquake Archive Search form that lets visitors obtain a list of historical earthquakes that meet various criteria.2 A query of earthquakes of magnitude 6.0 and above from 1900 to 2013 yields a somewhat alarming line plot (Figure 2.3).

Are we really to believe that earthquakes have increased in frequency by this much? Obviously not. The world that measured and collected earthquakes in the early twentieth century was very different than the one that did so in the last decade. Comparisons across decades, and even within some decades (the 1960s, for example), aren't “apples-to-apples” due to the changes in technology.

A line plot of earthquakes of worldwide magnitude of 6.0 and greater from the years 1900 to 2013.

FIGURE 2.3 A line plot of earthquakes of magnitude of 6.0 and greater.

If we separate the line plot by magnitude and add annotations that describe advances in seismology, we see that the rise is only in the smaller group (magnitude 6.0–6.9), and coincides with dramatic improvements in instrumentation (Figure 2.4).

It's safe to say that the rise in recorded earthquakes is primarily due to the improvements in our ability to detect them. There may also be an upward trend in actual earthquakes over this time, but it's impossible for us to know for sure due to the continual changes in the quality of the measurement system. For all we know, the actual trend could be a decreasing one.

When it comes to earthquakes, the gap between data and reality is getting smaller. As much as this is a marvelous technical development to be applauded, a by-product of our advancement is that it's difficult to discern historical trends.

Graph depicting the actual versus recorded data of worldwide earthquakes according to the magnitude ranges.

FIGURE 2.4 Breaking out worldwide earthquakes by magnitude group.

The fundamental epistemic problem is that the “data-reality gap” is dramatically changing over the time period we're considering. It's hard to know for sure exactly how many magnitude 6.0 earthquakes we missed in any particular year.

Let's look at another example: counting bicycles that cross a bridge.

Example 3: Counting Bicycles

Every day on my way to work from 2013 to 2015 I would walk across the Fremont Bridge in Seattle, Washington. It's a bright blue and orange double-leaf bascule bridge that was built in 1917. Since it sits so close to the water, it opens on average 35 times a day, which supposedly makes it the most opened drawbridge in the United States. Here's what it looks like (Figure 2.5).

Seattle is a city of avid bicyclists, and the City of Seattle Department of Transportation has installed two inductive loops on the pedestrian/bicycle pathways of the bridge that are designed to count the number of bicycles that cross the bridge in either direction, all day every day. The city also provides hourly counts going back to October 2, 2012, at data.seattle.gov.3 Downloading this data and visualizing it yields the timeline in Figure 2.6.

Photograph of the Fremont Bridge taken from the Aurora Bridge in Seattle, Washington.

FIGURE 2.5 The Fremont Bridge as seen from the Aurora Bridge in Seattle, Washington.

I showed this timeline during a presentation at a luncheon of marketing researchers not far from the bridge one day, and I asked the attendees what they thought of these spikes in late April 2014. In that moment, I had no idea what had caused them.

A few ideas quickly sprang from the crowd. Was it “bike to work day”? Maybe the weather was abnormally beautiful and everyone got bike fever at the same time. It seemed strange that there was only a spike on one side of the bridge but not the other, though. How did they get home? Was there an actual spike that gave them all flat tires so they couldn't ride home? Or, maybe there was an organized bicycle race or club event that involved a looping route where the riders crossed the water in another place instead of the riders turning around and coming back over the Fremont Bridge on the way back.

Image described by caption.

FIGURE 2.6 A time series of the counts of bicycles crossing the Fremont bridge, October 2012–December 2014.

Notice how each of these ideas are based on the assumption that there actually were more bikes that crossed the bridge on those two days. No one in the room thought to question that basic assumption, myself included. We collectively shrugged our shoulders and I moved on with my presentation.

About 20 minutes later, one of the attendees in the back of the audience held up his smart phone (presenting is such a treacherous endeavor, these days) and shouted out that he found the reason for the spikes. It was a case of equipment error.

The counter had glitched for a period of time in April of that year, but only the counter on the eastbound side of the bridge (labeled “Fremont Bridge NB” in the data set for some reason). You can read all the details of these anomalous readings and the correspondence between a local blogger and a city employee at the Seattle Bike Blog.4 The title of the blog post says it all: “Monday appears to smash Fremont Bridge bike counter record – UPDATE: Probably not.”

According to correspondence between the blogger and the city employee published in updates to the blog post, there were actually four hour-long spikes in bicycle counts on the mornings of April 23, 25, 28, and 29. If you look closely at the timeline in Figure 2.6, you'll see the higher values in the blue line just before the huge spike as well. They never figured out what was wrong with the counter, if anything, but they validated that it was working properly and replaced some hardware and the battery.

What's interesting is that if you go to download this data today, you won't see the four daily spikes at all. They have adjusted the data, and the abnormally high values have been replaced with “typical volumes.” Fascinating.

What really stuck with me was the fact that everyone in the room – myself included – immediately started proposing root causes from inside a specific box, and none of us thought to think outside of that box. The box in which we were stuck has a simple equation that describes it: data = reality. I find myself in that box over and over again. Every time I do, it makes me laugh at how easy it is to fall into this pitfall when working with data.

Let's consider yet another example of this first epistemic pitfall: counting Ebola deaths.

Example 4: When Cumulative Counts Go Down

In 2014, the whole world watched in horror as Ebola ravaged West Africa. During the crisis, the World Health Organization (WHO) provided data about fatalities in weekly situation reports.

Let's take a look at a timeline of cumulative deaths from Ebola as reported by the WHO and the Centers for Disease Control (CDC) from March 2014 through the end of that year (Figure 2.7). Notice the drops in cumulative death counts – the handful of times when the lines slope downward.

At first blush, this seems somewhat odd. How can there be fewer total people dead from the disease on one day than there were at the end of the previous day? The way I've worded the question shows that I've already fallen into the pitfall.

Let's ask it a different way: How can the total number of reported deaths due to the disease decrease from one day to the next?

Of course, it makes perfect sense: the task of diagnosing disease and ascertaining causes of death in some of the more remote locations, where the equipment and staff are often severely limited, must be incredibly difficult.

And the cause of death of any particular person isn't always so obvious to the professionals producing the figures. Often test results that are received days or even weeks later can change the recorded cause of death. In the case of a fast-moving pandemic, guesses have to be made that are proven either true or false at a later point in time.

Image described by caption.

FIGURE 2.7 Cumulative timeline of Ebola fatality count in West Africa, 2014.

That's exactly why, if you read the WHO situation report, you'll notice that they classify cases as “suspected,” “probable,” and “confirmed.” The criteria are shown in Figure 2.8.

The WHO and CDC actually do a very good job of speaking clearly about “reported” cases (the December 31 WHO situation report includes the word “reported” no less than 61 times).

I don't bring up this example to criticize the people or organizations involved with fighting and documenting the Ebola outbreak. Far from it. If anything, I commend their heroic efforts at fighting the disease and caring for those who were suffering and dying. And I commend them for clearly communicating to us the uncertainty inherent in their reporting of the data. But you can see how easy it would be for someone who downloaded the data, like me, to be confused by what I saw.

Tabular chart presenting the Ebola case-classification criteria of suspected, probable, and confirmed cases.

FIGURE 2.8 World Health Organization classification table of Ebola cases.

It turns out that classifying diseases and deaths in chaotic conditions can be tricky business indeed. This example merely demonstrates that the gap between data and reality can exist even when the stakes are high, and when the whole world is watching.

That's because this gap always exists. It's not a question of whether there's a gap, it's a question of how big it is.

Remember the supposedly trivial example of Hank Aaron's “reported” home run tally that I mentioned in passing earlier? Well, he hit an astounding 755 home runs over the course of his career in Major League Baseball, which stood as a record for 33 years. But what about the six home runs he hit in the playoffs, when it mattered the most? Or the two home runs he hit in the All-Star games in which he played while representing the National League in 1971 and 1972? And let's talk about the five home runs he hit in 26 official games while playing professionally for the Indianapolis Clowns of the Negro American Leagues prior to joining the Atlanta Braves. Shouldn't those count too? They're not incorporated into the official tally, which only includes regular season home runs hit in the MLB (Major League Baseball). But one could make the argument that those additional 13 home runs that he hit while playing professional baseball should bring his official career tally to 768.

There's always a gap.

Pitfall 1B: All Too Human Data

So let's recap. In Pitfall 1A, we saw data-reality gaps that result from measurement systems that change in resolution over time (seismology), have unknown glitches (bicycle counters), involve human-generated counts with missing data (meteorites), misclassified and later corrected data (Ebola deaths), and uncertain data due to unstated and unclear criteria (Hank Aaron's home runs).

But there's another type of gap that we humans create very frequently when we record values that we ourselves measure and then manually key in: we round, we fudge, and we guesstimate. We're not perfect, and we certainly don't record data with perfect precision.

The first example of rounding in human-keyed data that I'd like to look at is shown in Figure 2.9. It's the number of minutes past the hour that pilots provide when they report to the FAA that their aircraft struck wildlife at a particular moment on the runway or in flight. Now, I'm not familiar with the process that these pilots follow when they capture and report these incidents, but I'm willing to bet from looking at this chart that they're either writing down, dictating to someone, or keying in the time of day.

Of course, we know that the likelihood of an airplane hitting a bird or other creature doesn't change as a function of the number of minutes past the hour in this way. It's not as if the clock changes from 1:04 p.m. to 1:05 p.m. and then suddenly the actual frequency of wildlife strikes more than quadruples. The dramatic rise in the columns is due to our tendency to round time when we glance at a watch or a clock. We see 1:04 and we write down 1:05, or, heck with it, we just call it 1:00 even – close enough, right? Pilots are just like us.

Bar chart depicting an example of rounding in human-keyed data - the total number of records versus reported minute of the hour.

FIGURE 2.9 Example of rounding in human-keyed data.

If this data had been generated by some sort of sensing mechanism mounted to the aircraft that automatically logged each strike, and included a time-date stamp with each record, you can bet this triangular pattern would completely disappear. And it's not like data created by such a nonhuman measurement system would be perfect, either. It would have its own unique quirks, idiosyncrasies, and patterns imposed by the equipment itself. But it wouldn't round like this – not unless we programmed it to. And in that case, there wouldn't be any nonrounded time entries at all.

What's fascinating to me, though, is the geometric regularity of this chart. Think about it: the plot comes from over 85,000 reported wildlife strikes that took place over the course of 18 years. Data provided by thousands and thousands of individual pilots all over the country over almost two decades ended up producing this pattern that feels like it was generated by a mathematical formula. Just take a look at how the column heights reach up to very interesting frequency lines (Figure 2.10).

And it's not just wildlife strikes that produce this pattern. Here's data that my friend Jay Lewis collected that shows the minutes past the hour recorded for the first 1,976 diaper changings of his child (Figure 2.11). Doesn't the pattern look familiar? Now that's what I call dirty data.

We do this kind of fudging or rounding when we report other quantitative variables too, not just with time. Let's look at another example of this kind of human-keyed rounding behavior (Figure 2.12).

The weights of NBA players from the 2017–18 season can be plotted using a histogram, and at first blush, if we use a bin size of 10 pounds, we don't see any evidence of rounding or lack of precision whatsoever.

Bar chart depicting geometric regularity of reported strikes by minute of the hour, non-null values, with reference lines.

FIGURE 2.10 Geometric regularity.

Bar graph depicting diapers changed by minute of timestamp, over a period of five months.

FIGURE 2.11 Diapers changed by minute of timestamp.

Let's look a little bit deeper, though. What happens if we change the bin size from 10 pounds to 1 pound? Now instead of grouping players in bins of 10 pounds (all six players with weights between 160 and 169 pounds get grouped together, the 20 players with weights between 170 and 179 pounds get grouped together, etc.), we create a bin for each integer weight: the two players with 160 pounds are grouped together, the single player with 161 pounds gets his own group, and so on.

When we do this, another interesting pattern emerges that tells us something's going on with the measurement system here. The process of capturing and recording data is resulting in a fingerprint of human-keyed data again, this time with a different pattern than the one we saw when we looked at time data (Figure 2.13).

What's going on here? Almost half of the players have a listed weight that's divisible by 10, and almost 3 out of every 4 players (74%) have a weight that's divisible by 5. There are players, though, with listed weights that don't fall into these neat buckets. Just over 1 in 4 (26% to be precise) has a listed weight that isn't divisible by 5, such as the three players listed at 201 pounds, for example – clearly a number that begs to be rounded, if any does. But players with weights like this are in the minority.

A histogram plotting the weights of NBA players from the 2017–18 season, with bin size of 10 pounds.

FIGURE 2.12 A histogram of NBA player weights, with bin size of 10 pounds.

Of course, the actual weights of the players, if we were to weigh them all and capture the readings automatically using a digital scale, wouldn't produce this type of “chunky” data, would it? There might be some grouping of players around certain values, but this is clearly caused by humans reporting approximate values.

I'm also quite certain that the doctors and trainers employed by the basketball teams have much more granular biometric data about these players than the figures posted to rosters published on the Internet. But the processes that produce the particular values that you and I can see on the web definitely show fingerprints of human-keyed data.

Image described by caption.

FIGURE 2.13 An adjusted histogram of NBA player weights, this time with bin size of 1 pound.

Let's look elsewhere. If we scrape the online rosters of over 2,800 North American professional football players who were listed as active on preseason rosters before the 2018 season, we see similar grouping around weights divisible by 5 and 10, but not quite to the same degree; only half of the players fall into these neat buckets, and the other half fall into buckets not divisible by 5 or 10 (Figure 2.14).

The measurement system and process of capturing and recording weights of American football players and publishing these figures to online rosters is twice as likely to result in a value that isn't divisible by 5. It could be that player weight is much more of a critical factor in this sport, and therefore one that's more closely tracked and monitored. But that's just a guess, if I'm honest. We'd have to map out the measurement system for both leagues to figure out the source of the difference in resolution.

And you may say, well – who cares? We're talking about a minute here or there in the pilot wildlife strike example, and a pound here or there in the basketball player weight example. But the thing is, sometimes that level of precision really does matter, and the data might reflect that.

A histogram plotting the weights of active North American football players.

FIGURE 2.14 A histogram of weights of North American football players.

There's actually a perfect scenario that shows what the data looks like when the precision of player weight data matters a whole lot more than online team rosters. Every year, American football players who are entering the professional draft are tracked, scrutinized, and measured by team scouts with extreme precision in an event called the NFL Combine. These players are put through a gauntlet of physical performance tests, and practically everything except the number of hairs on their head are counted and measured. What type of weight profile does this event produce?

If we look at the 1,305 players who entered the Combine from 2013 through 2018 and who ended up playing in the NFL, we find that more than 3 in 4 players have recorded and published weights that don't end in either 0 or 5 (Figure 2.15).

The lack of a human-keyed, rounded process is evident. If we look at the frequency of player weight by last digit, we see that the Combine produces a very uniform distribution, and players have been no more likely to have a recorded weight ending in 0 or 5 as ending in any other number:

Image described by caption.

FIGURE 2.15 Histogram of weights of 283 players in the 2018 NFL Combine.

Image described by caption.

FIGURE 2.16 Histogram of the number (and %) of players having weights ending in 0 through 9.

So what does this mean? Well, measurement systems can be very different, even when they measure the exact same variable (weight) of the same type of object (American football players). Some measurement systems involve a high amount of rounding, fudging, and guesstimating by humans, some involve less, and some don't involve much at all. We can't know which we're dealing with unless we obtain a deep understanding of the measurement system process and a deep understanding of the data that it produces.

When we obtain such understanding, we'll be a little bit closer to knowing where the data-reality gap is coming from this time.

Pitfall 1C: Inconsistent Ratings

The Internet thrives on ratings generated by humans. Just in the past month, I've been prompted to provide a rating for a greasy spoon (but amazing) New Orleans breakfast restaurant, countless car share rides, a few meditations I listened to on a popular app, three audiobooks, a print book written by none other than my own mother – you name it. And that just scratches the surface.

Before we leave the section on human imperfections in the data collection process, let's talk bananas. Yes, bananas.

I like to run silly polls on social media now and then. Sometimes I throw my followers a little curveball with my polls, like the one in Figure 2.17:

So what this says is that 1 in 5 people who responded to a poll on social media say that they're unwilling to respond to a poll on social media. Another 1 in 3 respond that they prefer not to say whether or not they're willing to respond. Hmm …

But I digress. Bananas. I ran another not-so-scientific little poll this past year asking my social media pals to rate a series of 10 banana photos on a ripeness scale. Each photo was classified by respondents as either unripe, almost ripe, ripe, very ripe, or overripe. These five different categories of ripeness have not been vetted by the National Association of Banana Raters, or any other such body, if one exists. They came from my brain – clearly a surprising place.

Screenshot depicting the result of the social media poll of a contestant.

FIGURE 2.17 Social media poll.

Figure 2.18 presents the photos of the bananas I showed them. Each photo was shown once, and every respondent was shown the exact same bananas in the exact same order.

What will shock no one is that people don't tend to think of banana ripeness quite the same way. A banana that's ripe to me may be considered almost ripe to you, and it's most definitely overripe to someone else.

What was mildly startling to me, though, was just how differently respondents rated them. Only two of the ten photos received fewer than three different ripeness levels among the 231 respondents. Four of the photos got put into four different categories. And one of the photos was put into each of the five ripeness buckets at least once. The results appear in Figure 2.19 for you to see for yourself.

Images depicting bananas in various ripening stages.

FIGURE 2.18 Bananas in various stages of ripeness.

But this wasn't the point of this fun little informal survey at all. Again, while I think it was interesting that there was that much disparity in categorization, the truth is I was testing something else altogether. I wasn't so much interested in consistency between raters as I was interested in consistency within each rater.

Did you notice the trick? Take a look at the photos again. One of the ten images is exactly same as another one, only it's a mirror image of it. The image of the bunch of bananas that was shown second in the survey was shown again at the end of the survey, but it was flipped horizontally. The survey didn't mention anything about this fact, it just simply asked for a ripeness rating for each one.

Chart presenting the results of the banana ripeness assessment against the respective images, indicating how ripe or unripe the bananas are.

FIGURE 2.19 Results of the banana ripeness assessment.

What I was really interested in finding out was how many people rated these two photos the same, and how many people rated them differently.

My hypothesis was that one in ten or maybe one in twenty would change their rating. In actuality, more than one in three – a full 37% – changed their rating. Of the 231 respondents, 146 rated the tenth photo the same way they rated the second one. But 85 of them rated the photos differently.

This Sankey diagram shows the flow of raters from the way they rated photo 2 on the left to how they rated photo 10 on the right (Figure 2.20).

Another way of looking at the change gives a clue as to what might be happening. If we plot the way respondents rated the second photograph on the rows and the way they rated the tenth photograph on the columns of this five-by-five matrix, we notice that the majority of people who changed their rating increased the level of ripeness. In fact, 77 of the 85 people who changed their rating increased their level of ripeness (e.g. from “almost ripe” to “ripe,” or from “ripe” to “very ripe”) while only eight decreased their rating (Figure 2.21).

Sankey diagram depicting how respondents changed ripeness rating from the way they rated photo 2 on the left to how they rated photo 10 on the right.

FIGURE 2.20 Respondents' changes in ripeness ratings.

So why did such a high percentage of the raters who changed their rating increase their level of ripeness? Well, let's consider the photo that was shown ninth, the one immediately before the tricky tenth flipped photo (Figure 2.22).

These bananas look a little green, don't they? Now, again, this survey was highly unscientific, informal, and not at all controlled. While it's theoretically possible that respondents could've just been selecting at random to get through it, I don't really have a reason to think that that's what occurred en masse. Respondents weren't offered any incentive or honorarium, so I'll assume for the sake of argument that they gave it a decent shot.

Illustration presenting the ratings of photo 2 versus ratings of photo 10.

FIGURE 2.21 Ratings of photo 2 vs ratings of photo 10.

Photograph depicting the ninth banana, the one immediately before the tenth flipped photo.

FIGURE 2.22 The ninth banana photograph shown.

What does it mean, then? To me, it says that it's possible that we're not perfect models of objectivity and consistency when it comes to rating things, that our ratings and opinions have a degree of noise in them, even over short time horizons, and that we're possibly influenced to some degree by the context, or the order in which we provide our opinions.

How is this related to the topic at hand? Every measurement system has some degree of error due to challenges with repeatability and reproducibility. It's true of more than just rating bananas. Your data was created by a measurement system, and that measurement system isn't perfect. Different people performing the measurement process get different results, and even the same person repeating the procedure can end up with different readings sometimes due to sources of noise and error. It's a fact of life that means our data isn't a perfect reflection of reality.

How to Avoid Confusing Data with Reality

Notice that in each of these cases, something in the view of the data itself alerted us to a potential “data-reality gap.” Visualizing the data can be one of the best ways to find the gaps.

Earlier in the game, though, it helps to remind ourselves that every data point that exists was collected, stored, accessed, and analyzed via imperfect processes by fallible human beings dealing with equipment that has built-in measurement error.

The more we know about these processes – the equipment used, the protocol followed, the people involved, the steps they took, their motivations – the better equipped we will be to assess the data-reality gap.

Here are seven suggestions to help you avoid confusing data with reality:

  • Clearly understand the operational definitions of all metrics.
  • Draw the data collection steps as a process flow diagram.
  • Understand the limitations and inaccuracies of each step in the process.
  • Identify any changes in method or equipment over time.
  • Seek to understand the motives of the people collecting and reporting. Could there be any biases or incentives involved?
  • Visualize the data and investigate any shifts, outliers, and trends for possible discrepancies.
  • Think carefully about data formatting, processing, and transformations.

Ultimately, each data collection activity is unique, and there are too many possible sources of error to list them all. These are some typical ones that I've come across, and you may have your own suggestions.

At the core of the data-reality gap pitfall is our attitude toward data. Do we arrogantly or naively see ourselves as experts on a topic as soon as we get our hands on some data, or do we humbly realize that our knowledge is imperfect, and we may not know the full story?

We can't ever perfectly know the data-reality gap because that would require perfect data. What we can do, though, is seek to identify any gaps that may exist, and take them into account when we use data to shape our understanding of the world we live in.

If the data-reality gap deals with what data is and what it isn't, then the next section seeks to clarify what data can be used for and what it can't be used for.

Pitfall 1D: The Black Swan Pitfall

The way the popular thinking goes, we put data to its best possible use by employing it as a tool to verify truths about the world we live in. And I can see where this idea comes from. I want to know how many bicycles cross the Fremont bridge in a month, so I download the data from the government department website, I carry out a very simple computation, and I come up with my answer.

Bam! Question, meet answer. What could be better than that?

As useful as this information might be to us, there's only one problem with the thinking that affirmative answers are the best thing we can get from data.

It's dead wrong, and it actually plays right into a psychological deficiency we all have as humans.

In fact, the exact opposite is true. The best possible use of data is to teach us what isn't true about our previously held conceptions about the world we live in, and to suggest additional questions for which we don't have any answers yet. Embracing this means letting go of our egotistical need to be right all the damn time.

Before explaining, I need to distinguish between two kinds of statements that we deal with when we work with data. In his seminal 1959 work on the epistemology of science, The Logic of Scientific Discovery, Austrian-born philosopher Dr. Karl Popper expounded upon these two types of statements: singular statements and universal statements.

  • A singular statement (e.g. “That swan over there is white”) is a basic observation about the world we live in. It's an empirical fact.
  • A universal statement (e.g. “All swans are white”) is a hypothesis or theory that divides the world into two kinds of singular statements: those that the universal statement permits, and those that it does not permit. These latter statements that it does not permit, if observed in the real world, would falsify the universal statement.

What Popper taught us is that no amount of corroborating observations of singular statements can prove a universal statement to be true. No matter how many white swans we encounter in our search, we haven't come any closer to proving that non-white swans don't exist anywhere in the universe.

But here's the problem: it sure does feel that way. We see nothing but white swans for our entire life. Then we encounter yet another white swan, and our conviction that all swans are white just gets stronger.

It's called induction – arguing from the specific to the general. It's incredibly useful for forming hypotheses to test, but not useful at all for proving those hypotheses true or false. It definitely can give us the feeling of certainty, though, which often leads to very strong convictions.

As Popper pointed out, mere conviction is no basis for accepting a theory into the body of knowledge that's called science. That's faith. There's nothing inherently wrong with faith; we just can't use it to prove something to someone else.

On the other hand, it only takes a single observation of a non-white swan in order to debunk our universal statement and show it to be unequivocally false. That's exactly what happened when, in 1697, Willem de Vlamingh led a group of Dutch explorers to Western Australia and they became the first Europeans to observe a black swan, immediately dispelling the commonly held belief that all swans were white (Figure 2.23).

Like the Europeans and their erroneous induction from repeated observations of white swans to the belief that all swans are white, we often assume that singular statements that we encounter in data verify universal truths. We infer that something we see in the data applies well beyond the time, place, and conditions in which it happened to surface:

Photograph of a dark swan in a small pond in a picnic spot in Western Australia.

FIGURE 2.23 A black swan I photographed on a recent trip to Maui.

  • It's not just how many times bikes crossed the Fremont bridge in April 2014, it's how many bikes cross the bridge in general.
  • It's not just the preference of certain particular customers, it's the preference of all other potential customers as well.
  • It's not just that the pilot manufacturing line had high yields during qualification, it's that the process will also have high yields at full volume production as well.
  • It's not just that a particular mutual fund outperformed all others last year, it's that it'll be the best investment going forward.

How often do we come to find out afterward that these inductive leaps from the specific to the general are wrong? It's as if we have a default setting in our brain that assumes that any facts we discover are immutable properties of the universe that will most certainly apply going forward. It's a subtle but insidious mistake in the way we think about data. We even fall into the swan pitfall when there are warning signs right on the prospectus itself: “past performance does not predict future returns.”

That's why it's so important that we understand the difference between singular and universal statements, and that any time we consciously decide to work in the realm of universal statements, we commit to construct universal statements that are falsifiable. That is, the set of all possible singular statements that could prove our hypothesis wrong must not be empty. The universal statement “All swans are white” can and was shown to be false. That's a good thing.

But what kind of statement isn't falsifiable? Isn't it theoretically possible to prove anything someone says false? No, not necessarily. Popper pointed out that basic existential statements (e.g. “such-and-such a thing exists.”) aren't actually falsifiable. Why not?

Take the singular statement “There exists a black swan.” It's pretty easy to show it to be true – all we need to do is to find one. But what if we can't? Have we shown the statement to be false? Actually, no, we haven't, because as much searching as we might've done, there's always the possibility that we missed it, or that it's somewhere we haven't yet looked. Maddening, right?

Pitfall 1E: Falsifiability and the God Pitfall

That's why the statement “God exists” doesn't belong in the realm of science or data analysis. No matter what we do, we can't ever prove it to be false. She/he/it might just be hiding from us, or just not detectable by our senses. That's why it bothers me when people use science or data to try to disprove the existence of a god. It's a pointless exercise, because the hypothesis isn't falsifiable in the first place; it's a basic existential statement. You're welcome to believe it or not. If you don't, just don't delude yourself that you have evidence that one isn't there.

But on the other hand, if you do believe it, don't go spouting a bunch of statements that aren't falsifiable either about how such a god created the universe, and call it “science,” because it's not. That's exactly why Judge William Overton ruled that creationism couldn't be taught in schools as science in the 1982 ruling in McLean v Arkansas Board of Education. Among other things, the court found that the claims the creationists were making weren't falsifiable, and therefore not science. The Supreme Court agreed with him five years later when a similar case against Louisiana was brought before it.

This is the twofold nature of what I call the “God pitfall” in data analysis – either we form a hypothesis that isn't falsifiable, or we do our best to protect our hypothesis from any possible attempt to show it to be false.

Unlike people who like to get in arguments about religion, do we actively seek to prove our own hypotheses to be false, to debunk our own myths, or do we mostly try to prove ourselves right and others wrong?

If you think about it, we should actually feel more excited about data that proves that the universal truths we have adopted are false and in need of updating. It might feel nice when we get corroborating evidence, but the big leaps in our knowledge are made when we realize we've been wrong the whole time. There should be high fives all around when we come across data like this.

If only we were wired that way, but we're just not. Luckily for us, though, those moments of reluctant discovery can be so painful that we never, ever forget them. Reality has a very persistent way of piercing through our delusions time and again.

Because sooner or later our epistemic errors become clear, and we realize that we have fallen into a pitfall yet again.

Avoiding the Swan Pitfall and the God Pitfall

How do we avoid falling into these two epistemic pitfalls? Let's start by considering the process that gets us into trouble. Here's the way that process, and our thinking, often goes:

1. Basic question ➔ 2. Data analysis ➔ 3. Singular statement ➔ {unaware of the inductive leap} ➔ 4. Belief in a universal statement

For example, let's see how that played out in the Fremont Bridge bike counter example:

  1. I heard there's a bicycle counter on the Fremont bridge. That's pretty cool, I wonder what I can learn about ridership in my city.
  2. Okay, I found some data from the Seattle Department of Transportation, and it looks like …
  3. 49,718 crossed in the eastbound direction, and 44,859 crossed headed west in April 2014.
  4. Hmm, so more bicycles cross the bridge headed east than west, then. I wonder why that is? Maybe some riders cross to get to work in the morning but ride the bus home.

The evidence of the leap can be found in something that might seem insignificant – a verb tense shift. I put the verbs in bold so you'd be more likely to catch it. In step 3 above, we refer to bicycles that “crossed” the bridge (or, as was measured and recorded). But in step 4, we switched from past tense to present tense and used the word “cross.” As soon as we did that, we fell prey yet again to the inductive goof.

Instead, I propose we go about things like this:

1. Basic question ➔ 2. Data analysis ➔ 3. Singular statement ➔ 4. Falsifiable universal statement hypothesis ➔ 5. An honest attempt to disprove it

  1. How many bicycles cross the Fremont Bridge in a month?
  2. Well, I got some data from the Seattle Department of Transportation, and it looks like …
  3. The data suggest that the bike counters recorded counts of 49,718 in the eastbound direction, and 44,859 in the westbound direction in April 2014.
  4. Hmm, so the counters registered higher counts in the eastbound direction as compared to westbound that month. I wonder whether all months have seen higher counts going east as opposed to west?
  5. Let me see whether that's not the case.

A further analysis shows that it is indeed atypical (Figure 2.24).

Okay, so over the past couple years, it looks like my hypothesis was false – counts are typically higher for the westbound (“Fremont Bridge SB”) bike counters than for the eastbound ones (“Fremont Bridge NB”), and I can see a seasonal pattern with higher counts in the summer months. I wonder what happened in April 2014, and I'll have to watch the data going forward to see if this seasonal trend holds up, or whether things shift.

Do you see how subtle changes in the way we think about data, and the way we talk about it, result in fewer epistemic errors being made, better follow-up questions to ask, and a more accurate understanding of the world we live in? Also notice that I took care to avoid falling into the data-reality gap by talking about bike counter counts instead of actual bicycles crossing the bridge.

Chart depicting Fremont Bride bike counter measurements.

FIGURE 2.24 Fremont Bride bike counter measurements.

As always, the devil is in the details. And the devil cares a whole lot about the details about how we think about things.

Notes

  1.   1 https://www.lpi.usra.edu/meteor/metbull.php.
  2.   2 https://earthquake.usgs.gov/earthquakes/search/.
  3.   3 https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k.
  4.   4 https://www.seattlebikeblog.com/2014/04/29/monday-appears-to-smash-fremont-bridge-bike-counter-record/.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.14.196