Chapter Nine
Conclusion

“Mistakes are a fact of life. It is the response to error that counts.”

Nikki Giovanni

Well, we've traveled a long road together, and we've considered many aspects of the data working pathway, both pinnacles of wonder and pitfalls of despair. And if you're like me, you have a looming sense that the types of mistakes we've mapped out in this book are only the beginning.

The fact is, they're necessary in order for us to grow.

To drive this point home, I'd like to relate a famous story to you that I've only ever seen attributed to an anonymous source.

There once was a journalist who was interviewing a wealthy and successful bank president.

“Sir,” he said, “what is the secret to your success?”

“Two words,” replied the bank president.

“And what are they, sir?” asked the journalist.

“Good decisions,” replied the president.

“And how, sir, do you make good decisions?”

‘One word.”

“And what word is that?”

“Experience.”

“And how do you get experience?”

“Two words.”

“And what are they, sir?”

“Bad decisions.”

Working with data involves lots of decisions. Some of them are good decisions and some of them are bad decisions. It's not reasonable to expect ourselves and others to only ever make good decisions. We're going to make mistakes on the road to success with data, just as sure as we're going to come across data that's flawed and dirty. It's the way things are.

We can choose how we react to this state of affairs. We can throw in the towel right now. Why try? Or we can stick our head in the sand and pretend we don't make the kinds of mistakes that all the other knuckleheads out there seem to make so often. Sheesh. Or we can beat ourselves up every time we fall into a data pitfall. “How could I be so stupid?”

Or we can pick ourselves up, recognize the mistake, accept that it happened, put a little tick on the tick sheet, investigate – without judging – how and why we didn't catch it upfront this time, and tell ourselves it's going to be okay. I believe that you and I are far less likely to make this mistake the next time around if we react like this.

I'll never forget a time when I was a young mechanical design engineer at an automotive sensor company in Southern California. My boss, Mr. Chen, was a very wise man, sort of like an amused father or grandfather who mentored me with care. This company had been struggling financially for some time. No one had been getting raises. Layoffs seemed imminent. They had even begun counting rolls of toilet paper. It was a tough spell for the company.

There was a time in just my third month on the job when I was responsible for creating a batch of sensors in the lab that were going to be sold to a customer. It was a small batch, but the value of the raw materials added up to about $15,000, which at that time was about a third of my entry-level engineer salary. The sensors needed to be calibrated before they were cured in the oven and the assembly completed. The drawings were clear about that.

But I messed it up. I told the technicians, who made far less money than I did, to go ahead and finish off the assemblies, thinking that we could calibrate the sensors afterward. We couldn't, and the batch was ruined. Hundreds of high-precision sensors had to be thrown in the big scrap bin in the back of the building. Ouch.

I realized right away with a sinking feeling of dread that the error was mine, not the technicians’. I made up my mind that I would tell my boss, preparing myself for the obvious outcome that I would be fired on the spot. I had a pregnant wife at the time, and the temptation to cover it up or blame someone else was enormous.

When I told Mr. Chen what had happened, he walked up to the lab with me, drawings in hand. He and I went around and he asked the technicians what had happened. They looked at me, looked at the drawings, and then at Mr. Chen, and told them that I had provided different instructions than what the drawings said. I nodded in agreement, and said they weren't to blame.

Mr. Chen and I went back to his office, and we each took a seat. Before he had a chance to say anything, I asked him if he was going to fire me for the costly mistake. He smiled and said, “Are you kidding me? I just paid $15,000 for you to be trained on how not to calibrate a batch of sensors. I can't afford to pay another junior engineer to learn that same lesson.”

Obviously, I was relieved. And it's true, I had learned a valuable lesson about how to calibrate pressure sensors for heavy-duty trucks. But I learned something even more valuable – that making mistakes is inevitable, that there's no sense in browbeating yourself or another person for making one, and that if you think of it as a training cost, it turns the whole mess around.

So it's my sincere hope that you aren't discouraged by all the possible ways to get it wrong. That's not my purpose in writing this book at all. Quite the contrary. Possibility of failure is built into the very nature of our world, but we can't let this fact, while daunting, totally confine us to the sofa.

Data is going to be a huge part of the future of our species on this planet. In our generation it has increased in importance by leaps and bounds in a very short amount of time, and this trend is showing no sign of abating. We're in the infant phase of the data-working lifespan of our species.

Let's think about babies for a minute. A newborn's immune system learns to fight harmful viruses for the first time. A newborn grows into a toddler who learns to avoid walking into the coffee table or falling down the stairs. A toddler becomes a young child who learns not to touch the stovetop. These early experiences lead a human being to develop a strong immune system, a keen sense of balance, and an aversion to sharp objects or hot surfaces.

That's what we're doing for future generations right now, as it relates to data. It's up to us to continue to face these potential pitfalls, to learn from them, to build a better sense of where they are, what they look like, and how to avoid them. We're building the immune system, the defense mechanisms, and the habits of our species to make good use of data without shooting ourselves in the foot. If future generations look back at our mistakes and roll their eyes or shake their heads in derision, then we'll have done our part.

This book is just the beginning. It's a brief catalog of the ones I've come across in my data working days so far. I imagine it amounts to a small portion of the total number of different kinds of pitfalls I will come across before my lifetime is over. I have many more things to learn, and I'll keep jotting down new pitfalls that I come across – hopefully some of them from the top looking down.

And I'd love to know if you think I've missed any big ones, or if I've gotten any of these wrong. I bet I have. As ego-bruising as that would be to fall into a pitfall while writing a book about pitfalls, I'm prepared for that. So don't hold back.

As promised, I've created a checklist for you to use as a set of reminders about these pitfalls, and how to avoid them. By all means change it or add to it to suit your purposes. It's a living document, and it's as susceptible to falling into the very pitfalls it's seeking to help you avoid.

All the best to you on your journey to a higher place!

Avoiding Data Pitfalls Checklist

Pitfall 1: Epistemic Errors: How we think about data

  • 1A. The Data Reality Gap: Identify ways in which the data is different than reality.
  • 1B. All Too Human Data: Identify any human-keyed data and associated processes.
  • 1C. Inconsistent Ratings: Test the repeatability and reproducibility of ratings and measurements.
  • 1D. The Black Swan Pitfall: See if you're making any inductive leaps to universal statements.
  • 1E. The God Pitfall: Ask whether hypotheses formed and statements made are falsifiable.

Pitfall 2: Technical Trespasses: How we process data

  • 2A. Dirty Data: Consider the values in each variable, visualize the data, and scan for anomalies.
  • 2B. Bad Blends and Joins: Investigate the input and output of every join, blend, and union.

Pitfall 3: Mathematical Miscues: How we calculate data

  • 3A. Aggravating Aggregations: Explore the contours of your data and look for partial categories.
  • 3B. Missing Values: Scan for nulls and look for missing values between adjacent category levels.
  • 3C. Tripping on Totals: Determine whether any category levels or rows are totals or subtotals.
  • 3D. Preposterous Percents: Consider numerators and denominators of all added rates and percents.
  • 3E. Unmatching Units: Check that formulas involve variables with the correct units of measure.

Pitfall 4: Statistical Slipups: How we compare data

  • 4A. Descriptive Debacles: Consider distributions when communicating mean, median, or mode.
  • 4B. Inferential Infernos: When inferring about populations, verify statistical significance.
  • 4C. Slippery Sampling: Make sure samples are random, unbiased, and, if necessary, stratified.
  • 4D. Insensitivity to Sample Size: Look for very small sample sizes, occurrences, or rates.

Pitfall 5: Analytical Aberrations: How we analyze data

  • 5A. The Intuition/Analysis False Dichotomy: Ask whether you're honoring your human intuition.
  • 5B. Exuberant Extrapolations: Check for times that you're projecting far into the future.
  • 5C. Ill-Advised Interpolations: Consider if there should be more values between adjacent ones.
  • 5D. Funky Forecasts: Think about how you're forecasting values, and whether that's valid.
  • 5E. Moronic Measures: Check whether what you're measuring and visualizing really matters.

Pitfall 6: Graphical Gaffes: How we visualize data

  • 6A. Challenging Charts: Identify the core purpose of your visual and validate that it achieves it.
  • 6B. Data Dogmatism: Ask whether you've failed to consider a valid solution due to rigid rules.
  • 6C. The Optimize/Satisfice False Dichotomy: Decide if you need to optimize or satisfice.

Pitfall 7: Design Dangers: How we dress up data

  • 7A. Confusing Colors: Strive for one and only one color encoding; add another only if you must.
  • 7B. Omitted Opportunities: Stop and consider adding judicious embellishment of charts.
  • 7C. Usability Uh-Ohs: Test whether those using your visualization can actually use it well.

Pitfall 8: Biased Baseline: Who has a voice in data

  • 8A. The Unheard Voice: Make sure you are hearing and considering voices that have historically been undervalued or ignored.

The Pitfall of the Unheard Voice

Hello, dear reader. I am sitting here on my couch near Seattle, on a sunny Sunday morning in July after having read through the edits and proofs of each of the nine chapters of a draft version of this book. I have made a ghastly realization that is giving me pause to reflect and ask myself some important questions.

I knew when I set out to write this book that I would be sure to fall into numerous pitfalls while writing a book about pitfalls. I steeled myself against this ironic inevitability, and decided I would need to be the doctor who takes his own medicine: learn to laugh about it and move on as a wiser person.

But I'm not laughing about this one. I've discovered that I've fallen into an egregious pitfall that I didn't include or describe anywhere in the pages of this book. It wouldn't be accurate to say that this pitfall went unnoticed for ages, because the vast majority didn't even see it as a pitfall at all until recently. They saw it as the right and best path to take.

What is this pitfall I'm talking about? Did you notice that the checklist above included an eighth type of pitfall? Look again if you didn't.

There are nine chapters in this book, and these nine chapters each have a quote at the beginning – an epigraph. Researching and choosing an epigraph is a part of the writing process that I really enjoy, because I get to scan the thoughts and words of brilliant people and take inspiration from them. It's a very rewarding process for me.

But the first draft of this book contained nine quotes from nine men. There was not a single inspirational quote from a woman.

None.

Zero.

How did this happen? How did I not come to select even one quote from a woman in the initial writing of this book, which took place over the course of four years?

Even worse, how did I not even realize it until the very end, with the printers warming up and getting ready to send my book to people around the world, including women of all ages who are looking to build data skills and contribute to this booming data dialogue around us?

How would I feel if I were one of them and I got my hands on a copy of the book and wondered what it would take for my voice to be heard?

I feel strongly enough about this pitfall that I decided not only to fix it, but to write about it in the hopes that I can help raise awareness about it among my fellow males in the data world.

This pitfall is endemic in our discipline, in the broader STEM world, and in Western society at large. At the time I'm writing this, only 17% of the 1.5 million Wikipedia biography pages are about women.1

The voices of talented women have been ignored for too long, and too often their contributions have been attributed to men. I'm currently working on an article for the Nightingale, the new journal published by the Data Visualization Society, about a twentieth-century data visualization practitioner and author named Mary Eleanor Spear. In 1969, she wrote a fabulous book I've recently discovered called Practical Charting Techniques, and I'm currently awaiting shipment of an earlier book of hers titled Charting Statistics, published in 1952.

Why do I mention Spear in this final section on unheard voices? For starters, there is, of course, no Wikipedia page for her at the time I'm writing this sentence (a fact that will soon change).

But it's more than that. There's a common and popular statistical chart type called the box plot that depicts the quartiles of quantitative data in a compact and convenient way. If you research its origin, you'll find that the innovation of the box plot is commonly attributed to mathematician John W. Tukey, who “introduced this type of visual data display in 1969.”2

At present, the name Mary Eleanor Spear is not to be found anywhere on the Wikipedia page about the box plot, even though her name can be found in research papers on the topic of this chart type.3

But her name should be there. In her 1952 book, Spear depicted an early version of the box plot, which she called the range-bar. This was evidently a chart type that Tukey modified to create the box plot we know and use today.

While many see Tukey's modifications as helpful, why is Spear's name so often left out of the discussion of the origin of this chart? If Tukey had improved upon the work of another man, would that man's name have been left out of the record? Why did I first hear of her name just a few months ago, in spite of working in the business intelligence industry for over a decade?

And why, as I'm looking to go back and include quotes by women for the epigraphs of each chapter of this book, do I find so few of them? Why, in an article that purports to list the “100 greatest data quotes,” are the words of only seven women to be found?4 That's just one article. There were many other “top 20” or “top whatever” data and analytics quote lists I found that included no women whatsoever.

These are questions we have to ask ourselves. It just shouldn't be that way, and I'm mortified to think that I almost perpetuated this incredible bias in this book with lazily selected epigraphs. There was nothing wrong with the quotes I originally selected. They were brilliant thoughts from brilliant people. And I'm not seeking to take away anything from them, or from Tukey, or from those who have taken time to compile helpful quotes for all of us.

But it's time to start amplifying voices that have traditionally been drowned out. Voices of all types. A tradition that filters out so many bright contributors is much too costly. Imagine a world where people's thoughts are heard and considered in proportion to the value of the words themselves, as opposed to the value the culture places on the demographic of the person who said them. Wouldn't this be a better world?

What can we do to move our world in that direction? Let's do that.

Notes

  1.   1 https://www.thelily.com/wikipedia-has-15-million-biographies-in-english-only-17-percent-are-about-women/.
  2.   2 https://en.wikipedia.org/wiki/Box_plot.
  3.   3 https://vita.had.co.nz/papers/boxplots.pdf.
  4.   4 https://analyticsweek.com/content/100-greatest-data-quotes/.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.143.65