Chapter 15. The Dark Side of Data Science

Marck Vaisman

More often than not, data scientists hit roadblocks that do not necessarily arise from problems with data itself, but from organizational and technical issues. This chapter focuses on some of these issues and provides practical advice on dealing with them, both from human and technical perspectives. The anecdotes and examples in this chapter are drawn from real-world experiences working with many clients over the last five years and helping them overcome many of these challenges.

Although the ideas that are presented in this chapter are not new, the main purpose is to highlight common pitfalls that can derail analytical efforts. When put into context, these guidelines will help both data scientists and organizations be successful.

Avoid These Pitfalls

The subject of running a successful analytics organization has been explored in the past. There are many books, articles, and opinions written about it and this will not be addressed here. However, if you would like to be successful in executing and/or managing analytical efforts within your organization, you should not heed the “commandments” listed below.

  1. Know nothing about thy data

  2. Thou shalt provide your data scientists with a single tool for all tasks

  3. Thou shalt analyze for analysis’ sake only

  4. Thou shalt compartmentalize learnings

  5. Thou shalt expect omnipotence from data scientists

These commandments attempt to cluster-related ideas, which I will explore in the following sections. If you do choose to obey one or more of these commandments—which we’ve explicitly warned you not to—you will most likely head down the path of not achieving your goals.

Know Nothing About Thy Data

You have to know your data, period. This cannot be stressed enough. Real world data is messy and dirty; that is a fact. Regardless of how messy or dirty your data is, you need to understand all of its nuances. You need to understand the metadata about the data. If your data is dirty, know that. If there are missing values, know that, and know why they are missing. If you have multiple sources with different formatting, know that.

Knowing thy data is a crucial step in a successful analysis effort. Time spent up-front understanding all of the nuances and intricacies of the data is time well spent. The rule of thumb says that 80% of time spent in analytics projects is cleaning, munging, transforming, and so on; so the more you know about the data, the less time you’ll spend in these tasks.

In the following sections, we’ll highlight some examples of times when organizations knew about their data, but did not now enough. (In our experience, knowing nothing about the data is the exception. Usually people do have some level of knowledge.)

Be Inconsistent in Cleaning and Organizing the Data

The first step in an analysis effort is to establish consistent processes to clean the data. This way, other people know what to expect when they work with it.

Case study: A client had defined several processes to move, clean, and archive the data into tab-delimited flat files, which in turn served as the source data for many other analytics efforts. Most analyses of this data seemed to work well, but one particular case yielded unexpected results. A visual inspection didn’t reveal any obvious flaws in the data, but a closer investigation revealed that the troublesome files used a mix of spaces and tabs as delimiters. The culprit? The process that had generated these files had inserted some unexpected spaces instead of tabs.

Assume Data Is Correct and Complete

A common pitfall is to assume that you’re working with correct and complete data. Usually, a round of simple checks—counting records, aggregating totals, plotting, and comparing to known quantities—will reveal any problems. If your data is incorrect or incomplete, you need to know that, so your decisions can take that fact into account.

Case study: A client had created a data product that calculated certain metrics on a daily basis, over a period of time. All production processes were running without issue and the consumers of the data were using the data provided without question. The size of the processed daily datasets was in the order of hundreds of millions of data points. Certain decisions of large economic impact were made based on the results of these data products.

The client later launched a separate research investigation, using the same datasets, to try to understand if and how the daily distributions changed over time. This project yielded some unexpected results, raising the question of whether the data was correct and complete during certain time periods. Further analysis revealed that, during those time periods, about twenty percent of the data was missing. This means that the client had made (wrong) decisions, simply because no one knew that data was incomplete! Simple summaries, made early on, would have indicated missing data.

Spillover of Time-Bound Data

It’s common that organizations partition their data by some time interval—such as day, hour, or minute—and then organize the files in directories named accordingly. For example, given a directory called data_20120706, one could reasonably expect that every file therein would hold data somehow related to July 6, 2012.

In my experience, though, this doesn’t always hold true. Instead, many projects exhibit “spillover,” which is a nice way of saying that a path for one time interval contains data from other intervals. In this example it would mean that the directory data_20120706 would also contain data from July 5, 2012, or July 7, 2012.

Spillover can happen for any number of reasons, including inadequate accuracy in the partitioning scheme. While you may not be able to completely eliminate spillover, you can at least be aware of it. Don’t expect that the data is partitioned perfectly.

Thou Shalt Provide Your Data Scientists with a Single Tool for All Tasks

There is no single tool that allows you to perform all of your data science tasks. Many different tools exist, and each tool has a specific purpose. In order to be successful, data scientists should have access to the tools they need and also the ability to configure these tools as needed—at least in a research and development (R&D) environment—without having to jump through hoops to do their work. Providing one fixed tool (or set of tools) to perform all tasks is unrealistic and unreasonable.

Using a Production Environment for Ad-Hoc Analysis

The use cases of performing exploratory analysis or any other data R&D effort are very different than the use cases for running production analytics processes. Generally, the design of production systems specify that they have to meet certain service level agreements (SLAs), such as for uptime (availability) and speed. These systems are maintained by an operations or devops teams, and are usually locked down, have very tight user space quotas, and may be located in self-contained environments for protection. The production processes that run on these systems are clearly defined, consistent, repeatable, and reliable.

In contrast, the process of performing ad-hoc analytical tasks is nonlinear, error-prone, and usually requires tools that are in varying states of development, especially when using open source software. Using a production environment for ad-hoc and exploratory data science work is inefficient because of the limitations described above.

Case study: A client had two Hadoop clusters, one for R&D and the other for production. All R&D work was performed on the R&D cluster, which included an ample set of tools (R, Python, Pig, and Hive, among others). However, the R&D cluster was managed as a production system: R&D users did not have administrative privileges. This meant that even though users could run jobs, the R or Python streaming scripts were limited to only using the core libraries. Therefore, it took more time to develop analysis jobs because users had to implement “creative” solutions to work around these limitations.

One such workaround, to use special R or Python libraries, involved moving data out of the Hadoop cluster to a separate machine where analysts had administrative access. The entire process was cumbersome and added unnecessary time and headaches to a project.

You need to carefully plan out the architecture, configuration, and administration of your tools and environments. The use cases are different, and therefore the management and operation of the system should be different as well.

I certainly don’t advocate that all users or data scientists have administrative privileges and do as they please; but they should have enough privileges to set up the environment to suit their analytics needs. If this is not possible, it would make sense to have the operations teams and the analytics users work in partnership and devise workable solutions.

The Ideal Data Science Environment

At the time of this writing, it is relatively easy and inexpensive to set up an ideal data science environment. With low hardware costs (processors, storage, and memory) and the ease of access to cloud computing resources (Amazon EC2, Rackspace, or other cloud services), organizations should get the tools their data scientists need and let them manage the tools as they need to as mentioned before.

At a minimum, I recommend you set up one or more multi-core analytics machines running Linux, with lots of RAM and ample storage space. I recommend Linux as an operating system because most analytics tools and programming languages are designed with Linux in mind (e.g., Hadoop), and many external libraries run on Linux only.

If you have a cluster, you should try to have your analytics machine within the same environment as your cluster, especially if you store data in a distributed file system such as Hadoop’s HDFS. Data scientists should have some level of administrative rights so they can compile and install libraries as needed.

Setting up the right environment is not only a matter of using the right tools, but also of having the right organizational mindset. Ideally, a data science environment is not used for other development purposes so data scientists can take advantage of all available resources for their needs, especially when running large scale analysis in a parallel fashion.

Thou Shalt Analyze for Analysis’ Sake Only

There are many kind of analytical exercises you can do. Some begin as an exploration without a specific question in mind; but it could be argued that even when exploring, there are some questions in mind that are not formulated. Other exercises begin with a specific question in mind, and end up answering another question. Regardless, before you embark on a research investigation, you should have some idea of where you are going. You need to be practical and know when to stop and move onto something else. But again, start with some end in mind. Just because you have a lot of data does not mean you have to do analysis just for analysis’ sake. Very often, this kind of approach ends with wasted time and no results.

Additionally, before you embark on an data science project, you should assess your analytics readiness level. This assessment will help set an end goal and avoid digging into a rabbit hole. Understanding where you fall in this readiness spectrum will help set priorities and define an end goal. Some of the possible readiness levels are:

  • We don’t even know where to begin.

  • We don’t know what we have, and we’ve never done any analysis before.

  • We have an idea of what we have, but we’ve never done any analysis before.

  • We know what we have, and we’ve tried answering specific questions, but we’re stuck.

Case study: A Fortune 500 technology company had a process that generated data based on day-to-day operations. Historically, their data warehousing team performed most analytics tasks, such as traditional Business Intelligence (BI). This team’s primary responsibility was to develop relational database-driven tools, though they had also been dabbling in less-conventional analytics projects.

They decided to bring in a data scientist to help with the latter endeavor. The belief was that the data scientist would magically find a golden nugget hidden in their data, which they could easily translate into some results. Management’s directive was, quite simply: “Go and find me the value in the data!” They did not follow up with any context or direction; they simply set an unrealistic expectation that a wizard was off to do some magic and return with all the answers.

The proper response in this case—and the one the data scientist provided, mind you—was to object to this directive and ask for direction, get context, and set the parameters for engaging in an analytics exercise. Part of a data scientist’s role and value is to help the organization ask the right questions, and steer clear of unnecessary work.

Thou Shalt Compartmentalize Learnings

This commandment is pretty straightforward. The idea here is that, as an organization, you should share your knowledge. This is especially important when analytical efforts are performed by different areas throughout the company.

Share your findings. If you are doing analysis and you find something related to any of the pitfalls mentioned previously—whether you find missing data, formatting errors, shortcuts to get things done—share them. If you’ve already processed data into aggregates for some reason and you think that could be useful for other analyses, share it. When you finish a body of work, share the findings (code, results, charts, or other documentation). Document your assumptions. Document your code. Have informal gatherings to share and discuss.

It is amazing, especially in large organizations, when you spend time working on something and you consult with colleagues in other areas, how often you hear: “oh, yes, we looked at that a while ago and we have some results in a file somewhere.” The amount of time (which usually translates into economic value) saved by sharing can be quite large. By sharing your knowledge, you are also contributing to the learning across the organization.

Thou Shalt Expect Omnipotence from Data Scientists

Data scientists come in all shapes, sizes, and colors, and hail from traditional and unusual career paths. They blend skills in programming, mathematics, statistics, computer science, business, and machine learning, among others. Above all, great data scientists are very curious about everything and have broad knowledge across many different domains.

The current availability of computing power, analytically focused programming languages (such as R and Python), and parallel computing frameworks (such as Hadoop) allows data scientists to be very effective. Data scientists are able to perform many tasks across the analytics spectrum, from spinning up a cluster in the cloud, to understanding the subtleties and trade-offs of different technologies and system quirks, to running a clustering algorithm and building predictive models.

Omnipotence is defined as having unlimited powers and there seems to be an expectation that data scientists can and should do it all. While they are willing and able to work on many tasks across the data science process, from munging and modeling to visualizing and presenting, it’s quite rare to find talent with extensive experience in all aspects of data science.

Organizations and managers would do well to adjust their expectations accordingly. A successful data science function is made up not by one person, but at a minimum two or three individuals whose broad skills have much overlap while their unique expertise does not.

Where Do Data Scientists Live Within the Organization?

Finding a place for data scientists can be a bit tricky. Sometimes you’ll find them living within an engineering organization, sometimes within a product organization, sometimes within a research organization, and other times they live under some other umbrella or on their own. However, wherever data scientists live in your organization, make sure there is unified guidance and management that understands using data science as an asset.

Final Thoughts

I hope that this chapter’s case studies helped you think about the higher-level organizational issues that may arise in an analytics effort. If you want success in your data science endeavors, please heed my advice and do not follow the commandments I’ve outlined here.

It is hard to quantify the impact of the pitfalls outlined in this chapter. Suffice it to say that the impact is usually economic, and in some cases can be quite large.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.174.156