Chapter 2. Setting goals by asking good questions

This chapter covers

  • Putting yourself in the customer’s shoes
  • Asking specific, useful questions of the data
  • Understanding the strengths and limitations of the data in answering those questions
  • Connecting those questions and answers to project goals
  • Planning backward from the desired goal, not forward from data and software tools

Figure 2.1 shows where we are in the data science process: setting goals, which is the first step of the preparation phase. In a data science project, as in many other fields, the main goals should be set at the beginning of the project. All the work you do after setting goals is making use of data, statistics, and programming to move toward and achieve those goals. This chapter emphasizes how important this initial phase is and gives some guidance on how to develop and state goals in a useful way.

Figure 2.1. The first step of the preparation phase of the data science process: setting goals

2.1. Listening to the customer

Every project in data science has a customer. Sometimes the customer is someone who pays you or your business to do the project—for example, a client or contracting agency. In academia, the customer might be a laboratory scientist who has asked you to analyze their data. Sometimes the customer is you, your boss, or another colleague. No matter who the customer might be, they have some expectations about what they might receive from you, the data scientist who has been given the project. Often, these expectations relate to the following:

  • Questions that need to be answered or problems that need to be solved
  • A tangible final product, such as a report or software application
  • Summaries of prior research or related projects and products

Expectations can come from almost anywhere. Some are hopes and dreams, and others are drawn from experience or knowledge of similar projects. But a typical discussion of expectations boils down to two sides: what the customer wants versus what the data scientist thinks is possible. This could be described as wishes versus pragmatism, with the customer describing their desires and the data scientist approving, rejecting, or qualifying each one based on apparent feasibility. On the other hand, if you’d like to think of yourself, the data scientist, as a genie, a granter of wishes, you wouldn’t be the only one to do so!

2.1.1. Resolving wishes and pragmatism

With regard to the customer’s wishes, they can range from completely reasonable to utterly outlandish, and this is OK. Much of business development and hard science is driven by intuition. CEOs, biologists, marketers, and physicists alike use their experience and knowledge to develop theories about how the world works. Some of these theories are backed by solid data and analysis, but others come more from intuition, which is a conceptual framework that the person has developed while working extensively in their field. A notable difference between many fields and data science is that in data science, if a customer has a wish, even an experienced data scientist may not know whether it’s possible. Whereas a software engineer usually knows what tasks software tools are capable of performing, and a biologist knows more or less what the laboratory can do, a data scientist who has not yet seen or worked with the relevant data is faced with a large amount of uncertainty, principally about what specific data is available and about how much evidence it can provide to answer any given question. Uncertainty is, again, a major factor in the data scientific process and should be kept at the forefront of your mind when talking with customers about their wishes.

For example, during the few years that I worked with biologists and gene expression data, I began to develop my own conceptual ideas about how RNA is translated from DNA and how strands of RNA float around in a cell and interact with other molecules. I’m a visual person, so I often found myself picturing a strand of RNA comprising hundreds or maybe thousands of nucleotides, each one appearing like one of four letters representing a base compound (A, C, G, or T; I’ll use T in place of U for convenience) and the whole strand looking like a long, flexible chain, a sentence that makes sense only to the machinery within the cell. Because of the chemistry of RNA and its nucleotides, complementary sequences like to bind to one another; A likes to bind to T, and C likes to bind to G. When two strands of RNA contain near-complementary sequences, they may very well stick to each other. A single strand of RNA might also fold in upon and stick to itself if it’s flexible enough and contains mutually complementary sequences. I’ve used this conceptual framework on many occasions to make guesses about the types of things that can happen when a bunch of RNA is floating around in a cell.

When I began to work with microRNA data, it made sense to me that microRNA—short sequences of about 20 nucleotides—might bind to a section of a genetic mRNA sequence (RNA translated directly from a strand of DNA corresponding to a specific gene, which is typically much longer) and inhibit other molecules from interacting with the gene’s mRNA, effectively rendering that gene sequence useless. It makes conceptual sense to me that one bit of RNA can stick to a section of genetic RNA and end up blocking another molecule from sticking to the same section. This concept is supported by scientific journal articles and hard data showing that microRNA can inhibit expression or function of genetic mRNA if they have complementary sequences.

A professor of biology I was working with had a much more nuanced conceptual framework describing how he saw this system of genes, microRNA, and mRNA. In particular, he had been working with the biology of Mus musculus—a common mouse—for decades and could list any number of notable genes, their functions, related genes, and physical systems, and characteristics that are measurably affected if one begins to do experiments that knock out those genes. Because the professor knew more than I will ever know about the genetics of mice, and because it would be impossible for him to share all of his knowledge with me, it was incredibly important for us to talk through the goals and expectations of a project prior to spending too much time working on any aspect of the project. Without his input, I would be guessing at what the biologically relevant goals were. If I was wrong, which was likely, that work would be wasted. For example, certain specific microRNAs have been well studied and are known to accomplish basic functions within a cell and little more. If one of the goals of the project was to discover new functions of little-studied microRNAs, we would probably want to exclude certain families of microRNAs from the analysis. If we didn’t exclude them, they would most likely add to the noise of an already very noisy genetic conversation within a cell. This is merely one of a large number of important things that the professor knew that I didn’t, making a lengthy discussion of goals, expectations, and caveats necessary before starting the project in earnest.

In some sense, and it is a very common one, a project can be deemed successful if and only if the customer is satisfied with the results. There are exceptions to this guideline, but nevertheless it’s important always to have the expectations and goals in mind during every step of a data science project. Unfortunately, in my own experience, expectations aren’t usually clear or obvious at the very beginning of a project, or they’re not easy to formulate concisely. I’ve settled on a few practices that help me figure out reasonable goals that can guide me through each step of a project involving data science.

2.1.2. The customer is probably not a data scientist

A funny thing about customer expectations is that they may not be appropriate. It’s not always—or even usually—the customer’s fault, because the problems that data science addresses are inherently complex, and if the customer understood their own problem fully, they likely wouldn’t need a data scientist to help them. That’s why I always cut customers some slack when they’re unclear in their language or understanding, and I view the process of setting expectations and goals as a joint exercise that could be said to resemble conflict resolution or relationship therapy.

You—the data scientist—and the customer share a mutual interest in completing the project successfully, but the two of you likely have different specific motivations, different skills, and, most important, different perspectives. Even if you are the customer, you can think of yourself as having two halves, one (the data scientist) who is focused on getting results and another (the customer) who is focused on using those results to do something real, or external to the project itself. In this way, a project in data science begins by finding agreement between two personalities, two perspectives, that if they aren’t conflicting are at the very least disparate.

Although there is not, strictly speaking, a conflict between you and the customer, sometimes it can seem that way as you both muddle your way toward some semblance of a set of goals that are both achievable (for the data scientist) and helpful (for the customer). And, as in conflict resolution and relationship therapy, feelings are involved. These feelings can be ideological and driven by personal experience, preference, or opinion and may not make sense to the other party. A little patience and understanding, without too much judgment, can be extremely beneficial to both of you and, more importantly, to the project.

2.1.3. Asking specific questions to uncover fact, not opinions

When a customer is describing a theory or hypothesis about the system that you’re about to investigate, they are almost certainly expressing a mixture of fact and opinion, and it can be important to distinguish between the two. For example, in a study of cancer development in mice, the biology professor told me, “It is well known which genes are cancer related, and this study is concerned with only those genes and the microRNAs that inhibit them.” One might be tempted to take this statement at face value and analyze data from only the cancer-related genes, but this could be a mistake, because there is some ambiguity in the statement. Principally, it’s not clear whether other supposedly non-cancer-related genes can be involved in auxiliary roles within the complex reactions incited by the experiments or whether it is well known and proven that the expression of cancer-related genes is entirely independent of other genes. In the case of the former, it wouldn’t be a good idea to ignore the data corresponding to non-cancer-related genes, whereas in the case of the latter, it might be a good idea. Without resolving this issue, it’s not clear which is the appropriate choice. Therefore, it’s important to ask.

It’s also important that the question itself be formulated in a way that the customer understands. It wouldn’t be wise to ask, for example, “Should I ignore the data from the non-cancer-related genes?” This is a question about the practice of data science in this specific case, and it falls under your domain, not a biologist’s. You should ask, rather, something similar to, “Do you have any evidence that the expression of cancer-related genes is independent, in general, of other genes?” This is a question about biology, and hopefully the biology professor would understand it.

In his answer, it is important to distinguish between what he thinks and what he knows. If the professor merely thinks that the expression of these genes is independent of others, then it’s certainly something to keep in mind throughout the project, but you shouldn’t make any important decisions—such as ignoring certain data—based on it. If, on the other hand, the professor can cite scientific research supporting his claim, then it’s advisable to use this fact to make decisions.

In any project, you, the data scientist, are an expert in statistics and in software tools, but the principal subject-matter expert is very often someone else, as in the case involving the professor of biology. In learning from this subject matter expert, you should ask questions that give you some intuitive sense of how the system under investigation works and also questions that attempt to separate fact from opinion and intuition. Basing practical decisions on fact is always a good idea, but basing them on opinion can be dangerous. The maxim “Trust but confirm” is appropriate here. If I had ignored any of the genes in the data set, I may very well have missed a crucial aspect of the complex interaction taking place among various types of RNA in the cancer experiments. Cancer, it turns out, is a very complex disease on the genetic level as well as on the medical one.

2.1.4. Suggesting deliverables: guess and check

Your customer probably doesn’t understand data science and what it can do. Asking them “What would you like to appear in the final report?” or “What should this analytic application do?” can easily result in “I don’t know” or, even worse, a suggestion that doesn’t make sense. Data science is not their area of expertise, and they’re probably not fully aware of the possibilities and limitations of software and data. It’s usually best to approach the question of final product with a series of suggestions and then to note the customer’s reaction.

One of my favorite questions to ask a customer is “Can you give me an example of a sentence that you might like to see in a final report?” I might get responses such as “I’d like to see something like, ‘MicroRNA-X seems to inhibit Gene Y significantly,’” or “Gene Y and Gene Z seem to be expressed at the same levels in all samples tested.” Answers like these give a great starting point for conceiving the format of the final product. If the customer can give you seed ideas like these, you can expand on them to make suggestions of final products. You might then ask, “What if I gave you a table of the strongest interactions between specific microRNAs and genetic mRNAs?” Maybe the customer would say that this would be valuable—or maybe not.

It’s most likely, however, that a customer makes less-clear statements, such as “I’d like to know which microRNAs are important in cancer development.” For this you’ll need clarification if you hope to complete the project successfully. What does important mean in a biological sense? How might this importance manifest itself in the available data? It’s vital to get answers to these questions before proceeding; if you don’t know how microRNA importance might manifest itself in the data, how will you know when you’ve found it?

One mistake that many others and I have made on occasion is to conflate correlation with significance. Some people talk about the confusion of correlation and causation; here’s an example: a higher percentage of helmet-wearing cyclists are involved in accidents than non-helmet-wearing cyclists. It might be tempting to conclude that helmets cause accidents, but this is probably fallacious. The correlation between helmets and accidents doesn’t imply that helmets cause accidents; nor does it imply that accidents cause the wearing of helmets (directly). In reality, cyclists who ride on busier and more dangerous roads are more likely to wear helmets and also are more likely to get into accidents. The act of riding on more dangerous roads causes both. In the question of helmets and accidents, there’s no direct causation despite the existence of correlation. Causation, in turn, is merely one example of a way that correlation might be significant. If you’re conducting a study on the use of helmets and the rates of accidents, then this correlation might be significant even if it doesn’t imply causation. I should stress that significance, as I use the term, is determined by the project’s goals. This knowledge of a helmet–accident correlation could lead to considering (and modeling) the level of traffic and danger on each road as part of the project. Significance, also, is not guaranteed by correlation. I’m fairly certain that more cycling accidents happen on sunny days, but this is because more cyclists are on the road on sunny days and not because of any other significant relationship (barring rain). It’s not immediately clear to me how I might use this information to further my goals, and so I wouldn’t spend much time exploring it. The correlation doesn’t seem to have any significance in this particular case.

In gene/RNA expression experiments, thousands of RNA sequences are measured within only 10–20 biological samples. Such an analysis with far more variables (expression levels for each RNA sequence or gene) than data points (samples) is called high-dimensional or often under-determined because there are so many variables that some of them are correlated by random chance, and it would be fallacious to say that they’re related in a real biological sense. If you present a list of strong correlations to the biology professor, he’ll spot immediately that some of your reported correlations are unimportant or, worse, contrary to established research, and you’ll have to go back and do more analyses.

2.1.5. Iterate your ideas based on knowledge, not wishes

As it’s important, within your acquired domain knowledge, to separate fact from opinion, it’s also important to avoid letting excessive optimism make you blind to obstacles and difficulties. I’ve long thought that an invaluable skill of good data scientists is the ability to foresee potential difficulties and to leave open a path around them.

It’s popular, in the software industry today, to make claims about analytic capabilities while they’re still under development. This, I’ve learned, is a tactic of salesmanship that often seems necessary, in particular for young startups, to get ahead in a competitive industry. When I work with a startup, it always makes me nervous when a colleague is actively selling a piece of analytic software that I said I think I can build but that I’m not 100% sure will work exactly as planned, given some limitation of the data I have available. When I make bold statements about a hypothetical product, I try to keep them, as much as possible, in the realm of things that I’m almost certain I can do. In the case that I can’t, I try to have a backup plan that doesn’t involve the trickiest parts of the original plan.

Imagine you want to develop an application that summarizes news articles. You’d need to create an algorithm that can parse the sentences and paragraphs in the article and extract the main ideas. It’s possible to write an algorithm that does this, but it’s not clear how well it will perform. Summaries may be successful in some sense for a majority of articles, but there’s a big difference between 51% successful and 99% successful, and you won’t know where your particular algorithm falls within that range until you’ve built a first version at least. Blindly selling and feverishly developing this algorithm might seem like the best idea; hard work will pay off, right? Maybe. This task is hard. It’s entirely possible that, try as you might, you never get better than 75% success, and maybe that’s not good enough from a business perspective. What do you do then? Do you give up and close up shop? Do you, only after this failure, begin looking for alternatives?

Good data scientists know when a task is hard even before they begin. Sentences and paragraphs are complicated, random variables that often seem designed specifically to thwart any algorithm you might throw at them. In case of failure, I always go back to first principles, in a sense. I ask myself: what problem am I trying to solve? What is the end goal, beyond summarization?

If the goal is to build a product that makes reading news more efficient, maybe there’s another way to address the problem of inefficient news readers. Perhaps it’s easier to aggregate similar articles and present them together to the reader. Maybe it’s possible to design a better news reader through friendlier design or by incorporating social media.

No one ever wants to declare failure, but data science is a risky business, and to pretend that failure never happens is a failure in itself. There are always multiple ways to address a problem, and formulating a plan that acknowledges a likelihood of obstacles and failure can allow you to gain value from minor successes along the way, even if the main goal isn’t achieved.

A far greater mistake would be to ignore the possibility of failure and also the need to test and evaluate the performance of the application. If you assume that the product works nearly perfectly, but it doesn’t, delivering the product to your customer could be a huge mistake. Can you imagine if you began selling an untested application that supposedly summarized news articles, but soon thereafter your users began to complain that the summaries were completely wrong? Not only would the application be a failure, but you and your company might gain a reputation for software that doesn’t work.

2.2. Ask good questions—of the data

It may seem at first glance that this section could be included with the previous one, and I’ve even mentioned a few ways in which good questions may be asked of the customer. But in this section I discuss the question as an inquiry not only into the knowledge of the customer but also into the capabilities of the data. A data set will tell us no more than what we ask of it, and even then, the data may not be capable of answering the question. These are the two most dangerous pitfalls:

  • Expecting the data to be able to answer a question it can’t
  • Asking questions of the data that don’t solve the original problem

Asking questions that lead to informative answers and subsequently improved results is an important and nuanced challenge that deserves much more discussion than it typically receives. The examples of good, or at least helpful, questions I’ve mentioned in previous sections were somewhat specific in their phrasing and scope, even if they can apply to many types of projects. In the following subsections, I attempt to define and describe a good question with the intent of delivering a sort of framework or thought process for generating good questions for an arbitrary project. Hopefully you’ll see how it might be possible to ask yourself some questions in order to arrive at some useful, good questions you might ask of the data.

2.2.1. Good questions are concrete in their assumptions

No question is quite as tricky to answer as one that’s based on faulty assumptions. But a question based on unclear assumptions is a close second. Every question has assumptions, and if those assumptions don’t hold, it could spell disaster for your project. It’s important to think about the assumptions that your questions require and decide whether these assumptions are safe. And in order for you to figure out if the assumptions are safe, they need to be concrete, meaning well defined and able to be tested.

For a brief while I worked at a hedge fund. I was in the quantitative research department, and our principal goal was, as with any hedge fund, to find patterns in financial markets that might be exploited for monetary benefit. A key aspect of the trading algorithms that I worked with was a method for model selection. Model selection is to mathematical modeling what trying on pants is to shopping: we try many of them, judge them, and then select one or a few that seem to work well for us, hoping that they serve us well in the future.

Several months after I began working at this hedge fund, another mathematician was hired, fresh out of graduate school. She began working directly with the model selection aspect of the algorithms. One day, while walking to lunch, she began to describe to me how a number of the mathematical models of the commodities markets had begun to diverge widely from their long-term average success rates. For example, let’s assume that Model A has correctly predicted whether the daily price of crude oil has gone up or down 55% of the time over the last three years. But in the past four weeks, Model A had been correct only 32% of the time. My colleague informed me that because the success rate of Model A had fallen below its long-term average, it was bound to pick back up over the next several weeks, and we should bet on the predictions of Model A.

Frankly, I was disappointed with my colleague, but hers was an easy mistake to make. When a certain quantity—in this case the success rate of Model A—typically returns to its long-term mean, it’s known as mean reversion, and it’s a famously contested assumption of many real-life systems, not the least of which are the world’s financial markets.

Innumerable systems in this world don’t subscribe to mean reversion. Flipping a standard coin is one of them. If you flip a coin 100 times and you see heads only 32 times, do you think you’re going to see more than 50 heads in the next 100 tosses? I certainly don’t, at least to the point that I would bet on it. The history of a (fair) coin being tossed doesn’t affect the future of the coin, and commodities markets are in general the same way. Granted, many funds find exploitable patterns in financial markets, but these are the exceptions rather than the rule.

The assumption of mean reversion is a great example of a fallacious assumption in a question that you might ask the data. In this case, my colleague was asking “Will Model A’s success rate return to its long-term average?” and, based on the assumption of mean reversion, the answer would be yes: mean reversion implies that Model A will be correct more often when it has recently been on a streak of incorrectness. But if you don’t assume mean reversion in this case, the answer would be “I have no idea.”

It’s extremely important to acknowledge your assumptions—there are always assumptions—and to make sure that they are true, or at least to make sure that your results won’t be ruined if the assumptions turn out not to be true. But this is easier said than done. One way to accomplish this is to break down all of the reasoning between your analysis and your conclusion into specific logical steps and to make sure that all of the gaps are filled in. In the case of my former colleague, the original steps of reasoning were these:

  1. The success rate of Model A has recently been relatively low.
  2. Therefore, the success rate of Model A will be relatively high in the near future.

The data tells you 1, and then 2 is the conclusion you draw. If it isn’t obvious that a logical step is missing here, it might be easier to see it when you replace the success rate of Model A with an arbitrary quantity X that might go up or down over time:

  1. X has gone down recently.
  2. Therefore, X will go up soon.

Think of all of the things X could be: stock price, rainfall, grades in school, bank account balance. For how many of these does the previous logic make sense? Is there a missing step? I would argue that there is indeed a missing step. The logic should be like this:

  1. X has gone down recently.
  2. Because X always corrects itself towards a certain value, V,
  3. X will go up soon, toward V.

Note that the data has told you 1, as before, and you’d like to be able to draw the conclusion in 3, but 3 is dependent on 2 being true. Is 2 true? Again, think of all of the things X could be. Certainly, 2 is not true for a bank account balance or rainfall, so it can’t always be true. You must ask yourself if it’s true for the particular quantity you’re examining: do you have any reason to believe that, for an arbitrary period, Model A should be correct in its prediction 55% of the time? In this case, the only evidence you have that Model A is correct 55% of the time is that Model A historically has been correct 55% of the time. This is something like circular reasoning, which isn’t enough real evidence to justify the assumption. Mean reversion shouldn’t be taken as truth, and the conclusion that Model A should be correct 55% of the time (or more) in the near future isn’t justified.

As a mathematician, I’ve been trained to separate all analysis, argument, and conclusion into logical steps, and this experience has proven itself invaluable in making and justifying real-life conclusions and predictions through data science. Formal reasoning is probably the skill I value the most among those I learned through my mathematics course work in college. One important fact about reasoning is—to again emphasize the point I’m trying to make in this section—a false or unclear assumption starts you out in a questionable place, and you should make every effort to avoid relying on such false assumptions.

2.2.2. Good answers: measurable success without too much cost

Perhaps shifting focus to the answers to good questions can shed more light on what the good question comprises as well as help you decide when your answers are sufficient. The answer to a good question should measurably improve the project’s situation in some way. The point is that you should be asking questions that, whatever the answer, make your job a little easier by moving you closer to a practical result.

How do you know if answering a question will move you closer to a useful, practical result? Let’s return to the idea that one of a data scientist’s most valuable traits is their awareness of what might occur and their ability to prepare for that. If you can imagine all (or at least most) possible outcomes, then you can follow the logical conclusions from them. If you know the logical conclusions—the additional knowledge that you can deduce from your new outcome—then you can figure out whether they will help you with the goals of your project.

There can be a wide range of possible outcomes, many of which can be helpful. Though this is not an exhaustive list, you can move closer to the goals of your project if you ask and answer questions that lead to positive or negative results, elimination of possible paths or conclusions, or increasing situational awareness.

Both positive and negative results can be helpful. What I call positive results are those that confirm what you suspected and/or hoped for when you initially asked the question. These are helpful, obviously, because they fit into your thought processes about the project and also move you directly toward your goals. After all, goals are yet-unrealized positive results that, if confirmed, give some tangible benefit to your customer.

Negative results are helpful because they inform you that something that you thought is probably true is indeed false. These results usually feel like setbacks but, practically speaking, they’re the most informative of all possible results. What if you found out that the sun was not going to rise tomorrow, despite all of the historical evidence to the contrary? This is an extreme example, but can you imagine how informative that would be, if it were confirmed true? It would change everything, and you would be very likely one of very few people who knew it, given that it was so counterintuitive. In that way, negative results can be the most helpful, though often they require you to readjust your goals based on the new information. At the very least, negative results force you to rethink your project to account for those results, a process that leads to more informed choices and a more realistic path for your project.

As I mentioned in chapter 1, data science is fraught with uncertainty. There are always many possible paths to a solution, many possible paths to failure, and even more paths to the gray area between success and failure. Evidence of improbability or outright elimination of any of these possible paths or conclusions can be helpful to inform and focus the next steps of the project. A path can be eliminated or deemed improbable in many ways, which might include the following:

  • New information making a path far less likely
  • New information making other paths far more likely
  • Technical challenges that make exploring certain paths very difficult or impossible

If eliminating a path doesn’t seem like it’s helping—maybe it was one of the only paths that might have succeeded—keep in mind that your situation has become simpler regardless, which can be good. Or take the chance to rethink your set of paths and your knowledge of the project. Maybe there’s more data, more resources, or something else that you haven’t thought of yet that might help you gain a new perspective on the challenges.

In data science, increasing situational awareness is always good. What you don’t know can hurt you, because an unknown quantity will sneak into some aspect of your project and ruin the results. A question can be good if it helps you gain insight into how a system works or what peripheral events are occurring that affect the data set. If you find yourself saying “I wonder if...” at some point, or if a colleague does the same, ask yourself if that thought relates to a question that can help you gain some context for the project—if not answer some larger, more direct question. Being introspective in this way brings some formality and procedure to the often fuzzy task of looking for good results.

2.3. Answering the question using data

You have good questions, and now you want answers. Answers that provide solutions to problems are, after all, the goal of your entire project. Getting an answer from a project in data science usually looks something like the formula, or recipe, in figure 2.2. Although sometimes one of the ingredients—good question, relevant data, or insightful analysis—is simpler to obtain than the others, all three are crucial to getting a useful answer. Also, the four adjectives I chose, one in each ingredient (good, relevant, insightful) and one in the result (useful), should not be ignored, because without them the formula doesn’t always work. The product of any old question, data, and analysis isn’t always an answer, much less a useful one. It’s worth repeating that you always need to be deliberate and thoughtful in every step of a project, and the elements of this formula are not exceptions. For example, if you have a good question but irrelevant data, an answer will be difficult to find.

Figure 2.2. The recipe for a useful answer in a data science project

2.3.1. Is the data relevant and sufficient?

It’s not always easy to tell whether data is relevant. For example, let’s say that you’re building a beer recommendation algorithm. A user will choose a few beers that they like, and the algorithm will recommend other beers that they might like. You might hypothesize that a beer drinker typically likes certain types of beer but not others, and so their favorites tend to cluster into their favorite beer types. This is the good question: do beer drinkers like beers of certain types significantly more than others? You have access to a data set from a popular beer-rating website, composed of one- to five-star ratings by thousands of site users. You’d like to test your hypothesis using this data; is this data relevant?

Fairly soon, you realize that the data set is a CSV file containing only three columns: USER_NAME, BEER_NAME, and RATING. (Drat! No beer types.) A data set that seemed immensely relevant before now seems less so for this particular question. Certainly, for a question about beer types, you need to know, for each beer, what type it is. Therefore, to answer the question you must either find a data set matching beers to beer types or try to infer the beer types from the data you already have, perhaps based on the name of the beer.

Either way, it should be apparent that the data set that at first glance seemed perfectly capable of answering the question needs some additional resources in order to do so. A data scientist with some foresight and awareness can anticipate these sorts of problems before they cost you or your colleagues time and/or money. The first step is to outline specifically how data will help you answer your question. In the case of the investigation of affinity for beer types, such a statement might suffice:

In order to find out whether beer drinkers like certain types of beers significantly more than others, we need data containing beer name, beer type, user names, and their individual ratings for the beers. With this data, we can perform a statistical test, such as an analysis of variance (ANOVA), with each beer type as a variable, and examine whether beer type is a significant influencer of rating for individual users.

Disregarding the lack of specific detail in the description of the statistical test (it’s not important here, but we’ll return to it in a later chapter), you have here a basic outline of what might be done to answer the question using data you believe to be available. There may be other such outlines that serve the purpose equally well or better, but a good outline states what data you would need and how you would use that data to answer the question. By stating what specific properties the data set should have, you—or anyone you’re working with—can check quickly to see if a data set (or more than one) fulfills the requirements.

I (like many people I know) have on occasion begun creating algorithms based on data I thought I had instead of the data I actually had. When I realized my mistake, I had wasted some time (thankfully not much) on code that was worthless. By making a short outline of how you’re going to answer a question using data, it would be very easy to check to make sure that a data set you’re considering using contains all the information you’ve listed as a requirement—before you start coding. If the data set is lacking a vital piece of information, you can then adjust your plan either to find the missing piece elsewhere or to devise another plan/outline that doesn’t require it. Planning during this stage of a data science project can save you quite a lot of time and effort later. Having to modify code heavily at a late stage of a project or scrap it altogether is not usually an efficient use of time.

In the following subsections, I outline some steps you can use to develop a solid but detailed plan to find and use data to answer a specific question. In later chapters, I’ll discuss gathering and using data in detail, but here I’d like to cover the basic ideas of what to note and keep in mind throughout the project. You usually have many choices to make, and if those choices later turn out to be inefficient—or plain wrong—then it helps to have a record of other choices you could have made instead.

2.3.2. Has someone done this before?

This should always be the first step in developing a plan: check the internet for blog posts, scientific articles, open-source projects, research descriptions from universities, or anything else you can find related to the project you’re starting. If someone else has done something similar, you may gain a lot of insight into the challenges and capabilities that you haven’t yet encountered. Again, awareness is very helpful.

Regarding similar projects that others have done, you’re likely to encounter similar problems and similar solutions in your project, so it’s best to learn as much as you can about what you should watch out for and how to handle it.

Sometimes, a little searching will lead you to useful data sets you hadn’t seen yet, analytic methods you may not have considered, or, best of all, results or conclusions that you can use in your own project. Presuming that the analyses were rigorous—but it’s always best to verify—someone else may have done a lot of your work for you.

2.3.3. Figuring out what data and software you could use

Now that you’ve searched around, both for prior similar projects and for relevant data sets, you should take stock of what you have and what you still need.

Data

If you can imagine a data set that would help you tremendously, but that data set doesn’t exist anywhere, it often helps to make a note about it in case you realize something can be done later. For instance, if you’re missing the beer type labels in the data set of beer ratings, you may find that many breweries have the types listed on their websites or that other web pages do. This provides the opportunity to collect them yourself, potentially at a cost of time and effort. If you make a note of this need and this potential solution, then at any point in the future you can reevaluate the costs and benefits and make a decision about how to proceed. Because of the many uncertainties associated with data-centric projects, costs and benefits of possible choices may change in scope or scale at virtually any time.

Software

Most data scientists have a favorite tool for data analysis, but that tool may not always be appropriate for what you intend to do. It’s usually a good idea to think about the analysis you want to do conceptually before you try to match that concept to a piece of software that can turn the concept into a reality. You may decide that a statistical test can provide a good answer if the data supports it, that machine learning can figure out the classifications you need, or that a simple database query can answer the question. In any of these cases, you may know of a good way to use your favorite tool to perform the analysis, but first you might want to consider the format of your data, any data transformations that might have to take place, the amount of data you have, the method of loading the data into analytic tools, and finally the manner of the analysis in the tools. Thinking through all of these steps before you perform them and considering how they might work in practice can certainly lead to better choice of software tools.

If you aren’t familiar with many software tools and techniques, you can skip this step for now and continue reading, because I cover these in later chapters. But for now I merely want to emphasize the importance of thinking through—deliberately—various options before choosing any. The decision can have large implications and should not be taken lightly.

2.3.4. Anticipate obstacles to getting everything you want

Here are some questions you might want to ask yourself at this planning stage of the project:

  • Is the data easily accessed and extracted?
  • Is there anything you might not know about the data that could be important?
  • Do you have enough data?
  • Do you have too much data? Will it take too long to process?
  • Is there missing data?
  • If you’re combining multiple data sets, are you sure everything will integrate correctly? Are names, IDs, and codes the same in both data sets?
  • What happens if your statistical test or other algorithm doesn’t give the result you expect?
  • Do you have a way to spot-check your results? What if the checks show errors somewhere?

Some of these may seem obvious, but I’ve seen enough people—data scientists, software engineers, and others—forget to consider these things, and they paid for their negligence later. It’s better to remind you to be skeptical at the very beginning so that the uncertainty of the task doesn’t cost you nearly as much later.

2.4. Setting goals

I’ve mentioned goals several times in this chapter, but I haven’t yet addressed them directly. Though you probably began the project with some goals in mind, now is a good time to evaluate them in the context of the questions, data, and answers that you expect to be working with.

Typically, initial goals are set with some business purpose in mind. If you’re not in business—you’re in research, for example—then the purpose is usually some external use of the results, such as furthering scientific knowledge in a particular field or providing an analytic tool for someone else to use. Though goals originate outside the context of the project itself, each goal should be put through a pragmatic filter based on data science. This filter includes asking these questions:

  • What is possible?
  • What is valuable?
  • What is efficient?

Applying this filter to all putative goals within the context of the good questions, possible answers, available data, and foreseen obstacles can help you arrive at a solid set of project goals that are, well, possible, valuable, and efficient to achieve.

2.4.1. What is possible?

Sometimes it’s obvious what is possible, but sometimes it’s not. In the following chapters, I’ll describe how some tasks that might seem easy are not. For instance, finding appropriate data, wrangling the data, exploiting the data to get answers, designing software to perform the analysis, and confronting any other obstacle such as those I referenced in the previous section may affect your ability to achieve a certain goal. The more complex a task becomes and the less you know about the tasks beforehand, the less likely it is that the task is possible. For example, if you think a certain data set exists but you haven’t confirmed it yet, then achieving any goal requiring that data set might not be possible. For any goal with uncertainties, consider the possibility that achieving the goal might be impossible.

2.4.2. What is valuable?

Some goals give more benefit than others. If resources are scarce, everything else being equal, it’s better to pursue the more beneficial goal. In business, this might mean pursuing the goal(s) that are expected to give the highest profit increase. In academia, this might mean aiming for the most impactful scientific publication. Formally considering the expected value of achieving a goal, as I suggest here, creates a deliberate context and framework for project planning.

2.4.3. What is efficient?

After considering what is possible and what is valuable, you can consider the effort and resources it might take to achieve each goal. Then you can approximate efficiency via this equation:

efficiency = value / effort * possibility

The overall efficiency of achieving a goal is the value of achieving it divided by the effort it took to achieve it (the value gained per unit of effort) multiplied by the possibility (the probability that it will be achieved at all). Efficiency goes up with the value of the goal, down if more effort is required, and also down if the goal seems less possible of being achieved. This is only a rough calculation, and it means more to me conceptually than it does practically, but I do find it helpful.

2.5. Planning: be flexible

Given all your knowledge of the project, all the research you’ve done so far, and all the hypothetical questions you’ve asked yourself about the data and the software tools you might use, it’s time to formulate a plan. This should not be a plan containing sequential steps with outcomes that are presumed beforehand. The uncertainty of data and data science virtually guarantees that something won’t turn out the way you expect. It’s a good strategy to think of a few different ways that you might achieve your goals. Even the goals themselves can be flexible.

These alternative paths might represent different overarching strategies, but in most cases two paths in the plan will diverge wherever there’s an anticipated uncertainty, where the two most likely scenarios indicate two different strategies for addressing the outcome of the uncertainty. It’s definitely advisable to make a plan from the beginning to the first major uncertainty. If you want to stop there, it might save you some planning time now, but it’s even better to map out all of the most likely paths, in particular if you’re working with multiple people. That way, everyone can see where the project is headed and knows from the very beginning that there will be problems and detours, but you don’t yet know which ones they will be. Such is the life of a data scientist!

Last, the plans you formulate here will be revisited periodically throughout the project (and this book), so the early stages of the plans are the most important ones. A good goal is to plan the next steps to put you in the best position to be well informed the next time you revisit plans and goals. Increasing knowledge and reducing uncertainty are always good things.

Exercises

Consider the following scenario:

You’re working at a firm that consolidates personal finance data for its customers, which are primarily individual consumers. Let’s call this firm Filthy Money, Inc, or FMI. Through FMI’s primary product, a web application, customers can view all their data in one place instead of needing to log into the websites of each of their financial accounts separately. The typical customer has connected several of their accounts, such as bank accounts and credit cards, to the FMI app.

One of FMI’s lead product designers has asked you to help build a new product component called Filthy Money Forecasting, which was conceived to provide users of the FMI app with near-term forecasts of their accounts and overall financial status based on their spending and earning habits. The product designer wants to collaborate with you to figure out what is possible and what some good product features might be:

1.

What are three questions you would ask the product designer?

2.

What are three good questions you might ask of the data?

3.

What are three possible goals for the project?

Summary

  • Stay aware: experience, domain experts, and knowledge of other related projects help you plan and anticipate problems before they arise.
  • Be aware of the customer’s perspective and potential lack of data science knowledge.
  • Make sure the project focuses on answering questions that are good.
  • Take time to think through all possible paths to answering those good questions.
  • Set goals using a pragmatic perspective of what the customer wants, the questions you’ve developed, and the possible paths to getting answers.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.117.191