Chapter 10. Plan execution: putting it all together

This chapter covers

  • Tips for putting your statistics (chapter 7) and software (chapters 89) into action
  • When to modify the plan (formulated as in chapter 6)
  • Understanding the significance of results and how it relates to practical usefulness

Figure 10.1 shows where we are in the data science process: executing the build plan for the product. In the last three chapters, I covered statistics, statistical software, and some supplemental software. Those chapters provide a survey of technical options available to data scientists in the course of their projects, but they don’t continue along the data science process from the previous chapters. Because of this, in this chapter I bring you back to that process by illustrating how you can go from the formulation of a plan (chapter 6) to applying statistics (chapter 7) and software (chapters 8 and 9) in order to achieve good results. I point out some helpful strategies as well as some potential pitfalls, and I discuss what it might mean to have good results. Finally, I give a thorough case study from a project early in my career, with a focus on applying ideas from the current chapter as well as the previous few.

Figure 10.1. The final step of the build phase of the data science process: executing the plan efficiently and carefully

10.1. Tips for executing the plan

In chapters 8 and 9, I discussed various software related to statistical applications, when and where different types might be best used, and how to think about ways that the software relates to the statistics that you intend to do. But the process of building that software is another story. Even if you know exactly what you want to build and how you want the result to look, the act of creating it can be fraught with obstacles and setbacks, particularly the more complicated the tool that you’re trying to build.

Most software engineers are probably familiar with the trials and tribulations of building a complicated piece of software, but they may not be familiar with the difficulty of building software that deals with data of dubious quality. Statisticians, on the other hand, know what it’s like to have dirty data but may have little experience with building higher-quality software. Likewise, individuals in different roles relating to the project, each of whom might possess various experiences and training, will expect and prepare for different things. As part of the project awareness that I’ve emphasized throughout this book, I’ll consider briefly the types of experiences and difficulties that different people might have and a few ways that problems can be prevented. I don’t presume to know what others are thinking, but in my experience people with similar backgrounds tend to make similar mistakes, and I’ll describe those here with the hope that they’re helpful to you.

10.1.1. If you’re a statistician

If you’re a statistician, you know dirty data, and you know about bias and overstating the significance of results. These things are familiar to you, so you innately watch out for them. On the other hand, you may not have much experience building software for business, particularly production software—by which I mean software that’s used directly by a customer to gain insight into their data. Many things can go wrong with production software.

Consult a software engineer

Statisticians are smart people; they can learn and apply a lot of knowledge in a short time. To every smart person, it can be tempting to learn a new technology as you need it and to trust your own ability to use it properly. This is great if you’re creating something that you’ll use yourself or that’s primarily a prototype. But in cases where bugs and mistakes will have a significant negative impact for your project and your team, it’s best to at least consult a software engineer before, during, and after building an analytic software tool. If nothing else, the software engineer will give you a thumbs-up and tell you your design or your software is great. More likely—if the engineer is paying attention—they’ll be able to point out a few areas where you can improve in order to make your software tool more robust and less likely to fail for unknown reasons. If you’re not a software engineer, building a piece of production software yourself is like building a deck for your house when you have no training in carpentry or construction. You can learn most of what you need to know from books and other references, in theory, but putting the wood and the nails and the joints together can get a little messy. It can be very helpful to ask someone who has some hands-on experience, to make sure.

Have someone test your software thoroughly

If you plan on handing software to a customer and letting them use it directly, you can bet they’ll find a dozen ways to break it. It’s difficult to eliminate all bugs and handle all possible edge-case outcomes in a nice way, but you can find the most obvious bugs and problems if you give the software to a co-worker—ideally one who has a background similar to the customer’s—and tell them to use all aspects of the tool and to try to break it. Better still, give the software to several people and have them all use it and try to break it. This is often called a bug bash, but it can extend beyond bugs into the realm of user experience as well as the general usefulness of the tool. Feedback here should not be taken lightly, because if your co-workers can find a bug in a few hours, I can almost guarantee that the customers will find it twice as fast, and that can cost you time, money, and reputation.

Customers take a lot of time

If you’ve never delivered software to a customer before, it may come as a surprise that an astonishingly large number of customers won’t use your software without significant prompting—and customers who do use your software will bombard you with questions, problems, and insinuations that you did everything wrong.

Presuming that you want people to use your software, it can be worth spending time with customers to make sure that they’re comfortable with using the software and that they’re using it correctly. This means you may need to send some emails, make phone calls, or show up in person, depending on your situation. Projects in data science often depend on successfully using this new piece of software, and in the common case where the customer may not fully understand what the future impact is of your new data-centric solution, you may have to guide them down the right path.

Customers bombarding you is a good sign. It means they’re already engaged and they really want the software to work. The downside is that either there are many problems with it or they don’t know how to use it properly. Both of these can presumably be fixed by you or others on your team. Be aware ahead of time that customers can require maintenance at least as much as the software.

10.1.2. If you’re a software engineer

If you’re a software engineer, you know what a development lifecycle looks like, and you know how to test software before deployment and delivery. But you may not know about data and all the ways it can break your beautiful piece of programmed machinery. As I’ve mentioned before, uncertainty is the absolute enemy of the software engineer, yet uncertainty is inevitable in data science. No matter how good you are at software design and development, data will eventually break your application in ways that had never occurred to you. This requires new patterns of thought when building software and a new level of tolerance for errors and bugs because they’ll happen that much more often.

Consult a statistician

Software engineers are smart people; they can follow the flow of logic and information through complex structures. Data and statistics, though, introduce a level of uncertainty that logic and rigid structure don’t handle well innately. Statisticians are well versed in foreseeing and handling problematic data such as outliers, missing values, and corrupted values. It can be helpful to have a conversation with a statistician, focusing on the sources of your data and what you intend to do with it. A statistician might be able to provide some insight into the types of problems and edge cases that may occur once you get your software up and running. Without consulting a statistician or a statistics-oriented data scientist, you run the risk of having overlooked a potentially significant special case that can break your software or otherwise cause problems for it.

Data can break your software

Software engineers are good at connecting disparate systems and making them work together. A critical part of getting two software systems to work together is the agreement, or contract, between the two systems that states how they communicate with one another. If one of those systems is a statistical system, the output or state often can’t be guaranteed to meet a specific set of contractual guidelines. Special and edge cases of data values can make statistical systems do weird things, and when software components are asked to do weird things, they often break. When dealing with data and statistics, it’s best to forgive them in advance. Consider the broadest possible set of outcomes or states, and plan for that. If you’re feeling particularly magnanimous, you may want to enclose statistical statements in try-catch blocks (or similar) such that nothing breaks in the strict sense, and then weird or unacceptable outcomes can be handled, logged, reported, or raised as an exception, whatever seems appropriate.

Check the final results

This may seem obvious to most of you, but in a shortage of time it’s incredible how often this step gets skipped. I suggest to statisticians that they ask some people to try to break their software, and I strongly suggest to software engineers that they run through a few full examples of whatever data they’re analyzing and make sure the results are 100% correct. (Really, everyone should do this, but I hope that statisticians are trained well enough to do this by default.) It can be a tedious process to begin with a small amount of raw data and trace it all the way through to an outcome, but without doing an end-to-end correctness test, there’s no way to guarantee that your software is doing what it’s supposed to do. Even performing a few such tests doesn’t guarantee perfect software, but at least you know that you’re getting some correct answers. If you want to take your testing to the next level, translate your end-to-end tests into formal integration tests so that if you make changes to your software in the future, you’ll know immediately if you’ve made a mistake, because the integration test will fail.

10.1.3. If you’re a beginner

If you’re starting out in data science, without much experience in statistics or software engineering, first of all, good for you! It’s a big step into a broad field, and you need a good amount of courage to take it. Second, be careful. You can make many mistakes if you go in without the awareness I’ve emphasized throughout this book. The good news is that there are many people around who can help you; if they’re not at your company, find them elsewhere, such as at other similar companies, local technology organizations, or anywhere on the internet. For some reason, people in the software industry love to help others out. Anyone with some experience can probably give you some solid advice if you can explain your project and your goals to them. More specifically, though, it’s best to follow the advice I give in this chapter to both statisticians and software engineers. As a beginner, you have double duty at this stage of the process to make up for lack of experience.

10.1.4. If you’re a member of a team

If you’re merely one member of a team for the purposes of this project, communication and coordination are paramount. It isn’t necessary that you know everything that’s going on within the team, but it is necessary that goals and expectations are clear and that someone is managing the team as a whole.

Make sure someone is managing

I’ve seen some odd cases in which a team had no manager or leader. On exceptional teams, this can work. Sometimes everyone understands the problem, handles their part, and gets the job done. This is rare. But even in these rare cases, it’s usually inefficient if everyone on the team is keeping track of everything everyone else is doing. It’s usually better if one person is keeping track of all of the things that are happening, and this person can answer any questions about the status of the project that may come from anyone on the team or someone outside the team—for instance, a customer. It’s not necessary, but it’s usually advisable to have a team member designated as the one who keeps track of all things related to project status. This role may be as simple as taking notes, or it may be as complex as an official manager who holds formal meetings and sets deadlines. As a member of the team, you should know who this person is and the extent of their management role. If some aspect of management is lacking, you may want to bring it up with your own boss or another person of authority.

Make sure there’s a plan

Everyone who has held more than a couple jobs has most likely had a boss who didn’t do a good job. Some bosses are nice but not effective, and some are the opposite. In chapter 6, I discussed how to make a plan for your project; if you’re working on a team, you probably didn’t make the plan yourself, but you probably participated in a discussion of what should be done, when, and by whom. This should have resulted in some sort of plan, and you should know who is keeping track of this plan. If that’s not the case, there may be a problem. Probably the group leader or manager has a plan, and that person should be able to describe or outline it on demand; if this plan is nonexistent, incoherent, or bad, you may want to start a serious and probably difficult conversation with team leadership. It may not be your personal responsibility to manage the plan, but it benefits the whole group to make sure that someone is handling it in a reasonable way.

Be specific about expectations

Personnel issues aside, there’s almost nothing worse when working on a team than having unclear direction with your own work. If you don’t know exactly what you’re supposed to be doing and what the expectations are for your results, it’s tough to do a good job. On the other hand, it’s OK to have some open-ended goals as long as everyone is aware of that. In any case, if your part of the project isn’t quite clear to you, make sure to ask someone (or everyone) in order to get the issue settled.

10.1.5. If you’re leading a team

If, as in the previous section, you’re part of a team that’s taking on a data science project, all of those suggestions still apply. But if in addition to that you have a position of leadership, there are a few more to add.

Make sure you know what everyone is doing

A team is nothing if it doesn’t know what it’s doing, cohesively. Not everyone needs to know everything, but at least one person should know almost everything that’s going on, and if you’re the team leader, that person should be you. I’m not suggesting that you be a micromanager, but I am suggesting that you take an active interest in the status of each part of the project. This active interest should result in an awareness of the team and project status such that you can answer most general questions about the project status without consulting anyone else. If you can’t answer questions about project timelines and whether you think you’ll meet certain deadlines, your interest in team activities probably isn’t active enough. For more specific questions, such as implementation details, it’s probably OK to ask the relevant team member. If you’re the team leader and manager, it’s part of your job to be the representative of the team in front of non–team members, such as customers.

Be the keeper of the plan

In chapter 6, I discussed the process of making a plan for your project, with different paths and alternatives for different intermediate outcomes. If you have a reasonably sophisticated project, you probably have developed a plan that takes some time to understand. It would likely be inefficient if everyone on the team took the time to consult and understand the plan every time they had to make a decision. It’s a good idea, as team leader, to take responsibility for the plan and field all questions related to the plan over the course of the project. That is not to say that the plan belongs to you and you alone; quite the contrary. The plan should have been developed with the input of the whole group, and certain aspects of the plan might still be owned by the most appropriate members of the group. But it might be a good idea that you, as the team leader, are the only one who is thoroughly familiar with the plan as a whole, as well as the team’s status within it. If a customer asks, “Where are you in the development process?” you should be able to explain the plan summary to them and then say where the team is within the framework of the plan.

Delegate wisely

Beyond having a plan, a team that’s taking on a data science project needs to work together in such a way that work is distributed relatively evenly and to the people who are best suited to the given tasks. Software engineers should handle the more programming- and architecture-oriented aspects, data scientists should be concerning themselves more with data and statistics, subject matter experts should be handling anything related directly to the project domain, and anyone else with a certain set of skills should be handling the tasks most relevant to those skills. I don’t suggest that anyone should be pigeonholed based merely on what they’re good at, but each team member’s expertise and limitations are relevant to the division of tasks. I’ve worked on teams where the few data scientists were treated like the many software engineers, and the results were not positive. Considering the people on the team against the tasks to be done should be enough.

10.2. Modifying the plan in progress

In chapter 6, I discussed formulating a plan for completing your data science project. The plan should contain multiple paths and options, all depending on the outcomes, goals, and deadlines of the project. No matter how good a plan is, there’s always a chance that it should be revised as the project progresses. Even if you thought of all uncertainties and were aware of every possible outcome, things outside the scope of the plan may change. The most common reason for a plan needing to change is that new information comes to light, from a source external to the project, and either one or more of the plan’s paths change or the goals themselves change. I’ll briefly discuss these possibilities here.

10.2.1. Sometimes the goals change

When the goals of the project change, it can have large implications for the plan. Usually goals change because the customers have either changed their mind about something or they’ve communicated information that they didn’t mention before for one reason or another. It’s a common phenomenon—discussed in chapter 2—that customers may not know which information is important to you, a data scientist, so information gathering and goal setting can seem more like elicitation than business. If you’ve done a good job asking the customer the right questions along the way, you probably aren’t far from a good, useful set of goals. But if new information enters the picture, changing the plan may be necessary.

Because you’re already part of the way through the original plan, you probably have something to show for it: preliminary results, some software components, and the like. If the change in goals is dramatic, these things may no longer be as useful as they were, and it can be hard to convince yourself to jettison them. But the previous costs of having built something should not inherently be considered in making decisions for the future; in the finance industry, this is called a sunk cost, and it’s a cost you can’t recover; it’s lost forever, no refunds. Because the money and time have already been spent, any new plan (for the future) shouldn’t consider them. But whatever you’ve already produced can certainly be useful, and so that definitely should be taken into account when formulating a new plan. For example, if you’ve already built a system to load and format the raw data you intend to use, this system is probably going to be useful no matter what the new goals are. On the other hand, if you’ve built a statistical model that answers questions that are particular to the original goals but not the new ones, you might want to throw out that model and start over.

The main focus when goals change is to go through the process of making a plan again, like in chapter 6, but this time around you have some additional resources—whatever you’ve already produced from the completed part of the original plan—and you have to be very careful not to let sunk costs and other inertia prevent you from making the right choices. It’s usually worth it to formally run through the planning process again and make sure that every ongoing aspect of the project is in the best interest of the goals and the new plan that you formulate.

10.2.2. Something might be more difficult than you thought

This happens to me a lot. I remember setting out in 2008 to use MapReduce on Amazon Web Services (AWS) to compute the results of a rather hefty algorithm for bioinformatics that I had written in R. Documentation, tutorials, and simple tools for AWS were rather sparse back then, and the same was true for the MapReduce-related packages in R. I was also rather naïve, I must confess. To make a long story short, many hours later, I knew neither how to set up a cluster on AWS nor how to use one with R. Needless to say, I changed my plan.

When a step within your plan that you thought would be reasonably simple turns into a nightmare, that’s a good reason to change the plan. This doesn’t usually have as large of an impact on the overall plan as a change in goals would, but it can still be significant. Sometimes you might be able to swap out the difficult thing for an easier one. For instance, if you can’t figure out how to use MapReduce, you might gain access to a compute cluster and do your analysis there. Or if a piece of analytic software is overly complex, you might trade it for a simpler one.

If the difficult thing isn’t easily avoided—such as when there’s no comparable software tool to replace the one that has proven difficult to use—you may have to change the plan entirely based on the fact that a particular step must either be left out or changed. The key to making this decision is recognizing early—and correctly—that figuring out how to do the difficult thing is much more costly than doing something else.

10.2.3. Sometimes you realize you made a bad choice

I do this a lot, too. There are any number of reasons why a plan that seemed good when you made it would begin to seem less good as you make some progress. You might not have been aware of certain software tools or statistical methods, for example, and you realize that those are better choices. Or after beginning to use a certain tool, you realize that it has a limitation that you weren’t aware of before. Another possibility is that you had incorrect assumptions or got bad advice about which tools to use.

In any case, if you start to realize that a previous choice, and its inclusion in the plan, was a bad idea, it’s never too late to reevaluate the situation and reformulate a plan based on the most current information. It’s advisable to take into account all the progress up to that point, ignoring purely sunk costs.

10.3. Results: knowing when they’re good enough

As a project progresses, you usually see more and more results accumulate, giving you a chance to make sure they meet your expectations. Generally speaking, in a data science project involving statistics, expectations are based either on a notion of statistical significance or on some other concept of the practical usefulness or applicability of those results or both. Statistical significance and practical usefulness are often closely related and are certainly not mutually exclusive. I’ll briefly discuss the virtues of each and their relationship with one another.

Note that throughout this section, I use the term statistical significance loosely to mean the general levels of accuracy or precision, ranging from the concept of p-values to Bayesian probabilities to out-of-sample accuracies of machine learning methods.

10.3.1. Statistical significance

I mentioned statistical significance in chapter 7 but provided relatively little guidance about choosing a particular significance level. That’s because the appropriate level of significance depends greatly on the purpose of the project. In sociological and biological research, for example, significance levels of 95% or 99% are common. In particle physics, though, researchers typically require a 5-sigma level of significance before accepting results as significant; for reference, 5-sigma (five standard deviations from the mean) is approximately 99.99997% significance.

Depending on the type of statistical model you’re using and the statistical approach, there are different formal notions of significance, ranging from confidence to credibility to probability. I don’t want to discuss the nuances of each of these here, but I will highlight that significance can take many forms, though all of them indicate that if you repeat the analysis or gather more data that’s similar, you’ll see the same results with a certainty level matching the significance level. If you use a 95% significance level, 19 out of 20 comparable analyses would be expected to give the same result. This interpretation doesn’t formally match every type of statistical analysis, but it’s close enough for the discussion here.

Let’s say you’re doing a project in genomics, and you’re trying to find genes that are related to metabolism. Given a good statistical model that you developed for this project, and using the previous notion of repeated analyses with a 95% significance level, you’d expect that any gene that meets this significance level would also meet that significance level in 19 out of 20 repeated experiments. That clearly leaves one experiment in which it wouldn’t meet the significance level. Assuming that the gene truly is involved with metabolism, this one non-significant result would be considered a false negative, meaning that the result was negative (not significant) but it shouldn’t have been. If you analyzed data from thousands of genes, you’d expect to see many false negatives.

On the other hand, because you did only one experiment and subsequent analysis for each gene, surely there are some genes that are not involved in metabolism but that met the 95% significance level. In theory, these genes should most of the time not give significant results, but you were lucky enough to conduct one of the rare experiments whose data makes them significant. These are called false positives.

In practice, choosing a significance level means choosing the right balance between false negatives and false positives. If you absolutely need almost all of your positives to be true, then you need a very high significance level. If you’re more concerned with capturing nearly all of the true things (for example, all true metabolism-related genes) in your set of positives, then a lower significance level is more appropriate. This is the essence of statistical significance.

10.3.2. Practical usefulness

What I’m calling practical usefulness is very much like statistical significance as I’ve described it, but with more of a focus on what you intend to do with the results instead of a purely statistical notion of confidence. What you plan to do next with the results should play a large role in how significant you need them to be.

In the example of metabolism-related genes, a possible next step would be to take the set of significant genes and to run a specific experiment on each of those genes in order to verify at a much higher level of precision whether they’re truly involved with metabolism. If these experiments are costly in terms of time and/or money, then you’ll probably want to use a high level of significance in your analyses so that you perform relatively few of these follow-up experiments, only on the genes that you’re certain about.

In some cases, possibly even this example, the specific level of significance is almost irrelevant because you know that you want to take some fixed number of the most significant results. You could, for example, take the 10 most significance genes and perform the follow-up experiment on them. It might not matter whether you make the cutoff 99% or 99.9% if you’re not going to take more than 10 anyway. A significance level of less than 95%, though, is probably not advisable, and if there aren’t 10 genes meeting that level, it might be best to focus only on the ones that have statistical evidence in their favor, that meet at least a minimal level of significance.

You can begin to decide on a significance level by first asking yourself the question, what am I going to do next with the significant results? Considering the specific things you intend to do next with the set of significant results, answer these questions:

  • How many significant results do you want or need?
  • How many significant results can you handle?
  • What is your tolerance for false negatives?
  • What is your tolerance for false positives?

By answering these questions and combining the answers with your knowledge of statistical significance, it should be possible to select a set of significant results from your analyses that will serve the purposes of your project.

10.3.3. Reevaluating your original accuracy and significance goals

As part of your plan for the project, you probably included a goal of achieving some accuracy or significance in the results of your statistical analyses. Meeting these goals would be considered a success for the project. In light of the previous section on statistical significance and practical usefulness, it’s worth reconsidering the desired level of significance of the results at this stage of the process, for a few reasons that I outline here.

You have more information now

You didn’t have as much information when you began the project as you do now. The desired accuracy or significance may have been dictated to you by the customer, or you may have chosen it yourself. But in either case, now that you’re getting to the end of the project and you have some real results, you’re arriving at a position from which you can better determine whether that level of significance is the most effective.

Now that you have some results, you can ask yourself these questions:

  • If I give a sample of results to the customer, do they seem pleased or excited?
  • Do these results answer an important question that was posed at the beginning of the project?
  • Could I—or the customer—act on these results?

If you can answer yes to these questions, then you’re in good shape; maybe you don’t need to adjust your significance levels. If you can’t answer yes, it would be helpful to reconsider your thresholds and other aspects of how you select important results.

The number of results might not be what you want

No matter how you chose your significance levels previously, you might end up with more significant results than you can handle or too few results for them to be useful. The solution to too many results is to raise the threshold for significance, and for too few results, the solution might be to lower the threshold. But you should be careful of a few things.

Raising or lowering the threshold because you’d like more or fewer results can be a good idea as long as this doesn’t violate any assumptions or goals of your project. For example, if you’re working on a project involving the classification of documents as either relevant or not relevant to a legal case, it’s important that you have few false negatives. Classifying an important document as not relevant can be a big problem. If you happen to realize that your classification algorithm missed an important document or two, lowering the significance threshold of the algorithm to include these documents will indeed lower the number of false negatives. But it will also presumably increase the number of false positives, which in turn will require more time to subsequently review all the positive results manually. Here, decreasing the level of significance directly increases the amount of manual work that needs to be performed later, which can be costly in legal contexts. Rather than merely change the threshold, it might be better to go back to the algorithm and model and see if you can make them better for the task at hand.

The main point is that it can be a good idea to increase or decrease significance thresholds in order to decrease or increase arbitrarily the number of significant results, but only if it doesn’t adversely affect the project’s other assumptions and goals. It’s good to think through all possible implications of a threshold change so that you can avoid problems.

The results might not be quite what you expected

Sometimes, despite all your best intentions, and after considering all the uncertainties, you end up with results that don’t seem like what you’d expect. You might have a set of significant results that, generally speaking, don’t seem to be what you’re looking for. This is obviously a problem.

One potential solution is to raise the significance threshold and make sure the most significant results, the very top, do indeed meet your expectations. If they look good, then you can possibly use the new threshold as long as the change doesn’t adversely affect anything else in the project. If they don’t look good, you likely have a bigger problem than significance. You may have to go back to the statistical model and try to diagnose the problem.

Generally speaking, as you increase the threshold for the level of significance, the better your set of results should match what you expect. For example, documents should be more relevant in the legal example, genes should be more obviously related to metabolism in the genomics example, and so on. If this isn’t the case, it would be best to investigate the cause.

10.4. Case study: protocols for measurement of gene activity

I’ll illustrate the concepts from this and the previous few chapters by giving an in-depth explanation of a project from early in my career. With a master’s degree and two years of work experience, I decided to go back to school to get a PhD. I soon joined a research group in Vienna whose focus was the development of effective statistical methods for applications in bioinformatics.

I hadn’t worked in bioinformatics before, but I’d long had an interest in the primary language of biology—DNA sequences—and so I was looking forward to the challenge. I would have to learn about bioinformatics and relevant biology as well as some software and programming tools because my prior programming experience consisted mainly of MATLAB and a little C. But I had the support of two advisers and a small group of other researchers working in my lab, each with varied experience in bioinformatics, statistics, and programming.

Soon after getting settled at my new desk, one of my advisers came to me with a prospective first project and asked me to have a look. The general idea was to compare laboratory protocols for microarrays in a rigorously statistical manner. My adviser had already considered the experimental setup as well as a possible mathematical model that could be applied to the resulting data, so the first step had already been taken, which was probably good for me as a beginner. As the outcome of the project, we wanted to know which lab protocol was the best, and we intended to publish the result in a scientific journal not only for the laboratory implications but also for the statistical ones.

As I worked on this project, I learned a lot about bioinformatics, mathematics, statistics, and software, all of which, when put together, fit squarely in the field we now know as data science. In the rest of this section, I describe this project in terms of concepts from preceding chapters in this book, with the hope that this case study illuminates how they might work in practice.

10.4.1. The project

The goal of the project was to evaluate and compare the reliability and accuracy of several laboratory protocols for measuring gene expression. Each protocol is a chemical process by which RNA extracted from biological samples can be prepared for application to a microarray. Microarray technologies, which in the last decade have largely been replaced by high-throughput genetic sequencing, can measure the expression level (or activity level) of tens of thousands of genes in an RNA sample. The protocols that prepared the RNA for microarrays varied in their complexity as well as the required amount of RNA input needed for each microarray. The amount of RNA input needed for the protocols ranged from about a microgram down to a few nanograms, according to the developers of the protocols, which were often private companies that probably had reasons to mislead researchers about the reliability of their protocols in order to sell more of the required kits. Nonetheless, it would be beneficial to be able to use less RNA per microarray, because maintaining and extracting biological samples can be expensive. We wanted to hold a head-to-head competition between the protocols as they’d be used in the lab to see if any of the promises held up and in order to get the most out of our limited lab budgets.

We had four protocols in total, and one of them was well known to be quite reliable, so it was the closest thing to a gold standard that we had. For each protocol, we would run four microarrays whose putative experimental goal was to compare gene expression between male and female fruit flies—Drosophila melanogaster, a common model organism that’s better understood than most organisms. There are some large differences between expression in male and female flies in some genes, particularly those known to be associated with sexual development and function, and in other genes there shouldn’t be much of a difference. Each set of four microarrays would be run in a dye-swap configuration, which means that male RNA is dyed with the green radioactive dye on two arrays and with the red radioactive dye on the other two arrays; female RNA is dyed in the opposite color on each array. In the end, each microarray, for each of about 10,000 genes, gives a measurement of the ratio of gene expression in males to that in females.

Because we were using one of the protocols as a sort of gold standard, we ran two sets of four microarrays using it. Beyond having two sets of a reliable protocol available for comparison with other protocols, we could compare the two sets with each other to get an idea of the reliability of this protocol. If the two sets gave widely differing results, that would be evidence that even this gold standard protocol wasn’t reliable.

In addition to all of that, for two of the four protocols we ran experiments using less RNA than what the protocol usually requires, so that we could compare these protocols with other protocols that typically require less RNA, in a sort of fair-fight scenario. A set of microarrays, one for each protocol, for four protocols, plus an extra set of the gold standard and two sets for the two low-RNA versions give 28 microarrays in total. This was the entire data set we would be using.

10.4.2. What I knew

Upon starting this project, I knew mathematics and statistics—at least to the master’s degree level—and a fair amount of MATLAB. I knew the basics of DNA and RNA transcription and the general principles about how genes are translated and expressed within cells. In a relatively short time, I also learned the basics regarding the project description, including the foundations of how microarrays work and how the experiments are configured.

10.4.3. What I needed to learn

I had a lot to learn about bioinformatics, but strangely that wasn’t the bulk of what I had to learn. It seemed like I learned the relevant knowledge about genes and microarrays in a relatively short time, but there were specific aspects of the mathematics and statistics that I hadn’t seen before, and I also didn’t know R, which was the preferred programming language of the lab because of its strengths specific to bioinformatics.

On the mathematics side, although I was very familiar with probability and statistics, I hadn’t ever formulated and applied a mathematical model to data. I was a Bayesian-leaning mathematician with a fully Bayesian adviser, and so I needed to commit to learning all the implications of formulating and applying a Bayesian model.

On the programming side, I was a complete beginner with R, but on the advice of my advisers, that’s the language I would use. The R libraries for loading microarray data are very good, and the statistical libraries are comprehensive as well, so I’d need to learn a lot of R in order to use it on this project.

10.4.4. The resources

Beyond my two advisers, I had several colleagues with varied experience in bioinformatics, mathematics, and programming in R. I was definitely in a good environment in which to learn R. When I encountered a problem or a weird error, I had to ask aloud, “Has anyone had this problem before?” and someone usually had a helpful comment or even a solution to my problem. My colleagues were certainly helpful. I tried to pay back my knowledge debt by telling the rest of the group whenever I discovered a programming trick that I thought they might not have seen before.

Beyond human resources, we also had some technological ones. Most important, my group had a lab capable of performing microarray experiments from beginning to end. Though microarrays aren’t cheap, if it seemed prudent we could create any amount of data that we wanted for the analysis.

On the computational side, I had access to two university-owned servers, each of which had many computing cores and therefore could compute results several times faster than I could on my local machine. I kept this in mind while writing my code and made sure that everything I did could run in parallel on multiple cores.

10.4.5. The statistical model

Quite a few variables might come into play in this project. First and foremost among them was true gene expression. The main goal of the project was to evaluate how closely the measurements for each protocol matched the true gene expression level. We would want a variable in the model representing the true gene expression. We didn’t have any perfect measurements of this—the best we had was the gold standard protocol that we knew was less than perfect—so this true gene expression variable would have to be a latent one. In addition to the true gene expression, we would need variables representing the measurements that the protocols produced. These are obviously measured quantities because we had the data for them, and there might be an associated error term because measurements on a genetic level are often noisy.

In addition to the true gene expression values and their various measurements, several types of variance were involved. Usually, we’d be looking at RNA samples from various individual flies, and there would be a variance between individuals depending on their own genetic composition. But in this case we mixed all the samples of female flies together, and likewise for the males, so there would be no biological variance among individuals. Microarrays notoriously don’t produce the same results every time you run one with the same biological sample. That’s why we were running four microarrays per protocol: to get an estimate of the technical variance resulting from each of the protocols. Lower technical variance is generally better, because it means that multiple measurements of the same thing will give results that are close to each other. On the other hand, lower technical variance isn’t always better; a protocol that totally fails and always reports a measurement of zero for every gene will have perfect technical variance of zero but would be completely useless. We would want a notion of technical variance somewhere in our model as well.

The model formulation that we ultimately settled on assumed that the measurements reported by the microarrays for each protocol were normally distributed random variables based on the true gene expression values. Specifically, for each gene g, the measurement xn,g indicates the gene expression value reported by microarray n from the gold standard protocol, and ym,g represents the expression value reported by microarray m from another protocol. The formulation is as follows:

xn,g ~ N( μg , 1/λ )

ym,g ~ N( μg + β, 1/(αλ) )

Here μg is the true gene expression value of gene g, λ is the technical precision (inverse variance) of the gold standard protocol. The variables β and α represent inherent differences between a protocol and the gold standard. β allows for a possible rescaling of expression values; in case one protocol tends to have lower or higher values across all genes, we wanted to allow for that (and not penalize) because it doesn’t directly imply that the rescaled numbers are wrong. Lastly, α represents the protocol’s technical variance relative to the gold standard’s. A higher α means that protocol’s technical variance is lower.

I’ve mentioned that I like to consider every variable a random variable until I’ve convinced myself that I’m allowed to fix the values in place. Therefore, parameters in the aforementioned probability distributions need to have their distributions specified as well, as in the following:

μg ~ N( 0, 1/(γλ) )

β ~ N( 0, 1/(νλ) )

λ ~ Gamma( φ, κ )

A gamma distribution is related to the normal distribution in such a way that makes it useful and convenient to use for variance parameters of normal distributions. The rest of the yet-undiscussed model parameters appearing in the equations—γ, ν, φ, and κ—I didn’t treat as random variables, but I was careful about it.

Each of these model parameters is at least two steps away from the data—by that, I mean that none of them appears directly in one of the equations describing the observed data, xn,g or ym,g. Because of this removal, such parameters are often called hyper-parameters. In addition, these parameters can be used in a non-informative fashion—meaning that their values can be chosen so as not to exert too much influence on the rest of the model. I attempted to make the hyper-parameters almost irrelevant to the rest of the model, but I checked to make sure this was the case. After finding the optimal parameter values (see the section on model fitting in this chapter), checking to make sure that the value of a hyper-parameter is almost irrelevant to the model and the results involved a sort of sensitivity analysis wherein I changed the values of the parameters dramatically and looked to see if the results changed at all. In this case, the hyper-parameters, even if I multiplied them by 10 or 1000, didn’t affect the conclusions in a significant way.

I’ve described a fairly complex model with several paragraphs and equations, but I’m a visual person, so I like to make diagrams of models. A good visual representation of a mathematical or statistical model comes in the form of a directed acyclic graph (DAG). Figure 10.2 shows the DAG for the model of multi-protocol measurement of gene expression. In the DAG, you can see all of the variables and parameters that I’ve discussed, each inside its own circle. The gray shaded circles are observed variables, whereas the unshaded circles are latent variables. An arrow from one variable to another indicates that the origin/first variable appears as a parameter in the distribution of the target/second variable. The rectangles, or sheets, in the background show that there are multiple genes g and microarray replicates n and m, each of which possesses a different instance of each of the variables contained in the sheet. For example, for each gene g, there is a different true gene expression value μg as well as a set of gold standard measurements xn,g and a set of measurements by another protocol ym,g. Such a visual representation helps me keep all the variables straight.

Figure 10.2. A directed acyclic graph (DAG) representing a model of the comparison of gene expression measurements based on different laboratory techniques

10.4.6. The software

I was learning and using the R language while working on this project. R has a bunch of great bioinformatics packages, but I used only the limma package, which is handy for loading and manipulating microarray data, among other things. Being a beginner with R, I decided to use it only to manipulate the data into a familiar format: a tab-separated file containing gene expression values.

After manipulating and formatting the raw data in R, I wrote the code that fit the statistical model in MATLAB, a language that was more familiar to me and one that is very good at performing operations on large matrices, an important computational aspect of my code.

I had R code that processed and reformatted the microarray data into a familiar format, and then I had a considerable amount of MATLAB code that loaded the processed data and applied the statistical model to the data. At this point in my career, this was the most complex piece of software I had written.

10.4.7. The plan

Academic timelines are usually rather slow. There was no real deadline for this project, except for an upcoming conference for which I might apply to give a talk. The conference application deadline was a few months away, so I had a good amount of time to make sure everything was in order before submitting.

The main goal of the project, as with most academic projects, was to have a paper accepted into a good scientific journal. In order to have a paper accepted, the research must be original, meaning it contains something no one has done before, and it must be rigorous, meaning the paper shows that the author didn’t make any mistakes or fallacies.

Therefore, my first primary goal was to make sure that the main scientific results were rock solid. The next goal was to compile additional statistics and supporting evidence that the methods used in the project were consistent with the common knowledge and methods of bioinformatics. Finally, I would write a compelling scientific paper based on the research and submit it to the conference and/or a good scientific journal.

My plan, therefore, was a relatively simple one. Given my level of knowledge at the time, the plan was approximately the following:

  1. Learn R and use it to manipulate the data into a familiar format.
  2. Write the statistical methods in MATLAB and apply them to the data using one of the university’s high-powered computing servers.
  3. Compile a set of known statistics measuring microarray data quality and compare them to the results from the main statistical model. Reconcile any discrepancies as necessary.
  4. Write a compelling paper.
  5. Show the paper to my adviser, and go around and around editing and improving it. Some iterations may require additional analysis.
  6. Once the paper is good enough, submit it to the conference and/or a journal.
  7. If rejected, edit the paper based on feedback from journal reviewers and submit again.

This was a fairly straightforward plan, without too many competing interests or potential roadblocks. The most time was spent on developing the sophisticated statistical model, building the software, and iteratively checking and improving various aspects of the analysis.

One problem that we did run into during the course of this project was that the data quality seemed very poor for one of the microarray protocols. After weeks of investigation, our laboratory researchers figured out that one of the chemical reagents used in the protocol expired much sooner than expected. It became ineffective after only a few months, and we hadn’t realized that—probably because we weren’t using the original packaging and were sharing the reagents with nearby labs. Once we figured that out, we ordered some new reagent and reran the affected microarray experiments, with better results.

Other than the reagent snafu, there were no major issues, and everything ran according to plan. The biggest uncertainty at the beginning was in how good the results would be. Once I had calculated them and compared them to the known statistics for microarray data quality, that obstacle had largely been overcome. On the other hand, there was considerable discussion between my advisers and me about what exactly constitutes good results and what additional work, if any, would improve them.

10.4.8. The results

The purpose of this project was to compare objectively the fidelity of several microarray protocols in the laboratory and to decide which protocols and which amounts of RNA are required to produce reliable results. The main results were from the statistical model described earlier, but as in most bioinformatic analyses, no one trusts a novel statistical model unless one can prove that it doesn’t contradict known applicable models. I calculated four other statistics that measured the different aspects of the fidelity that we intended to measure with the main statistical model. If these other statistics generally supported the main statistical model, other researchers might be convinced that the model is a good one.

The results table excerpted from a draft of the scientific paper can be seen in figure 10.3. The descriptions of the other statistics are given in the original, clipped caption, but the specifics aren’t important here. What is important is that in the combination of the four other statistics—technical variance (TV), correlation coefficient (CC), gene list overlap (GLO), and the number of significant genes (Sig. Genes)—there was ample evidence that the log marginal likelihood from our statistical model (log ML) was a reliable measure of protocol fidelity. These supplementary statistics and analyses functioned like descriptive statistics—they’re much closer to the data—and provided easy-to-interpret results that are hard to doubt. And because they generally supported the results from the statistical model, I was confident that others would see the value in a statistical model that considers all of these valuable aspects of fidelity at once.

Figure 10.3. The main table of results for the microarray protocol comparison project, as clipped from a draft submitted to a scientific journal

10.4.9. Submitting for publication and feedback

In scientific research, as in data science in industry, what you know to be true because you’ve proven it through rigorous research may not be accepted by the community at large. It takes most people some time to accept new knowledge into their canon, and so it’s rarely surprising to experience some resistance from people whom you think would know better.

A few weeks after submitting a version of my research paper to the bioinformatics conference, I received an email informing me that it was not accepted to be the topic of a talk at the conference. I was disappointed—but not too disappointed—because the acceptance rate at this particular conference was known to be well under 50%, and a first-year PhD student is probably at a distinct disadvantage.

The rejection letter came with minimal feedback from scientists who had read my paper and had judged its worthiness. From what I could tell, no one had questioned the rigor of the paper, but they thought it was boring. Exciting science definitely gets more attention and press, but someone does need to do the boring stuff, which I fully acknowledge I was doing.

After the rejection, I went back to the later steps in my plan, focusing on how to make the paper more compelling (and exciting, if possible) before submitting the paper again to a scientific journal.

10.4.10. How it ended

Not every data science project ends well. The initial rejection by the conference was the beginning of an extended phase of redefining the exact goals of the paper that we would resubmit to a scientific journal. From shortly before the initial submission until the end of the project (well over a year later), some of the goals of the project and paper were continually moving. In that way, this academic experience was a lot like my later experiences at software companies. In both cases, goals rarely stayed in one place throughout a project. In software and data science, business leadership and customers often modify the goals for business reasons. In my microarray protocol project, the goals were changing because of our impression of what good results might mean to potential paper reviewers.

Because the end goal of the project was to get a paper published in a scientific journal, we needed to be aware of what the journal’s reviewers might say. Each step of the way, we looked for holes in our own arguments and gaps in the evidence that our research provided. In addition to that, we needed to take into consideration feedback from other researchers who weren’t involved in our project, because these researchers are peers of those who would eventually become our reviewers.

It can be frustrating to have goals that move constantly. Thankfully, there were no large goal changes, but there certainly were dozens of small ones. Because of the goal changes, progress through the project was riddled with small plan changes, and I spent several months juggling and prioritizing the changes that might have the most significant impact on our chances of acceptance into a good journal.

Ultimately, no paper based on this research was ever published. The project leadership was quite fickle and couldn’t settle on a single set of goals, and they were never satisfied with the state of the research and paper no matter how many modifications they or I made. I was also rather inexperienced in working with a research team and publishing an academic paper, and no matter how much I pushed for it, without the approval of all authors, a paper generally can’t be published at all.

This is certainly not a tale of happily ever after, but I think the project as a whole provides examples of both good and bad things that can happen during a data science project. Things were going rather well until the later stages of the project—I think the analysis and results of the project were good—but I was forced to make some tough decisions when put in a difficult spot, and the plan was modified several times before being thrown out the window. Data science is not, as the press sometimes seems to believe, always sunshine and rainbows, but it can help solve many problems. Don’t let the possibility of failure prevent you from doing good work, but be aware of signs indicating that the plan and the project might be running off track; catching it early can give you the opportunity to correct the problems.

Exercises

Continuing with the Filthy Money Forecasting personal finance app scenario first described in chapter 2, and relating to previous chapters’ exercises, try these:

1.

List three people (by role or expertise) at FMI with whom you will probably be talking the most while executing your project plan and briefly state why you will probably talk to them so much.

2.

Suppose that the product designer has spoken with the management team, and they all agree that your statistical application must generate a forecast for all user accounts, including ones with extremely sparse data. Priorities have shifted from making sure the forecasts are good to making sure that every forecast exists. What would you do to address that?

Summary

  • A project plan can unfold in a number of ways; maintaining an awareness of outcomes as they occur can mitigate risk and problems.
  • If you’re a software engineer, be careful with statistics.
  • If you’re a statistician, be careful with software.
  • If you’re a member of a team, do your part to make a plan and track its progress.
  • Modifying a plan in progress is an option when new, external information becomes available, but make modifications deliberately and with care.
  • Good project results are good because they’re useful in some way, and statistical significance might be a part of that.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.73.102