Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

P. RissenExperiment-Driven Product Developmenthttps://doi.org/10.1007/978-1-4842-5528-5_7

7. Scale and Method

Paul Rissen¹

(1)

Middlesex, UK

By now, you should be almost ready to get started running an experiment. In the previous chapter, we covered the first part of the Design Phase—formulating a hypothesis where applicable, identifying the measures of evidence by which we’ll answer the question, and determining conditions for being part of the experiment.

In this chapter, we’ll examine how we can determine the amount of evidence we should collect, in order to discover an answer that will be of any use to us. This what is meant by the scale of the experiment. Once this is settled, we can decide which method to use—what is the simplest, useful thing we could do, to gather the required evidence?

Determining scale

Scale—how much evidence we need to collect as part of an experiment—plays a crucial role in determining how useful the answer we emerge with will be.

With any experiment, there is no predetermined solution for how much evidence you need. It’s up to you to decide, but there are two pivotal factors which can help you make the decision:

How certain do you need the answer to be?
How precise do you need the answer to be?

Certainty here refers to how much you care about the possibility of error in any given experiment. This depends on how important you feel getting a particular answer right will be, which in turn depends on how important the decision you’re likely to make as a result will be.

Precision here refers to how valuable it is to have a really granular answer. Depending on the context, moving the needle on a particular measure by 0.2%, being able to spot the difference between two very particularly demographic groups of users, and capturing as many possible varieties of responses will have different amounts of value to you, the team, the stakeholders, and the business.

How long should an experiment run for?

A question that is often asked when starting to design an experiment is—how long do we need to run the experiment for? The short answer is—for as long as it takes, in order to get a useful answer.

The long answer is—it depends. The length of an experiment should be determined by how long you’re willing for it to run, and that should ultimately depend on how the answers to the preceding questions of certainty and precision.

When designing an experiment, you should consider how certain, and how precise, you need the answer to be. Doing this will allow you to determine how much evidence you need to collect. If it looks like you’ll need an unreasonable amount of evidence in order to guarantee the certainty and precision you desire, you need to make a choice.

Both options involve a compromise on your standards of certainty and precision. One route to take would be to lower your standards in order to come up with an amount of evidence that seems reasonable and then determine the time needed to gather this new amount of evidence. The other is to set a fixed time limit for the experiment and then determine how certain and precise this will allow you to be.

In either case, compromising on your standards of certainty and precision will ensure you get some kind of answer in a reasonable amount of time, but it will need to be heavily caveated. By compromising on these standards, you leave yourself open to the possibility that the answer you receive is unreliable and thus cannot be used to make a product decision.

Either way, making a conscious decision to design the scale of the experiment will have wide-ranging consequences for the reliability, and thus usefulness, of the answer you end up with.

Remember!

Take the time to decide how certain and how precise you need the answer from an experiment to be.

How certain do you need to be?

When it comes to belief-led questions, the experiments where we are testing our hypotheses and seeking strong enough evidence to reject our null hypothesis, there are generally four possible outcomes.

1.
Evidence for our alternative hypothesis exists in reality and is detected.
2.
Evidence exists in reality, but you fail to detect it.
3.
Evidence doesn’t exist in reality, and so you cannot detect it.
4.
Evidence doesn’t exist in reality, but you detect something that leads you to think it exists.

Outcome number 1 is the happy path—you ran a well-designed experiment, and the thing you hypothesized is true. Success! Well done! Give yourself a pat on the back.

Outcome number 3 is a bittersweet one. You ran a well-designed experiment, but the thing you hypothesized wasn’t strong enough to dislodge the null hypothesis. Yes, it’s disappointing. You’ll most likely be sad, and/or go through the five stages of grief, but eventually you’ll move on, having learned something valuable, and crucially, avoided wasting time going down a rabbit hole.

It’s outcomes 2 and 4 that you need to watch out for. These are the risks involved when you don’t take the time to design an experiment properly.

With outcome 2, the danger is that after the experiment, you continue to stumble around in the darkness, assuming that the difference or change doesn’t exist, when really you just failed to find it. This is known in statistics as a Type II error, or a false negative. We can use a concept known as statistical power to avoid this.

With outcome 4, you’re equally misled—you carry on thinking that you were right, when really you’re (possibly dangerously) wrong. This is known in statistics as a Type I error, or a false positive—mistakenly assuming that something exists, when in fact it doesn’t. To avoid making Type I errors, we can make use of a concept called the significance level.

Note

Although these outcomes, errors, and specific statistical techniques seem particular to dealing with belief-led questions and testing hypotheses, similar phenomena exist when dealing with exploratory questions. Be aware of the ways in which poor experiment design can lead you into making the wrong decisions!

Let’s take a closer look at the ways in which we can avoid false positives and false negatives. Determining the statistical power and setting a significance level are the two main ways that we can bring our required degree of certainty to an experiment in which we test a hypothesis. Therefore, these play a crucial role in calculating the necessary sample size for an experiment.

Avoid missing crucial evidence using statistical power

Statistical power is the probability that if evidence to support your alternative hypothesis exists, you’ll find it. The higher the statistical power of an experiment, the lower the chances are that you’ll stick with the null hypothesis when in fact there was enough evidence to reject it.

Statistical power is usually expressed as a percentage, and generally people choose a level between 80% and 95%.¹ With a statistical power of 80%, that means 20% of the evidence that would have supported your alternative hypothesis is likely to be missed. Turn the power up to 95%, and you’ll only miss 5% of that evidence.

The price you pay for selecting a higher statistical power is the amount of evidence you’ll need to collect. With a higher power level, you’re choosing a higher degree of certainty, and thus you’ll likely need to wait longer to gather enough evidence to be certain.

This is why selecting an appropriate statistical power for your experiment is a choice you have to make. You can think of it like a dial that you can turn up or down, depending on how certain of finding evidence (presuming it exists) you wish to be.

If you’re running an experiment to test something really, really crucial—something that potentially could make millions, or conversely, seriously damage your brand—you probably want to be as certain as you can be, before making any kind of recommendation to stakeholders, deploying a feature, or basing design decisions on those results.

In contrast, there may be times when you’re working on an experiment which, although valuable, isn’t a big deal. With a lower statistical power, you’re lowering your standards of evidence needed to support the alternative evidence and lowering the amount of evidence needing to be gathered overall and thus are more likely to be able to run the experiment quickly. However, this comes at a price—turn your statistical power level down to, say, 60%, and 40% of the evidence that could have been found, is likely to be missed.

Remember!

The higher the level of statistical power you choose, the more certain you can be of detecting evidence to support your alternative hypothesis, where it exists. The price you pay for increased certainty—a larger amount of total evidence needed, so a longer amount of time to run the experiment.

Avoid being misled by chance with a significance level

Significance is the probability that what you detect from the evidence gathered in an experiment is down to chance. That is to say, you see an effect where in reality, it doesn’t actually exist. It’s always a possibility that you’ll detect some evidence to support your alternative hypothesis, but sometimes this could be a freak, natural occurrence, rather than actually being a significant event.

The lower the significance level of an experiment, the less chance of detecting evidence supporting your alternative hypothesis which has occurred due to chance. In other words, the more stringent your standard of certainty.

Typically, people tend to set a significance level of 5%. This means that there’s still a 5% chance of gathering “false positives,” and doesn’t guarantee you’ll find evidence. What you’re doing with a significance level is setting a certainty bar, against which you’ll place the evidence you gather from the experiment.

In Chapter 8, we’ll look at the concept of a p value, which tells us how likely any actual evidence gathered in support of our alternative hypothesis is due to chance. If the p value is less than the significance level we set, prior to running the experiment, then we can reject the null hypothesis, knowing that this evidence occurred below the threshold of chance that we set as acceptable and thus can be trusted.

As with the statistical power, you can choose a higher degree of certainty, in this instance by lowering the significance level. For instance, by lowering the significance level from 5% to 3%, we’re saying that we want to be even more sure that the evidence supporting an alternative hypothesis gathered didn’t occur due to chance.

With a 3% significance level, you’re saying that you’ll accept no more than a 3% probability that the evidence supporting an alternative hypothesis occurred by chance. This, as you can expect, has an impact on the amount of evidence we need to gather in total.

Remember!

The lower the significance level you choose, the more certain you can be that any evidence you detect in support of your alternative hypothesis did not occur due to chance. Again, the price you pay for this increased certainty—a larger amount of total evidence needed, so a longer amount of time to run the experiment.

Beware the danger of stopping an experiment when it “reaches significance”

When I first joined the team that practiced a form of experiment-driven product development, it was common practice to be constantly watching the results of an experiment as they came in, and run it “until it had reached significance,” rather than predetermining, and allowing, a minimum amount of time to pass. This was a huge error on our part.

While techniques do exist which allow teams to dynamically stop experiments when a certain level of certainty is reached, they are complex, often misunderstood, and should be avoided when starting out.²

Why is it better, then, to not stop an experiment once a level of significance has been reached? The fact is that while you’re running an experiment, the randomness of life is occurring, and that randomness is itself, random.

Although you may believe that at a certain point in time, the results you are seeing are “statistically significant” (i.e., that the results fall below your agreed significance level), all this means is that the evidence gathered so far happens not to have been affected (enough) by chance. But that could change at any moment. One or two more minutes, and something might occur which throws any notion of “not being affected by chance” out the window.

This line of logic would then suggest that literally as soon as an experiment is declared significant, regardless of the amount of evidence gathered, it should be stopped. But by doing this, you’ve made a mockery of the significance level and biased the results in favor of your alternative hypothesis. Indeed, the more frequently you check the progress of an experiment, the less evidence will have been gathered, and thus the less accurate the calculation of significance will be.³

Instead, set a statistical power level, a significance level, alongside one other factor, the minimum detectable effect, which we’ll examine in just a moment, before running your experiment. Use these to calculate a minimum amount of evidence that must be collected (regardless of whether or not it supports your alternative hypothesis) before any analysis or judgment can be made.

From this “sample size” calculation, you can then estimate how long it is likely to take to gather that amount of evidence. Ensure that everyone commits to a rule of no peeking at the results, until such time as enough evidence has been collected. You should only stop an experiment once you have gathered that minimum amount of evidence.

How precise do you need to be?

If statistical power and significance level, when taken together, represent the level of certainty you require from your experiment, then a third concept, known as the minimum detectable effect , is used to describe how precise a change or difference you will be able to detect when running your experiment.

Determining a minimum detectable effect

In order to work out how long you need to run an experiment for, in addition to the level of certainty you’re comfortable with, you need to ask yourself—what size of difference would be interesting, or useful, for us to be able to see? This is what’s known in statistics as deciding upon your minimum detectable effect (MDE)—the minimum amount of difference that we could detect within a time period, given an agreed significance level and power.

With belief-led questions, this tends to be fairly simple. Assuming your null hypothesis to be true, when we run an experiment, we wouldn’t see any difference between one condition and another. The question is—if we’re trying to determine whether there’s strong enough evidence to support our alternative hypothesis, how much of a difference would be valuable for us to be able to see?

For instance—say we have a click-through rate to articles on a news company’s home page of 3%. Being able to detect that a proposed change to the layout of the home page caused the click-through rate to rise to 3.2% (or indeed fall to 2.8%)—is that interesting enough for us? Perhaps not. But knowing that it rose or fell to 4% or 2%—that might be valuable, particularly if there’s a high amount of traffic in play.

The larger you set the minimum detectable effect, the coarser the change from an existing baseline you’ll be able to detect, given an agreed level of certainty. In contrast, if you set a smaller minimum detectable effect, the more precise the level of change from an existing baseline that can be detected, again with an agreed level of certainty.

The upshot of this is that if you set, say, a 10% MDE, then a result which shows evidence of any change less than that cannot be regarded as significant or trustworthy.

Larger MDEs, because they detect bigger movements from a baseline, need less evidence to be gathered. With a larger MDE, you’re saying that you’re not interested if the needle moves a little—but this will ignore evidence which says your proposed change would have a smaller effect.

Smaller MDEs thus need more evidence to be gathered, because you’re trying to be more precise—you want to capture smaller differences or changes.

Bear in mind, though, that the smaller your starting baseline, the larger and larger your required evidence will be needed, when setting a smaller MDE. This is because detecting smaller changes on something that is already small requires even more precision—even if you’re willing to compromise on certainty.

Understanding absolute vs. relative MDEs

When setting a minimum detectable effect, it’s important to understand whether you’re interested in an absolute change or a relative change. Some calculators that will help you determine the amount of evidence needed for an experiment allow you to choose between absolute and relative MDEs, but others simply assume one or the other.

From my experience so far, plenty of tools assume you’re interested in a relative change, so you need to bear this in mind when inputting your desired MDE. If in doubt, use Evan Miler’s sample size calculator, which lets you play around with both kinds, as well as seeing how different levels of statistical power and significance can affect the sample size (i.e., the amount of evidence) needed.⁴

For clarity, let’s take a quick look at the definition of “absolute” and “relative” change.

Absolute changes

An absolute change is the “real” difference, or gap, between two numbers. It’s akin to simply adding or subtracting the change from the number you start with.

So, an absolute change of 1% on 10% would be either 9% or 11%, and an absolute change of 0.1% on 10% would be 9.9% or 10.1%.

Relative changes

A relative change, however, is the “fractional” difference between two numbers. It’s akin to taking a certain slice of the original number and adding or subtracting that from the original number.

For instance, a relative change of 1% on 10% would be 9.9% or 10.1%, because 1% of 10% is 0.1%.

Similarly, a relative change of 10% on 10% would be 9% or 11%.

It can be a little confusing, but the key thing to note is to be aware of which type of change you’re selecting. You can set the size of effect you want, regardless of whether it’s absolute or relative, but choosing the wrong kind, and mistakenly setting an MDE of completely the wrong order of magnitude, can ruin an experiment’s utility.

Selecting an appropriate MDE

Ultimately, then, it’s up to you, in the context of your experiment, to set an appropriate MDE. It boils down to how valuable it would be to you, the team, and the organization within which you are working, to know about small or large changes or differences. This, of course, depends on which measure you’re interested in and its importance to the health or value of your product.

Calculating scale for belief-led experiments

Now that we have decided upon the necessary certainty for our experiment, using statistical power and significance level, we can combine that with our preferred minimum detectable effect and thus get the sample size needed for each condition in our experiment.

Note that this is the sample size for one condition—and if we’re testing two, as per our null and alternative hypotheses, then we need to double the sample size in order to get our total sample size number.

Next, we estimate how quickly we might be able to get to that size. This, in part, depends on the method you’re going to choose, and so taking the total sample size into account is going to be a major factor when deciding which method to employ.

Think about how many samples you’d be able to get in an hour, a day, and a week. If you’re thinking of running the typical A/B feature test, then depending on the level of traffic to your product or service, you can consider how many hours or days it should take to reach the sample size. If you’re considering other methods, such as user interviews, think about how much effort is required to conduct each interview, and thus how many you could reasonably do in a day or week.

All of which is to say—what is the simplest, useful thing you could do, to reach that sample size as quickly and cheaply as possible, without compromising on your agreed level of certainty and precision?

If it seems like it’s going to be prohibitively lengthy to reach the minimum amount of evidence needed, then you’ll need to compromise on your certainty and precision either by lowering the standards or by seeing what would be possible with in a specific amount of time.⁵ As mentioned earlier in the chapter, however, this should only be done if there really is no other option.

A note on baseline experiments

The combination of certainty, precision, and MDE work very well in cases where you’re hypothesizing a change or difference between an established thing and something different. However, there will be times when you’re going to want to understand the potential for something new.

Baseline experiments are ones in which you have a hypothesis—that introducing something new will do something positive for the metrics you care about. The issue at hand, however, is that you’re going to introduce it into an environment where that metric currently isn’t in play.

For example—the team that I worked with when first practicing XDPD was developing a recommendations product. We had deployed a pop-up window on certain pages across the company’s sites, and were seeing a very healthy click-through rate, as well as sign-up numbers for our email service. We wanted to bring the product to a new area of the site, but pop-ups weren’t going to be appropriate. So, we designed something that could sit within the page.

Our first instinct was to compare it against the pop-up. But we realized that this wouldn’t be a fair test. It wasn’t really a case of deciding whether the in-page version was “better” than the pop-up version. That wasn’t the question we were trying to answer.

What we were interested in was whether the in-page version would get any click-through at all, given it was less eye-catching and further down the page than a pop-up that appeared shortly after a user had started to scroll.

The problem was, when trying to decide our MDE, we realized that the current performance on those pages… was a 0% click-through rate—because it didn’t exist. Try calculating a sample size based on a 0% current performance. It’s not possible.

As a result, it was impossible to accurately calculate a total sample size and thus a minimum amount of time to run the experiment. In which case, we had to go for a fixed amount of time—two weeks, in our case—just to get a baseline for how our new version of the product, in a completely new environment, would perform.

It was only after we had the answer to our question—is this even going to be of interest?—that we could start iterating and getting back to calculating our scale properly.

Deciding scale for exploratory experiments

So far, when discussing scale, we’ve focused on experiments which intend to answer belief-led questions. However, a variation of the same problematic outcomes that we saw with those kinds of experiments can occur with exploratory experiments.

Even if you aren’t engaged in testing a hypothesis, you would still want to ensure that you set up the experiment to give you the best possible chance of finding interesting knowledge. Similarly, although a single response can tell you a great deal about how someone makes sense of the question at hand, basing a product decision purely on one person’s interpretation can be problematic—if it wasn’t, we could just ask our most senior team member or stakeholder for their opinion and be done with it.

A “representative” sample

Determining a representative sample size to draw upon for your experiment is still important, but what “representative” means in any particular scenario will depend a lot more on the conditions and criteria you’ve selected for your experiment.

When determining the necessary scale needed for an exploratory experiment, therefore, you should take into account the availability of resources—existing evidence, likely number of participants, and so on—that would qualify as being relevant for the conditions you’ve chosen.

For instance, if you were interested in discovering current customers’ attitudes to your product, you’d want to ensure a representative sample of customers, so that you get an accurate understanding of the general mood. In order to do this, you’d need to take into account not only the number of customers you have but the number of customers you’re likely to be able to get some kind of feedback from, in a reasonable amount of time.

Some may not respond—not because they don’t have an opinion but because they’re not comfortable or motivated enough to contribute. Others may use it as an opportunity to give you completely different feedback, in addition, or instead of, helping you answer the question you’re trying to pose.

Think about the conditions needed, and how much evidence you’d be comfortable basing decisions upon. The basic principles of certainty still apply—no matter the kind of question being posed, the more evidence you can collect, the more certain you can be, but the longer it will take to gather.

Certainty comes in patterns, not anecdotes

What you’re looking for, when it comes to exploratory experiments, is certainty in patterns . Just as false positives and false negatives can throw you off the scent with belief-led experiments, collecting a number of isolated anecdotes, although interesting, won’t necessarily always be helpful in answering the question. Instead, you get certainty from having spotted a pattern—insights that occur often within, and sometimes even across, the conditions of your experiment.

Precision and granularity

Similarly, with precision, it all depends on the granularity of knowledge that you’re interested in. Being more precise in this context comes down to how important it is to capture every last detail of the evidence. As with belief-led questions, the real trick is not to focus on how long it will take to run the experiment but the volume of information required to be able to answer the question effectively.

Would the answer to the question really change if you captured the exact time of day, precise location, body temperature, and facial expression of participants? In most cases, probably not.

All of the preceding informationwill help you determine the simplest, useful thing you can do, to help you gather enough evidence in a reasonable amount of time. It may not be quite as neat and tidy as the mathematical formulae underlying hypothesis testing, but it’s still an element of your experiment worth designing.

Selecting the most appropriate method

By this stage in your experiment design, you will have filled in all but one of the sections of the experiment card. Now that we know all of the various ingredients for our experiment, there’s only one thing left to decide. What’s the simplest, useful thing we could do, to answer the question?

The difference between objects of inquiry and methods

When designing your experiment, it’s sometimes tough to avoid lapsing into specifying a method up front. For instance, let’s say you’re proposing that a change in your product would affect a particular metric, or perhaps you want to find out the reasons why users are no longer interacting with a certain feature. Your mind may naturally leap to suggesting an A/B feature test for the former, or a series of user interviews for the latter, even before you’ve worked out the appropriate hypothesis, measures, conditions, and scale.

Specifying the method up front, however, risks entering into the “method-question-answering” loop I mentioned last chapter. The question you’re answering is not whether method X will get you an answer. It’s whether the change in the product will affect the metric—or why users feel the way they do.

One way to avoid this trap is to recognize the difference between the object of inquiry and the method.

The object of inquiry is the thing that you’re investigating in the experiment, regardless of method. The method is the way in which you will perform the investigation.

For instance, in the case of our first example earlier, we’re proposing to make an intervention—a change that we believe will make a difference. We could do this by running an A/B feature test. Alternatively, we could mock up a paper prototype and test reactions to it. Or, we could gather some user interaction data and then simulate the change. These are our possible methods.

In our second example, we’re seeking to perform some observation. How might we do this? Run a survey on the site? Go and canvas opinions at a conference? Arrange one-to-one interviews? Again, all possible ways of performing the investigation, but whether they are the most appropriate depends on the conditions, measures, and scale.

Let’s try one more example. Think back to the case of Etsy I mentioned in Chapter 5. One premise they identified was that showing more items in search results would be better for business. The object of inquiry was their hypothesis—that they would try increasing the number of items loaded on a search result page. The method was to run an A/B feature test which would allow them to compare data on how much was bought in the version with more items vs. that with the usual.

Not every experiment is an A/B test; not every A/B test requires a code change

In the world of scientific research, experiments always have hypotheses and always use the framework of the A/B test. In experiment-driven product development, this is not the case.

In XDPD, the most important thing is the question, and finding a useful, meaningful answer to it. Regardless of whether we have a knowledge gap we’re trying to fill, or we want to test one of our assumptions, claims, or premises, each of these has a question (or indeed many questions!) at their root, and thus they are all candidates for experiments.

The A/B framework is especially important, as we’ve seen, for our belief-led questions. It can give us important things to consider, mainly when it comes to scale, for our exploratory questions too. That said, don’t feel you have to force every experiment to be answered via an A/B test.

Similarly, it’s important to remember that the “A/B” idea is a framework rather than a method. “A/B” is really just about our two hypotheses—the null and the alternative. Whether we seek to test them via a code change and a new feature, or whether we can collect some evidence without changing the product at all—it doesn’t matter. The “A/B” framework doesn’t specify the method by which you’ll test those hypotheses. So again, even if you are designing an experiment around a belief led question, don’t feel you have to surface that via an actual change in your live product. What really matters is, is this the simplest, useful thing we could do?

Simplest, useful thing

In this and the preceding chapter, we’ve delved deep into how we can try and define what “useful” means, in the context of an experiment. By designing the parameters of our experiment, we know what we would need in order to get an answer we trust, enough that we can base a decision upon it. But how should we try and achieve that?

This is where it pays to avoid choosing the method ahead of time. Deciding up front on a method forces you to compromise on the usefulness of the experiment. If you do this, you’ll find yourself torturing the experiment design in order to wrangle at least something useful out of it, often at the expense of time, effort, and ultimately, the usefulness of the answer you get out at the end.

Instead, focus on how you might achieve a useful answer. Think of all the possible ways you could find the answer, and select the one that is simplest to set up but will still achieve the required conditions, measures, and scale.

Ultimately this comes down to what’s practical and possible given your circumstances. Every team is different, with different resources available to them. What might be cheap, quick, and dirty for one can be hideously expensive just to set up, for another.

If it seems that it’s not going to be possible to find something which is both simple and will be useful, then take a step back. Is the question we’ve posed too large? Are there smaller, still useful, questions we could ask as a way of taking small steps toward the bigger question?

Similarly, if you choose a method, and then find yourself spending days just setting up the experiment, ask yourself—is this really the simplest, useful thing we could do? Is there not something else we could do which would get us a step closer toward the answer?

Again, this may mean that you can’t answer the original question you posed in one experiment, but that’s OK. Is there a simpler, useful thing you could do, which would still test an assumption, claim, or premise, or help fill in a knowledge gap? Do that. Realizing that a question is too big to answer in one go is still a valuable thing to learn.

Your experiment, your responsibility

We’ve touched on an element of responsibility previously, when discussing the dangers of collecting user data without any particular reason in mind. Choosing your method for an experiment also has consequences for you and the people who may be included within it while it runs.

The ethics of experimentation are a vast subject and could easily take up their own book. I will, however, touch on one aspect. As much as you should be looking to do the simplest, useful thing in an experiment, you should make sure, too, that you aren’t unintentionally causing harm.

Earlier, we discussed the concept of “health metrics”—metrics that aren’t always directly under investigation in an experiment but which might indicate serious problems for the long-term health of our product if they are negatively affected.

Just as we wish to do no harm to those health metrics, we must consider the ways in which the method we choose might, unintentionally, harm our users. One framework for doing so is to ask the following questions, adapted from the Nuremberg Code⁶ by Kim Goodwin:⁷

Is it the only way?
Is the risk proportional to the benefit?
What kinds of harm are possible?
How will you minimize harm?

Asking these questions will help flush out potential issues with the experiment—all part of designing something that will allow you to move, and learn, fast— without breaking the things that truly matter.

Running experiments in parallel

It can sometimes be frustrating, waiting for an experiment to finish before starting the next. Depending on your team’s capacity, it’s possible to run experiments in parallel—at the same time—as long as you avoid breaking the golden rule. Never run two (or more) experiments on the same thing, at the same time.

“Same thing” is a little vague, I know. So let’s be more specific. Say you’ve got two different experiments you’d like to run, both involving a user journey across a particular page, screen, or interaction within your product. If you ran both of these at the same time—and crucially, exposed a single user to both experiments within the same session—there’s no way of knowing if their response is due to the first or second thing that you’re interested in discovering. Assuming this was potentially the case for anyone taking part in the experiment, it’s now impossible to really know whether their responses are tied to the first or second experiment. Thus, both answers are potentially invalid.

However, if you can guarantee that the two experiments are totally independent, or that the chances of crossover are so negligible as to not affect the answers, then by all means, save time and run the experiments in parallel. Just always be aware of the golden rule—experiments must be totally independent of each other.

Summary

Congratulations—you’ve completed the Design Phase! At last, we’re ready to run an experiment, and watch the results come in. We’ve covered a lot in this chapter, so let’s review. What have we learned?

Deciding on the scale and method of your experiment—how many participants, how long to run it for, and how to gather evidence—deserves some thought.
Determining the necessary scale has two aspects:
- Certainty
- Precision
That there are four possible outcomes to an experiment:
- Evidence for the answer exists, and you find it.
- Evidence for the answer exists, but you don’t find it (Type II error).
- Evidence for the answer doesn’t exist, and you don’t find anything suggesting that it does.
- Evidence for the answer doesn’t exist, but you’re misled by what you collect during the experiment, so that you believe it does exist (Type I error).
Statistical power is the chance of capturing crucial evidence when it does exist, and thus not missing out on something vital:
- A higher power level, the less your chances of missing out
Significance level is the probability of being misled by false positives:
- A lower significance level, the less chance of being mislead.
- Significance level fluctuates—so don’t stop an experiment early.
Minimum detectable effect is the smallest level of detail you are interested in being able to capture.
All three concepts have an effect on the minimum sample size you’ll need for your experiment, and thus the time it will take to run the experiment.
In exploratory experiments, we can bear the preceding concepts in mind, but we should aim for
- A representative sample, depending on your conditions
- An appropriate level of granularity for detail, depending on your measures
- Identifying patterns, not simply gathering anecdotes
When selecting an appropriate method for the experiment, you should aim to do the simplest, useful thing.
Remember that you have a responsibility to ensure no harm comes to participants in your experiment.
And finally, that running parallel experiments is possible, but they must be kept totally independent of one another.

Now, we’ve gone through every part of the experiment card, feel free to start designing some experiments. Don’t worry about getting everything spot on the first time around—use all the resources you have to hand, within and outside the team, to help you. The most important thing is to try and to learn—which brings us on to the final phase of experiment-driven product development—the Analysis Phase, which we’ll look at in Chapter 8.

Footnotes

Why not 100%? Because it’s impossible to completely guarantee you’ll always find the evidence.

If you’re really interested and highly confident in your abilities, you can read Evan Miller’s Simple Sequential A/B Testing: www.evanmiller.org/sequential-ab-testing.html, but make sure you understand it before proceeding down that path.

The reason for this is that significance calculations are based on the assumption of a fixed sample size. You can find an excellent explanation of this problem in Evan Miller’s How Not To Run an A/B Test: www.evanmiller.org/how-not-to-run-an-ab-test.html

You can find the calculator here: www.evanmiller.org/ab-testing/sample-size.html

The “Power Analysis Calculator” by Rik Higham, at http://experimentationhub.com/hypothesis-kit.html#power-analysis, allows you to set a fixed time window (minimum one week) and tells you what the possible MDE would be as a result.

The Nuremberg Code is a set of ethical principles to guide experiments involving humans, developed after World War II.

Kim Goodwin (https://twitter.com/kimgoodwin), Human-Centred Products, Mind The Product conference, 2018 (www.mindtheproduct.com/2018/11/how-can-we-build-human-centred-products-by-kim-goodwin/)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Scale and Method

Create new playlist

Sign In

Sign Up

7. Scale and Method

Determining scale

How long should an experiment run for?

Remember!

How certain do you need to be?

Note

Avoid missing crucial evidence using statistical power

Remember!

Avoid being misled by chance with a significance level

Remember!

Beware the danger of stopping an experiment when it “reaches significance”

How precise do you need to be?

Determining a minimum detectable effect

Understanding absolute vs. relative MDEs

Absolute changes

Relative changes

Selecting an appropriate MDE

Calculating scale for belief-led experiments

A note on baseline experiments

Deciding scale for exploratory experiments

A “representative” sample

Certainty comes in patterns, not anecdotes

Precision and granularity

Selecting the most appropriate method

The difference between objects of inquiry and methods

Not every experiment is an A/B test; not every A/B test requires a code change

Simplest, useful thing

Your experiment, your responsibility

Running experiments in parallel

Summary

Table of Contents for
7. Scale and Method