A/B testing – a brief introduction and a practical example with R

Unlike the other tests that we've seen so far, A/B tests do not rely on a unique test statistic and a unique distribution to derive inference; as a matter of fact, they could benefit from any test statistic and distribution. A/B testing is, rather, a name given to a broad technique used for versions comparison that will dictate things from how to sample all of the way to getting your p-value and confidence interval and making a decision. These tests are wildly popular in the field of web analytics, for very good reasons, but are not restricted to it.

These tests are capable of handling statistically based insights to a broad set of is A better than B? kind of problems. Will layout A attract more clicks than layout B? Will this color A be more profitable than color B? Would campaign A work better than campaign B? A/B tests tend to guide better than gut feeling and guesses. This section aims to introduce the reader to them.

The two versions of something you wish to test are called control treatment and variant treatment

There are a couple of things that deserve a great deal of attention while addressing these tests. The first thing to decide is what we are looking for, in other words, the variable of interest. The variable of interest decides in which ways we want to check if version A does things better than B—adopting the perspective of web analytics, it could be click rate, revenue per paying user, or time spent on the website. The kind of variable chosen will rule which statistic test to use later (t-test, chi-square test, Fisher's exact test, and so on).

Once the variable is chosen, it's time to get samples. Samples of two or more versions must be gathered simultaneously. Speaking web, this is done by redirecting some traffic to one version and the other one to the alternative version. It's important to do this randomly.

Samples from control and variant treatments must have similar sizes. They must also be large enough to make inferences. The user must let the test run long enough to gather enough data, but not too long. There are mainly two reasonable arguments that will advise us not to let the test run for so long:

  • Economic argument: If two versions are running simultaneously there is a great chance you are not getting optimal results
  • Statistical argument: The population might change drastically during a test of a long duration; results might be coming from something other than what the test was designed to investigate

It's very common to get live results from tests like these as the samples keep growing. Some users would stop their tests once the results showed statistical significance and that's not a good way to go. Even A on A tests would reject the null hypothesis once in a while. Good practices may include minimal sample sizes and a time limit.

Once you get your test running, you can do the calculations with R and even determine if your sample is sufficiently large. Let's check this with an example. Imagine that you are running an A/B test on a website. You are testing if a green button (variant treatment) does better than a red one (control treatment). After running the test with the traffic split seemingly randomly between those two versions for one week, you get data about click rates:

Clicks No-clicks
Red button (control) 130 9870
Green button (variant) 170 9830

 

We can reproduce this table using R with the following code:

control <- c(130, 9870)
variant <- c(170, 9830)
tab <- rbind(control, variant)

Now the object tab holds the exact same table showing how many clicks each button received during the experiment. The red button had a click rate of 1.3%, while the green one did slightly better with 1.7%. What is the likelihood of these values showing up even though there is no real difference between them? Now, it's time to deliver these numbers to a Fisher's exact test, a test suitable to get comparisons from two proportions:

fisher.test(tab, alternative = 'less')
# Fisher's Exact Test for Count Data
#
# data: tab
# p-value = 0.01156
# alternative hypothesis: true odds ratio is less than 1
# 95 percent confidence interval:
# 0.0000000 0.9296408
# sample estimates:
# odds ratio
# 0.7616196

The odds ratio will show how likely the red button is to be clicked in comparison with the green button. The null hypothesis will state that they're equally likely to be clicked, in other words, the odds ratio is equal one. Once I think that the green button does better, I set the test with an alternative hypothesis stating that the true odds ratio is less than one.

We got a p-value of 0.01156 (1156%), which is favorable to the alternative hypothesis—green is better. These tests also seek to minimize the chances of getting a type II error by improving what is called statistical power. Power is improved by sample sizes. There is an easy way to estimate how many observations per group we may need to achieve some power using R:

fisher.test(tab, alternative = 'less')
power.prop.test(p1 = 130/(130+9870), p2 = 170/(170+9830),
power = 0.8, sig.level = 0.012,
alternative = 'one')

# Two-sample comparison of proportions power calculation
#
# n = 17732.86
# p1 = 0.013
# p2 = 0.017
# sig.level = 0.012
# power = 0.8
# alternative = one.sided
#
# NOTE: n is number in *each* group

The p1 and p2 arguments are input respectively with the click rates of the red and green buttons; power and sig.level ask for the minimal statistical power and significance level that we are looking for. 17732.86 observations per group are expected to be needed in order to achieve a power of 80% (a very popular number). Given that in the first week 10,000 observations per group were gathered, I would say that it is reasonable to let the experiment run for one more week in order to get even better results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.216