What do pharmaceutical trials have in common with designing websites and mobile apps? A lot, as it turns out. When we want to test out a new drug, we conduct an experiment to see whether a test group that’s been given the drug outperforms a control group.
And when we want to test changes to our website at scale, we conduct A/B tests to see if our changes end up improving conversion rates.
In a drug trial, the worst-case scenario is a false positive: our experiment makes it seem like the drug is effective, but in reality, it’s no better than (or even worse than) a placebo. This means an ineffective or even harmful drug gets released to the public.
When you conduct A/B tests on your website, there are similar concerns. If you’re not disciplined about how you run and evaluate A/B test experiments, you’ll get false positives that do nothing to improve your website.
In fact, it’s even possible that improper A/B testing suggests changes that decrease your conversion rates.
How does this happen? By stopping your tests too early. In drug trials, false positives happen when we’re not disciplined about stopping a trial. There are strict “stopping rules” to prevent false positives. The same statistical logic applies when running A/B tests, but we often ignore these rules (or don’t even know about them altogether!).
When running an A/B test through a service like Optimizely, it’s easy to check the results while the test is still running. Instead of letting the test run all the way through, many people (especially startups!) save time and money by stopping a test as soon as it has reached statistical significance. Doing this will cause the rate of false positives to skyrocket.
I’ll illustrate how this can happen with an example. At Heap, we ran a simple A/B test a few months ago between two different headlines on our homepage: “No code required” and “Capture everything.” We decided to run an A/B test for 1000 users each. So far, so good. But a few hours after we deployed the test, we saw that the “No code required” variation was winning by a statistically significant margin. We selected that as the new headline without letting the experiment run all the way through.
This is where we went wrong. As it turned out, we left the A/B test running on a portion of our userbase by accident. When looking at the results 4 days later after a few thousand visitors, it turned out that the two headlines had no difference in conversion rate. If we had let the experiment run all the way through, the early randomness would have evened out and neither variation would have won. By checking in on the experiment before it finished, we had a false positive result.
In many experiments, we set the significance threshold to be 5% (or a p-value threshold of 0.05). This means that we’ll accept that Variation A is better than Variation B if A beats B by a margin large enough that a false positive would only happen 5% of the time. Phrased another way, this means that Variation A needs to do a lot better than Variation B to be considered the “winner” of the A/B test; if it’s only a little bit better, then it might just be random chance. This helps us be confident that our experiments are improving conversion rates, and not just making random, useless (or even detrimental) changes to our product. But if we stop the experiment before it’s over, then this has the effect of relaxing the 5% constraint, sometimes by a huge amount. The more often we check the experiment (with the intent of stopping it if it shows significance), the more we undermine the power of A/B testing.
I ran some simulated A/B tests to see what would happen if we check our experiments while they’re still running. The simulation was as follows:
- We have two variations of our product, A and B, and we want to see which converts better.
- I set up the simulation so that the conversion rate for both variations was exactly 10%. So if an A/B test experiment reported that one variation converted better than the other, then that would be a false positive.
- I ran both variations against 1000 simulated visitors each, measured the final conversion rate for each variation, and calculated the p-value based on the difference in conversion rates. I set the p-value threshold to 0.05, so that we expect a false positive rate of 5%. Sure enough, when I ran several A/B test simulations, about 5% of them resulted in a false positive.
- Then I simulated what would happen if we checked for statistical significance midway through, after just 500 visitors had seen each variation (as well as once more at the end). Now what percentage led to false positives? This time, I saw a false positive rate of around 8.4% (out of 100,000 simulations, 8,426 of them were false positives). Even just one check mid-way through increased our false positive rate significantly (from 5% to 8.4%).
- Now I decided to see what would happen if we checked even more often. What if I had checked my Optimizely dashboard every 100 visitors (10 total checks throughout the 1000 visitors), and stopped the A/B test if I saw statistical significance at any one of those checks? What about every 50 visitors? What about every visitor, i.e. we stop the test as soon as we hit statistical significance at all? Here are the results:
|Number of checks||Simulated False Positive Rate|
|1, at the end (like we’re supposed to)||5.0%|
|2 (every 500 visitors)||8.4%|
|5 (every 200 visitors)||14.3%|
|10 (every 100 visitors)||19.5%|
|20 (every 50 visitors)||25.5%|
|100 (every 10 visitors)||40.1%|
|1000 (every visitor)||63.5%|
This means that if we’re monitoring our A/B test and stopping it as soon as we hit significance, the false positive rate will be over 60%. That’s worse than useless! In cases like this, even a worse variation has a decent chance of winning.
The fix to this problem is simple: don’t stop your A/B tests part-way through! Let them run their course, and then determine whether the results are significant.
It is possible to design an A/B test experiment such that it’s okay to stop it before completion, or even just let it run indefinitely until it hits significance. However, the statistics involved are a lot more complicated than the two-tailed test we use in traditional A/B testing.