A/B Testing With Bayes' Theorem

June 29, 2020 | Categories: data analytics python

If you clicked through to this from the A/B testing tool on this website, then you already know that I built it using a Bayesian approach. You may well have noticed some differences between this tool and the normal A/B testing tools you've seen or used.

This blog isn't going to tell you everything you need to know about Bayesian statistics. It is, I assure you, a really interesting topic and powerful approach to a lot of things. In our particular field of e-commerce and marketing, it puts focus on our expertise rather than casting us as impartial observers awaiting the verdict of a machine. If you want to find out more about Bayesian statistics, a really good place to start is with Will Kurt's book, 'Bayesian Statistics the Fun Way' (it is fun, I promise) or on his blog, the pleasingly named ' Count Bayesie'.

I want to talk about what makes my A/B testing tool a bit different, how this comes about and why I've chosen to do it this way.

Firstly - the bit of Bayes you need to know for this is that in a Bayesian approach, you use your knowledge of a problem to create a prior view of the likelihood of something happening, then apply to it real life observations, perform a calculation and end up with a posterior view of the likelihood. In other words, a very human effect that goes 'I did think this, then I saw something happen, and now I think this'.

This is useful to us as marketers for a few reasons.

In testing, it can give us a weighting for each arm of the test.

Because it works on a cycle of prior -> data -> posterior, we can continue to collect data and improve our view of the test.

We can use our expertise to get to an answer more quickly (useful for email workflows!).

It gives us a way to take into account stakeholders giving their view of whether something works (especially senior stakeholders who may be prone to saying that they just don't buy the results of your tests).

Most importantly, it allows us to make business decisions based on probability, not an artificial filter of certainty.

How the tool works

Essentially, the way that my tool works is that it takes the inputs (priors for each arm, number of successful outcomes, total outcomes) and fits them into something called a beta distribution for each test arm. For something as simple as A/B testing, the beta distribution is very similar to the binomial distribution that a frequentist approach uses (more on that in a minute). The tool then runs the distribution a large number or times for each arm (essentially, rolling virtual dice and checking the output), calculating the average difference in output each time and noting down how many times each arm wins. Once it's done this, you end up with a number for the percentage of times each arm wins and the average by which the winner is higher than the loser.

In other words, an output in the form 'There is a 70% chance that A is the successful arm, and is on average 1.4 x higher than B'.

In terms of making a business decision, this is ideal - we have an idea of how likely the winning test version is likely to actually be better performing, and also of how big the opportunity is. As someone making a business decision, you can decide how risk averse you want to be, something that may vary depending on how much value there is to be gained.

How most other tools work

Most A/B testing tools out there (I keep saying most because I'm sure some work using my approach, but these are the ones I see most commonly) use what is called a Frequentist approach. Essentially, this asks you to create something called a Null Hypothesis ('There is no difference between the two arms'), input the test data and then calculate the p value of this Null Hypothesis. Typically, a p value of less than 0.05 means you can be 95% certain that the difference in performance between the two tests is down to the difference in the two arms and not random chance.

This presents a number of obstacles:

95% is an arbitrary number which we might use in science but is less useful for most business situations.

Once the test is run over the specified time period, you must take the results as given, no matter what they are or the level of significance.

Testing can be very slow, especially in sites or parts of sites that don't get a lot of visits.

There are a lot of ways in which I see these tests being invalidated, from people stopping tests early, to running them longer to get more data, to just deciding to go ahead at a lower level of significance. If you are going to do these things, then there is no point in setting up a hypothesis to be disproved and performing calculations - you might as well save yourself time and money and pick the one you like more at the start.

The advantage of the 95% certainty is that it minimises Type I errors (false positives) - in other words, you are less likely to make the wrong choice of winner (you can see how minimising false conclusions in scientific research is important). However, it creates more in the way of Type II errors (false negatives) - you miss out on an improvement that could improve customer experience because you haven't reached an arbitrary level of significance. This is why it's important to have a tool that gives you the opportunity to make a business decision with all the context, not just one that disproves or not a narrow hypothesis.

A note about Priors

I introduced the concept of priors above but didn't really go into it. In the tool, I suggest that if you don't have any priors you should put them in as 1 each. This is essentially the same as saying, I don't have any information about the expected outcomes, nor do I have a view on which version is likely to perform better. This is fine - you will not damage the calculations by doing this. However, you can get to a better answer more quickly if you can input a view. Generally speaking, you have two levers - the version you think will win (ie, you input a higher prior into that arm) and also the weighting of your prior. The more certain of something you are, the bigger you would make your input (eg, 100 and 40 vs 10 and 4). This way, it takes more data to change your prior (or, if you prefer, you have to see something more times to change your mind).

This all relates to the free version of this tool on my site; I am working on some software that will allow for better calculation of priors that I hope to release at some point (if there are any software developers reading this who are interested in this project, please get in touch).

If you've got any questions or feedback on the tool, please get in touch using the contact form below.