Multi-armed bandits: the smart alternative for A/B testing

Written by: Corné Corné

Read in: Nederlands

For already a few years, Google Analytics provides an alternative way of A/B testing, called 'multi-armed bandits', or MAB. Although this method can be very beneficial compared to the ‘classic’ A/B testing, it is only used sporadically compared to A/B testing, perhaps because it is less well known. In this blog, our Data Scientist Corné de Ruijt therefore makes a comparison between A/B testing and MAB in the case of optimising career websites.

A/B testing: a short recap

Say that we are currently using a rather long application form on our career website, which is likely to repulse job seekers to apply to the career website.  To improve the application ratio, which we define as follows:

Application ratio = Number of applicants / Number of times application form is shown,

we test a new, much shorter application form. Hopefully, this shorter form will increase the application ratio. We will further refer to the current application form as 'form A', whereas the new form is 'form B', hence the name A/B testing. Since our hypothesis is that option B will do better, we will test the following hypothesis:

H0: The application ratio of form B is smaller or equal to the application ratio of form A

H1: The application ratio of form B is greater than the application ratio of form A

Both form A and B receive 1000 visitors. Say that in reality, visitors to form A will have a 5% chance of applying (therefore an application ratio of 0.05), whereas visitors to option B have a 8% chance of applying (application ratio of 0.08). Of course these real application ratios are unknown to us, but I’ll use them for illustrative purposes.

Figure 1 shows how our estimation of the real application probability becomes more certain as we increase the number of visitors to either form A or B. In this example, we would need to send both forms to at least 682 visitors to be able to conclude, with 95% certainty, that form B is better.Figure 1. Estimation of application ratios form A and B for increasing number of visitors

Exploration vs Exploitation: the flaw of A/B testing

Now consider the following thought experiment: say that we already knew that form A and B have an application ratio of 0.05 and 0.08 respectively, but we still run the A/B test. How many applications would we miss by not always showing form B?

Figure 2 gives the answer to this question. It shows how the regret, which is the number of applicants we missed since we were also showing form A, increases with the number of visitors. During the exploration phase, in which we show both form A and B, the regret gradually increases as we are showing the suboptimal form A, 50% of the time. The exploitation phase starts when we decide that we have sufficient evidence to only show form A or B. Since we still have a 5% chance of making the wrong conclusion, the regret still increases during the exploitation phase.Figure 2. Regret of A/B testing

Although we can benefit from the knowledge we obtain via an A/B test, this might come at a high cost: we first have to explore. But what if we don’t have much time or budget to go through this exploration phase and want results right away? In those scenarios, multi-armed bandits come into play.

Multi-armed bandits for recruitment

Imagine, you are standing in a completely abandoned casino, in front of two slot machines, which combined are also called a 'two-armed bandit'. Although you are not aware of this, the slot machine on your left, which we call 'slot machine A', pays a reward of €1,000 with probability 0.05, whereas the slot machine on your right, slot machine 'B', pays €1,000 with probability 0.08. You have the privilege to play these slot machines for free for as long as you like. However, you are not planning to play these machines forever, so your objective is to obtain a cumulative reward of €100,000 a quickly as possible, how will you proceed?

As you might have noticed, this problem is completely analog to the example we have used so far: replace slot machine by application form, the €1000 reward with one applicant, and we have our original example back. The convenience of using the analogy with the multi-armed bandit problem, is that we can use results from the extensive amount of research that has been done on multi-armed bandit problems.

Figure 3 shows how three well-known multi-armed bandit algorithms: UCB, ε-greedy, and  decaying-ε-greedy, perform compared to the A/B test. What these algorithms have in common is that they do not show form A and B 50% of the time, like in the A/B test, but try to infer early on which option is better, and show this option to more visitors. This results in a smaller regret during the experiment.Figure 3. Comparison regret multi-armed bandit algorithms and A/B testing

Alternative to A/B test?

Multi-armed bandit algorithms try to overcome the high missed opportunity cost involved in learning, by exploiting and exploring at the same time. Therefore, these methods are in particular interesting when there is a high lost opportunity cost involved in the experiment, and when exploring and exploiting must be performed during a limited time interval. This therefore also implies that if there are no cost involved in running an experiment, then there is no need to apply multi-armed bandit algorithms: A/B testing is still the fastest way to learn.