Dynamic Yield uses a score called the Probability to be Best to determine how to allocate variations across your site's traffic. The score is based on a Bayesian statistical approach.
This article describes the reason Dynamic Yield uses the Bayesian approach, and gets into some of the heavy math behind how it works.
Frequentist vs. Bayesian
In the field of statistical inference, there are two very different, yet mainstream, schools of thought: the frequentist approach and the Bayesian approach.
The difference between these two rival schools can be explained through the different interpretations each gives to the term probability. The intro given here is adapted from this series of blog posts.
Let’s take a concrete case-in-point: say we are interested in discovering the average height of American citizens nowadays. For a frequentist, this number is unknown but fixed. This is a natural intuitive view, as you can imagine that if you go through all American citizens one by one, measure their height and average the list, you will get the actual number.
However, since you do not have access to all American citizens, you take a sample of, say, a thousand citizens, measure and average their height to produce a point estimate, and then calculate the estimate of your error. The point is that the frequentist looks at the average height as a single unknown number.
A Bayesian statistician, however, would have an entirely different take on the situation. A Bayesian would look at the average height of an American citizen not as a fixed number, but instead as an unknown distribution (you might imagine here a “bell” shaped normal distribution).
The Bayesian engine provides answers to questions such as:
What is the probability that A is better than B? (contrast this with the contrived p-value mentioned earlier)
If I declare B as a winner, and it is not really better, how much should I expect to lose in terms of conversion rate?
Let’s summarize how the two frameworks compare:
Hypothesis Testing | Bayesian A/B Testing | |
---|---|---|
Knowledge of Baseline Performance | Required | Not Required |
Intuitiveness | Less, as p-value is a convoluted term | More, as we directly calculate the probability of A being better than B |
Sample size | Pre-defined | No need to pre-define |
Peeking at the data while the test runs | Not allowed | Allowed (with caution) |
Quick to make decisions | Less, as it has more restrictive assumptions on distributions | More, as it has less restrictive assumptions |
Representing uncertainty | Confidence Interval (again, a convoluted interpretation which is often misunderstood) | Highest Posterior Density Region – highly intuitive interpretation |
Declaring a winner | When sample size is reached and p-value is below a certain threshold | When either “probability to be best” goes above a threshold or the expected loss is below a threshold (in which case a “tie” can be declared between multiple variations) |
I should note though that generally speaking, a frequentist A/B testing framework which performs as well as the Bayesian framework described here is possible, but further development would be needed above what is usually implemented.
Bayesian Statistics and Probability: How it Breaks Down
Initially, the Bayesian statistician has some basic prior knowledge which is being assumed: for example, that the average height is somewhere between 50cm and 250cm.
Then, the Bayesian begins to measure heights of specific American citizens, and with each measurement updates the distribution to become a bit more “bell shaped” around the average height measured so far. As more data is collected, the “bell” becomes sharper and more concentrated around the measured average height.
For Bayesians, probabilities are fundamentally related to their knowledge about an event. This means, for example, that in a Bayesian view, we can meaningfully talk about the probability that the true conversion rate lies in a given range, and that probability codifies our knowledge of the value based on prior information and/or available data.
For Bayesians, the concept of probability is extended to cover degrees of certainty about any given statement on reality. However, in a strict frequentist view, it is meaningless to talk about the probability of the true conversion rate. For frequentists, the true conversion rate is by definition a single fixed number, and to talk about a probability distribution for a fixed number is mathematically nonsensical.
The same logic applies when seeking to measure the conversion rate of a web-based purchase funnel. Sure, probability can certainly be estimated in a frequentist fashion by measuring the ratio of how many times a conversion was made out of a huge number of trials. But this is not fundamental to the Bayesian, who can stop the test at any point and calculate probabilities from data.
To illustrate the convergence process of the distribution as more data is collected, here is a plot based on test data. Notice how the bell shape becomes sharper (more certain) as data streams in:
The surprising thing is that this arguably subtle difference in philosophy between these schools leads in practice to vastly different approaches to the statistical analysis of data.
Declaring a Winner
Once the posterior distributions are mapped for the variations, to conclude a winner you sample a large amount of observations. Let’s say hypothetically you sample 1000 times from 2 variations looking like this:
and 999 times out of the 1000 times you see the orange variation having higher probability. So the probability that variation orange is better than variation blue will be: (999/1000) * 100 which is 99.9%. In dynamic yield we sample 300,000 times from every variation to calculate the probability to be best.
The Mathematics Behind the Bayesian Model
In Bayesian the real mean is a distribution, but the observations are fixed, which models real-life behavior much better. To be more precise, taking the case of a Bernoulli distribution, the probability mass function (pmf) is defined as:
π being the probability of clicking. Here, according to the Bayesian approach, Pi should also have a distribution of its own, its own parameters etc..
To calculate the mean click-through rate, similar to the maximum likelihood mean value in a traditional A/B test, we try to solve for the value in the below equation:
Where X = the observed data
We apply the Bayesian conditional probability equation:
Here, p(X) can be treated as a normalizing constant, given its independence from π.
Therefore:
- p(π)= probability of click before the experiment began - the Prior
- p(X|π)= the observed data samples - the Likelihood
- p(π|X) = the probability of click after observing the sample – the Posterior
Prior, Likelihood & Posterior Probability Distributions
Prior probability is the probability to click on a variation before any sample data is collected, likelihood is the probability distribution of the collected sample data & posterior probability probability is the probability to click a variation after the taking the sampled data observations into account.
In Bayesian probability theory, if the posterior distribution has the same probability distribution as the prior probability distribution given a likelihood function, then the prior and posterior are called conjugate distributions and the prior is called a conjugate prior for the likelihood function.
For example, in our case: if the likelihood function is Bernoulli distributed, choosing a beta prior over the mean will ensure that the posterior distribution is also beta distributed. So in essence the beta distribution is a conjugate prior for the likelihood that is Bernoulli distributed!
As we are dealing with a Bernoulli distribution, we only have to deal with one random variable (). So assuming our likelihood function follows a prior-beta distribution:
Assuming the experiment begins with no prejudice, a beta distribution for the prior with α=1; β=1would be a good starting point as beta(1,1) is a uniform distribution.
Grouping the similar terms together:
We can see that the posterior is simply a beta distribution of the form:
Where:
which is exactly the same as our prior probability distribution, which was:
Thus, confirming the conjugate priors concept for binary outcomes.
Therefore to solve for a posterior probability for binary outcomes the blueprint would be:
Blueprint:
- Beta(1,1): assume the prior distribution is uniform
- Sample x1
- Beta(1+x1 ,1+1- x1): Update the beta distribution to account for the sample observed data,
- Sample x2
- Beta(1+x2 ,1+2- x2)
- Repeat From Step 2
In the end we reach a beta distribution that progresses from a uniform distribution to a skinny normal distribution.
Sample Implementation in Python
This script below gives a demo how the Bayesian probability changes over time as the number of samples increase.
from __future__ import print_function, division
#! usr/bin/env python"
__author__ = "SivaGabbi"
__copyright__ = "Copyright 2019, Dynamic Yield"
from builtins import rangev
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import beta
NUM_TRIALS = 2000
CLICK_PROBABILITIES = [0.35,0.75]
class Variation(object):
def __init__(self, p):
self.p = p
self.a = 1
self.b = 1
def showVariation(self):
return np.random.random() < self.p
def sampleVariation(self):
return np.random.beta(self.a, self.b)
def updateVariation(self, x):
self.a += x
self.b += 1 - x
def plot(variations, trial):
x = np.linspace(0, 1, 200)
for b in variations:
y = beta.pdf(x, b.a, b.b)
plt.plot(x, y, label="real p: %.4f" % b.p)
plt.title("variation distributions after %s trials" % trial)
plt.legend()
plt.show()
def experiment():
variations = [Variation(p) for p in CLICK_PROBABILITIES]
sample_points = [5,10,20,50,100,200,500,1000,1500,1999]
for i in range(NUM_TRIALS):
# take a sample from each variation
bestv = None
maxsample = -1
allsamples = []
for v in variations:
sample = v.sampleVariation()
allsamples.append("%.4f" % sample)
if sample > maxsample:
maxsample = sample
bestv = v
if i in sample_points:
print("current samples: %s" % allsamples)
plot(variations, i)
# show the variation with the largest sample
x = bestv.showVariation()
# update the distribution for the variation which was just sampled
bestv.updateVariation(x)
if __name__ == "__main__":
experiment()