Understanding A/B Test Results – Dynamic Yield Knowledge Base

A/B tests are experiences that use the A/B test allocation method and have at least two active variations. You can view a report for each experience by clicking a campaign in one of the campaign lists and clicking any experience.

Result types

Uplift

Uplift is the percent difference between the performance of a variation and the performance of a baseline variation (usually the control group). For example, if one variation has a revenue per user of $5, and the control has a revenue per user of $4, the uplift is 25%.

Note: Uplift isn't calculated until the baseline variation has at least one conversion for the metric.

Probability to Be Best

Probability to Be Best is the chance of a variation to outperform all other variations. This is the most actionable metric in the report, used to define the winner of an A/B test. Whereas uplift might vary based on chance for small sample sizes, the probability to be best takes sample size into account (based on the Bayesian approach).

Note: Probability to be best doesn't begin calculating until there are at least 30 converters and 1000 samples (users, sessions, or pageviews, depending on the selected stickiness) for every active variation in the A/B test. For purchases or revenue per user metrics, converters are distinct users who've made a purchase. For a metric like CTR, converters are clicks.

Probability to Beat Control

Probability to Beat Control is the chance of a variation to outperform the baseline. Probability to Beat Control is equivalent to probability to be best, but each variation competes only against the baseline rather than against all other variations. It's useful in tests with more than two variations, where multiple variations might outperform the control, but perform similarly to each other. This means that no single variation can have a high probability to be best, but each can have a high Probability to Beat Control.

In summary, Probability to Be Best and Probability to Beat Control answer the question “Which variation is better?” while uplift answers the question “By how much?”.

Credible intervals

A credible interval is a range that captures the likely true value of a metric with a certain probability. Credible intervals are the Bayesian counterpart to frequentist confidence intervals, but unlike the latter, they can be interpreted at face value: A 95% credible interval contains the likely true value of the metric with 95% probability. Credible intervals of 95% and 50% probability are displayed for both the metric (primary or secondary) and the uplift.

While both the metric and uplift are estimates calculated directly from the data collected, credible intervals are an output of the statistics engine and represent our certainty about the estimate. For example, we might estimate the purchases per user for a variation at 0.05. This estimate would be the same whether there are 5 purchases and 100 users or 500 purchases and 10,000 users, though intuitively, the second estimate might seem more credible because it's based on more evidence. Credible intervals effectively capture how credible our estimate is, given the amount of evidence we have collected.

Basic analysis

The report's version overview indicates whether a winner has been declared for all users or for one of your primary audiences.

A winner is declared if the following conditions are met:

One variation has a probability to be best score above 95% (the threshold can be changed using the winner significance level setting).
The minimum test duration has passed (the default of 2 weeks can be changed using the test duration settings). This is designed to limit the effect of daily fluctuations on the results.
The expected loss for the variation is lower than 1%. Expected loss can be interpreted as the average uplift you would lose if you were to deploy a variation, and it turned out not to be the best. This metric effectively weighs how much loss you'd incur (downlift) by the probability of that loss occurring (probability to be best).
Note: Expected loss is calculated by our stats engine for the purpose of declaring winning variations, but is not displayed in the Experience OS console.

Secondary metrics analysis

While the winners of each test are based on the primary metric, Dynamic Yield also measures additional metrics called secondary metrics. You don't need to select secondary metrics in advance. They're tracked automatically and are available in your experience reports. We recommend checking out the secondary metrics before applying the winning variation to all users for the following reasons:

It can prevent mistakes (for example, your primary metric is CTR, but the winning variation might reduce purchases, revenue, or AOV).
It can lead to interesting insights (for example, purchases per user dropped, but AOV increased, meaning the variation led to users purchasing fewer but more expensive products and overall generating more revenue).

For each secondary metric, look at the uplift and probability to be best scores to see how each variation performed.

After your analysis, you can determine whether to serve the winning variation to all your traffic, or adjust your allocation based on what you've learned.

Audience breakdown analysis

A good way to dig deeper is to break down your results by audience. This can help answer questions such as:

How did traffic from different sources behave in the test?
Which variation won for mobile and which for desktop?
Which variation was most effective for new users?

We recommend selecting audiences that are meaningful to your business, as well as audiences that are likely to have different intent.

For each audience, look at the uplift and probability to be best to see how each variation performed.

After your analysis, you can determine whether to serve the winning variation to all of your traffic, or adjust your targeting based on what you've learned.

Note: Audience breakdown considers users to be part of an audience only if they were in that audience at the moment of their first interaction with the test version. If a user enters the audience after interacting with the variation for the first time, the user is not considered part of that audience in the audience breakdown.

Predictive targeting

Sometimes, you might see a message that a personalization opportunity was detected. This means there is a way to increase uplift by serving a specific audience with one of the losing variations, instead of serving the winning variation to all of your traffic. You can analyze the report based on this specific audience to understand their behavior. For more details, see Predictive Targeting.

What if a test doesn’t reach significant results?

Tests might take some time to reach significance, depending on how much traffic they receive. However, every now and then, you might see a test that's been running for a long time without reaching statistically significant results. Here are some recommended actions to take to ensure you gain learning from the test:

Explore secondary metrics: A test might not reach significant results in the primary metric, but if one of the variations performs significantly better in the secondary metrics, it might still be optimal to serve it to all users.
Explore the audience breakdown: Different audiences prefer different variations, and sometimes, two variations appear to cancel each other out if you look at the big picture but show something different when you zoom in. For example, if one variation is better for mobile and the other is better for desktop, there might not be a clear overall winner, but broken down by audience, each variation can provide uplift.
Identify the losing variations: If you're testing 3 or more variations, and one of the variations is performing very poorly after the minimum test duration has passed, you should apply one of the leading variations. To determine whether a losing variation has reached significance as a loser, use the following formula: ( 1/number of variations ) / 10
So if there are 3 variations, a PTBB of 3% or lower is statistically significant. For 4 variations, 2% or lower is required.
Identify variations that beat the control group: If your test includes a control group or has a baseline variation, you can observe their probability of beating the control. Look for variations with statistically significant scores, that are higher than the significance level set for declaring a winner, and take action accordingly.
Compare the variation metric using credible intervals: The true value of your metric performance lies within the range of the credible interval. Compare the variations' ranges to identify which one has a higher probability of performing the best.