A/B tests are experiences using the A/B test allocation method that have at least two active variations. You can view a report for each experience by clicking a campaign in one of the campaign lists and clicking any experience.
Uplift vs. Probability to Be Best
Before analyzing a report, it is important that you understand these two important metrics.
- Uplift: The difference between the performance of a variation and the performance of a baseline variation (usually the control group). For example, if one variation has a revenue per user of $5, and the control has a revenue per user of $4, the uplift is 25%.
Notice: uplift is not calculated before the baseline variation has at least one conversion for the metric. - Probability to Be Best: The chance of a variation to have the best performance in the long term. This is the most actionable metric in the report, used to define the winner of A/B tests. Whereas uplift may vary based on chance for small sample sizes, the probability to be best takes sample size into account (based on the Bayesian approach). The probability to be best does not begin calculating until there have been 30 conversions or 1,000 samples.
To say it simply, the Probability to Be Best answers the question “Who is better?”, and uplift answers the question “By how much?”
Basic Analysis
Check the top of your A/B test to see if a winner has been declared, or for other information such as which variation is currently winning.
A winner will be declared if the following conditions are met:
- One variations has a Probability to Be Best score above 95% (the threshold can be changed using the winner significance level setting).
- The minimum test duration has passed (default is 2 weeks). This is designed to make sure the results are not affected by seasonality.
Secondary Metrics Analysis
While the winners of each test are based on the primary metric, Dynamic Yield also measures additional metrics called secondary metrics. There is no need to select secondary metrics in advance, they are tracked automatically and available in your experience reports. We recommend checking out the secondary metrics before applying the winning variation to all users for a few reasons:
- It can save you from making a mistake (e.g. Your primary metric is CTR, but the winning variation may reduces purchases, revenue, or AOV).
- It can lead to interesting insights (e.g. purchases per user dropped, but AOV increased, meaning the variation led to users purchasing less, but more expensive products and overall more revenue).
For each secondary metric, look at the uplift and probability to be best scores to see how each variation performed.
After your analysis, you can determine if you should serve all of your traffic with the winning variation, or adjust your allocation based on what you have learned.
Audience Breakdown Analysis
Another good way to dig deeper is to break your results down by audience. This can help answer questions such as:
- How did traffic from different sources behave in the test?
- Which variation won for mobile and which for desktop?
- Which variation was most effective for new users?
We recommend selecting audiences that are meaningful to your business, as well as audiences that are likely to have different intent.
For each audience, look at the uplift and probability to be best scores to see how each variation performed.
After your analysis, you can determine if you should serve all of your traffic with the winning variation, or adjust your allocation based on what you have learned.
Note: Audience breakdown user data will appear in the report only if the user belonged to the selected audience at the moment of their first impression in the test version. If a user entered the audience after viewing the variation for the first time, any data related to that user will not appear in the breakdown.
Predictive Targeting
Sometimes, you might see a message that personalization opportunity was detected. This means that there is a way to increase uplift by serving a specific audience with one of the losing variations, instead of serving all of your traffic with the winning variation. You can click apply to accept the analysis and adjust your traffic. For details, see Predictive Targeting.
What If a Test Doesn’t Reach Significant Results
Tests may take some time to reach significance, depending on how much traffic they receive. However, every now and then you might see a test that has been running a long time without reaching statistically significant result. Here are 3 recommended actions to take in such cases, to ensure you gain learning from the test:
- Explore Secondary Metrics: A test might not reach significant results in the primary metric, but if one of the variations performs significantly better a secondary metrics - it might be still optimal to serve it to all users.
- Explore Audience Breakdown: Different audiences prefer different variations, and sometimes two variations appear to cancel each other out when looking at the big picture. For example, if one variation is better for mobile and the other is better for desktop, you might see that no variation was a clear overall winner. However, if you break down the results by audience you would see that each audience does have a variation that will result in an uplift.
- Identify the losing variation(s). If you are testing 3 or more variations, and one of the variations is performing very poorly after the minimum test duration has passed, you should apply one of the leading variations. To determine if a losing variation has reach significance as a loser, use the following formula: ( 1/number of variations ) / 10
So if there are 3 variations, a PTBB of 3% of lower is statistically significant, for 4 variations 2% or lower is required.