A/A testing is a methodology that involves testing identical variations against each other, to validate the testing environment rather than determine a winning variation. By ensuring that testing identical variations produces the expected results, experimenters can gain confidence in the reliability of subsequent A/B tests.
A/A tests can be particularly useful to:
- Verify correct data collection: Ensure implementation and tracking of events is aligned with your analytics platform or internal sources of truth.
- Verify correct traffic allocation: Ensure that traffic is allocated randomly according to the split, and avoid underlying biases.
-
Evaluate the statistical engine: Assess whether the guarantees offered by the statistical method in use are satisfied. It is important to note that different methods offer different guarantees. For example:
- Frequentist methods used in null hypothesis testing promise to keep Type 1 errors (false positives) under a certain threshold, provided sample size requirements are satisfied, and only the intended metric is considered in the evaluation. Rejections of the null hypothesis in A/A tests are false positives, making A/A tests well suited to validate that the system keeps its promise.
- The Bayesian method used by Dynamic Yield does not focus on Type 1 error control but rather tries to limit the loss (downlift) that occurs by accepting a false discovery. In other words, it limits the negative consequences of a mistake, rather than limiting the number of mistakes. In A/A tests, where variations are identical, no loss is possible, and therefore A/A tests are not well suited to evaluate the system’s guarantees. Particularly, because the system doesn't focus on false positive control, there should be no expectation that false positives are bound by a certain threshold. Learn more about this in the Why false-positive control is not our primary focus section.
Setting up an A/A test
- Create a new Custom Code campaign by going to Site Personalization › New Campaign › Custom Code.
- Give your campaign a name (e.g. "AA Test 1")
- If you have an analytics integration (e.g. Google Analytics) - make sure it is enabled, and click Next.
- In the targeting tab, click Next without changing any settings.
- In the variation tab, click New Variation and select the Custom Code template.
- In the JS tab, add the following code:
console.log('A/A test variation A');
- Click Save Variation.
- Click New Variation and create a second variation with the following code in the JS tab:
console.log('A/A test variation B');
- After saving the second variation, use the Allocation column to allocate 50% of the traffic to each variation.
- Use the default settings for the primary metric (for example, purchases). Do not change the default advanced settings that are sticky for the user (multi-session) or an attribution window that starts when the variation is served and ends when the session ends.
- Click Next and set the experience status to Active.
- Click Save Experience and Publish. Don't worry, this won't impact your visitors' experience. Users that are assigned to variations will only trigger a console.log message in the browser.
- Navigate back to the campaigns page, find your newly created A/A Test campaign, and click on the duplicate button to create an additional one called A/A test 2. Repeat the process until you have at least 20 campaigns.
Why duplicate the A/A test?
A single A/A test might be sufficient to spot issues with data collection or traffic allocation. However, to validate the reliability of the statistical engine, it is important to run as many A/A tests as possible because statistical guarantees are generally based on the premise of repeated trials. It's an industry standard to accept a small percentage of false positives, and creating multiple A/A tests ensures that you'll gain insight into the statistical engine across an A/B testing program as a whole.
For this reason, repeat the following process as many times as possible (we recommend 20 times). You can simply duplicate the test after you create the first one.
Evaluating A/A test results
After launching your A/A tests, we recommend collecting data for two weeks before evaluating the results. It is important to evaluate it in the following steps, as these steps are built on each other:
Step 1: Verify data collection
Expectation: Collected data matches an external system of record with a discrepancy that is smaller than 5%.
If you see a larger discrepancy, consider the following:
- By default, Dynamic Yield excludes outliers from results, so make sure to turn off outlier exclusion when appropriate.
- Ensure the comparison is valid. For example, if using Google Analytics, ensure the property aligns with the pages where the A/B testing platform script is implemented.
- Consider dual tracking: Monitor users in the analytics platform and purchases in the e-commerce platform analytics.
Step 2: Verify allocation accuracy
Expectation: No sample ratio mismatch is detected. The split of traffic is similar to the split defined in the test setup. Due to randomness, allocation will never be exactly as defined, but you can use this calculator to check if there’s an issue, by entering the number of users of each variation.
If you detect sample ratio mismatch, investigate whether there is any known automation running on your site (such as machine-generated traffic for QA purposes) or external bot traffic (such as scrapers), and address it if possible. If the issue persists, contact support.
Step 3: Evaluate the statistical engine
Expectation: Most tests don't have a winning variation declared. In standard A/B tests, you can expect approximately a 5% false-positive rate (meaning, a winning variation that is not necessarily better, possibly even worse, or in most cases, simply has no advantage over the other variation). However, in A/A tests, because none of the variations are better than the other, the chance for a false-positive is 10% – 5% for each one of the 2 variations.
If one of the variations has a Probability to Be Best score that is 95% or higher:
- Ensure you have successfully passed Steps 1 and 2 of the A/A test analysis.
- Check that if Probability to Be Best has crossed the 95% threshold, both uplift and credible interval on the uplift are small. In such cases, a false positive declaration occurred, but our system evaluated the potential loss (downlift) connected to switching to the winning variation as minimal, and hence produced a recommendation to switch. This is expected, given that variations are identical and there is no possible loss in switching.
The Dynamic Yield stats engine prioritizes limiting expected loss, rather than minimizing false positive declarations, so you shouldn't expect the total number of declarations to be bound by a certain threshold. However, if an unusually high number of declarations occurs (say, 10/20), and they are associated with extreme uplifts/downlifts, it might signal that the metric’s variance is higher than expected, or that there might be an issue with data generation or collection processes. In this case, contact support.
Why false-positive control is not our primary focus
In a frequentist approach, A/A tests are used to ensure that the false-positive rate stays within the desired threshold. For instance, if you conduct a large number of A/A tests with a 95% significance threshold (and don't compare multiple metrics), no more than 5% should reject the null hypothesis.
However, the Dynamic Yield statistical engine uses the Bayesian method, which doesn't center on false positive control. Instead, it aims to minimize expected loss. To clarify: You might not care about making a wrong decision often if the decision has small to no impact. On the other hand, you probably would be unhappy with a small number of wrong decisions if those decisions have a massive impact.
While controlling the false-positive rate is crucial in scientific research to avoid publicizing false discoveries (say, a cure that doesn't work), it's less important in a business context and potentially even counterproductive. Expected loss achieves this balance by weighing the frequency of incorrect decisions against their potential negative consequences. In simpler terms, we prioritize decisions based on both the frequency and magnitude of their potential negative impacts. This is why the Bayesian approach has gained so much popularity in recent years and is being used by most modern A/B platforms.