Pitfalls of A/B testing
Advent of internet has given rise to application-based companies that frequently come out with updates to optimize user interaction. To decide on the changes, companies are increasingly relying on the statistical experimentation. A/B testing has been the most common among them.
A/B testing is a sequential method of testing a new feature (more commonly called a treatment) against an existing feature (commonly called control) by randomly assigning users to either control or treatment group and then measuring their response. It helps the companies build and test hypotheses according to the feedback, and improve the platform continuously based on the results.
Despite the popularity of the A/B testing, it suffers from serious pitfalls, both in its inherent nature and the fashion in which it is implement in at most companies:
- Running the test for short period of time
In order to speed up the process of improving the platform, managers make the mistake of stopping the experiment too soon. They receive the initial result and suspend the experiment from running it’s course. It is possible that any feature shows improvement or decline in the success metric in short run but the effect changes in the long run. This shortsightedness leads to incorrect inference from the but can be easily remedied by letting the experiment run its course.
2. Taking averages as the success metrics
When we look at averages, we are basically assigning a single value to all users taking part in the experiment. We ignore the variation in reception to the new feature among the users in treatment group. It is quite possible that most if the difference in user behavior is being observed is driven by a small percentage of the total user base. Devising recommendations based on an metrics which takes average values can lead to decline in user experience and hence, can jeopardizes the product. Different ways to remedy this are:
➢ Choose a non-average success metric
Clicks is one of the most important metrics being tracked by companies while running an A/B experiment. Instead of defining a metric like average number of clicks per customer, we can define some click through rate (percentage of all visitors to the platform who clicked) or total number of clicks. This will help us avoid the erroneous conclusion we might have reached by use average clicks per user if a small number of users were using the feature at a much higher rate than rest if the user base.
➢ Segmented A/B testing
Traditional A/B testing can be modified to be run in segments. Social media companies Facebook and LinkedIn run their tests on assuming a group of users as the experimental unit, rather than a single user. This can be combined with defining a non-average success metric to strengthen the cause – inference relationship.
➢ Modified A/B testing
There are many avatars of A/B testing which can be used depending on assumptions being violated. One of the unique examples of such medication is the interleaving A/B testing where each user gets to interact with results from multiple algorithms together. This helps the organization quickly filter out the bad algorithms and then we can apply traditional A/B testing when we reduce the available options.
3. Falsely assuming stable unit treatment value assumption (SUTVA)
A lot of time, managers assume that there no interference (this is assumption of SUTVA), i.e., the user in a particular group is responding to only its own treatment effect. It is not affected by other users in the same group or different. Violation of this assumption will lead to biased estimates. There are two main ways in which SUTVA is violated:
➢ Interference effects
For O2O companies (companies with two-sided markets) like Uber, Doordash etc., users response to the feature is affected by how the experimental units are randomly assigned to either or treatment or control as there is constant interaction between people on both sides of the market. This will lead to a biased estimate of the success metric and hence, may lead to erroneous inference.
➢ Network effects
For social media companies like LinkedIn, Facebook and online gaming companies, there is constant interaction between different users. If we take a single user as our experimental units, then it is more than likely that users in different groups will interact leading to violation of SUTVA and hence biased estimates. LinkedIn has developed a new algorithm to tackle this issue where instead of randomly choosing individuals as experimental units, it combines users into interaction groups such that there is no spill over between different groups leading to a reduction in network effects.
I hope this article is useful for anyone starting out on their journey in the world of statistics as I am trying to do.
- Author unknown, A/B Testing, Optimizely, 2020
- Isak Kabir, How to conduct A/B testing ?, towards data science, 2020
- Bojinor, Lanov, Avoid the pitfalls of A/B testing, Harward Business Review, 2020
- Harsh, Harsh, Different avatars of A/B testing, Medium, 2021
- Iskold, Alex, Understanding network effects, LinkedIn, 2016
- Bojinor, Lanov, Design and Analysis of switch-back experiments, HBS, 2020
- Kastelman, David, Switchback tests and randomized Experimentation under Network
- Effects at DoorDash, Medium, 2018