A to Z of A/B Testing for Product Managers
A/B test, statistically significant results, p-value, null hypothesis & much more!
Welcome to my third newsletter! This one is on A/B testing.
BTW, thank you for the unprecedented love you have shown for my second newsletter on API’s. It was overwhelming, to say the least.
In today’s newsletter, I will cover -
What is A/B testing?
How long should you run an A/B test?
What do we mean by statistically significant results?
What is p-value?
What is null hypothesis?
What are the pitfalls of A/B test?
And a lot many other things!
1. What is A/B testing?
At the face value, A/B testing sounds extremely simple!
Suppose, you have created a personal branding course and plan to sell it for ₹ 99. You are excited & have built a registration page for the same. You have a green coloured sign-up button on the page. Somehow, you came across an article that said - Bro! Blue evokes trust & brings more conversion.
You want to try blue! But you aren’t sure too! What if it makes fewer people signup now?
You want to play safe! So, you launch two versions of the same registration page - one with a green button and the other with a blue button.
You show one half of your users’ green sign-up button and the other half blue sign-up button. Boom!
That’s an A/B test.
The button colour that gives more sign-up is the winner. That’s the button colour you would want to go with eventually!
1.1 What’s next?
Note ➖ When we run the test, the group of users who will see the old green signup button, as usual, are called as control group whereas the other group of users who will see the new blue coloured signup button are called the treatment group.
But what if the registration page with the blue sign-up button is bringing more registrations because it’s shown to more & more users who are genuinely interested to learn personal branding? Quite possible. Ain’t it?
Therefore, it’s important to split your traffic randomly
to avoid sampling bias.
Avoiding sampling bias is simple! You just need to ensure that any user visiting your registration page has an equal chance of seeing either blue or green sign-up button.
1.2 How to split your traffic randomly in an A/B test?
If you are conducting an A/B test, for your
logged-in users
, you can split the traffic on the basis ofuser_id
. If theuser_id
is an even number you can show them the existing design, else show the new design.If you are conducting an A/B test, for
new users
(i.e. new users visiting your registration page), your developer can code and set a cookie in the user’s browser & the cookie can randomly decide which variant to show for a particular user.
Now, let’s move ahead to the next big question - How long should you run an A/B test?
2. How long should you run an A/B test?
This is quite an irritating question for data scientists. The product team always wants the results fast and they keep on asking - how long should we run the A/B experiment?
We always run an A/B test until we get a statistically significant result.
Wait! What the fuck is statistically significant result?
Does it mean we should run the test until we see a large/big improvement? (i.e. blue button has 10% more signup than green button)
After all, the literal meaning of significant is big/large/huge.
I share my not so ‘important’ opinion on product, tech, design, venture capital, and human psychology on Xplainerr here. Do subscribe - no spam, ever!
2.1 What do we mean by statistically significant results?
Reading the Wikipedia definition of statistical significance can give nightmares :p (Maybe, you haven’t read something more complex than this in a while!)
Let’s make it simple!
At Internshala, we had revamped the design of our Application Tracking System (ATS)! If you don’t know what an ATS is - Think of ATS as a place where all the resumes of job seekers are dumped and managed!
I wouldn’t go into the details of the design changes we made but the idea was to improve the processing rate (i.e. more applications should be hired/rejected/shortlisted out of total applications received)
* Processing rate - If 10 applications were hired OR rejected OR shortlisted out of a total of 100 applications, the processing rate is 10%.
Let’s suppose we ran three simultaneous different experiments. Can you identify which of the test results are statistically significant?
Guess, please!
Doesn’t normal wisdom say that Experiment C definitely has a statistically significant result? 205.56% improvement over the base rate!
Experiment A can also have a statistically significant result as the change over the base rate isn’t bad. It’s 6.64%. Right?
Experiment B can never have a statistically significant result as the change is just 3% more than the old design. Right?
After all, significant means big/large/huge.
2.2 What statistically significant difference isn’t?
Damn! Experiment C (as expected) and experiment B (what 😳) have shown statistically significant results.
It’s so important to remember that -
Statistically significant doesn’t mean large/big/significant change.
We can’t say anything about statistical significance by just looking at the absolute change or change over the base rate!
2.3 Then what do statistically significant results mean?
This may sound absolutely bizarre to understand and comprehend at this point in time but please bear with me!