Rules for split testing for startups

In a startup you have limited resources and you want to get as much bang for these resources. As a result I wanted to recommend some rules that have worked for me:

Err on the side of bold testing

Its nice to know exactly what was the underlying cause of the uplift from your test, but if I have the opportunity to run three tests I want them to give me the biggest uplift, not the most knowledge. This means that you’re opportunity for failure will be higher, but when you win it’ll translate to a lot of wins. Meek testing (e.g. button colours, CTA buttons) might work if you’re Google or booking.com, but with limited traffic and resources you want to focus on things that are going to move the needle for you.

Make falsifiable hypotheses which will help you learn even if the test doesn’t win

As a corollary to the above you want to learn as much as possible from every test. I like the hypothesis format:

As a results of evidence

We believe that the following changes will result in this effect

We will measure this through this metric

If your hypothesis gives you a learning then it will make it easier for you to create better future tests and iterate on your testing.

Each test should run minimum 2 weeks.

If you run your test for less than 2 weeks you won’t get enough traffic and you won’t understand the experience that people have of day parting (i.e. your test results may skew to how people behave on a weekend or week day). This also depend on the amount of traffic you have, you may need to run a test for months on low traffic pages (which then begs the question, why are you bothering to test this page, see “test important things” point below)

If after a month the variant has a significance level of less than 65%, kill the test

You need to keep a decent testing velocity, if the variant is only slightly better than the control then it doesn’t seem like its going to be a winner, and its not worth continuing to test and waste the slot.

Significance level of at least 75%

All the tools tell you to test to 95% confidence. If this is your payment page or a crucial page then I’d agree that its worth testing to a high significance, however if it is smaller pages you have the challenge of getting enough traffic and having a number of things to test in your pipeline. In that case using a lower significance level is justified, especially if its backed up by:

-performance over time

Its also interesting to look at how the test has performed over time. If the variation has always been winning then you can make an assumption that it will continue winning.

– secondary metrics/micro conversions

Aim for micro-conversions (i.e. next step in process) as opposed to sales/far away goals

Conversions and revenue are your obvious goal, but often you don’t have enough traffic or enough conversions to get statistical significance on these. Micro-conversions can help here; clicks on “Add to basket”, reaching the basket or the next step after the elements you’re testing can add further colour to the picture and will be easier to achieve significance on than metrics further down the funnel. These can be proxies for conversion especially if you have a few of them to review in addition to your main metrics.

Test the most important things

Don’t test your about us page and pages that have a limited impact in the purchase journey. Elements like sitewide navigation, signup, checkout etc impact everyone on site and so will have a much larger opportunity to move the needle when it comes to achieving meaningful increases.

Have a roadmap

Having a number of tests planned for each slot helps, then you don’t have to go back to try and puzzle a new test every time you conclude a test. If you invest the time in having one or two tests schedule for each slot then your process works more smoothly.

These are some elements that have helped me keep a healthy test trajectory with limited resources.