In my ppc for travel post from a couple of weeks ago I recommended splittester for testing ad variations. I got a bit fed up with only being able to test two ads at once and having to type the information in for each test. I though it would be a lot easier if I could do the same thing on a spreadsheet. I emailed Brian Teasley the creator of splittester, to find out what sort of statistical test he used so that I could implement it in my own spreadsheet. He offered to sell me his own spreadsheet for $950. **NINE** hundred and fifty dollars! Nine **HUNDRED** and fifty dollars! Nine hundred and **FIFTY**! *For a spreadsheet! *Brian, I could buy a bank for that.

I wasn’t going to pay $950 for a spreadsheet so I began to do my own research about statistical testing with the aim of using it for our own ads. Having read quite a few SEO blog posts on the subject I think that quite a lot of people don’t really know what they’re talking about when it comes to ad testing. Throughout this article I’m talking as if you’re optimising for CTR. The procedure for optimising for conversion rate is very similar; for impressions read clicks, and for clicks read conversions. Here are 5 of the most insidious errors:

**1. You must have x clicks (mistake 1).**

It isn’t a higher number of clicks that makes a result more statistically significant; the important thing, your sample size, is the number of impressions. Think of it this way; if you had an ad with 1,000,000 impressions and no clicks you wouldn’t wait for it to get a certain number (30 seems popular for some reason) of clicks before deciding it was a bad ad. In my own model I treat each impression as a Bernoulli trial with a click being a success. Then I estimate the binomial parameter p, the variance of which gets smaller as the number of impressions increases.

**2. You must test two ads.**

People only talk about A/B testing of adverts. You can still get valid results if you’re testing 1000 ad variations at the same time. The main problem with this is that your test would have to run for a long time to get significant results. It is also hard to say why the best ad might be best which makes it difficult to write your next ad variation.

**3. Use a two-tailed test.**

Some of you might not know what a two tailed test is. Imagine you have two ads called A and B that you’re testing for CTR. To test the hypothesis “A and B have different CTRs” you’d use a two tailed test to sum the probabilities that A was better than B and that B was better then A. If the test tells you your hypothesis is true what do you do? You’re no further forward because you already thought the ads would have a different CTR otherwise you wouldn’t be testing them. To establish how sure you can be that ad A is better than ad B you must use a one tailed test. Telling someone that “You are approximately 99% confident that the ads will have different long term response rates” is useless.

**4. You must have x clicks (mistake 2).**

It is true that if the test runs for long enough then it will be obvious which ad is actually the best. But how long is long enough? Waiting 6 months to see if an ad with an apparent CTR of 0.5% is going to have a late surge to beat one with a CTR of 4.5% just causes your business to miss out on clicks. To solve this problem and know when a split test has been running long enough you must use a statistical test. I’d say there are three possibilities to choose from:

- The simplest test to use is known as a z-test. To use a z-test you must assume that CTR is normally distributed and your sample size must be big enough so that the sample CTR variance is a close approximation to the true variance.
- For small sample sizes use a t-test. Or more specifically, Welch’s t-test. This test does not assume that you know the population variance so it is a better test than the z-test. For large sample sizes the t-distribution matches the normal distribution (used in the z-test) so for large samples, since the t-test is more complicated to use, I’d use the z-test.
- For the above two tests there is an implicit assumption that CTR is normally distributed (on a bell curve). I think this is actually the case but if you disagree let me know why on the comments below and then start using the Mann-Whitney U test. Wikipedia says that for normally distributed data a Mann-Whitney U test is 95% as good as a t-test and it is less likely to give spurious results based on outliers. I would consider using this test when the CTR is small since then anyone who clicked the ad could be considered an outlier.

**5. Any difference in CTR is because of the Ad.**

I haven’t seen this view all that much but its one I believed myself until quite recently. I thought that any difference between the CTR of the ads being tested must be due entirely to the difference between the ad texts. I didn’t think that one ad might’ve been shown on slightly more relevant keywords or at a time of day when it was more likely to get clicked. How do you compensate for this when you’re testing? For a useful campaign ‘in the wild’ I don’t think it is possible to completely avoid this problem; it is impractical to have only one exact matched keyword per ad group. Instead do the best you can by following AdWords best practice and using tightly grouped keywords in each ad group.

I hope to blog a bit more about ad variations and statistical testing; I’ve had some pretty weird things come up as being statistically significant. Put any questions or comments in the form below and I’ll try to address them in my next post.