I’m not the first person to have considered the possibility that Paul the Octopus is the spawn of Cthulhu, based on his “remarkable” predictive powers. However, being unconvinced, I presented the possibility that he is not a normal octopus to my students last week, as an example of a basic non-parametric test (the runs test). I thought I’d present a couple of results here, and contemplate some of the complexities of hypothesis testing against a backdrop of crawling chaos.


So the basic tale is that Paul predicted the outcome of 5 German games and then one Holland/Spain game successfully, and he had an 80% success rate in the European cup (4 games out of 5 predicted correctly). We will present some statistical tests of this situation, and finish up with a few discussion details.


To test whether or not the Oberhausen Sea Life aquarium is housing one of the gibbering dark ones from beyond time and space, or whether, in fact, Paul is just a normal octopus who happens to be lucky. Additionally, are the cult of the Ancient Ones who surround him actually a bunch of charlatans making money from our credulous belief in the crawling abominations of the netherworld? Should we sacrifice Paul, perhaps lightly-battered with a slice of lemon, for the good of all humanity; or should we accept his fundamental normality and get on with our lives safe in the knowledge that the Nameless Ones do not, in fact, inhabit our mortal realm?[1]


We can posit the fundamental question as to Paul’s normality or infinite evil in terms of the null and alternative hypothesis of a non-parametric statistical test, as follows. Let the random variable X measure the outcome of Paul’s attempt to guess the result of the next Germany match. Then let X=1 if Paul is successful in his prediction, and X=0 if he fails. Define the probability that Paul successfully predicts a soccer match as p=P(X=1). Then, we can write the null and alternative hypotheses as:

H0: Paul is a normal octopus (p=1/2)

H1: Paul is a crawling abomination from the pits of hell (p>1/2)

In this case we can test the possibility that p>1/2 by means of a runs test. That is, under the null hypothesis, is the chance that Paul would predict 5 games correctly in a row unusually low, such that we might reject the null hypothesis with some confidence? We will choose a confidence of 95% and reject the null hypothesis if the probability of 5 games predicted correctly in a row is less than 5%. Note that we are using a runs test here, requiring sequential successes; we might want to allow the possibility that he can make a mistake at any point in the process, in which case we are interested in the probability that he gets 5 games out of 5 correct in any order.

This second test is important because in 2008 Paul predicted 4 games out of 5, for 80% accuracy. I’m not sure whether this happened sequentially or not, but it seems reasonable to suppose that his mistake could occur at any point in the chain of games, so then we need to calculate the probability of 4 games correct out of 5, in any order, and identify whether this is less than 5% (for a one-sided test), in order to reject the null hypothesis in favour of the terrible omens of destruction and chaos.


So, the probability that he correctly predicts 5 games in a row under the null is (1/2)^5, because the predictions are independent events and the probability is thus the product of their separate probabilities. This gives a probability of 1/32=3%, or less than 5%. We reject the null hypothesis of normality, and conclude that in fact the Elder Gods stalk the (aquariums of the) Earth.

However, the probability of 4 out of 5 correct in any order is (5 4) (1/2)^5 under the null hypothesis, where (5 4) is my crappy non-latex way of writing “5 choose 4”. This gives us 5/32=1/6=16% (approximately) so we retain the null hypothesis, that Paul is a normal octopus. Note the probability of 4 predictions in a row is 1/16 (exactly) or 6%,so no dice…

So, we have contradictory results concerning the nature of evil. Having proven statistically that British people are idiots and the Australian government didn’t burn the house down, I’m a little disappointed at this mixed result. I’m sure no priest of Sigmar would accept such equivocation where the agents of chaos are concerned. What to do?


We could combine the results of the two football matches, to get a total of 10 games with 9 correct results, but we don’t really have 10 games, because the 5 predictions of each series are correlated – Paul was a younger, and presumably less infinitely evil, octopus 2 years ago, and maybe had a different predictive method/ ritual, plus of course his cult followers were probably making different/smaller human sacrifices. So we need to consider the possibility that those 5 games are more similar to each other than they are to the next 5 games. Without any knowledge of the degree of correlation in the octopus’s predictions under the null hypothesis, we can’t make a judgement.

There is also a question of inter-rater agreement here. It’s possible that Paul always goes for the same box, and the staff don’t randomly assign flags to boxes, or just by luck the Germany box is more likely to be on the side Paul favours. We should probably consider the randomization sequence of the boxes in some way. A variable for the side on which the box is placed, or better still random assignment of the flags to the boxes, would have solved this problem.

But I think there is a more sinister trick at work here. We know that Germany are a strong team, and we know that Paul is lured into the boxes by mussels. So, since the staff can be confident that the German team will likely win most games, it is quite easy to rig the process by training Paul to prefer the German flag[2]. Remember that Octopi have strong colour vision and are very smart, so it could be possible to train a preference. Then, the probability of success in each predictive effort increases significantly. The Probability of success is P(Paul picks Germany and Germany win)+ P(Paul picks the opposition and Germany lose)=P(Paul picks Germany)*P(Germany win)+(1-P(Paul Picks Germany))*P(Germany lose), by the independence of the prediction and the outcome. But if P(Paul picks Germany)>1/2 and P(Germany win)>1/2, the total probability increases a lot. We know Germany won 3 games out of 5 this time around, so we could estimate P(Germany win)=0.6; if P(Paul picks Germany)=0.8, then we have the total p=0.8*0.6+0.4*0.2=0.56, p>1/2. If Germany’s win probability is really 0.8 (because Serbia were a pack of cheating bastards), then the probability increases to p=0.68.

Of course, because Germany win most games and Paul predicts they win most games, the actual fact that Paul is going to pick Germany more often anyway gets missed.

A final couple of notes. First, in this analysis[3], I have ignored the Holland/Spain prediction, because I read somewhere that Paul used to only predict on games involving Germany. This means that the Holland/Spain game is well outside the range of data on which the predictive model is based, and we shouldn’t assume it represents the same underlying probability structure or process (or manifestation of ultimate evil). So I’ve excluded this observation from my data set.

Secondly, it’s worth bearing in mind that statisticians should never, ever use statistical tests to test theoretically implausible events[4]. Because there is a small chance of type 1 error (rejecting the null hypothesis when the null is true), as soon as you apply a statistical test to a ridiculously implausible theory, you open the risk that you will prove it to be “true” by mistake. So all that is required to prove the existence of God is for some nong to conduct a statistical test of an apparent “miracle” that is really just a carefully trained Octopus, get a spurious result, and before you know it you have people worshipping his tentacly appendages.


Two non-parametric statistical tests have produced inconclusive results as to whether or not the shambling horrors of cthulhu walk among us, predicting our soccer matches. However, the test that rejected the null hypothesis was borderline, and consistent with the possibility that Paul has been trained to pick the German flag more often than other flags, thus ensuring increased predictive success and a high likelihood of a run of successful predictions, provided that Germany remain a strong team. This report concludes that Paul should probably not be burnt at the stake (or grilled) as a heretic, tentacled avatar of the brooding darkness; but it might be worthwhile to monitor him, his aquarium shrine, and the Cult that surround him, for further signs of the manifestations of chaos and, if witnessed, liquidate them and extirpate their teachings from the annals of history in the interests of the human race.

Update: Looking at the Wikipedia entry on our dark and tentacled oppressor, I note that actually he got 7 out of 7 results correct in this world cup, and only 4 out of 6 in the European cup. This doesn’t change the conclusion of our runs test (which simply becomes an even more powerful indication of his brooding and ultimate evil), but it makes his success rate in the European cup look even more merely mortal. Also the wikipedia entry correctly points out that in the group games there is a chance of a draw, so what we actually have here is a sequence of multinomial events with probability 1/3 of three outcomes in the first 3 tests, then 1/2 of two outcomes in the remainder (under the null). We would need to adjust the probabilities accordingly, for both the runs and the binomial test. This actually makes the binomial test a bit fiddly, but my guess is that it reduces the p-value slightly (due to the probabilities of success being lower). I think the wikipedia entry is slightly wrong on the odds of “at least 12 successes in 14 trials” due to the issue of correlation (as mentioned above)[5].

fn1: yet

fn2: My suspicion is that they ran a series of dummy runs with Paul before the cup, and either gave him a second mussel when he picked Germany, and/or sacrificed a virgin and offered her blood to the elder gods to enhance his magical powers; statistical testing seems to suggest the former was the case, but we can never be sure…

fn3: and I do use the term loosely

fn4: this applies to the kids at home too, obviously

fn5: also, has anyone else noticed that the wikipedia entry on the ecological fallacy confuses confounding and the ecological fallacy? At least, I thought it did last time I read it.