Two days ago I wrote a scathing review of Star Wars: The Last Jedi, and since then I have been digging around for others’ views on the matter. The Guardian has an article giving some fans’ reviews, and the below the line comments are suitably critical of this awful movie. Meanwhile Vox has a pathetic, self-serving article by a film critic attempting to explain why so many people have such different views to the critics. This article includes such great insights as “critics don’t really care about plot” which is dismissed as a “nitty gritty detail” of a movie – they’re more interested in themes and emotional struggles, apparently, which suggests they’d be more at home at a My Chemical Romance gig than a decent movie. How did they get the job?

In amongst the complaints on the Guardian‘s article, and at the centre of the Vox piece, is a particularly vicious little dismissive claim: That a lot of the negative reaction to the movie arises from long term fans[1], who cannot handle what Rian Johnson did with their cherished childhood movie, and are unrepresentative of the broader movie-going public. In the more vernacular form of some of the BTL comments on the Guardian article, fanboys are pissed off because Rian Johnson didn’t make the movie exactly the way they wanted. This, apparently, explains the difference between the critics’ view of the movie and the people giving a review on the Rotten Tomatoes website.

I thought this sounded fishy, so I decided to collect a little bit of data from the Rotten Tomatoes website and have a look at just how far fanboys typically deviate from critics. I figured that if fanboys’ disappointment with not getting a movie exactly as they wanted it was the driver of negative reactions to this movie, we should see it in other Star Wars movies. We should also see it in other movies with a strong fanboy following, and maybe we wouldn’t see it in movies that don’t have strong preconceptions. I collected data on critics’ and fans’ aggregated review statistics for 35 movies from the Rotten Tomatoes website. For each movie I calculated a score, which I call the Odds Ratio of Critical Acceptance (ORCA). This is calculated as follows:

1. Calculate an odds for the critics’ aggregate score, O1, which is (score)/(1-score)

2. Calculate an odds for the viewers’ aggregate score, O2, which is (score)/(1-score)

3. Calculate their ratio, ORCA=O1/O2

I use this score because it accounts for the inherent limits on the value of a critical score. The Last Jedi got a critics’ score of 0.93, which is very close to the upper limit of 1. If the viewers’ score was, for example, 0.83, it is 0.1 lower than the critics’ score. But this 0.1 is a much larger gap than, say, the difference between a critics’ score of 0.55 and a viewers’ score of 0.45. Similarly, if critics give a movie a value of 0.1 and viewers a value of 0.2, this means viewers thought it was twice as good – whereas values of 0.45 and 0.55 are much less different. We use this kind of odds ratio in epidemiology a lot because it allows us to properly account for small differences when one score is close to 1, as (inexplicably) it is for this horrible movie. Note that ORCA scores above 1 indicate that the critics gave the movie a higher score than the viewers, and scores below 1 indicate that the viewers liked the movie more than the critics.

I collected scores for all the Star Wars movies, all three Lord of the Rings movies, both Ghost in the Shell movies (the Japanese and the western remake), both Blade Runners, Alien:Covenant, two Harry Potter movies, Fifty Shades of Grey, and Gedo Senki (the (filthy) Studio Ghibli version of A Wizard of Earthsea), as examples of movies with a fanboy following. As readers of my blog are no doubt very aware, the Lord of the Rings fanboys are absolutely filthy, and if anyone is going to sink a movie over trivial shit they will. Ghost in the Shell is a remake of a movie with a very strong otaku following of the worst kind, and also suffers from a huge controversy over whitewashing, and Gedo Senki is based on one of the world’s most popular books, by a woman who has an intense generation-spanning cadre of fans who are obssessed with her work. Harry Potter fans are also notoriously committed. I also gathered a bunch of movies that I like or that I thought would be representative of the kinds of movies that did not have a following before they were released: Mad Max Fury Road, Brokeback Mountain, that new movie about a transgender bull[3], Ferdinand, things like that. I figured that some of these movies would not get a big divergence in ORCA if the fanboy theory is true.

Figure 1: ORCA Scores for a range of movies, none apparently as shit as The Last Jedi.

Results of my calculations are shown in Figure 1 (sorry about the fiddly size). The Last Jedi is on the far left, and is obviously a massive outlier, with an ORCA score of 10.9. This score arises because it has a critics’ score of 93%, but a score from fans of 55%[4]. Next is Mad Max: Fury Road, which was not as successful with fans as with critics but still got a rating of 0.85 from fans. It can be noted that several Star Wars movies lie to the right of the pale blue dividing line, indicating that fans liked them more than did critics – this includes Rogue One and The Phantom Menace, showing that this phenomenon was not limited to the first generation movies. Note that Fellowship of the Ring, the LoTR movie most likely to disappoint fanboys under the theory that fanboys want the director to make the movie in their heads, had an ORCA value of 0.53, indicating fans had twice the odds of liking it than did critics. Gedo Senki also did better with fans than critics despite being a terrible movie that completely pisses all over Ursula Le Guin’s original book.

There’s no evidence at all from this data that fanboys respond badly to movies based on not getting the movie in their head, and there’s no evidence that Star Wars fanboys are particularly difficult to please. The ORCA score for The Last Jedi is at least 12 parsecs removed from the ORCA score for the next worse movie in the series, which (despite that movie also being a pile of shit) is not that high – it’s lower than Dunkirk, in fact, which was an awesome movie with no pre-existing fanbase[5]. Based on this data it should be pretty clear that either the “toxic fandom” of Star Wars has been hiding for the past 10 years as repeated bad movies were made – or this movie is uniquely bad, and the critics were uniquely stupid to give it a good score.

I’m going with the latter conclusion, and I want the movie critics to seriously re-evaluate how they approached this movie. Star Wars clearly gets a special pass from critics because it’s so special, and Star Wars directors can lay any stinking turd on the screen and get a pass from critics for some incomprehensible reason. Up your game, idiots.

A few minor side points about critical reviews of The Last Jedi

I’ve been generally shocked by the way in which this movie is being hailed as a critical masterpiece. I really can’t see how this can be. Even if it’s not as bad as I think, I can’t understand how it can get similar scores to movies like Dunkirk, Mad Max: Fury Road, or Titanic. Those movies are infinitely better crafted than this pile of junk, with tight and carefully designed plots that clearly hold together under extensive criticism. There is nothing extraneous at all in Titanic or Dunkirk, not one moment that you could say isn’t directly relevant to the unfolding story, and the acting in all three of these movies is exemplary. Worse still, the Guardian is now claiming that Star Wars is the most triumphantly feminist movie yet. This is utter bullshit on its face: The main male character, Po Dameron, repeatedly undermines female leaders, and their attempts to discipline him are ignored, ultimately leading to the death of probably 200 people in a completely avoidable catastrophe, and he suffers no consequences for his dishonesty and treachery. Furthermore, he takes over the main role from Finn, the black character, and Rei is sidelined into a supplicant to an aging white man. As a moral story for entitled white men who can’t bear to be told what to do by women it’s exemplary. But this is even more horrific when you consider that Mad Max: Fury Road is a savage eco-feminist masterpiece, and undoubtedly the most triumphantly feminist movie ever made. This is another example of the weird special pass that Star Wars movies get: they make piss poor tokenistic gestures towards diversity and the critics are claiming they’re the most woke movie ever made.

There’s a strange irony in this. Star Wars fanboys are being blamed for obstinately marking this movie down on the basis of silly stereotypes about nerds, when in fact it’s the critics themselves who are acting like Star Wars sycophants, giving one of the worst movies of the millenium sterling marks for trying. Unless of course the conspiracy theories are true, and they’re all paid by Disney.

I won’t be so cynical. They’re just stupid and wrong, and in future I recommend not listening to reviewers before going to see any movie. Trust the viewers, they have much better judgment!

UPDATE: I have swapped my shoddy figure with a figure supplied by reader frankelavsky, who apparently actually knows how to do visual stuff, so it’s now much easier to see how terribly wrong the reviewers were.


fn1: Which, inexplicably, the Vox article seems to view as Baby Boomers, which is weird since most people want to now pretend Star Wars is a kid’s movie (it’s not[2]). Many of the fans saw it as kids, it’s true, but that’s because we were Gen X, not baby boomers. More importantly, Star Wars fandom crosses three generations, and includes a lot of Generation Y. It’s just dumb to even hint that the themes in the movie pissed off the fans because baby boomers don’t like the idea of handing on the baton to a new, more diverse generation. Star Wars fans aren’t baby boomers, and why would baby boomers have a problem with this anyway?

fn2: How fucking stupid is modern pop cultural analysis of pop culture, and how far has it fallen, that people could say this?

fn3: This is a joke. See here for more details.

fn4: It was 56% yesterday. This movie is sinking by the day.

fn5: Barring UKIP, I guess

Advertisements

Recently I wrote a post criticizing an article at the National Bureau of Economic Research that found a legal ivory sale in 2008 led to an increase in elephant poaching. This paper used a measure of illegal poaching called commonly the PIKE, which is measured through counting carcasses discovered in searches of designated areas. This PIKE measures the proportion of deaths that are due to illegal poaching. I argued that the paper was wrong because the PIKE has a lot of zero or potentially missing values, and is bounded between 0 to 1 but the authors used ordinary least squares (OLS) regression to model this bounded number, which creates huge problems. As a result, I wasn’t convinced by their findings.

Since then I have read a few reports related to the problem. I downloaded the data and decided to have a look at an alternative, simple way of estimating the PIKE in which the estimate of the PIKE emerges naturally from a standard Poisson regression model of mortality. This analysis better handles the large number of zeros in the data, and also the non-linear nature of death counts. I estimated a very simple model, but I think a stronger model can be developed using all the standard properties of a well-designed Poisson regression model, rather than trying to massage a model misspecification with the wrong distributional assumptions to try and get the correct answer. This is an outline of my methods and results from the most basic model.

Methods

I obtained the data of legal and illegal elephant carcasses from the Monitoring Illegal Killing of Elephants (MIKE) website. This data indicates the region, country code, site number, and number of natural deaths and illegally killed elephants at each site in each year from 2002 – 2015. I used the full data set, although for comparing with the NBER paper I also used the 2003-2013 data only. I didn’t exclude any very large death numbers as the NBER authors did.

The data set contains a column for total deaths and a column for illegal deaths. I transformed the data into a data set with two records for each site-year, so that one row was illegal killings and one was total deaths. I then defined a covariate, we will call it X, that was 0 for illegal killings and 1 for total killings. We can then define a Poisson model as follows, where Y is the count of deaths in each row of the data set (each observation):

Y~Poisson(lambda)                               (1)

ln(lambda)=alpha+beta*X                   (2)

Here I have dropped various subscripts because I’m really bad at writing maths in wordpress, but there should be an i for each observation, obviously.

This simple model can appear confounding to people who’ve been raised on only OLS regression. It has two lines and no residuals, and it also introduces the quantity lambda and the natural logarithm, which in this case we call a link function. The twiddle in the first line indicates that Y is Poisson distributed. If you check the Poisson distribution entry in Wikipedia you will see that the parameter lambda gives the average of the distribution. In this case it measures the average number of deaths in the i-th row. When we solve the model, we use a method called maximum likelihood estimation to obtain the relationship between that average number of deaths and the parameters in the second line – but taking the natural log of that average first. The Poisson distribution is able to handle values of Y (observed deaths) that are 0, even though this average parameter lambda is not 0 – and the average parameter cannot be 0 for a Poisson distribution, which means the linear equation in the second line is always defined. If you look at the example probability distributions for the Poisson distribution at Wikipedia, you will note that for small values of lambda values of 0 and 1 are very common. The Poisson distribution is designed to handle data that gives large numbers of zeros, and carefully recalibrates the estimates of lambda to suit those zeros.

It’s not how Obama would do it, but it’s the correct method.

The value X has only two values in the basic model shown above: 0 for illegal deaths, and 1 for total deaths. When X=0 equation (2) becomes

ln(lambda)=alpha              (3)

and when X=1 it becomes

ln(lambda)=alpha+beta    (4)

But when X=0 the deaths are illegal; let’s denote these by lambda_ill. When X=1 we are looking at total deaths, which we denote lambda_tot. Then for X=1 we have

ln(lambda_tot)=alpha+beta

ln(lambda_tot)-alpha=beta

ln(lambda_tot)-ln(lambda_ill)=beta     (5)

because from equation (3) we have alpha=ln(lambda_ill). Note that the average PIKE can be defined as

PIKE=lambda_ill/lambda_tot

Then, rearranging equation (5) slightly we have

ln(lambda_tot/lambda_ill)=beta

-ln(lambda_ill/lambda_tot)=beta

ln(PIKE)=-beta

So once we have used maximum likelihood estimation to solve for the value of beta, we can obtain

PIKE=exp(-beta)

as our estimate of the average PIKE, with confidence intervals.

Note that this estimate of PIKE wasn’t obtained directly by modeling it as data – it emerged organically through estimation of the average mortality counts. This method of estimating average mortality rates is absolutely 100% statistically robust, and so too is our estimate of PIKE, as it is simply an interpretation of a mathematical expression derived from a statistically robust estimation method.

We can expand this model to enable various additional estimations. We can add a time term to equation (2), but most importantly we can add a term for the ivory sale and its interaction with poaching type, which enables us to get an estimate of the effect of the sale on the average number of illegal deaths and an estimate of the effect of the sale on the average value of the PIKE. This is the model I fitted, including a time term. To estimate the basic values of the PIKE in each year I fitted a model with dummy variables for each year, the illegal/legal killing term (X) and no other terms. All models were fitted in Stata/MP 14.

Results

Figure 1 shows the average PIKE estimated from the mortality data with a simple model using dummy variables for each year and illegal/total killing term only.

Figure 1: Estimated Annual PIKE values

Figure 1: Estimated Annual PIKE values

The average PIKE before the sale was 0.43 with 95% Confidence Interval (CI) 0.40 to 0.44. Average PIKE after increased by 26%, to 0.54 (95% CI: 0.52 to 056). This isn’t a very large increase, and it should be noted that the general theory is that poaching is unsustainable when the PIKE value exceeds 0.5. The nature of this small increase can be seen in Figure 2, which plots average total and illegal carcasses over time.

Figure 2: Trend in illegal killings and total deaths

Figure 2: Trend in illegal killings and total deaths

Although numbers of illegal carcasses increased over time, so did the number of legal carcasses. After adjusting for the time trend, illegal killings appear to have increased by a factor of 2.05 (95% CI 1.95 – 2.16) after the sale, and total kills by 1.58 (95% CI 1.53 – 1.64). Note that there appears to have been a big temporary spike in 2011-2012, but my results don’t change much if the data is analyzed from only 2003-2013.

Conclusion

Using a properly-specified model to estimate the PIKE as a side-effect of a basic model of average carcass counts, I found only a 25% increase in the PIKE despite very large increases in the number of illegal carcasses observed. This increase doesn’t appear to be just the effect of time, since after adjusting for time the post-2008 step is still significant, but there are a couple of possible explanations for it, including the drought effect mentioned by the authors of the NBER paper; some kind of relationship between illegal killing and overall mortality (for example because the oldest elephants are killed and this affects survival of the whole tribe); or an increase in monitoring activity. The baseline CITES report identified a relationship between the number of person-hours devoted to searching and the number of carcasses found, and the PIKE is significantly higher in areas with easy monitoring access. It’s also possible that it is related to elephant density.

I didn’t use exactly the same data as the NBER team for this model, so it may not be directly comparable – I built this model as an example of how to properly estimate the PIKE, not to get precise estimates of the PIKE. This model still has many flaws, and a better model would use the same structure but with the following adjustments:

  • Random effects for site
  • A random effect for country
  • Incorporation of elephant populations, so that we are comparing rates rather than average carcasses. Population numbers are available in the baseline CITES report but I’m not sure if they’re still available
  • Adjustment for age. Apparently baby elephant carcasses are hard to find but poachers only kill adults. Analyzing within age groups or adjusting for age might help to identify where the highest kill rates are and their possible effect on population models
  • Adjustment for search effort and some other site-specific data (accessibility and size, perhaps)
  • Addition of rhino data to enable a difference-in-difference analysis comparing with a population that wasn’t subject to the sale

These adjustments to the model would allow adjustment for the degree of searching and intervention involved in each country, which probably affect the ability to identify carcasses. It would also enable adjustment for possible contemporaneous changes in for example search effort. I don’t know how much of this data is available, but someone should use this model to estimate PIKE and changes in mortality rates using whatever is available.

Without the correct models, estimates of the PIKE and the associated implications are impossible. OLS models may be good enough for Obama, but they aren’t going to help any elephants. Through a properly constructed and adjusted Poisson regression model, we can get statistically robust estimates of the PIKE and trends in the PIKE, and we can better understand the real effect of interventions to control poaching.

Today the Guardian reported on a new study that claims a large sale of legal ivory in 2008 actually led to an increase in illegal elephant poaching. Basically in 2008 China and Japan were allowed to pay for a large stockpile of legally-obtained ivory, in the hopes that this would crash the market and drive ivory traders out of business. Instead, the study claims, the sale led to a big increase in poaching – approximately a 66% increase in elephants killed, according to the study. This is interesting because it appears to put a big dent in a common libertarian idea for preserving endangered species – that allowing a regulated trade in them would lead to their preservation. It is also one of those cute findings that puts a hole in the standard just-so story of “Economics 101” that everything is driven by supply and demand. We all know that in reality there are many factors which moderate the effect of supply and demand on crucial markets, and on the surface this study appears to suggest a quite contradictory supply and demand relationship in illegal poaching markets, in which increasing supply boosts poaching. But is it true?

The Guardian report links to the original study, which is held at the National Bureau of Economic Research behind a paywall, but which I managed to get a copy of through my work. I thought I would check the statistical methods and see if the study really did support this conclusion. My judgment is that this study is quite poor, and that the data doesn’t support that conclusion at all, due primarily to three causes:

  • A poor choice of measure for illegal poaching that doesn’t clearly measure illegal poaching
  • The wrong choice of statistical method to analyze this measure
  • The wrong experimental design

I will go through each of these reasons in turn. Where equations are needed, I have used screenshots from the original paper because I’m terrible at writing equations in html. Let’s get started.

The PIKE is a terrible measure of illegal poaching

The study is based around analysis of a data set of “legal” and “illegal” carcasses observed at search sites in 40 countries. Basically a “legal” carcass is an elephant that died on its own, while an illegal one is one that was shot and looted. Apparently poachers don’t bother to clean up the corpse, they just cut off the ivory and run, so it’s easy to see when an elephant has been poached. However, because no one knows the full details of elephant populations, the authors study an outcome variable called the PIKE, which is defined as the ratio of illegal carcasses to total carcasses. In their words (screenshot):

PIKE equation

They say that this enables them to remove the unknown population from the outcome by “normalizing” it out in top and bottom of the ratio. They justify this with a little proof that I am not convinced by, since the proof assumes that probability of discovering carcasses is independent of the number of carcasses, and that legal mortality and illegal mortality are not related in any way. But even if it factors out population, this PIKE measure doesn’t tell you anything about illegal poaching. Consider the following hypothetical scenario, for example:

Imagine a population of elephants in which all the older elephants have been killed by poachers, so only the pre-adult elephants remain. Every time an elephant becomes mature enough to have decent tusks a poacher kills it and the corpse is found. Further, suppose that the population is not subject to predation or other causes of legal mortality – it is young, and the environment is in good shape so there are large stocks of easier prey animals for lions to target. This population is at high risk of collapse due to adults being killed as they mature; indeed, let’s suppose no babies are born because adults are poached as soon as they reach sexual maturity. Thus every time an elephant is killed, the population drops by one towards its inevitable crash.

In this case, at every time point the PIKE would be 1, because there are no legal carcasses. The PIKE will remain 1 until there are no elephants left to die, at which point it will jump to infinity. It doesn’t tell us anything about the impending population collapse.

Consider now a situation where there are a great many more legal deaths than illegal deaths. Denoting illegal carcasses by y and legal carcasses by x, we have y/(y+x) where y<<x. In this case we can approximate the PIKE by y/x, and if e.g. the number of illegal carcasses suddenly doubles we will see an approximate doubling in the PIKE. But suppose y is approximately the same as x. Then we have that the PIKE is approximately 1/2. Now suppose that the number of illegal carcasses doubles; then the PIKE increases to 2/3, i.e. it nowhere near doubles. If the number of illegal carcasses again doubles, it increases to 4/5. But if all deaths drop to 0 it then increases to infinity … So the magnitude of the increase in PIKE is not a direct reflection of the size of the change in poaching, and in at least one case even the direction is not meaningful. That is not a well-designed measure of poaching. It is also scale free, which in this case is a bad thing because it means we cannot tell whether a value of 1 indicates a single illegal carcass or 10 illegal carcasses. Similarly we don’t know if a value of 1/2 corresponds to 1 or a million illegal carcasses; only that however many there are, they are half of the total.

The authors say that this variable is constrained between 0 and 1, but this is not strictly true; it actually has an additional non-zero probability mass at infinity. This strange distribution of the variable has implications for model choice, which leads us to the second problem with their data.

All the models in this study were poorly chosen

The authors choose to model the PIKE using an ordinary least squares (OLS) model with fixed effects for country and a (separate) fixed effect for each year. An OLS model is only valid if the residuals of the model are normally distributed, which is a very strong assumption to make about a variable that has lots of values of 0 or 1. The authors claim their residuals are normally distributed, but only by pooling them across years – when you look at residuals within individual years you can see that many years have much more normally distributed residuals. They also don’t show us the crucial plot of residuals against predicted values, which is where you get a real idea of whether the residuals are well-behaved.

An additional consequence of using an OLS model is that it is possible to predict values of the PIKE that are unphysical – values bigger than 1 or less than 0 – and indeed the authors report this in 5.6% of their data points. This is indicative of another problem – the PIKE shows a non-linear response to increased illegal kills (see my example from 1/2 to 2/3 to 4/5 above), so that for a fixed number of legal kills each additional illegal kill has a diminishing effect on the value of PIKE, but a linear OLS model assumes that the PIKE changes by a uniform amount across its range. Given that the goal here is to identify increases in the PIKE over time, this runs the risk of the model over- or under-estimating the true effect of the 2008 ivory sale, because it is not properly modeling the response of the PIKE score.

The authors try to test this by fitting a new model that regresses ln(illegal carcasses+1) against a function that includes ln(legal carcasses+1) like so:

PIKE alternative model

This introduces a new set of problems. The “+1” has been added to both variables here because there are many zero-valued observations, and ln(0) doesn’t exist. But if there are lots of zero-valued observations, adding one to them is introducing a big bias – it’s effectively saying there was an illegal carcass where previously there wasn’t one. This distorts low numbers and changes the patterns in the data. The authors claim, furthermore, that “The coefficient on legal carcasses φ will be equal to unity if the ratio of illegal carcasses to legal carcasses is fixed”, but this is both nonsensical and obscures the fact that this model is no longer testing PIKE. It’s nonsensical because that is not how we interpret φ. If φ=1, then we can rewrite their equation (8) so that the left hand side becomes the natural logarithm of (illegal carcasses+1)/(legal carcasses+1). Then we are fitting a linear model of a new variable that is not the PIKE. We are not, however, assuming the ratio of illegal carcasses to legal carcasses is fixed. If φ is not 1, we are modeling the natural logarithm of (illegal carcasses+1)/(legal carcasses+1)^φ. The ratio here is still fixed, but the denominator has been raised to the power φ. What does “fixed” even mean in such a context, and why would we want to model this particular strange construction?

The authors do, finally, propose one sensible model, which is similar to equation (8) (they say) but uses a Poisson distribution for the illegal carcasses, and still fits the same right hand side. This is better but it still distorts the relationship between illegal and legal carcasses by adding a 1 to all the legal (but not the illegal) carcasses. It also doesn’t properly account for elephant populations, which is really what the legal carcasses serve as a proxy for. There is a much better way to use the legal carcass data and this is not it.

Finally there are two other big problems with the model: It uses fixed rather than random effects for country and site, which reduces its power, and also it doesn’t include any covariates. The authors instead chose to model these covariates separately and look for similar spikes in specific possible predictors of ivory usage, such as Chinese affluence. The problem with this is that you might not see a strong spike in any single covariate, but multiple covariates could move together at the same time to cause a jump in poaching. It’s better to include them in the model and report adjusted poaching numbers.

The wrong experimental design

An expert cited in the original article noted this interesting fact:

The Cites spokesman also noted that there had never been a one-off sale of rhino horn: “However, the spike in the number of rhinos poached for horn largely mirrors what has been seen with ivory. The illegal killing of rhino for its horn in South Africa alone increased from 13 in 2007 to close to 1,200 last year.”

This suggests that there has been an upsurge in illegal poaching across Africa that is independent of the ivory sale, and could reflect changing economic conditions in Africa (though it could also reflect different markets for ivory and rhino horn). It’s possible to test this using a difference-in-difference approach, in which rhino poaching data is also modeled, but is treated as not having been exposed to an intervention. The correct model specification then enables the analyst to use the rhino data to estimate a general cross-species increase in poaching; the elephant data identifies an additional, elephant-specific increase that could be said to be due to the ivory sale. The authors chose not to do this, which means that they haven’t rigorously ruled out a common change in poaching practice across Africa. If the CITES spokesman’s point is correct, then I think it likely that we would conclude the opposite to what this study found: that compared to rhinos, elephant poaching did not increase nearly as much, and in fact the ivory sale protected them from the kind of increased poaching observed with rhinos.

Indeed, it’s possible that there were poachers flooding into the market at around that time for other reasons (probably connected to development and increasing demand in Asia), but after the ivory sale most of them switched to killing rhinos. That would suggest the sale was successful, provided you aren’t judging that success from the standpoint of a rhino.

A better model: Bayesian population estimation followed by Poisson regression

It’s possible to build a better model using this data, by putting the legal carcass data to proper use and then using a correctly-specified Poisson regression model on the illegal carcass data. To see how different the results might then look, consider Figure 1, taken from the Appendix of the paper, which shows the actual numbers of illegal carcasses in each year.

Figure 1

Figure 1: Distribution of illegal elephant kills, 2002 – 2013 (year is above its corresponding histogram)

Does it look to you like the number of elephants killed has increased? It certainly doesn’t to me. Note that between 20 and 50% of observed data are 0 kills in all years except 2002 (which the authors say was the start year of the data, and exclude from their analysis). Can you strongly conclude any change from these figures? I haven’t shown the legal kill data but it is broadly similar in scale. Certainly, if there is any upward step in illegal kills in 2008, it could potentially be explained simply by changes in populations of elephants – if even a small change in elephant density leads to an extra 1 or 2 extra kills per site per year, it would lead to distributions like those in Figure 1. To me it seems likely that the single biggest determinant of elephant kills will be the number of elephants and the number of poachers. If we assume the number of poachers (or the pace of their activity) changed after 2008, then surely we need to consider what happened to the population of elephants overall in 2008. If it declined, then poachers might catch the same number as 2007; if it increased, they would catch more.

The best way to analyze this data is to directly adjust for the population of elephants. We can use the legal kill data to do this, assuming that it is mostly reflective of elephant population dynamics. It’s not easy, but if from published sources one can obtain some estimate of the mortality rate of wild elephants (or their life expectancy), a Bayesian model could be built to estimate total population of elephants from carcasses. This would give a credible interval for the population that could then be used as what is called an offset in a Poisson regression model that simply modeled counts of illegal kills directly against time. The advantage of this is that it uses all 0 count events, because a Poisson model allows for zeros, but it adjusts for the estimated population. I think the whole thing could be done in a single modeling process, but if not one could obtain first a distribution of the elephant population, then use this to simulate many different possible regression model coefficients for the effect of the ivory sale. In this model, the effect of the ivory sale would simply represent a direct estimate of the relative increase in mortality of elephants due to poaching.

Then, to complete the process, one would add in the rhino data and use a difference-in-difference approach to estimate the additional effect of the ivory sale on elephant mortality compared to rhinos. In this case one would find that the sale was protective for elephants, but potentially catastrophic for rhinos.

Conclusion

Based on looking at this data and my critical review of the model, I cannot conclude that the ivory sale led to an increase in poaching. I think CITES should continue to consider ivory sales as a tool to reduce elephant poaching, though with caution and further ongoing evaluation. In addition, based on the cited unnamed CITES spokesman, evidence from rhino culling at the time suggests the sale may even have been protective of elephants during a period of increased poaching; if so, a further big sale might actually crush the business, although there would be little benefit to this if it simply drove poachers to kill more rhinos.

With regard to the poor model design here, it shows a lot of what I have come to expect from economics research: poor definition of an outcome variable that seems intuitive but is mathematically useless (in health economics, the incremental cost effectiveness ratio shows a similar set of problems); over-reliance on OLS models when they are clearly inappropriate; poor model specification and covariate adjustment; and unwillingness to use Poisson or survival models when they are clearly most suited to the data.

I think there is lots of evidence that legal markets don’t necessary protect animals from over-exploitation (exhibit A, the fishing industry), but it is also obviously possible that economic levers of supply and demand could be used to kill an illegal industry. I suspect that more effective, sustainable solutions to the poaching problem will involve proper enforcement of sales bans in China and Japan, development in the regions where poaching happens, and better monitoring and implementation of anti-poaching measures. If market-crushing strategies like the 2008 ivory sale are going to be deployed, development is needed to offer affected communities an opportunity to move into other industries. But I certainly don’t think on the evidence presented here that such market-crushing strategies would have the exact opposite of the intended effect, and I hope this poor quality, non-peer-reviewed article in the NBER doesn’t discourage CITES from deploying a potentially effective strategy to stop an industry that is destroying a majestic and beautiful wild animal.

Those bastards!!

Those bastards!!

I ran into more problems trying to run OpenBUGS on a Bayesian spatial regression model today. The problems are laid out pretty simply in the picture above: I ran into an illegal memory write. I know this is not the fault of my model, as it runs smoothly for 50, 500 or 1000 iterations. This illegal memory write happens somewhere between 6000 and 6100 iterations.

This model is a simple version of a larger model I’m building. It has data on about 1800 small areas collected over four time points, with age and sex covariates, with a Poisson distribution for the outcome and an adjacency matrix for the 1800 areas. The data are laid out with the time points and age- and sex-values as separate records, so for one municipality I have 4*6*2=24 observations. I can’t use other packages for this because they assume one observation per municipality (i.e a very simple relationship between data structure and adjacency matrix). So I’m using OpenBUGS or WinBUGS. The ultimate goal is to extend to 30 time points and 18 age categories, with a more complex time and age structure, but I want to get this simple model running first because I’m running into major computational issues.

Today I identified that these computational issues can’t be solved. The reason for the illegal memory write is that OpenBUGS has a 1.6Gb RAM limit. People say this but it’s hard to confirm; however, in digging through critiques of BUGS I discovered it is written in Component Pascal and compiled in BlackBox, and a bit more digging reveals BlackBox has a 1.6Gb RAM limit.

This is frankly ridiculous. It’s understandable that there would be a RAM limit in a 32-bit system, but OpenBUGS has been around for about 9 years now and no one has upgraded it to handle either multi-core computers or 64 bit architecture. Given that OpenBUGS is designed to handle complex, processor-demanding simulation-heavy Bayesian models, the decision to write it in such a restrictive framework is ridiculous. I’m a strong critic of open source systems but if it had been written for R it would at least be able to use the 64-bit architecture, even though R has been late to the multi-core party. I bought a PC with 128 Gb of RAM specifically for this big project, and I might as well have bought a laptop with the minimum RAM. For the model I ultimately aim to build I expect I will need about 100,000 iterations to get stability, which means that OpenBUGS will never get there. The only way to run this model without a workaround is to either 1) write the full adjacency matrix by hand and implement it directly [something I am considering] or 2) recompile the source code (which I think might not be available) in a different Pascal compiler. I have no idea how to even start with that.

I have considered two workarounds, however, though I don’t know whether either of them might work.

  1. Save and rerun: the reason that OpenBUGS hits its RAM limit is that it saves all iterations in RAM, but I think this is possibly bad design. So one option could be to run the model to 5000 iterations, then save the results, shut down OpenBUGS, reopen OpenBUGS, load the model file, and then use the update functions to run another 5000 iterations on what has already been run. I think this can be done (running additional updates on a past model) but I’m not sure. If this works I just need to run one set of runs a night for about 3 weeks, and I’ll get my model.
  2. Rerun with sequential initial values: If method 1 doesn’t work, another option that I am very sure will work is to run the model for 5000 iterations, extract all estimated values from the model, then start a new model of 5000 runs with the past estimated values as the initial values for the next model. I’m pretty sure it will start off where it left off, assuming I correctly specify them all, although there might be small jumps in the trace, but ultimately it’ll get where it needs to go. But restarting the model is going to take a lot of time unless I can find a way to loop through and extract values automatically (I think I can’t). So probably a month to run the model.

Ridiculous! And even if I had a supercomputer I couldn’t speed it up …

Today Stata 14 was released, and it finally has built in functionality for a wide range of Bayesian statistical tasks. I’m hoping that they’ll introduce conditional autoregression and the BYM model in version 15. Really, if you have a grant that can afford a single Stata license, it’s really worth getting. You just can’t rely on open source statistical packages. Only use them when you must!

Today I’m trying to do some simple prediction in R. I have a simple linear model with a single missing data point, and I need to get the predicted values and standard errors for 14 data points including this one missing point. Because I find R’s data handling in linear prediction really hard to use, I decided to use a brutal, simple loop approach to get the job done. And in doing so I have uncovered a really weird piece of behavior.

My linear model predicts a proportion as a function of year, with year valued from 0 to 13 and 2 missing. My outputs are a matrix with the year and the predicted value (con.pred) and a matrix with the year and standard error (con.se). My input is the data (dataValues) and the resulting model (con.lm). So my model runs like this:

con.lm<-lm(y~year,data=dataValues)

for (j in 1:14){
year<-j-1
temp.pred<-predict(con.lm,newdata=year,se.fit=TRUE)
con.pred[j,2]<-temp.pred$fit
con.se[j,2]<-temp.pred$se
}

Now the weird thing here is that this breaks down and delivers the following error:

Error in eval(predvars, data, env) : not that many frames on the stack

However, it didn’t stop working immediately. The first 5 elements of the output con.pred have values – the process failed at j=6. Even weirder, when I ran it two hours ago it failed at j=10. However, it works fine if I make this tiny change:

  for (j in 1:14){
year<-j-1
temp.pred<-predict(con.lm,newdata=data.frame(year),se.fit=TRUE)
con.pred[j,2]<-temp.pred$fit
con.se[j,2]<-temp.pred$se
}

How can it be that the prediction process works for the value year=5, but doesn’t work when year=6? But it works for all values if I redefine the scalar year to be a data frame? Does this mean that R treats scalars with a value below 6 as data frames, but not above 6?

Also, if I type the following commands directly into the command line I get different predicted values:

temp.pred<-predict(con.lm,newdata=(year=1),se.fit=TRUE)

temp.pred<-predict(con.lm,newdata=1,se.fit=TRUE)

However, the latter command doesn’t produce any warnings and does not do the default prediction (which would return 69 fitted values; it returns one value, that appears to correspond with a value of year of about 9. Furthermore, while the first of these two commands works, this command does not, and brings up the same not enough frames error:

temp.pred<-predict(con.lm,newdata=(year=10),se.fit=TRUE)

So I have confirmed in the command line that the commands I have given in script break at some value of year>1.

I don’t know what’s going on here, but it’s completely counter-intuitive. The only error returned is also completely meaningless. R really needs to improve its linear prediction and data handling, because there is something deeply wrong with the way it handles input data.

I’ve recently been building a fairly complex series of Bayesian spatial regression models in BUGS, and thought I’d share some tips based on hard won experience with the models. The various BUGS packages have the most cryptic and incoherent error messages of any stats software I have ever worked with, and although various Bayesian boosters claim that their modeling approach is intuitive, in my opinion it is the exact opposite of intuitive, and it is extremely hard to configure data for use in the packages. Furthermore, online help is hard to find – google an error message and you will find multiple websites with people asking questions that have never been answered, which is rare in the modern world. I take this as a sign that most people don’t understand the error message, and indeed the BUGS manual includes a list of errors with “possible interpretations” that reads more like the I Ching than a software guide. But Confucius say Enlightenment is not to be found in Black Box Pascal, so here is my experience of BUGS.

The models I’m running are complex, with nested conditional autoregressive structures and the higher level having more than 1000 areas with complex neighbour relationships, and millions of observations. I originally ran them on a completely hideous Hewlett Packard laptop, with 4 cores and 8Gb of RAM. I subsequently upgraded to a Dell Workstation (joy in comparison to HP’s clunky root-kitted horror) with 8 cores and 16Gb of RAM; I’m not sure that hardware is the main barrier to performance here though …

The HP machine had a secret administrator account (arseholes!) so I couldn’t install winBUGS[1], so I started off running OpenBUGS called through R’s R2OpenBUGS package running in RStudio. I use R to set up the data and initial values, because I can’t think of any other way to load millions of observations into a text file without going stir crazy. But when I call OpenBUGS it just hangs … no error messages or any other kind of indication of what is going on. I also can’t tell if it is happening at the data loading or compiling or inits stage.

Some digging around online and I found an old post by Andrew Gelman, observing that BUGS does not work well with “large datasets, multivariate structures, and regression coefficients.”

i.e. pretty much every statistical problem worth doing. Gelman also notes that “efficiently-programmed models can get really long, ugly, and bug-prone,” which seems like a contradiction in terms.

Anyway, noting that my data was large, with multivariate structures and regression coefficients, I thought maybe I should tone it down a bit so I tried using a higher level of spatial heirarchy, which reduces the adjacency matrix by an order of magnitude. Still no dice. It was at this point that I upgraded to the bigger computer.

On the bigger computer the smaller model actually worked! But it didn’t work in the sense that anything meaningful came out of it … It worked in the sense that it reported a completely incomprehensible bug, something like a node having an invalid value. I tried multiple different values and nothing worked, but somewhere on the internet I found someone hinting that you should try running BUGS directly rather than calling through R, so I tried this … having created the data in R, I killed OpenBUGS then opened the OpenBUGS interface directly and input the model, then the data, using the text files created by R[2].

When I did this I could step through the process – model was syntatically correct, then model failed to compile! Given that loading inits comes after compilation, an error telling me that I had the wrong initial value seems a bit misleading… in fact I had an “index out of range” error, and when I investigated I found I had made a mistake preparing one part of the data. So where the actual error was “the model can’t compile because you have provided the wrong data,” when called through R the problem was “you have the wrong initial values” (even though I haven’t actually loaded initial values yet).

WTF?! But let’s step back and look at this process for a moment, because it is seven shades of wrong. When you run R2OpenBUGS in R, it first turns the data and inits into a form that OpenBUGS can read; then it dumps these into a directory; then it opens OpenBUGS and gets OpenBUGS to access those files in a stepwise process – at least, that’s what I see R doing. If I decide to do the model directly in the OpenBUGS graphical interface, then what I do is I get R to make the data, then I use the task manager to kill OpenBUGS, then I call OpenBUGS directly, and get OpenBUGS to access the files R made in a stepwise process. i.e. I do exactly the same thing that R does, but I get completely different error messages.

There are various places on the internet where you might stumble on this advice, but I want to stress it: you get different error messages in OpenBUGS run natively than you do in OpenBUGS called through R. Those error messages are so different that you will get a completely different idea of what is wrong with your program.

Anyway, I fixed the index but then I ran into problems after I tried to load my initial values. Nothing seemed to work, and the errors were really cryptic. “Invalid initial value” is not very useful. But further digging on the internet showed me that OpenBUGS and WinBUGS have different approaches to initial values, and winBUGS is not as strict about the values that it accepts. Hmmm … so I installed winBUGS, and reran the model… and it worked! OpenBUGS apparently has some kind of condition on certain variables that they must sum to 0, while winBUGS doesn’t check that condition. A free tip for beginners: setting your initial values so they sum to 0 doesn’t help, but running the same model, unchanged, in winBUGS, works.

So either OpenBUGS is too strict, or winBUGS lets through a whole bunch of dodgy stuff. I am inclined to believe the former, because initial values shouldn’t be a major obstacle to a good model, but as others[3] have observed, BUGS is programmed in a completely opaque system so no one knows what it is doing.

So, multiple misleading errors, and a complex weirdness about calling external software through R, and I have a functioning model. Today I expanded that model back to the original order of magnitude of small areas, and it also worked, though there was an interesting weirdness here. When I tried to compile the model it took about three hours, and produced a Trap. But the weird thing is the Trap contained no warnings about BUGS at all, they were all warnings about windows (something called Windows.AddInteger or similar), and after I killed the Trap my model updated fine. So I think the compile problems I previously experienced may have had something to do with memory problems in Windows (I had no problems with badly designed adjacency matrices in the larger model), but OpenBUGS just doesn’t tell you what’s going on, so you have no idea …

I should also add, for intrepid readers who have got this far, that this dude provides an excellent catalogue of OpenBUGS errors with his plain English explanations of what they actually meant. He’s like this mystical interpreter of the I Ching for Bayesian spatial regressives. Also I want to add that I think the CAR spatial correlation model is super dodgy. I found this article (pdf) by Melanie Wall from the Journal of Statistical Planning and Inference (what a read!) that shows that the way we construct the spatial adjacency matrix is the primary determinant of the correlation structure, and that the correlation structure determined by this adjacency matrix is nothing like what we think we are getting. Today on my whiteboard and with the help of R I imagined a simple industrial process where each stage in the process is correlated with the one before and after it, and I showed very easily based on Wall’s work that the adjacency matrix required to describe this process is completely different to the one that you would naively set up under the framework described for CAR modeling. So I think most of the “spatial correlation” structures described using CAR models have no relationship to what the programmer thinks they’re entering into the model. But I have no proof of this, so I guess like everyone else I’ll just press on, using the adjacency matrix I think works …

So there you have it. Next time you see an opinion formed on the basis of a spatial regression model built in BUGS, remember the problems I had getting to the output, and ask yourself – do you trust that model? Really?

fn1: Well, I could copy winBUGS into the program files folder but I couldn’t patch it or install the immortality key, which, wtf? When I bought Time Series Modelling and Forecasting by Brockwell and Davis, ITSM came as a free disk with the book. When I buy the BUGS book I get to install software that comes with a big message telling me to get stuffed, and years later they finally provide a key that enables you to use it for free …

fn2: If you aren’t aware of how this works, basically when you call OpenBUGS in R, providing data from inside R, R first dumps the data into text files in the directory of your choosing, then OpenBUGS opens those files. So if you aren’t comfortable preparing data for BUGS yourself, use the list and structure commands in R, then kill OpenBUGS and go to OpenBUGS directly … the text files will remain in the directory you chose.

fn3: Barry Rowlingson does a pretty good job of showing how interesting and useful spatial analysis can be: see e.g. his post on mapping the Zaatari refugee camp in Syria.

I have a dataset of about 40 million records covering rare events over a 30 year period, with about 600,000 events in total. The data are clustered in about 50 regions, and in the latter years of the file there are only about 4000 events a year (they increase in frequency in the early years). This means that in one year there may be only 80 events in any one region. I want to conduct a multi-level Poisson regression of the events, with about four or five categorical covariates plus year and a random effect for region. I’m doing this using GLLAMM in Stata on an Apple computer with 32Gb of RAM, 12 cores and Stata/MP optimized for all 12 cores (this is not a cheap setup!) The raw dataset takes about 16Gb but I can collapse it to a dataset of counts of events and populations at risk, which reduces it to (I think, from memory) about 170,000 records. I’m using a Poisson distribution, so essentially a loglinear model, but I should probably ideally use negative binomial for zero-inflated data – unfortunately GLLAMM doesn’t handle negative binomial data and I have read of problems with the xtnegbin procedure, so I just have to suck it up and use the Poisson family. I’m also not adjusting for serial dependence, because that would mean extra random effects and/or correlation structures and it would just get too nasty.

The problem here is that GLLAMM just won’t converge. If you set it running with relatively standard settings you can first run it with the NOEST command to get the coefficients, then feed them as starting values into a model with adaptive quadrature and 12 integration points, but it just won’t converge: after 3 or 4 days and 150 iterations (one estimation cycle takes about an hour) it still hasn’t converged, and the maximum likelihood estimate is not concave. The maximization process seems to be proceeding in some fashion, because every 10 iterations or so the likelihood decreases slightly, but it just won’t seem to stop. Interestingly you can’t set the tolerance in the estimation process in GLLAMM, so I can’t stop it when the MLE is decreasing by less than 0.01 (which it is doing now) but more than 0.00001.

This non-convergence is not a data issue – it works fine in a standard Poisson regression, there are no warnings about combinations of covariates with all zeros or all 1s, and this also seems to be the case when the data is divided up into regions (as best I can tell). I think the problem is just that the events are rare, there are a lot of dimensions over which the likelihood needs to change, and it’s just a very tough optimization problem. I have run into this problem with non-linear multi-level models before, and it appears to be a problem of big data with rare events and large numbers of regions. So what to do?

I have noticed in the past that estimation in GLLAMM is more likely to be successful if there are more integration points, which slow down the time to converge but improve stability. I’ve also noticed that the initial estimates can really kill the model, and also I’ve noticed that a big part of the estimation procedure is about getting standard errors, not estimates. Also, although you can’t control tolerance you can limit the number of iterations (a rough version of the same thing). Each iteration taking an hour, if one wants to get actual results it is helpful to run as few iterations as possible[1]. So today I tried a new method for getting convergence, which works in stages like this:

  • Start with the NOEST option to get initial coefficients for starting values
  • Feed these into a model with 1 integration point (Laplace Approximation) and 1 iteration
  • Feed the coefficients from this as starting values to a model with 2 integration points and 2 iterations
  • Keep running new models, slowly increasing integration points and iterations

If I did this a few times, I found that my models started converging after just two Newton-Raphson cycles. By the fifth model (with nip(12) and iterate(8)) the coefficients had all become stable to 4 decimal places, and the variances were largely fixed: all that remained was small changes in the variance of the random effect. At the fourth model some of the variances were out of whack, but the coefficients close to their final value. The fourth and fifth models converged on the second Newton-Raphson step. I set it running last night and it’s basically now fine-tuning, using a model with 24 integration points and a maximum of 32 iterations. That will be the final model, it’s probably going to be done by tomorrow morning.

Because this modeling process starts with the final coefficients from the previous model, it seems to bypass a lot of the convergence problems that arise from starting with a model that has coefficients very far from the final values and then waiting. Instead, you get rough coefficients halfway through the process and then restart. Since much of the challenge for the estimation process seems to be in calculating variances rather than coefficients, this saves you a lot of time waiting to get values. It seems to have shortcircuited a lot of convergence issues. I don’t know if it would work for every problem (GLLAMM seems to be a temperamental beast, and every time I use GLLAMM a new problem seems to rear its ugly head), but if you are having convergence problems and you are confident that they are not caused by bad data, then maybe give my method a try. And let me know if it works! (Or if it doesn’t!)

fn1: Actually this model process takes so long that there are some sensitivity analyses we just have to skip. There is one method we use in preparing the data that I would like to check through resampling and sensitivity analysis, but if every model is going to take 2 days, then you can’t even do a paltry 100-sample bootstrap and get a meaningful answer in half a year. Crazy!