Recently a major economics paper was found to contain basic excel errors, among other problems, and an amusing storm of controversy is growing around the paper. The controversy arises because the paper was apparently quite influential in promoting the most recent round of austerity politics in the western world, and the authors themselves used it actively in this way. The authors even managed to find a magic number – 90% – at which government debt (as a proportion of GDP) throttles growth, a threshold that many small government activists and “sustainable deficit” types have been seeking for years. It’s like manna from heaven!

There’s been a lot of hilarity arising from this, about how woeful the economics field is and about how vulnerable policy-makers on crucial issues like government spending can be to even quite terrible research that supports their agenda. But there has also been some criticism on statistics and academic blogs about the use of excel for advanced analysis, and what a bad idea it is. Andrew Gelman has a thread debating how terrible excel is as a computational tool, and Crooked Timber has a post (with excellent graphic!) expressing incredulity that anyone would do advanced stats in excel. While I agree in general, I feel an urgent need to jump to the defense of MS Excel.

MS Excel is great. It’s much, much more convenient than a calculator, and it can be used to do quite complex calculations (as in sums and multiplications) in multiple directions that would take ages on a calculator. On most computers now the calculator is buried or, if you’re a windows user, crap, and if you need anything more than addition it’s much more convenient to drag out excel. Sure it takes a moment to load compared to your calculator function, but it is so much easier to compare numbers, to calculate exponents and logs, and to present simple results in excel than in a calculator. As a simple case in point: if you get regression coefficients from Stata you can copy and paste them into excel and exponentiate to get relative risks, etc.; then you copy the formulas below, run a new regression model (with, e.g. random effects that weren’t in the previous one) and paste the results to enable you to compare between models quickly and easily. Similarly, if you’re checking a paper to see if they calculated odds ratios or relative risks, you can chuck those numbers into excel and do the comparisons with the contingency table right there in front of you. It offers a simple, graphically convenient way to visualize numbers. This is especially useful when the task you’re approaching is conceptually very simple (like a contingency table) but takes a bit of time to do on a hand calculator, and takes a bit of time to convert to the file formats required in Stata or R. In the time it takes me to think about how to structure the problem, input four lines of data to R, and then write the code to calculate the odds ratios, I can do the whole thing in excel, have the contingency table in front of me to check I’ve made no transcription errors from the paper, and fiddle quickly with changing numbers.

If you’re doing cost-effectiveness analysis in TreeAge (shudder) or R, excel is a really useful tool both for outputting results to something that is vaguely attractive to use, and for doing ballpark calculations to check that your models are behaving reasonably. This is especially useful if you’re doing stochastic Markov models, that can take hours or days to run in TreeAge, because you can’t trust software like that to give you the correct answer if you try to treat your stochastic model as a simple decision tree (because of the way that TreeAge faffs around with probability distributions, which is non-intuitive). Make a few simple assumptions, and you can do approximate calculations yourself in excel, and fiddle with key numbers – cohort size or a few different parameters – and see what effect they have.

Recently I was helping someone with survival analysis and she was concerned that her definition of time to drop out was affecting her results. She conducted a sensitivity analysis in Stata to see what effect it was having, and although with correct programming she could have produced all the material she needed in Stata, the time it takes to do this and debug your code can be time-consuming if you aren’t a natural. It’s much easier with modern machines to just run the regression 10 times with different values of drop-out time and plot the output hazard ratios in excel.

So, I think excel is a very useful tool for advanced modeling, precisely because of its ease of use and its natural, intuitive feel – the properties that recent excel bashers claim make it such a terrible device. While I definitely think it should not be used for advanced models themselves, I find it a hugely valuable addition to the model-building process. Reproducible code and standardized tools are essential for publishable work, but unless you are one of those people who never does any fiddling in the background to work out what’s going on in your model, excel will turn out to be your go-to tool for a wide range of support tasks.

In any case, the bigger problem with Rogoff and Reinhart’s work was not the excel error. Even if they had got the excel code right, their results would still have been wrong because their modeling method was absolutely appalling, and should never have seen the light of day, even at a conference. The main flaws in their work were twofold:

  • They binned years together, essentially giving different years different weights in the final model
  • They stripped the years out of their time series context, so crucial information contained in the time ordering of deficits and growth was lost

I think the second flaw is the most specifically terrible. By using this method they essentially guaranteed that they would be unable to show that Keynesian policies work, and they stripped the cause-effect relationship from all data collected in the Keynesian era (which lasted from the start of their data series to about 1970). In the Keynesian era, we would expect to see a sequence in which deficit increases follow negative growth, so unless the negative growth periods are very short and random, Reinhart and Rogoff’s method guarantees that this looks like an association between negative growth and higher deficits. If Keynesian policies actually work, then we would subsequently see an increase in growth and a reduction in deficits – something that by design in Reinhart and Rogoff’s model would be used to drive the conclusion that higher debt causes lower growth.

In short, no matter what package they used, and no matter how sophisticated and reproducible their methods, Reinhart and Rogoff’s study was designed[1] to show the effect it did. The correct way to analyze this data was through the presentation of time series data, probably analyzing using generalized least squares with a random effect for country, or something similar. Using annual data I think it would probably be impossible to show the relationship between debt and growth clearly, because recessions can happen within a year. But you could probably achieve better, more robust results in excel using proper time series data than you could get in R from Reinhart and Rogoff’s original method.

The problem here was the operator, not the machine – something which should always be remembered in statistics!

—-

fn1: I use the term “was designed” here without any intention to imply malfeasance on the part of the authors. It’s a passive “was designed”.

Advertisements