I’ve complained before about the reliability and quality of the open source statistics package, R. Sometimes I get pushback, with people suggesting that I just don’t understand what R is trying to do, or that there is an obvious way to do things differently that answers my complaints – that R is idiosyncratic but generally trustworthy.

Well, try this exercise, which I stumbled on today while trying to teach basic programming in R:

  • Run a logistic regression model with any reasonable data set, assign the output to an object (let’s call it logit1)
  • Extract the Akaike Information Criterion (AIC) from this object, using the command logit1$aic. What is the value of the AIC?
  • Now extract basic information from the logistic model by typing its name (logit1). What is the value of the AIC?
  • Now extract more detailed information from the logistic model by typing summary(logit1). What is the value of the AIC?

When I did this today my AIC value was 54720.95. From the summary function it was 54721; from the basic output option it was 54720.

That’s right, depending on how you extract the information, R rounds the value of the AIC up, or truncates it. R truncates a numerical value without telling you.

Do you trust this package to conduct a maximum likelihood estimation procedure, when its developers not only can’t adhere to standard practice in rounding, but can’t even be internally consistent in their errors? And how can you convince someone who needs reliability in their numerical algorithms that they should use R, when R can’t even round numbers consistently?

I should point out that a decision to truncate a number is not a trivial decision. That isn’t something that happens because you didn’t change the default. Someone actually consciously programmed the basic output display method in R to truncate rather than round off. At some point they faced a decision between floor() and round() for a basic, essential part of the I/O for a statistics package, and they decided floor() was the better option. And no one has changed that decision ever since. I don’t think it’s a precision error either (the default precision of the summary function is 4 digits!) because the example I stumbled across today ended with the decimal digits .95. This was a conscious programming decision that no one bothered to fix.

The more I work with R, the more I believe that it is only good for automation, and all its output needs to be checked in a system with actual quality control. And that you should never, ever use it for any process anyone else is going to rely upon.

Advertisements