I stumbled on a problem with out-of-sample prediction in R, which I think is an example of idiosyncratic programming (and possibly insoluble) and, in the spirit of Andrew Gelman’s stats blog, I thought I’d put it on the internet to see if a) I’m just being stupid and b) it really is as much of a barrier to automation of statistical functions as I think it is. Perhaps someone with more clout than me will find this blog and notice this problem – or perhaps it’s not a problem and I’m just being stupid (or unfairly demanding of my prediction modeling software).
In case you haven’t worked this out already, if you aren’t interested in stats, R, or nerdy shit that wastes everyone’s time, you should stop reading right now. Everything beneath the dotted line is going to kill your love life, and probably your pets.
First, let’s consider the basic out-of-sample prediction process. We run a very simple linear model and then we provide R with a set of new data – containing exactly the same variable and type of variable – and get it to work out what the expected value of the line is on the new predictor variables. Taking the example straight from the manual, here is the code for a trivial predictive model:
x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)
This is very simple code: we produce a set of x values and make the y values equal to these x values plus a random amount. Then we run the model and get the predicted values (just for shits and giggles, btw), and then we make a new data set that contains values ranging from -3 to 3 in steps of 0.5, and run the predictions on these new x values. In this code, the data frame called new is the out-of-sample data, and we get the expected value of y given the observed values in new, for the given model. This is stats 101, right?
In my case, the x values are
1.5313262 1.5307600 -0.5067223 -0.1366010 -0.6557527 1.4245267 -0.3917136 1.7941995 2.0511560 0.3602334 -0.8798598 -0.5755816 -1.6419118 -0.7885237 1.1478910
and the coefficients of the resulting linear model are
That is my model has an intercept of 0.367 and a slope of 1.047. This means that, e.g. for a new x value of -3 I would expect y to be 0.367+-3*1.047, or about -2.774. The correct predicted values for the new data set, based on these coefficients, produced by the 5th line of my code, are:
-2.7751002 -2.2514963 -1.7278925 -1.2042886 -0.6806847 -0.1570809 0.3665230 0.8901269 1.4137307 1.9373346 2.4609385 2.9845424 3.5081462
Now, let’s produce our first example of a problem. Consider the following very simple code:
Here, I’ve bound my previous x and y values together to make a matrix, with y in column 1 and x in column 2. Then I’ve rerun the model using the perfectly natural R method, of referring to each variable in the data set by its column reference ([,1] for the y, [,2] for the x). Then I’ve rerun the out-of-sample prediction on the same set of x values as I used before (the vector new). This is the simplest possible non-trivial linear model, and out-of-sample prediction in this case should be as simple as in the case of the previous model. So what happens? First, the following “warning”:
‘newdata’ had 13 rows but variable(s) found have 15 rows
Note this is a warning. So what values do we get? Here they are:
1.97013967 1.96954674 -0.16412052 0.22347339 -0.32018633 1.85829845 -0.04368247 2.24542264 2.51450943 0.74376222 -0.55487302 -0.23623050 -1.35289974 -0.45922514 1.56860335
These values are different. In fact, they’re the in-sample predictions. You can check it – they’re the result of applying the coefficients to the original observations of the predictor variable. This means that two important things happened:
- R failed to produce a solution to the simplest possible linear modeling problem, simply because I changed the input method for my regression model – even though the revised input methods (for the formula and the data) are entirely consistent with the R language
- The “warning” was not a warning: R didn’t just warn me about a problem, it failed to apply the command I told it to, and didn’t tell me about its decision. I could publish this shit, and not realize that I’ve been given my original values back.
Note how thoroughly horrible this would be if I were running some kind of automated series of regression processes – I once did a capture/recapture model that was heavily dependent on automation in R, and I’m wondering if I fell for this crap then. R should not be declaring a “warning” when it is actually refusing to run the command I asked it to and producing completely different output. I know I’m only a statistician, but imagine if this kind of programming were allowed in Jumbo Jets. Unhappy punters we would all be, young Jedi.
Also let’s just pause and dwell on the meaningless nature of that warning. I’ve supplied a vector of 13 observations, and the vector of 1s (for the constant) is implicit in everything I’ve done to date. So this means I’m (reasonably) assuming R will construct a 13×2 matrix with a column of 1s and a column of (out-of-sample) x values. Then, we have a 2×1 vector of coefficients. Thus, R should be able to calculate a vector of outputs from the information I’ve supplied it. The only possible explanation is that it is expecting me to supply it with a full-rank matrix, i.e. it has decided arbitrarily to stop implicitly assuming the constant. So, shall we try that?
Same prediction process, only now the out-of-sample data that I provide is full-rank, containing a column of 1s and a column of (out-of-sample) x values.
The result? The same problem: the same warning in red, and in-sample expected values. So the problem is not that I’m providing a less-than-full-rank matrix of predictors on the assumption that one column of the design matrix is implicit. The problem is deeper than that.
I can, however, fix the problem in a weird way. If I assign column names to the original data, and also assign the new data a name that matches the name of the predictor in the original data, I get the correct output:
In this case I’ve made a new data frame for the original data and the out-of-sample data, to avoid confusion, and I’ve given them all consistent names. Note now that in the prediction process the out-of-sample data, new2, is not full rank – the column of 1s is assumed in the calculation process. But it doesn’t cause any trouble: the previous warning about seeing only 13 observations doesn’t appear, and I get the correct out-of-sample predictions. I still only have 13 observations, but R doesn’t care. I get correct predictions and everything is fine.
This might seem like a trivial problem until you’re doing automation. I recently wrote a program to do a search through a large number of univariate models, and then combine the variables from the top 10 in a predictive model. The predictions were the only thing I wanted. But I spent hours trying to work out why R wouldn’t give me out-of-sample predictions: my training sample had about 50k records, and my validation sample had about 20k. I was using numeric column references for the purposes of automation (the process should be fairly obvious: you make a matrix of variable numbers with AIC values from the corresponding linear models, sort by AIC values, choose the variable numbers corresponding to the best 10, and then put them into a formula in a linear model) but it wasn’t working. Eventually I worked out this problem, made a vector of column names and used the paste() function to produce a formula.
But this work around is stupid. In both cases – using column numbers or referring to names – the design matrix is implicit, not specified by the user. The underlying R code has to construct this design matrix. So why does it screw me around like this? Why provide two ways of referencing columns and then have one of them bork your most basic modeling function? What would Tibshirani say? Dividing data into training and validation sets is an essential part of modern machine learning principles, but R can’t handle it because – omfg – one has a different number of records to the other. Even though the matrix mathematics is in no way affected by the difference in the number of records.
That, dear reader, is called crap program design. Especially crap is the issuing of a warning that isn’t a warning at all, but an indication that the program has arbitrarily decided to produce a result different to the one you asked for! Rather than declaring a “warning,” R should say clearly “Catastrophic failure! Providing default prediction using original design matrix only!” Or better still it could say “What are you doing, Dave?” and refuse to let me out of the cargo bay doors.
The manual says the following about giving out-of-sample data to the program:
If the fit is rank-deficient, some of the columns of the design matrix will have been dropped. Prediction from such a fit only makes sense if
newdata is contained in the same subspace as the original data. That cannot be checked accurately, so a warning is issued.
A warning will be given if the variables found are not of the same length as those in
newdata if it was supplied.
This is not exactly the most informative help in the world. I don’t think that what is described in the first sentences (to the extent that I understand the English) is what happens in practice, and the last sentence is certainly misleading. In R, warnings usually indicate something you should be aware of that does not affect the numerical processing of results (see e.g. Andrew Gelman’s statement in the above blog link, “I don’t care about warning messages.” Except you should, because sometimes R issues a message called a “warning” that is actually an indicator of insouciant and catastrophic failure to enact the expected mathematical process.
Open source software: it sucks. Except that R is far and away the best software for automation in stats – until you run into crap like this. Fix it, CRAN!
fn1: of course the R manual makes this part just slightly more obscure than it needs to be, because hey! R is a really pithy language, so why would you use comments to make the process easier to grasp, or separate the process of generating x from the process of generating epsilon? Anyone who knows what they’re doing can figure it out, and manuals are for people who already know what they’re doing – right?
fn2: I guess I could do this more accurately by supplying a variable containing only 1s, and specifying “no constant.” That might work, I suppose. How horrid.
Here is the full code to reproduce all these results:
x <- rnorm(15)
y <- x + rnorm(15)
pred.v1<-predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
pred.vnew<-predict(lm(y ~ x), new, se.fit = TRUE)
## now make a data set yy that contains a column for x and a column for y
## now give the data set names and retry it
# now try a full rank matrix
# try again with matrix maths
pred.frnew2<-new.fr %*% x.mat