I just got a nice new macbook, solid state drive and quad core (I guess), a very pretty screen and much lighter than my last one. It’s really fast! But it has had a huge problem since I got it – the wi-fi is hell slow. My 12Mbps connection dropped to 0.6 Mbps when I used my new macbook on it, and I could confirm this was the macbook’s problem: I used the old one at the same time and it didn’t slow down (the old one does not have a Mavericks update).

I guess it could be a hardware problem, but I searched on a lot of forums and found that lots of people have the same problem, and a related problem of wi-fi dropping out every couple of minutes, and it’s definitely Mavericks: many people have reported this problem after updating. It appears that this problem has been around for 3-5 months and hasn’t been fixed.

Well, this morning I stumbled by accident on a solution on my macbook: turn off the bluetooth wireless mouse. This happened because I was comparing wireless speeds on my old and new macbook, and my wireless mouse and keyboard were attached to my old one. My new one was running at 12Mbps downloads so I decided to switch off the old one and use only the new one. I duly disconnected the wireless mouse and keyboard from the old one and connected them to the new one, and suddenly it was running slowly. So then I did tests, and identified that the new laptop works fine with neither mouse nor keyboard attached; that it works fine with just keyboard; and that the speed drops to 5% of previous levels with the mouse attached.

So, if you are having this problem, try this first: disconnect your bluetooth mouse. If that doesn’t work, try disconnecting all devices. If that doesn’t work, then I guess there must be more than one cause for this problem. But for me at least it is specifically the fault of my bluetooth mouse and nothing else. Now my new computer works like a dream, and I’m enjoying the speediest computer experience ever. Solid state drives for the win!

It's all Greek to you, isn't it?

It’s all Greek to you, isn’t it?

I received a very interesting hospital dataset recently, in excel format and containing some basic variable names and values in Japanese. These included the sex of the patient, the specialty under which they were admitted to hospital, and all variable names. Initially this would be reasonably easy to convert to English in excel before import, but it would require making a pivot table and fiddling a bit (my excel-fu) is a bit rusty, but also I have address data and though at this stage it’s not important it may be in the future. So, at some point, I’m going to have to import this data in its Japanese form, so I figured I should work out how to do it.

The problem is that a straight import of the data leads to garbled characters, completely illegible, and very little information appears to be available online about how to import Japanese-labeled data into Stata. A 2010 entry on the statalist suggests it is impossible:

Unfortunately Stata does not support Unicode and does not support other multi-byte character sets, such as those necessary for Far Eastern Languages. If you are working with a data set in which all of the strings are in a language that can be represented by single byte characters (all European languages) just choose the appropriate output encoding. However, if your dataset contains strings in Far Eastern langages or multiple languages that use different character sets, you will simply not be able to properly represent all of the strings and will need to live with underscores in your data.

This is more than a little unfortunate but it’s also not entirely correct: I know that my students with Japanese operating systems can import Stata data quite easily. So I figured there must be something basic going wrong with my computer that was stopping it from doing a simple import. In the spirit of sharing solutions to problems that I find with computers and stats software, here are some solutions to the problem of importing far Eastern languages for two different operating systems (Windows and Mac OS X), with a few warnings and potential bugs or problems I haven’t yet found a solution for.

Case 1: Japanese language, Windows OS

In this case there should be no challenge importing the data. I tried it on my student’s computer: you just import the data any old how, whether it’s in .csv or excel format. Then in your preferences, set the font for the data viewer and the results window to be any of the Japanese-language OS defaults: MS Mincho or Osaka, for example.

This doesn’t work if you’re in an English language Windows, as far as I know, and it doesn’t work in Mac OS X (this I definitely know). In the latter case you are simply not able to choose the Japanese native fonts – Stata doesn’t use them. No matter what font you choose, the data will show up as gobbledigook. There is a solution for Mac OS X, however (see below).

Case 2: English language, Windows OS

This case is fiddly, but it has been solved and the solution can be found online through the helpful auspices of the igo, programming and economics blogger Shinobi. His or her solution only popped up when I did a search in Japanese, so I’m guessing that it isn’t readily available to the English language Stata community. I’m also guessing that Shinobi solved the problem on an English-language OS, since it’s not relevant on a Japanese-language OS. Shinobi’s blog post has an English translation at the bottom (very helpful) and extends the solution to Chinese characters. The details are on Shinobi’s blog but basically what you do is check your .csv file to see how it is encoded, then use a very nifty piece of software called iconv to translate the .csv file from its current encoding to one that can be read by Stata: in the example Shinobi gives (for Chinese) it is GB1030 encoding, but I think for Japanese Stata can read Shift-JIS (I found this explained somewhere online a few days ago but have lost the link).

Encoding is one of those weird things that most people who use computers (me included!) have never had to pay attention to, but it’s important in this case. Basically there are different ways to assign underlying values to far Eastern languages (this is the encoding) and although excel and most text editors recognize many, Stata only recognizes one. So if you have a .csv file that is a basic export from, say, excel, it’s likely in an encoding that Stata doesn’t recognize on an English-language OS. So just change the encoding of the file, and then Stata should recognize it.

Working out what encoding your .csv file is currently in can be fiddly, but basically if you open it in a text editor you should be able to access the preferences of the editor and find out what the encoding is; then you can use iconv to convert to a new one (see the commands for iconv in Shinobi’s blog).

Unfortunately this doesn’t work on Mac OS X: I know this, because I tried extensively. Mac OS X has iconv built in, so you can just open a terminal and run it. BUT, no matter how you change the encoding, Stata won’t read the resulting text file. You can easily interpret Shinobi’s solution for use on Mac but it won’t work. This may be because the native encoding of .csv files on Mac is unclear to the iconv software (there is a default “Mac” encoding that is hyper dodgy). However, given the simplicity of the solution I found for Mac (below), it seems more likely that the problem is something deep inside the way Stata and the OS interact.

Case 3: English-language, Mac OS X

This is, of course, something of a false case: there is no such thing as a single-language Mac OS X. Realizing this, and seeing that the task was trivial on a Japanese-language Windows but really fiddly on an English-language windows, it occurred to me to just change the language of my OS (one of the reasons I use Apple is that I can do this). So, I used the language preferences to change the OS language to Japanese, and then imported the .csv file. Result? Stata could instantly read the Japanese. Then I just switched my OS back to English when I was done with Stata. This is a tiny bit fiddly in the sense that whenever you want to work on this file you have to switch OS languages, but doing so on Apple is really trivial – maybe 3 or 4 clicks.

When you do this though, if you aren’t actually able to read Japanese, you’ll be stuffed trying to get back. So, before you do this, make sure you change your system settings so that the language options are visible on the task bar (you will see a little flag corresponding to your default locale appear next to the date and time). Then, make sure you know the sequence of clicks to get back to the regional language settings (it’s the bottom option of the language options menu in your taskbar, then the left-most tab inside that setting). That way you can change back easily. Note also that you don’t, strictly speaking, have to change the actual characters on the screen into Japanese! This is because when you select to change your default OS language, a little window pops up saying that the change will apply to the OS next time you log in but will apply to individual programs next time you open them. So you can probably change the OS, open Stata, fiddle about, close Stata, then change the OS back to English, and so long as you don’t log out/restart, you should never see a single Japanese-language menu! Weird, and kind of trivial solution!

A final weird excel problem

Having used this trick in Mac OS X, I thought to try importing the data from its original excel format, rather than from the intermediate .csv file. To my surprise, this didn’t work! In programming terms, running insheet to import .csv files translates the Japanese perfectly, but running import to import the excel file fails to translate properly! So, either there is something inaccessible about excel’s encoding, or the import program is broken in Stata. I don’t know which, but this does mean that if you receive a Japanese-language excel file and you’re using Mac OS X, you will need to export to .csv before you import to Stata. This is no big deal: before Stata 12, there was no direct excel import method for Stata.

A few final gripes

As a final aside, I take this as a sign that Stata need to really improve their support for Asian languages, and they also need to improve the way they handle excel. Given excel’s importance in the modern workplace, I think it would be a very good idea if Microsoft did more to make it fully open to other developers. It’s the default data transfer mechanism for people who are unfamiliar with databases and statistical software and it is absolutely essential that statisticians be able to work with it, whatever their opinions of its particular foibles or of the ethics of Microsoft. It also has better advanced programming and data manipulation properties than, say, OpenOffice, and this makes it all the more important that it match closely to standards that can be used across platforms. Excel has become a ubiquitous workplace tool, the numerical equivalent of a staple, and just as any company’s staplers can work with any other company’s staples if the standards match, so excel needs to be recognized as a public good, and made more open to developers at other companies. If that were the case I don’t think Stata would be struggling with Asian-language excel files but dealing fine with Asian-language .csv files.

And finally, I think this may also mean that both Apple and Microsoft need to drop their proprietary encoding systems and use an agreed, open standard. And also that Windows need to grow up and offer support for multiple languages on all their versions of Windows, not just the most expensive one.

Lastly, I hope this post helps someone out there with a Japanese-language import (or offers a way to import any other language that has a more extensive encoding than English).

Recently I’ve been working on some problems in disease modeling for influenza, and one of the problems is to calculate the basic reproduction number for a model which includes differential disease strengths in poor and rich risk groups. Calculating this number is generally done with a method called the “Next Generation Matrix” method, and to do this one needs to calculate two matrices of partial derivatives, invert one and multiply it by the other, then calculate the eigenvalues – the basic reproduction number is the largest eigenvalue of the resulting calculation. Doing this for just one risk group in the model I’m fiddling with can be done analytically in about 7 pages of notes – it involves finding the inverse of a 5×5 matrix, but actually this is quite quick to do by hand because most of the matrices involved are wide open spaces of zeros. However, once one extends the model to four risk groups the calculation becomes nastier – it involves inverting a 20×20 matrix, then finding the eigenvalues of a product of 20×20 matrix. Even recognizing that most of these matrices are zero elements, one still ends up with a fiendish hand calculation. On top of this, the matrices themselves contain many separate values all multiplied together. I started this by hand and decided today that I want to take a shortcut – a student needs to use some basic values from this soon and neither she nor I are going to get it done analytically before our deadline.

So tonight I came home and, after a nice dinner and an hour spent with my partner, I spent about an hour programming Matlab to do the calculation numerically for me. I now have the two values that my student needs, and if she needs to tweak her model it’s just a few presses of a button on my computer to get the updated reproduction number. Also, it’s a matter of a second’s work to test any other parameter in the model, and with a few loops I can produce charts of relationships between the reproduction number and any parameter. It’s nice and it was fairly trivial to program in Matlab. In this instance Matlab saved me a couple of days’ work fiddling around with some enormously tricky (though not mathematically challenging) hand calculations.

On this blog a short while back I investigated a weird probability distribution I had encountered at Grognardia. For that calculation, rather than going through the (eye-bleedingly horrible) tedium of attempting to generate a mathematical expression for the probability distributions I wanted to analyze, I simply ran a simulation in R with so many runs (about 100,000) that all random error was stripped out and I essentially got the exact shape of the theoretical underlying distribution I wanted.

In both cases, it’s pretty clear that I’m using a computer to do my thinking for me.

This is very different to using a computer to run an experiment based on the theory one developed painstakingly by hand. Rather, I’m using the brute number-crunching power of modern machines to simply get the theoretical result I’m looking for without doing the thinking. That Grognardia problem involved a badly programmed loop that executed a total of 4,500,000 dice results just to produce one chart. I did it on a computer with 32Gb of RAM and 12 chips, it took about 3 seconds – and I didn’t even have to program efficiently (I did it in R without using the vector nature of R, just straight looping like a 12 year old). The resulting charts are so close to the analytical probability distribution that it makes no difference whatsoever that they’re empirical – that hour of programming and the 3 seconds of processor time short circuited days and days of painstaking theoretical work to find the form of the probability distributions.

Obviously if I want to publish any of these things I need to do the hard work, so on balance I think that these numerical short cuts are a good thing – they help me to work out the feasibility of a hard task, get values to use in empirical work while I continue with the analytic problems, and give a way to check my work. But on the flip side – and much as I hate to sound like a maths grognard or something – I do sometimes wonder if the sheer power of computers has got to the point where they genuinely do offer a brutal, empirical short cut to actual mathematical thinking. Why seek an elegant mathematical solution to a problem when you can just spend 10 minutes on a computer and get all the dynamics of your solution without having to worry about the hard stuff? For people like me, with a good enough education in maths and physics to know what we need to do, but not enough concerted experience in the hard yards to be able to do the complex nitty-gritty of the work, this may be a godsend. But from the broader perspective of the discipline, will it lead to an overall, population-wide loss of the analytical skills that make maths and physics so powerful? And if so, in the future will we see students at universities losing their deep insight into the discipline as the power of the computer gives them ways to short cut the hard task of learning and applying the theory?

Maybe those 12 chips, 32Gb of RAM, 27 inch screen and 1Gb graphics card are a mixed blessing …

This week my work started the process of buying me a PowerMac – a 12 core, 24Gb RAM monster that will replace my 8 core, 16Gb RAM windows machine. I also received a new macbook, and my colleagues are also going to or planning to buy macs in their next round of upgrades. This means that unless something catastrophic happens with funding in the next week, I will have completely abandoned Windows in my work life. I have long since abandoned it at home, but workplaces have tended to be more conservative about the change, but it’s finally going to happen.

I have noticed over the past few years that a lot of people working at the crunchy end of science are using macs, not Windows machines. This has included people working in computer graphics research, nuclear physics, veterinary epidemiology, medicine and epidemiology. It appears that if you need to do numerical computation, Mac OS X is increasingly the platform that you do it on. I guess this is because Mac OS has been much easier to develop for since it switched to a unix-based system, and so now many of the key tools of numerical computation are available on it: Matlab, Stata, LateX and R are all implemented very well on Apple machines. Indeed, in my experience these packages tend to be easier and more pleasant to use on Apple – LateX editors are much more pleasant and more readily available, R’s script-writing tools are much better, and there is not really any difference that I can see between matlab and R on the two platforms.

I think this means the end of the dominance Windows had in this area since it forced Unix out. This would make the scientific computing market a rare (I guess) example of a company with essentially complete market dominance losing its monopoly place purely on quality-related issues. I’m not a fascist about any one computer system (except Linux – I try to avoid Linux) but I do find Mac OS much nicer to work with, and not just because the machines are pretty. Something about the way it works is just less sticky than Windows, and it seems to trouble me less with extraneous stuff. I think it’s probably something to do with the attitude to design, and also with the greater degree of sympathy between hardware and software (which comes, I guess, from Steve Jobs’s obsession with keeping everything in the one company). I think it might also have something to do with the extra money one pays for the machines.

As far as money goes, the common complaint that you can get the same performance at half the price with a Windows machine is, in my experience, sadly misguided. The very worst thing you can do with Windows is buy a cheap machine that has good stats on paper – or worse still, build your own. You’ll be paying in time and Insanity Points for the rest of the 2 years you use it before you throw it away.  I don’t know what it is about computers, but their performance is like a symphony orchestra, and if you scrimp on any part of the process the whole thing leaves you feeling bad. You’re better off getting an Apple that is, on paper, inferior for the same price, than assuming that the extra 0.2GHz in that PC chip are going to work as they should for half the price.

Of course, this is all classic flamewar material, but my observation is that the scientific computing world is moving away from Windows and taking Apple seriously as a computational tool. This is particularly striking given that just 10 or so years ago Mac was seen as exclusively the tool of inner-city designers with black clothes and slanty hair, or girly magazine writers who value a cute case over a decent OS. Well, perhaps those girly magazine writers were right all along …

As an aside, and to give all those reading this a common enemy, I recently downloaded the cute 2D dungeon crawler Dungeons of Dredmor. I can’t say it’s holding my attention, but it is cute. On their blog, the developers talk about the work they’re doing to port the game to iPad (an excellent idea!), and they have a few things to say about the commercial and technical problems involved in porting it to Android. In the process they also reveal some interesting issues about Linux. In answer to the question “Will the game be available for Android?” they say:

This… is an interesting question. While SDL 1.3 supports Android, at least partly, there are two reasons why we might not go ahead and do this. The first is insufficient demand – my personal experience with Android on previous products have been that Android sales are a very small fraction of iPad sales (in fact, less than the ratio of Linux sales to Windows sales.) Consequently, it’s not entirely clear that this is something that we will actually make money on – especially on the tablet market, where Android tablets are still somewhat of an unknown factor and where the iPad still occupies 75% of the market share.

The second reason why we might not support Android is because the infrastructure for Android is so, so, hideously broken. Again, it’s *worse* than the Linux situation, which is kind of amazing. In order for us to ship on Android, we have to be convinced (more specifically, as the Technical Director for the studio *I* have to be convinced) that we can actually ship an Android version of Dredmor and have it work. Given that there are a number of horror stories floating around about people who test their software on 300 Android devices and get everything working, only to release and have everything explode on Day 1… I’m just not confident that we can do this. It is possible that we might put together an Android release for a *very* limited selection of devices (Kindle Fire, Samsung Galaxy Tab, ASUS EEE Pad Transformer) where we have some hope of having things run in a fashion that we’re happy with. That said, we’re still looking into this, and the iPad port (by virtue of the market share we mentioned above) is still the top priority.

There are phrases in there that are genuine music to the ears of someone with a strong anti-linux fetish. “Worse than linux … which is kind of amazing.” ha! Also note the disturbing information about the ratio of iPad to other tablet sales. I wonder if Apple are going to become the windows of the smartphone and tablet marketplace, completely dominating all other products and stifling development on anything else? And if this market dominance is built, in the short term at least, on higher quality product, what are the chances of a rival OS surviving? The “I hate apple” niche market is probably going to get smaller and smaller as the virtues of Apple products become better known (as has happened in the scientific computing world).

I guess we’ll have to just learn to love our new slanty-fringed overlords …

 

I have now had quite a bit of experience working with large datasets in Stata, and consistent with my previous efforts on this blog to publicize pr0blems with statistical software and solutions to computer problems, I thought I’d explain how I do it and why it’s a good idea to use Stata for large data. I approached this problem in 2008, when I was living in London and working with National Health Service (NHS) data. At that time it was a seemingly insoluble problem and there wasn’t much information out there about how to solve it; I guess since then things have improved, but just in case the information is thin on the ground, I thought I’d write this post.

What size is “large”?

When I researched solutions to the problem of analyzing large datasets in Stata, many of the people I contacted and the websites I looked at thought I meant data consisting of hundreds of thousands of records – this is a common size in statistical analysis of, e.g. schools data or pharmaceutical data. I was working with files of 100s of millions of records, up to 30Gb in size, and back in 2008 very few people were working with this size. Even now, this is still pretty uncommon in epidemiology and health services research. Four years of outpatient data from the NHS will contain about 250 million records, and the chances are that the correct analysis you need for such data is a multi-level model (facility and patient being two levels) with binary outcomes. With this kind of data most health researchers make compromises and use the linear probability model, or other approximations and workarounds. Most researchers also use SAS, because SAS is the only software package capable of analyzing files that don’t fit into RAM. However, it takes an enormous amount of time to do a logistic regression on 250 million records with SAS – my colleague would leave it running all day, and work on a different computer while he waited for it to complete. This is not acceptable.

Why Stata?

I’m not a fascist about statistical software – I’ll use anything I need to to get the job done, and I see benefits and downsides in all of them. However, I’ve become increasingly partial to Stata since I started using it, for these reasons:

  • It is much, much faster than SAS
  • It is cheaper than SAS or SPSS
  • Its help is vastly superior to R, and the online help (on message boards, etc) is much, much politer – the R online help is a stinking pit of rude, sneering people
  • R can’t be trusted, as I’ve documented before, and R is also quite demanding on system resources
  • Much of the stuff that epidemiologists need is standardized in Stata first – for example, Stata leads the way on combining multilevel models and probability sampling
  • Stata’s programming language, while not as powerful as R, is still very flexible and is relatively standardized
  • Stata has very good graphics compared to the other packages
  • SAS is absolutely terrible to work with if you need automation or recursive programming
  • Stata/MP is designed to work with multi-core computers out of the box, whereas R has no support for modern chips, and SAS requires some kind of horrendous specialized set up that no one with a life can understand

So, while I’ll use R for automation and challenging, recursive tasks, I won’t go near it for work that I really need to get trustworthy results on quickly, where I’m collaborating with non-statisticians, or where I need good quality output. I gave up on SAS in 2008 and won’t be going back unless I need something that only SAS can do, and I don’t think SPSS is a viable option for serious statistical analysis, though it has its uses (I could write a very glowing  post on the benefits of SPSS for standardizing analysis of probability survey analysis over large organizations).

The big problem with Stata is that, like R, it is vectorized, so you need to load the entire data file into RAM in order to be able to do any analysis on it. This means that if you want to analyze very large data sets, you need huge amounts of RAM – whereas in SPSS or SAS you can load it piecewise and analyze accordingly. Furthermore, until Windows 7 came along it was not possible to give more than 700Mb of RAM to any program (unless you were using Mac OS X/Unix), so you couldn’t load even medium-sized files into RAM. Sure, you could use Windows Professional 2000 or some such nightmare mutant package (which I tried to do) but it’s hell on earth to go there. Your best option was Mac OS and a huge amount of RAM.

I’m going to now prove that it’s better to buy Stata and invest in 32 or 64 Gb of RAM, than to keep working with SAS. And I’m not going to fall back on hazy “productivity gains” to do so.

Conditions for analysis of large datasets

The core condition for analysis of large datasets is sufficient RAM to load the entire dataset – so if you expect your basic analysis file to be 12Gb in size, you’ll need a bit more than that in RAM. If the file is coming in a size larger than this, you’ll need a database package to access it – I use MS Access, but anything will do. If the file comes in text (e.g. .csv) format you can break it into chunks in a text editor or database package and import these into Stata sequentially, appending them together. Also, don’t be discouraged by larger file sizes before you import – Stata has very efficient data storage and by careful manipulation of variable types you can make your data files much smaller. Also, if you are importing sequentially you can drop variables you don’t need from each chunk of file before appending. For example, if you receive NHS data there will be a unique id derived from some encryption software that is about 32 characters long. Turn this into an integer and you save yourself about 16 bytes per record – this adds up over 250 million records. Some spatial data is also repeated in the file, so you can delete it, and there’s lots of information that can be split into separate files and merged in later if needed – in Stata it’s the work of a few seconds to merge a 16 Gb file with another 16 Gb file if you have sufficient RAM, whereas working with a single bloated 25Gb file in SAS will take you a day. It’s worth noting that SAS’s minimum sizes for a lot of variable types are bloated, and you can shave off 30-40% of the file size when you convert to Stata.

So, loop through chunks to build up files containing only what is relevant, compress them to minimum sizes, and use a judiciously constructed master file of IDs as a reference file against which to merge data sets with secondary information. Then, buy lots of RAM. You’ll then have the dual benefits of a really, really nice computer and a fast statistical analysis package. If you were working with large datasets in SAS, you’ll have cut your analysis time from hours to seconds, increased the range of analyses you can conduct, and got yourself improved graphics. But how are you going to convince anyone to buy you that computer?

Stata and a large computer is cheaper

Obviously you should do your own cost calculations, but in general you’ll find it’s cheaper to buy Stata and a beast of a computer than to persist with SAS and a cheap computer. When I was in the UK I did the calculations, and they were fairly convincing. Using my rough memory of the figures at the time: SAS was about 1600 pounds a year, and a basic computer about 2000 pounds every three years: total cost 6800 pounds every three years. Stata costs 1500 pounds, upgrades about every 2-3 years, and a computer with 32Gb of RAM and 4 processors was about 3000 pounds. So your total costs over 3 years are about 2300 pounds less. Even if you get a beast of an apple workstation, at about 5000 pounds, you’ll end up about even on the upgrade cycle. The difference in personal satisfaction and working pace is huge, however.

Conclusion

If you work with large datasets, it’s worth your while to switch to Stata and a better computer than to persist with slow, clunky, inflexible systems like SAS or SPSS. If you need to continue to interact closely with a large SQL backend then obviously these considerations don’t apply, but if your data importation and manipulation needs are primarily flat files that you receive in batches once or twice a year, you’ll get major productivity gains and possibly cost savings even though you’ve bought yourself a better computer. There are very few tasks that Stata can’t solve in combination with Windows 7 or Mac OS X, so don’t hold back – make the case to your boss for the best workstation you can afford, and an upgrade to a stats package you can enjoy.

I stumbled on a problem with out-of-sample prediction in R, which I think is an example of idiosyncratic programming (and possibly insoluble) and, in the spirit of Andrew Gelman’s stats blog, I thought I’d put it on the internet to see if a) I’m just being stupid and b) it really is as much of a barrier to automation of statistical functions as I think it is. Perhaps someone with more clout than me will find this blog and notice this problem – or perhaps it’s not a problem and I’m just being stupid (or unfairly demanding of my prediction modeling software).

In case you haven’t worked this out already, if you aren’t interested in stats, R, or nerdy shit that wastes everyone’s time, you should stop reading right now. Everything beneath the dotted line is going to kill your love life, and probably your pets.

First, let’s consider the basic out-of-sample prediction process. We run a very simple linear model and then we provide R with a set of new data – containing exactly the same variable and type of variable – and get it to work out what the expected value of the line is on the new predictor variables. Taking the example straight from the manual, here is the code for a trivial predictive model:

x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)

This is very simple code: we produce a set of x values and make the y values equal to these x values plus a random amount[1]. Then we run the model and get the predicted values (just for shits and giggles, btw), and then we make a new data set that contains values ranging from -3 to 3 in steps of 0.5, and run the predictions on these new x values. In this code, the data frame called new is the out-of-sample data, and we get the expected value of y given the observed values in new, for the given model. This is stats 101, right?

In my case, the x values are

1.5313262  1.5307600 -0.5067223 -0.1366010 -0.6557527  1.4245267 -0.3917136  1.7941995  2.0511560  0.3602334 -0.8798598 -0.5755816 -1.6419118 -0.7885237  1.1478910

and the coefficients of the resulting linear model are

[0.367,1.047]

That is my model has an intercept of 0.367 and a slope of 1.047. This means that, e.g. for a new x value of -3 I would expect y to be 0.367+-3*1.047, or about -2.774. The correct predicted values for the new data set, based on these coefficients, produced by the 5th line of my code, are:

-2.7751002 -2.2514963 -1.7278925 -1.2042886 -0.6806847 -0.1570809  0.3665230  0.8901269  1.4137307  1.9373346  2.4609385  2.9845424  3.5081462

Now, let’s produce our first example of a problem. Consider the following very simple code:

yy<-cbind(y,x)
pred.m1<-predict(lm(yy[,1]~yy[,2]))
pred.mnew<-predict(lm(yy[,1]~yy[,2]),new,se.fit=TRUE)

Here, I’ve bound my previous x and y values together to make a matrix, with y in column 1 and x in column 2. Then I’ve rerun the model using the perfectly natural R method, of referring to each variable in the data set by its column reference ([,1] for the y, [,2] for the x). Then I’ve rerun the out-of-sample prediction on the same set of x values as I used before (the vector new). This is the simplest possible non-trivial linear model, and out-of-sample prediction in this case should be as simple as in the case of the previous model. So what happens? First, the following “warning”:

Warning message:
‘newdata’ had 13 rows but variable(s) found have 15 rows

Note this is a warning. So what values do we get? Here they are:

1.97013967  1.96954674 -0.16412052  0.22347339 -0.32018633  1.85829845 -0.04368247  2.24542264  2.51450943  0.74376222 -0.55487302 -0.23623050 -1.35289974 -0.45922514  1.56860335

These values are different. In fact, they’re the in-sample predictions. You can check it – they’re the result of applying the coefficients to the original observations of the predictor variable. This means that two important things happened:

  1. R failed to produce a solution to the simplest possible linear modeling problem, simply because I changed the input method for my regression model – even though the revised input methods (for the formula and the data) are entirely consistent with the R language
  2. The “warning” was not a warning: R didn’t just warn me about a problem, it failed to apply the command I told it to, and didn’t tell me about its decision. I could publish this shit, and not realize that I’ve been given my original values back.

Note how thoroughly horrible this would be if I were running some kind of automated series of regression processes – I once did a capture/recapture model that was heavily dependent on automation in R, and I’m wondering if I fell for this crap then. R should not be declaring a “warning” when it is actually refusing to run the command I asked it to and producing completely different output. I know I’m only a statistician, but imagine if this kind of programming were allowed in Jumbo Jets. Unhappy punters we would all be, young Jedi.

Also let’s just pause and dwell on the meaningless nature of that warning. I’ve supplied a vector of 13 observations, and the vector of 1s (for the constant) is implicit in everything I’ve done to date. So this means I’m (reasonably) assuming R will construct a 13×2 matrix with a column of 1s and a column of (out-of-sample) x values. Then, we have a 2×1 vector of coefficients. Thus, R should be able to calculate a vector of outputs from the information I’ve supplied it. The only possible explanation is that it is expecting me to supply it with a full-rank matrix, i.e. it has decided arbitrarily to stop implicitly assuming the constant. So, shall we try that?

new.fr<-cbind(rep(1,13),new)
pred.frnew<-predict(lm(yy[,1]~yy[,2]),new.fr,se.fit=TRUE)

Same prediction process, only now the out-of-sample data that I provide is full-rank, containing a column of 1s and a column of (out-of-sample) x values[2].

The result? The same problem: the same warning in red, and in-sample expected values. So the problem is not that I’m providing a less-than-full-rank matrix of predictors on the assumption that one column of the design matrix is implicit. The problem is deeper than that.

I can, however, fix the problem in a weird way. If I assign column names to the original data, and also assign the new data a name that matches the name of the predictor in the original data, I get the correct output:

yy2<-data.frame(yy)
names(yy2)<-c(“Outcome”,”Predictor”)
new2<-new
names(new2)<-c(“Predictor”)
pred.mnamed<-predict(lm(Outcome~Predictor,data=yy2),new2,se.fit=TRUE)

In this case I’ve made a new data frame for the original data and the out-of-sample data, to avoid confusion, and I’ve given them all consistent names. Note now that in the prediction process the out-of-sample data, new2, is not full rank – the column of 1s is assumed in the calculation process. But it doesn’t cause any trouble: the previous warning about seeing only 13 observations doesn’t appear, and I get the correct out-of-sample predictions. I still only have 13 observations, but R doesn’t care. I get correct predictions and everything is fine.

This might seem like a trivial problem until you’re doing automation. I recently wrote a program to do a search through a large number of univariate models, and then combine the variables from the top 10 in a predictive model. The predictions were the only thing I wanted. But I spent hours trying to work out why R wouldn’t give me out-of-sample predictions: my training sample had about 50k records, and my validation sample had about 20k. I was using numeric column references for the purposes of automation (the process should be fairly obvious: you make a matrix of variable numbers with AIC values from the corresponding linear models, sort by AIC values, choose the variable numbers corresponding to the best 10, and then put them into a formula in a linear model) but it wasn’t working. Eventually I worked out this problem, made a vector of column names and used the paste() function to produce a formula.

But this work around is stupid. In both cases – using column numbers or referring to names – the design matrix is implicit, not specified by the user. The underlying R code has to construct this design matrix. So why does it screw me around like this? Why provide two ways of referencing columns and then have one of them bork your most basic modeling function? What would Tibshirani say? Dividing data into training and validation sets is an essential part of modern machine learning principles, but R can’t handle it because – omfg – one has a different number of records to the other. Even though the matrix mathematics is in no way affected by the difference in the number of records.

That, dear reader, is called crap program design. Especially crap is the issuing of a warning that isn’t a warning at all, but an indication that the program has arbitrarily decided to produce a result different to the one you asked for! Rather than declaring a “warning,” R should say clearly “Catastrophic failure! Providing default prediction using original design matrix only!” Or better still it could say “What are you doing, Dave?” and refuse to let me out of the cargo bay doors.

The manual says the following about giving out-of-sample data to the program:

If the fit is rank-deficient, some of the columns of the design matrix will have been dropped. Prediction from such a fit only makes sense if newdata is contained in the same subspace as the original data. That cannot be checked accurately, so a warning is issued.

and

A warning will be given if the variables found are not of the same length as those in newdata if it was supplied.

This is not exactly the most informative help in the world. I don’t think that what is described in the first sentences (to the extent that I understand the English) is what happens in practice, and the last sentence is certainly misleading. In R, warnings usually indicate something you should be aware of that does not affect the numerical processing of results (see e.g. Andrew Gelman’s statement in the above blog link, “I don’t care about warning messages.” Except you should, because sometimes R issues a message called a “warning” that is actually an indicator of insouciant and catastrophic failure to enact the expected mathematical process.

Open source software: it sucks. Except that R is far and away the best software for automation in stats – until you run into crap like this. Fix it, CRAN!

fn1: of course the R manual makes this part just slightly more obscure than it needs to be, because hey! R is a really pithy language, so why would you use comments to make the process easier to grasp, or separate the process of generating x from the process of generating epsilon? Anyone who knows what they’re doing can figure it out, and manuals are for people who already know what they’re doing – right?

fn2: I guess I could do this more accurately by supplying a variable containing only 1s, and specifying “no constant.” That might work, I suppose. How horrid.

Here is the full code to reproduce all these results:

## Predictions
x <- rnorm(15)
y <- x + rnorm(15)
pred.v1<-predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
pred.vnew<-predict(lm(y ~ x), new, se.fit = TRUE)

## now make a data set yy that contains a column for x and a column for y

yy<-cbind(y,x)

pred.m1<-predict(lm(yy[,1]~yy[,2]))

pred.mnew<-predict(lm(yy[,1]~yy[,2]),new,se.fit=TRUE)

## now give the data set names and retry it
yy2<-data.frame(yy)
names(yy2)<-c(“Outcome”,”Predictor”)
new2<-new
names(new2)<-c(“Predictor”)

pred.mnamed<-predict(lm(Outcome~Predictor,data=yy2),new2,se.fit=TRUE)

# now try a full rank matrix
new.fr<-cbind(rep(1,13),new)
pred.frnew<-predict(lm(yy[,1]~yy[,2]),new.fr,se.fit=TRUE)

x.mat<-as.numeric(lm(yy[,1]~yy[,2])$coefficients)

# try again with matrix maths
new.fr<-as.matrix(cbind(rep(1,13),new))
pred.frnew2<-new.fr %*% x.mat

The Guardian today has an article on gold-farming in China about gold farming in Chinese labour camps, which claims that prisoners in labour camps in China were (are?) forced to play computer games at night after they had spent the day at hard physical labour. The prison then sold the products of their labour to free[1] computer gamers, but of course the camp workers saw none of the profits. If they failed to produce sufficient gold, or slacked off in their virtual world, they were beaten and punished in other ways. The article presents some interesting information about the way gold farming is conducted in labour camps and also in IT sweatshops. The latter represent “voluntary” labour and those doing it seem to think it pays better than factory work (it’s probably safer too), while the former are involuntary work.

These gold farmers in labour camps are being essentially forced to go and work in another world for 7 to 10 hours a day, and beaten if they don’t produce the goods they’re sent there to get. Not only is this process surreal (literally!) but it’s a model of human trafficking, enacted virtually. Are we witnessing the development of an industry based on trafficking in virtual people?

fn1: if you can define WoW players as “free” by the standard definition of the term…

Follow

Get every new post delivered to your Inbox.

Join 53 other followers