I have now had quite a bit of experience working with large datasets in Stata, and consistent with my previous efforts on this blog to publicize pr0blems with statistical software and solutions to computer problems, I thought I’d explain how I do it and why it’s a good idea to use Stata for large data. I approached this problem in 2008, when I was living in London and working with National Health Service (NHS) data. At that time it was a seemingly insoluble problem and there wasn’t much information out there about how to solve it; I guess since then things have improved, but just in case the information is thin on the ground, I thought I’d write this post.
What size is “large”?
When I researched solutions to the problem of analyzing large datasets in Stata, many of the people I contacted and the websites I looked at thought I meant data consisting of hundreds of thousands of records – this is a common size in statistical analysis of, e.g. schools data or pharmaceutical data. I was working with files of 100s of millions of records, up to 30Gb in size, and back in 2008 very few people were working with this size. Even now, this is still pretty uncommon in epidemiology and health services research. Four years of outpatient data from the NHS will contain about 250 million records, and the chances are that the correct analysis you need for such data is a multi-level model (facility and patient being two levels) with binary outcomes. With this kind of data most health researchers make compromises and use the linear probability model, or other approximations and workarounds. Most researchers also use SAS, because SAS is the only software package capable of analyzing files that don’t fit into RAM. However, it takes an enormous amount of time to do a logistic regression on 250 million records with SAS – my colleague would leave it running all day, and work on a different computer while he waited for it to complete. This is not acceptable.
I’m not a fascist about statistical software – I’ll use anything I need to to get the job done, and I see benefits and downsides in all of them. However, I’ve become increasingly partial to Stata since I started using it, for these reasons:
- It is much, much faster than SAS
- It is cheaper than SAS or SPSS
- Its help is vastly superior to R, and the online help (on message boards, etc) is much, much politer – the R online help is a stinking pit of rude, sneering people
- R can’t be trusted, as I’ve documented before, and R is also quite demanding on system resources
- Much of the stuff that epidemiologists need is standardized in Stata first – for example, Stata leads the way on combining multilevel models and probability sampling
- Stata’s programming language, while not as powerful as R, is still very flexible and is relatively standardized
- Stata has very good graphics compared to the other packages
- SAS is absolutely terrible to work with if you need automation or recursive programming
- Stata/MP is designed to work with multi-core computers out of the box, whereas R has no support for modern chips, and SAS requires some kind of horrendous specialized set up that no one with a life can understand
So, while I’ll use R for automation and challenging, recursive tasks, I won’t go near it for work that I really need to get trustworthy results on quickly, where I’m collaborating with non-statisticians, or where I need good quality output. I gave up on SAS in 2008 and won’t be going back unless I need something that only SAS can do, and I don’t think SPSS is a viable option for serious statistical analysis, though it has its uses (I could write a very glowing post on the benefits of SPSS for standardizing analysis of probability survey analysis over large organizations).
The big problem with Stata is that, like R, it is vectorized, so you need to load the entire data file into RAM in order to be able to do any analysis on it. This means that if you want to analyze very large data sets, you need huge amounts of RAM – whereas in SPSS or SAS you can load it piecewise and analyze accordingly. Furthermore, until Windows 7 came along it was not possible to give more than 700Mb of RAM to any program (unless you were using Mac OS X/Unix), so you couldn’t load even medium-sized files into RAM. Sure, you could use Windows Professional 2000 or some such nightmare mutant package (which I tried to do) but it’s hell on earth to go there. Your best option was Mac OS and a huge amount of RAM.
I’m going to now prove that it’s better to buy Stata and invest in 32 or 64 Gb of RAM, than to keep working with SAS. And I’m not going to fall back on hazy “productivity gains” to do so.
Conditions for analysis of large datasets
The core condition for analysis of large datasets is sufficient RAM to load the entire dataset – so if you expect your basic analysis file to be 12Gb in size, you’ll need a bit more than that in RAM. If the file is coming in a size larger than this, you’ll need a database package to access it – I use MS Access, but anything will do. If the file comes in text (e.g. .csv) format you can break it into chunks in a text editor or database package and import these into Stata sequentially, appending them together. Also, don’t be discouraged by larger file sizes before you import – Stata has very efficient data storage and by careful manipulation of variable types you can make your data files much smaller. Also, if you are importing sequentially you can drop variables you don’t need from each chunk of file before appending. For example, if you receive NHS data there will be a unique id derived from some encryption software that is about 32 characters long. Turn this into an integer and you save yourself about 16 bytes per record – this adds up over 250 million records. Some spatial data is also repeated in the file, so you can delete it, and there’s lots of information that can be split into separate files and merged in later if needed – in Stata it’s the work of a few seconds to merge a 16 Gb file with another 16 Gb file if you have sufficient RAM, whereas working with a single bloated 25Gb file in SAS will take you a day. It’s worth noting that SAS’s minimum sizes for a lot of variable types are bloated, and you can shave off 30-40% of the file size when you convert to Stata.
So, loop through chunks to build up files containing only what is relevant, compress them to minimum sizes, and use a judiciously constructed master file of IDs as a reference file against which to merge data sets with secondary information. Then, buy lots of RAM. You’ll then have the dual benefits of a really, really nice computer and a fast statistical analysis package. If you were working with large datasets in SAS, you’ll have cut your analysis time from hours to seconds, increased the range of analyses you can conduct, and got yourself improved graphics. But how are you going to convince anyone to buy you that computer?
Stata and a large computer is cheaper
Obviously you should do your own cost calculations, but in general you’ll find it’s cheaper to buy Stata and a beast of a computer than to persist with SAS and a cheap computer. When I was in the UK I did the calculations, and they were fairly convincing. Using my rough memory of the figures at the time: SAS was about 1600 pounds a year, and a basic computer about 2000 pounds every three years: total cost 6800 pounds every three years. Stata costs 1500 pounds, upgrades about every 2-3 years, and a computer with 32Gb of RAM and 4 processors was about 3000 pounds. So your total costs over 3 years are about 2300 pounds less. Even if you get a beast of an apple workstation, at about 5000 pounds, you’ll end up about even on the upgrade cycle. The difference in personal satisfaction and working pace is huge, however.
If you work with large datasets, it’s worth your while to switch to Stata and a better computer than to persist with slow, clunky, inflexible systems like SAS or SPSS. If you need to continue to interact closely with a large SQL backend then obviously these considerations don’t apply, but if your data importation and manipulation needs are primarily flat files that you receive in batches once or twice a year, you’ll get major productivity gains and possibly cost savings even though you’ve bought yourself a better computer. There are very few tasks that Stata can’t solve in combination with Windows 7 or Mac OS X, so don’t hold back – make the case to your boss for the best workstation you can afford, and an upgrade to a stats package you can enjoy.