I have a dataset of about 40 million records covering rare events over a 30 year period, with about 600,000 events in total. The data are clustered in about 50 regions, and in the latter years of the file there are only about 4000 events a year (they increase in frequency in the early years). This means that in one year there may be only 80 events in any one region. I want to conduct a multi-level Poisson regression of the events, with about four or five categorical covariates plus year and a random effect for region. I’m doing this using GLLAMM in Stata on an Apple computer with 32Gb of RAM, 12 cores and Stata/MP optimized for all 12 cores (this is not a cheap setup!) The raw dataset takes about 16Gb but I can collapse it to a dataset of counts of events and populations at risk, which reduces it to (I think, from memory) about 170,000 records. I’m using a Poisson distribution, so essentially a loglinear model, but I should probably ideally use negative binomial for zero-inflated data – unfortunately GLLAMM doesn’t handle negative binomial data and I have read of problems with the xtnegbin procedure, so I just have to suck it up and use the Poisson family. I’m also not adjusting for serial dependence, because that would mean extra random effects and/or correlation structures and it would just get too nasty.

The problem here is that GLLAMM just won’t converge. If you set it running with relatively standard settings you can first run it with the NOEST command to get the coefficients, then feed them as starting values into a model with adaptive quadrature and 12 integration points, but it just won’t converge: after 3 or 4 days and 150 iterations (one estimation cycle takes about an hour) it still hasn’t converged, and the maximum likelihood estimate is not concave. The maximization process seems to be proceeding in some fashion, because every 10 iterations or so the likelihood decreases slightly, but it just won’t seem to stop. Interestingly you can’t set the tolerance in the estimation process in GLLAMM, so I can’t stop it when the MLE is decreasing by less than 0.01 (which it is doing now) but more than 0.00001.

This non-convergence is not a data issue – it works fine in a standard Poisson regression, there are no warnings about combinations of covariates with all zeros or all 1s, and this also seems to be the case when the data is divided up into regions (as best I can tell). I think the problem is just that the events are rare, there are a lot of dimensions over which the likelihood needs to change, and it’s just a very tough optimization problem. I have run into this problem with non-linear multi-level models before, and it appears to be a problem of big data with rare events and large numbers of regions. So what to do?

I have noticed in the past that estimation in GLLAMM is more likely to be successful if there are more integration points, which slow down the time to converge but improve stability. I’ve also noticed that the initial estimates can really kill the model, and also I’ve noticed that a big part of the estimation procedure is about getting standard errors, not estimates. Also, although you can’t control tolerance you can limit the number of iterations (a rough version of the same thing). Each iteration taking an hour, if one wants to get actual results it is helpful to run as few iterations as possible[1]. So today I tried a new method for getting convergence, which works in stages like this:

  • Start with the NOEST option to get initial coefficients for starting values
  • Feed these into a model with 1 integration point (Laplace Approximation) and 1 iteration
  • Feed the coefficients from this as starting values to a model with 2 integration points and 2 iterations
  • Keep running new models, slowly increasing integration points and iterations

If I did this a few times, I found that my models started converging after just two Newton-Raphson cycles. By the fifth model (with nip(12) and iterate(8)) the coefficients had all become stable to 4 decimal places, and the variances were largely fixed: all that remained was small changes in the variance of the random effect. At the fourth model some of the variances were out of whack, but the coefficients close to their final value. The fourth and fifth models converged on the second Newton-Raphson step. I set it running last night and it’s basically now fine-tuning, using a model with 24 integration points and a maximum of 32 iterations. That will be the final model, it’s probably going to be done by tomorrow morning.

Because this modeling process starts with the final coefficients from the previous model, it seems to bypass a lot of the convergence problems that arise from starting with a model that has coefficients very far from the final values and then waiting. Instead, you get rough coefficients halfway through the process and then restart. Since much of the challenge for the estimation process seems to be in calculating variances rather than coefficients, this saves you a lot of time waiting to get values. It seems to have shortcircuited a lot of convergence issues. I don’t know if it would work for every problem (GLLAMM seems to be a temperamental beast, and every time I use GLLAMM a new problem seems to rear its ugly head), but if you are having convergence problems and you are confident that they are not caused by bad data, then maybe give my method a try. And let me know if it works! (Or if it doesn’t!)

fn1: Actually this model process takes so long that there are some sensitivity analyses we just have to skip. There is one method we use in preparing the data that I would like to check through resampling and sensitivity analysis, but if every model is going to take 2 days, then you can’t even do a paltry 100-sample bootstrap and get a meaningful answer in half a year. Crazy!

Advertisements