I’m working on a couple of complex multi-level models at the moment, using Stata, and I’ve run into some big problems getting them to work. In the spirit of helping others to solve problems I’ve spent a lot of time attacking, in this post I’m going to describe the problems I encountered and my solutions. In searching the internet for these problems I haven’t found many solutions, so hopefully other people can approach the problems more easily with the help of this post.

The two main problems I have encountered occur when I encounter the* initial values not feasible* error, and also problems of the time it takes to run the multi-level models. Solving the first problem raises another interesting and related problem that I think is a Stata bug, and that I have found a workaround for. Solving the second one might be related to the first one, but largely involves utilizing some properties of binomial or Poisson distributions. I have also encountered a third problem in which different multi-level modeling functions give different results, but I haven’t found a solution to that one. I will mention it here for completeness.

Obviously if you aren’t a statistician, everything that goes on from here down will seem like satanism, and you should probably stop reading.

**The models**

I am currently working on three unrelated and complex multi-level models, all of which use binomial- or Poisson-distributed responses and all of which have very large data sets. I’m not going to give details about the research projects, but the three models are listed here.

- An analysis of Demographic and Health Survey (DHS) data from multiple countries and years, with a binomial outcome and about 800,000 records. This model should have individuals clustered within households clustered within clusters within years within countries, but currently I’m not using any household level variables and I only have 1-4 years per country (mostly 1) so I’m ignoring the serial dependence in years. Even so, running a simple model in xtmelogit takes a fairly long time (perhaps 8 hours) on a 12-core PowerPC with Stata/MP 12 and 64Gb of RAM. Because I’m combining surveys across countries and years, I don’t think survey weights are valid, so I’m ignoring probability sampling effects (
*thank god*!) - An analysis of DHS data from a single country with more than a million records, with a binomial outcome. This model has individuals within households clustered within regions within states, so should be a four-level model. It also has a lot of variables (it is examining socioeconomic determinants of health outcomes). Running a single model in xtmelogit with Laplace transformation (the fastest maximization method) on an 8-core Dell Precision platform with 32Gb of RAM takes … well, days, mostly. And currently I’m ignoring one level (region, I think) for simplicity.
- An analysis of NHS hospital admission data with about 5 million records, Poisson outcomes, and only two levels: individuals within regions. Fortunately it only has a couple of variables, though it has a massive interaction term, so it runs in a couple of hours. I should probably have crossed levels (hospital X region) but I think this would fry anything it touched (as well as my brain) and probably be unidentifiable (because of lots of regions sending single individuals to big hospitals). This is also running on the 8-core Dell.

I’ve tried running model 2 in GLLAMM, but models 1 and 3 so far have used only* xtmelogit* and *xtPoisson* respectively. Obviously I don’t have infinite time for trouble-shooting, since each model takes at least one night to run; and I can’t run tests on random sub-samples because the problems often only occur in very large data sets; or they will appear to often in poorly-chosen random sub-samples. In all the models I also have to fiddle with categorical variables to ensure that some combinations of predictors don’t become too rare (and thus non-identifiable) in some cluster/country/time combinations. Also, as a final note, the heirarchical structure is essential. Results are completely different without it, and/or some key variables are estimated at regional level rather than individual.

**Problem 1: Non feasible initial values**

The first problem I have encountered is that *xtmelogit* crashes as soon as it begins initial parameter selection, and returns the following error:

initial values not feasible

This is rarely frustrating and rarely unusual in statistical software: usually the starting values don’t matter that much. As an initial solution to this problem I tried using simplified models just to see if I could get it running, but I found I had to simplify them so much they became meaningless. I dug around on the internet and eventually found this kind of solution, which advocates using the – from() – option to load up your own initial values. The option suggested there is to run a non-heirarchical model, extract the coefficients from that, and input them into the *xtmelogit* using the from option. If you have *k* levels, Stata will be expecting *k* additional coefficients, but apparently it can handle this automatically, so you just give it the output from* logit* and off it goes.

Unfortunately it doesn’t work. For example, if I run this code:

xi: logit outcome i.year i.education i.wealthIndex

mat a=e(b)

xi: xtmelogit outcome i.year i.education i.wealthIndex || country: || cluster:,laplace from(a)

I get the following error:

extra parameter outcome:_Iyear_2001 found

specify skip option if necessary

and again the model fails. I’ve never seen this error before. It basically stops me from adding the coefficients from* logit* into the* xtmelogit* command. The frustrating thing though is that if you display the matrices of coefficients from the two models, they are ordered the same way and have the same labels for variables, so there is no reason why Stata should find an “extra” parameter. What is happening? I searched for this online, and all I found was this useless advice, or (after much digging) a link I can’t find again, which suggested that the model might be “non-identifiable.” This problem is not arising through non-identifiability or complexity: I know this because if I can find the right starting values (see below) I can get the model to run just fine. Furthermore, this error doesn’t sound like a mathematical problem. It sounds like a programming problem.

In fact, the solution to this problem is remarkably simple: you simply add the “copy” option to the -from()- option. Without this option, the -from()- option tells *xtmelogit* to insert values from the matrix of coefficients from *logit* into their corresponding place in the *xtmelogit* procedure based on variable labels. When it runs out of variable labels it is then supposed to make up additional starting values using defaults. However, this doesn’t work because for some reason *xtmelogit* doesn’t understand the matrix output from *logit*. However, if you use the copy option,* xtmelogit* inserts the coefficients from the matrix based only on their position in the matrix. This means you need to supply the *k* extra starting values for the error terms of the random effects, but otherwise you’re good to go. You can supply these by guessing them, giving zeros (don’t know if this is a good idea!) or running an intercept-only heirarchical model and taking them from the results of that. The full code (with me supplying zeros in this case) is then:

xi: logit outcome i.year i.education i.wealthIndex

mat a=e(b)

mat a1=(a,0,0)

xi: xtmelogit outcome i.year i.education i.wealthIndex || country: || cluster:,laplace from(a1, copy)

Do this and you won’t run into the extra parameter problem. But note that supplying starting values from *logit* isn’t always a good idea – they can be radically different to the true final coefficients, even differing in sign, and they can again lead to the *initial values not feasible* problem.

In this case the only solution I could find was to run a much simpler multi-level model and then extract the values from that. In fact, I found an intercept-only model was sufficient to provide functioning starting parameters for the full *xtmelogit*. So if you can’t get your *logit* starting values to work, try just running a simple intercept-only model with all the necessary levels, and supplying those starting values to *xtmelogit*.

This bothers me for two reasons: first of all, the extra parameter error is obviously a programming error; and secondly, if supplying the results of an intercept-only model is enough to make the full model run, this suggests pretty extreme sensitivity to initial values. Is there maybe a more stable starting process for these maximization algorithms? It takes humans days to select good starting values, and if you don’t stumble on them immediately you have to do a search through a range of possibilities – isn’t it faster for the computer to do this? What starting values is it using?

**Problem 2: Optimization time is just too long**

As I mentioned in the introduction, these models can take a long time to run – between 6 hours and a couple of days depending on the model. I had hoped that finding new and better initial values would solve this problem at least partially, but it doesn’t much and the Stata manual admits that spending a long time looking for good initial values will have little impact on the time it takes. So what to do? The number of levels is a huge determinant of the time it takes to run (processor time depends on a factor of 2^k, I think), but if you can’t drop your levels, you’re in big trouble. Fortunately you can use a simple workaround (in some cases) to solve this problem. Because *xtmelogit* works on binomial data you can reduce the dataset in size by calculating summary data at the lowest level: you collapse the data at this level into a data set of events and trials. Not all Stata logistic regression procedures accept the events/trials framework, but xtmelogit does. If you’re dealing with, e.g. a school with classrooms, each classroom will have only two ages and two sexes. So you may be able to reduce each classroom to just 4 records, containing the count of the number of students in each age/sex combination, and the number of events. I tried this with model 1, and managed to reduce the data set to about 100,000 records, and the processor time by a factor of about 8 or maybe more, and get exactly the same results. Of course, if you have a household level above the individual, this method will be largely impossible, but if you are working with larger low-level clusters it will work a treat. Note also that it doesn’t work where you have a genuinely continuous variable, or a lot of diversity in predictors. But it’s worth trying if you have a lot of reasonably-sized clusters, especially if you are hoping to get a more accurate estimate than the laplace method.

**Problem 3: Different results in GLLAMM and xtmelogit**

I’ve noticed as well that in some cases *GLLAMM* and *xtmelogit* produce remarkably different results for the same model. On about page 450 of Rabe-Hesketh’s text she mentions this problem but puts it down to choice of integration points: it appears to me that this isn’t the whole story. The Stata list seems to also think this. I have yet to work out the details of this, so will come back to it when I have a better idea. Right now I’m suspicious that *GLLAMM* and *xtmelogit* are doing … well, not quite the same thing.

**A note on software comparisons**

Note that this problem doesn’t just exist in Stata. I found PROC GLIMMIX in SAS to be horribly unstable, and apparently *glmer* in R uses the Laplace approximation as its default optimization method, and doesn’t allow any other where there are more than two levels! Multi-level modeling with non-normal responses is one of those situations where you really have to be aware of the underlying engineering of the software, and cautious about any results until you have checked every aspect of the defaults. This can be very dangerous if you’re under pressure to produce results and each model is taking a couple of days. Indeed Rabe-Hesketh recommends choosing multiple different optimization procedures (primarily choices of numbers of integration points) to select a stable one before presenting a final model. That’s really pushing your luck if one model takes a couple of days to run, and you have to go through 12 different integration point choices. I wonder if there are any methods for identifying likely optimization complexity from the data itself, without having to run extremely time-consuming guesstimates?

Probably not.

**Conclusion**

There are a couple of solutions to the initial value problem and they don’t just involve reducing the number of variables. One of these, choosing initial values from a standard generalized linear model, produces an error based entirely on what appears to be a programming but, but you can work around it. You can also workaround the time constraints put on you by big multi-level models, by collapsing data rather than removing or simplifying variables. But be careful about the choice of integration points, and the possibility that GLLAMM would give you a different result …

Good luck!

August 9, 2013 at 8:12 am

You promised it would be like satanism, which I interpreted as human sacrifice and naked sex on defiled altars! But it was like maths!

November 20, 2013 at 9:03 am

Did you end up using HLM?

November 20, 2013 at 10:38 am

Thanks for your comment s. We didn’t use HLM – is it better for these kinds of problems?

November 20, 2013 at 11:11 am

I think HLM was designed exactly for these kind of data. HLM 7 can accommodate 4 levels, so why not?

November 20, 2013 at 1:33 pm

I don’t know if the number of levels is relevant. Does HLM implement the GLLAMM framework of Rabe-Hesketh? Does it have multiple different methods to estimate hte same problem that produce quite different answers? If it’s specifically designed for these problems rather than just an add on to an existing package them maybe it is a lot faster and better programmed – I should check it out. Is it hard to learn?

June 20, 2014 at 11:50 pm

Thanks for making this blog post – has helped me past a similar “problem one” situation for me on my PhD. Still slow but thankfully my data is taking hours not days!

June 21, 2014 at 1:32 pm

Happy I could help! I have begun some work looking at convergence stability across different modeling techniques and situations, hopefully this will help me to understand the behavior of these methods in more detail.

I’ve also recently discovered that sometimes GLLAMM crashes and gives strange error messages for reasons unrelated to its optimization process. In this case, I think it gives a message something like “some variables dropped; cannot continue.” If you run the exact same model (minus the multi-level part) in the corresponding simple regression (e.g. poisson or logit) you’ll find that it runs but gives a warning – often a collinearity warning. So basically a problem that can be handled comfortably by older software with just a warning, produces a crash in GLLAMM. So quite often, the first step to solving these weird problems with GLLAMM is simply to step back and run the exact same model minus the multi-level structure, and see if you have made a silly error inputting variables or with data preparation (I also do this to get a sense of the total number of missing observations due to combinations of missing observations across variables).

There’s a real art to getting a final model out of GLLAMM…

September 22, 2015 at 7:10 pm

Hi Faustusnotes,

Thank you for your blog – it is by far the most useful discussion of fitting multilevel models in Stata that is on the internet!

I am using multilevel modelling to predict patient health outcomes for my undergrad thesis. As you might expect, on first pass I am struggling to find feasible starting values.

The model converges when using plain logit. When I try to generalise it to a RE model (meglm, covariance(unstructured)), it only converges when a subset of the variables are included. Using xtmelogit, I have achieved convergence with a intercept-only model, my simplest model and a model I am comparing my estimation to. However, I cannot get convergence with my full specification!

Do you know how I can identify which variables are causing the problem? What should I be looking for?

Thank you for your help,

Lucille

September 23, 2015 at 12:08 am

Hi Lucille, thanks for your positive comment. Since I wrote this post I’ve had a bit more experience trouble-shooting another couple of models, and found a few more methods for attempting to get convergence. Here are a few more tips based on this additional experience …

1) check your model specification to see if there is some problem arising from a combination of outcome variable and predictors producing scarce (or zero) results in some clusters. This won’t show up in a standard logit (because you don’t have the cluster effects in) so you may need to do some probing with tabulations and the like. I have had experience with rare events where you get a large number of zero-event combinations of cells in clusters with low populations. In this case you may need to combine levels of variables, or find some other workaround to handle the dodgy variable. (Working with early neonatal mortality in a developed nation, I discovered it’s possible to have lots of clusters with zero outcomes in some levels of some predictors and this can bork the likelihood estimation)

2) Always run a non-heirarchical version first because it sometimes turns up obvious problems with your model that you just didn’t think about, and does so in a reasonable time period. Look for very large coefficient or standard error values, and collinearity.

3) You need to do this because sometimes the error messages from the multilevel model packages (especially gllamm; see below) can be uninformative and misleading, and you don’t realize that you have a simple misspecification problem

4) Try using GLLAMM rather than xtmelogit or the other packages. GLLAMM seems to be more flexible and more reliably converges, and has two good properties…

5) Using GLLAMM, set it to use adaptive quadrature (the

adaptoption) because this is slightly faster6) Using GLLAMM you can get parameter estimates quickly without estimating covariances by specifying the NOEST option. These can then be fed directly into a new GLLAMM model with more integration points, to get convergence

7) Use lots of integration points. Contrary to my expectations, having less integration points does not improve your chances of getting a rough but ready convergence. Sometimes you need to up the integration points to get convergence at all. I think the default is 7 so try 12; if this doesn’t work try 24. This significantly increases the run time but sometimes it’s better to start on a longer initial run because it will actually work

8) You can control the number of iterations in order to guarantee your model spits something out, then use these in a model with more integration points.

In my experience GLLAMM will get you where you need to be, but it can be a trial and error process. Last year, with the early neonatal mortality set, I was doing something like this:

1) run GLLAMM with the NOEST option to get initial parameters

2) Chuck these parameters straight into a model with a lot of integration points (say 18) and stop the model at a predefined number of iterations so that it doesn’t take longer than your undergraduate degree

3) Chuck those parameter estimates into a model with the same number of integration points (if it looked like convergence was working) or more (if you were getting the “not concave” warning)

4) Once it looks like convergence has been achieved, run the model to its full convergence, then use the parameter estimates to start a model with more integration points

5) Use a stepwise process to increase the integration points and compare each new model with the previous one, and stop when the

significantparameter estimates are changing by less than some arbitrary amount (e.g. 5%). You will find that the non-significant parameters change by lots, but that is to be expected – they’re non-significant.From memory, once you have a good set of starting parameters and you’re using adaptive quadrature, models converge rapidly even with lots of integration points and variables (by rapidly I mean in less than a week from start to final model). I managed to get a solution to a model with 40 million records and 300,000 events, and about 8 variables, with two levels, using this method, inside about 3 days (again from memory).

In my experience the number of integration points is important, and it can be worth wasting a few days to see if a model with a lot of integration points (>24) converges or looks like it’s converging; then you can stop the model, run up some better starting parameter estimates, and start it again, if necessary. You can usually tell if a model is going to converge by the values of the maximum likelihood estimates that are displayed on each iteration – they should start to bottom out fast and though there will be a few “likelihood not concave” errors, if they are interspersed with completed iterations you should be fine. The important point here is that the default number of intpoints doesn’t always work and almost never works when you have large models with big data and rare-ish events. So dive in and start messing around with the intpoints.

Finally, if you really really can’t find a solution, an alternative might be to try something using winBUGS to run a heirarchical model with Bayesian priors. WinBUGS solves problems through simulation (a Gibbs sampler) rather than the full maximum likelihood estimation process of a standard heirarchical model package (though I think GLLAMM includes some MCMC estimation so I could be telling a big fib here; please correct me if I’m wrong). This means it may not be vulnerable to the challenges of maximizing a hugely complex likelihood with many zeros. In this case, a word of advice: unless you have a spatial covariance component to your model, go straight to JAGS implemented in R, because this will work. I don’t advise this because I think Bayesian heirarchical modeling is better (in fact I don’t advise it at all), but you might want to consider it if all else fails as a way of getting around an intractable likelihood estimation problem – but in truth I don’t know if JAGS would solve this problem any better [also Stata 14 has Bayesian modules built in that might solve the problem by using MCMC estimation instead of integration, but I don’t know].

Good luck! Please tell me if any of my suggestions worked …

September 23, 2015 at 10:29 am

Is there a wiki for statistical analysis (especially in relation to the software used)?

Because based on the comments I’ve seen (i.e. sporadic posts from a guy blogging about other stuff being praised as being the sole source of assistance), it’d seem this is a gap in the wiki coverage of the world.

September 23, 2015 at 11:02 am

As far as I know there isn’t, Paul! (I’ll be really pissed off if there is and I have never found it). There are message boards for all the major stats packages, but I guess it remains primarily something picked up through text books, class learning and (to a

verylarge extent) mentoring. Andrew Gelman (an actually famous statistician) also includes blog posts about specific problems he has experienced and their solutions, one of which I think I linked to in this post, but we don’t have a wiki. It would be good – if my previous comment were part of a wiki one would be able to expand out from any of the key points to a full exploration of a wide range of statistical tasks, many of which I don’t know the mathematical background to (e.g. the details of adaptive quadrature). But it doesn’t exist … and there’s a reason for this that we statisticians only let on to our closest friend … lean in if you dare hear a scary secret …Stats includes a very large portion of experience and art, rather than rigorous technique. Especially with the kind of problems described by Lucille, you’re trying to balance computational practicality, problems with data, your approximate knowledge of how the original problem might be posed, and guesswork at the best actual model, and many of the solutions to this balancing act arise not from statistical theory but from experience and a kind of feeling for the way data works. Because computation and data size has expanded faster than our experience, even experienced statisticians are often casting around for solutions to problems they’ve never encountered before – and then asking around for help in managing the computational problems their solution caused. At the core of this is the simple fact (as old as statistics) that our models and the techniques we use to build them require some degree of art and fee-fees rather than robust technical specification. We choose which variables to put into a model and how to construct the model, and it’s primarily the modeler themselves who chooses how to test the robustness and quality of their model. There’s a lot of experience, guesswork and intuition involved.

Now that I’ve told you that I have to kill you. Sorry!

September 23, 2015 at 12:00 pm

Eh. *Shrug* The odds of you successfully killing me are difficult to work out. So we can probably just assume they’re zero.