I’m working on a couple of complex multi-level models at the moment, using Stata, and I’ve run into some big problems getting them to work. In the spirit of helping others to solve problems I’ve spent a lot of time attacking, in this post I’m going to describe the problems I encountered and my solutions. In searching the internet for these problems I haven’t found many solutions, so hopefully other people can approach the problems more easily with the help of this post.

The two main problems I have encountered occur when I encounter the initial values not feasible error, and also problems of the time it takes to run the multi-level models. Solving the first problem raises another interesting and related problem that I think is a Stata bug, and that I have found a workaround for. Solving the second one might be related to the first one, but largely involves utilizing some properties of binomial or Poisson distributions. I have also encountered a third problem in which different multi-level modeling functions give different results, but I haven’t found a solution to that one. I will mention it here for completeness.

Obviously if you aren’t a statistician, everything that goes on from here down will seem like satanism, and you should probably stop reading.

The models

I am currently working on three unrelated and complex multi-level models, all of which use binomial- or Poisson-distributed responses and all of which have very large data sets. I’m not going to give details about the research projects, but the three models are listed here.

  1. An analysis of Demographic and Health Survey (DHS) data from multiple countries and years, with a binomial outcome and about 800,000 records. This model should have individuals clustered within households clustered within clusters within years within countries, but currently I’m not using any household level variables and I only have 1-4 years per country (mostly 1) so I’m ignoring the serial dependence in years. Even so, running a simple model in xtmelogit takes a fairly long time (perhaps 8 hours) on a 12-core PowerPC with Stata/MP 12 and 64Gb of RAM. Because I’m combining surveys across countries and years, I don’t think survey weights are valid, so I’m ignoring probability sampling effects (thank god!)
  2. An analysis of DHS data from a single country with more than a million records, with a binomial outcome. This model has individuals within households clustered within regions within states, so should be a four-level model. It also has a lot of variables (it is examining socioeconomic determinants of health outcomes). Running a single model in xtmelogit with Laplace transformation (the fastest maximization method) on an 8-core Dell Precision platform with 32Gb of RAM takes … well, days, mostly. And currently I’m ignoring one level (region, I think) for simplicity.
  3. An analysis of NHS hospital admission data with about 5 million records, Poisson outcomes, and only two levels: individuals within regions. Fortunately it only has a couple of variables, though it has a massive interaction term, so it runs in a couple of hours. I should probably have crossed levels (hospital X region) but I think this would fry anything it touched (as well as my brain) and probably be unidentifiable (because of lots of regions sending single individuals to big hospitals). This is also running on the 8-core Dell.

I’ve tried running model 2 in GLLAMM, but models 1 and 3 so far have used only xtmelogit and xtPoisson respectively. Obviously I don’t have infinite time for trouble-shooting, since each model takes at least one night to run; and I can’t run tests on random sub-samples because the problems often only occur in very large data sets; or they will appear to often in poorly-chosen random sub-samples. In all the models I also have to fiddle with categorical variables to ensure that some combinations of predictors don’t become too rare (and thus non-identifiable) in some cluster/country/time combinations. Also, as a final note, the heirarchical structure is essential. Results are completely different without it, and/or some key variables are estimated at regional level rather than individual.

Problem 1: Non feasible initial values

The first problem I have encountered is that xtmelogit crashes as soon as it begins initial parameter selection, and returns the following error:

initial values not feasible

This is rarely frustrating and rarely unusual in statistical software: usually the starting values don’t matter that much. As an initial solution to this problem I tried using simplified models just to see if I could get it running, but I found I had to simplify them so much they became meaningless. I dug around on the internet and eventually found this kind of solution, which advocates using the – from() – option to load up your own initial values. The option suggested there is to run a non-heirarchical model, extract the coefficients from that, and input them into the xtmelogit using the from option. If you have k levels, Stata will be expecting k additional coefficients, but apparently it can handle this automatically, so you just give it the output from logit and off it goes.

Unfortunately it doesn’t work. For example, if I run this code:

xi: logit outcome i.year i.education i.wealthIndex
mat a=e(b)
xi: xtmelogit outcome i.year i.education i.wealthIndex || country: || cluster:,laplace from(a)

I get the following error:

extra parameter outcome:_Iyear_2001 found
specify skip option if necessary

and again the model fails. I’ve never seen this error before. It basically stops me from adding the coefficients from logit into the xtmelogit command. The frustrating thing though is that if you display the matrices of coefficients from the two models, they are ordered the same way and have the same labels for variables, so there is no reason why Stata should find an “extra” parameter. What is happening? I searched for this online, and all I found was this useless advice, or (after much digging) a link I can’t find again, which suggested that the model might be “non-identifiable.” This problem is not arising through non-identifiability or complexity: I know this because if I can find the right starting values (see below) I can get the model to run just fine. Furthermore, this error doesn’t sound like a mathematical problem. It sounds like a programming problem.

In fact, the solution to this problem is remarkably simple: you simply add the “copy” option to the -from()- option. Without this option, the -from()- option tells xtmelogit to insert values from the matrix of coefficients from logit into their corresponding place in the xtmelogit procedure based on variable labels. When it runs out of variable labels it is then supposed to make up additional starting values using defaults. However, this doesn’t work because for some reason xtmelogit doesn’t understand the matrix output from logit. However, if you use the copy option, xtmelogit inserts the coefficients from the matrix based only on their position in the matrix. This means you need to supply the k extra starting values for the error terms of the random effects, but otherwise you’re good to go. You can supply these by guessing them, giving zeros (don’t know if this is a good idea!) or running an intercept-only heirarchical model and taking them from the results of that. The full code (with me supplying zeros in this case) is then:

xi: logit outcome i.year i.education i.wealthIndex
mat a=e(b)
mat a1=(a,0,0)
xi: xtmelogit outcome i.year i.education i.wealthIndex || country: || cluster:,laplace from(a1, copy)

Do this and you won’t run into the extra parameter problem. But note that supplying starting values from logit isn’t always a good idea – they can be radically different to the true final coefficients, even differing in sign, and they can again lead to the initial values not feasible problem.

In this case the only solution I could find was to run a much simpler multi-level model and then extract the values from that. In fact, I found an intercept-only model was sufficient to provide functioning starting parameters for the full xtmelogit. So if you can’t get your logit starting values to work, try just running a simple intercept-only model with all the necessary levels, and supplying those starting values to xtmelogit.

This bothers me for two reasons: first of all, the extra parameter error is obviously a programming error; and secondly, if supplying the results of an intercept-only model is enough to make the full model run, this suggests pretty extreme sensitivity to initial values. Is there maybe a more stable starting process for these maximization algorithms? It takes humans days to select good starting values, and if you don’t stumble on them immediately you have to do a search through a range of possibilities – isn’t it faster for the computer to do this? What starting values is it using?

Problem 2: Optimization time is just too long

As I mentioned in the introduction, these models can take a long time to run – between 6 hours and a couple of days depending on the model. I had hoped that finding new and better initial values would solve this problem at least partially, but it doesn’t much and the Stata manual admits that spending a long time looking for good initial values will have little impact on the time it takes. So what to do? The number of levels is a huge determinant of the time it takes to run (processor time depends on a factor of 2^k, I think), but if you can’t drop your levels, you’re in big trouble. Fortunately you can use a simple workaround (in some cases) to solve this problem. Because xtmelogit works on binomial data you can reduce the dataset in size by calculating summary data at the lowest level: you collapse the data at this level into a data set of events and trials. Not all Stata logistic regression procedures accept the events/trials framework, but xtmelogit does. If you’re dealing with, e.g. a school with classrooms, each classroom will have only two ages and two sexes. So you may be able to reduce each classroom to just 4 records, containing the count of the number of students in each age/sex combination, and the number of events. I tried this with model 1, and managed to reduce the data set to about 100,000 records, and the processor time by a factor of about 8 or maybe more, and get exactly the same results. Of course, if you have a household level above the individual, this method will be largely impossible, but if you are working with larger low-level clusters it will work a treat. Note also that it doesn’t work where you have a genuinely continuous variable, or a lot of diversity in predictors. But it’s worth trying if you have a lot of reasonably-sized clusters, especially if you are hoping to get a more accurate estimate than the laplace method.

Problem 3: Different results in GLLAMM and xtmelogit

I’ve noticed as well that in some cases GLLAMM and xtmelogit produce remarkably different results for the same model. On about page 450 of Rabe-Hesketh’s text she mentions this problem but puts it down to choice of integration points: it appears to me that this isn’t the whole story. The Stata list seems to also think this. I have yet to work out the details of this, so will come back to it when I have a better idea. Right now I’m suspicious that GLLAMM and xtmelogit are doing … well, not quite the same thing.

A note on software comparisons

Note that this problem doesn’t just exist in Stata. I found PROC GLIMMIX in SAS to be horribly unstable, and apparently glmer in R uses the Laplace approximation as its default optimization method, and doesn’t allow any other where there are more than two levels! Multi-level modeling with non-normal responses is one of those situations where you really have to be aware of the underlying engineering of the software, and cautious about any results until you have checked every aspect of the defaults. This can be very dangerous if you’re under pressure to produce results and each model is taking a couple of days. Indeed Rabe-Hesketh recommends choosing multiple different optimization procedures (primarily choices of numbers of integration points) to select a stable one before presenting a final model. That’s really pushing your luck if one model takes a couple of days to run, and you have to go through 12 different integration point choices. I wonder if there are any methods for identifying likely optimization complexity from the data itself, without having to run extremely time-consuming guesstimates?

Probably not.


There are a couple of solutions to the initial value problem and they don’t just involve reducing the number of variables. One of these, choosing initial values from a standard generalized linear model, produces an error based entirely on what appears to be a programming but, but you can work around it. You can also workaround the time constraints put on you by big multi-level models, by collapsing data rather than removing or simplifying variables. But be careful about the choice of integration points, and the possibility that GLLAMM would give you a different result …

Good luck!