I used to be lately speaking with my good friend Rebecca about simulating multilevel knowledge, and she or he requested me if I’d present her some examples. It occurred to me that lots of you may additionally prefer to see some examples, so I made a decision to put up them to the Stata Weblog.
Introduction
We simulate knowledge on a regular basis at StataCorp and for quite a lot of causes.
One motive is that actual datasets that embrace the options we wish are sometimes tough to search out. We favor to make use of actual datasets within the guide examples, however generally that isn’t possible and so we create simulated datasets.
We additionally simulate knowledge to verify the protection chances of recent estimators in Stata. Generally the formulae printed in books and papers include typographical errors. Generally the asymptotic properties of estimators don’t maintain beneath sure circumstances. And each infrequently, we make coding errors. We run simulations throughout improvement to confirm {that a} 95% confidence interval actually is a 95% confidence interval.
Simulated knowledge may turn out to be useful for displays, instructing functions, and calculating statistical energy utilizing simulations for advanced examine designs.
And, simulating knowledge is simply plain enjoyable when you get the hold of it.
A few of you’ll recall Vince Wiggins’s weblog entry from 2011 entitled “Multilevel random results in xtmixed and sem — the lengthy and broad of it” wherein he simulated a three-level dataset. I’m going to elaborate on how Vince simulated multilevel knowledge, after which I’ll present you some helpful variations. Particularly, I’m going to speak about:
- Easy methods to simulate single-level knowledge
- Easy methods to simulate two- and three-level knowledge
- Easy methods to simulate three-level knowledge with covariates
- Easy methods to simulate longitudinal knowledge with random slopes
- Easy methods to simulate longitudinal knowledge with structured errors
Easy methods to simulate single-level knowledge
Let’s start by simulating a trivially easy, single-level dataset that has the shape
[y_i = 70 + e_i]
We’ll assume that e is generally distributed with imply zero and variance (sigma^2).
We’d need to simulate 500 observations, so let’s start by clearing Stata’s reminiscence and setting the variety of observations to 500.
. clear . set obs 500
Subsequent, let’s create a variable named e that accommodates pseudorandom usually distributed knowledge with imply zero and normal deviation 5:
. generate e = rnormal(0,5)
The variable e is our error time period, so we are able to create an final result variable y by typing
. generate y = 70 + e
. listing y e in 1/5
+----------------------+
| y e |
|----------------------|
1. | 78.83927 8.83927 |
2. | 69.97774 -.0222647 |
3. | 69.80065 -.1993514 |
4. | 68.11398 -1.88602 |
5. | 63.08952 -6.910483 |
+----------------------+
We will match a linear regression for the variable y to find out whether or not our parameter estimates are fairly near the parameters we specified after we simulated our dataset:
. regress y
Supply | SS df MS Variety of obs = 500
-------------+------------------------------ F( 0, 499) = 0.00
Mannequin | 0 0 . Prob > F = .
Residual | 12188.8118 499 24.4264766 R-squared = 0.0000
-------------+------------------------------ Adj R-squared = 0.0000
Complete | 12188.8118 499 24.4264766 Root MSE = 4.9423
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 69.89768 .221027 316.24 0.000 69.46342 70.33194
------------------------------------------------------------------------------
The estimate of _cons is 69.9, which may be very near 70, and the Root MSE of 4.9 is equally near the error’s normal deviation of 5. The parameter estimates is not going to be precisely equal to the underlying parameters we specified after we created the information as a result of we launched randomness with the rnormal() operate.
This easy instance is simply to get us began earlier than we work with multilevel knowledge. For familiarity, let’s match the identical mannequin with the combined command that we are going to be utilizing later:
. combined y, stddev
Combined-effects ML regression Variety of obs = 500
Wald chi2(0) = .
Log probability = -1507.8857 Prob > chi2 = .
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 69.89768 .2208059 316.56 0.000 69.46491 70.33045
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
sd(Residual) | 4.93737 .1561334 4.640645 5.253068
------------------------------------------------------------------------------
The output is organized with the parameter estimates for the fastened half within the high desk and the estimated normal deviations for the random results within the backside desk. Simply as beforehand, the estimate of _cons is 69.9, and the estimate of the usual deviation of the residuals is 4.9.
Okay. That basically was trivial, wasn’t it? Simulating two- and three-level knowledge is nearly as straightforward.
Easy methods to simulate two- and three-level knowledge
I posted a weblog entry final yr titled “Multilevel linear fashions in Stata, half 1: Parts of variance“. In that posting, I confirmed a diagram for a residual of a three-level mannequin.
The equation for the variance-components mannequin I match had the shape
[y_{ijk} = mu + u_i.. + u_{ij.} + e_{ijk}]
This mannequin had three residuals, whereas the one-level mannequin we simply match above had just one.
This time, let’s begin with a two-level mannequin. Let’s simulate a two-level dataset, a mannequin for youngsters nested inside lecture rooms. We’ll index lecture rooms by i and youngsters by j. The mannequin is
[y_{ij} = mu + u_{i.} + e_{ij}]
For this toy mannequin, let’s assume two lecture rooms with two college students per classroom, that means that we need to create a four-observation dataset, the place the observations are college students.
To create this four-observation dataset, we begin by making a two-observation dataset, the place the observations are lecture rooms. As a result of there are two lecture rooms, we sort
. clear . set obs 2 . generate classroom = _n
Any further, we’ll consult with classroom as i. It’s simpler to recollect what variables imply if they’ve significant names.
Subsequent, we’ll create a variable that accommodates every classroom’s random impact (u_i), which we’ll assume follows an N(0,3) distribution.
. generate u_i = rnormal(0,3)
. listing
+----------------------+
| classr~m u_i |
|----------------------|
1. | 1 .7491351 |
2. | 2 -.0031386 |
+----------------------+
We will now increase our knowledge to incorporate two kids per classroom by typing
. increase 2
. listing
+----------------------+
| classr~m u_i |
|----------------------|
1. | 1 .7491351 |
2. | 2 -.0031386 |
3. | 1 .7491351 |
4. | 2 -.0031386 |
+----------------------+
Now, we are able to consider our observations as being college students. We will create a baby ID (we’ll name it youngster quite than j), and we are able to create every youngster’s residual (e_{ij}), which we’ll assume has an N(0,5) distribution:
. bysort classroom: generate youngster = _n
. generate e_ij = rnormal(0,5)
. listing
+------------------------------------------+
| classr~m u_i youngster e_ij |
|------------------------------------------|
1. | 1 .7491351 1 2.832674 |
2. | 1 .7491351 2 1.487452 |
3. | 2 -.0031386 1 6.598946 |
4. | 2 -.0031386 2 -.3605778 |
+------------------------------------------+
We now have practically all of the elements to calculate (y_{ij}):
(y_{ij} = mu + u_{i.} + e_{ij})
We’ll assume mu is 70. We sort
. generate y = 70 + u_i + e_ij
. listing y classroom u_i youngster e_ij, sepby(classroom)
+-----------------------------------------------------+
| y classr~m u_i youngster e_ij |
|-----------------------------------------------------|
1. | 73.58181 1 .7491351 1 2.832674 |
2. | 72.23659 1 .7491351 2 1.487452 |
|-----------------------------------------------------|
3. | 76.59581 2 -.0031386 1 6.598946 |
4. | 69.63628 2 -.0031386 2 -.3605778 |
+-----------------------------------------------------+
Notice that the random impact u_i is identical inside every college, and every youngster has a unique worth for e_ij.
Our technique was easy:
- Begin with the highest stage of the information hierarchy.
- Create variables for the extent ID and its random impact.
- Increase the information by the variety of observations inside that stage.
- Repeat steps 2 and three till the underside stage is reached.
Let’s do this recipe for three-level knowledge the place kids are nested inside lecture rooms that are nested inside faculties. This time, I’ll index faculties with i, lecture rooms with j, and youngsters with okay in order that my mannequin is
[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}]
the place
(u_{i..}) ~ N(0,2)
(u_{ij.}) ~ N(0,3)
(u_{ijk}) ~ N(0,5)
Let’s create knowledge for
(stage 3, i) 2 faculties
(stage 2, j) 2 lecture rooms in every college
(stage 1, okay) 2 college students in most lecture rooms; 3 college students in i==2 & j==2
Start by creating the level-three knowledge for the 2 faculties:
. clear
. set obs 2
. generate college = _n
. generate u_i = rnormal(0,2)
. listing college u_i
+--------------------+
| college u_i |
|--------------------|
1. | 1 3.677312 |
2. | 2 -3.193004 |
+--------------------+
Subsequent, we increase the information in order that we have now the 2 lecture rooms nested inside every of the faculties, and we create its random impact:
. increase 2
. bysort college: generate classroom = _n
. generate u_ij = rnormal(0,3)
. listing college u_i classroom u_ij, sepby(college)
+-------------------------------------------+
| college u_i classr~m u_ij |
|-------------------------------------------|
1. | 1 3.677312 1 .9811059 |
2. | 1 3.677312 2 -3.482453 |
|-------------------------------------------|
3. | 2 -3.193004 1 -4.107915 |
4. | 2 -3.193004 2 -2.450383 |
+-------------------------------------------+
Lastly, we increase the information in order that we have now three college students at school 2’s classroom 2, and two college students in all the opposite lecture rooms. Sorry for that complication, however I wished to point out you methods to create unbalanced knowledge.
Within the earlier examples, we’ve been typing issues like increase 2, that means double the observations. On this case, we have to do one thing completely different for varsity 2, classroom 2, particularly,
. increase 3 if college==2 & classroom==2
after which we are able to simply increase the remaining:
. increase 2 if !(college==2 & clasroom==2)
Clearly, in an actual simulation, you’d most likely need 16 to 25 college students in every classroom. You would do one thing like that by typing
. increase 16+int((25-16+1)*runiform())
In any case, we’ll sort
. increase 3 if college==2 & classroom==2
. increase 2 if !(college==2 & classroom==2)
. bysort college classroom: generate youngster = _n
. generate e_ijk = rnormal(0,5)
. generate y = 70 + u_i + u_ij + e_ijk
. listing y college u_i classroom u_ij youngster e_ijk, sepby(classroom)
+------------------------------------------------------------------------+
| y college u_i classr~m u_ij youngster e_ijk |
|------------------------------------------------------------------------|
1. | 76.72794 1 3.677312 1 .9811059 1 2.069526 |
2. | 69.81315 1 3.677312 1 .9811059 2 -4.845268 |
|------------------------------------------------------------------------|
3. | 74.09565 1 3.677312 2 -3.482453 1 3.900788 |
4. | 71.50263 1 3.677312 2 -3.482453 2 1.307775 |
|------------------------------------------------------------------------|
5. | 64.86206 2 -3.193004 1 -4.107915 1 2.162977 |
6. | 61.80236 2 -3.193004 1 -4.107915 2 -.8967164 |
|------------------------------------------------------------------------|
7. | 66.65285 2 -3.193004 2 -2.450383 1 2.296242 |
8. | 49.96139 2 -3.193004 2 -2.450383 2 -14.39522 |
9. | 64.41605 2 -3.193004 2 -2.450383 3 .0594433 |
+------------------------------------------------------------------------+
No matter how we generate the information, we should be certain that the school-level random results u_i are the identical inside college and the classroom-level random results u_ij are the identical inside classroom.
Regarding knowledge development, the instance above we concocted to provide a dataset that may be straightforward to listing. Let’s now create a dataset that’s extra cheap:
[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}]
the place
(u_{i..}) ~ N(0,2)
(u_{ij.}) ~ N(0,3)
(u_{ijk}) ~ N(0,5)
Let’s create knowledge for
(stage 3, i) 6 faculties
(stage 2, j) 10 lecture rooms in every college
(stage 1, okay) 16-25 college students
. clear . set obs 6 . generate college = _n . generate u_i = rnormal(0,2) . increase 10 . bysort college: generate classroom = _n . generate u_ij = rnormal(0,3) . increase 16+int((25-16+1)*runiform()) . bysort college classroom: generate youngster = _n . generate e_ijk = rnormal(0,5) . generate y = 70 + u_i + u_ij + e_ijk
We will use the combined command to suit the mannequin with our simulated knowledge.
. combined y || college: || classroom: , stddev
Combined-effects ML regression Variety of obs = 1217
-----------------------------------------------------------
| No. of Observations per Group
Group Variable | Teams Minimal Common Most
----------------+------------------------------------------
college | 6 197 202.8 213
classroom | 60 16 20.3 25
-----------------------------------------------------------
Wald chi2(0) = .
Log probability = -3710.0673 Prob > chi2 = .
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 70.25941 .9144719 76.83 0.000 68.46707 72.05174
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
college: Id |
sd(_cons) | 2.027064 .7159027 1.014487 4.050309
-----------------------------+------------------------------------------------
classroom: Id |
sd(_cons) | 2.814152 .3107647 2.26647 3.494178
-----------------------------+------------------------------------------------
sd(Residual) | 4.828923 .1003814 4.636133 5.02973
------------------------------------------------------------------------------
LR check vs. linear regression: chi2(2) = 379.37 Prob > chi2 = 0.0000
The parameter estimates from our simulated knowledge match the parameters used to create the information fairly properly: the estimate for _cons is 70.3, which is close to 70; the estimated normal deviation for the school-level random results is 2.02, which is close to 2; the estimated normal deviation for the classroom-level random results is 2.8, which is close to 3; and the estimated normal deviation for the individual-level residuals is 4.8, which is close to 5.
We’ve simply accomplished one cheap simulation.
If we wished to do a full simulation, we would want to do the above 100, 1,000, 10,000, or extra instances. We’d put our code in a loop. And in that loop, we’d maintain monitor of no matter parameter us.
Easy methods to simulate three-level knowledge with covariates
Often, we’re extra interested by estimating the consequences of the covariates than in estimating the variance of the random results. Covariates are sometimes binary (reminiscent of male/feminine), categorical (reminiscent of race), ordinal (reminiscent of schooling stage), or steady (reminiscent of age).
Let’s add some covariates to our simulated knowledge. Our mannequin is
[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}]
the place
(u_{i..}) ~ N(0,2)
(u_{ij.}) ~ N(0,3)
(u_{ijk}) ~ N(0,5)
We create knowledge for
(stage 3, i) 6 faculties
(stage 2, j) 10 lecture rooms in every college
(stage 1, okay) 16-25 college students
Let’s add to this mannequin
(stage 3, college i) whether or not the college is in an city setting
(stage 2, classroom j) trainer’s expertise (years)
(stage 1, scholar okay) scholar’s mom’s schooling stage
We will create a binary covariate known as city on the college stage that equals 1 if the college is positioned in an city space and equals 0 in any other case.
. clear . set obs 6 . generate college = _n . generate u_i = rnormal(0,2) . generate city = runiform()<0.50
Right here we assigned faculties to one of many two teams with equal chance (runiform()<0.50), however we might have assigned 70% of the faculties to be city by typing
. generate city = runiform()<0.70
On the classroom stage, we might add a steady covariate for the trainer’s years of expertise. We might generate this variable through the use of any of Stata’s random-number features (see assist random_number_functions. Within the instance beneath, I’ve generated trainer’s years of expertise with a uniform distribution starting from 5-20 years.
. increase 10 . bysort college: generate classroom = _n . generate u_ij = rnormal(0,3) . bysort college: generate teach_exp = 5+int((20-5+1)*runiform())
After we summarize our knowledge, we see that instructing expertise ranges from 6-20 years with a median of 13 years.
. summarize teach_exp
Variable | Obs Imply Std. Dev. Min Max
-------------+--------------------------------------------------------
teach_exp | 60 13.21667 4.075939 6 20
On the youngster stage, we might add a categorical/ordinal covariate for mom’s highest stage of schooling accomplished. After we increase the information and create the kid ID and error variables, we are able to generate a uniformly distributed random variable, temprand, on the interval [0,1].
. increase 16+int((25-16+1)*runiform()) . bysort college classroom: generate youngster = _n . generate e_ijk = rnormal(0,5) . generate temprand = runiform()
We will assign kids to completely different teams through the use of the egen command with cutpoints. Within the instance beneath, kids whose worth of temprand is within the interval [0,0.5) will be assigned to mother_educ==0, children whose value of temprand is in the interval [0.5,0.9) will be assigned to mother_educ==1, and children whose value of temprand is in the interval [0.9,1) will be assigned to mother_educ==2.
. egen mother_educ = cut(temprand), at(0,0.5, 0.9, 1) icodes . label define mother_educ 0 "HighSchool" 1 "College" 2 ">College" . label values mother_educ mother_educ
The resulting frequencies of each category are very close to the frequencies we specified in our egen command.
. tabulate mother_educ, generate(meduc)
mother_educ | Freq. Percent Cum.
------------+-----------------------------------
HighSchool | 602 50.17 50.17
College | 476 39.67 89.83
>College | 122 10.17 100.00
------------+-----------------------------------
Total | 1,200 100.00
We used the option generate(meduc) in the tabulate command above to create indicator variables for each category of mother_educ. This will allow us to specify an effect size for each category when we create our outcome variable.
. summarize meduc*
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
meduc1 | 1200 .5016667 .5002057 0 1
meduc2 | 1200 .3966667 .4894097 0 1
meduc3 | 1200 .1016667 .3023355 0 1
Now, we can create an outcome variable called score by adding all our fixed and random effects together. We can specify an effect size (regression coefficient) for each fixed effect in our model.
. generate score = 70
+ (-2)*urban
+ 1.5*teach_exp
+ 0*meduc1
+ 2*meduc2
+ 5*meduc3
+ u_i + u_ij + e_ijk
I have specified that the grand mean is 70, urban schools will have scores 2 points lower than nonurban schools, and each year of teacher’s experience will add 1.5 points to the students score.
Mothers whose highest level of education was high school (meduc1==1) will serve as the referent category for mother_educ(mother_educ==0). The scores of children whose mother completed college (meduc2==1 and mother_educ==1) will be 2 points higher than the children in the referent group. And the scores of children whose mother completed more than college (meduc3==1 and mother_educ==2) will be 5 points higher than the children in the referent group. Now, we can use the mixed command to fit a model to our simulated data. We used the indicator variables meduc1–meduc3 to create the data, but we will use the factor variable i.mother_educ to fit the model.
. mixed score urban teach_exp i.mother_educ || school: ||
classroom: , stddev baselevel
Mixed-effects ML regression Number of obs = 1259
-----------------------------------------------------------
| No. of Observations per Group
Group Variable | Groups Minimum Average Maximum
----------------+------------------------------------------
school | 6 200 209.8 217
classroom | 60 16 21.0 25
-----------------------------------------------------------
Wald chi2(4) = 387.64
Log likelihood = -3870.5395 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
score | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
city | -2.606451 2.07896 -1.25 0.210 -6.681138 1.468237
teach_exp | 1.584759 .096492 16.42 0.000 1.395638 1.77388
|
mother_educ |
HighSchool | 0 (base)
School | 2.215281 .3007208 7.37 0.000 1.625879 2.804683
>School | 5.065907 .5237817 9.67 0.000 4.039314 6.0925
|
_cons | 68.95018 2.060273 33.47 0.000 64.91212 72.98824
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
college: Id |
sd(_cons) | 2.168154 .7713944 1.079559 4.354457
-----------------------------+------------------------------------------------
classroom: Id |
sd(_cons) | 3.06871 .3320171 2.482336 3.793596
-----------------------------+------------------------------------------------
sd(Residual) | 4.947779 .1010263 4.753681 5.149802
------------------------------------------------------------------------------
LR check vs. linear regression: chi2(2) = 441.25 Prob > chi2 = 0.0000
“Shut” is within the eye of the beholder, however to my eyes, the parameter estimates look remarkably near the parameters that have been used to simulate the information. The parameter estimates for the fastened a part of the mannequin are -2.6 for city (parameter = -2), 1.6 for teach_exp (parameter = 1.5), 2.2 for the School class of mother_educ (parameter = 2), 5.1 for the >School class of mother_educ (parameter = 5), and 69.0 for the intercept (parameter = 70). The estimated normal deviations for the random results are additionally very near the simulation parameters. The estimated normal deviation is 2.2 (parameter = 2) on the college stage, 3.1 (parameter = 3) on the classroom stage, and 4.9 (parameter = 5) on the youngster stage.
A few of you might disagree that the parameter estimates are shut. My reply is that it would not matter until you are simulating a single dataset for demonstration functions. If you’re, merely simulate extra datasets till you get one that appears shut sufficient for you. If you’re simulating knowledge to verify protection chances or to estimate statistical energy, you may be averaging over 1000’s of simulated datasets and the outcomes of any a kind of datasets will not matter.
Easy methods to simulate longitudinal knowledge with random slopes
Longitudinal knowledge are sometimes conceptualized as multilevel knowledge the place the repeated observations are nested inside people. The principle distinction between odd multilevel fashions and multilevel fashions for longitudinal knowledge is the inclusion of a random slope. If you’re not acquainted with random slopes, you’ll be able to be taught extra about them in a weblog entry I wrote final yr (Multilevel linear fashions in Stata, half 2: Longitudinal knowledge).
Simulating longitudinal knowledge with a random slope is very like simulating two-level knowledge, with a few modifications. First, the underside stage can be observations inside individual. Second, there can be an interplay between time (age) and a person-level random impact. So we’ll generate knowledge for the next mannequin:
[weight_{ij} = mu + age_{ij} + u_{0i.} + age*u_{1i.} + e_{ij}]
the place
(u_{0i.}) ~ N(0,3) (u_{1i.}) ~ N(0,1) (e_{ij}) ~ N(0,2)
Let’s start by simulating longitudinal knowledge for 300 folks.
. clear . set obs 300 . gen individual = _n
For longitudinal knowledge, we should create two person-level random results: the variable u_0i is analogous to the random impact we created earlier, and the variable u_1i is the random impact for the slope over time.
. generate u_0i = rnormal(0,3) . generate u_1i = rnormal(0,1)
Let’s increase the information in order that there are 5 observations nested inside every individual. Moderately than create an observation-level identification quantity, let’s create a variable for age that ranges from 12 to 16 years,
. increase 5 . bysort individual: generate age = _n + 11
and create an observation-level error time period from an N(0,2) distribution:
. generate e_ij = rnormal(0,2)
. listing individual u_0i u_1i age e_ij if individual==1
+-------------------------------------------------+
| individual u_0i u_1i age e_ij |
|-------------------------------------------------|
1. | 1 .9338312 -.3097848 12 1.172153 |
2. | 1 .9338312 -.3097848 13 2.935366 |
3. | 1 .9338312 -.3097848 14 -2.306981 |
4. | 1 .9338312 -.3097848 15 -2.148335 |
5. | 1 .9338312 -.3097848 16 -.4276625 |
+-------------------------------------------------+
The person-level random results u_0i and u_1i are the identical in any respect ages, and the observation-level random results e_ij are completely different at every age. Now we’re able to generate an final result variable known as weight, measured in kilograms, primarily based on the next mannequin:
[weight_{ij} = 3 + 3.6*age_{ij} + u_{0i} + age*u_{1i} + e_{ij}]
. generate weight = 3 + 3.6*age + u_0i + age*u_1i + e_ij
The random impact u_1i is multiplied by age, which is why it’s known as a random slope. We might rewrite the mannequin as
[weight_{ij} = 3 + age_{ij}*(3.6 + u_{1i}) + u_{01} + e_{ij}]
Notice that for every year of age, an individual’s weight will improve by 3.6 kilograms plus some random quantity specified by u_1j. In different phrases,the slope for age can be barely completely different for every individual.
We will use the combined command to suit a mannequin to our knowledge:
. combined weight age || individual: age , stddev
Combined-effects ML regression Variety of obs = 1500
Group variable: individual Variety of teams = 300
Obs per group: min = 5
avg = 5.0
max = 5
Wald chi2(1) = 3035.03
Log probability = -3966.3842 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
weight | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 3.708161 .0673096 55.09 0.000 3.576237 3.840085
_cons | 2.147311 .5272368 4.07 0.000 1.113946 3.180676
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
individual: Unbiased |
sd(age) | .9979648 .0444139 .9146037 1.088924
sd(_cons) | 3.38705 .8425298 2.080103 5.515161
-----------------------------+------------------------------------------------
sd(Residual) | 1.905885 .0422249 1.824897 1.990468
------------------------------------------------------------------------------
LR check vs. linear regression: chi2(2) = 4366.32 Prob > chi2 = 0.0000
The estimate for the intercept _cons = 2.1 just isn’t very near the unique parameter worth of three, however the estimate of three.7 for age may be very shut (parameter = 3.6). The usual deviations of the random results are additionally very near the parameters used to simulate the information. The estimate for the individual stage _cons is 2.1 (parameter = 2), the person-level slope is 0.997 (parameter = 1), and the observation-level residual is 1.9 (parameter = 2).
Easy methods to simulate longitudinal knowledge with structured errors
Longitudinal knowledge usually have an autoregressive sample to their errors due to the sequential assortment of the observations. Measurements taken nearer collectively in time can be extra related than measurements taken additional aside in time. There are lots of patterns that can be utilized to descibe the correlation among the many errors, together with autoregressive, shifting common, banded, exponential, Toeplitz, and others (see assist combined##rspec).
Let’s simulate a dataset the place the errors have a Toeplitz construction, which I’ll outline beneath.
We start by making a pattern with 500 folks with a person-level random impact having an N(0,2) distribution.
. clear . set obs 500 . gen individual = _n . generate u_i = rnormal(0,2)
Subsequent, we are able to use the drawnorm command to create error variables with a Toeplitz sample.
A Toeplitz 1 correlation matrix has the next construction:
. matrix V = ( 1.0, 0.5, 0.0, 0.0, 0.0 ///
0.5, 1.0, 0.5, 0.0, 0.0 ///
0.0, 0.5, 1.0, 0.5, 0.0 ///
0.0, 0.0, 0.5, 1.0, 0.5 ///
0.0, 0.0, 0.0, 0.5, 1.0 )
. matrix listing V
symmetric V[5,5]
c1 c2 c3 c4 c5
r1 1
r2 .5 1
r3 0 .5 1
r4 0 0 .5 1
r5 0 0 0 .5 1
The correlation matrix has 1s on the principle diagonal, and every pair of contiguous observations could have a correlation of 0.5. Observations greater than 1 unit of time away from one another are assumed to be uncorrelated.
We should additionally outline a matrix of means to make use of the drawnorm command.
. matrix M = (0 0 0 0 0)
. matrix listing M
M[5,1]
c1
r1 0
r2 0
r3 0
r4 0
r5 0
Now, we’re prepared to make use of the drawnorm command to create 5 error variables which have a Toeplitz 1 construction.
. drawnorm e1 e2 e3 e4 e5, means(M) cov(V)
. listing in 1/2
+---------------------------------------------------------------------------+
| individual u_i e1 e2 e3 e4 e5 |
|---------------------------------------------------------------------------|
1. | 1 5.303562 -1.288265 -1.201399 .353249 .0495944 -1.472762 |
2. | 2 -.0133588 .6949759 2.82179 .7195075 -1.032395 .1995016 |
+---------------------------------------------------------------------------+
Let’s estimate the correlation matrix for our simulated knowledge to confirm that our simulation labored as we anticipated.
. correlate e1-e5
(obs=300)
| e1 e2 e3 e4 e5
-------------+---------------------------------------------
e1 | 1.0000
e2 | 0.5542 1.0000
e3 | -0.0149 0.4791 1.0000
e4 | -0.0508 -0.0364 0.5107 1.0000
e5 | 0.0022 -0.0615 0.0248 0.4857 1.0000
The correlations are 1 alongside the principle diagonal, close to 0.5 for the contiguous observations, and close to 0 in any other case.
Our knowledge are at present in broad format, and we want them in lengthy format to make use of the combined command. We will use the reshape command to transform our knowledge from broad to lengthy format. If you’re not acquainted with the reshape command, you’ll be able to be taught extra about it by typing assist reshape.
. reshape lengthy e, i(individual) j(time)
(observe: j = 1 2 3 4 5)
Knowledge broad -> lengthy
-----------------------------------------------------------------------------
Variety of obs. 300 -> 1500
Variety of variables 7 -> 4
j variable (5 values) -> time
xij variables:
e1 e2 ... e5 -> e
-----------------------------------------------------------------------------
Now, we’re able to create our age variable and the end result variable weight.
. bysort individual: generate age = _n + 11
. generate weight = 3 + 3.6*age + u_i + e
. listing weight individual u_i time age e if individual==1
+-------------------------------------------------------+
| weight individual u_i time age e |
|-------------------------------------------------------|
1. | 50.2153 1 5.303562 1 12 -1.288265 |
2. | 53.90216 1 5.303562 2 13 -1.201399 |
3. | 59.05681 1 5.303562 3 14 .353249 |
4. | 62.35316 1 5.303562 4 15 .0495944 |
5. | 64.4308 1 5.303562 5 16 -1.472762 |
+-------------------------------------------------------+
We will use the combined command to suit a mannequin to our simulated knowledge.
. combined weight age || individual:, residual(toeplitz 1, t(time)) , stddev
Combined-effects ML regression Variety of obs = 1500
Group variable: individual Variety of teams = 300
Obs per group: min = 5
avg = 5.0
max = 5
Wald chi2(1) = 33797.58
Log probability = -2323.9389 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
weight | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 3.576738 .0194556 183.84 0.000 3.538606 3.61487
_cons | 3.119974 .3244898 9.62 0.000 2.483985 3.755962
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
individual: Id |
sd(_cons) | 3.004718 .1268162 2.766166 3.263843
-----------------------------+------------------------------------------------
Residual: Toeplitz(1) |
rho1 | .4977523 .0078807 .4821492 .5130398
sd(e) | .9531284 .0230028 .9090933 .9992964
------------------------------------------------------------------------------
LR check vs. linear regression: chi2(2) = 3063.87 Prob > chi2 = 0.0000
Once more, our parameter estimates match the parameters that have been used to simulate the information very carefully.
The parameter estimate is 3.6 for age (parameter = 3.6) and three.1 for _cons (parameter = 3). The estimated normal deviations of the person-level random impact is 3.0 (parameter = 3). The estimated normal deviation for the errors is 0.95 (parameter = 1), and the estimated correlation for the Toeplitz construction is 0.5 (parameter = 0.5).
Conclusion
I hope I’ve satisfied you that simulating multilevel/longitudinal knowledge is simple and helpful. The subsequent time you end up instructing a category or giving a chat that requires multilevel examples, strive simulating the information. And if it’s essential to calculate statistical energy for a multilevel or longitudinal mannequin, take into account simulations.
