This is a study note for using \(lmerTest\) package for Linear Mixed-Effects Models/Regression (LMER).
The modeling framwork for random effect is Linear Mixed-Effects Models, or the following equivalent name in different disciplines:
Mixed Effects Model
, Random Effects Model
, Linear Mixed Models
Hierarchical Linear Model
Multilevel Model
Random Coefficients Regression Model
Growth Curve Model
(source: Xie (2010) Regression Analysis (Chinese), CH16)
Let’s quickly introduce random effect and fixed effect then explain why need to use LMM to deal with nest data.
Nested data
: Level of one random effect only occur within one unique level of the higher. It often occurs in the design of experiments and in particular in the context of randomized experiments and randomized controlled trials. wikiWithout a random effect for each experimental unit, a one-to-one correspondence between observations and experimental units is assumed. Including random effects in a model is one way to account for a lack of independence among observations that might be expected based on the design of experiment. Here is the definitions of fixed effect and random effect:
Fixed effect
: test for effect of parametersRandom effect
: control for non-independence from nested or hierarchical structure. i.e. blocking, replicated sampling from individuals, a subset of many possible group that could be sampled.In probability and statistics, Simpson's paradox
refer to a trend appears in several different groups of data but disappears or reverses when these groups are combined. This result is often encountered in social-science and medical-science statistics and is particularly problematic when frequency data is unduly given causal interpretations. When not consider level for data where samples are not independent with levels, problems can be resulted in within-group regression
, Between-group regression
, and Total regression
, shown in the folowing:
When imporperly consider consider level in a model, there two kind of statisical fallaies may occur:
Ecological Fallacy
: a formal fallacy in the interpretation of statistical data that occurs when inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong. It could occur when when consider only cluster (lv.2) as the independent variableAtomistic fallacy
: the fallacy of drawing inferences regarding variability across units defined at a higher level based on data collected for units at a lower level. It could occur when consider only individual samples (lv.1) as the dependent variableLinear Mixed-Effects Models
assume that the data being analysed are drawn from a hierarchy of different populations whose differences relate to that hierarchy.
partial polling
or shrinkage
. (source: https://www.youtube.com/watch?v=QCqF-2E86r0 )Reference | Minimum ratio of Group size/Sample size |
---|---|
Kreft (1996) | 30/30 |
Hox (1998) | 100/10 |
The ICC, or Intraclass Correlation Coefficient, can be very useful in many statistical situations, but especially so in Linear Mixed Models. The definition of ICC is the following:
\[ ICC = \rho = \frac{IntraclassVariance}{TotalVariance }\]
Why ICC is useful?
(Source: https://www.theanalysisfactor.com/the-intraclass-correlation-coefficient-in-mixed-models/)
Luke (2004)’s idea of whether using LMM:
Intraclass correlation | ICC | if Applying LMM |
---|---|---|
low | ICC < 0.059 | No neccessary |
moderate | 0.059 < ICC < 0.138 | Yes |
high | ICC > 0.139 | Yes |
ICC for unconditional and conditional models
:
(source: https://www.rdocumentation.org/packages/sjstats/versions/0.17.4/topics/icc)
In R, both \(lmer\) and \(lmerTest\) package can model Linear mixed-Effects models/regression. We will use \(lmerTest\) package because it can output significance of fixed effect.
m = lmerTest::lmer(data = , formula = DV ~ Fixed_Factor_1 + (Random_intercept + Random slope | Random_Factor_1), ...)
Variable:
DV
: dependent variable;Fixed_Factor
: Level_1 independent variables as fixed effect. It can also include interaction between independent variables at level 1Random_Factor
- Level_2 catergorical variable indicating groups or cluster as random factor.Random_Slope
- the Level_1 independent variable that has different effect on Level 2 variableRandom_intercept
: the intercpet in Level_2.Symbol:
~
, +
: a two-sided linear formula object describing both the fixed-effects and randomeffects part of the model, with the response on the left of a ~ operator and the terms, separated by + operators, on the right.|
: Random-effects terms are distinguished by vertical bars (|) separating expressions for design matrices from grouping factors.||
: Two vertical bars (||) can be used to specify multiple uncorrelated random effects for the same grouping variable. (Because of the way it is implemented, the ||-syntax works only for design matrices containing numeric (continuous) predictors; to fit models with independent categorical effects, see dummy or the lmer_alt function from the afex package.)(source: https://github.com/usplos/Eye-movement-related/blob/master/Linear%20mixed%20model%20(Hierarchical%20linear%20model)%20using%20R%20and%20JAMOVI.md; https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)
Coefficient | Full formula | Simplified form |
---|---|---|
Fixed intercept + Fixed slope | lm(Y ~ 1 + X1, …) | lm(Y ~ X1, …) |
Random intercept + Fixed slope | lmer(Y ~ 1 + X1 + (1 + 1| group), …) | lmer(Y ~ X1 + (1 | group), …) |
Fixed intercept + Random slope | lmer(Y ~ 1 + X1 +(0 + 1 | group), …) | lmer(Y ~ X1 + (0 + 1 | group), …) |
Random intercept + Random slope | lmer(Y ~ 1 + X1 + (1 + X1 | group), …) | lmer(Y ~ X1 + (X1 | group), …) |
# Checking linear coefficient for linear regression
broom::tidy(m1)
# Checking Level 1 coefficient for linear mixed model
broom.mixed::tidy(m, effects = "fixed")
# Checking Level 2 coefficient for linear mixed model
broom.mixed::tidy(m, effects = "ran_vals")
Not enought degree of freedom: When number of observation is less than number of random effect, then model cannot converge. Solution 1: simplify the model (i.e. removing the problematic random slopes) Solution 2: increase the number of iteration
m = lmerTest::lmer(data, formula, control = lmerControl(optCtrl = list(maxfun = 20000)))
Singular fit: the data cannot support the complex random effect structure? Solution: remove the most complex term in the random effect (i.e. random slope)
when talk about LMM, Random intercept is often applied. So only Fixed slope and Random slope are considered.
Model category | Meaning | Formula | Equivalent form |
---|---|---|---|
Fixed slope model (nullmodel, interceptonlymodel, unconditionalmodel) | Random intercept with fixed mean | lmer(Y ~ (1 | g), …) | lmer(Y ~ 1+ (1 | g), …) |
Fixed slope model | Random intercept with a priori means | lmer(Y ~ 0 + offset(o) + (1 | g), …) | lmer(Y ~ -1 + offset(o) + (1 | g), …) |
Fixed slope model (crossed random effect) | Intercept varying among g1 and g2 | lmer(Y ~ (1 | g1) + (1 | g2), …) | - |
Fixed slope model (3-level nested model) | Intercept varying among g1 (lv.3) and g2 (lv.2) within g1 (lv.3) | lmer(Y ~ (1 | g1 / g2), …) | lmer(Y ~ (1 | g1) + (1 | g1: g2), …) |
Random slope model | Uncorrelated random intercept and slope | lmer(Y ~ X + ( X || g), …) | lmer(Y ~ X + (1 | g) + (0 + X | g), …) |
Random intercepts and slopes model (recommended if possible) | Correlated random intercept and slope | lmer(Y ~ X + ( X | g), …) | - |
Reference:
In the following, I am using the example demonstrated from this Link, specificall using experience
to predict salary
with/without department
as level1 varaible.
library(tidyverse)
library(lme4)
library(modelr)
library(broom)
library(broom.mixed)
library(ggplot2)
create_data <- function() {
df <- tibble(ids = 1:100,
department = rep(c("sociology", "biology", "english", "informatics", "statistics"), 20),
bases = rep(c(40000, 50000, 60000, 70000, 80000), 20) * runif(100, .9, 1.1),
experience = floor(runif(100, 0, 10)),
raises = rep(c(2000, 500, 500, 1700, 500), 20) * runif(100, .9, 1.1)) %>%
mutate(salary = bases + experience * raises)
}
df <- create_data()
df %>% head(10) %>% knitr::kable(format = "markdown")
ids | department | bases | experience | raises | salary |
---|---|---|---|---|---|
1 | sociology | 38059.49 | 6 | 2150.1902 | 50960.63 |
2 | biology | 47062.50 | 6 | 476.4946 | 49921.47 |
3 | english | 57341.60 | 5 | 466.0628 | 59671.91 |
4 | informatics | 71526.43 | 9 | 1759.8572 | 87365.15 |
5 | statistics | 75302.47 | 5 | 483.7902 | 77721.42 |
6 | sociology | 37071.41 | 2 | 2010.3158 | 41092.04 |
7 | biology | 47754.50 | 4 | 501.3143 | 49759.76 |
8 | english | 58306.45 | 7 | 511.0082 | 61883.51 |
9 | informatics | 72928.48 | 0 | 1710.6474 | 72928.48 |
10 | statistics | 85608.22 | 1 | 462.1660 | 86070.38 |
# Fixed intercept + Fixed slope: Model without respect to grouping
m1 <- lm(salary ~ experience, data = df)
df_pred_m1 <-df %>% modelr::add_predictions(m1)
# Random intercept + Fixed slope: Model with varying intercept
m2 <- lmer(salary ~ experience + (1 | department), data = df)
df_pred_m2 <-df %>% modelr::add_predictions(m2)
# Fixed intercept + Random slope: Model with varying slope
m3 <- lmer(salary ~ experience + (0 + experience | department), data = df)
df_pred_m3 <-df %>% modelr::add_predictions(m3)
# Random intercept + Random slope: Model with varying slope and intercept
m4 <- lmer(salary ~ experience + (1 + experience | department), data = df)
df_pred_m4 <-df %>% modelr::add_predictions(m4)
p1 <- df_pred_m1 %>%
ggplot(aes(x = experience, y = salary)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = "Experience", y = "Predicted Salary") +
ggtitle("Fixed intercept + Fixed slope") +
scale_colour_discrete("Department")
p2 <- df_pred_m2 %>%
ggplot(aes(x = experience, y = salary, color = department)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = "Experience", y = "Predicted Salary") +
ggtitle("Random intercept + Fixed slope") +
scale_colour_discrete("Department") +
theme(legend.position = "none")
p3 <- df_pred_m3 %>%
ggplot(aes(x = experience, y = salary, color = department)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = "Experience", y = "Predicted Salary") +
ggtitle("Fixed intercept + Random slope") +
scale_colour_discrete("Department") +
theme(legend.position = "none")
p4 <- df_pred_m4 %>%
ggplot(aes(x = experience, y = salary, color = department)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = "Experience", y = "Predicted Salary") +
ggtitle("Random intercept + Random slope") +
scale_colour_discrete("Department") +
theme(legend.position = "none")
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
Work flow of LMM analysis:
sjstats::icc()
to calculate the intraclass Correlation Coefficient using the null model, which is used to determine if intraclass variable is large enought to apply LMM# nullmodel = lmerTest::lmer(DV ~ 1+ (1|g), ...)
# sjstats::icc(nullmodel)
summary()
to view the the variance of random effect;lmerTest::ranova()
to produce a ANOVA-like table for random effects to test significance of Lv.2 variables. It checks whether the model becomes significantly worse if a certain random effect is dropped (formally known as likelihood ratio tests), if this is not the case, the random effect is not significant.# summary(fit1)
# lmerTest::ranova(fit1)
Similar to the GLM/OLS, use anova()
to check if the main effects and interaction effect is significant in F-tests.
# anova(fit1)
In analysis with multilevel (i.e. at least two-level), two types of cross-level interactions could occur between the higher and lower level:
Compositional effects
: inter-group (or inter-context) differences in an outcome (for example, disease rates) are attributable to differences in group composition (that is, in the characteristics of the individuals of which the groups are comprised)Contextual effects
: intra-group differences are attributable to the effects of group level variable or properties.(source: https://jech.bmj.com/content/56/8/588)
Effect size
is the ratio of the variance of residual between improved model and the base model. For example, the effect size * 100% refers to the percentage of imporvement by adding variables into the new model.
\[ ES = \frac{variance_{base} - variance_{new}}{variance_{base}} \]
Effect size | Improvement |
---|---|
0.2 ~ 0.15 | weak |
0.15 ~ 0.35 | moderate |
> 0.35 | strong |
In many research area, LMM is usually applied to the following common randomized statistical experiments The Depanding variable
, Random effect
, and Fixed effect
will be listed out for type of experiments, respectively. During the model selection, it is important to consider whether:
The main advantage of this design is randomization. The post-test comparison with randomized subjects controls for the main effects of history, maturation, and pre-testing; because no pre-test is used there can be no interaction effect of pre-test and X. Another advantage of this design is that it can be extended to include more than Controlled if necessary.
The RCBD is the standard design for agricultural experiments where similar experimental units are grouped into blocks or replicates.
Like a randomized complete block design (RCBD), a GRBD is randomized. Within each block, treatments are randomly assigned to experimental units: this randomization is also independent between blocks. In a (classic) RCBD, however, there is no replication of treatments within blocks.