Model Comparison, Fit, and Estimation Methods in MLM

PSY 8XXX: Multilevel Modeling for Organizational Research — Week 7

Author

Instructor

Published

March 29, 2026

1 Introduction

How do you know if your multilevel model is good? And if you fit multiple models—null, random intercept, random slope, with predictors—how do you choose among them?

This is where model comparison comes in. Unlike single-level regression, multilevel models have multiple reasonable specifications depending on whether you want to:

  • Estimate effects while accounting for clustering (compare to naive OLS)
  • Decide if a slope should be random or fixed
  • Test whether adding a predictor improves fit
  • Choose between competing theoretical models

In this tutorial, we’ll master the practical tools: likelihood ratio tests, information criteria (AIC/BIC), pseudo-R² measures, and a real workflow for building and comparing models step by step.

2 ML vs. REML: When to Use Each

Multilevel models can be estimated two ways: Maximum Likelihood (ML) and Restricted Maximum Likelihood (REML).

2.1 The Conceptual Difference

  • ML: Estimates both fixed effects and random effects variances simultaneously. More biased for random effects when sample size is small, but allows model comparison via likelihood ratio tests.
  • REML: Accounts for the degrees of freedom lost in estimating fixed effects before estimating variances. Less biased for variance components, but cannot compare models that differ in fixed effects.

2.2 Practical Rule

  • Use REML for final inference on variance components and random effects
  • Use ML for model comparison (comparing fixed effect structures)

Let’s demonstrate with data:

Code
set.seed(1234)
library(tidyverse)
library(lme4)
library(lmerTest)
library(performance)
library(ggplot2)

# Simulate organizational data
n_teams <- 50
n_per_team <- 10
n_total <- n_teams * n_per_team

teams <- tibble(
  team_id = 1:n_teams,
  team_size = sample(8:15, n_teams, replace = TRUE),
  team_climate = rnorm(n_teams, mean = 5, sd = 1.2),
  leader_support = rnorm(n_teams, mean = 5.5, sd = 1)
)

employees <- expand_grid(
  team_id = 1:n_teams,
  emp_within_team = 1:10
) %>%
  left_join(teams, by = "team_id") %>%
  mutate(
    employee_id = row_number(),
    autonomy = rnorm(n_total, mean = 5, sd = 1.5),
    job_complexity = rnorm(n_total, mean = 4.5, sd = 1.2),
    tenure_years = pmax(rnorm(n_total, mean = 4, sd = 3), 0.5),
    autonomy_c = autonomy - mean(autonomy),
    team_climate_c = team_climate - mean(team_climate),
    job_satisfaction = 3.5 +
      0.6 * autonomy_c +
      0.4 * team_climate_c +
      0.35 * autonomy_c * team_climate_c +
      0.15 * job_complexity +
      0.05 * tenure_years +
      rnorm(n_total, 0, 1.2),
    job_satisfaction = pmin(pmax(job_satisfaction, 1), 7)
  )

Now let’s fit the same model with ML and REML:

Code
# Fit with ML
model_ml <- lmer(
  job_satisfaction ~ autonomy_c + job_complexity + tenure_years + (1 + autonomy_c | team_id),
  data = employees,
  REML = FALSE
)

# Fit with REML
model_reml <- lmer(
  job_satisfaction ~ autonomy_c + job_complexity + tenure_years + (1 + autonomy_c | team_id),
  data = employees,
  REML = TRUE
)

# Compare random effects estimates
comparison <- tibble(
  Component = c("σ²_u0 (intercept)", "σ²_u1 (autonomy slope)", "σ² (residual)"),
  ML = c(
    attr(VarCorr(model_ml)$team_id, "stddev")[1]^2,
    attr(VarCorr(model_ml)$team_id, "stddev")[2]^2,
    sigma(model_ml)^2
  ),
  REML = c(
    attr(VarCorr(model_reml)$team_id, "stddev")[1]^2,
    attr(VarCorr(model_reml)$team_id, "stddev")[2]^2,
    sigma(model_reml)^2
  )
)

print(comparison)
# A tibble: 3 × 3
  Component                  ML   REML
  <chr>                   <dbl>  <dbl>
1 σ²_u0 (intercept)      0.177  0.184 
2 σ²_u1 (autonomy slope) 0.0699 0.0727
3 σ² (residual)          1.45   1.46  
Code
# Fixed effects are identical (as they should be for the same model)
cat("\nFixed effect of autonomy:\n")

Fixed effect of autonomy:
Code
cat("ML: ", fixef(model_ml)["autonomy_c"], "\n")
ML:  0.5640012 
Code
cat("REML: ", fixef(model_reml)["autonomy_c"], "\n")
REML:  0.5645692 

Observation: The fixed effects are identical, but variance components differ slightly. With 500 observations, the difference is small, but with smaller samples, REML would be noticeably less biased.

Decision rule: For this tutorial, we’ll use ML when comparing models, then report final estimates from REML.

3 Deviance and Likelihood Ratio Tests

The most direct way to compare nested models is the likelihood ratio test. If Model A is nested in Model B (B has all of A’s terms plus more), you can test whether the additional terms significantly improve fit.

3.1 The Test Statistic

\[LR = -2(\log L_A - \log L_B) = -2(LL_A - LL_B)\]

Under the null (Model A is correct), LR follows a χ² distribution with df = difference in number of parameters.

3.2 Workflow

Let’s build models incrementally and compare:

Code
# Model 1: Null model (random intercepts only)
m1_null <- lmer(
  job_satisfaction ~ 1 + (1 | team_id),
  data = employees,
  REML = FALSE
)

# Model 2: Add random slope for autonomy
m2_random_slope <- lmer(
  job_satisfaction ~ autonomy_c + (1 + autonomy_c | team_id),
  data = employees,
  REML = FALSE
)

# Model 3: Add fixed effects (L1 controls)
m3_controls <- lmer(
  job_satisfaction ~ autonomy_c + job_complexity + tenure_years +
    (1 + autonomy_c | team_id),
  data = employees,
  REML = FALSE
)

# Model 4: Add L2 predictor
m4_l2_pred <- lmer(
  job_satisfaction ~ autonomy_c + job_complexity + tenure_years + team_climate_c +
    (1 + autonomy_c | team_id),
  data = employees,
  REML = FALSE
)

# Model 5: Add cross-level interaction
m5_interaction <- lmer(
  job_satisfaction ~ autonomy_c * team_climate_c + job_complexity + tenure_years +
    (1 + autonomy_c | team_id),
  data = employees,
  REML = FALSE
)

# Compare models pairwise
cat("=== Model Comparisons ===\n\n")
=== Model Comparisons ===
Code
cat("M1 vs M2: Does random slope for autonomy improve fit?\n")
M1 vs M2: Does random slope for autonomy improve fit?
Code
print(anova(m1_null, m2_random_slope))
Data: employees
Models:
m1_null: job_satisfaction ~ 1 + (1 | team_id)
m2_random_slope: job_satisfaction ~ autonomy_c + (1 + autonomy_c | team_id)
                npar  AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)    
m1_null            3 1891 1903.6 -942.49      1885                         
m2_random_slope    6 1698 1723.2 -842.98      1686 199.03  3  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
cat("\nM2 vs M3: Do L1 controls improve fit?\n")

M2 vs M3: Do L1 controls improve fit?
Code
print(anova(m2_random_slope, m3_controls))
Data: employees
Models:
m2_random_slope: job_satisfaction ~ autonomy_c + (1 + autonomy_c | team_id)
m3_controls: job_satisfaction ~ autonomy_c + job_complexity + tenure_years + (1 + autonomy_c | team_id)
                npar    AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)    
m2_random_slope    6 1698.0 1723.2 -842.98    1686.0                         
m3_controls        8 1683.9 1717.6 -833.96    1667.9 18.032  2  0.0001214 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
cat("\nM3 vs M4: Does L2 predictor (team climate) improve fit?\n")

M3 vs M4: Does L2 predictor (team climate) improve fit?
Code
print(anova(m3_controls, m4_l2_pred))
Data: employees
Models:
m3_controls: job_satisfaction ~ autonomy_c + job_complexity + tenure_years + (1 + autonomy_c | team_id)
m4_l2_pred: job_satisfaction ~ autonomy_c + job_complexity + tenure_years + team_climate_c + (1 + autonomy_c | team_id)
            npar    AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)    
m3_controls    8 1683.9 1717.6 -833.96    1667.9                         
m4_l2_pred     9 1674.5 1712.4 -828.24    1656.5 11.453  1  0.0007139 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
cat("\nM4 vs M5: Does cross-level interaction improve fit?\n")

M4 vs M5: Does cross-level interaction improve fit?
Code
print(anova(m4_l2_pred, m5_interaction))
Data: employees
Models:
m4_l2_pred: job_satisfaction ~ autonomy_c + job_complexity + tenure_years + team_climate_c + (1 + autonomy_c | team_id)
m5_interaction: job_satisfaction ~ autonomy_c * team_climate_c + job_complexity + tenure_years + (1 + autonomy_c | team_id)
               npar    AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)    
m4_l2_pred        9 1674.5 1712.4 -828.24    1656.5                         
m5_interaction   10 1642.6 1684.8 -811.31    1622.6 33.853  1  5.945e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of the LR test output:

  • Df = change in number of parameters
  • Chisq = the LR test statistic
  • Pr(>Chisq) = p-value

A significant p-value (typically p < .05) means the more complex model fits significantly better. The interaction in M5 likely shows significance because we generated the data with an interaction.

4 Information Criteria: AIC and BIC

Likelihood ratio tests work for nested models, but what if you want to compare non-nested models (e.g., different predictors)? Or what if you want a measure that penalizes complexity?

Information criteria do both:

\[\text{AIC} = -2 \log L + 2k\] \[\text{BIC} = -2 \log L + k \log(n)\]

Where k = number of parameters and n = number of observations.

Interpretation: Lower AIC/BIC is better. The penalty term discourages overfitting.

  • AIC: Gentler penalty; better for prediction
  • BIC: Harsher penalty; better for model selection when sample size is large

Let’s compare all five models:

Code
models <- list(
  "M1: Null" = m1_null,
  "M2: Random Slope" = m2_random_slope,
  "M3: + L1 Controls" = m3_controls,
  "M4: + L2 Predictor" = m4_l2_pred,
  "M5: + Cross-Level Interaction" = m5_interaction
)

ic_table <- tibble(
  Model = names(models),
  Parameters = sapply(models, function(x) attr(logLik(x), "df")),
  LogLik = sapply(models, logLik),
  AIC = sapply(models, AIC),
  BIC = sapply(models, BIC),
  deltaAIC = AIC - min(AIC),
  deltaBIC = BIC - min(BIC)
) %>%
  arrange(AIC)

print(ic_table)
# A tibble: 5 × 7
  Model                         Parameters LogLik   AIC   BIC deltaAIC deltaBIC
  <chr>                              <int>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 M5: + Cross-Level Interaction         10  -811. 1643. 1685.      0        0  
2 M4: + L2 Predictor                     9  -828. 1674. 1712.     31.9     27.6
3 M3: + L1 Controls                      8  -834. 1684. 1718.     41.3     32.9
4 M2: Random Slope                       6  -843. 1698. 1723.     55.3     38.5
5 M1: Null                               3  -942. 1891. 1904.    248.     219. 
Code
# Practical guidance
cat("\n=== Interpretation ===\n")

=== Interpretation ===
Code
cat("Models with ΔAI < 10 from the best have appreciable support.\n")
Models with ΔAI < 10 from the best have appreciable support.
Code
cat("Models with ΔAI > 10 have essentially no support.\n\n")
Models with ΔAI > 10 have essentially no support.
Code
best_aic <- ic_table %>% filter(AIC == min(AIC)) %>% pull(Model)
cat("Best model by AIC: ", best_aic, "\n")
Best model by AIC:  M5: + Cross-Level Interaction 

Key insight: AIC and BIC agree here—M5 (with interaction) is best. This makes sense; we generated the data with a cross-level interaction.

5 Building a Model Comparison Table for a Manuscript

Here’s how to format model comparisons for publication:

Code
# Extract fixed effects from all models
extract_effects <- function(model) {
  fe <- fixef(model)
  se_vals <- sqrt(diag(vcov(model)))
  tibble(
    Coefficient = names(fe),
    Estimate = as.numeric(fe),
    SE = se_vals,
    t_or_z = Estimate / SE,
    Sig = case_when(
      abs(t_or_z) > 2.576 ~ "***",
      abs(t_or_z) > 1.96 ~ "**",
      abs(t_or_z) > 1.645 ~ "*",
      TRUE ~ ""
    )
  )
}

# Because models have different fixed effects, we'll make a wide table manually
comparison_table <- tibble(
  Predictor = c(
    "Autonomy (centered)",
    "Job Complexity",
    "Tenure",
    "Team Climate (centered)",
    "Autonomy × Team Climate"
  ),
  M3 = c(
    sprintf("%.3f**", fixef(m3_controls)["autonomy_c"]),
    sprintf("%.3f**", fixef(m3_controls)["job_complexity"]),
    sprintf("%.3f", fixef(m3_controls)["tenure_years"]),
    "—",
    "—"
  ),
  M4 = c(
    sprintf("%.3f**", fixef(m4_l2_pred)["autonomy_c"]),
    sprintf("%.3f**", fixef(m4_l2_pred)["job_complexity"]),
    sprintf("%.3f", fixef(m4_l2_pred)["tenure_years"]),
    sprintf("%.3f**", fixef(m4_l2_pred)["team_climate_c"]),
    "—"
  ),
  M5 = c(
    sprintf("%.3f**", fixef(m5_interaction)["autonomy_c"]),
    sprintf("%.3f**", fixef(m5_interaction)["job_complexity"]),
    sprintf("%.3f", fixef(m5_interaction)["tenure_years"]),
    sprintf("%.3f**", fixef(m5_interaction)["team_climate_c"]),
    sprintf("%.3f**", fixef(m5_interaction)["autonomy_c:team_climate_c"])
  )
)

print(comparison_table)
# A tibble: 5 × 4
  Predictor               M3      M4      M5     
  <chr>                   <chr>   <chr>   <chr>  
1 Autonomy (centered)     0.564** 0.568** 0.540**
2 Job Complexity          0.151** 0.152** 0.148**
3 Tenure                  0.063   0.066   0.059  
4 Team Climate (centered) —       0.259** 0.267**
5 Autonomy × Team Climate —       —       0.189**
Code
# Model fit row
fit_row <- tibble(
  Predictor = c("AIC", "BIC", "LogLik", "Comparison to M3"),
  M3 = c(
    sprintf("%.1f", AIC(m3_controls)),
    sprintf("%.1f", BIC(m3_controls)),
    sprintf("%.1f", logLik(m3_controls)),
    "—"
  ),
  M4 = c(
    sprintf("%.1f", AIC(m4_l2_pred)),
    sprintf("%.1f", BIC(m4_l2_pred)),
    sprintf("%.1f", logLik(m4_l2_pred)),
    sprintf("Δχ² = %.2f**", -2 * (logLik(m3_controls) - logLik(m4_l2_pred)))
  ),
  M5 = c(
    sprintf("%.1f", AIC(m5_interaction)),
    sprintf("%.1f", BIC(m5_interaction)),
    sprintf("%.1f", logLik(m5_interaction)),
    sprintf("Δχ² = %.2f**", -2 * (logLik(m4_l2_pred) - logLik(m5_interaction)))
  )
)

cat("\n=== Model Fit Indices ===\n")

=== Model Fit Indices ===
Code
print(fit_row)
# A tibble: 4 × 4
  Predictor        M3     M4            M5           
  <chr>            <chr>  <chr>         <chr>        
1 AIC              1683.9 1674.5        1642.6       
2 BIC              1717.6 1712.4        1684.8       
3 LogLik           -834.0 -828.2        -811.3       
4 Comparison to M3 —      Δχ² = 11.45** Δχ² = 33.85**

This format clearly shows: (1) how effects change as you add predictors, (2) which effects remain significant, and (3) which model fits best.

6 Pseudo-R² Measures

Unlike single-level regression, R² is not automatically output in MLM. The Nakagawa & Schielzeth approach decomposes variance explained into marginal (fixed effects only) and conditional (fixed + random effects) components.

Code
r2_stats <- tibble(
  Model = c("M3: L1 Controls", "M4: + L2 Pred", "M5: + Interaction"),
  R2_Marginal = c(
    r2(m3_controls)$R2_marginal,
    r2(m4_l2_pred)$R2_marginal,
    r2(m5_interaction)$R2_marginal
  ),
  R2_Conditional = c(
    r2(m3_controls)$R2_conditional,
    r2(m4_l2_pred)$R2_conditional,
    r2(m5_interaction)$R2_conditional
  ),
  Variance_Explained_by_REs = c(
    r2(m3_controls)$R2_conditional - r2(m3_controls)$R2_marginal,
    r2(m4_l2_pred)$R2_conditional - r2(m4_l2_pred)$R2_marginal,
    r2(m5_interaction)$R2_conditional - r2(m5_interaction)$R2_marginal
  )
)

print(r2_stats)
# A tibble: 3 × 4
  Model             R2_Marginal R2_Conditional Variance_Explained_by_REs
  <chr>                   <dbl>          <dbl>                     <dbl>
1 M3: L1 Controls         0.317          0.447                    0.130 
2 M4: + L2 Pred           0.362          0.442                    0.0802
3 M5: + Interaction       0.419          0.446                    0.0263
Code
cat("\n=== Interpretation ===\n")

=== Interpretation ===
Code
cat("Marginal R²: How much variance is explained by fixed effects alone.\n")
Marginal R²: How much variance is explained by fixed effects alone.
Code
cat("Conditional R²: How much variance is explained by the full model (fixed + random).\n")
Conditional R²: How much variance is explained by the full model (fixed + random).
Code
cat("The gap shows how much random effects (group-level variation) contribute.\n")
The gap shows how much random effects (group-level variation) contribute.

Practical insight: The conditional R² is larger than marginal—group-level factors explain a lot of variance. This justifies using MLM rather than ignoring the clustering.

7 Practical Model-Building Workflow

Here’s a systematic approach to building your final model:

Code
cat("=== STEP-BY-STEP MLM MODEL BUILDING WORKFLOW ===\n\n")
=== STEP-BY-STEP MLM MODEL BUILDING WORKFLOW ===
Code
cat("STEP 1: Start with the null model (unconditional means model)\n")
STEP 1: Start with the null model (unconditional means model)
Code
m_null <- lmer(job_satisfaction ~ 1 + (1 | team_id), data = employees, REML = FALSE)
cat("Formula: Y ~ 1 + (1 | team_id)\n")
Formula: Y ~ 1 + (1 | team_id)
Code
cat("Purpose: Establish baseline ICC, variance components\n\n")
Purpose: Establish baseline ICC, variance components
Code
# Compute ICC
vc_null <- VarCorr(m_null)
tau_sq <- as.numeric(vc_null$team_id[1, 1])
sigma_sq <- sigma(m_null)^2
icc <- tau_sq / (tau_sq + sigma_sq)
cat("ICC = ", round(icc, 3), " (", round(icc * 100, 1), "% of variance is between-team)\n\n")
ICC =  0.031  ( 3.1 % of variance is between-team)
Code
cat("STEP 2: Add substantive level-1 predictors\n")
STEP 2: Add substantive level-1 predictors
Code
cat("Decide: Should slopes be fixed or random?\n")
Decide: Should slopes be fixed or random?
Code
cat("- Fit with fixed slopes first for simplicity\n")
- Fit with fixed slopes first for simplicity
Code
m_l1_fixed <- lmer(
  job_satisfaction ~ autonomy_c + job_complexity + tenure_years + (1 | team_id),
  data = employees,
  REML = FALSE
)
cat("Formula: Y ~ X1 + X2 + X3 + (1 | team_id)\n")
Formula: Y ~ X1 + X2 + X3 + (1 | team_id)
Code
cat("Comparison to null: χ² = ", round(-2 * (logLik(m_null) - logLik(m_l1_fixed)), 2), "\n\n")
Comparison to null: χ² =  191.28 
Code
cat("STEP 3: Test if predictor slopes should be random\n")
STEP 3: Test if predictor slopes should be random
Code
m_l1_random <- lmer(
  job_satisfaction ~ autonomy_c + job_complexity + tenure_years + (1 + autonomy_c | team_id),
  data = employees,
  REML = FALSE
)
cat("Formula: Y ~ X1 + X2 + X3 + (1 + X1 | team_id)\n")
Formula: Y ~ X1 + X2 + X3 + (1 + X1 | team_id)
Code
cat("Comparison: χ² = ", round(-2 * (logLik(m_l1_fixed) - logLik(m_l1_random)), 2), "\n")
Comparison: χ² =  25.79 
Code
cat("(If significant, keep random slope; otherwise, keep fixed)\n\n")
(If significant, keep random slope; otherwise, keep fixed)
Code
cat("STEP 4: Add level-2 predictors\n")
STEP 4: Add level-2 predictors
Code
m_l2 <- lmer(
  job_satisfaction ~ autonomy_c + job_complexity + tenure_years + team_climate_c +
    (1 + autonomy_c | team_id),
  data = employees,
  REML = FALSE
)
cat("Comparison to L1 model: χ² = ", round(-2 * (logLik(m_l1_random) - logLik(m_l2)), 2), "\n\n")
Comparison to L1 model: χ² =  11.45 
Code
cat("STEP 5: Test cross-level interactions (if theory supports it)\n")
STEP 5: Test cross-level interactions (if theory supports it)
Code
m_final <- lmer(
  job_satisfaction ~ autonomy_c * team_climate_c + job_complexity + tenure_years +
    (1 + autonomy_c | team_id),
  data = employees,
  REML = FALSE
)
cat("Comparison to L2 model: χ² = ", round(-2 * (logLik(m_l2) - logLik(m_final)), 2), "\n\n")
Comparison to L2 model: χ² =  33.85 
Code
cat("STEP 6: Final model diagnostics and re-estimate with REML\n")
STEP 6: Final model diagnostics and re-estimate with REML
Code
cat("(See Week 8 tutorial)\n\n")
(See Week 8 tutorial)
Code
cat("STEP 7: Report results\n")
STEP 7: Report results
Code
cat("Use AIC/BIC comparison + LR tests + effect sizes\n")
Use AIC/BIC comparison + LR tests + effect sizes
Code
cat("Report marginal and conditional R²\n")
Report marginal and conditional R²
Code
cat("Present simple slopes if interactions present\n")
Present simple slopes if interactions present

8 Try It Yourself

  1. Alternative specification: Fit a model where job_complexity also has a random slope. Does this improve fit over the current model?

  2. Model selection criteria: Using only AIC and BIC (no LR tests), which model would you choose from M1–M5? Do they agree?

  3. Effect stability: Refit models M3, M4, and M5 using REML. Do the fixed effects change noticeably? (They shouldn’t, but this is good practice to check.)

  4. Competing hypotheses: Suppose a colleague proposes that leader_support (not team_climate) is the true moderator. Build that model and compare it (via AIC/BIC) to your interaction model with team_climate. Which fits better?

  5. Reporting challenge: Create a manuscript-ready table showing all five models (M1–M5) with fit indices, degrees of freedom, and key coefficients.


References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.

Nakagawa, S., & Schielzeth, H. (2013). A general and simple method for obtaining R² from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4(2), 133-142.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464.