# Plotting partial pooling in mixed-effects models

In this post, I demonstrate a few techniques for plotting information from a relatively simple mixed-effects model fit in R. These plots can help us develop intuitions about what these models are doing and what “partial pooling” means.

## The sleepstudy dataset

For these examples, I’m going to use the sleepstudy dataset from the lme4 package. The outcome measure is reaction time, the predictor measure is days of sleep deprivation, and these measurements are nested within participants—we have 10 observations per participant. I am also going to add two fake participants with incomplete data to illustrate partial pooling.

library(lme4)
library(dplyr)
library(tibble)

# Convert to tibble for better printing. Convert factors to strings
sleepstudy <- sleepstudy %>%
as_tibble() %>%
mutate(Subject = as.character(Subject))

df_sleep <- bind_rows(
sleepstudy,
data_frame(Reaction = c(286, 288), Days = 0:1, Subject = "374"),
data_frame(Reaction = 245, Days = 0, Subject = "373"))

df_sleep
#> # A tibble: 183 x 3
#>    Reaction  Days Subject
#>       <dbl> <dbl>   <chr>
#>  1 249.5600     0     308
#>  2 258.7047     1     308
#>  3 250.8006     2     308
#>  4 321.4398     3     308
#>  5 356.8519     4     308
#>  6 414.6901     5     308
#>  7 382.2038     6     308
#>  8 290.1486     7     308
#>  9 430.5853     8     308
#> 10 466.3535     9     308
#> # ... with 173 more rows


We can visualize all the data in ggplot2 by using facet_wrap() to create subplots for each participant and stat_smooth() to create a regression line in each subplot.

library(ggplot2)

xlab <- "Days of sleep deprivation"
ylab <- "Average reaction time (ms)"

ggplot(df_sleep) +
aes(x = Days, y = Reaction) +
stat_smooth(method = "lm", se = FALSE) +
# Put the points on top of lines
geom_point() +
facet_wrap("Subject") +
labs(x = xlab, y = ylab) +
# We also need to help the x-axis, so it doesn't
# create gridlines/ticks on 2.5 days
scale_x_continuous(breaks = 0:4 * 2)


By the way, ggplot2 doesn’t draw the regression lines outside of the range of the data unless we set fullrange = TRUE. That’s a helpful feature for 374!

Update: Douglas Bates did it first. Someone sent me a link to a slide deck by Douglas Bates, lead author of the lme4 package, where he has some plots just like the ones I demo in this post. He uses the sleepstudy dataset too—it’s his R package and his teaching dataset, after all—so the similarities are uncanny but accidental. Origin of this post: I was asked on twitter how to make a facet plot of a mixed effects model, wrote up a quick demo using the convenient sleepstudy dataset, and then fleshed that demo into a tutorial. By using his teaching dataset to illustrate some partial pooling concepts, I ended up recreating some of his work on accident. [Sept. 14, 2017]

## Complete pooling and no pooling models

Each one of these panels plotted above shows an independently estimated regression line. This approach to fitting a separate line for each participant is sometimes called the no pooling model because none of the information from different participants is combined or pooled together.

We fit a separate line for each cluster of data, unaware that any of the other participants exist. The lmList() function in lme4 automates this process.

df_no_pooling <- lmList(Reaction ~ Days | Subject, df_sleep) %>%
coef() %>%
# Subject IDs are stored as row-names. Make them an explicit column
rownames_to_column("Subject") %>%
rename(Intercept = (Intercept), Slope_Days = Days) %>%
# Remove the participant who only had one data-point
filter(Subject != "373")

#>   Subject Intercept Slope_Days      Model
#> 1     308  244.1927  21.764702 No pooling
#> 2     309  205.0549   2.261785 No pooling
#> 3     310  203.4842   6.114899 No pooling
#> 4     330  289.6851   3.008073 No pooling
#> 5     331  285.7390   5.266019 No pooling
#> 6     332  264.2516   9.566768 No pooling


In contrast, we might consider a complete pooling model where all the information from the participants is combined together. We fit a single line for the combined data set, unaware that the data came from different participants.

# Fit a model on all the data pooled together
m_pooled <- lm(Reaction ~ Days, df_sleep)

# Repeat the intercept and slope terms for each participant
df_pooled <- data_frame(
Model = "Complete pooling",
ylim = range(df_pulled$Slope_Days), expand = TRUE)  To go all out , let’s also label the contours with the confidence levels. I see that the lower left area is relatively free of points, so I can place the labels there. I filter down to just the ellipse points in the bottom 25% of x and y values. That will keep points in that lower left quadrant. Then I find the (x, y) point with the farthest distance from the center as the location for my label. I make it sound so easy but it took a lot of trial and error (including an an attempt to use cosines). # Euclidean distance contour_dist <- function(xs, ys, center_x, center_y) { x_diff <- (center_x - xs) ^ 2 y_diff <- (center_y - ys) ^ 2 sqrt(x_diff + y_diff) } # Find the point to label in each ellipse. df_label_locations <- df_ellipse %>% group_by(level) %>% filter(Intercept < quantile(Intercept, .25), Slope_Days < quantile(Slope_Days, .25)) %>% # Compute distance from center. mutate(dist = contour_dist(Intercept, Slope_Days, fixef(m)[1], fixef(m)[2])) %>% # Keep smallest values. top_n(-1, wt = dist) %>% ungroup() # Tweak the last plot one more time! last_plot() + geom_text(aes(label = level, color = NULL), data = df_label_locations, nudge_x = .5, nudge_y = .8, size = 3.5, color = "grey40")  Are you feeling satisfied? I feel satisfied. ## Bonus: Plotting lines from a Bayesian mixed effects model This last part is more of a code demo than a walkthough. I call myself a Bayesian. Visualizing uncertainty is one of my things here, so I would be remiss if I didn’t also demo how to do some plots using posterior samples. Conceptually, the classical model above estimated a single set of partially pooled regression lines. With the Bayesian model, we can sample from a posterior distribution of partially pooled regression lines. Instead of one line for each participant, there’s an entire distribution of them for each participant. This distribution lets us quantify our uncertainty about each part of our model. First, we fit the model in RStanARM with weakly informative priors. library(rstanarm) #> Loading required package: Rcpp #> rstanarm (Version 2.15.3, packaged: 2017-04-29 06:18:44 UTC) #> - Do not expect the default priors to remain the same in future rstanarm versions. #> Thus, R scripts should specify priors explicitly, even if they are just the defaults. #> - For execution on a local, multicore CPU with excess RAM we recommend calling #> options(mc.cores = parallel::detectCores())  b <- stan_glmer( Reaction ~ Days + (Days | Subject), family = gaussian(), data = df_sleep, prior = normal(0, 2), prior_intercept = normal(0, 5), prior_covariance = decov(regularization = 2), prior_aux = cauchy(0, 1))  We get a similar overview as arm::display() when we print the model. b #> stan_glmer #> family: gaussian [identity] #> formula: Reaction ~ Days + (Days | Subject) #> ------ #> #> Estimates: #> Median MAD_SD #> (Intercept) 252.4 6.1 #> Days 10.4 1.7 #> sigma 25.7 1.5 #> #> Error terms: #> Groups Name Std.Dev. Corr #> Subject (Intercept) 24 #> Days 7 0.07 #> Residual 26 #> Num. levels: Subject 20 #> #> Sample avg. posterior predictive #> distribution of y (X = xbar): #> Median MAD_SD #> mean_PPD 297.9 2.7 #> #> ------ #> For info on the priors used see help('prior_summary.stanreg').  We have posterior distribution of values now! That means instead of one “center of gravity” point, we have 4,000 plausible points for our central value. The center of our former contour plot has its own contour plot. That’s Bayes for you. We can plot that easily with stat_density_2d(). We set the coordinate limits to be the same as the last plot, just so that we don’t exaggerate the uncertainty around the central point by drawing a gigantic contour surface. # Get a dataframe: One row per posterior sample df_posterior <- b %>% as.data.frame() %>% as_tibble() ggplot(df_posterior) + aes(x = (Intercept), y = Days) + # Calculate the density stat_density_2d(aes(fill = ..level..), geom = "polygon") + ggtitle("Where's the average intercept and slope?") + xlab("Estimate for average intercept") + ylab("Estimate for average slope") + # Use the same coordinate limits as last plot coord_cartesian( xlim = range(df_pulled$Intercept),
ylim = range(df_pulled$Slope_Days), expand = TRUE) + guides(fill = "none")  For each participant, we have 4,000 partially-pooled regression lines too, so we can visualize our uncertainty for each participant’s individual regression line. Let’s finish by drawing a sample of those lines for a faceted plot. We have to do a bunch of data wrangling to get a dataframe with one row per subject per posterior sample. # For each sample, add the average intercept and average slope values to each # participant's deviation from that average. These yields the intercept and # slope parameters for each participant. df_effects <- df_posterior %>% # Find all the columns with the pattern "b[(Intercept". Add the column # df_posterior$(Intercept) to each of those columns.
mutate_at(
.vars = vars(matches("b\\[\\(Intercept")),
.funs = funs(. + df_posterior$(Intercept))) %>% # Again for slope mutate_at( .vars = vars(matches("b\\[Day")), .funs = funs(. + df_posterior$Days))

# Convert to a long format
df_long_effects <- df_effects %>%
select(matches("b\\[")) %>%
rowid_to_column("draw") %>%
tidyr::gather(Parameter, Value, -draw)

# Extract the effect type and subject number from each parameter name
df_long_effects$Type <- df_long_effects$Parameter %>%
stringr::str_detect("Intercept") %>%
ifelse(., "Intercept", "Slope_Day")

df_long_effects$Subject <- df_long_effects$Parameter %>%
stringr::str_extract("\\d\\d\\d")

df_long_effects <- df_long_effects %>%
select(draw, Subject, Effect = Type, Value)

# Finally!
df_long_effects
#> # A tibble: 160,000 x 4
#>     draw Subject    Effect    Value
#>    <int>   <chr>     <chr>    <dbl>
#>  1     1     308 Intercept 236.5760
#>  2     2     308 Intercept 263.6090
#>  3     3     308 Intercept 237.3153
#>  4     4     308 Intercept 247.9472
#>  5     5     308 Intercept 265.3939
#>  6     6     308 Intercept 241.7141
#>  7     7     308 Intercept 255.3133
#>  8     8     308 Intercept 261.2169
#>  9     9     308 Intercept 236.9531
#> 10    10     308 Intercept 264.2734
#> # ... with 159,990 more rows


Now that we have the data in the right shape, we are going randomly choose 50 posterior samples and plot those lines alongside the observed data.

df_samples <- df_long_effects %>%
filter(draw %in% sample(1:4000, size = 50)) %>%
df_samples
#> # A tibble: 1,000 x 4
#>     draw Subject Intercept Slope_Day
#>  * <int>   <chr>     <dbl>     <dbl>
#>  1    15     308  266.8775 15.109710
#>  2    15     309  204.5003  4.056596
#>  3    15     310  223.4707  5.505083
#>  4    15     330  255.6470  7.120754
#>  5    15     331  279.8910  6.648727
#>  6    15     332  270.2003  8.059170
#>  7    15     333  283.5603  5.315121
#>  8    15     334  258.0755 11.155002
#>  9    15     335  243.8579  3.091841
#> 10    15     337  284.9078 19.905983
#> # ... with 990 more rows

ggplot(df_sleep) +
aes(x = Days, y = Reaction) +
geom_abline(aes(intercept = Intercept, slope = Slope_Day),
data = df_samples, color = "#3366FF", alpha = .1) +
geom_point() +
facet_wrap("Subject") +
scale_x_continuous(breaks = 0:4 * 2) +
labs(x = xlab, y = ylab)


For the participants with complete data, the lines pile up and form a narrow band, indicating a low degree of uncertainty. In the final two panels, however, we only have limited data, and the sample of lines fan out and cover many different plausible trajectories.

The uncertainty is more dramatic if we draw a contour plot for each participant—basically, drawing each participants’ mostly likely locations in the landscape of parameter values.

ggplot(df_long_effects %>% tidyr::spread(Effect, Value)) +
aes(x = Intercept, y = Slope_Day) +
stat_density_2d(aes(fill = ..level..), geom = "polygon") +
facet_wrap("Subject") +
xlab("Intercept estimate") +
ylab("Slope estimate") +
theme(legend.position = "bottom") +
guides(fill = "none")


For 373 and 374, the contour regions/ink-splats are very tall: A lot of slope values are plausible. The region for 374 is more off center and slightly narrow than that of 373: That extra data point matters.

Funnily enough, this post started as a quick write-up of a demo I wrote, but it kind of spiraled out of control. I hope this write-up helps students and users understand mixed-effects models at a more intuitive level.

I had formally learned about these models twice in graduate school. In psychology, we were told to use them if we wanted to make inferences about a larger population of subjects or stimulus items. In educational psychology, we were told to use them to capture the sources of variances in a nested data-set: Kids nested in classrooms nested in schools, etc. It wasn’t until I taught myself Bayesian stats that I learned about third reason to use them: They pool information across different units, providing regularized model estimates. I find this rationale most intuitive. The Gelman and Hill book and Statistical Rethinking both discuss the partial pooling description of these models. (Ooooh, as I added the Rethinking link, I just noticed that I created a ggplot2 version of the plot from the cover of that book. )

Tags:

Updated: