Higher Order Functions

Notes on Citing R and R Packages

2024-05-03T00:00:00-05:00

Our group has started using a new knowledge base system, so I have been writing up and revisiting some of my documentation. Here I am going to share a guide I wrote about citing R packages in academic writing.

Which software to cite

Let’s make a distinction here between reporting (or summarizing) an analysis and reproducing (or carrying out) an analysis.

Our main manuscript document is for reporting. We want to report which tools and which versions of those tools we used to get our statistical results. We don’t need to include every computational detail. We will save that level of detail for a supplemental document that shows the exact modeling code and sessioninfo::session_info() for reproducing our results. Moreover, journals will sometimes limit the number of references in a manuscript and a full R analysis might draw on 15 packages, so we in general cannot cite everything that helped us get our results. So, we can think more generally about citation priorities.

For an analysis carried out in R, these items have the highest priority for citations:

R (the programming language / analysis environment).
Third party packages that carried out the analyses.
- For example, nlme, lme4, ordinal, rms, brms.
If a package calls on another language or analysis tool, cite that tool as well.
- For example, brms and rstanarm fit models using the Stan programming language, so we need to cite and version Stan as well.
Packages that performed additional computation on analysis results.
- For example, emmeans to get marginal means from a fitted model.
Packages that visualized analysis results automatically. For example, see or interactions.

The following items would have the lowest priority for citations:

RStudio: It’s just an interface to the language. (Ideally, an analysis could be run without touching RStudio.)
The built-in stats package.
knitr/quarto/rmarkdown: These performed R computations for us and stored the results in a document.
Siloed off parts of a main package.
- For example, the gamlss package fits GAMLSS models but the distributions for model families are stored in the package gamlss.dist. gamlss needs gamlss.dist to work, but gamlss is the main important thing to cite.
Data storage formats.

If space and the publication venue permit, we can also cite and version the key R packages that manipulated or visualized the data such as tidyverse, ggplot2, broom, tidybayes/ggdist, etc. Be generous. We do want to credit the tools we used to get our results after all!

Where to get citation information

Creators of scientific software will often tell users how to cite their software. Scientific software tools often have an associated article that announces the software and describes how to use it, and authors will ask users to cite that publication so they can obtain academic credit for their software work.

For R and R packages, the citation() function will tell users how to cite their software. lme4 is one of those packages that directs users to a publication.

citation("lme4")
#> To cite lme4 in publications use:
#> 
#>   Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015).
#>   Fitting Linear Mixed-Effects Models Using lme4. Journal of
#>   Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {Fitting Linear Mixed-Effects Models Using {lme4}},
#>     author = {Douglas Bates and Martin M{\"a}chler and Ben Bolker and Steve Walker},
#>     journal = {Journal of Statistical Software},
#>     year = {2015},
#>     volume = {67},
#>     number = {1},
#>     pages = {1--48},
#>     doi = {10.18637/jss.v067.i01},
#>   }

Notice in the BibTeX entry at the bottom how {lme4} is put in braces. These braces tell LaTeX not to change the capitalization of that word when printing the title. Some journals or formats have different preferences for how to capitalize titles, but as a general rule of thumb, software titles need to be printed verbatim, or as they would be used by the user. (That is, library(Lme4) will not load the lme4 package). When creating bibliography entries, take care to preserve the capitalization so that the software name is accurate. Take care also to differentiate between statistical methods and software names: “We fit GAMLSS models with the gamlss package”.

For CRAN packages, the output of citation() is also provided online in HTML. The CRAN package description page (e.g., lme4) includes a Citation entry which generates a formatted version of the citation information (e.g., lme4 citation info).

When the software doesn’t have a publication, R will generate a citation for you. The ordinal package is one such example.

citation("ordinal")
#> To cite 'ordinal' in publications use:
#> 
#>   Christensen R (2023). _ordinal-Regression Models for Ordinal Data_. R
#>   package version 2023.12-4,
#>   .
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {ordinal---Regression Models for Ordinal Data},
#>     author = {Rune H. B. Christensen},
#>     year = {2023},
#>     note = {R package version 2023.12-4},
#>     url = {https://CRAN.R-project.org/package=ordinal},
#>   }

The underscores _ in the title indicate that the title would be italicized when the citation is viewed on CRAN.

How to cite and version R and R packages

As a rule of thumb, any citation of a resource should answer these questions:

Who (authors)
What (title and sometimes format)
When (year)
Where (journal, URL, book, DOI)

Then for software, we can add the following:

Which (version)

The citation() will answer these questions for you.

There are a couple of other functions to know when it comes to package versions. utils::packageVersion() provides the package version as a string:

utils::packageVersion("lme4")
#> [1] '1.1.35.3'
utils::packageVersion("ordinal")
#> [1] '2023.12.4'

For the current R version, a bunch of built-in functions can tell you everything you need to know. I can never remember which of these functions I want (it’s getRversion()), so I will sometimes use utils::packageVersion("base") to get a simple version number.

R.version.string
#> [1] "R version 4.4.0 (2024-04-24 ucrt)"
R.version
#>                _                                
#> platform       x86_64-w64-mingw32               
#> arch           x86_64                           
#> os             mingw32                          
#> crt            ucrt                             
#> system         x86_64, mingw32                  
#> status                                          
#> major          4                                
#> minor          4.0                              
#> year           2024                             
#> month          04                               
#> day            24                               
#> svn rev        86474                            
#> language       R                                
#> version.string R version 4.4.0 (2024-04-24 ucrt)
#> nickname       Puppy Cup
getRversion()
#> [1] '4.4.0'

utils::packageVersion("base")
#> [1] '4.4.0'

For Stan, depending on the backend used, the software version is available via:

# rstanarm and default brms
rstan::stan_version()
#> [1] "2.32.2"

# non-default for brms
cmdstanr::cmdstan_version()
#> [1] "2.34.1"

Examples

A simple example of R, a modeling R package and a helper R package:

Analyses were carried out the R programming language (vers. 4.2.0, R Core Team, 2021). Mixed models were estimated using the lme4 package (vers. 1.1.28, Bates et al., 2015). We estimated marginal means and contrasts using the emmeans package (vers. 1.7.2, Lenth, 2021).

Below is the actual RMarkdown content, so that version numbers and citations are inlined automatically. (We’re omitting details on creating .bib files or using pandoc’s @ citations.)

```{r}
v_lme4 <- packageVersion("lme4")
v_r <- packageVersion("base")
v_emmeans <- packageVersion("emmeans")
```

Analyses were carried out the R programming language [vers. `r v_r`,
@rstats]. Mixed models were estimated using the lme4 package
[vers. `r v_lme4`, @lme4]. We estimated marginal means and contrasts
using the emmeans package [vers. `r v_emmeans`, @emmeans].

This aspect of the code is invisible, but I use nonbreaking spaces (HTML ) after vers. so that the vers. and the version number stay on the same line.

Here is a more involved example involving an additional language and an R package that interfaces to that language:

We estimated the models using Stan (vers. 2.27.0, Carpenter et al., 2017) via the brms package (vers. 2.16.1, Bürkner, 2017) and tidybayes package (vers. 3.0.4, Kay, 2021) in R (vers. 4.3.0, R Core Team, 2021).

Behind the scenes, I had written the following RMarkdown:

```{r}
model <- targets::tar_read(model_random_slope)
v_stan <- model$version$cmdstan
v_brms <- model$version$brms
v_tidybayes <- packageVersion("tidybayes")
v_r <- getRversion()
```

We estimated the models using Stan [vers. `r v_stan`, @stan] via the
brms package [vers. `r v_brms`, @brms-jss] and tidybayes package
[vers. `r v_tidybayes`, @R-tidybayes] in R [vers. `r v_r`, @r-base].

Notice that I am reading in a cached model object (targets::tar_read()) and reading the software versions from that object. This arrangement avoids problems where models are fitted with one version of a package but utils::packageVersion() returns a different, more recent package version. brms stored these versions automatically for me. In general, when I cache a model like this, I store the package version in the model object.

A note on automatic citation helpers

A tool like the grateful package will generate a list of references and citations for us. Suppose we had the following where we fit a model and look at a summary of it.

library(tidyverse)
library(mgcv)
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

data <- MASS::mcycle
model <- gam(accel ~ s(times, bs = "cr"), data = data)
broom::tidy(model)
#> # A tibble: 1 × 5
#>   term       edf ref.df statistic p.value
#>                 
#> 1 s(times)  8.39   8.87      53.8       0

grateful detects the following packages in use:

grateful::scan_packages(pkgs = "Session")
#>         pkg version
#> 1      base   4.4.0
#> 2     knitr    1.46
#> 3      mgcv   1.9.1
#> 4      nlme 3.1.164
#> 5 tidyverse   2.0.0

Note that broom is excluded despite being used and that nlme is included despite not being loaded directly. That’s because broom is loaded by tidyverse and gets absorbed by the tidyverse citation and because mgcv loads nlme (as its start-up message says). (And knitr is loaded when I render my blog posts but not when I work within this R session interactively.)

Okay, so we should just list the packages manually and disable tidyverse from absorbing broom. grateful can create a bibliography for us:

grateful::get_pkgs_info(
  pkgs = c("mgcv", "broom"), 
  out.dir = getwd(), 
  cite.tidyverse = FALSE
)
#>     pkg version                                         citekeys
#> 1 broom   1.0.5                                            broom
#> 2  mgcv   1.9.1 mgcv2011, mgcv2016, mgcv2004, mgcv2017, mgcv2003

a <- grateful::cite_packages(
  pkgs = c("mgcv", "broom"), 
  out.dir = getwd(), 
  cite.tidyverse = FALSE, 
  output = "paragraph"
)
a |> stringr::str_wrap(width = 72) |> writeLines()
#> We used the following R packages: broom v. 1.0.5 [@broom], mgcv v. 1.9.1
#> [@mgcv2003; @mgcv2004; @mgcv2011; @mgcv2016; @mgcv2017].

Now, we have the issue where citation(mgcv) has multiple publications associated with it, not all of which are relevant for our usage.

The point of this example is that a tool like grateful—and more generally tools that produce code for us—can be useful to compile information and get the ball rolling for us. But, we still have to edit and refine the outputs to work correctly.

Last knitted on 2024-05-07. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting         value
#>  version         R version 4.4.0 (2024-04-24 ucrt)
#>  os              Windows 10 x64 (build 19045)
#>  system          x86_64, mingw32
#>  ui              RTerm
#>  language        (EN)
#>  collate         English_United States.utf8
#>  ctype           English_United States.utf8
#>  tz              America/Chicago
#>  date            2024-05-07
#>  pandoc          NA
#>  stan (rstan)    2.32.2
#>  stan (cmdstanr) 2.34.1
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package        * version  date (UTC) lib source
#>    abind            1.4-5    2016-07-21 [1] CRAN (R 4.4.0)
#>    backports        1.4.1    2021-12-13 [1] CRAN (R 4.4.0)
#>    broom            1.0.5    2023-06-09 [1] CRAN (R 4.4.0)
#>    cachem           1.0.8    2023-05-01 [1] CRAN (R 4.4.0)
#>    checkmate        2.3.1    2023-12-04 [1] CRAN (R 4.4.0)
#>    cli              3.6.2    2023-12-11 [1] CRAN (R 4.4.0)
#>    cmdstanr         0.7.1    2024-05-03 [1] local
#>    codetools        0.2-20   2024-03-31 [2] CRAN (R 4.4.0)
#>    colorspace       2.1-0    2023-01-23 [1] CRAN (R 4.4.0)
#>    curl             5.2.1    2024-03-01 [1] CRAN (R 4.4.0)
#>    distributional   0.4.0    2024-02-07 [1] CRAN (R 4.4.0)
#>    downlit          0.4.3    2023-06-29 [1] CRAN (R 4.4.0)
#>    dplyr          * 1.1.4    2023-11-17 [1] CRAN (R 4.4.0)
#>    evaluate         0.23     2023-11-01 [1] CRAN (R 4.4.0)
#>    fansi            1.0.6    2023-12-08 [1] CRAN (R 4.4.0)
#>    fastmap          1.1.1    2023-02-24 [1] CRAN (R 4.4.0)
#>    forcats        * 1.0.0    2023-01-29 [1] CRAN (R 4.4.0)
#>    generics         0.1.3    2022-07-05 [1] CRAN (R 4.4.0)
#>    ggplot2        * 3.5.1    2024-04-23 [1] CRAN (R 4.4.0)
#>    git2r            0.33.0   2023-11-26 [1] CRAN (R 4.4.0)
#>    glue             1.7.0    2024-01-09 [1] CRAN (R 4.4.0)
#>    grateful         0.2.4    2023-10-22 [1] CRAN (R 4.4.0)
#>    gridExtra        2.3      2017-09-09 [1] CRAN (R 4.4.0)
#>    gtable           0.3.5    2024-04-22 [1] CRAN (R 4.4.0)
#>    here             1.0.1    2020-12-13 [1] CRAN (R 4.4.0)
#>    hms              1.1.3    2023-03-21 [1] CRAN (R 4.4.0)
#>    inline           0.3.19   2021-05-31 [1] CRAN (R 4.4.0)
#>    jsonlite         1.8.8    2023-12-04 [1] CRAN (R 4.4.0)
#>    knitr          * 1.46     2024-04-06 [1] CRAN (R 4.4.0)
#>    lattice          0.22-6   2024-03-20 [2] CRAN (R 4.4.0)
#>    lifecycle        1.0.4    2023-11-07 [1] CRAN (R 4.4.0)
#>    loo              2.7.0    2024-02-24 [1] CRAN (R 4.4.0)
#>    lubridate      * 1.9.3    2023-09-27 [1] CRAN (R 4.4.0)
#>    magrittr         2.0.3    2022-03-30 [1] CRAN (R 4.4.0)
#>    MASS             7.3-60.2 2024-04-24 [2] local
#>    Matrix           1.7-0    2024-03-22 [2] CRAN (R 4.4.0)
#>    matrixStats      1.3.0    2024-04-11 [1] CRAN (R 4.4.0)
#>    memoise          2.0.1    2021-11-26 [1] CRAN (R 4.4.0)
#>    mgcv           * 1.9-1    2023-12-21 [2] CRAN (R 4.4.0)
#>    munsell          0.5.1    2024-04-01 [1] CRAN (R 4.4.0)
#>    nlme           * 3.1-164  2023-11-27 [2] CRAN (R 4.4.0)
#>    pillar           1.9.0    2023-03-22 [1] CRAN (R 4.4.0)
#>    pkgbuild         1.4.4    2024-03-17 [1] CRAN (R 4.4.0)
#>    pkgconfig        2.0.3    2019-09-22 [1] CRAN (R 4.4.0)
#>    posterior        1.5.0    2023-10-31 [1] CRAN (R 4.4.0)
#>    processx         3.8.4    2024-03-16 [1] CRAN (R 4.4.0)
#>    ps               1.7.6    2024-01-18 [1] CRAN (R 4.4.0)
#>    purrr          * 1.0.2    2023-08-10 [1] CRAN (R 4.4.0)
#>    QuickJSR         1.1.3    2024-01-31 [1] CRAN (R 4.4.0)
#>    R6               2.5.1    2021-08-19 [1] CRAN (R 4.4.0)
#>    ragg             1.3.0    2024-03-13 [1] CRAN (R 4.4.0)
#>    Rcpp             1.0.12   2024-01-09 [1] CRAN (R 4.4.0)
#>  D RcppParallel     5.1.7    2023-02-27 [1] CRAN (R 4.4.0)
#>    readr          * 2.1.5    2024-01-10 [1] CRAN (R 4.4.0)
#>    rlang            1.1.3    2024-01-10 [1] CRAN (R 4.4.0)
#>    rprojroot        2.0.4    2023-11-05 [1] CRAN (R 4.4.0)
#>    rstan            2.32.6   2024-03-05 [1] CRAN (R 4.4.0)
#>    rstudioapi       0.16.0   2024-03-24 [1] CRAN (R 4.4.0)
#>    scales           1.3.0    2023-11-28 [1] CRAN (R 4.4.0)
#>    sessioninfo      1.2.2    2021-12-06 [1] CRAN (R 4.4.0)
#>    StanHeaders      2.32.7   2024-04-25 [1] CRAN (R 4.4.0)
#>    stringi          1.8.3    2023-12-11 [1] CRAN (R 4.4.0)
#>    stringr        * 1.5.1    2023-11-14 [1] CRAN (R 4.4.0)
#>    systemfonts      1.0.6    2024-03-07 [1] CRAN (R 4.4.0)
#>    tensorA          0.36.2.1 2023-12-13 [1] CRAN (R 4.4.0)
#>    textshaping      0.3.7    2023-10-09 [1] CRAN (R 4.4.0)
#>    tibble         * 3.2.1    2023-03-20 [1] CRAN (R 4.4.0)
#>    tidyr          * 1.3.1    2024-01-24 [1] CRAN (R 4.4.0)
#>    tidyselect       1.2.1    2024-03-11 [1] CRAN (R 4.4.0)
#>    tidyverse      * 2.0.0    2023-02-22 [1] CRAN (R 4.4.0)
#>    timechange       0.3.0    2024-01-18 [1] CRAN (R 4.4.0)
#>    tzdb             0.4.0    2023-05-12 [1] CRAN (R 4.4.0)
#>    utf8             1.2.4    2023-10-22 [1] CRAN (R 4.4.0)
#>    V8               4.4.2    2024-02-15 [1] CRAN (R 4.4.0)
#>    vctrs            0.6.5    2023-12-01 [1] CRAN (R 4.4.0)
#>    withr            3.0.0    2024-01-16 [1] CRAN (R 4.4.0)
#>    xfun             0.43     2024-03-25 [1] CRAN (R 4.4.0)
#>    yaml             2.3.8    2023-12-11 [1] CRAN (R 4.4.0)
#> 
#>  [1] C:/Users/mahr/AppData/Local/R/win-library/4.4
#>  [2] C:/Program Files/R/R-4.4.0/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

Ordering constraints in brms using contrast coding

2023-07-03T00:00:00-05:00

Mattan S. Ben-Shachar wrote an excellent tutorial about how to impose ordering constraints in Bayesian regression models. In that post, the data comes from archaeology (inspired by Buck, 2017 but not an exact copy). We have samples from different layers (Layer) in a site, and for each sample, we have a C14 radiocarbon date measurement and its associated measurement error.

library(tidyverse)

table1 <- tribble(
  ~Layer,  ~C14, ~error,
     "B", -5773,     30,
     "B", -5654,     30,
     "B", -5585,     30,
     "C", -5861,     30,
     "C", -5755,     30,
     "E", -5850,     50,
     "E", -5928,     50,
     "E", -5905,     50,
     "G", -6034,     30,
     "G", -6184,     30,
     "I", -6248,     50,
     "I", -6350,     50
  )
table1$Layer <- factor(table1$Layer)

Because of how the layers are ordered—new stuff piled on top of older stuff—we a priori expect deeper layers to have older dates, so these are the ordering constraints:

\[\mu_{\text{Layer I}} < \mu_{\text{Layer G}} < \mu_{\text{Layer E}} < \mu_{\text{Layer C}} < \mu_{\text{Layer B}}\]

where μ is the average C14 age of a layer.

Ben-Shachar’s post works through some ways in brms to achieve this constraint:

Fit the usual model but filter out posterior draws where the ordering constraint is violated.
Have the Stan sampler reject draws where the constraint is violated. But note that the documentation for reject has a section titled “Rejection is not for constraints”.
Use brms’s monotonic effect mo() syntax.

In this post, I am going to add another option to this list:

Use contrast coding so the model parameters represent the differences between successive levels, and use priors to enforce the ordering constraint.

Big idea of contrast coding

When our model includes categorical variables, we need some way to code those variables in our model (that is, use numbers to represent the category levels). Our choice of coding scheme will change the meaning of the model parameters, allowing us to perform different comparisons (test different statistical hypotheses) about the means of the category levels. Let’s spell that out again, because it is the big idea of the contrast coding:

different contrast coding schemes <-> 
  different parameter meanings <-> 
    different comparisons / hypotheses

(Isn’t that an eye-popping graphic?)

The toolbox of contrast coding schemes is deep but also confusing. Whenever I step away from R’s default contrast coding, I usually have these pages open to help me: some tutorial on a UCLA page, Lisa DeBruine’s comparison article, and the menu of contrast schemes in emmeans. So, let’s review the basics by looking at R’s default contrast coding scheme.

The default: dummy coding

By default, R will code categorical variables in a regression model using “treatment” or “dummy” coding. In this scheme,

The intercept is the mean of one of the category levels (the reference level)
Parameters estimate the difference between each other level and the reference level

Let’s fit a simple linear model and work through the parameter meanings:

m1 <- lm(C14 ~ 1 + Layer, table1)
coef(m1)
#> (Intercept)      LayerC      LayerE      LayerG      LayerI 
#>  -5670.6667   -137.3333   -223.6667   -438.3333   -628.3333

Here, the (Intercept) is the mean of the reference level, and the reference level is the level of the categorical variable not listed in the other parameter names (LayerB). Each of the other parameters is a difference from that reference level. Layer C’s mean is (Intercept) + LayerC. The model.matrix() shows how these categorical variables are coded in the model’s design/contrast matrix:

# Matrix has 1 row per observation but we just want 1 per category level
mat_m1 <- m1 |> 
  model.matrix() |>
  unique()
mat_m1
#>    (Intercept) LayerC LayerE LayerG LayerI
#> 1            1      0      0      0      0
#> 4            1      1      0      0      0
#> 6            1      0      1      0      0
#> 9            1      0      0      1      0
#> 11           1      0      0      0      1

The (Intercept) is the model constant, so naturally, it’s switched on (equals 1) for every row. Each of the other columns are indicator variables. layerC turns on for the layer C rows, layerE turns on for layer E rows, and so on.

Matrix multiplying the contrast matrix by the model coefficients will compute the mean values of each layer.

\[\mathbf{\hat y} = \mathbf{X}\boldsymbol{\beta}\]

Think of this equation as a contract for a contrast coding scheme: Multiplying the contrast matrix by the model coefficients should give us the means of the category levels.

mat_m1 %*% coef(m1)
#>         [,1]
#> 1  -5670.667
#> 4  -5808.000
#> 6  -5894.333
#> 9  -6109.000
#> 11 -6299.000

# Means by hand
aggregate(C14 ~ Layer, table1, mean)
#>   Layer       C14
#> 1     B -5670.667
#> 2     C -5808.000
#> 3     E -5894.333
#> 4     G -6109.000
#> 5     I -6299.000

If the matrix multiplication is too quick, here it is in slow motion where each row has been weighted (multiplied) by coefficients:

# Sums of the rows are the means
mat_m1 %*% diag(coef(m1))
#>         [,1]      [,2]      [,3]      [,4]      [,5]
#> 1  -5670.667    0.0000    0.0000    0.0000    0.0000
#> 4  -5670.667 -137.3333    0.0000    0.0000    0.0000
#> 6  -5670.667    0.0000 -223.6667    0.0000    0.0000
#> 9  -5670.667    0.0000    0.0000 -438.3333    0.0000
#> 11 -5670.667    0.0000    0.0000    0.0000 -628.3333

Successive differences coding

Now, let’s look at a different kind of coding: (reverse) successive differences coding. In this scheme:

The intercept is the mean of the levels means
Parameters estimate the difference between adjacent levels
but I have to reverse how the levels are ordered in the underlying factor() so that the differences are positive, comparing each layer with the one below it. (LayerB - LayerC should be positive).

We apply this coding by creating a new factor and setting the contrast(). R lets us set the contrast to the name of a function that computes contrasts, so we use "contr.sdif".

contr.sdif <- MASS::contr.sdif

# Reverse the factor levels
table1$LayerAlt <- factor(table1$Layer, rev(levels(table1$Layer)))

contrasts(table1$LayerAlt) <- "contr.sdif"

Then we just fit the model as usual. As intended, the model’s coefficients are different.

m2 <- lm(C14 ~ 1 + LayerAlt, table1)
coef(m2)
#> (Intercept) LayerAltG-I LayerAltE-G LayerAltC-E LayerAltB-C 
#> -5956.20000   190.00000   214.66667    86.33333   137.33333

We can compute the mean of layer means and the layer differences by hand to confirm that the model parameters are computing what we expect.

# Make a list so we can write out the diffs easily
layer_means <- table1 |> 
  split(~ Layer) |> 
  lapply(function(x) mean(x$C14))
str(layer_means)
#> List of 5
#>  $ B: num -5671
#>  $ C: num -5808
#>  $ E: num -5894
#>  $ G: num -6109
#>  $ I: num -6299

data.frame(
  model_coef = coef(m2),
  by_hand = c(
    mean(unlist(layer_means)),
    layer_means$G - layer_means$I,
    layer_means$E - layer_means$G,
    layer_means$C - layer_means$E,
    layer_means$B - layer_means$C
  )
)
#>              model_coef     by_hand
#> (Intercept) -5956.20000 -5956.20000
#> LayerAltG-I   190.00000   190.00000
#> LayerAltE-G   214.66667   214.66667
#> LayerAltC-E    86.33333    86.33333
#> LayerAltB-C   137.33333   137.33333

Back to our contrast coding contract, we see that the contrast matrix matrix-multiplied by the model coefficients gives us the level means.

mat_m2 <- unique(model.matrix(m2))

mat_m2 %*% coef(m2)
#>         [,1]
#> 1  -5670.667
#> 4  -5808.000
#> 6  -5894.333
#> 9  -6109.000
#> 11 -6299.000

# By hand
aggregate(C14 ~ Layer, table1, mean)
#>   Layer       C14
#> 1     B -5670.667
#> 2     C -5808.000
#> 3     E -5894.333
#> 4     G -6109.000
#> 5     I -6299.000

It’s so clean and simple. We still get the level means and the parameters estimate specific comparisons of interest to us. So, how are the categorical variables and their differences coded in the model’s contrast matrix?

mat_m2
#>    (Intercept) LayerAltG-I LayerAltE-G LayerAltC-E LayerAltB-C
#> 1            1         0.2         0.4         0.6         0.8
#> 4            1         0.2         0.4         0.6        -0.2
#> 6            1         0.2         0.4        -0.4        -0.2
#> 9            1         0.2        -0.6        -0.4        -0.2
#> 11           1        -0.8        -0.6        -0.4        -0.2

Wait… what? 😕

The Comparison Matrix

When I first started drafting this post, I made it to this point and noped out for a few days. My curiosity did win out eventually, and I hit the books (remembered this tweet and this handout, watched this video, read this paper, and read section 9.1.2 in Applied Regression Analysis & Generalized Linear Models). Now, for the rest of the post.

The best formal, citable source for what I describe here is Schad and colleagues (2020), but what they call a “hypothesis matrix”, I’m calling a comparison matrix. I do this for two reasons: 1) to get away from hypothesis testing mindset (see Figure 1) and 2) because we are using the hypothesis matrix to apply a constraint among parameter values (remember that?).

Figure 1. The sign in my yard.

In this approach, we define the model parameters β by matrix-multiplying the the comparison matrix C (which activates or weights different level means) and the levels means μ.

\[\mathbf{C}\boldsymbol{\mu} = \boldsymbol{\beta} \\ \begin{bmatrix} \textrm{weights for comparison 1} \\ \textrm{weights for comparison 2} \\ \textrm{weights for comparison 3} \\ \cdots \\ \end{bmatrix} \begin{bmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \\ \cdots \\ \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \cdots \\ \end{bmatrix}\]

So, in the dummy-coded version of the model, we had the following comparison matrix:

\[\mathbf{C}_\text{dummy}\boldsymbol{\mu} = \boldsymbol{\beta}_\text{dummy} \\ \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ -1 & 1 & 0 & 0 & 0 \\ -1 & 0 & 1 & 0 & 0 \\ -1 & 0 & 0 & 1 & 0 \\ -1 & 0 & 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \mu_{\text{Layer B}} \\ \mu_{\text{Layer C}} \\ \mu_{\text{Layer E}} \\ \mu_{\text{Layer G}} \\ \mu_{\text{Layer I}} \\ \end{bmatrix} = \begin{bmatrix} \beta_0: \mu_{\text{Layer B}} \\ \beta_1: \mu_{\text{Layer C}} - \mu_{\text{Layer B}} \\ \beta_2: \mu_{\text{Layer E}} - \mu_{\text{Layer B}} \\ \beta_3: \mu_{\text{Layer G}} - \mu_{\text{Layer B}} \\ \beta_4: \mu_{\text{Layer I}} - \mu_{\text{Layer B}} \\ \end{bmatrix}\]

The first row in C sets the Layer B as the reference value for the dummy coding. The second row turns on both Layer B and Layer C, but Layer B is negatively weighted. Thus, the corresponding model coefficient is the difference between Layers C and B.

The comparison matrix for the reverse successive difference contrast coding is similar. The first row activates all of the layers buts equally weights them, so we get a mean of means for the model intercept. Each row after the first is the difference between two layer means.

\[\mathbf{C}_\text{rev-diffs}\boldsymbol{\mu} = \boldsymbol{\beta}_\text{rev-diffs} \\ \begin{bmatrix} .2 & .2 & .2 & .2 & .2 \\ 0 & 0 & 0 & 1 & -1 \\ 0 & 0 & 1 & -1 & 0 \\ 0 & 1 & -1 & 0 & 0 \\ 1 & -1 & 0 & 0 & 0 \\ \end{bmatrix} \begin{bmatrix} \mu_{\text{Layer B}} \\ \mu_{\text{Layer C}} \\ \mu_{\text{Layer E}} \\ \mu_{\text{Layer G}} \\ \mu_{\text{Layer I}} \\ \end{bmatrix} = \begin{bmatrix} \beta_0: \text{mean of } \mu \\ \beta_1: \mu_{\text{Layer G}} - \mu_{\text{Layer I}} \\ \beta_2: \mu_{\text{Layer E}} - \mu_{\text{Layer G}} \\ \beta_3: \mu_{\text{Layer C}} - \mu_{\text{Layer E}} \\ \beta_4: \mu_{\text{Layer B}} - \mu_{\text{Layer C}} \\ \end{bmatrix}\]

Now, here is the magic part 🔮. Multiplying both sides by the inverse of the comparison matrix will set up a design matrix for the linear model which follows the contract for the contrast matrices I described above:

\[\mathbf{C}\boldsymbol{\mu} = \boldsymbol{\beta} \\ \mathbf{C}^{-1}\mathbf{C}\boldsymbol{\mu} = \mathbf{C}^{-1}\boldsymbol{\beta} \\ \boldsymbol{\mu} = \mathbf{C}^{-1}\boldsymbol{\beta} \\ \mathbf{\hat y} = \mathbf{X}\boldsymbol{\beta} \\\]

So, we can invert¹ our comparison matrix to get the model’s contrast matrix:

comparisons <- c(
  .2, .2, .2, .2, .2,
   0,  0,  0,  1, -1,
   0,  0,  1, -1,  0,
   0,  1, -1,  0,  0,
   1, -1,  0,  0,  0
)

mat_comparisons <- matrix(comparisons, nrow = 5, byrow = TRUE)
solve(mat_comparisons)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1  0.2  0.4  0.6  0.8
#> [2,]    1  0.2  0.4  0.6 -0.2
#> [3,]    1  0.2  0.4 -0.4 -0.2
#> [4,]    1  0.2 -0.6 -0.4 -0.2
#> [5,]    1 -0.8 -0.6 -0.4 -0.2

mat_m2
#>    (Intercept) LayerAltG-I LayerAltE-G LayerAltC-E LayerAltB-C
#> 1            1         0.2         0.4         0.6         0.8
#> 4            1         0.2         0.4         0.6        -0.2
#> 6            1         0.2         0.4        -0.4        -0.2
#> 9            1         0.2        -0.6        -0.4        -0.2
#> 11           1        -0.8        -0.6        -0.4        -0.2

Or, perhaps more commonly, we can take the contrast matrix used by a model and recover the comparison matrix, which is a nice trick when we have R automatically set the contrast values for us:

# Dummy coding example
mat_m1
#>    (Intercept) LayerC LayerE LayerG LayerI
#> 1            1      0      0      0      0
#> 4            1      1      0      0      0
#> 6            1      0      1      0      0
#> 9            1      0      0      1      0
#> 11           1      0      0      0      1
solve(mat_m1)
#>              1 4 6 9 11
#> (Intercept)  1 0 0 0  0
#> LayerC      -1 1 0 0  0
#> LayerE      -1 0 1 0  0
#> LayerG      -1 0 0 1  0
#> LayerI      -1 0 0 0  1

# Successive differences coding example
mat_m2
#>    (Intercept) LayerAltG-I LayerAltE-G LayerAltC-E LayerAltB-C
#> 1            1         0.2         0.4         0.6         0.8
#> 4            1         0.2         0.4         0.6        -0.2
#> 6            1         0.2         0.4        -0.4        -0.2
#> 9            1         0.2        -0.6        -0.4        -0.2
#> 11           1        -0.8        -0.6        -0.4        -0.2
solve(mat_m2)
#>               1    4    6    9   11
#> (Intercept) 0.2  0.2  0.2  0.2  0.2
#> LayerAltG-I 0.0  0.0  0.0  1.0 -1.0
#> LayerAltE-G 0.0  0.0  1.0 -1.0  0.0
#> LayerAltC-E 0.0  1.0 -1.0  0.0  0.0
#> LayerAltB-C 1.0 -1.0  0.0  0.0  0.0

As I said earlier, there are all kinds of contrast coding schemes which allow us to define the model parameters in terms of specific comparisons, and this post only mentions two such schemes (dummy coding and a reversed version of successive differences coding).

Finally, in Layer I of this post, the brms model

Now that we know about contrasts, and how they let us define model parameters in terms of the comparisons we want to make, we can use this technique to enforce an ordering constraint.

We set up our model as in Ben-Shachar’s post, but here we set a prior for normal(500, 250) on the non-intercept coefficients with a lower-bound of 0 lb = 0 to enforce the ordering constraint.

library(brms)
priors <- 
  set_prior("normal(-5975, 1000)", class = "Intercept") + 
  set_prior("normal(500, 250)", class = "b", lb = 0) +
  set_prior("exponential(0.01)", class = "sigma")

validate_prior(
  priors,
  bf(C14 | se(error, sigma = TRUE) ~ 1 + LayerAlt),
  data = table1
)
#>                prior     class        coef group resp dpar nlpar lb ub
#>     normal(500, 250)         b                                    0   
#>     normal(500, 250)         b LayerAltBMC                        0   
#>     normal(500, 250)         b LayerAltCME                        0   
#>     normal(500, 250)         b LayerAltEMG                        0   
#>     normal(500, 250)         b LayerAltGMI                        0   
#>  normal(-5975, 1000) Intercept                                        
#>    exponential(0.01)     sigma                                    0   
#>        source
#>          user
#>  (vectorized)
#>  (vectorized)
#>  (vectorized)
#>  (vectorized)
#>          user
#>          user

We fit the model:

m3 <- brm(
  bf(C14 | se(error, sigma = TRUE) ~ 1 + LayerAlt),
  family = gaussian("identity"),
  prior = priors,
  data = table1,
  seed = 4321,
  backend = "cmdstanr",
  cores = 4, 
  # caching
  file = "_caches/2023-07-03", 
  file_refit = "on_change"
)

We can see that the level differences are indeed positive with 95% intervals of positive values.

summary(m3)
#>  Family: gaussian 
#>   Links: mu = identity; sigma = identity 
#> Formula: C14 | se(error, sigma = TRUE) ~ 1 + LayerAlt 
#>    Data: table1 (Number of observations: 12) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Population-Level Effects: 
#>             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept   -5957.60     27.91 -6011.89 -5900.71 1.00     1964     1715
#> LayerAltGMI   211.00     82.29    51.67   378.86 1.00     1693      939
#> LayerAltEMG   206.15     71.30    68.47   349.07 1.00     1937     1185
#> LayerAltCME   105.55     62.84     7.90   243.81 1.00     1377     1023
#> LayerAltBMC   145.95     65.13    23.63   279.12 1.00     1684      857
#> 
#> Family Specific Parameters: 
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma    79.03     26.95    41.05   142.49 1.00     1651     2149
#> 
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
bayesplot::mcmc_intervals(m3, regex_pars = "Layer")

Estimates of the level differences.

conditional_effects(m3)

Conditional means for each layer.

Normally, I don’t think you need contrast codes

My general advice for contrast coding is to just fit the model and then have the software compute the appropriate estimates and comparisons afterwards on the outcome scale. For example, emmeans can take a fitted model, run requested comparisons, and handle multiple comparisons and p-value adjustments for us. marginaleffects probably does this too. (I really need to play with it.) And in a Bayesian model, we can compute comparisons of interest by doing math on the posterior samples (estimating things and computing differences and summarizing the distribution of the differences), but this particular model, where the coding was needed to impose the prior ordering constraint, ruled out the posterior post-processing approach.

Last knitted on 2023-07-05. Source code on GitHub.²

I use solve() here for the inversion, but Schad and colleagues (2020) use the generalized inverse MASS::ginv() or matlib::Ginv(). solve() only works on square matrices, but the generalized inverse works on non-square matrices. ↩

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting         value
#>  version         R version 4.3.0 (2023-04-21 ucrt)
#>  os              Windows 11 x64 (build 22621)
#>  system          x86_64, mingw32
#>  ui              RTerm
#>  language        (EN)
#>  collate         English_United States.utf8
#>  ctype           English_United States.utf8
#>  tz              America/Chicago
#>  date            2023-07-05
#>  pandoc          NA
#>  stan (rstan)    2.26.1
#>  stan (cmdstanr) 2.32.0
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package        * version date (UTC) lib source
#>    abind            1.4-5   2016-07-21 [1] CRAN (R 4.3.0)
#>    backports        1.4.1   2021-12-13 [1] CRAN (R 4.3.0)
#>    base64enc        0.1-3   2015-07-28 [1] CRAN (R 4.3.0)
#>    bayesplot        1.10.0  2022-11-16 [1] CRAN (R 4.3.0)
#>    bridgesampling   1.1-2   2021-04-16 [1] CRAN (R 4.3.0)
#>    brms           * 2.19.0  2023-03-14 [1] CRAN (R 4.3.0)
#>    Brobdingnag      1.2-9   2022-10-19 [1] CRAN (R 4.3.0)
#>    cachem           1.0.8   2023-05-01 [1] CRAN (R 4.3.0)
#>    callr            3.7.3   2022-11-02 [1] CRAN (R 4.3.0)
#>    checkmate        2.2.0   2023-04-27 [1] CRAN (R 4.3.0)
#>    cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>    cmdstanr         0.5.3   2023-04-24 [1] local
#>    coda             0.19-4  2020-09-30 [1] CRAN (R 4.3.0)
#>    codetools        0.2-19  2023-02-01 [2] CRAN (R 4.3.0)
#>    colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
#>    colourpicker     1.2.0   2022-10-28 [1] CRAN (R 4.3.0)
#>    crayon           1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
#>    crosstalk        1.2.0   2021-11-04 [1] CRAN (R 4.3.0)
#>    curl             5.0.1   2023-06-07 [1] CRAN (R 4.3.0)
#>    digest           0.6.32  2023-06-26 [1] CRAN (R 4.3.1)
#>    distributional   0.3.2   2023-03-22 [1] CRAN (R 4.3.0)
#>    downlit          0.4.3   2023-06-29 [1] CRAN (R 4.3.0)
#>    dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
#>    DT               0.28    2023-05-18 [1] CRAN (R 4.3.0)
#>    dygraphs         1.1.1.6 2018-07-11 [1] CRAN (R 4.3.0)
#>    ellipsis         0.3.2   2021-04-29 [1] CRAN (R 4.3.0)
#>    emmeans          1.8.7   2023-06-23 [1] CRAN (R 4.3.1)
#>    estimability     1.4.1   2022-08-05 [1] CRAN (R 4.3.0)
#>    evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)
#>    fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
#>    farver           2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
#>    fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>    forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
#>    generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
#>    ggplot2        * 3.4.2   2023-04-03 [1] CRAN (R 4.3.0)
#>    git2r            0.32.0  2023-04-12 [1] CRAN (R 4.3.1)
#>    glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
#>    gridExtra        2.3     2017-09-09 [1] CRAN (R 4.3.0)
#>    gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
#>    gtools           3.9.4   2022-11-27 [1] CRAN (R 4.3.0)
#>    here             1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
#>    highr            0.10    2022-12-22 [1] CRAN (R 4.3.0)
#>    hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
#>    htmltools        0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
#>    htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
#>    httpuv           1.6.11  2023-05-11 [1] CRAN (R 4.3.0)
#>    igraph           1.5.0   2023-06-16 [1] CRAN (R 4.3.1)
#>    inline           0.3.19  2021-05-31 [1] CRAN (R 4.3.0)
#>    jsonlite         1.8.5   2023-06-05 [1] CRAN (R 4.3.1)
#>    knitr          * 1.43    2023-05-25 [1] CRAN (R 4.3.0)
#>    labeling         0.4.2   2020-10-20 [1] CRAN (R 4.3.0)
#>    later            1.3.1   2023-05-02 [1] CRAN (R 4.3.0)
#>    lattice          0.21-8  2023-04-05 [2] CRAN (R 4.3.0)
#>    lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
#>    loo              2.6.0   2023-03-31 [1] CRAN (R 4.3.0)
#>    lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)
#>    magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>    markdown         1.7     2023-05-16 [1] CRAN (R 4.3.0)
#>    MASS           * 7.3-60  2023-05-04 [1] CRAN (R 4.3.0)
#>    Matrix           1.5-4   2023-04-04 [2] CRAN (R 4.3.0)
#>    matrixStats      1.0.0   2023-06-02 [1] CRAN (R 4.3.0)
#>    memoise          2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
#>    mime             0.12    2021-09-28 [1] CRAN (R 4.3.0)
#>    miniUI           0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
#>    munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
#>    mvtnorm          1.2-2   2023-06-08 [1] CRAN (R 4.3.1)
#>    nlme             3.1-162 2023-01-31 [2] CRAN (R 4.3.0)
#>    pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
#>    pkgbuild         1.4.2   2023-06-26 [1] CRAN (R 4.3.1)
#>    pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
#>    plyr             1.8.8   2022-11-11 [1] CRAN (R 4.3.0)
#>    posterior        1.4.1   2023-03-14 [1] CRAN (R 4.3.0)
#>    prettyunits      1.1.1   2020-01-24 [1] CRAN (R 4.3.0)
#>    processx         3.8.1   2023-04-18 [1] CRAN (R 4.3.1)
#>    promises         1.2.0.1 2021-02-11 [1] CRAN (R 4.3.0)
#>    ps               1.7.5   2023-04-18 [1] CRAN (R 4.3.0)
#>    purrr          * 1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
#>    R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
#>    ragg             1.2.5   2023-01-12 [1] CRAN (R 4.3.0)
#>    Rcpp           * 1.0.10  2023-01-22 [1] CRAN (R 4.3.0)
#>  D RcppParallel     5.1.7   2023-02-27 [1] CRAN (R 4.3.0)
#>    readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
#>    reshape2         1.4.4   2020-04-09 [1] CRAN (R 4.3.0)
#>    rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
#>    rprojroot        2.0.3   2022-04-02 [1] CRAN (R 4.3.0)
#>    rstan            2.26.22 2023-05-02 [1] local
#>    rstantools       2.3.1   2023-03-30 [1] CRAN (R 4.3.0)
#>    rstudioapi       0.14    2022-08-22 [1] CRAN (R 4.3.0)
#>    scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
#>    sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>    shiny            1.7.4   2022-12-15 [1] CRAN (R 4.3.0)
#>    shinyjs          2.1.0   2021-12-23 [1] CRAN (R 4.3.0)
#>    shinystan        2.6.0   2022-03-03 [1] CRAN (R 4.3.0)
#>    shinythemes      1.2.0   2021-01-25 [1] CRAN (R 4.3.0)
#>    StanHeaders      2.26.27 2023-06-14 [1] CRAN (R 4.3.1)
#>    stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
#>    stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
#>    systemfonts      1.0.4   2022-02-11 [1] CRAN (R 4.3.0)
#>    tensorA          0.36.2  2020-11-19 [1] CRAN (R 4.3.0)
#>    textshaping      0.3.6   2021-10-13 [1] CRAN (R 4.3.0)
#>    threejs          0.3.3   2020-01-21 [1] CRAN (R 4.3.0)
#>    tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
#>    tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
#>    tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
#>    tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
#>    timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
#>    tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
#>    utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
#>    V8               4.3.0   2023-04-08 [1] CRAN (R 4.3.1)
#>    vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.1)
#>    withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
#>    xfun             0.39    2023-04-20 [1] CRAN (R 4.3.0)
#>    xtable           1.8-4   2019-04-21 [1] CRAN (R 4.3.0)
#>    xts              0.13.1  2023-04-16 [1] CRAN (R 4.3.0)
#>    zoo              1.8-12  2023-04-13 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.3
#>  [2] C:/Program Files/R/R-4.3.0/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

How to score Rock Paper Scissors

2022-12-06T00:00:00-06:00

Ho ho ho, it is the most wonderful time of the year: Advent of code!

AOC is a yearly collection of programming puzzles throughout the first 25 days of December. I like it… so much so that I wrote an R package for completing my puzzles using the structure of an R package. The puzzles start out easy and get progressively more elaborate or devious in their requirements. But I am going to talk about an easy puzzle in this post, and specifically, one little trick I used in my solution.

Day 2 of 2022 requires us to score games of Rock Paper Scissors. The moves are encoded using letters, where our opponent’s moves are coded as A, B, C and ours are coded as X, Y, Z. So, an input describing three moves will look like the following:

example_input <- c(
  "A Y",
  "B X",
  "C Z"
)

Where the letters mean the following:

move_codes <- c(
  "A" = "rock",
  "B" = "paper",
  "C" = "scissors",
  "X" = "rock",
  "Y" = "paper",
  "Z" = "scissors"
)

This encoding seems like a weird bit of indirection thrown on, and it is, because the puzzle changes the meanings of the letters in Part 2. Still, it is straightforward to parse the input into a list of roshambo moves.

input <- example_input |> 
  strsplit(" ") |> 
  # Use character subsetting to convert letters to moves
  lapply(function(x) unname(move_codes[x])) 

# Our character's move is the second element in each vector
str(input)
#> List of 3
#>  $ : chr [1:2] "rock" "paper"
#>  $ : chr [1:2] "paper" "rock"
#>  $ : chr [1:2] "scissors" "scissors"

Now, for the point of this post, how do we score each game?

The naive approach is to start typing away furiously

before eventually noping the hell out of there.

What we have is a decision tree: we need to follow a branch for player one and another branch for player two. And here’s the main point of this post: nested lists are trees. (Yes, I love lists—see this post where I use them in my knitr reporting.) The top (outer) level of the list will be all of the player one options, and then the bottom (inner) level will be all the player two options. The nodes of the tree (bottom level values) are the outcomes of the games.

run_game <- function(pair) {
  # nested lists are trees
  rules <- list(
    rock = list(
      rock = "draw",
      scissors = "lose",
      paper = "win"
    ),
    scissors = list(
      scissors = "draw",
      rock = "win",
      paper = "lose"
    ),
    paper = list(
      paper = "draw",
      scissors = "win",
      rock = "lose"
    )
  )

  # Because `rules[[pair[1]]][[pair[2]]]` is unsightly:
  rules |>
    getElement(pair[1]) |>
    getElement(pair[2])
}

At this point, we could take a second to ponder how the structure of several nested if-elses—the actual shape of the code, indenting in and out in and in again—resembles the structure and the shape of the nested list, and ponder further about how the regular, orderly shape of code could be the whispers of hidden data, saying “list() me, list() me”. Or, we could run the code and see it in action.

input |> 
  lapply(run_game)
#> [[1]]
#> [1] "win"
#> 
#> [[2]]
#> [1] "lose"
#> 
#> [[3]]
#> [1] "draw"

# Or to repeat the input
input |> 
  stats::setNames(input) |> 
  lapply(run_game)
#> $`c("rock", "paper")`
#> [1] "win"
#> 
#> $`c("paper", "rock")`
#> [1] "lose"
#> 
#> $`c("scissors", "scissors")`
#> [1] "draw"

Earlier in the post, I used character subsetting to convert letters into moves. This process turned a matching/replacement problem into a data lookup problem. The Rock Paper Scissors are the same trick again: converting a decision tree into a data lookup problem.

Last knitted on 2022-12-06. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31 ucrt)
#>  os       Windows 10 x64 (build 22621)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2022-12-06
#>  pandoc   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  asciicast     2.3.0   2022-12-05 [1] CRAN (R 4.2.2)
#>  cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.1)
#>  curl          4.3.3   2022-10-06 [1] CRAN (R 4.2.1)
#>  evaluate      0.18    2022-11-07 [1] CRAN (R 4.2.2)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  git2r         0.30.1  2022-03-16 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  here          1.0.1   2020-12-13 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  jsonlite      1.8.3   2022-10-21 [1] CRAN (R 4.2.1)
#>  knitr       * 1.40    2022-08-24 [1] CRAN (R 4.2.1)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
#>  magick        2.7.3   2021-08-18 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  processx      3.8.0   2022-10-26 [1] CRAN (R 4.2.1)
#>  ps            1.7.2   2022-10-26 [1] CRAN (R 4.2.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  ragg          1.2.4   2022-10-24 [1] CRAN (R 4.2.1)
#>  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.2.1)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.1)
#>  rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
#>  systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.2.0)
#>  textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.2.0)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  V8            4.2.2   2022-11-03 [1] CRAN (R 4.2.2)
#>  vctrs         0.5.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.34    2022-10-18 [1] CRAN (R 4.2.1)
#> 
#>  [1] C:/Users/trist/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.2/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

Creating a Summoning Salt-style speedrun plot

2022-05-24T00:00:00-05:00

A videogame speedrun is a challenge to beat the game as quickly as possible. It’s time attack racing but for a videogame. There are, in my mind, two ways to make a run’s time go faster: Playing better and more smoothly (optimizations, having better luck) and playing less of the game (better routing, new glitches/skips). The history of a speedrun category then is often an exciting mix of evolutionary improvements as players level up their skills and revolutionary jumps as players find new ways to cut through the game.

Summoning Salt is a Youtube creator who creates documentaries that trace out the world record progression in a speedrun. The videos are immensely enjoyable, as Salt dishes out the history bit by bit, record by record, sometimes in a suspenseful fashion.

As a data visualization person, I’ve noticed that Summoning Salt recently started to use a new prop in the videos: A step graph of the world record times. The graph is developed throughout a video as players (represented by individual colors) lower the times with new records (points) until you get a full reveal of a timeline like the following:

Screenshot of a timeline from a Summoning Salt video.

Let’s recreate this figure in R with ggplot2.

Warp pipe: Obtaining the data

The game in question is New Super Mario Bros Wii, and the record keeper is the site speedrun.com. There is not just one speedrun category for this game, so in particular, we want the “Any%” record history (i.e., “any percent”: you don’t have play every level, and you can skip parts of the game.)

We need to get the leaderboard history data from speedrun.com. There is an official REST API for the site’s data, but it’s not straightforward how to query it to obtain the data needed for a world record progression. (Apparently, one could request the leaderboard on different dates and work backwards through time.) But that’s okay, we are not going to use the API. Instead, the statistics page for the game has a plot that is tantalizingly close to the one we want to create.

A timeline figure from speedrun.com.

This plot is interactive, and our browser is downloading the data and plotting it for us. If we snoop around the page, we can find the JSON data behind the plot. In Firefox, when I right-click on the plot and hit “Inspect”, I see the HTML code that contains the plot. Just below the plot’s div is a chunk of Javascript.

A screenshot of the Firefox inspector showing the speedrun data in a Javascript script tag.

The first line of it is all the speedrun data that is being plotted. We save that JSON into its own file.

Ground pound: Filtering and cleaning the data

Let’s read the data into R. JSON is short for “Javascript Object Notation”, and it’s basically the equivalent of a list() in R. Hence, jsonlite provides a large, deeply nested list for us.

library(tidyverse)

# a helper function to download the data from github
# in case you want to play along
path_blog_data <- function(x) {
  file.path(
    "https://raw.githubusercontent.com",
    "tjmahr/tjmahr.github.io/master/_R/data",
    x
  )
}

json_runs <- path_blog_data("2022-05-23-nsmbw-runs.json") |> 
  jsonlite::read_json()

The plot on the statistics page has a dropdown menu for different kinds of records to display, so this JSON object has a sublist for each dropdown menu choice. What we want is the first sublist (full game runs) then its first sublist (with a label of "Any% - Physical") then its "data".

# Dropdown menu choices
str(json_runs, max.level = 1)
#> List of 10
#>  $ 0   :List of 7
#>  $ 6789:List of 18
#>  $ 6805:List of 18
#>  $ 6815:List of 18
#>  $ 6826:List of 19
#>  $ 6841:List of 18
#>  $ 6846:List of 20
#>  $ 6859:List of 19
#>  $ 6868:List of 22
#>  $ 6882:List of 18

# Full game run histories
str(json_runs[[1]], max.level = 2)
#> List of 7
#>  $ :List of 7
#>   ..$ label                    : chr "Any% - Physical"
#>   ..$ data                     :List of 30
#>   ..$ borderColor              : chr "#EE4444"
#>   ..$ pointBorderColor         : chr "#EE4444"
#>   ..$ pointHoverBackgroundColor: chr "#EE4444"
#>   ..$ hidden                   : logi FALSE
#>   ..$ steppedLine              : logi TRUE
#>  $ :List of 7
#>   ..$ label                    : chr "Cannonless - Physical"
#>   ..$ data                     :List of 25
#>   ..$ borderColor              : chr "#EF8241"
#>   ..$ pointBorderColor         : chr "#EF8241"
#>   ..$ pointHoverBackgroundColor: chr "#EF8241"
#>   ..$ hidden                   : logi FALSE
#>   ..$ steppedLine              : logi TRUE
#>  $ :List of 7
#>   ..$ label                    : chr "100% - Physical"
#>   ..$ data                     :List of 17
#>   ..$ borderColor              : chr "#F0C03E"
#>   ..$ pointBorderColor         : chr "#F0C03E"
#>   ..$ pointHoverBackgroundColor: chr "#F0C03E"
#>   ..$ hidden                   : logi FALSE
#>   ..$ steppedLine              : logi TRUE
#>  $ :List of 7
#>   ..$ label                    : chr "Any% No W5 - Physical"
#>   ..$ data                     :List of 22
#>   ..$ borderColor              : chr "#8AC951"
#>   ..$ pointBorderColor         : chr "#8AC951"
#>   ..$ pointHoverBackgroundColor: chr "#8AC951"
#>   ..$ hidden                   : logi TRUE
#>   ..$ steppedLine              : logi TRUE
#>  $ :List of 7
#>   ..$ label                    : chr "Low% - Physical"
#>   ..$ data                     :List of 18
#>   ..$ borderColor              : chr "#09B876"
#>   ..$ pointBorderColor         : chr "#09B876"
#>   ..$ pointHoverBackgroundColor: chr "#09B876"
#>   ..$ hidden                   : logi TRUE
#>   ..$ steppedLine              : logi TRUE
#>  $ :List of 7
#>   ..$ label                    : chr "Any% Multiplayer - Physical"
#>   ..$ data                     :List of 11
#>   ..$ borderColor              : chr "#44BBEE"
#>   ..$ pointBorderColor         : chr "#44BBEE"
#>   ..$ pointHoverBackgroundColor: chr "#44BBEE"
#>   ..$ hidden                   : logi TRUE
#>   ..$ steppedLine              : logi TRUE
#>  $ :List of 7
#>   ..$ label                    : chr "All Regular Exits - Physical"
#>   ..$ data                     :List of 7
#>   ..$ borderColor              : chr "#6666EE"
#>   ..$ pointBorderColor         : chr "#6666EE"
#>   ..$ pointHoverBackgroundColor: chr "#6666EE"
#>   ..$ hidden                   : logi TRUE
#>   ..$ steppedLine              : logi TRUE

# Just want the data field from the first one
json_any_percent <- json_runs[[1]][[1]][["data"]]

Here are the first two points’ worth of date. We have a not-so-obviously encoded date (x), the run length in seconds (y) and the player. We are going to convert each of these lists into a dataframe and bind them together.

json_any_percent |> head(2) |> str()
#> List of 2
#>  $ :List of 4
#>   ..$ x      : int 1306670400
#>   ..$ y      : int 1616
#>   ..$ players:List of 1
#>   .. ..$ : chr "RaikerZ"
#>   ..$ link   : chr "/nsmbw/run/2216987"
#>  $ :List of 4
#>   ..$ x      : int 1325246400
#>   ..$ y      : int 1549
#>   ..$ players:List of 1
#>   .. ..$ : chr "RaikerZ"
#>   ..$ link   : chr "/nsmbw/run/2216995"

data <- json_any_percent |> 
  lapply(
    # turn one list into a dataframe
    function(x) { 
      tibble(
        date = x$x, 
        run_time_s = x$y, 
        player = x$players[[1]]
      )
    }
  ) |> 
  bind_rows()

data
#> # A tibble: 30 × 3
#>          date run_time_s player       
#>                        
#>  1 1306670400       1616 RaikerZ      
#>  2 1325246400       1549 RaikerZ      
#>  3 1332763200       1531 RaikerZ      
#>  4 1349870400       1527 RaikerZ      
#>  5 1457179200       1526 GreenUprooter
#>  6 1461585600       1523 Auchgard     
#>  7 1461672000       1522 Auchgard     
#>  8 1461758400       1519 Auchgard     
#>  9 1470744000       1514 Auchgard     
#> 10 1471521600       1512 Auchgard     
#> # … with 20 more rows

Lastly, we need to do something about those dates. When you see a date-time represented by a single large number, it’s probably a POSIX date representing the date-time as the number of seconds since some origin date-time (see also Unix Time). Using the default Unix origin time seems to give the correct date conversion:

data <- data |> 
  mutate(
    date_posix = as.POSIXct(date, tz = "UTC", origin = "1970-01-01")
  ) 

data
#> # A tibble: 30 × 4
#>          date run_time_s player        date_posix         
#>                                      
#>  1 1306670400       1616 RaikerZ       2011-05-29 12:00:00
#>  2 1325246400       1549 RaikerZ       2011-12-30 12:00:00
#>  3 1332763200       1531 RaikerZ       2012-03-26 12:00:00
#>  4 1349870400       1527 RaikerZ       2012-10-10 12:00:00
#>  5 1457179200       1526 GreenUprooter 2016-03-05 12:00:00
#>  6 1461585600       1523 Auchgard      2016-04-25 12:00:00
#>  7 1461672000       1522 Auchgard      2016-04-26 12:00:00
#>  8 1461758400       1519 Auchgard      2016-04-27 12:00:00
#>  9 1470744000       1514 Auchgard      2016-08-09 12:00:00
#> 10 1471521600       1512 Auchgard      2016-08-18 12:00:00
#> # … with 20 more rows

Triple jump: Plotting

First, let’s get the data on the panel. I could spend an endless amount of time tweaking or customizing a plot’s theme, so I do the styling last. Otherwise, styling would fill up all of the time I’ve set aside to work on the plot.

We want to draw a point for each particular record-setting event, and we want to draw a line that connects all of the points. geom_step() draws a line plot but it can move straight up/down or straight left/right—no diagonal lines—so it’s what we want. We also want to the color of these geometries to change with the record holder (player).

ggplot(data) + 
  aes(x = date_posix, y = run_time_s, color = player) + 
  geom_step() +
  geom_point()

Oops! It assumed that we wanted to connected the dots separately for each color. We have to set the group aesthetic to a constant value so there is only one line drawn.

ggplot(data) + 
  aes(x = date_posix, y = run_time_s, color = player) + 
  geom_step(aes(group = 1)) +
  geom_point()

Making the Summoning Salt version is just a matter of theming at this point. We use theme_void() to completely wipe out the current theme, and we hide the color legend.

ggplot(data) + 
  aes(x = date_posix, y = run_time_s, color = player) + 
  geom_step(aes(group = 1)) +
  geom_point() + 
  theme_void() +
  guides(color = "none")

Next, we are going to use the showtext package to obtain an 8-bit font:

library(showtext)
#> Loading required package: sysfonts
#> Loading required package: showtextdb
font_add_google("Press Start 2P")
showtext_auto(TRUE)

The void theme provides nothing, so we have to specify the main colors, the axis lines, and the plotting margin. We also crank up the chroma values to have more intense colors for the black background.

ggplot(data) + 
  aes(x = date_posix, y = run_time_s, color = player) + 
  geom_step(aes(group = 1)) +
  geom_point() +
  guides(color = "none") + 
  scale_color_discrete(c = 255) +
  labs(title = "World Record Timeline") +
  theme_void(base_size = 20, base_family = "Press Start 2P") +
  theme(
    plot.title = element_text(color = "white", hjust = .5), 
    plot.background = element_rect(fill = "black"),
    axis.line = element_line(
      color = "white", 
      size = 1, 
      # more 8-bit looking lines
      lineend = "square"
    ), 
    plot.margin = margin(12, 12, 12, 12, "pt")
  ) 

To keep overlapping points from looking like blobs, we can use a filled point. For these, color is used on the border and fill is used on the inside. We will set the outline of the points to black and the fill to the player color. (If you look at more professional data visualizations, you see this trick frequently with white bordering around points.) With a new fill aesthetic in place, e have to make sure that guide for the fill doesn’t appear and that fill and color have the same color scale.

ggplot(data) + 
  aes(x = date_posix, y = run_time_s, color = player) + 
  geom_step(aes(group = 1)) +
  geom_point() +
  geom_point(
    aes(fill = player),
    shape = 21,
    color = "black", 
    size = 2
  ) +
  # no legend for fill
  guides(color = "none", fill = "none") + 
  # fill and color get same scale
  scale_color_discrete(c = 255, aesthetics = c("color", "fill")) +
  labs(title = "World Record Timeline") +
  theme_void(base_size = 20, base_family = "Press Start 2P") +
  theme(
    plot.title = element_text(color = "white", hjust = .5), 
    plot.background = element_rect(fill = "black"),
    axis.line = element_line(
      color = "white", 
      size = 1, 
      lineend = "square"
    ), 
    plot.margin = margin(12, 12, 12, 12, "pt")
  ) 

Finally, let’s make another version of this figure. How might we make a more accessible presentation of this information (of who held a record and when), assuming that we only have a static image? A legend with players/colors is a nonstarter. We could give each player their own distinct point shape so that color/shape encode the same information, but shapes get rough once you have to use more than four of them. We could use a player’s first letter instead of a point (show an F for FadeVanity) but the letters quickly overlap.

One idea would be to label the point with an annotation whenever there is a new record holder.

showtext_auto(FALSE)
data <- data |>
  mutate(
    # Remove the country flag annotation from this player
    player2 = ifelse(
      player == "[gb/eng]FadeVanity", 
      "FadeVanity", 
      player
    ),
    # Record whenever the title holder changes as an "era"
    change = player != lag(player) | is.na(lag(player)),
    era = cumsum(change)
  ) 

# I am going to hardcode some vertical position adjustments for the labels.
offsets <- c(1, 2, 1, 5, -1, 5, 4, 3, 2, 1, -4, -3, -2, -1)

data_lab <- data |> 
  group_by(era) |> 
  # Label the last point in an era
  filter(run_time_s == max(run_time_s)) |> 
  ungroup() |> 
  mutate(offset = offsets)

nudge_factor <- 30
ggplot(data) + 
  aes(x = date_posix, y = run_time_s, color = player) + 
  geom_text(
    aes(
      label = player2,
      y = run_time_s + nudge_factor * offset 
    ),
    hjust = 0,
    size = 4,
    data = data_lab
  ) +
  geom_segment(
    aes(
      # i.e., run the line up to .95 of the label's nudging
      yend = run_time_s + nudge_factor * offset * .95, 
      xend = date_posix
    ),     
    data = data_lab, 
    linetype = "dashed"
  ) + 
  geom_step(aes(group = 1), size = 1) +
  geom_point(size = 3) +
  # yes, I'm adding forty million seconds to the last datetime
  expand_limits(x = max(data$date_posix) + 4e7) +
  guides(color = "none") +
  scale_x_datetime(
    name = NULL,
    date_breaks = "2 years", 
    date_labels = "%Y"
  ) +
  scale_y_continuous(
    name = "World record",
    breaks = 21:27 * 60,
    # Show the minutes value with zero-padded seconds
    labels = function(x) sprintf("%d:%02.f", x %/% 60, x %% 60)
  ) + 
  theme_minimal(base_size = 14) +
  theme(plot.margin = margin(12, 12, 12, 12, "pt"))

Last knitted on 2022-05-27. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2022-05-27
#>  pandoc   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom         0.8.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  cachem        1.0.6   2021-08-19 [1] CRAN (R 4.2.0)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  curl          4.3.2   2021-06-23 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  downlit       0.4.0   2021-10-29 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  farver        2.1.0   2021-02-28 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  git2r         0.30.1  2022-03-16 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.2.0)
#>  here          1.0.1   2020-12-13 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.2.0)
#>  httr          1.4.3   2022-05-04 [1] CRAN (R 4.2.0)
#>  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr       * 1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  labeling      0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.2.0)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  ragg          1.2.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.2   2022-01-30 [1] CRAN (R 4.2.0)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.2.0)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  showtext    * 0.9-5   2022-02-09 [1] CRAN (R 4.2.0)
#>  showtextdb  * 3.0     2020-06-04 [1] CRAN (R 4.2.0)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  sysfonts    * 0.8.8   2022-03-13 [1] CRAN (R 4.2.0)
#>  systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.2.0)
#>  textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.2.0)
#>  tibble      * 3.1.7   2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyr       * 1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.2.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

The cursed Morgan Stanley Covid-19 visualization

2022-03-23T00:00:00-05:00

Darren Dahly, username @statsepi, asked people on Twitter to share some of their favorite or least favorite data visualizations from the pandemic. I nominated the notorious “cubic fit” ‘forecast’ from the Council of Economic Advisers. But then there was the reply by Travis Whitfill, username @twhitfill, showing a nightmare of a figure from a report produced by Morgan Stanley:

I’d like to submit this one from Morgan Stanley 🤦🏻‍♂️ pic.twitter.com/D5CYi6zSrT

— Travis Whitfill MPH (@twhitfill) March 21, 2022

The main statistical problem here is the completely inappropriate “smoothing” line. The panel on the left is really two linear trends: a steady trend around 8,500 patients until May 6th and a decreasing trend from 11,000 patients starting on May 7th. Upon seeing data like these points, I would be inclined to ask, “What changed in the data? Was a new state added to the dataset? Did the definition of what counts as an ICU bed change?” The analysts here instead imposed a linear trend on the points.

Another problem with this plot is rhetorical: it’s tryhard counterintuitive bullshit. I think analysts will fetishize surprising or counterintuitive findings, with an attitude of “oh, you would think that such-and-such is true but the data show us that actually the opposite is true”. At the time of this plot, our belief was something like “Covid-19 protections like stay-at-home orders can help flatten the curve and reduce the spread of the disease and the number of hospitalizations.” This plot sashays into the room and tells us “well, according to the data, it’s the states without Covid-19 protections that have decreasing numbers of ICU patients, and get this: Covid lockdowns make things worse!”. Granted, I could not find the original report for this image, so I don’t know how the authors interpreted it in the report’s narrative. Yet, I can only assume the authors added these linear trend lines–overriding the default GAM or LOESS smooth used by stat_smooth()–to make this particular point.

When I first saw it, this plot made me quip: “I hate statistics now. it’s been a good run. gonna live my days out as a druid”. But it’s been a few days, and I’m still haunted by this plot. What did go wrong? Why do the ICU counts shoot upwards like that? So, I investigated it.

Attempt 1: There is no jump

I tried to find the original report, searching Google and Twitter for a report with this image from around May 12, 2020 (when @twhitfill first shared it), but nothing came up. After dredging through a bunch of Morgan Stanley report PDFs, I noticed that the reports usually had a small number of authors, so I am wondering whether (and hoping that) the original report was something more akin to a dashed-off newsletter than a research report.

Failing to find the original image, I tried to recreate it in R. The original image credits The COVID Tracking Project, and their downloads page provides a .csv file with state-level data. Here we read in just the relevant columns, filter down to the time range of the cursed image, and plot the total number of current ICU patients.

library(tidyverse)

# a helper function to download the data from github
# in case you want to play along
path_blog_data <- function(x) {
  file.path(
    "https://raw.githubusercontent.com",
    "tjmahr/tjmahr.github.io/master/_R/data",
    x
  )
}

data <- readr::read_csv(
  path_blog_data("all-states-history.csv"), 
  col_types = cols(
    date = col_date(), 
    state = col_character(), 
    inIcuCurrently = col_number(), 
    .default = col_skip()
  ), 
  progress = FALSE
)

data <- data %>% 
  filter(
    as.Date("2020-04-28") <= date,
    date <= as.Date("2020-05-11")
  )

ggplot(data) + 
  aes(x = date, y = inIcuCurrently) + 
  stat_summary(fun = "sum", geom = "point", size = 3) +
  labs(
    x = "Date", 
    y = "Current patients in ICU",
    caption = "Data from The COVID Project (March 23, 2022)"
  ) +
  theme_grey(base_size = 16)
#> Warning: Removed 454 rows containing non-finite values (stat_summary).

There is no jump in ICU patients ❌, and because the jump disappeared when we used a more recent (and presumably better) version of the dataset, the jump was probably some kind of artifact.

Out of curiosity, let’s look at the state-by-state data. Because (spoiler alert) about half the states only have NA values for this time period, we will filter out the NA points and look at the remaining points.

ggplot(data %>% filter(!is.na(inIcuCurrently))) + 
  aes(x = date, y = inIcuCurrently) + 
  geom_point() +
  facet_wrap("state") +
  labs(
    x = "Date", 
    y = "Current patients in ICU",
    caption = "Data from The COVID Project (March 23, 2022)"
  ) +   
  theme_grey(base_size = 12)

So, some states have ICU patient data added midway through this window and many states are completely missing data from this window. The whole open-versus-closed-states question was doomed from the get-go because we don’t know what happened in every state.

Attempt 2: Let’s go back in time

If we poke around the COVID Tracking Project’s GitHub repository, we find a folder of data backups with a file called states_daily_4pm_et.csv. This file provides the same result as the previously loaded data.

data <- readr::read_csv(
  path_blog_data("states_daily_4pm_et.csv"), 
  col_types = cols(
    date = col_date("%Y%m%d"),
    state = col_character(),
    inIcuCurrently = col_number(),
    .default = col_skip()
  ),
  progress = FALSE
)

data <- data %>% 
  filter(
    as.Date("2020-04-28") <= date,
    date <= as.Date("2020-05-11")
  )

ggplot(data) + 
  aes(x = date, y = inIcuCurrently) + 
  stat_summary(fun = "sum", geom = "point", size = 3) +
  labs(
    x = "Date", 
    y = "Current patients in ICU",
    caption = "Data from The COVID Project (March 23, 2022)"
  ) +
  theme_grey(base_size = 16)
#> Warning: Removed 454 rows containing non-finite values (stat_summary).

But because this file is hosted on GitHub, we can go back in time and find the version of the data from May 12, 2020 and use that file instead.

data <- readr::read_csv(
  path_blog_data("2020-05-12-states_daily_4pm_et.csv"), 
  col_types = cols(
    date = col_date("%Y%m%d"),
    state = col_character(),
    inIcuCurrently = col_number(),
    .default = col_skip()
  ),
  progress = FALSE
)

data <- data %>% 
  filter(
    as.Date("2020-04-28") <= date,
    date <= as.Date("2020-05-11")
  )

ggplot(data) + 
  aes(x = date, y = inIcuCurrently) + 
  stat_summary(fun = "sum", geom = "point", size = 3) +
  labs(
    x = "Date", 
    y = "Current patients in ICU",
    caption = "Data from The COVID19 Project (May 12, 2020)"
  ) +
  theme_grey(base_size = 16)
#> Warning: Removed 477 rows containing non-finite values (stat_summary).

There it is: the jump ICU patients on May 7th ✔️. Let’s look at the state-by-state data:

ggplot(data %>% filter(!is.na(inIcuCurrently))) + 
  aes(x = date, y = inIcuCurrently) + 
  geom_point() +
  facet_wrap("state") +
  labs(
    x = "Date", 
    y = "Current patients in ICU",
    caption = "Data from The COVID Project (May 12, 2020)"
  ) +
  theme_grey(base_size = 12)

Look at New York (NY)! That’s the jump in original plot. New York had a large number of ICU patients but their data only became available on May 7th, giving the spurious increase in ICU patients.

By adding incomplete data from NY to the rest of the states, the analyst effectively treated all of the missing points in the NY panel as zeros.

What could they have done differently?

It’s fun to complain about haunted plots, but I will try to be constructive for a moment. How would a fixed version of this plot look?

Option 1: Don’t do it. Given all the missing and incomplete data, it’s just not worth it to make this plot.

Option 2: Don’t aggregate. Or we might embrace the missingness, and show all and only the data we have. Here is a sketch of this kind of approach. We will show individual state data and provide labels for the states that stand out from the pack. We will also note the number of missing lines in the caption.

data_for_plot <- data %>% 
  filter(!is.na(inIcuCurrently)) %>% 
  group_by(state) %>%
  mutate(state_icu_max = max(inIcuCurrently)) %>% 
  ungroup() 

total_regions <- data$state %>% unique() %>% length()
plotted_regions <- data_for_plot$state %>% unique() %>% length()

ggplot(data_for_plot) + 
  aes(x = date, y = inIcuCurrently) + 
  geomtextpath::geom_textline(
    aes(label = state, group = state, hjust = state),
    data = . %>% filter(state_icu_max > 250)
  ) +
  geomtextpath::scale_hjust_discrete() +
  geom_line(
    aes(group = state),
    data = . %>% filter(state_icu_max <= 250)) +
  labs(
    x = "Date", 
    y = "Current patients in ICU",
    caption = glue::glue(
      "
      Data from The COVID Project (May 12, 2020).
      No data available for {total_regions - plotted_regions} states/territories.
      "
    )
  ) + 
  theme_grey(base_size = 14)

And then we can put the linear regression “smooth” on it. 🙃

Update: Notes from the Tracking Project trenches [Mar. 24, 2022]

After releasing this post, COVID Tracking Project alum Quang Nguyen shared some behind the scenes details of what happened around May 7th, 2020. I will repost the Twitter thread here:

OMG @COVID19Tracking history lesson (short 🧵)!! First, shoutout to our data infrastructure folks @zachlipton @JuliaKodysh for the GitHub archive! Second, I actually dug through the slack to figure out what happened (jokes on me, I was shift lead that day). https://t.co/7U6LOm8HKE pic.twitter.com/iXdPO9EV6A
— Quang Nguyen (@quangpmnguyen) March 24, 2022

The problem was, back in May 2020, the only way you can get hospitalization data for the state of NY was to take low-res screenshots of the governor's presentation and then try to piece the information together (also shoutout to @justinhendrix for watching these press conf.). pic.twitter.com/5UnGP1RUox

— Quang Nguyen (@quangpmnguyen) March 24, 2022

Using this weird graph, we actually tried to back-calculate total hospitalization numbers, but unfortunately, it was super messy and nothing came out of it. This source also doesn't have current ICU numbers.
— Quang Nguyen (@quangpmnguyen) March 24, 2022

We actually found a new source from Twitter (!!) who apparently got these numbers from a press email list from the governor (??). May 7th was the first day where we got data directly from the email list, which was the BLIP in total ICU data that made it onto the disastrous graph.
— Quang Nguyen (@quangpmnguyen) March 24, 2022

The bottom line is: data from 2020 was a mess, and don't trust anything that came out of it. A group of volunteers taped it together using nothing but hot glue and scotch tape.
— Quang Nguyen (@quangpmnguyen) March 24, 2022

The fact they had to pull numbers from the graphs in the Governor’s Covid briefings is an important reminder that high-quality Covid-19 was hard to come by at the start of the pandemic (especially from the Cuomo administration). We needed something like the COVID Tracking Project where volunteers would go to heroic lengths to curate data.

Last knitted on 2022-05-27. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2022-05-27
#>  pandoc   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports      1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  bit            4.0.4   2020-08-04 [1] CRAN (R 4.2.0)
#>  bit64          4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
#>  broom          0.8.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  cachem         1.0.6   2021-08-19 [1] CRAN (R 4.2.0)
#>  cellranger     1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli            3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon         1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  curl           4.3.2   2021-06-23 [1] CRAN (R 4.2.0)
#>  DBI            1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  dbplyr         2.1.1   2021-04-06 [1] CRAN (R 4.2.0)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  downlit        0.4.0   2021-10-29 [1] CRAN (R 4.2.0)
#>  dplyr        * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  farver         2.1.0   2021-02-28 [1] CRAN (R 4.2.0)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats      * 0.5.1   2021-01-27 [1] CRAN (R 4.2.0)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  generics       0.1.2   2022-01-31 [1] CRAN (R 4.2.0)
#>  geomtextpath   0.1.0   2022-01-24 [1] CRAN (R 4.2.0)
#>  ggplot2      * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  git2r          0.30.1  2022-03-16 [1] CRAN (R 4.2.0)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  haven          2.5.0   2022-04-15 [1] CRAN (R 4.2.0)
#>  here           1.0.1   2020-12-13 [1] CRAN (R 4.2.0)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms            1.1.1   2021-09-26 [1] CRAN (R 4.2.0)
#>  httr           1.4.3   2022-05-04 [1] CRAN (R 4.2.0)
#>  jsonlite       1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr        * 1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  labeling       0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate      1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  memoise        2.0.1   2021-11-26 [1] CRAN (R 4.2.0)
#>  modelr         0.1.8   2020-05-19 [1] CRAN (R 4.2.0)
#>  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr        * 0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  ragg           1.2.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  readr        * 2.1.2   2022-01-30 [1] CRAN (R 4.2.0)
#>  readxl         1.4.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rprojroot      2.0.3   2022-04-02 [1] CRAN (R 4.2.0)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  rvest          1.0.2   2021-10-16 [1] CRAN (R 4.2.0)
#>  scales         1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr      * 1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  systemfonts    1.0.4   2022-02-11 [1] CRAN (R 4.2.0)
#>  textshaping    0.3.6   2021-10-13 [1] CRAN (R 4.2.0)
#>  tibble       * 3.1.7   2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyr        * 1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  tidyverse    * 1.3.1   2021-04-15 [1] CRAN (R 4.2.0)
#>  tzdb           0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  vroom          1.5.7   2021-11-30 [1] CRAN (R 4.2.0)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun           0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  xml2           1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

Self-documenting plots in ggplot2

2022-03-10T00:00:00-06:00

When I am showing off a plotting technique in ggplot2, I sometimes like to include the R code that produced the plot as part of the plot. Here is an example I made to demonstrate the debug parameter in element_text():

library(ggplot2)

self_document(
  ggplot(mtcars, aes(x = mpg)) +
    geom_histogram(bins = 20, color = "white") +
    labs(title = "A basic histogram") +
    theme(axis.title = element_text(debug = TRUE))
)

Let’s call these “self-documenting plots”. If we’re feeling nerdy, we might also call them “qquines”, although they are not true quines.

In this post, we will build up a self_document() function from scratch. Here are the problems we need to sort out:

how to put plotting code above a title
how to capture plotting code and convert it into text

Creating the code annotation

As a first step, let’s just treat our plotting code as a string that is ready to use for annotation.

p_text <- 'ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(bins = 20, color = "white") +
  labs(title = "A basic histogram")'

p_plot <- ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(bins = 20, color = "white") +
  labs(title = "A basic histogram")

In order to have a titled plot along with this annotation, we need some way to combine these two graphical objects together (the code and the plot produced by ggplot2). I like the patchwork package for this job. Here we use wrap_elements() to capture the plot into a “patch” that patchwork can annotate.

library(patchwork)
wrap_elements(p_plot) + 
  plot_annotation(title = p_text)

Let’s style this title to use a monospaced font. I use Windows and like Consolas, so I will use that font.

# Use default mono font if "Consolas" is not available
extrafont::loadfonts(device = "win", quiet = TRUE)
monofont <- ifelse(
  extrafont::choose_font("Consolas") == "", 
  "mono", 
  "Consolas"
)

title_theme <- theme(
  plot.title = element_text(
    family = monofont, hjust = 0, size = rel(.9), 
    margin = margin(0, 0, 5.5, 0, unit = "pt")
  )
)

wrap_elements(p_plot) + 
  plot_annotation(title = p_text, theme = title_theme)  

One problem with this setup is that the plotting code has to be edited in two places: the plot p_plot and the title p_text. As a result, it’s easy for these two pieces of code to fall out of sync with each other, turning our self-documenting plot into a lying liar plot.

The solution is pretty easy: Tell R that p_text is code with parse() and evaluate the code with eval():

wrap_elements(eval(parse(text = p_text))) + 
  plot_annotation(title = p_text, theme = title_theme)  

This works. It gets the job done. But we find ourselves in a clumsy workflow, either having to edit R code inside of quotes or editing the plot interactively and then having to wrap it in quotes. Let’s do better.

Capturing plotting code as a string

Time for some nonstandard evaluation. I will use the rlang package, although in principle we could use functions in base R to accomplish these goals.

First, we are going to use rlang::expr() to capture/quote/defuse the R code as an expression. We can print the code as code, print it as text, and use eval() to show the plot.

p_code <- rlang::expr(
  ggplot(mtcars, aes(x = mpg)) +
    geom_histogram(bins = 20, color = "white") +
    labs(title = "A basic histogram")
)

# print the expressions
p_code
#> ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 20, color = "white") + 
#>     labs(title = "A basic histogram")

# expression => text
rlang::expr_text(p_code)
#> [1] "ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 20, color = \"white\") + \n    labs(title = \"A basic histogram\")"

eval(p_code)

Then, it should be straightforward to make the self-documenting plot, right?

p_code <- rlang::expr(
  ggplot(mtcars, aes(x = mpg)) +
    geom_histogram(bins = 20, color = "white") +
    labs(title = "A basic histogram")
)

wrap_elements(eval(p_code)) + 
  plot_annotation(title = rlang::expr_text(p_code), theme = title_theme)  

Hey, it reformatted the title! Indeed, in the process of capturing the code, the code formatting was lost. To get something closer to the source code we provided, we have to reformat the captured code before we print it.

The styler package provides a suite of functions for reformatting code. We can define our own coding styles/formatting rules to customize how styler works. I like the styler rules used by Garrick Aden-Buie in his grkstyle package, so I will use grkstyle::grk_style_text() to reformat the code.

p_code <- rlang::expr(
  ggplot(mtcars, aes(x = mpg)) +
    geom_histogram(bins = 20, color = "white") +
    labs(title = "A basic histogram")
)

wrap_elements(eval(p_code)) + 
  plot_annotation(
    title = rlang::expr_text(p_code) |> 
      grkstyle::grk_style_text() |> 
      # reformatting returns a vector of lines,
      # so we have to combine them
      paste0(collapse = "\n"), 
    theme = title_theme
  ) 

Putting it all together

When we write our self_document() function, the only change we have to make is using rlang::enexpr() instead rlang::expr(). The en-variant is used when we want to en-quote exactly what the user provided. Aside from that change, our self_document() function just bundles together all of the code we developed above:

self_document <- function(expr) {
  monofont <- ifelse(
    extrafont::choose_font("Consolas") == "", 
    "mono", 
    "Consolas"
  )
  
  p <- rlang::enexpr(expr)
  title <- rlang::expr_text(p) |> 
    grkstyle::grk_style_text() |> 
    paste0(collapse = "\n")
  
  patchwork::wrap_elements(eval(p)) + 
    patchwork::plot_annotation(
      title = title, 
      theme = theme(
        plot.title = element_text(
          family = monofont, hjust = 0, size = rel(.9), 
          margin = margin(0, 0, 5.5, 0, unit = "pt")
        )
      )
    )
}

And let’s confirm that it works.

library(ggplot2)
self_document(
  ggplot(mtcars, aes(x = mpg)) +
    geom_histogram(bins = 20, color = "white") +
    labs(title = "A basic histogram")
)

Because we developed this function on top of rlang, we can do some tricks like injecting a variable’s value when capturing the code. For example, here I use !! color to replace the color variable with the actual value.

color <- "white"
self_document(
  ggplot(mtcars, aes(x = mpg)) +
    geom_histogram(bins = 20, color = !! color) +
    labs(title = "A basic histogram")
)

And if you are wondering, yes, we can self_document() a self_document() plot.

self_document(
  self_document(
    ggplot(mtcars, aes(x = mpg)) +
      geom_histogram(bins = 20, color = "white") +
      labs(title = "A basic histogram")
  )
)

Alas, comments are lost

One downside of this approach is that helpful comments are lost.

self_document(
  ggplot(mtcars, aes(x = mpg)) +
    geom_histogram(bins = 20, color = !! color) +
    # get rid of that grey
    theme_minimal() +
    labs(title = "A basic histogram")
)

I am not sure how to include comments. One place where comments are stored and printed is in function bodies:

f <- function() {
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(bins = 20, color = !! color) +
  # get rid of that grey
  theme_minimal() +
  labs(title = "A basic histogram")
}

print(f, useSource = TRUE)
#> function() {
#> ggplot(mtcars, aes(x = mpg)) +
#>   geom_histogram(bins = 20, color = !! color) +
#>   # get rid of that grey
#>   theme_minimal() +
#>   labs(title = "A basic histogram")
#> }
#> 

I have no idea how to go about exploiting this feature for self-documenting plots, however.

Last knitted on 2022-05-27. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2022-05-27
#>  pandoc   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  cachem        1.0.6   2021-08-19 [1] CRAN (R 4.2.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  downlit       0.4.0   2021-10-29 [1] CRAN (R 4.2.0)
#>  dplyr         1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  extrafont     0.18    2022-04-12 [1] CRAN (R 4.2.0)
#>  extrafontdb   1.0     2012-06-11 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  farver        2.1.0   2021-02-28 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  git2r         0.30.1  2022-03-16 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  grkstyle      0.0.3   2022-05-25 [1] Github (gadenbuie/grkstyle@6a7011c)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  here          1.0.1   2020-12-13 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  knitr       * 1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  labeling      0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  patchwork   * 1.1.1   2020-12-17 [1] CRAN (R 4.2.0)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.2.0)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.2.0)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  ragg          1.2.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  Rttf2pt1      1.3.10  2022-02-07 [1] CRAN (R 4.2.0)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.2.0)
#>  systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.2.0)
#>  textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.2.0)
#>  tibble        3.1.7   2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

Custom syntax highlighting themes in RMarkdown (and pandoc)

2021-11-17T00:00:00-06:00

I recently developed and released an R package called solarizeddocx. It provides solarizeddocx::document(), an RMarkdown output format for solarized-highlighted Microsoft Word documents . The image below shows a comparison of the solarizeddocx and the default docx format:

Side-by-side comparison of solarizeddocx::document() and rmarkdown::word_document().

The package provides a demo document which is essentially a vignette where I describe all the customizations used by the package and put the syntax highlighting to the test. The demo can be rendered and viewed with:

# install.packages("devtools")
devtools::install_github("tjmahr/solarizeddocx")
solarizeddocx::demo_document()

The format can used in RMarkdown document via YAML metadata.

output: 
  solarizeddocx::document: default

Or explicitly with rmarkdown:

rmarkdown::render(
  "README.Rmd", 
  output_format = solarizeddocx::document()
)

solarizeddocx also exports its document assets so that they can be used in other output formats, and it exports theme-building tools to create new pandoc syntax highlighting themes. I am most proud of these features, so I will demonstrate each of these in turn and create a brand new syntax highlighting theme in this post.

knitr: .Rmd to .md conversion

To give a simplified description, RMarkdown works by knitting the code in an RMarkdown (.Rmd) file with knitr to obtain a markdown (.md) file and then post-processing this knitr output with other tools. In particular, it uses pandoc which converts between all kinds of document formats. For this demonstration, we will do the knitting and pandoc steps separately without relying on RMarkdown. That said, the options we pass to pandoc can usually be used in RMarkdown (as we demonstrate at the very end of this post).

Our input file is a small .Rmd file. It’s very basic, meant to illustrate some function calls, strings, numbers, code comments and output.

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
Fit a model with `lm`():
```{r}
model <- lm(mpg ~ 1 + cyl, mtcars)
coefs <- coef(model)

# prediction for 8 cylinders
coefs["(Intercept)"] + 8 * coefs["cyl"]

predict(model, data.frame(cyl = 8L))
```

We knit() the document to run the code and store results in a markdown file. (Actually, we use knit_child() because I was getting some weird using-knit()-inside-of-knit() issues when rendering this post. But in general, we would knit().)

md_file <- tempfile(fileext = ".md")
knit_func <- if(interactive()) knitr::knit else knitr::knit_child
knit_func(
  solarizeddocx::file_code_block(), 
  output = md_file,
  quiet = TRUE
)

This is the content of the file.

Fit a model with `lm`():

```r
model <- lm(mpg ~ 1 + cyl, mtcars)
coefs <- coef(model)

# prediction for 8 cylinders
coefs["(Intercept)"] + 8 * coefs["cyl"]
#> (Intercept) 
#>    14.87826

predict(model, data.frame(cyl = 8L))
#>        1 
#> 14.87826
```

pandoc: .md to everything conversion

Everything we do with syntax highlighting occurs at this point when we have an .md file. For this demo, we will use pandoc to convert this .md file to an HTML document.

To make life easier, let’s set up a workflow for quickly converting a .md file to an HTML document and taking a screenshot of the document. run_pandoc() is a wrapper over rmarkdown::pandoc_convert() but hard-codes some output options and lets us more easily forward options to pandoc using ....s page_thumbnail() is a wrapper over webshot::webshot() with some predefined output options. pd_style() and pd_syntax() are helpers we will use later for setting pandoc options.

run_pandoc <- function(input, ...) {
  output <- tempfile(fileext = ".html")
  rmarkdown::pandoc_convert(
    input, 
    to = "html5", 
    output = output,
    options = c(
      "--standalone", 
      ...
    )
  )
  output
}

page_thumbnail <- function(url, file, ...) {
  webshot::webshot(
    url = url,  
    file = file,
    vwidth = 500, 
    vheight = 350,
    zoom = 2
  )
}
pd_style <- function(x) c("--highlight-style", x)
pd_syntax <- function(x) c("--syntax-definition", x)

# Update from May 2022: Make file paths into urls
url_file <- function(x) paste0("file://localhost/", x)

These tools let us preview the default syntax highlighting in pandoc:

results <- run_pandoc(md_file, pd_style("tango"))
page_thumbnail(url_file(results), "shot1.png")

Setting pandoc options

Here is the pandoc HTML output but this time using my solarized (light) highlighting style:

theme_sl <- solarizeddocx::file_solarized_light_theme()
results <- run_pandoc(md_file, pd_style(theme_sl))
page_thumbnail(url_file(results), "shot2.png")

By convention, we see two kinds of comment lines: actual code comments (#) and R output (#>). The #> comments helpful because I can copy a whole code block (output included) and run it in R without that output being interpreted as code. But these comments represent two different kinds of information, and I’d like them to be styled differently. The # code comments can stay unintrusive (light italic type), but the #> out comments should be legible (darker roman type).

To treat these two type of comments differently, I modified the R syntax definition used by pandoc to recognize # and #> as different entities. We can pass that syntax definition to pandoc:

syntax_sl <- solarizeddocx::file_syntax_definition()
results <- run_pandoc(
  md_file, 
  pd_style(theme_sl), 
  pd_syntax(syntax_sl)
)
page_thumbnail(url_file(results), "shot3.png")

comments." alt="Screenshot of html file created by pandoc. It now has solarized colors and differently styled #> comments." width="80%" style="display: block; margin: auto;" />

Creating a theme from scratch

Maybe you’re thinking, that’s cool… if you like solarized. What about something fun like Fairy Floss? Okay, fine, let’s make Fairy Floss… right now… in this blog post.

First, let’s store the Fairy Floss colors in a handy list:

ff_colors <- list(
  gold = "#e6c000",
  yellow = "#ffea00",
  dark_purple = "#5a5475",
  white = "#f8f8f2",
  pink = "#ffb8d1",
  salmon = "#ff857f",
  purple = "#c5a3ff",
  teal = "#c2ffdf"
)

If we use the correct command, pandoc will provide us with a syntax highlighting theme as a JSON file. copy_base_pandoc_theme() will call this command for us. We can read that file into R and see that it is a list of global style options followed by a list of individual style definitions.

temptheme <- tempfile(fileext = ".theme") 
solarizeddocx::copy_base_pandoc_theme(temptheme)

data_theme <- jsonlite::read_json(temptheme)
str(data_theme, max.level = 2)
#> List of 5
#>  $ text-color                  : NULL
#>  $ background-color            : NULL
#>  $ line-number-color           : chr "#aaaaaa"
#>  $ line-number-background-color: NULL
#>  $ text-styles                 :List of 29
#>   ..$ Other         :List of 5
#>   ..$ Attribute     :List of 5
#>   ..$ SpecialString :List of 5
#>   ..$ Annotation    :List of 5
#>   ..$ Function      :List of 5
#>   ..$ String        :List of 5
#>   ..$ ControlFlow   :List of 5
#>   ..$ Operator      :List of 5
#>   ..$ Error         :List of 5
#>   ..$ BaseN         :List of 5
#>   ..$ Alert         :List of 5
#>   ..$ Variable      :List of 5
#>   ..$ BuiltIn       :List of 5
#>   ..$ Extension     :List of 5
#>   ..$ Preprocessor  :List of 5
#>   ..$ Information   :List of 5
#>   ..$ VerbatimString:List of 5
#>   ..$ Warning       :List of 5
#>   ..$ Documentation :List of 5
#>   ..$ Import        :List of 5
#>   ..$ Char          :List of 5
#>   ..$ DataType      :List of 5
#>   ..$ Float         :List of 5
#>   ..$ Comment       :List of 5
#>   ..$ CommentVar    :List of 5
#>   ..$ Constant      :List of 5
#>   ..$ SpecialChar   :List of 5
#>   ..$ DecVal        :List of 5
#>   ..$ Keyword       :List of 5

Each of those individual style definitions is a list of color options and font style options:

str(data_theme$`text-styles`$Comment)
#> List of 5
#>  $ text-color      : chr "#60a0b0"
#>  $ background-color: NULL
#>  $ bold            : logi FALSE
#>  $ italic          : logi TRUE
#>  $ underline       : logi FALSE

solarizeddocx provides a helper function set_theme_text_style() for setting individual style options. Let’s set up Fairy Floss’s global and comment styles. We use the fake name "global" to access the global style options, and we use style definition names like "Comment" to access those specifically.

library(magrittr)
ff_theme <- data_theme %>% 
  solarizeddocx::set_theme_text_style(
    "global", 
    background = ff_colors$dark_purple,
    text = ff_colors$white
  ) %>% 
  solarizeddocx::set_theme_text_style(
    "Comment",
    text = ff_colors$gold
  ) %>% 
  solarizeddocx::set_theme_text_style(
    "String",
    text = ff_colors$yellow 
  )

Let’s preview our partial theme:

solarizeddocx::write_pandoc_theme(ff_theme, temptheme)
results <- run_pandoc(
  md_file, 
  pd_style(temptheme), 
  pd_syntax(syntax_sl)
)
page_thumbnail(url_file(results), "shot4.png")

This is a good start, but when I first ported the solarized theme, I had to use 20 calls to set_theme_text_style(). That’s a lot. Plus, themes are data. Can’t we just describe what needs to change in a list? Yes. For this post, I made solarizeddocx::patch_theme_text_style() where we describe the changes to make as a list of patches.

Let’s write our list of patches to make to the base theme. Because some style definitions are identical, we will use tibble’s lazy list tibble::lst()to reuse patches along the way. For this application of the palette, I consulted the Fairy Floss .tmTheme file and the rsthemes implementation of Fairy Floss.

patches <- tibble::lst(
  global = list(
    text = ff_colors$white,
    background = ff_colors$dark_purple
  ),
  # # comments
  Comment = list(text = ff_colors$gold, italic = TRUE, bold = FALSE),
  # ## comments
  Documentation = Comment,
  # #> comments
  Information = list(text = ff_colors$gold, italic = FALSE, bold = TRUE),
  Keyword = list(text = ff_colors$pink),
  ControlFlow = list(text = ff_colors$pink, bold = FALSE),
  Operator = list(text = ff_colors$pink),
  Function = list(text = ff_colors$teal),
  Attribute = list(text = ff_colors$white),
  Variable = list(text = ff_colors$white),
  # this should be code outside of a code block
  VerbatimString = list(
    text = ff_colors$white, 
    background = ff_colors$dark_purple
  ),
  Other = Variable,
  Constant = list(text = ff_colors$purple),
  Error = list(text = ff_colors$salmon),
  Alert = Error,
  Warning = Error,
  Float = list(text = ff_colors$purple),
  DecVal = Float,
  BaseN = Float,
  SpecialChar = list(text = ff_colors$white),
  String = list(text = ff_colors$yellow),
  Char = String,
  SpecialString = String
)

Save yourself from guessing and checking. These style definition names are documented on this page. I wish I had found this page before starting to port the solarized theme. My initial approach was to use the style inspector in Microsoft Word and look at the style names applied to pieces of code. The downside of that approach is that in order to figure out what a SpecialChar was, I had to write a SpecialChar. (Escape sequences inside of strings like "hello\nthere" are SpecialChars in the R syntax definition used by pandoc.)

Now we apply our patches to the theme:

ff_theme <- solarizeddocx::patch_theme_text_style(
  data_theme,
  patches
)

solarizeddocx::write_pandoc_theme(ff_theme, temptheme)
results <- run_pandoc(
  md_file, 
  pd_style(temptheme), 
  pd_syntax(syntax_sl)
)
page_thumbnail(url_file(results), "shot5.png")

Wonderful!

Sneaking these features into RMarkdown

Update: This problem has been fixed. When I first wrote this post, it was not possible to use custom highlighting themes with RMarkdown HTML documents. The syntax highlighting for this format was overhauled in rmarkdown 2.12. [May 27, 2022]

So far, we have set these options by directly calling pandoc with the style and syntax options. ~~We can use these options in RMarkdown some of the time. For example, here we try to send the Fairy Floss theme into an html_document() and fail.~~

out <- rmarkdown::render(
  md_file, 
  output_format = rmarkdown::html_document(
    # Update, May 2022: Adding this line fixes things
    highlight = pd_style(temptheme)[2],
    pandoc_args = c(
      pd_syntax(syntax_sl)
    )
  ),
  quiet = TRUE
)
page_thumbnail(url_file(out), "shot6.png")

RMarkdown assembles and performs a giant pandoc command. The problem, as far as I can tell, is that this command includes our pd_style(temptheme) which sets the option for --highlight-style—but later on it also includes --no-highlight which blocks our style. Bummer.

If we use the simpler html_document_base() format, however, we can see Fairy Floss output.

out <- rmarkdown::render(
  md_file, 
  output_format = rmarkdown::html_document_base(
    pandoc_args = c(pd_style(temptheme), pd_syntax(syntax_sl))
  ),
  quiet = TRUE
)
page_thumbnail(url_file(out), "shot7.png")

The options also work for the pdf_document() format.

out <- rmarkdown::render(
  md_file, 
  output_format = rmarkdown::pdf_document(
    pandoc_args = c(pd_style(temptheme), pd_syntax(syntax_sl))
  ), 
  quiet = TRUE
)

# Convert to png and crop most of the empty page
png <- pdftools::pdf_convert(out, dpi = 144)
#> Converting page 1 to file343c662113f3_1.png... done!
magick::image_read(png) %>% 
  magick::image_crop(magick::geometry_area(1050, 400, 100, 100))

The options also work with word_document(). In fact, that’s how solarizeddocx::document() works.

Last knitted on 2022-05-27. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2022-05-27
#>  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version    date (UTC) lib source
#>  askpass         1.1        2019-01-13 [1] CRAN (R 4.2.0)
#>  bslib           0.3.1      2021-10-06 [1] CRAN (R 4.2.0)
#>  cachem          1.0.6      2021-08-19 [1] CRAN (R 4.2.0)
#>  callr           3.7.0      2021-04-20 [1] CRAN (R 4.2.0)
#>  cli             3.3.0      2022-04-25 [1] CRAN (R 4.2.0)
#>  crayon          1.5.1      2022-03-26 [1] CRAN (R 4.2.0)
#>  digest          0.6.29     2021-12-01 [1] CRAN (R 4.2.0)
#>  downlit         0.4.0      2021-10-29 [1] CRAN (R 4.2.0)
#>  ellipsis        0.3.2      2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate        0.15       2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi           1.0.3      2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap         1.1.0      2021-01-25 [1] CRAN (R 4.2.0)
#>  git2r           0.30.1     2022-03-16 [1] CRAN (R 4.2.0)
#>  glue            1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
#>  here            1.0.1      2020-12-13 [1] CRAN (R 4.2.0)
#>  highr           0.9        2021-04-16 [1] CRAN (R 4.2.0)
#>  htmltools       0.5.2      2021-08-25 [1] CRAN (R 4.2.0)
#>  jquerylib       0.1.4      2021-04-26 [1] CRAN (R 4.2.0)
#>  jsonlite        1.8.0      2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr         * 1.39       2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle       1.0.1      2021-09-24 [1] CRAN (R 4.2.0)
#>  magick          2.7.3      2021-08-18 [1] CRAN (R 4.2.0)
#>  magrittr      * 2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  memoise         2.0.1      2021-11-26 [1] CRAN (R 4.2.0)
#>  pdftools        3.2.0      2022-04-19 [1] CRAN (R 4.2.0)
#>  pillar          1.7.0      2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 4.2.0)
#>  processx        3.5.3      2022-03-25 [1] CRAN (R 4.2.0)
#>  ps              1.7.0      2022-04-23 [1] CRAN (R 4.2.0)
#>  qpdf            1.1        2019-03-07 [1] CRAN (R 4.2.0)
#>  R6              2.5.1      2021-08-19 [1] CRAN (R 4.2.0)
#>  ragg            1.2.2      2022-02-21 [1] CRAN (R 4.2.0)
#>  Rcpp            1.0.8.3    2022-03-17 [1] CRAN (R 4.2.0)
#>  rlang           1.0.2      2022-03-04 [1] CRAN (R 4.2.0)
#>  rmarkdown       2.14       2022-04-25 [1] CRAN (R 4.2.0)
#>  rprojroot       2.0.3      2022-04-02 [1] CRAN (R 4.2.0)
#>  rstudioapi      0.13       2020-11-12 [1] CRAN (R 4.2.0)
#>  sass            0.4.1      2022-03-23 [1] CRAN (R 4.2.0)
#>  sessioninfo     1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>  solarizeddocx   0.0.1.9000 2022-05-25 [1] Github (tjmahr/solarizeddocx@8f82bf1)
#>  stringi         1.7.6      2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr         1.4.0      2019-02-10 [1] CRAN (R 4.2.0)
#>  systemfonts     1.0.4      2022-02-11 [1] CRAN (R 4.2.0)
#>  textshaping     0.3.6      2021-10-13 [1] CRAN (R 4.2.0)
#>  tibble          3.1.7      2022-05-03 [1] CRAN (R 4.2.0)
#>  tinytex         0.39       2022-05-16 [1] CRAN (R 4.2.0)
#>  utf8            1.2.2      2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs           0.4.1      2022-04-13 [1] CRAN (R 4.2.0)
#>  webshot         0.5.3      2022-04-14 [1] CRAN (R 4.2.0)
#>  xfun            0.31       2022-05-10 [1] CRAN (R 4.2.0)
#>  yaml            2.3.5      2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

A one-liner for generating random participant IDs

2021-10-12T00:00:00-05:00

On one of the Slacks I browse, someone asked how to de-identify a column of participant IDs. The original dataset was a wait list, so the ordering of IDs itself was a sensitive feature of the data and we need to scramble the order of IDs produced.

For example, suppose we have the following repeated measures dataset.

library(tidyverse)
data <- tibble::tribble(
  ~ participant, ~ timepoint, ~ score,
           "DB",           1,       7,
           "DB",           2,       8,
           "DB",           3,       8,
           "TW",           1,      NA,
           "TW",           2,       9,
           "CF",           1,       9,
           "CF",           2,       8,
           "JH",           1,      10,
           "JH",           2,      10,
           "JH",           3,      10
)

We want to map the participant identifiers onto some sort of shuffled-up random IDs. Suggestions included hashing the IDs with digest:

# This approach cryptographically compresses the input into a short
# "digest". (It is not a random ID.)
data %>% 
  mutate(
    participant = Vectorize(digest::sha1)(participant)
  )
#> # A tibble: 10 × 3
#>    participant                              timepoint score
#>                                             
#>  1 ad61ec1247b2381922bec89483c3ce2fb67f98d9         1     7
#>  2 ad61ec1247b2381922bec89483c3ce2fb67f98d9         2     8
#>  3 ad61ec1247b2381922bec89483c3ce2fb67f98d9         3     8
#>  4 c080f9a87edc6d47f28185279fd8be068c566a37         1    NA
#>  5 c080f9a87edc6d47f28185279fd8be068c566a37         2     9
#>  6 1f9da22bf684761daec27326331c58b46502a25b         1     9
#>  7 1f9da22bf684761daec27326331c58b46502a25b         2     8
#>  8 627d211747438ae59690cea8f0a8d6adf666b974         1    10
#>  9 627d211747438ae59690cea8f0a8d6adf666b974         2    10
#> 10 627d211747438ae59690cea8f0a8d6adf666b974         3    10

But this approach seems like overkill, and hashing just transforms these IDs. We want to be rid of them completely.

The uuid package provides another approach:

data %>% 
  group_by(participant) %>% 
  mutate(
    id = uuid::UUIDgenerate(use.time = FALSE)
  ) %>% 
  ungroup() %>% 
  select(-participant, participant = id) %>% 
  relocate(participant)
#> # A tibble: 10 × 3
#>    participant                          timepoint score
#>                                         
#>  1 03e9536d-1446-4779-ac4d-67848fa73ef4         1     7
#>  2 03e9536d-1446-4779-ac4d-67848fa73ef4         2     8
#>  3 03e9536d-1446-4779-ac4d-67848fa73ef4         3     8
#>  4 f7b73ca6-57c7-4c9a-9211-86b434912856         1    NA
#>  5 f7b73ca6-57c7-4c9a-9211-86b434912856         2     9
#>  6 81b02d88-c3bd-490b-b2dc-150077f03172         1     9
#>  7 81b02d88-c3bd-490b-b2dc-150077f03172         2     8
#>  8 60f80714-77ba-4e9f-a7d2-1943ca6724fc         1    10
#>  9 60f80714-77ba-4e9f-a7d2-1943ca6724fc         2    10
#> 10 60f80714-77ba-4e9f-a7d2-1943ca6724fc         3    10

Again, these IDs seem excessive: Imagine plotting data with one participant per facet.

When I create blogposts for this site, I use a function to create a new .Rmd file with the date and a random adjective-animal phrase for a placeholder (e.g., 2021-06-28-mild-capybara.Rmd). We could try that for fun:

data %>% 
  group_by(participant) %>% 
  mutate(
    id = ids::adjective_animal()
  ) %>% 
  ungroup() %>% 
  select(-participant, participant = id) %>% 
  relocate(participant)
#> # A tibble: 10 × 3
#>    participant              timepoint score
#>                             
#>  1 chrysoprase_bushsqueaker         1     7
#>  2 chrysoprase_bushsqueaker         2     8
#>  3 chrysoprase_bushsqueaker         3     8
#>  4 hideous_cheetah                  1    NA
#>  5 hideous_cheetah                  2     9
#>  6 powdery_siamang                  1     9
#>  7 powdery_siamang                  2     8
#>  8 ducal_hornshark                  1    10
#>  9 ducal_hornshark                  2    10
#> 10 ducal_hornshark                  3    10

But that’s too whimsical (and something like hideous-cheetah seems disrespectful for human subjects).

One user suggested forcats::fct_anon():

data %>% 
  mutate(
    participant = participant %>% 
      as.factor() %>% 
      forcats::fct_anon(prefix = "p0")
    )
#> # A tibble: 10 × 3
#>    participant timepoint score
#>                
#>  1 p04                 1     7
#>  2 p04                 2     8
#>  3 p04                 3     8
#>  4 p02                 1    NA
#>  5 p02                 2     9
#>  6 p03                 1     9
#>  7 p03                 2     8
#>  8 p01                 1    10
#>  9 p01                 2    10
#> 10 p01                 3    10

This approach works wonderfully. The only wrinkle is that it requires converting our IDs to a factor in order to work.

Call me the `match()`-maker

My approach is a nice combination of base R functions:

data %>% 
  mutate(
    participant = match(participant, sample(unique(participant)))
  )
#> # A tibble: 10 × 3
#>    participant timepoint score
#>                
#>  1           3         1     7
#>  2           3         2     8
#>  3           3         3     8
#>  4           1         1    NA
#>  5           1         2     9
#>  6           2         1     9
#>  7           2         2     8
#>  8           4         1    10
#>  9           4         2    10
#> 10           4         3    10

match(x, table) returns the first positions of the x elements in some vector table. What is the position in the alphabet of the letters L and Q and L again?

match(c("L", "Q", "L"), LETTERS)
#> [1] 12 17 12

sample() shuffles the values in the table so the order of elements is lost. The unique() is optional. We could just sample(data$participant). Then the first position of one of the IDs might be a number larger than 4:

shuffle <- sample(data$participant)
shuffle
#>  [1] "CF" "JH" "TW" "JH" "DB" "DB" "DB" "JH" "CF" "TW"

match(data$participant, shuffle)
#>  [1] 5 5 5 3 3 1 1 2 2 2

For more aesthetically pleasing names, and for names that will sort correctly, we can zero-pad the results with sprintf(). I am mostly including this step so that I have it written down somewhere for my own reference.

zero_pad <- function(xs, prefix = "", width = 0) {
  # use widest element if bigger than `width`
  width <- max(c(nchar(xs), width))
  sprintf(paste0(prefix, "%0", width, "d"), xs)    
}

data %>% 
  mutate(
    participant = match(participant, sample(unique(participant))),
    participant = zero_pad(participant, "p", 3)
  )
#> # A tibble: 10 × 3
#>    participant timepoint score
#>                
#>  1 p003                1     7
#>  2 p003                2     8
#>  3 p003                3     8
#>  4 p004                1    NA
#>  5 p004                2     9
#>  6 p002                1     9
#>  7 p002                2     8
#>  8 p001                1    10
#>  9 p001                2    10
#> 10 p001                3    10

Bonus: `match()` `%in%` disguise

What happens when match() fails to find an x in the table? By default, we get NA. But we can customize the results with the nomatch argument.

match(c("7", "A", "L"), LETTERS)
#> [1] NA  1 12
match(c("7", "A", "L"), LETTERS, nomatch = -99)
#> [1] -99   1  12
match(c("7", "A", "L"), LETTERS, nomatch = 0)
#> [1]  0  1 12

If we do something like this last example, then we can check whether an element in x has a match by checking for numbers greater than 0.

match(c("7", "A", "L"), LETTERS, nomatch = 0) > 0
#> [1] FALSE  TRUE  TRUE

And that is how the functions %in% and is.element() are implemented behind the scenes:

c("7", "A", "L") %in% LETTERS
#> [1] FALSE  TRUE  TRUE

# The 0L means it's an integer number instead of floating point number
`%in%`
#> function (x, table) 
#> match(x, table, nomatch = 0L) > 0L
#> 
#> 

is.element(c("7", "A", "L"), LETTERS)
#> [1] FALSE  TRUE  TRUE

is.element
#> function (el, set) 
#> match(as.vector(el), as.vector(set), 0L) > 0L
#> 
#> 

Last knitted on 2022-05-27. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2022-05-27
#>  pandoc   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom         0.8.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  git2r         0.30.1  2022-03-16 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.2.0)
#>  here          1.0.1   2020-12-13 [1] CRAN (R 4.2.0)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.2.0)
#>  httr          1.4.3   2022-05-04 [1] CRAN (R 4.2.0)
#>  ids           1.0.1   2017-05-31 [1] CRAN (R 4.2.0)
#>  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr       * 1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  ragg          1.2.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.2   2022-01-30 [1] CRAN (R 4.2.0)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.2.0)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.2.0)
#>  textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.2.0)
#>  tibble      * 3.1.7   2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyr       * 1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.2.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  uuid          1.1-0   2022-04-19 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

Keep your R scripts locally sourced

2021-08-16T00:00:00-05:00

A few weeks ago, I had a bad debugging session. The code was just not doing what I expected, and I went down a lot of deadends trying to fix or simplify things. I could not get the problem to happen in a reproducible example (reprex) or interactively (in RStudio). Eventually, the most minimal example of the problem completely broke my mental model for how the code should work.

The problem had to do with names and what they mean. select() is a function the lives in the MASS package and the dplyr package, and I always intend for select() to point to dplyr::select(). But sometimes a statistics package will load in MASS and overwrite select() to point to MASS::select(). And in this case, my attempts to use select() in a source()-ed file kept reverting to MASS::select() instead of dplyr::select(). A tweet from the session shows the minimal example and my wracked brain. (I will describe the example in more detail below.)

i'm dry heaving here wtf is going pic.twitter.com/KIeRJT6kwY

— tj mahr 🍍🍕 (@tjmahr) July 21, 2021

Here’s what happens:

I explicitly assign select to dplyr::select().
I make a function f() that prints the environment of select (where the name/function is defined), store the function in a .R text file and source() in the text file. (source() runs the code in an R script.)
I print the value of select and see that it is indeed from the dplyr environment.
I call my function, and it says that select is actually in the MASS package.
I check the value of select, and it reports the dplyr environment once again.

A similar problem using functions

This problem only happened while knitting one of my analysis notebooks (which was a clue). Right now, it’s proving difficult for me to write examples of this problem for this blogpost, so I’m going to show the source 😉 of the problem using functions.

First, let’s set up things so that select belongs to the MASS package. We are also going to use the conflicted package which normally prevents package name conflicts from happening. This part isn’t necessary or helpful; I just want to illustrate that this is not a simple name conflict problem.

library(conflicted)
library(MASS)
environment(select)
#> 

We are going to make a function that does what my original code example tried to do:

set select to dplyr explicitly
source() in a file that gives the environment of select
return the environment of select, both using the source()-ed function and directly.

source_in_my_code <- function(...) {
  # set dplyr select
  select <- dplyr::select
  
  # write a script to temporary file
  temp_script <- tempfile(fileext = ".R")
  my_code <- "
    f <- function() environment(select)
  "
  writeLines(my_code, temp_script)
  
  # run the script
  source(temp_script, ...)
  
  list(
    source_select_environment = f(),
    function_select_environment = environment(select)
  )
}


default_results <- source_in_my_code()

What do you think the select environment should be? dplyr, right? That’s what select means everywhere else inside of the function. source() is just like dropping in some R code and running it, right? That’s what I thought.

default_results
#> $source_select_environment
#> 
#> 
#> $function_select_environment
#> 

No, it’s the MASS environment. 😕

Local and parent environments

In order to understand what’s happening, let’s first note that R works by evaluating expressions in an environment. The environment defines the values of names. If a name is not found in an environment, R searches parent environment for the name (or the parent’s parent, and so on). This idea is illustrated beautifully in Advanced R using diagrams.

For an analogy, you might think of environments as looking up someone in an office, a building directory, then an area directory:

I like the multi-company building analogy. If you want to call Jim, first you look in your company directory. If there isn’t a Jim there, you look in the all-building maintenance dir. If not there, you look in the city services dir. You don’t look in another company-specific dir
— Brenton Wiernik 🏳️‍🌈 (@bmwiernik) April 27, 2021

Here is small example showing a local function environment, its parent environment and how a name will take different values depending on the context.

where_am_i <- "outside of the function"
where_are_you <- "outside of the function too"

where_is_everyone <- function() {
  where_am_i <- "inside of the function"
  list(
    where_am_i = where_am_i,
    where_are_you = where_are_you
  )
} 

where_am_i
#> [1] "outside of the function"
where_is_everyone()
#> $where_am_i
#> [1] "inside of the function"
#> 
#> $where_are_you
#> [1] "outside of the function too"
where_am_i
#> [1] "outside of the function"

Outside of the function, where_am_i is "outside of the function", but in the body of the function, it is defined to "inside of the function". The variable where_are_you is only defined "out of the function too", so the function has to search for the variable in its parent environment.

"parent" environment suggests a family metaphor. if you cant find what a symbol means, ask a parent.
— tj mahr 🍍🍕 (@tjmahr) April 27, 2021

Locally sourced R code

Reading the documentation to source(), we find the solution to the original problem:

Arguments

local
TRUE, FALSE or an environment, determining where the parsed expressions are evaluated. FALSE (the default) corresponds to the user’s workspace (the global environment) and TRUE to the environment from which source is called.

By default, the code evaluated by source() runs in the global environment–that is, “outside” of the body of the function. The code breaks out of the function environment and runs at the higher environment.

My mental model for source() was completely wrong. source() is not like dropping in the R code from a file and running it. It is more like pausing everything that you’re doing in your current context, backing out to the highest level context, running that code, and then resuming what you’re doing.

Fortunately, if we ask source to run locally (local = TRUE), select has the same environment inside the function and in the code run using source().

# I defined the function so it could pass arguments to source()
source_in_my_code(local = TRUE)
#> $source_select_environment
#> 
#> 
#> $function_select_environment
#> 

When we’re using source() as one of the first few lines of an R script, the default global environment for source() doesn’t really matter. But in contexts like the function example or code stored in a custom knitr/RMarkdown setup (my original problem), this difference is a problem. Therefore, in the future, I’m going to abide by the motto Keep it locally sourced. This way fits my mental model for source() as something that drops in R code and runs it in place.

And by the way, yes, even though I cited Advanced R above, I clearly did not do all of the exercises:

20.2.4 Exercises

Carefully read the documentation for source(). What environment does it use by default? What if you supply local = TRUE? How do you provide a custom environment?

Last knitted on 2022-05-27. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2022-05-27
#>  pandoc   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date (UTC) lib source
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.2.0)
#>  cachem        1.0.6      2021-08-19 [1] CRAN (R 4.2.0)
#>  cli           3.3.0      2022-04-25 [1] CRAN (R 4.2.0)
#>  conflicted  * 1.1.0      2021-11-26 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1      2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2      2021-12-20 [1] CRAN (R 4.2.0)
#>  dplyr         1.0.9      2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.2.0)
#>  emo           0.0.0.9000 2022-05-25 [1] Github (hadley/emo@3f03b11)
#>  evaluate      0.15       2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3      2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.2.0)
#>  generics      0.1.2      2022-01-31 [1] CRAN (R 4.2.0)
#>  git2r         0.30.1     2022-03-16 [1] CRAN (R 4.2.0)
#>  glue          1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
#>  here          1.0.1      2020-12-13 [1] CRAN (R 4.2.0)
#>  knitr       * 1.39       2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1      2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate     1.8.0      2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  MASS        * 7.3-56     2022-03-23 [2] CRAN (R 4.2.0)
#>  memoise       2.0.1      2021-11-26 [1] CRAN (R 4.2.0)
#>  pillar        1.7.0      2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.2.0)
#>  R6            2.5.1      2021-08-19 [1] CRAN (R 4.2.0)
#>  ragg          1.2.2      2022-02-21 [1] CRAN (R 4.2.0)
#>  rlang         1.0.2      2022-03-04 [1] CRAN (R 4.2.0)
#>  rprojroot     2.0.3      2022-04-02 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.6      2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.2.0)
#>  systemfonts   1.0.4      2022-02-11 [1] CRAN (R 4.2.0)
#>  textshaping   0.3.6      2021-10-13 [1] CRAN (R 4.2.0)
#>  tibble        3.1.7      2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2      2022-02-21 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2      2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1      2022-04-13 [1] CRAN (R 4.2.0)
#>  xfun          0.31       2022-05-10 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

Snecko eye lets you play more cards

2021-07-07T00:00:00-05:00

In a previous post, I used simulations to estimate how long it would take to collect the unique Unowns in Pokemon Go! The message of the post was that we can use simulations to solve problems when the analytic solution is not clear or obvious. The current post is an another example of using simulations to understand a weird counting/probability problem.

Over the past strange year, I sunk a lot of time into Slay the Spire, a rogue-like deck-building game. You have to escape from a 50-floor spire, fighting monsters by playing cards. The cards let you attack, defend, apply buffs and debuffs, draw cards, etc. You start each turn with a given amount of energy, the cards cost energy (with more powerful effects costing more energy), so you need to plan out how to play your turns in order to defeat the monsters. You receive more cards from winning battles and can receive special relics that will make you stronger or change how your deck plays.

That’s the basic gist of the game. The sublime part comes when the cards and relics start synergistically empowering each other and comboing off each other. You might get the curse Pain which drains you of 1 health every time you play a card. (This is bad.) But then you find a Rupture which increases your strength every time you take damage from a card. Then you get Runic Cube which draws you an extra card every time you take damage. Finally, you find Reaper which converts damage into health. So you now have this card-drawing, strength-building, self-sustaining engine that makes you unstoppable. (This particular scenario unfolded in a recent game by the streamer Jorbs.)

Screenshot of Slay the Spire gameplay. We see a hand of 5 cards at the bottom with 1/3 energy available on the left.

The exercise today: simulate the maximum number of cards we can play per turn for 3 energy under normal circumstances and when a game-warping relic (Snecko Eye) is active.

A baseline deck

Let’s consider a setup as a baseline for comparison.

We have 3 energy to play cards per turn.
We draw 5 cards per turn.
Our deck has 16 cards: 1 card costs 0 energy, 12 cost 1 energy, 3 cost 2 energy, and 1 costs 3 energy.

(I made up this deck for this example.)

If we build our deck, we can find the average cost of our cards and simulate some draws by sample()-ing without replacement.

library(magrittr)
set.seed(20210707)

costs <- c(0, rep(1, 12), rep(2, 3), 3)
costs
#>  [1] 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3

mean(costs)
#> [1] 1.235294

# Simulate 3 hands
sample(costs, size = 5)
#> [1] 1 2 2 1 1
sample(costs, size = 5)
#> [1] 1 3 2 1 1
sample(costs, size = 5)
#> [1] 1 1 1 1 1

Suppose that we don’t really care about what the cards do. We want to maximize the number of cards that we play per turn. We just want to know: How many cards per turn can I expect to play on average?

Let’s write a function that counts the number of playable cards in a hand given a certain energy budget. The basic logic is that we sort the card costs, compute the cumulative sum (cumulative energy spent on each card), and count how many sums (played cards) are less than or equal to the energy limit.

# A worked example
energy <- 3
hand <- sample(costs, size = 5)
hand
#> [1] 3 1 0 1 1
sort(hand)
#> [1] 0 1 1 1 3
cumsum(sort(hand))
#> [1] 0 1 2 3 6
cumsum(sort(hand)) <= energy 
#> [1]  TRUE  TRUE  TRUE  TRUE FALSE
sum(cumsum(sort(hand)) <= energy)
#> [1] 4

count_max_playable <- function(hand, energy) {
  sum(cumsum(sort(hand)) <= energy)
}

count_max_playable(hand, energy)
#> [1] 4

Now, we can do this procedure on several thousand hands and run summary statistics on the number of playable cards in each hand.

simulated_cards_played <- replicate(
  10000,
  costs %>% 
    sample(size = 5) %>% 
    count_max_playable(energy = 3)
)

summary(simulated_cards_played)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   2.000   3.000   3.000   3.173   3.000   4.000

table(simulated_cards_played)
#> simulated_cards_played
#>    2    3    4 
#>  451 7364 2185

proportions(table(simulated_cards_played))
#> simulated_cards_played
#>      2      3      4 
#> 0.0451 0.7364 0.2185

The expected number of playable cards per hand is 3.2. The dreaded (1, 2, 2, 2, 3) hand appears about 4.5% of the time, but the one 0-cost card in our deck lets us play a fourth card about 21.8% of the time.

Enter the Snecko

Snecko Eye is probably the best relic in the game.

Let’s suppose we obtain the mighty Snecko Eye relic. It says “Draw 2 additional cards each turn. Start each combat Confused.” Confused is a debuff that randomizes the costs of cards when we draw them. So now our setup is the following:

We have 3 energy to play cards per turn.
We draw 7 cards per turn.
Our deck has 16 cards: the costs are random integers between 0 and 3 energy.

The average energy cost of any given card in our deck is now mean(0:3) = 1.5. In the baseline example, the average energy cost was 1.24. (One obvious strategy with Snecko Eye is to maximize the costs of new cards—that is, try to get as many as 2s and 3s as possible because the new expected cost is less than the original cost. But let’s ignore that dimension of gameplay for now.)

So here’s the puzzle, how many cards per turn can I play with Snecko Eye? We can run the same simulations as above.

snecko_costs <- 0:3 
simulated_snecko_cards_played <- replicate(
  10000,
  snecko_costs %>% 
    sample(size = 7, replace = TRUE) %>% 
    count_max_playable(energy = 3)
)

summary(simulated_snecko_cards_played)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   3.000   4.000   3.826   5.000   7.000

table(simulated_snecko_cards_played)
#> simulated_snecko_cards_played
#>    1    2    3    4    5    6    7 
#>   83  979 2911 3397 1945  621   64

proportions(table(simulated_snecko_cards_played))
#> simulated_snecko_cards_played
#>      1      2      3      4      5      6      7 
#> 0.0083 0.0979 0.2911 0.3397 0.1945 0.0621 0.0064

Let us note that the dream—playing 7 cards in one turn—happened about 0.6% of the time and the nightmare—drawing only 2-cost and 3-cost cards—happened 0.8% of the time. Recall that in the baseline setup, we got to play 4 cards 21.8% of the time. With Snecko Eye, we can play 4 or more cards per turn 60.3% of the time. Snecko Eye simply lets us play more cards on average.

Yes, we could skip the random sampling. For this problem where there are 4^7 = 16384 combinations, a brute-force enumeration is possible. The proportions from the counting from the full set are within .005 (half a percentage point) of the proportions from simulating 10,000 hands.

# Generate all combinations
expand.grid(rep(list(0:3), 7)) %>% 
  # Count playables in each row
  apply(MARGIN = 1, count_max_playable, energy = 3) %>% 
  table() %>% 
  proportions() %>% 
  round(3)
#> .
#>     1     2     3     4     5     6     7 
#> 0.008 0.098 0.287 0.341 0.200 0.059 0.007

Where does this power come from?

Is the magic of Snecko Eye the card draw or the cost randomization? Well, let’s suppose that we are just confused and we draw only 5 cards (as in the baseline example).

simulated_confused_cards_played <- replicate(
  10000,
  snecko_costs %>% 
    sample(size = 5, replace = TRUE) %>% 
    count_max_playable(energy = 3)
)

summary(simulated_confused_cards_played)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    2.00    3.00    3.03    4.00    5.00

table(simulated_confused_cards_played)
#> simulated_confused_cards_played
#>    1    2    3    4    5 
#>  358 2496 4162 2454  530

proportions(table(simulated_confused_cards_played))
#> simulated_confused_cards_played
#>      1      2      3      4      5 
#> 0.0358 0.2496 0.4162 0.2454 0.0530

Here the average number of cards played is 3.0 and we play 4–5 cards per turn 29.8% of the time. This percentage is greater than the baseline case (21.8%), but the nightmare case is worse (1 card), occurring 3.6% of the time.

We can plot the three simulations side by side and observe the distributions. First, we package them together into a single dataframe suitable for plotting and plot a bar chart.

library(ggplot2)

sim1 <- data.frame(
  set = "Baseline",
  energy = 3,
  cards = simulated_cards_played
)

sim2 <- data.frame(
  set = "Snecko Eye",
  energy = 3,
  cards = simulated_snecko_cards_played
)

sim3 <- data.frame(
  set = "Confused",
  energy = 3,
  cards = simulated_confused_cards_played
)

sims <- rbind(sim1, sim2, sim3)

ggplot(sims) + 
  aes(x = cards) + 
  geom_bar(aes(y = stat(prop))) + 
  facet_wrap("set") + 
  scale_x_continuous(breaks = 1:7, minor_breaks = NULL) +
  scale_y_continuous(labels = scales::label_percent()) +
  labs(
    title = "Confusion increases variance. Card draw increases mean.",
    x = "Number of playable of cards in hand", 
    y = "Percentage of hands",
    caption = "N = 10,000 simulations per panel"
  ) +
  theme_grey(base_size = 12)

Both the confused and the Snecko Eye panels have increased variance. The bars are shorter and more spread out, compared to the Baseline panel. The peak (the mode) shifts from 3 to 4 cards from the Confused and Snecko Eye panels.

A more statistically niche technique would be plotting the empirical cumulative distribution function. Imagine taking the bars from the previous plot and summing them along the x axis so that they are cumulative percentages. These percentages would tell you about the percentage of cases less than or equal to that given value. In the plot below, I do that procedure on reversed x axis, so we can look at what proportion of simulations had at least 4 cards played. (I chose the reversed x axis to visually convey the advantage of Snecko Eye.)

library(dplyr)

props <- sims %>% 
  count(set, cards) %>% 
  # Fill in rows that would be n = 0
  tidyr::complete(set, cards = 1:7, fill = list(n = 0)) %>% 
  # Compute ECDF in reverse order (dtarting at 7 cards)
  arrange(set, desc(cards)) %>% 
  group_by(set) %>% 
  mutate(
    proportion = n / sum(n),
    ecdf = cumsum(proportion)
  ) %>% 
  ungroup() 

ggplot(props) + 
  aes(x = cards) +
  geom_step(
    aes(y = ecdf , color = set, linetype = set), 
    direction = "mid"
  ) +
  geom_label(
    aes(color = set, y = ecdf),
    label = "Snecko can play 4 or more\ncards in 60% of hands",
    data = . %>% filter(set == "Snecko Eye", cards == 4), 
    y = .65,
    nudge_x = -.25,
    hjust = 1.0,
    vjust = 0, 
    fill = scales::alpha("grey93", .6),
    label.size = 0, 
    show.legend = FALSE,
    size = 4.5,
  ) +
  scale_x_reverse(breaks = 7:0, minor_breaks = NULL) +
  scale_y_continuous(labels = scales::label_percent()) +
  labs(
    x = "At least X playable cards in hand",
    y = "Percentage of hands",
    caption = "N = 10,000 simulations per line",
    color = NULL,
    linetype = NULL
  ) + 
  theme_grey(base_size = 12) + 
  theme(
    legend.position = "top", 
    legend.justification = "left"
  )

The advantage at higher energy

During a run through the game, we can obtain up to two relics (along with Snecko Eye) that increase our energy per turn by 1 unit. Let’s see how these new energy budgets affect the simulations.

First, we run the simulations. We put the main code into functions so that we can build the dataframes more easily.

simulate_decko <- function(n, energy, costs, size = 5) {
  replicate(
    n,
    sample(costs, size = size) %>% 
      count_max_playable(energy = energy)
  )
}

simulate_snecko <- function(n, energy, size = 7) {
  snecko_costs <- 0:3
  replicate(
    n,
    sample(snecko_costs, size = size, replace = TRUE) %>% 
      count_max_playable(energy = energy)
  )
}

additional_sims <- rbind(
  # include old results
  sims,
  data.frame(
    set = "Baseline",
    energy = 4,
    cards = simulate_decko(10000, 4, costs)
  ),
  data.frame(
    set = "Baseline",
    energy = 5,
    cards = simulate_decko(10000, 5, costs)
  ),
  data.frame(
    set = "Snecko Eye",
    energy = 4,
    cards = simulate_snecko(10000, 4)
  ),
  data.frame(
    set = "Snecko Eye",
    energy = 5,
    cards = simulate_snecko(10000, 5)
  ),
  data.frame(
    set = "Confused",
    energy = 4,
    cards = simulate_snecko(10000, 4, size = 5)
  ),
  data.frame(
    set = "Confused",
    energy = 5,
    cards = simulate_snecko(10000, 5, size = 5)
  )
)

We can make the same kind of plot as before. We see that the distribution with the highest mode (the peak that lands on the highest number of cards) in each row is Snecko Eye.

ggplot(additional_sims %>% mutate(energy = paste0(energy, " energy"))) + 
  aes(x = cards) + 
  geom_bar(aes(y = stat(prop))) + 
  facet_grid(energy ~ set) + 
  scale_x_continuous(breaks = 1:7, minor_breaks = NULL) +
  scale_y_continuous(labels = scales::label_percent()) +
  labs(
    x = "Number of playable of cards in hand", 
    y = "Percentage of hands",
    caption = "N = 10,000 simulations per panel"
  ) +
  theme_grey(base_size = 12)

One limitation of the other two non-Snecko sets becomes more obvious in the 5-energy row: They can never play 6 or 7 cards in a turn. They don’t draw that many cards. Their distributions are cut off at 5 cards.

If we look at numerical summaries, we get some sense that the benefit of Snecko diminishes as energy increases but we won’t explore this trend in any detail.

additional_sims %>% 
  group_by(set, energy) %>% 
  summarise(
    mean = mean(cards),
    sd = sd(cards),
    median = median(cards),
    .groups = "drop"
  ) %>% 
  tidyr::pivot_longer(cols = c(mean, sd, median)) %>% 
  tidyr::pivot_wider(
    names_from = energy, 
    values_from = value, 
    names_prefix = "Energy "
  ) %>% 
  rename(Set = set, Statistic = name) %>% 
  arrange(Statistic, Set) %>% 
  knitr::kable(digits = 2)

Set	Statistic	Energy 3	Energy 4	Energy 5
Baseline	mean	3.17	3.81	4.27
Confused	mean	3.03	3.44	3.79
Snecko Eye	mean	3.83	4.30	4.72
Baseline	median	3.00	4.00	4.00
Confused	median	3.00	3.00	4.00
Snecko Eye	median	4.00	4.00	5.00
Baseline	sd	0.48	0.56	0.53
Confused	sd	0.92	0.89	0.84
Snecko Eye	sd	1.11	1.08	1.06

You should probably play Snecko

Pwease no steppo. Posted by u/usernameequalspants.

We could go on and on with the simulations. Suppose you are dying and you are desperate need of the Apparition on the top of your deck. How many cards can you play after you are forced to play that card? Or suppose that you are, rightly, playing on Ascension 20 and one of the cards is an unplayable curse. Does that change anything?

The point here is that we used simulations to visualize how randomization increased the variance of playable cards but the extra cards shifted the mode of the distribution upwards. You can play more cards with Snecko Eye because you simply have more cards you can play per turn.

Last knitted on 2022-05-27. Source code on GitHub.¹

.session_info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2022-05-27
#>  pandoc   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  farver        2.1.0   2021-02-28 [1] CRAN (R 4.2.0)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  git2r         0.30.1  2022-03-16 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  here          1.0.1   2020-12-13 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  knitr       * 1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  labeling      0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  magrittr    * 2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  ragg          1.2.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.2.0)
#>  textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.2.0)
#>  tibble        3.1.7   2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyr         1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

↩

Higher Order Functions

Notes on Citing R and R Packages

Which software to cite

Where to get citation information

How to cite and version R and R packages

Examples

A note on automatic citation helpers

Ordering constraints in brms using contrast coding

Big idea of contrast coding

The default: dummy coding

Successive differences coding

The Comparison Matrix

Finally, in Layer I of this post, the brms model

Normally, I don’t think you need contrast codes

How to score Rock Paper Scissors

Creating a Summoning Salt-style speedrun plot

Warp pipe: Obtaining the data

Ground pound: Filtering and cleaning the data

Triple jump: Plotting

The cursed Morgan Stanley Covid-19 visualization

Attempt 1: There is no jump

Attempt 2: Let’s go back in time

What could they have done differently?

Update: Notes from the Tracking Project trenches [Mar. 24, 2022]

Self-documenting plots in ggplot2

Creating the code annotation

Capturing plotting code as a string

Putting it all together

Alas, comments are lost

Custom syntax highlighting themes in RMarkdown (and pandoc)

knitr: .Rmd to .md conversion

pandoc: .md to everything conversion

Setting pandoc options

Creating a theme from scratch

Sneaking these features into RMarkdown

A one-liner for generating random participant IDs

Call me the match()-maker

Bonus: match() %in% disguise

Keep your R scripts locally sourced

A similar problem using functions

Local and parent environments

Locally sourced R code

Snecko eye lets you play more cards

A baseline deck

Enter the Snecko

Where does this power come from?

The advantage at higher energy

You should probably play Snecko

Call me the `match()`-maker

Bonus: `match()` `%in%` disguise