Serializing rvars with qs2’s qdata

I was surprised by how large a tibble with posterior::rvar() columns became when saved with qs2::qs_save(). The issue seems to be attributes/pointers on the rvar objects. Removing the pointers by converting to a non-rvar format or using qs2::qd_save() will result in smaller file sizes.

Let’s make rvars of 1000 draws each for 100 rows x 2 columns of data.

set.seed(20250512)

f <- function(n) {
  replicate(n = n, posterior::rvar(rnorm(1000))) |> 
    do.call(what = c)
}

data <- tibble::tibble(
  a = 1:100,
  b = f(100),
  c = f(100)
)
data
#> # A tibble: 100 × 3
#>        a               b                c
#>    <int>      <rvar[1d]>       <rvar[1d]>
#>  1     1   0.0014 ± 1.02  -0.00354 ± 0.99
#>  2     2   0.0479 ± 1.00  -0.06457 ± 1.01
#>  3     3  -0.0602 ± 0.96   0.00022 ± 0.98
#>  4     4   0.0068 ± 0.98   0.01593 ± 1.01
#>  5     5  -0.0101 ± 1.00  -0.02121 ± 0.99
#>  6     6  -0.0049 ± 0.95  -0.04301 ± 0.97
#>  7     7   0.0078 ± 0.97   0.00162 ± 1.03
#>  8     8  -0.0044 ± 0.96   0.03011 ± 0.99
#>  9     9   0.0276 ± 1.00  -0.00378 ± 1.01
#> 10    10  -0.0421 ± 1.05   0.08015 ± 0.99
#> # ℹ 90 more rows

Let’s save it with save() and qs2::qs_save(), which is meant to serialize the R object’s data like save() does but in a faster and smarter way.

t1 <- tempfile()
save(data, file = t1)
file.size(t1) |> scales::label_bytes()()
#> [1] "155 MB"

t2 <- tempfile()
qs2::qs_save(data, t2, nthreads = 8)
file.size(t2) |> scales::label_bytes()()
#> [1] "110 MB"

If we unnest the data so that there is one row per draw, we get much better sizes.

t3 <- tempfile()
data_long <- tidybayes::unnest_rvars(data)
data_long
#> # A tibble: 100,000 × 6
#> # Groups:   a [100]
#>        a      b      c .chain .iteration .draw
#>    <int>  <dbl>  <dbl>  <int>      <int> <int>
#>  1     1 -0.584 -0.135      1          1     1
#>  2     1 -1.78  -0.955      1          2     2
#>  3     1 -0.222 -0.534      1          3     3
#>  4     1 -0.560  1.13       1          4     4
#>  5     1 -0.754 -1.81       1          5     5
#>  6     1  0.436  1.24       1          6     6
#>  7     1 -0.241 -1.58       1          7     7
#>  8     1 -1.68  -1.84       1          8     8
#>  9     1 -1.83   0.591      1          9     9
#> 10     1 -0.470 -0.260      1         10    10
#> # ℹ 99,990 more rows

qs2::qs_save(data_long, t3, nthreads = 8)
file.size(t3) |> scales::label_bytes()()
#> [1] "1 MB"

file.size(t2) / file.size(t3)
#> [1] 78.05878

What is the bloat here? It is cached data about the rvars. Here is the data for one rvar:

str(data$b |> unclass())
#>  list()
#>  - attr(*, "draws")= num [1:1000, 1:100] -0.584 -1.776 -0.222 -0.56 -0.754 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:1000] "1" "2" "3" "4" ...
#>   .. ..$ : NULL
#>  - attr(*, "nchains")= int 1
#>  - attr(*, "cache")=<environment: 0x00000159652bf370>

See the environment pointer?

Let’s zap the caches and try again.

data2 <- data
attr(data2$c, "cache") <- NULL
attr(data2$b, "cache") <- NULL

t4 <- tempfile()
qs2::qs_save(data2, t4, nthreads = 8)
file.size(t4) |> scales::label_bytes()()
#> [1] "1 MB"

That does reduce the size to near the unnested file size. qs2 provides the qd_ functions for a stricter kind of serialization that will replace pointers with NULL, so we don’t have to mess with the guts of the rvars:

t5 <- tempfile()
qs2::qd_save(data, t5, nthreads = 8)
#> Warning in qs2::qd_save(data, t5, nthreads = 8): Attributes of type environment
#> are not supported in qdata format
#> Warning in qs2::qd_save(data, t5, nthreads = 8): Attributes of type environment
#> are not supported in qdata format
file.size(t5) |> scales::label_bytes()()
#> [1] "2 MB"

Importantly, the data seems to come back fine:

data3 <- qs2::qd_read(t5)
data3
#> # A tibble: 100 × 3
#>        a               b                c
#>    <int>      <rvar[1d]>       <rvar[1d]>
#>  1     1   0.0014 ± 1.02  -0.00354 ± 0.99
#>  2     2   0.0479 ± 1.00  -0.06457 ± 1.01
#>  3     3  -0.0602 ± 0.96   0.00022 ± 0.98
#>  4     4   0.0068 ± 0.98   0.01593 ± 1.01
#>  5     5  -0.0101 ± 1.00  -0.02121 ± 0.99
#>  6     6  -0.0049 ± 0.95  -0.04301 ± 0.97
#>  7     7   0.0078 ± 0.97   0.00162 ± 1.03
#>  8     8  -0.0044 ± 0.96   0.03011 ± 0.99
#>  9     9   0.0276 ± 1.00  -0.00378 ± 1.01
#> 10    10  -0.0421 ± 1.05   0.08015 ± 0.99
#> # ℹ 90 more rows

data3$b == data2$b
#> rvar<1000>[100] mean ± sd:
#>   [1] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [11] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [21] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [31] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [41] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [51] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [61] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [71] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [81] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0 
#>  [91] 1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0  1 ± 0

Leave a comment