Serializing rvars with qs2’s qdata
I was surprised by how large a tibble with posterior::rvar() columns became
when saved with qs2::qs_save(). The issue seems to be attributes/pointers on
the rvar objects. Removing the pointers by converting to a non-rvar format or
using qs2::qd_save() will result in smaller file sizes.
Let’s make rvars of 1000 draws each for 100 rows x 2 columns of data.
set.seed(20250512) f <- function(n) { replicate(n = n, posterior::rvar(rnorm(1000))) |> do.call(what = c) } data <- tibble::tibble( a = 1:100, b = f(100), c = f(100) ) data #> # A tibble: 100 × 3 #> a b c #> <int> <rvar[1d]> <rvar[1d]> #> 1 1 0.0014 ± 1.02 -0.00354 ± 0.99 #> 2 2 0.0479 ± 1.00 -0.06457 ± 1.01 #> 3 3 -0.0602 ± 0.96 0.00022 ± 0.98 #> 4 4 0.0068 ± 0.98 0.01593 ± 1.01 #> 5 5 -0.0101 ± 1.00 -0.02121 ± 0.99 #> 6 6 -0.0049 ± 0.95 -0.04301 ± 0.97 #> 7 7 0.0078 ± 0.97 0.00162 ± 1.03 #> 8 8 -0.0044 ± 0.96 0.03011 ± 0.99 #> 9 9 0.0276 ± 1.00 -0.00378 ± 1.01 #> 10 10 -0.0421 ± 1.05 0.08015 ± 0.99 #> # ℹ 90 more rows
Let’s save it with save() and qs2::qs_save(), which is meant to
serialize the R object’s data like save() does but in a faster and
smarter way.
t1 <- tempfile() save(data, file = t1) file.size(t1) |> scales::label_bytes()() #> [1] "155 MB" t2 <- tempfile() qs2::qs_save(data, t2, nthreads = 8) file.size(t2) |> scales::label_bytes()() #> [1] "110 MB"
If we unnest the data so that there is one row per draw, we get much better sizes.
t3 <- tempfile() data_long <- tidybayes::unnest_rvars(data) data_long #> # A tibble: 100,000 × 6 #> # Groups: a [100] #> a b c .chain .iteration .draw #> <int> <dbl> <dbl> <int> <int> <int> #> 1 1 -0.584 -0.135 1 1 1 #> 2 1 -1.78 -0.955 1 2 2 #> 3 1 -0.222 -0.534 1 3 3 #> 4 1 -0.560 1.13 1 4 4 #> 5 1 -0.754 -1.81 1 5 5 #> 6 1 0.436 1.24 1 6 6 #> 7 1 -0.241 -1.58 1 7 7 #> 8 1 -1.68 -1.84 1 8 8 #> 9 1 -1.83 0.591 1 9 9 #> 10 1 -0.470 -0.260 1 10 10 #> # ℹ 99,990 more rows qs2::qs_save(data_long, t3, nthreads = 8) file.size(t3) |> scales::label_bytes()() #> [1] "1 MB"
file.size(t2) / file.size(t3) #> [1] 78.05878
What is the bloat here? It is cached data about the rvars. Here is the data for one rvar:
str(data$b |> unclass()) #> list() #> - attr(*, "draws")= num [1:1000, 1:100] -0.584 -1.776 -0.222 -0.56 -0.754 ... #> ..- attr(*, "dimnames")=List of 2 #> .. ..$ : chr [1:1000] "1" "2" "3" "4" ... #> .. ..$ : NULL #> - attr(*, "nchains")= int 1 #> - attr(*, "cache")=<environment: 0x00000159652bf370>
See the environment pointer?
Let’s zap the caches and try again.
data2 <- data attr(data2$c, "cache") <- NULL attr(data2$b, "cache") <- NULL t4 <- tempfile() qs2::qs_save(data2, t4, nthreads = 8) file.size(t4) |> scales::label_bytes()() #> [1] "1 MB"
That does reduce the size to near the unnested file size. qs2 provides
the qd_ functions for a stricter kind of
serialization that will
replace pointers with NULL, so we don’t have to mess with the guts of
the rvars:
t5 <- tempfile() qs2::qd_save(data, t5, nthreads = 8) #> Warning in qs2::qd_save(data, t5, nthreads = 8): Attributes of type environment #> are not supported in qdata format #> Warning in qs2::qd_save(data, t5, nthreads = 8): Attributes of type environment #> are not supported in qdata format file.size(t5) |> scales::label_bytes()() #> [1] "2 MB"
Importantly, the data seems to come back fine:
data3 <- qs2::qd_read(t5) data3 #> # A tibble: 100 × 3 #> a b c #> <int> <rvar[1d]> <rvar[1d]> #> 1 1 0.0014 ± 1.02 -0.00354 ± 0.99 #> 2 2 0.0479 ± 1.00 -0.06457 ± 1.01 #> 3 3 -0.0602 ± 0.96 0.00022 ± 0.98 #> 4 4 0.0068 ± 0.98 0.01593 ± 1.01 #> 5 5 -0.0101 ± 1.00 -0.02121 ± 0.99 #> 6 6 -0.0049 ± 0.95 -0.04301 ± 0.97 #> 7 7 0.0078 ± 0.97 0.00162 ± 1.03 #> 8 8 -0.0044 ± 0.96 0.03011 ± 0.99 #> 9 9 0.0276 ± 1.00 -0.00378 ± 1.01 #> 10 10 -0.0421 ± 1.05 0.08015 ± 0.99 #> # ℹ 90 more rows data3$b == data2$b #> rvar<1000>[100] mean ± sd: #> [1] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [11] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [21] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [31] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [41] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [51] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [61] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [71] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [81] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 #> [91] 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0 1 ± 0
Leave a comment