Making targets handle large bootstrap workflows
Here is a summary of how I got targets to handle a bunch of bootstrapping:
- transient memory
- not letting errors on one branch stop the whole pipeline
tarchetypes::tar_group_by()format = "fst_tbl"orformat = "qs"tar_read(..., 1)
My bootstraps would instantly fill up my computer’s memory during a
build. So now I dump stuff out of memory with "transient". (You can do
this on a by-target level too.) I also keep running the build if one
target/branch fails:
tar_option_set(
error = "null",
memory = "transient"
)
I also split a single target into smaller ones. I declare a global partition value like
N_PARTITIONS <- 80
Then I used tarchetypes::tar_group_by() to tell targets to split a
giant dataframe into groups that can be dynamically branched over. Here
is some actual code in the _targets.R file. The first targets
defines 2000 bootstrap replicates to be split into 80
partitions/branches and the second fits the models on each
partition/branch.
tar_group_by(
straps,
data_iwpm |>
filter(!is.na(mean_wpm)) |>
prepare_bootstrap_data() |>
bootstrap_by_slpg(
times = n_boot,
col_child_id = subject_id,
seed = 20220621
) |>
mutate(partition = assign_partition_numbers(id, N_PARTITIONS)),
partition
),
tar_target(
models_intelligibility,
fit_bootstrap_models(straps, "mean_intel", "intelligibility", 3, 2),
pattern = map(straps),
format = "qs"
),
# btw the targets here themselves are built serially bc fit_bootstrap_models()
# and friends use furrr to fit the models in each branch in parallel
But to be honest, as I’m looking at this code, I’m realizing that this “making a giant thing and split into branches” could have been built as “make a small table of branch IDs and grow them within each branch”. One analogy would be to read in a dozen files at once and split up the work so each file is processed separately versus take a dozen filenames and handle from each filename separately.
I have to use tarchetypes::tar_group_by() instead of
tarchetypes::tar_group_count() because I have multiple datasets that I
am bootstrapping and they need to be kept together within a branch.
Like, to split the following into two branches, I would want
bootstrap_number 1 and 2 in one branch and 3 in a separate branch so
that I can compare a, b and c within each bootstrap.
data_set,bootstrap_number
a,1
b,1
c,1
a,2
b,2
c,2
a,3
b,3
c,3
For serializing the data, I used format = "qs" on models or lists of
models and format = "fst_tbl" on plain, inoffensive data tables.
I also started using targets::tar_read(blah, n) where n is some
selection of branches like 1 or c(1, 2, 4), to read a subset of the
overall data. This speeds up interactive development by giving me a
smaller subset of data to write code around without having to load
gigantic objects into R.
Leave a comment