Using DuckDB with dplyr to query control files
duckdb is awesome. I used it in a project to read in all of control files (tsv files) in a folder into a database with the following code. There is a lot to unpack here, but here are the main ideas:
-
dplyr is used to generate the queries that sent to the database.
-
The functions that start with
regexp_are duckdb functions, not R functions, but that’s not a problem because the code that will be run is SQL code in the database. -
Normally,
tbl()is used to specify a table in a database, but here, the database table is constructed from the tsv files usingread_csv_auto()with wildcard characters. -
The code in
run_control_file_query()produces a SQL query. The line withcompute()is the one that actually has the database execute the query and store the results in a database table.
library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(dbdir = "data-raw/data.duckdb"))
# Generates a SQL query from all the control files in the
# working directory
run_control_file_query <- function() {
tbl(con, "read_csv_auto('*Control*File-*.txt', FILENAME = TRUE)") |>
# this all gets converted to SQL statements that will be executed by the
# database, so there is a mix of R functions and duckdb functions here.
select(
control_file = filename,
repeat_no = Repeat,
sentence = Sentence,
response = Response,
s_word = SWord,
m_word = MWord,
m_word_a = MWordA,
phonetic_sentence = `Phonetic Sentence`,
phonetic_response = `Phonetic Response`,
audio_file = File,
) |>
mutate(
control_file_number = control_file |>
regexp_extract('\\d{7}\\d+'),
item = audio_file |>
toupper() |>
# fix files with a subscript letter (S7T10a)
regexp_replace("[ABCDE].WAV", ".WAV") |>
# fix files with an extra 0(WT010)
regexp_replace("0(\\d\\d).WAV", "\\1.WAV") |>
regexp_extract('(S|S\\d|W)(T)\\d\\d.WAV') |>
left(-4L),
tocs_type = ifelse(
substr(item, 1L, 1L) == "W",
"single-word",
"multiword"
)
)
}
# (Re-)Create a table from the first batch of files
withr::with_dir("../files-to-import/2022-10-27-batch-1/", {
DBI::dbRemoveTable(con, "control_files")
run_control_file_query() |>
compute(name = "control_files", temporary = FALSE)
})
Leave a comment