Skip to contents

This vignette documents the internal parsing logic of the readtextgrid package. It is intended for developers maintaining the parser or for developers in other languages, not for end users of the package.

In this article, I describe the specification of the .TextGrid file format used in this package, note how it differs from the documented specification provided by Praat, and provide a high-level overview of R code and a C++ translation that can parse .TextGrid files.

Example .TextGrid file contents

The .TextGrid file format used by Praat is very flexible. Below are three different .TextGrid files representing the same Praat textgrid.

Long format:

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 1 
tiers? <exists> 
size = 1 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "Mary" 
        xmin = 0 
        xmax = 1 
        intervals: size = 1 
        intervals [1]:
            xmin = 0 
            xmax = 1 
            text = "" 

Short format:

File type = "ooTextFile"
Object class = "TextGrid"

0
1
<exists>
1
"IntervalTier"
"Mary"
0
1
1
0
1
""

Custom format with comments and other noise:

File type = "ooTextFile"
Object class = "TextGrid"

! info about the grid
0s 1s <exists> 1
! info about the tier
"IntervalTier" "Mary" 0s 1s 1 ! type, name, xmin, xmax, size
0s 1s "" ! interval xmin, xmax, size

readtextgrid can handle all three of these files in the same way because the Praat textgrid specification is simple—once you figure it out. I developed the readtextgrid specification by reading Praat’s description of the format, testing various edge cases in the format and testing whether Praat would open the test file. If Praat could handle the file, it had to be supported by this package’s textgrid parser.

Package design

To read in .TextGrid file, we do the following:

  • read it in with the proper character encoding
  • tokenize the file contents from a sequence of characters into a list of Praat strings and Praat numbers
  • identify the start and end tokens of each textgrid tier
  • split those tokens up into batches of data and assemble dataframes

This document concerns the tokenization step. The remaining parsing steps follow straightforward split-apply-combine programming in R.

Documented .TextGrid file format specification

First, let’s start with Paul Boersma’s own description of the file format. He notes that the long format contains several comments to help a person read the file, and that these are ultimately ignored by Praat. Instead, there are only a few important tokens:

Praat will consider as data only the following types of information in the file:

  • free-standing numbers, such as 0 and 2.3 above, but not [1] or [3];
  • free-standing text enclosed within double quotes, such as "TextGrid" and "" above;
  • free-standing flags, such as <exists> above (this is the only flag that appears in TextGrid files […]).

In this list, “free-standing” means that the number, text or flag is preceded by the beginning of the file, the beginning of a line, or a space, and that it is followed by the end of the file, the end of a line, or a space.

He also mentions additional features about the format:

  • ! comments: “everything that follows an exclamation mark on the same line is considered a comment”.
  • "" escapement by doubling: “a double quote that appears in a text [i.e., a string] is written as a doubled double quote in the text file.”
  • ignore the <flag> tokens anyway: “The flag <exists>, which tells us that this TextGrid contains tiers (this value would be <absent> if the TextGrid contained no tiers, in which case the file would end here; however, you cannot really create TextGrid objects without tiers in Praat, so this issue can be ignored).”

These details are mostly accurate and simple enough, but they don’t specify what to do with .1 for example (Praat treat it as an error).

Our specification of the .TextGrid file format

After testing, I developed the following specification for this R package.

  • There are two kinds of tokens: strings and numbers.

  • Strings start and end with a ". If a string is supposed to have a double-quote character " inside of it, double the quote characters instead. The textgrid interval text He said “hello” to me would have the string "He said ""hello"" to me". Everything inside of the " pair belongs to the string, even line breaks and comments.

  • A string is fully “free-standing”. It should be preceded and followed by a space, newline, or the start or end of a file. I said"Hello" does not contain a string because there is no space before the " character.

  • Numbers start with a plus, minus or digit. Decimal, hexadecimal, and scientific notation are supported. Fractions are supported. A number ending with a % (a percentage) is divided by 100. Numbers use a . for the decimal point character. .5 is not a number because it doesn’t start with a plus, minus or digit.

  • A number is “left free-standing” (my terminology). It must be preceded by a space or newline. (Using the file start doesn’t make sense for a boundary). From a valid start of a number, characters are read until the sequence of characters would no longer yield a number. Any additional characters until the next space, newline, or file boundary are ignored. In 100ms and +100e1ms, for example, the final ms characters are ignored.

  • Praat does not support real numbers with a stranded exponent (1e). These kinds of numbers are an exception to the left-free-standing feature described earlier.

  • Everything else is a comment and ignored. I differentiate between two kinds of comments. This is my terminology, not Praat’s.

  • “Strong” comments start with a ! and end with a newline (\n).

  • “Weak” comments would be any token that does not start like a string or number. In the long format textgrid, size = 1 would be two ignored weak comments (size, =) and a number (1).

The allowance for characters on the right side of numbers is the major difference between the description of the Praat format and the one used in this package.

Reference R implementation for textgrid tokenization

Given a vector of characters from a Praat .TextGrid file, we want a list of strings and numbers contained in the file. For example, here are the characters from the short textgrid file and the output of the R-based tokenization:

tg_characters <- examples[2] |> 
  strsplit("") |> 
  unlist()

tg_characters
#>   [1] "F"  "i"  "l"  "e"  " "  "t"  "y"  "p"  "e"  " "  "="  " "  "\"" "o"  "o" 
#>  [16] "T"  "e"  "x"  "t"  "F"  "i"  "l"  "e"  "\"" "\n" "O"  "b"  "j"  "e"  "c" 
#>  [31] "t"  " "  "c"  "l"  "a"  "s"  "s"  " "  "="  " "  "\"" "T"  "e"  "x"  "t" 
#>  [46] "G"  "r"  "i"  "d"  "\"" "\n" "\n" "0"  "\n" "1"  "\n" "<"  "e"  "x"  "i" 
#>  [61] "s"  "t"  "s"  ">"  "\n" "1"  "\n" "\"" "I"  "n"  "t"  "e"  "r"  "v"  "a" 
#>  [76] "l"  "T"  "i"  "e"  "r"  "\"" "\n" "\"" "M"  "a"  "r"  "y"  "\"" "\n" "0" 
#>  [91] "\n" "1"  "\n" "1"  "\n" "0"  "\n" "1"  "\n" "\"" "\"" "\n"

tg_characters |> 
  readtextgrid:::r_tokenize_textgrid_chars() |> 
  str()
#> List of 13
#>  $ : chr "ooTextFile"
#>  $ : chr "TextGrid"
#>  $ : num 0
#>  $ : num 1
#>  $ : num 1
#>  $ : chr "IntervalTier"
#>  $ : chr "Mary"
#>  $ : num 0
#>  $ : num 1
#>  $ : num 1
#>  $ : num 0
#>  $ : num 1
#>  $ : chr ""

Some comments about this function:

  • r_tokenize_textgrid_chars() is not an exported or supported function. That is why it needs to be accessed with the triple colon namespace operator :::.
  • The function was the intended implementation for the package until I converted the implementation to C++. I keep this R version around as a reference implementation for testing the current C++ implementation.
  • Don’t use it.

The big ideas in r_tokenize_textgrid_chars() are the following:

  • We have three special states: in_strong_comment, in_string, and in_escaped_quote. These determine how we interpret spaces, newlines, and " characters. When in_strong_comment is true, we skip the character iteration loop with next until we see a newline. When in_escaped_quote is true, we skip the next iteration of the loop (to catch next to "). When in_string is true, we keep collecting characters for the current token until we see a closing ".

  • When these states are all false and we see a space or newline, then we have the end of current token. We extract the characters for the current token, combine them into a single value, check the value and keep it if it is a Praat string or Praat number. Then we reset the current token position and advance.

Everything else is book-keeping to check for a special state or initialize a new token.

The complete code is given below. It is fairly well-commented but you don’t have to read it—just knowing the high-level details is sufficient.

function(all_char) {
  # The parser rules here follow the textgrid specifications
  # <https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html> EXCEPT
  # when they contradict the behavior of Praat.exe. For example, the specs says
  # the main literals are freestanding strings and numbers, where freestanding
  # means that they have a whitespace or boundary (newline or file start/end).
  # But Praat.exe can handle numbers like "10.00!comment". So, this parser
  # gathers freestanding literals but only keeps ones that are strings or
  # start with a valid number (the non-numeric characters are lopped off.)

  in_strong_comment <- FALSE         # Comment mode: ! to new line \n
  in_string <- FALSE                 # String mode: "Quote to quote"
  in_escaped_quote <- FALSE          # Escaped quote: "" inside of a string

  token_start <- integer(0)          # Start of current token
  values <- vector(mode = "list")    # Collects completed values

  for (i in seq_along(all_char)) {
    cur_value_ready <- length(token_start) != 0
    c <- all_char[i]
    c_is_whitespace <- c %in% c(" ", "\n")
    c_starts_string <- c == "\""

    # Comments start with ! and end with \n. Skip characters in this mode.
    if (!in_string & c == "!") {
      in_strong_comment <- TRUE
      next
    }
    if (in_strong_comment) {
      if (c == "\n") in_strong_comment <- FALSE
      next
    }

    # Whitespace delimits values so collect values if we see whitespace
    if (c_is_whitespace & !in_string) {
      # Skip whitespace if no values collected so far
      if (!cur_value_ready) next

      total_value <- all_char[seq(token_start, i - 1)] |>
        paste0(collapse = "")
      is_string <- all_char[token_start] == "\"" && all_char[i - 1] == "\""

      # Collect only numbers and strings
      if (r_tg_parse_is_number(total_value)) {
        # Keep only the numeric part.
        total_value <- total_value |> r_tg_parse_extract_number()
        values <- c(values, total_value)
      } else if (is_string) {
        values <- c(values, total_value)
      }
      token_start <- integer(0)
      next
    }

    # Store character if ending an escaped quote
    if (in_escaped_quote) {
      in_escaped_quote <- !in_escaped_quote
      next
    }

    # Start or close string mode if we see "
    if (c_starts_string) {
      # Check for "" escapes
      peek_c <- all_char[i + 1]
      if (peek_c == "\"" & in_string) {
        in_escaped_quote <- TRUE
      } else {
        in_string <- !in_string
      }
    }

    if (!cur_value_ready) {
      token_start <- i
    }
  }

  values |>
    lapply(r_tg_parse_convert_value)
}

C++ implementation

Given the simple nature of the R code and its relatively slow performance compared to the legacy version of the parser, I used ChatGPT to help convert the R code into a C++ implementation built on the cpp11 package. I tried to make sure I understood every line and made my own comments to help my understanding.

The C++ code is a straightforward translation of the R version. For example, here is the part of the function that collects tokens when we see a space or newline:

    if (!in_string && is_ws(b)) {
      if (have_token) {
        size_t start = tok_start_byte;
        size_t end   = (curr_char_byte == 0 ? 0 : prev_char_byte);
        size_t len   = (end >= start) ? (end - start + 1) : 0;
        if (len > 0) {
          // do we have a string (start and end with ")
          bool q = (static_cast<unsigned char>(src[start]) == 0x22) &&
            (static_cast<unsigned char>(src[end])   == 0x22);
          tokens.push_back(src.substr(start, len));
          tokens_is_string.push_back(q);
        }
        have_token = false;
      }
      continue;
    }

Some details are different: The C++ version extracts tokens with a substring (.substr()) method, delays checking whether the token is a number until later on, and accumulates results into lists (tokens and tokens_is_string). But the underlying logic is the same as the R version.

The C++ function takes a single character value (one whole string) for the file contents and returns a list of the tokens in the file, whether each token is a Praat string, the numbers of characters of each token that form a number, and the value of that token’s number:

examples[2] |> 
  readtextgrid:::cpp_tg_scan_tokens() |> 
  as.data.frame()
#>            tokens is_string num_prefix num_value
#> 1            File     FALSE          0        NA
#> 2            type     FALSE          0        NA
#> 3               =     FALSE          0        NA
#> 4    "ooTextFile"      TRUE          0        NA
#> 5          Object     FALSE          0        NA
#> 6           class     FALSE          0        NA
#> 7               =     FALSE          0        NA
#> 8      "TextGrid"      TRUE          0        NA
#> 9               0     FALSE          1         0
#> 10              1     FALSE          1         1
#> 11       <exists>     FALSE          0        NA
#> 12              1     FALSE          1         1
#> 13 "IntervalTier"      TRUE          0        NA
#> 14         "Mary"      TRUE          0        NA
#> 15              0     FALSE          1         0
#> 16              1     FALSE          1         1
#> 17              1     FALSE          1         1
#> 18              0     FALSE          1         0
#> 19              1     FALSE          1         1
#> 20             ""      TRUE          0        NA

Before I had figured out how to parse numbers with C++, I originally was going to use R code on the token column to figure out whether each token is a legal number or not. That’s why this function returns a list of vectors with information about the tokens.

Back in the R layer, the final tokens are selected using really basic vector operations:

readtextgrid:::tokenize_textgrid
#> function (tg_text) 
#> {
#>     res <- withr::with_locale(c(LC_NUMERIC = "C"), cpp_tg_scan_tokens(tg_text))
#>     toks <- res$tokens
#>     is_string <- res$is_string
#>     is_number <- (res$num_prefix != 0) & !is_string
#>     keep <- is_number | is_string
#>     toks <- toks[keep]
#>     out <- vector("list", length(toks))
#>     strings <- toks[is_string[keep]]
#>     strings <- substring(strings, 2L, nchar(strings) - 1L)
#>     strings <- gsub("\"\"", "\"", strings, fixed = TRUE)
#>     out[is_string[keep]] <- strings
#>     out[is_number[keep]] <- res$num_value[is_number]
#>     out
#> }
#> <bytecode: 0x55d68b30c6e8>
#> <environment: namespace:readtextgrid>

An important part of this function is the withr::with_locale(c(LC_NUMERIC = "C"), ... ) call. We are setting the locale for numbers to the C locale which means that . is the decimal point character, and not a comma as in some locales.

Parsing numbers is also handled by C++. I discovered that the standard library strtod() function does exactly what we need:

Interprets a floating-point value in a byte string pointed to by str.

Function discards any whitespace characters (as determined by isspace) until first non-whitespace character is found. Then it takes as many characters as possible to form a valid floating-point representation and converts them to a floating-point value.

https://en.cppreference.com/w/c/string/byte/strtof

We include some additional logic to make sure that .4 is illegal and to output NA_real_ for missing values, but otherwise, strtod() does the work for us.

One consequence of this approach is that we can parse other kinds of numbers like hexadecimal with exponents. It turns out that Praat can also parse these numbers in a .TextGrid file as well.

The number-parsing logic has its own function, so we can test how tokens specific tokens are parsed:

test_nums <- c("+1.0", "000ms", "-2", "0xA", ".5", "+.0") 

as.data.frame(c(
  test_nums = list(test_nums),
  readtextgrid:::cpp_parse_praat_numbers(test_nums)
))
#>   test_nums prefix_len value
#> 1      +1.0          4     1
#> 2     000ms          3     0
#> 3        -2          2    -2
#> 4       0xA          3    10
#> 5        .5          0    NA
#> 6       +.0          0    NA

There are two limitations with the number parser used in this package:

  • We do not support fractions and percentages. (Praat does.)
  • We accept stranded exponents. (Praat does not.)
test_nums <- c("1e", "1E", "20/10", "1000%") 
expected <- c(NA_real_, NA_real_, 2.0, 10.0) 

as.data.frame(c(
  test_nums = list(test_nums),
  readtextgrid:::cpp_parse_praat_numbers(test_nums),
  expected_value = list(expected)
))
#>   test_nums prefix_len value expected_value
#> 1        1e          1     1             NA
#> 2        1E          1     1             NA
#> 3     20/10          2    20              2
#> 4     1000%          4  1000             10

These are not high-priority limitations until we find a case where a software program writes out .TextGrid files that uses these features.

Notes on testing

The package’s folder tests/testthat/test-data includes a series of .TextGrid files for testing the parsing functions. One of these, hard-to-parse.TextGrid, collects as many edge cases as I can imagine.

The C++ implementation is tested against the legacy parser on easy long-format textgrid files and against the pure R implementation on other test textgrid files, including hard-to-parse.TextGrid.

The folder tests/testthat/test-data/praat-test include some tests of whether Praat can open a file or not. Files that fail to open start with fail- and files that open start with okay-. We support the only the syntactic features in the okay- files.

Notes on the Praat source code

I did not rely on the Praat source code but I tried! The Praat source code has to read in all kinds of text files so there is not an obvious read_textgrid()-like function for parsing a .TextGrid file. Still, I was able to find how numbers a read in from a text file.

The primitive data types of Praat are defined in the Melder folder. The abcio.cpp files has functions like getReal() for reading a float from text. getReal() calls Melder_a8tof() function in melder_atof.cpp to convert strings into numbers, and this function in turn calls findEndOfNumericString() which processes numbers character by character.