tibblify

library(tibblify)
library(purrr)
library(repurrrsive)
library(vctrs)

Introduction

With tibblify() you can rectangle deeply nested lists into a tidy tibble. These lists might come from an API in the form of JSON or from scraping XML.

Example

Let’s start with gh_users, which is a list from the {repurrrsive} package containing information about four GitHub users. We’ll select a subset of columns to keep the example relatively simple.

gh_users_small <- purrr::map(
  repurrrsive::gh_users,
  ~ .x[c(
    "followers",
    "login",
    "url",
    "name",
    "location",
    "email",
    "public_gists"
  )]
)

names(gh_users_small[[1]])
#> [1] "followers"    "login"        "url"          "name"         "location"    
#> [6] "email"        "public_gists"

We can quickly rectangle gh_users_small with tibblify().

tibblify(gh_users_small)
#> # A tibble: 6 × 7
#>   followers login       url                    name  location email public_gists
#>       <int> <chr>       <chr>                  <chr> <chr>    <chr>        <int>
#> 1       303 gaborcsardi https://api.github.co… Gábo… Chippen… csar…            6
#> 2       780 jennybc     https://api.github.co… Jenn… Vancouv… <NA>            54
#> 3      3958 jtleek      https://api.github.co… Jeff… Baltimo… <NA>            12
#> 4       115 juliasilge  https://api.github.co… Juli… Salt La… <NA>             4
#> 5       213 leeper      https://api.github.co… Thom… London,… <NA>            46
#> 6        34 masalmon    https://api.github.co… Maël… Barcelo… <NA>             0

We can now look at the specification tibblify() used for rectangling.

guess_tspec(gh_users_small)
#> tspec_df(
#>   tib_int("followers"),
#>   tib_chr("login"),
#>   tib_chr("url"),
#>   tib_chr("name"),
#>   tib_chr("location"),
#>   tib_chr("email"),
#>   tib_int("public_gists"),
#> )

If we are only interested in some of the fields we can easily adapt the specification.

spec <- tspec_df(
  login_name = tib_chr("login"),
  tib_chr("name"),
  tib_int("public_gists")
)

tibblify(gh_users_small, spec)
#> # A tibble: 6 × 3
#>   login_name  name                   public_gists
#>   <chr>       <chr>                         <int>
#> 1 gaborcsardi Gábor Csárdi                      6
#> 2 jennybc     Jennifer (Jenny) Bryan           54
#> 3 jtleek      Jeff L.                          12
#> 4 juliasilge  Julia Silge                       4
#> 5 leeper      Thomas J. Leeper                 46
#> 6 masalmon    Maëlle Salmon                     0

Objects

We refer to lists like gh_users_small as collections, and each element of that list is an object. Objects and collections are the typical input for tibblify().

Basically, an object is something that can be converted to a one-row tibble. This boils down to a condition on the names of the object:

  1. The object must have names (the names attribute must not be NULL).
  2. Every element must be named (no name can be NA or "").
  3. The names must be unique.

In other words, the names must fulfill vctrs::vec_as_names(names, repair = "check_unique"). The name-value pairs of an object are the fields.

For example list(x = 1, y = "a") is an object with the fields (x, 1) and (y, "a") but list(1, z = 3) is not an object because it is not fully named.

A collection is a list of similar objects so that the fields can become the columns in a tibble.

Specification

Providing an explicit specification has several advantages:

  1. You can ensure type and shape stability of the resulting tibble in automated scripts.
  2. You can give the columns names that are different than their names in the source collection.
  3. You can parse only the fields you need (and ignore any others).
  4. You can specify what happens if a value is missing.

To create a specification for a collection, use tspec_df(). Describe the columns of the output tibble with the tib_*() functions. The tib_*() functions describe the path to the field to extract and the output type of the field. Throughout these functions, we use the concept of ptype (prototype) as used in the {vctrs} package (see vignette("type-size", package = "vctrs")).

There are five main types of tib_*() functions:

There are also two other tib_*() functions for special cases:

For convenience there are shortcuts for tib_scalar() and tib_vector() for the most common prototypes:

Scalar Elements

Scalar elements are the most common case and result in a normal vector column.

tibblify(
  list(
    list(id = 1, name = "Peter"),
    list(id = 2, name = "Lilly")
  ),
  tspec_df(
    tib_int("id"),
    tib_chr("name")
  )
)
#> # A tibble: 2 × 2
#>      id name 
#>   <int> <chr>
#> 1     1 Peter
#> 2     2 Lilly

With tib_scalar(), you can also provide your own prototype for column types not included in our shortcuts. For example, let’s say you have a list with durations (objects with class “difftime”).

x <- list(
  list(id = 1, duration = vctrs::new_duration(100)),
  list(id = 2, duration = vctrs::new_duration(200))
)
x
#> [[1]]
#> [[1]]$id
#> [1] 1
#> 
#> [[1]]$duration
#> Time difference of 100 secs
#> 
#> 
#> [[2]]
#> [[2]]$id
#> [1] 2
#> 
#> [[2]]$duration
#> Time difference of 200 secs

Use tib_scalar() with vctrs::new_duration() to specify the duration ptype.

tibblify(
  x,
  tspec_df(
    tib_int("id"),
    tib_scalar("duration", .ptype = vctrs::new_duration())
  )
)
#> # A tibble: 2 × 2
#>      id duration
#>   <int> <drtn>  
#> 1     1 100 secs
#> 2     2 200 secs

Vector Elements

If an element does not always have size one then it is a vector element. If it still always has the same type ptype then it produces a list column with elements of ptype.

x <- list(
  list(id = 1, children = c("Peter", "Lilly")),
  list(id = 2, children = "James"),
  list(id = 3, children = c("Emma", "Noah", "Charlotte"))
)

tibblify(
  x,
  tspec_df(
    tib_int("id"),
    tib_chr_vec("children")
  )
)
#> # A tibble: 3 × 2
#>      id    children
#>   <int> <list<chr>>
#> 1     1         [2]
#> 2     2         [1]
#> 3     3         [3]

You can use tidyr::unnest() or tidyr::unnest_longer() to flatten these columns to regular columns.

Object Elements

In gh_repos_small, the field owner is an object itself.

gh_repos_small <- purrr::map(
  repurrrsive::gh_repos[[1]], 
  ~ .x[c("id", "name", "owner")]
)
gh_repos_small <- purrr::map(
  gh_repos_small,
  function(repo) {
    repo$owner <- repo$owner[c("login", "id", "url")]
    repo
  }
)

gh_repos_small[[1]]
#> $id
#> [1] 61160198
#> 
#> $name
#> [1] "after"
#> 
#> $owner
#> $owner$login
#> [1] "gaborcsardi"
#> 
#> $owner$id
#> [1] 660288
#> 
#> $owner$url
#> [1] "https://api.github.com/users/gaborcsardi"

The specification to extract it uses tib_row().

spec <- guess_tspec(gh_repos_small)
spec
#> tspec_df(
#>   tib_int("id"),
#>   tib_chr("name"),
#>   tib_row(
#>     "owner",
#>     tib_chr("login"),
#>     tib_int("id"),
#>     tib_chr("url"),
#>   ),
#> )

This specification results in a tibble column.

tibblify(gh_repos_small, spec)
#> # A tibble: 30 × 3
#>          id name        owner$login    $id $url                                 
#>       <int> <chr>       <chr>        <int> <chr>                                
#>  1 61160198 after       gaborcsardi 660288 https://api.github.com/users/gaborcs…
#>  2 40500181 argufy      gaborcsardi 660288 https://api.github.com/users/gaborcs…
#>  3 36442442 ask         gaborcsardi 660288 https://api.github.com/users/gaborcs…
#>  4 34924886 baseimports gaborcsardi 660288 https://api.github.com/users/gaborcs…
#>  5 61620661 citest      gaborcsardi 660288 https://api.github.com/users/gaborcs…
#>  6 33907457 clisymbols  gaborcsardi 660288 https://api.github.com/users/gaborcs…
#>  7 37236467 cmaker      gaborcsardi 660288 https://api.github.com/users/gaborcs…
#>  8 67959624 cmark       gaborcsardi 660288 https://api.github.com/users/gaborcs…
#>  9 63152619 conditions  gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 10 24343686 crayon      gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> # ℹ 20 more rows

If you don’t like the tibble column you can unpack it with tidyr::unpack(). Alternatively, if you only want to extract some of the fields in owner you can use a character vector path.

spec2 <- tspec_df(
  id = tib_int("id"),
  name = tib_chr("name"),
  owner_id = tib_int(c("owner", "id")), # "id" in "owner"
  owner_login = tib_chr(c("owner", "login")) # "login" in "owner"
)
spec2
#> tspec_df(
#>   tib_int("id"),
#>   tib_chr("name"),
#>   owner_id = tib_int(c("owner", "id")),
#>   owner_login = tib_chr(c("owner", "login")),
#> )

tibblify(gh_repos_small, spec2)
#> # A tibble: 30 × 4
#>          id name        owner_id owner_login
#>       <int> <chr>          <int> <chr>      
#>  1 61160198 after         660288 gaborcsardi
#>  2 40500181 argufy        660288 gaborcsardi
#>  3 36442442 ask           660288 gaborcsardi
#>  4 34924886 baseimports   660288 gaborcsardi
#>  5 61620661 citest        660288 gaborcsardi
#>  6 33907457 clisymbols    660288 gaborcsardi
#>  7 37236467 cmaker        660288 gaborcsardi
#>  8 67959624 cmark         660288 gaborcsardi
#>  9 63152619 conditions    660288 gaborcsardi
#> 10 24343686 crayon        660288 gaborcsardi
#> # ℹ 20 more rows

Required and Optional Fields

Objects usually have some fields that always exist and some that are optional. By default tib_*() requires that a field exists.

x <- list(
  list(x = 1, y = "a"),
  list(x = 2)
)

spec <- tspec_df(
  x = tib_int("x"),
  y = tib_chr("y")
)

tibblify(x, spec)
#> Error in `tibblify()`:
#> ! Field y is required but does not exist in `x[[2]]`.
#> ℹ Use `required = FALSE` if the field is optional.

You can mark a field as optional with the argument .required = FALSE.

spec <- tspec_df(
  x = tib_int("x"),
  y = tib_chr("y", .required = FALSE)
)

tibblify(x, spec)
#> # A tibble: 2 × 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 <NA>

You can specify the value to use with the .fill argument.

spec <- tspec_df(
  x = tib_int("x"),
  y = tib_chr("y", .required = FALSE, .fill = "missing")
)

tibblify(x, spec)
#> # A tibble: 2 × 2
#>       x y      
#>   <int> <chr>  
#> 1     1 a      
#> 2     2 missing

Converting a Single Object

To rectangle a single object you have two options: tspec_object() which produces a list, or tspec_row() which produces a tibble with one row.

It often makes more sense to convert such objects to a list. For example a typical API response might be something like this.

api_output <- list(
  status = "success",
  requested_at = "2021-10-26 09:17:12",
  data = list(
    list(x = 1),
    list(x = 2)
  )
)

Use tspec_row() to convert to a one-row tibble.

row_spec <- tspec_row(
  status = tib_chr("status"),
  data = tib_df(
    "data",
    x = tib_int("x")
  )
)

api_output_df <- tibblify(api_output, row_spec)
api_output_df
#> # A tibble: 1 × 2
#>   status                data
#>   <chr>   <list<tibble[,1]>>
#> 1 success            [2 × 1]

To access data one has to use api_output_df$data[[1]], which is not very nice. We can use tspec_object() to simplify this situation.

object_spec <- tspec_object(
  status = tib_chr("status"),
  data = tib_df(
    "data",
    x = tib_int("x")
  )
)

api_output_list <- tibblify(api_output, object_spec)
api_output_list
#> $status
#> [1] "success"
#> 
#> $data
#> # A tibble: 2 × 1
#>       x
#>   <int>
#> 1     1
#> 2     2

Now accessing data does not require an extra subsetting step.

api_output_list$data # No [[1]] needed
#> # A tibble: 2 × 1
#>       x
#>   <int>
#> 1     1
#> 2     2