bigANNOY v3 adds explicit index lifecycle support around persisted Annoy
files. That makes it possible to:
This vignette focuses on those operational workflows rather than on search quality or benchmark tuning.
Annoy indexes are stored on disk. In practice, that means the useful object is not just the result of a single build call, but a persisted pair:
.ann index file.meta sidecar metadata fileThe bigannoy_index object returned by bigANNOY is a session-level wrapper
around those files. It remembers the key metadata and can optionally hold a
live native handle for faster repeated searches within the same R session.
library(bigANNOY)
library(bigmemory)
We will create a small reference matrix, write the Annoy index into a temporary directory, and keep the returned object in lazy mode so the first search is what loads the live handle.
artifact_dir <- file.path(tempdir(), "bigannoy-lifecycle")
dir.create(artifact_dir, recursive = TRUE, showWarnings = FALSE)
ref_dense <- matrix(
c(
0.0, 0.1, 0.2,
0.1, 0.0, 0.1,
0.2, 0.1, 0.0,
1.0, 1.1, 1.2,
1.1, 1.0, 1.1,
1.2, 1.1, 1.0
),
ncol = 3,
byrow = TRUE
)
ref_big <- as.big.matrix(ref_dense)
index_path <- file.path(artifact_dir, "ref.ann")
metadata_path <- paste0(index_path, ".meta")
index <- annoy_build_bigmatrix(
ref_big,
path = index_path,
n_trees = 25L,
metric = "euclidean",
seed = 123L,
load_mode = "lazy"
)
index
#> <bigannoy_index>
#> path: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//RtmpBEyDSE/bigannoy-lifecycle/ref.ann
#> metadata: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//RtmpBEyDSE/bigannoy-lifecycle/ref.ann.meta
#> index_id: annoy-20260327203934-2a8b6582c143
#> metric: euclidean
#> trees: 25
#> items: 6
#> dimension: 3
#> build_seed: 123
#> build_threads: -1
#> build_backend: cpp
#> load_mode: lazy
#> loaded: FALSE
#> file_size: 2968
#> file_md5: 2a8b6582c143e941abc77e79789e227e
#> prefault: FALSE
The returned object points to the persisted files, but the native handle is not loaded yet.
annoy_is_loaded(index)
#> [1] FALSE
file.exists(index$path)
#> [1] TRUE
file.exists(index$metadata_path)
#> [1] TRUE
The sidecar metadata file is meant to support safe reopen and validation workflows. It records the metric, dimension, item count, build settings, and a small file signature for the persisted Annoy file.
metadata <- read.dcf(index$metadata_path)
metadata[, c(
"index_id",
"metric",
"n_dim",
"n_ref",
"n_trees",
"build_seed",
"build_backend",
"file_size",
"file_md5"
)]
| index_id | metric | n_dim | n_ref | n_trees | build_seed | build_backend | file_size | file_md5 |
|---|---|---|---|---|---|---|---|---|
| annoy-20260327203934-2a8b6582c143 | euclidean | 3 | 6 | 25 | 123 | cpp | 2968 | 2a8b6582c143e941abc77e79789e227e |
| annoy-20260327203934-2a8b6582c143 | euclidean | 3 | 6 | 25 | 123 | cpp | 2968 | 2a8b6582c143e941abc77e79789e227e |
| annoy-20260327203934-2a8b6582c143 | euclidean | 3 | 6 | 25 | 123 | cpp | 2968 | 2a8b6582c143e941abc77e79789e227e |
The important point is not the exact formatting of the metadata file, but that the persisted index is now self-describing enough to be reopened and checked in later sessions.
There are two lifecycle modes:
"lazy" keeps only metadata in memory until the first search"eager" loads a native handle immediately when the index object is created
or reopenedThe index we just built is lazy.
annoy_is_loaded(index)
#> [1] FALSE
The first search loads the handle automatically.
first_result <- annoy_search_bigmatrix(index, k = 2L, search_k = 100L)
annoy_is_loaded(index)
#> [1] TRUE
first_result$index
| 2 | 3 |
| 1 | 3 |
| 2 | 1 |
| 5 | 6 |
| 4 | 6 |
| 5 | 4 |
round(first_result$distance, 3)
| 0.173 | 0.283 |
| 0.173 | 0.173 |
| 0.173 | 0.283 |
| 0.173 | 0.283 |
| 0.173 | 0.173 |
| 0.173 | 0.283 |
Once the handle is loaded, repeated searches in the same session can reuse it.
second_result <- annoy_search_bigmatrix(index, k = 2L, search_k = 100L)
identical(first_result$index, second_result$index)
#> [1] TRUE
all.equal(first_result$distance, second_result$distance)
#> [1] TRUE
Validation and loading are related, but they are not the same thing. Sometimes you want to confirm that the metadata and file signature still look right without paying the cost of loading the native handle yet.
annoy_close_index(index)
annoy_is_loaded(index)
#> [1] FALSE
validation_no_load <- annoy_validate_index(
index,
strict = TRUE,
load = FALSE
)
validation_no_load$valid
#> [1] TRUE
validation_no_load$checks[, c("check", "passed", "severity")]
| check | passed | severity |
|---|---|---|
| index_file | TRUE | error |
| metric | TRUE | error |
| dimensions | TRUE | error |
| items | TRUE | error |
| file_size | TRUE | error |
| file_md5 | TRUE | error |
| file_mtime | TRUE | warning |
annoy_is_loaded(index)
#> [1] FALSE
Because load = FALSE, the validation report checks the recorded metadata
against the current file without changing the loaded state of the object.
If you do want validation to also confirm that the Annoy index can be opened
successfully, set load = TRUE.
validation_with_load <- annoy_validate_index(
index,
strict = TRUE,
load = TRUE
)
validation_with_load$valid
#> [1] TRUE
tail(validation_with_load$checks[, c("check", "passed", "severity")], 2L)
| check | passed | severity |
|---|---|---|
| file_mtime | TRUE | warning |
| load | TRUE | error |
annoy_is_loaded(index)
#> [1] TRUE
This is a useful pattern before long-running queries or before handing a reopened index to downstream analysis code.
Explicit close support is helpful in long R sessions, in tests, and in code that wants deterministic control over when handles are released.
annoy_close_index(index)
annoy_is_loaded(index)
#> [1] FALSE
The persisted .ann file is still there, so the next search can load it
again.
reload_result <- annoy_search_bigmatrix(index, k = 2L, search_k = 100L)
annoy_is_loaded(index)
#> [1] TRUE
reload_result$index
| 2 | 3 |
| 1 | 3 |
| 2 | 1 |
| 5 | 6 |
| 4 | 6 |
| 5 | 4 |
The more important persistence workflow is reopening the same files into a new
bigannoy_index object. This is what a later R session would typically do.
annoy_open_index() and annoy_load_bigmatrix() both support this pattern.
The main distinction is semantic: annoy_load_bigmatrix() is a friendlier name
when you are thinking in terms of bigmemory workflows, while
annoy_open_index() makes the persisted-index lifecycle more explicit.
reopened_lazy <- annoy_open_index(
path = index$path,
load_mode = "lazy"
)
reopened_eager <- annoy_load_bigmatrix(
path = index$path,
load_mode = "eager"
)
annoy_is_loaded(reopened_lazy)
#> [1] FALSE
annoy_is_loaded(reopened_eager)
#> [1] TRUE
The eager reopen path loads immediately. The lazy reopen path waits until first use.
reopened_result <- annoy_search_bigmatrix(
reopened_lazy,
k = 2L,
search_k = 100L
)
annoy_is_loaded(reopened_lazy)
#> [1] TRUE
reopened_result$index
| 2 | 3 |
| 1 | 3 |
| 2 | 1 |
| 5 | 6 |
| 4 | 6 |
| 5 | 4 |
The persisted files are shared, but loaded-state tracking is per-object and per-session. Closing one in-memory object does not invalidate another object that already opened the same index.
annoy_close_index(reopened_lazy)
c(
original = annoy_is_loaded(index),
reopened_lazy = annoy_is_loaded(reopened_lazy),
reopened_eager = annoy_is_loaded(reopened_eager)
)
#> original reopened_lazy reopened_eager
#> TRUE FALSE TRUE
This is a useful mental model:
.ann file is the durable assetbigannoy_index object is the session-level controllerIn normal workflows, annoy_validate_index(..., strict = TRUE) is the safest
default because it stops immediately when critical checks fail. If you want a
diagnostic report instead of an error, use strict = FALSE.
report <- annoy_validate_index(
reopened_eager,
strict = FALSE,
load = FALSE
)
report$valid
#> [1] TRUE
report$checks[, c("check", "passed", "severity")]
| check | passed | severity |
|---|---|---|
| index_file | TRUE | error |
| metric | TRUE | error |
| dimensions | TRUE | error |
| items | TRUE | error |
| file_size | TRUE | error |
| file_md5 | TRUE | error |
| file_mtime | TRUE | warning |
That pattern is especially helpful when you are writing higher-level code that wants to display a validation report before deciding whether to rebuild or reload an index.
For most projects, a sensible lifecycle pattern looks like this:
annoy_build_bigmatrix().ann file and the .meta file togetherannoy_open_index() or annoy_load_bigmatrix() in later
sessionsannoy_validate_index() before important downstream workannoy_close_index() when you want explicit control over loaded
handlesbigANNOY v3 turns persisted Annoy files into a more explicit lifecycle:
annoy_is_loaded()annoy_close_index()annoy_validate_index()The next vignette to read after this one is usually File-Backed bigmemory Workflows, which focuses on descriptor files, file-backed matrices, and streamed output destinations.