| Type: | Package |
| Title: | Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy |
| Version: | 0.3.0 |
| Date: | 2026-03-27 |
| Author: | Frederic Bertrand [aut, cre] |
| Maintainer: | Frederic Bertrand <frederic.bertrand@lecnam.net> |
| Description: | Approximate Euclidean k-nearest neighbour search routines that operate on 'bigmemory::big.matrix' data through Annoy indexes created with 'RcppAnnoy'. The package builds persistent on-disk indexes plus sidecar metadata from streamed 'big.matrix' rows, supports euclidean, angular, Manhattan, and dot-product Annoy metrics, and can either return in-memory results or stream neighbour indices and distances into destination 'bigmemory' matrices. Explicit index life cycle helpers, stronger metadata validation, descriptor-aware file-backed workflows, and benchmark helpers are also included. |
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
| Depends: | R (≥ 3.5.0) |
| Imports: | methods, Rcpp, RcppAnnoy |
| LinkingTo: | BH, bigmemory, Rcpp, RcppAnnoy |
| Suggests: | bigmemory, knitr, litedown, testthat (≥ 3.0.0) |
| VignetteBuilder: | litedown |
| Encoding: | UTF-8 |
| NeedsCompilation: | yes |
| URL: | https://fbertran.github.io/bigANNOY/, https://github.com/fbertran/bigANNOY |
| BugReports: | https://github.com/fbertran/bigANNOY/issues |
| RoxygenNote: | 7.3.3 |
| Config/testthat/edition: | 3 |
| Packaged: | 2026-03-27 20:39:35 UTC; bertran7 |
| Repository: | CRAN |
| Date/Publication: | 2026-04-01 08:00:33 UTC |
bigANNOY: Annoy Search for bigmemory Matrices
Description
Approximate nearest-neighbour search for bigmemory::big.matrix
references through Annoy indexes built on disk, with multi-metric Annoy
support, explicit loaded-index lifecycle helpers, descriptor-aware
file-backed workflows, benchmark helpers, and in-memory or streamed
big.matrix result writes.
Details
Package options:
-
bigANNOY.block_size: default number of rows processed per block while building an index or reading query matrices. Defaults to1024L. -
bigANNOY.backend: backend used by the public API. Defaults to"cpp". Set to"r"only for debugging/comparison, or"auto"to prefer the native backend when available. -
bigANNOY.progress: logical flag controlling simple progress messages during index build and search. Defaults toFALSE.
Author(s)
Maintainer: Frederic Bertrand frederic.bertrand@lecnam.net
See Also
Useful links:
Report bugs at https://github.com/fbertran/bigANNOY/issues
Build an Annoy index from a bigmemory::big.matrix
Description
Stream the rows of a reference bigmemory::big.matrix into an on-disk
Annoy index and write a small sidecar metadata file next to it. The returned
bigannoy_index can be reopened later with annoy_open_index().
Usage
annoy_build_bigmatrix(
x,
path,
n_trees = 50L,
metric = "euclidean",
seed = NULL,
build_threads = -1L,
block_size = annoy_default_block_size(),
metadata_path = NULL,
load_mode = "lazy"
)
Arguments
x |
A |
path |
File path where the Annoy index should be written. |
n_trees |
Number of Annoy trees to build. |
metric |
Distance metric. bigANNOY v2 supports |
seed |
Optional positive integer seed used to initialize Annoy's build RNG. |
build_threads |
Build-thread setting passed to Annoy's native backend.
Use |
block_size |
Number of rows processed per streamed block while building the index. |
metadata_path |
Optional path for the sidecar metadata file. Defaults to
|
load_mode |
Whether to keep the returned index metadata-only until first
search ( |
Value
A bigannoy_index object describing the persisted Annoy index.
Close any loaded Annoy handle cached inside a bigannoy_index
Description
Close any loaded Annoy handle cached inside a bigannoy_index
Usage
annoy_close_index(index)
Arguments
index |
A |
Value
index, invisibly.
Check whether an index currently has a loaded in-memory handle
Description
Check whether an index currently has a loaded in-memory handle
Usage
annoy_is_loaded(index)
Arguments
index |
A |
Value
TRUE when a live native or debug-only handle is cached, otherwise
FALSE.
Load an existing Annoy index for bigmatrix workflows
Description
Load an existing Annoy index for bigmatrix workflows
Usage
annoy_load_bigmatrix(
path,
metadata_path = NULL,
prefault = FALSE,
load_mode = "eager"
)
Arguments
path |
File path to an existing Annoy index built by
|
metadata_path |
Optional path to the sidecar metadata file. |
prefault |
Logical flag indicating whether searches should prefault the index when loaded by the native backend. |
load_mode |
Whether to eagerly load the native index handle on open or defer until first search. |
Value
A bigannoy_index object that can be passed to
annoy_search_bigmatrix().
Open an existing Annoy index and its sidecar metadata
Description
Open an existing Annoy index and its sidecar metadata
Usage
annoy_open_index(
path,
metadata_path = NULL,
prefault = FALSE,
load_mode = "eager"
)
Arguments
path |
File path to an existing Annoy index built by
|
metadata_path |
Optional path to the sidecar metadata file. |
prefault |
Logical flag indicating whether searches should prefault the index when loaded by the native backend. |
load_mode |
Whether to eagerly load the native index handle on open or defer until first search. |
Value
A bigannoy_index object that can be passed to
annoy_search_bigmatrix().
Search an Annoy index built from a bigmemory::big.matrix
Description
Query a persisted Annoy index created by annoy_build_bigmatrix() or
reopened with annoy_open_index(). Supply query = NULL for self-search
over the indexed reference rows, or provide a dense numeric matrix,
big.matrix, or external pointer for external-query search. Results can be
returned in memory or streamed into destination big.matrix objects.
Usage
annoy_search_bigmatrix(
index,
query = NULL,
k = 10L,
search_k = -1L,
xpIndex = NULL,
xpDistance = NULL,
prefault = NULL,
block_size = annoy_default_block_size()
)
Arguments
index |
A |
query |
Optional query source. Supply |
k |
Number of neighbours to return. |
search_k |
Annoy's runtime search budget. Use |
xpIndex |
Optional writable |
xpDistance |
Optional writable |
prefault |
Optional logical override controlling whether the native backend prefaults the Annoy file while loading it for search. |
block_size |
Number of queries processed per block. |
Value
A list with components index, distance, k, metric, n_ref,
n_query, exact, and backend.
Validate a persisted Annoy index and its sidecar metadata
Description
Validate a persisted Annoy index and its sidecar metadata
Usage
annoy_validate_index(index, strict = TRUE, load = TRUE, prefault = NULL)
Arguments
index |
A |
strict |
Whether failed validation checks should raise an error. |
load |
Whether to also verify that the index can be loaded successfully. |
prefault |
Optional logical override used when |
Value
A list containing valid, checks, and the normalized index.
Benchmark a single bigANNOY build/search configuration
Description
Build or reuse a benchmark reference dataset, create an Annoy index, query
it, and optionally compare recall against the exact bigKNN Euclidean
baseline.
Usage
benchmark_annoy_bigmatrix(
x = NULL,
query = NULL,
n_ref = 2000L,
n_query = 200L,
n_dim = 20L,
k = 10L,
n_trees = 50L,
metric = "euclidean",
search_k = -1L,
seed = 42L,
build_seed = seed,
build_threads = -1L,
block_size = annoy_default_block_size(),
backend = getOption("bigANNOY.backend", "cpp"),
exact = TRUE,
filebacked = FALSE,
path_dir = tempdir(),
keep_files = FALSE,
output_path = NULL,
load_mode = "eager"
)
Arguments
x |
Optional benchmark reference input. Supply |
query |
Optional benchmark query input. Supply |
n_ref |
Number of synthetic reference rows to generate when |
n_query |
Number of synthetic query rows to generate when |
n_dim |
Number of synthetic columns to generate when |
k |
Number of neighbours to return. |
n_trees |
Number of Annoy trees to build. |
metric |
Annoy metric. One of |
search_k |
Annoy search budget. |
seed |
Random seed used for synthetic data generation and, by default, for the Annoy build seed. |
build_seed |
Optional Annoy build seed. Defaults to |
build_threads |
Native Annoy build-thread setting. |
block_size |
Build/search block size. |
backend |
Requested bigANNOY backend. |
exact |
Logical flag controlling whether to benchmark the exact
Euclidean baseline with |
filebacked |
Logical flag; if |
path_dir |
Directory where temporary Annoy and optional file-backed benchmark files should be written. |
keep_files |
Logical flag; if |
output_path |
Optional CSV path where the benchmark summary should be written. |
load_mode |
Whether the benchmarked index should be returned
metadata-only until first search ( |
Value
A list with a one-row summary data frame plus the benchmark
parameters and generated Annoy file paths.
Benchmark a recall suite across multiple Annoy configurations
Description
Run a grid of n_trees and search_k settings on the same benchmark
dataset, optionally recording recall against the exact bigKNN Euclidean
baseline.
Usage
benchmark_annoy_recall_suite(
x = NULL,
query = NULL,
n_ref = 2000L,
n_query = 200L,
n_dim = 20L,
k = 10L,
n_trees = c(10L, 50L, 100L),
search_k = c(-1L, 1000L, 5000L),
metric = "euclidean",
seed = 42L,
build_seed = seed,
build_threads = -1L,
block_size = annoy_default_block_size(),
backend = getOption("bigANNOY.backend", "cpp"),
exact = TRUE,
filebacked = FALSE,
path_dir = tempdir(),
keep_files = FALSE,
output_path = NULL,
load_mode = "eager"
)
Arguments
x |
Optional benchmark reference input. Supply |
query |
Optional benchmark query input. Supply |
n_ref |
Number of synthetic reference rows to generate when |
n_query |
Number of synthetic query rows to generate when |
n_dim |
Number of synthetic columns to generate when |
k |
Number of neighbours to return. |
n_trees |
Integer vector of Annoy tree counts to benchmark. |
search_k |
Integer vector of Annoy search budgets to benchmark. |
metric |
Annoy metric. One of |
seed |
Random seed used for synthetic data generation and, by default, for the Annoy build seed. |
build_seed |
Optional Annoy build seed. Defaults to |
build_threads |
Native Annoy build-thread setting. |
block_size |
Build/search block size. |
backend |
Requested bigANNOY backend. |
exact |
Logical flag controlling whether to benchmark the exact
Euclidean baseline with |
filebacked |
Logical flag; if |
path_dir |
Directory where temporary Annoy and optional file-backed benchmark files should be written. |
keep_files |
Logical flag; if |
output_path |
Optional CSV path where the benchmark summary should be written. |
load_mode |
Whether the benchmarked index should be returned
metadata-only until first search ( |
Value
A list with a summary data frame containing one row per
(n_trees, search_k) configuration.
Benchmark scaling across data volumes for bigANNOY and direct RcppAnnoy
Description
Run benchmark_annoy_vs_rcppannoy() over a grid of synthetic data sizes to
study how build time, search time, and index size scale with data volume.
Usage
benchmark_annoy_volume_suite(
n_ref = c(2000L, 5000L, 10000L),
n_query = 200L,
n_dim = c(20L, 50L),
k = 10L,
n_trees = 50L,
metric = "euclidean",
search_k = -1L,
seed = 42L,
build_seed = seed,
build_threads = -1L,
block_size = annoy_default_block_size(),
backend = getOption("bigANNOY.backend", "cpp"),
exact = FALSE,
filebacked = FALSE,
path_dir = tempdir(),
keep_files = FALSE,
output_path = NULL,
load_mode = "eager"
)
Arguments
n_ref |
Integer vector of synthetic reference row counts. |
n_query |
Integer vector of synthetic query row counts. |
n_dim |
Integer vector of synthetic column counts. |
k |
Number of neighbours to return. |
n_trees |
Number of Annoy trees to build. |
metric |
Annoy metric. One of |
search_k |
Annoy search budget. |
seed |
Random seed used for synthetic data generation and, by default, for the Annoy build seed. |
build_seed |
Optional Annoy build seed. Defaults to |
build_threads |
Native Annoy build-thread setting. |
block_size |
Build/search block size. |
backend |
Requested bigANNOY backend. |
exact |
Logical flag controlling whether to benchmark the exact
Euclidean baseline with |
filebacked |
Logical flag; if |
path_dir |
Directory where temporary Annoy and optional file-backed benchmark files should be written. |
keep_files |
Logical flag; if |
output_path |
Optional CSV path where the benchmark summary should be written. |
load_mode |
Whether the benchmarked index should be returned
metadata-only until first search ( |
Value
A list with a summary data frame containing one row per
implementation and data-volume combination.
Benchmark bigANNOY against direct RcppAnnoy
Description
Run the same Annoy build and search task through bigANNOY and through a
direct dense RcppAnnoy baseline. The comparison reports both speed metrics
and data-volume metrics such as reference bytes, query bytes, and generated
index size.
Usage
benchmark_annoy_vs_rcppannoy(
x = NULL,
query = NULL,
n_ref = 2000L,
n_query = 200L,
n_dim = 20L,
k = 10L,
n_trees = 50L,
metric = "euclidean",
search_k = -1L,
seed = 42L,
build_seed = seed,
build_threads = -1L,
block_size = annoy_default_block_size(),
backend = getOption("bigANNOY.backend", "cpp"),
exact = TRUE,
filebacked = FALSE,
path_dir = tempdir(),
keep_files = FALSE,
output_path = NULL,
load_mode = "eager"
)
Arguments
x |
Optional benchmark reference input. Supply |
query |
Optional benchmark query input. Supply |
n_ref |
Number of synthetic reference rows to generate when |
n_query |
Number of synthetic query rows to generate when |
n_dim |
Number of synthetic columns to generate when |
k |
Number of neighbours to return. |
n_trees |
Number of Annoy trees to build. |
metric |
Annoy metric. One of |
search_k |
Annoy search budget. |
seed |
Random seed used for synthetic data generation and, by default, for the Annoy build seed. |
build_seed |
Optional Annoy build seed. Defaults to |
build_threads |
Native Annoy build-thread setting. |
block_size |
Build/search block size. |
backend |
Requested bigANNOY backend. |
exact |
Logical flag controlling whether to benchmark the exact
Euclidean baseline with |
filebacked |
Logical flag; if |
path_dir |
Directory where temporary Annoy and optional file-backed benchmark files should be written. |
keep_files |
Logical flag; if |
output_path |
Optional CSV path where the benchmark summary should be written. |
load_mode |
Whether the benchmarked index should be returned
metadata-only until first search ( |
Value
A list with a two-row summary data frame, one row for bigANNOY
and one for direct RcppAnnoy, plus benchmark metadata and any validation
report produced for the bigANNOY index.
Print a bigannoy_index
Description
Print a bigannoy_index
Usage
## S3 method for class 'bigannoy_index'
print(x, ...)
Arguments
x |
A |
... |
Unused. |
Value
x, invisibly.