R’s parallel tools make it easy to fan out work, but they leave you to manage the hard parts yourself: duplicated memory, runaway workers, invisible copy-on-write. shard handles all of that so you can focus on the computation.
The core idea is simple: share inputs once, write outputs to a buffer, let shard supervise the workers.
Suppose you have a large matrix and want to compute column means in parallel. With shard, you share the matrix, allocate an output buffer, and map over column indices:
set.seed(42)
X <- matrix(rnorm(5000), nrow = 100, ncol = 50)
# Share the matrix (zero-copy for workers)
X_shared <- share(X)
# Allocate an output buffer
out <- buffer("double", dim = ncol(X))
# Define column shards and run
blocks <- shards(ncol(X), workers = 2)
run <- shard_map(
blocks,
borrow = list(X = X_shared),
out = list(out = out),
workers = 2,
fun = function(shard, X, out) {
for (j in shard$idx) {
out[j] <- mean(X[, j])
}
}
)
# Read results from the buffer
result <- out[]
head(result)
#> [1] 0.03251482 -0.08748371 -0.01036817 0.03293646 -0.11783051 0.00236351No serialization of the full matrix per worker. No list of return
values to reassemble. The workers wrote directly into
out.
shard’s workflow revolves around three things:
| Object | Constructor | Purpose |
|---|---|---|
| Shared input | share() |
Immutable, zero-copy data visible to all workers |
| Output buffer | buffer() |
Writable shared memory that workers fill in |
| Shard descriptor | shards() |
Index ranges that partition the work |
buffer() creates typed shared memory that workers write
to using standard R indexing:
buf <- buffer("double", dim = c(10, 5))
buf[1:5, 1] <- rnorm(5)
buf[6:10, 1] <- rnorm(5)
buf[, 1]
#> [1] 0.07122244 0.97029003 0.31003525 -0.13954856 -0.32631113 -0.11880951
#> [7] 0.89355889 0.21111451 -0.48877979 -0.22037193Buffers support "double", "integer",
"logical", and "raw" types. For matrices and
arrays, pass a dim vector:
shard_map() is the engine. It dispatches shards to a
supervised worker pool, passes shared inputs, and collects
diagnostics:
set.seed(1)
X <- matrix(rnorm(2000), nrow = 100, ncol = 20)
X_shared <- share(X)
col_sds <- buffer("double", dim = ncol(X))
blocks <- shards(ncol(X), workers = 2)
run <- shard_map(
blocks,
borrow = list(X = X_shared),
out = list(col_sds = col_sds),
workers = 2,
fun = function(shard, X, col_sds) {
for (j in shard$idx) {
col_sds[j] <- sd(X[, j])
}
}
)
# Results are already in the buffer
sd_values <- col_sds[]
# Verify against base R
all.equal(sd_values, apply(X, 2, sd))
#> [1] TRUEIf your function returns a value (instead of writing to a buffer), shard gathers the results:
blocks <- shards(10, workers = 2)
run <- shard_map(
blocks,
workers = 2,
fun = function(shard) {
sum(shard$idx)
}
)
results(run)
#> $`1`
#> [1] 3
#>
#> $`2`
#> [1] 7
#>
#> $`3`
#> [1] 11
#>
#> $`4`
#> [1] 15
#>
#> $`5`
#> [1] 19Buffers are preferred for large outputs because they avoid serializing results back to the main process. Use return values for small summaries.
For common patterns, shard provides wrappers that handle sharing, sharding, and buffering automatically.
shard_apply_matrix() applies a scalar function over each
column of a matrix:
set.seed(1)
X <- matrix(rnorm(2000), nrow = 100, ncol = 20)
y <- rnorm(100)
# Correlate each column of X with y
cors <- shard_apply_matrix(
X,
MARGIN = 2,
FUN = function(v, y) cor(v, y),
VARS = list(y = y),
workers = 2
)
head(cors)
#> [1] 0.09122115 0.14024598 0.14152571 0.07212827 0.01476771 -0.02163338The matrix is auto-shared, columns are dispatched as shards, and results are collected into a vector.
shard_lapply_shared() is a parallel lapply
with automatic sharing of large list elements:
Every shard_map() call records timing, memory, and
worker statistics. Use report() to inspect them:
report(result = run)
#> shard_report (summary)
#> Generated: 2026-03-30 17:15:27
#>
#> Pool:
#> Workers: 2
#> RSS limit: 2.0 GB
#> Drift threshold: 50%
#> Stats: 32 tasks, 0 recycles, 0 deaths
#>
#> Memory:
#> Total RSS: 171.7 MB
#> Peak RSS: 86.6 MB
#> Mean RSS: 85.9 MB
#>
#> Recommendations:
#> - Per-shard work looks tiny (2.502 ms/shard). Consider increasing chunk_size, or use profile='speed' / dispatch_mode='shm_queue' for out-buffer workflows to reduce dispatch overhead.For focused views:
mem_report(run) – peak and baseline RSS per workercopy_report(run) – bytes transferred through
bufferstask_report(run) – per-chunk execution times and retry
countsBy default, shard_map() creates a worker pool on first
use and reuses it. You can also manage the pool explicitly:
# Create a pool with 4 workers and a 1GB memory cap
pool_create(n = 4, rss_limit = "1GB")
# Check pool health
pool_status()
# Run multiple shard_map() calls (reuses the same pool)
run1 <- shard_map(shards(1000), workers = 4, fun = function(s) sum(s$idx))
run2 <- shard_map(shards(500), workers = 4, fun = function(s) mean(s$idx))
# Shut down workers when done
pool_stop()Workers are supervised: if a worker’s memory usage drifts beyond the threshold, shard recycles it automatically.
Shared inputs are immutable by default (cow = "deny").
This prevents a common class of parallel bugs where a worker
accidentally modifies shared data, triggering a silent copy:
shard_map(
shards(10),
borrow = list(X = share(matrix(1:100, 10, 10))),
workers = 2,
cow = "deny",
fun = function(shard, X) {
X[1, 1] <- 999 # Error: mutation denied
}
)You can relax this with cow = "audit" (detect and report
mutations) or cow = "allow" (permit copy-on-write with
tracking). See ?shard_map for details.
?shard_map – full reference for the parallel
engine?share – sharing options and backing types?buffer – buffer types and matrix/array support?report – diagnostic reports and recommendations?shard_apply_matrix – column-wise parallel apply?pool_create – pool configuration and memory
limits