
bioLeak is an R package for detecting, quantifying, and
diagnosing data leakage in biomedical machine-learning workflows. It
provides leakage-resistant resampling, guarded preprocessing, post-hoc
auditing, and inference tools for cross-validation and related
evaluation settings.
In scope: - Preprocessing leakage: global imputation, scaling, filtering, or feature selection applied before resampling. - Dependence leakage: repeated measures, subject-level grouping, batch/site/study effects, and near-duplicate samples. - Resampling violations: group overlap, study holdout, time-ordered evaluation, and multi-axis split constraints. - Diagnostic evidence: permutation-based performance gaps, batch/fold association tests, target leakage scans, duplicate detection, and mechanism-level risk summaries.
Out of scope: - Proving the absence of leakage or guaranteeing unbiased performance. - Production deployment tooling. - Unsupervised learning. - Broad, non-leakage-oriented data-quality diagnostics.
Standard cross-validation assumes independent samples and
exchangeable labels. Biomedical datasets often violate these assumptions
due to repeated measures, site effects, batch structure, and temporal
dependence. These violations can inflate performance metrics even when a
model does not generalize. bioLeak enforces leakage-aware
resampling and provides post-hoc diagnostics that estimate how much
apparent performance could be driven by leakage or confounding.
| Function | Description |
|---|---|
make_split_plan() |
Leakage-aware splits: subject-grouped, batch-blocked, study leave-out, time-series, and N-axis combined modes with compact storage option and constraint-aware train/test exclusion |
check_split_overlap() |
Explicit overlap-invariant validation across declared grouping axes |
as_rsample() |
Convert LeakSplits to an rsample rset for
tidymodels interoperability |
| Function | Description |
|---|---|
fit_resample() |
Cross-validated fitting with train-only imputation, normalization,
filtering, and feature selection; supports binomial, multiclass,
regression, and survival tasks; parallel execution via
future.apply |
tune_resample() |
Nested hyperparameter tuning via tidymodels
tune/dials with leakage-aware outer splits,
hyperparameter aggregation across folds, and optional threshold tuning;
survival tasks are not yet supported |
impute_guarded() |
Standalone train-only imputation (median, knn, missForest, none) |
guard_to_recipe() |
Convert guarded preprocessing specifications to recipes
pipelines |
| Function | Description |
|---|---|
audit_leakage() |
Permutation gap test, batch/study association tests, univariate and multivariate target leakage scans, near-duplicate detection, and mechanism-level risk summaries |
audit_leakage_by_learner() |
Multi-learner auditing for multi-model fits |
audit_report() |
Self-contained HTML summary of audit results |
calibration_summary() |
Probability calibration checks |
confounder_sensitivity() |
Sensitivity analysis for confounding effects |
| Function | Description |
|---|---|
delta_lsi() |
Leakage sensitivity index (ΔLSI) with Huber M-estimation, BCa confidence intervals, sign-flip inference, and blocked exchangeability for time-series designs |
cv_ci() |
Cross-validation confidence intervals with Nadeau-Bengio correction |
| Function | Description |
|---|---|
simulate_leakage_suite() |
Generate controlled leakage scenarios for benchmarking audit sensitivity |
benchmark_leakage_suite() |
Reproducible modality-by-mechanism benchmark grids with detection-rate summaries |
plot_calibration(),
plot_confounder_sensitivity(),
plot_fold_balance(), plot_overlap_checks(),
plot_perm_distribution(), plot_time_acf()
Requires R >= 4.3.
From CRAN:
install.packages("bioLeak")Development version from GitHub:
install.packages("remotes")
remotes::install_github("selcukorkmaz/bioLeak")Non-obvious dependencies: - SummarizedExperiment and
BiocGenerics are Bioconductor packages (installed
automatically by remotes, but can be installed manually
with BiocManager::install() if needed). - Optional packages
enable specific features: glmnet, ranger,
pROC, PRROC, survival,
future.apply, RANN, rmarkdown,
tune, dials.
library(bioLeak)
set.seed(1)
n_subject <- 40
rep_per_subject <- 3
n <- n_subject * rep_per_subject
subject <- rep(seq_len(n_subject), each = rep_per_subject)
batch <- rep(seq_len(6), length.out = n)
# Subject-level latent risk creates dependence across repeated measures
subj_risk <- rnorm(n_subject, sd = 1)
x1 <- subj_risk[subject] + rnorm(n, sd = 0.5)
x2 <- rnorm(n)
x3 <- rnorm(n)
p_subj <- stats::plogis(1.5 * subj_risk)
outcome <- factor(ifelse(runif(n) < p_subj[subject], "case", "control"),
levels = c("control", "case"))
df <- data.frame(subject, batch, outcome, x1, x2, x3)
# Leakage-aware splits (subjects do not cross folds)
splits <- make_split_plan(
df,
outcome = "outcome",
mode = "subject_grouped",
group = "subject",
v = 5,
stratify = TRUE,
seed = 1
)
# Guarded pipeline (train-only preprocessing)
spec <- parsnip::logistic_reg(mode = "classification") |>
parsnip::set_engine("glm")
fit_guarded <- fit_resample(
df,
outcome = "outcome",
splits = splits,
learner = spec,
metrics = "auc",
preprocess = list(
impute = list(method = "median"),
normalize = list(method = "zscore"),
filter = list(var_thresh = 0),
fs = list(method = "none")
),
refit = FALSE,
seed = 1
)
# Leaky comparator: add a leakage feature computed on the full dataset
df_leaky <- within(df, {
leak_subject <- ave(as.numeric(outcome == "case"), subject, FUN = mean)
})
fit_leaky <- fit_resample(
df_leaky,
outcome = "outcome",
splits = splits,
learner = spec,
metrics = "auc",
preprocess = list(
impute = list(method = "none"),
normalize = list(method = "none"),
filter = list(var_thresh = 0),
fs = list(method = "none")
),
refit = FALSE,
seed = 1
)
# Use unstratified permutations here to avoid warnings in small grouped data
audit_guarded <- audit_leakage(
fit_guarded,
metric = "auc",
B = 30,
perm_stratify = FALSE,
X_ref = df[, c("x1", "x2", "x3")]
)
audit_leaky <- audit_leakage(
fit_leaky,
metric = "auc",
B = 30,
perm_stratify = FALSE,
X_ref = df_leaky[, c("x1", "x2", "x3", "leak_subject")]
)
summary(fit_guarded)
summary(fit_leaky)
summary(audit_guarded)
summary(audit_leaky)Interpretation notes: - If the leaky comparator shows higher AUC and
leak_subject ranks near the top of the target leakage scan,
the performance gap is likely inflated by leakage. - Similar guarded and
leaky results do not prove the absence of leakage; they only reduce
specific risks tested by the audit.
bioLeak integrates with the tidymodels ecosystem at
multiple levels:
fit_resample() and
tune_resample() accept rsample rset/rsplit
objects as splits. as_rsample() converts
LeakSplits to an rsample rset.preprocess can be a
recipes::recipe (prepped on training folds, baked on test
folds). guard_to_recipe() converts guarded preprocessing
specifications into a recipes pipeline.learner accepts
parsnip::model_spec or workflows::workflow
objects.metrics accepts
yardstick::metric_set objects.Note: When using recipes/workflows, the built-in guarded preprocessing list is not applied; ensure your recipe is leakage-safe.
Example (rsample + recipes + yardstick):
if (requireNamespace("rsample", quietly = TRUE) &&
requireNamespace("recipes", quietly = TRUE) &&
requireNamespace("yardstick", quietly = TRUE)) {
rs <- rsample::vfold_cv(df, v = 5)
rec <- recipes::recipe(outcome ~ ., data = df) |>
recipes::step_normalize(recipes::all_numeric_predictors())
ys <- yardstick::metric_set(yardstick::roc_auc, yardstick::accuracy)
fit_rs <- fit_resample(
df,
outcome = "outcome",
splits = rs,
learner = spec,
preprocess = rec,
metrics = ys,
refit = FALSE
)
}bioLeak supports four task types: - Binomial
classification: built-in character metrics are
auc, pr_auc, and accuracy;
additional metrics can be supplied via
yardstick::metric_set when supported by the learner
outputs. - Multiclass classification: built-in
character metrics are accuracy, macro_f1, and
log_loss. - Regression: built-in character
metrics are rmse and cindex. -
Survival analysis: built-in character metric is
cindex in fit_resample().
The learner argument accepts parsnip model specs,
workflows, or built-in learner strings ("glmnet",
"ranger"). Models such as base R glm or
xgboost can still be used through parsnip/workflows or
custom learners, but they are not built-in character learner names.
Survival outcomes are supported in fit_resample(), but
support is less complete across the package. In particular,
tune_resample() does not yet support survival tasks.
X_ref can still pass undetected.p_value_adj,
flag_fdr) provide a more conservative screen.duplicate_scope = "all"
to include within-fold duplicates and review for data-quality
issues.audit@info$mechanism_summary) provides a compact
mechanism-level risk view across permutation, confounding, target-proxy,
duplicate, and temporal signals.Common misinterpretations: - “Non-significant permutation test means
no leakage”: false. - “High AUC implies good generalization”: false if
resampling is violated. - “No flagged features means no leakage”: false;
audits are limited to available metadata and X_ref.
Biomedical ML researchers, biostatisticians, and methodologists reviewing cross-validation and leakage risk.
citation("bioLeak")When reporting results, include: split mode, grouping columns, random seeds, preprocessing steps, learner specification, metrics, and audit settings (B, target threshold, similarity method). Include both guarded and leaky comparator results when used.
MIT