The base module provides eleven utility functions covering four areas:
| Area | Functions |
|---|---|
| Data frame utilities | df2list(), df2vect(),
recode_column(), view() |
| File system utilities | file_ls(), file_info(),
file_tree() |
| Gene ID conversion | gene2entrez(), gene2ensembl() |
| GMT file parsing | gmt2df(), gmt2list() |
df2list() — Split a data frame into a named listGroups one column’s values by another column and returns a named list. Useful for building marker lists, gene set inputs, or any grouping operation that downstream functions expect as a list.
df <- data.frame(
cell_type = c("T_cell", "T_cell", "B_cell", "B_cell", "B_cell"),
marker = c("CD3D", "CD3E", "CD79A", "MS4A1", "CD19"),
stringsAsFactors = FALSE
)
df2list(df, group_col = "cell_type", value_col = "marker")
#> $T_cell
#> [1] "CD3D" "CD3E"
#>
#> $B_cell
#> [1] "CD79A" "MS4A1" "CD19"df2vect() — Extract a named vector from a data
frameExtracts two columns and returns a named vector, using one column as names and the other as values. The original value type is preserved.
df <- data.frame(
gene = c("TP53", "BRCA1", "MYC"),
score = c(0.91, 0.74, 0.55),
stringsAsFactors = FALSE
)
df2vect(df, name_col = "gene", value_col = "score")
#> TP53 BRCA1 MYC
#> 0.91 0.74 0.55The name column must not contain NA, empty strings, or
duplicates — all three are caught at input and raise an informative
error.
bad <- data.frame(id = c("a", "a"), val = 1:2)
df2vect(bad, "id", "val")
#> Error in `df2vect()`:
#> ! `name_col` contains duplicate values.recode_column() — Map column values via a named
vectorReplaces values in a column using a named vector (dict).
Unmatched values receive default (NA by default). Set
name to write to a new column instead of overwriting the
source.
df <- data.frame(
gene = c("TP53", "BRCA1", "EGFR", "XYZ"),
stringsAsFactors = FALSE
)
dict <- c("TP53" = "Tumour suppressor", "EGFR" = "Oncogene")
# Overwrite in place
recode_column(df, column = "gene", dict = dict)
#> gene
#> 1 Tumour suppressor
#> 2 <NA>
#> 3 Oncogene
#> 4 <NA>
# Write to a new column, keep original; use a custom fallback
recode_column(df, column = "gene", dict = dict,
name = "role", default = "Unknown")
#> gene role
#> 1 TP53 Tumour suppressor
#> 2 BRCA1 Unknown
#> 3 EGFR Oncogene
#> 4 XYZ Unknownview() — Interactive table viewerReturns an interactive reactable widget with search,
filtering, sorting, and pagination. In RStudio the widget renders in the
Viewer pane; in other environments it renders in the default HTML
output.
view() requires the reactable package. If
it is not installed, the function raises a clear error rather than
falling back silently.
file_ls() — List files with metadataReturns a data frame of file metadata for all files in a directory.
Columns: file, size_MB,
modified_time, path.
# All files in the current directory
file_ls(".")
#> file size_MB modified_time path
#> 1 DESCRIPTION 0.002 2026-03-20 14:22:01 F:/project/evanverse/DESCRIPTION
#> 2 NAMESPACE 0.002 2026-03-20 14:22:01 F:/project/evanverse/NAMESPACE
#> ...
# R source files only, searched recursively
file_ls("R", recursive = TRUE, pattern = "\\.R$")file_info() — Metadata for specific filesReturns the same four-column data frame as file_ls() but
for an explicit vector of file paths rather than a directory scan.
file_info(c("DESCRIPTION", "NAMESPACE"))
#> file size_MB modified_time path
#> 1 DESCRIPTION 0.002 2026-03-20 14:22:01 F:/project/evanverse/DESCRIPTION
#> 2 NAMESPACE 0.002 2026-03-20 14:22:01 F:/project/evanverse/NAMESPACEDuplicate paths in the input are silently deduplicated. Missing files raise an error listing all unresolved paths.
file_tree() — Print a directory treePrints the directory structure in tree format. Returns the lines invisibly so output can be captured if needed.
file_tree(".", max_depth = 2)
#> F:/project/evanverse
#> +-- DESCRIPTION
#> +-- NAMESPACE
#> +-- R
#> | +-- base.R
#> | +-- plot.R
#> | +-- utils.R
#> +-- tests
#> +-- testthatBoth gene2entrez() and gene2ensembl()
accept a character vector of gene symbols and return a three-column data
frame: the original input (symbol), the case-normalised
form used for matching (symbol_std), and the converted
ID.
Matching is performed against a ref data frame with
columns symbol, entrez_id, and
ensembl_id. Two sources are available:
| Source | When to use |
|---|---|
toy_gene_ref() |
Examples, tests, offline work — 20 genes, no network |
download_gene_ref() |
Production analysis — full genome via biomaRt |
| Species | Rule applied to both input and reference |
|---|---|
"human" |
toupper() — "tp53" and "TP53"
both match TP53 |
"mouse" |
tolower() — "TRP53" and
"Trp53" both match Trp53 |
Unmatched symbols are returned with NA in the ID column
rather than dropped.
gene2entrez()GMT (Gene Matrix Transposed) is the standard format for gene set
collections such as MSigDB. Each line encodes one gene set:
term, description, and a tab-separated list of
gene symbols.
toy_gmt() writes a minimal GMT file to a temp path for
offline use:
tmp <- toy_gmt(n = 3)
readLines(tmp)
#> [1] "HALLMARK_P53_PATHWAY\tGenes regulated by p53\tTP53\tBRCA1\tMYC\t..."
#> [2] "HALLMARK_MTORC1_SIGNALING\tGenes upregulated by mTORC1\tPTEN\t..."
#> [3] "HALLMARK_HYPOXIA\tGenes upregulated under hypoxia\tMTOR\tHIF1A\t..."gmt2df() — Long-format data frameReturns one row per gene, making the output directly compatible with
dplyr and data.table workflows.
gmt2list() — Named list of gene vectorsReturns a named list where each element is a character vector of gene
symbols. This is the format expected by most gene set enrichment tools
(e.g., fgsea, clusterProfiler).
gs <- gmt2list(tmp)
names(gs)
#> [1] "HALLMARK_P53_PATHWAY" "HALLMARK_MTORC1_SIGNALING"
#> [3] "HALLMARK_HYPOXIA"
gs[["HALLMARK_P53_PATHWAY"]]
#> [1] "TP53" "BRCA1" "MYC" "EGFR" "PTEN" "CDK2" "MDM2"
#> [8] "RB1" "CDKN2A" "AKT1"Lines with fewer than 3 tab-separated fields are skipped with a warning and removed from the result. If every line is malformed, both functions return
NULLrather than raising an error — this is the current behaviour. Always check for aNULLreturn when parsing files from untrusted sources.
Gene ID conversion and GMT parsing compose naturally. The example below reads a GMT file, converts all gene symbols to Entrez IDs, and produces a named list of ID vectors ready for enrichment analysis.
library(evanverse)
# 1. Parse GMT into long format
tmp <- toy_gmt(n = 5)
df <- gmt2df(tmp)
# 2. Convert symbols to Entrez IDs
ref <- toy_gene_ref(species = "human")
id_map <- gene2entrez(df$gene, ref = ref, species = "human")
# 3. Attach IDs and drop unmatched
df$entrez_id <- id_map$entrez_id
df <- df[!is.na(df$entrez_id), ]
# 4. Rebuild named list with Entrez IDs
gs_entrez <- df2list(df, group_col = "term", value_col = "entrez_id")
gs_entrez[["HALLMARK_P53_PATHWAY"]]
#> [1] "7157" "672" "4609" "1956" "5728" "1031" "4193" "5925" "1029" "207"?df2list, ?df2vect,
?recode_column, ?view?file_ls, ?file_info,
?file_tree?gene2entrez, ?gene2ensembl?gmt2df, ?gmt2list?toy_gene_ref, ?toy_gmt,
?download_gene_ref