Type: Package
Title: Improving Interaction Modelling and Interpretability in Random Forests
Version: 0.1.0
Date: 2026-01-26
Maintainer: Roman Hornung <hornung@ibe.med.uni-muenchen.de>
Description: Implementation of the unity forest (UFO) framework (Hornung & Hapfelmeier, 2026, <doi:10.48550/arXiv.2601.07003>). UFOs are a random forest variant designed to better take covariates with purely interaction-based effects into account, including interactions for which none of the involved covariates exhibits a marginal effect. While this framework tends to improve discrimination and predictive accuracy compared to standard random forests, it also facilitates the identification and interpretation of (marginal or interactive) effects: In addition to the UFO algorithm for tree construction, the package includes the unity variable importance measure (unity VIM), which quantifies covariate effects under the conditions in which they are strongest - either marginally or within subgroups defined by interactions - as well as covariate-representative tree roots (CRTRs) that provide interpretable visualizations of these conditions. Currently, only classification is supported. This package is a fork of the R package 'ranger' (main author: Marvin N. Wright), which implements random forests using an efficient C++ backend.
SystemRequirements: C++17
Encoding: UTF-8
License: GPL-3
Imports: Rcpp (≥ 0.11.2), Matrix, ggplot2, ggrepel, dplyr, scales, rlang
LinkingTo: Rcpp, RcppEigen
Depends: R (≥ 3.5)
Suggests: patchwork
RoxygenNote: 7.3.3
NeedsCompilation: yes
Packaged: 2026-01-26 18:40:17 UTC; hornung
Author: Roman Hornung [aut, cre], Marvin N. Wright [ctb, cph]
Repository: CRAN
Date/Publication: 2026-01-30 11:00:09 UTC

Unity Forest (UFO) Framework

Description

This package implements the unity forest (UFO) framework. UFOs are a random forest variant designed to better take covariates with purely interaction-based effects into account, including interactions for which none of the involved covariates exhibits a marginal effect. While this framework tends to improve discrimination and predictive accuracy compared to standard random forests, it also facilitates the identification and interpretation of (marginal or interactive) effects: In addition to the UFO algorithm for tree construction, the package includes the unity variable importance measure (unity VIM), which quantifies covariate effects under the conditions in which they are strongest - either marginally or within subgroups defined by interactions - as well as covariate-representative tree roots (CRTRs) that provide interpretable visualizations of these conditions. Currently, only classification is supported.

Details

The main functions of the package are:

This package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. The documentation is partly taken from 'ranger', where some parts of the documentation may not apply to (the current version of) the 'unityForest' package.

The code in the example sections can be used as a template for basic application scenarios.

References


Unity Forest prediction

Description

Prediction with new data and a saved forest from unityfor.

Usage

## S3 method for class 'unityfor'
predict(
  object,
  data = NULL,
  predict.all = FALSE,
  num.trees = object$num.trees,
  type = "response",
  num.threads = NULL,
  verbose = TRUE,
  ...
)

Arguments

object

unityfor object.

data

New test data of class data.frame.

predict.all

Return individual predictions for each tree instead of aggregated predictions for all trees. Return a matrix (sample x tree) for classification and a 3d array for probability estimation (sample x class x tree).

num.trees

Number of trees used for prediction. The first num.trees in the forest are used.

type

Type of prediction. One of 'response', 'se', 'terminalNodes', 'quantiles' with default 'response'. See below for details.

num.threads

Number of threads. Default is number of CPUs available.

verbose

Verbose output on or off.

...

further arguments passed to or from other methods.

Details

This package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. More precisely, 'unityForest' was written by modifying the code of 'ranger', version 0.11.0.

Value

Object of class unityfor.prediction with elements

predictions Predicted classes/probabilities/values (only for classification and regression)
num.trees Number of trees.
num.independent.variables Number of independent variables.
treetype Type of forest/tree. Classification or regression (the latter is not implemented yet).
num.samples Number of samples.

Author(s)

Marvin N. Wright

References

See Also

unityfor


Select and visualize covariate-representative tree roots (CRTRs)

Description

Implements the algorithm for selecting and visualizing covariate-representative tree roots (CRTRs) as described in Hornung & Hapfelmeier (2026).
CRTRs are tree roots extracted from a unity forest that characterize the conditions under which a given variable exhibits its strongest effect on the outcome. The function selects one representative tree root for each variable and visualizes its structure to facilitate interpretation. CRTRs are essential for analyzing the effects identified by the unity VIM (unityfor). See the 'Details' section below for more details.

Usage

reprTrees(
  object,
  vars = NULL,
  numvars = 5,
  indvars = NULL,
  num.threads = NULL,
  plotit = TRUE,
  highlight_relevant = TRUE,
  box_plots = TRUE,
  density_plots = TRUE,
  add_split_line = TRUE,
  verbose = TRUE
)

Arguments

object

Object of class unityfor.

vars

This is an optional vector of variable names, for which CRTRs should be obtained

numvars

The number of the variables with the largest unity VIM values for which CRTRs should be obtained.

indvars

The indices of the variables with the largest unity VIM values for which CRTRs should be obtained. For example, if indvars = c(1, 3), the CRTRs for the variables with the largest and third-largest unity VIM values are obtained.

num.threads

Number of threads. Default is number of CPUs available.

plotit

Whether or not the CRTRs should be plotted or merely returned (invisibly). Default is TRUE.

highlight_relevant

Whether or not the nodes not containing the top-scoring splits for the variables of interest or their ancestor nodes should be shaded out. Default is TRUE. See the 'Details' section below for explanation.

box_plots

Whether boxplots should be used to show the outcome class-specific distributions of the variables values in the nodes with top-scoring splits (see 'Details' section for explanation). For classification only. Default is TRUE.

density_plots

Whether kernel density plots should be used to show the outcome class-specific distributions of the variable values in the nodes with top-scoring splits (see 'Details' section for explanation). For classification only. Default is TRUE.

add_split_line

Whether in the boxplots and/or density plots a line at the split point of the corresponding node should be drawn. Default is TRUE.

verbose

Verbose output on or off. Default is TRUE.

Details

Further details on the descriptions below are provided in Hornung & Hapfelmeier (2026).

Covariate-representative tree roots (CRTRs). Covariate-representative tree roots (CRTRs) (Hornung & Hapfelmeier, 2026) are tree fragments (or 'tree roots' - the first few splits in the trees) extracted from a fitted unity forest (unityfor) that characterize for given variables the conditions under which each variable exerts its strongest influence on the prediction.

Technically, for a given variable, the algorithm identifies tree roots in which this variable attains particularly high split scores (top-scoring splits). From these tree roots, a representative root is extracted (Laabs et al., 2024) that best reflects the conditions under which this variable has its strongest effect.

Interpretation and subgroup effects. If a variable has a strong marginal effect, the corresponding CRTR typically contains a split on this variable at the root node (first split in the tree). In contrast, if a variable has little marginal effect but interacts with another variable, the CRTR may first split on that other variable, thereby defining a subgroup in which the variable of interest exhibits a strong conditional effect.

From a substantive perspective, CRTRs enable the exploration of variable effects that are generally not detectable by conventional methods focusing on marginal associations. In particular, CRTRs can reveal variables that have weak marginal effects but act strongly within specific subgroups defined by interactions with other variables.

Relation to unity VIM. CRTRs are closely related to the unity variable importance measure (unity VIM) (unityfor). The unity VIM quantifies the strength of variable effects under the conditions in which they are strongest. Analogously, CRTRs visualize these conditions by displaying the tree structures that give rise to the respective unity VIM values.

Accordingly, the CRTR algorithm can be used to visualize and interpret the effects identified by the unity VIM. By default, CRTRs are constructed and visualized for the five variables with the largest unity VIM values.

Scope of applicability. CRTRs should primarily be examined for variables with sufficiently large unity VIM values. Constructing CRTRs for variables with negligible importance may lead to overinterpretation, as apparent patterns may reflect random structure rather than meaningful effects.

Shaded regions in the visualization. For improved interpretability, parts of the CRTRs are shaded out by default. Specifically, only the nodes containing the top-scoring splits for the variable of interest and their ancestor nodes are shown prominently.

This design is motivated by two considerations. First, the purpose of CRTRs is to depict the conditions under which a variable exhibits its strongest effects - conditions that are defined by the ancestors of the nodes with top-scoring splits. Second, the remaining regions of the tree are of limited interpretive value. Since each CRTR is derived from tree roots selected for strong effects of a specific variable, the splitting patterns along the highlighted paths are specific for that variable. In contrast, shaded regions reflect arbitrary aspects of the overall association structure in the data and may include splits on non-informative variables, as each tree root is grown from a (small) random subset of all available variables.

Note that additional splits on the variable of interest may occur within shaded regions and can still be relevant. However, these splits do not represent the conditions under which the variable attains its strongest effects.

In-bag data for top-scoring split visualizations. The boxplots and density plots illustrating the discriminatory power of the top-scoring splits are computed exclusively based on the in-bag observations of the corresponding trees. This is consistent with the construction of the CRTRs themselves, which are derived from in-bag data only.

Value

Object of class unityfor.reprTrees with elements

rules

List. Ing-bag statistics on the outcome at each node in the CRTRs. For classification, this provides the class frequencies and the numbers of observations representing each class.

plots

List. Generated ggplot2 plots.

var.names

Labels of the variables for which CRTRs were selected.

var.names.all

Names of all independent variables in the dataset.

num.independent.variables

Number of independent variables in the dataset.

num.samples

Number of observations in the dataset.

treetype

Tree type.

forest

Sub-forest that contains only the CRTRs.

Author(s)

Roman Hornung

References

See Also

unityfor

Examples



## Load package:

library("unityForest")


## Set seed to make results reproducible:

set.seed(1234)


## Load wine dataset:

data(wine)


## Construct unity forest and calculate unity VIM values:

model <- unityfor(dependent.variable.name = "C", data = wine,
                  importance = "unity", num.trees = 2000)

# NOTE: num.trees = 2000 (in the above) would be too small for practical 
# purposes. This quite small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000.


## Visualize the CRTRs for the five variables with the largest unity VIM
## values:

reprTrees(model, box_plots = FALSE, density_plots = FALSE)


## Visualize the CRTRs for the variables with the largest and third-largest 
## unity VIM values:

reprTrees(model, indvars = c(2, 3), box_plots = FALSE, density_plots = FALSE)


## Visualize the CRTRs for the variables with the largest and third-largest 
## unity VIM values, where density plots are shown to visualize the 
## outcome class-specific distributions of the variables values in the 
## nodes with top-scoring splits:

reprTrees(model, indvars = c(2, 3), box_plots = FALSE, density_plots = TRUE)


## Visualize the CRTRs for the variables with the largest and third-largest 
## unity VIM values, where both density plots and boxplots are shown to 
## visualize the outcome class-specific distributions of the variables values 
## in the top-scoring splits; the split points are not indicated in these
## plots:
ps <- reprTrees(model, indvars = c(2, 3), add_split_line = FALSE)


## Save one of the CRTRs with the corresponding density plot:

library("patchwork")
library("ggplot2")

p <- ps$plots[[1]]$tree_plot / ps$plots[[1]]$density_plot +
     patchwork::plot_layout(heights = c(2, 1))
p

# outfile <- file.path(tempdir(), "figure_xy.pdf")
# ggsave(outfile, device = cairo_pdf, plot = p, width = 18, 
#        height = 14)


# Note: The plots can be manipulated with the usual ggplot2 syntax, e.g.:

ps$plots[[1]]$density_plot + xlab("Proline") + labs(title = NULL, y = NULL) +
  theme(
    legend.position = c(0.95, 0.95),
    legend.justification = c(1, 1)
  )




Construct a unity forest prediction rule and compute the unity VIM.

Description

Constructs a unity forest and computes the unity variable importance measure (VIM), as described in Hornung & Hapfelmeier (2026). Currently, only categorical outcomes are supported.
The unity forest algorithm is a tree construction approach for random forests in which the first few splits are optimized jointly in order to more effectively capture interaction effects beyond marginal effects. The unity VIM quantifies the influence of each variable under the conditions in which that influence is strongest, thereby placing a stronger emphasis on interaction effects than conventional variable importance measures.
To explore the nature of the effects identified by the unity VIM, it is essential to examine covariate-representative tree roots (CRTRs), which are implemented in reprTrees.

Usage

unityfor(
  formula = NULL,
  dependent.variable.name = NULL,
  data = NULL,
  num.trees = 20000,
  num.cand.trees = 500,
  probability = TRUE,
  importance = "none",
  prop.best.splits = NULL,
  min.node.size.root = NULL,
  min.node.size = NULL,
  max.depth.root = NULL,
  max.depth = NULL,
  prop.var.root = NULL,
  mtry.sprout = NULL,
  replace = FALSE,
  sample.fraction = ifelse(replace, 1, 0.7),
  case.weights = NULL,
  class.weights = NULL,
  inbag = NULL,
  oob.error = TRUE,
  num.threads = NULL,
  write.forest = TRUE,
  verbose = TRUE
)

Arguments

formula

Object of class formula or character describing the model to fit. Interaction terms supported only for numerical variables.

dependent.variable.name

Name of outcome variable, needed if no formula given.

data

Training data of class data.frame, matrix, dgCMatrix (Matrix) or gwaa.data (GenABEL).

num.trees

Number of trees. Default is 20000.

num.cand.trees

Number of random candidate trees to generate for each tree root. Default is 500.

probability

Grow a probability forest as in Malley et al. (2012). (NOTE: Currently only probability forests are implemented, will be changed in the next version)

importance

Variable importance mode, either 'unity' (unity VIM) or 'none'.

prop.best.splits

Related to the unity VIM. Default value should generally not be modified by the user. When calculating the unity VIM, only the top prop.best.splits \times 100% of the splits – those with the highest split criterion values weighted by node size – are considered for each variable. The default value is 0.01, meaning that only the top 1% of splits are used. While small values are recommended, they should not be set too low to ensure that each variable has a sufficient number of splits for a reliable unity VIM computation.

min.node.size.root

Minimal node size in the tree roots. Default is 10 irrespective of the outcome type.

min.node.size

Minimal node size. Default 1 for classification and 5 for probability.

max.depth.root

Maximal depth of the tree roots. Default value is 3 and should generally not be modified by the user. Larger values can be associated with worse predictive performance for some datasets.

max.depth

Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). Must be at least as large as max.depth.root.

prop.var.root

Proportion of variables randomly sampled for constructing each tree root. Default is the square root of the number of variables divided by the number of variables. Consequently, per default, for each tree root, a random subset of variables is considered, with size equal to the (rounded up) square root of the total number of variables. An exception is made for datasets with more than 100 variables, where the default for prop.var.root is set to 0.1. See the 'Details' section below for explanation.

mtry.sprout

Number of randomly sampled variables to possibly split at in each node of the tree sprouts (i.e., the branches of the trees beyond the tree roots). Default is the (rounded down) square root of the number variables.

replace

Sample with replacement. Default is FALSE.

sample.fraction

Fraction of observations to sample for each tree. Default is 1 for sampling with replacement and 0.7 for sampling without replacement.

case.weights

Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.

class.weights

Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.

inbag

Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.

oob.error

Compute OOB prediction error. Set to FALSE to save computation time.

num.threads

Number of threads. Default is number of CPUs available.

write.forest

Save unityfor.forest object, required for prediction. Set to FALSE to reduce memory usage if no prediction intended.

verbose

Show computation status and estimated runtime.

Details

There are two reasons why, for datasets with more than 100 variables, the default value of prop.var.root is set to 0.1 rather than to the square root of the number of variables divided by the total number of variables.

First, as the total number of variables increases, the square-root-based proportion decreases. This makes it less likely that the same pairs of variables are selected together in multiple trees. This can be problematic for the unity VIM, particularly for variables that do not have marginal effects on their own but act only through interactions with one or a few other variables. Such variables are informative in tree roots only when they are used jointly with the covariates they interact with. Setting prop.var.root = 0.1 ensures that interacting covariates are selected together sufficiently often in tree roots.

Second, this choice reflects the fact that in high-dimensional datasets, typically only a small proportion of variables are informative. Applying the square-root rule in such settings may result in too few informative variables being selected, thereby reducing the likelihood of constructing predictive tree roots.

However, note that results obtained from applications of the unity forest framework to high-dimensional datasets should be interpreted with caution. For high-dimensional data, the curse of dimensionality makes the identification of individual interaction effects challenging and increases the risk of false positives. Moreover, the split points identified in the CRTRs (reprTrees) may become less precise as the number of covariates considered per tree root increases.

Value

Object of class unityfor with elements

predictions

Predicted classes/values, based on out-of-bag samples.

forest

Saved forest (If write.forest set to TRUE). Note that the variable IDs in the split.varIDs object do not necessarily represent the column number in R.

data

Training data.

variable.importance

Variable importance for each independent variable. Only available if importance is not "none".

importance.mode

Importance mode used.

prediction.error

Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples, for probability estimation the Brier score and for regression the mean squared error.

confusion.matrix

Contingency table for classes and predictions based on out-of-bag samples (classification only).

call

Function call.

num.trees

Number of trees.

num.cand.trees

Number of candidate trees generated for each tree root.

num.independent.variables

Number of independent variables.

num.samples

Number of samples.

prop.var.root

Proportion of variables randomly sampled for each tree root.

mtry

Value of mtry used (in the tree sprouts).

max.depth.root

Maximal depth of the tree roots.

min.node.size.root

Minimal node size in the tree roots.

min.node.size

Value of minimal node size used.

splitrule

Splitting rule (used only in the tree sprouts).

replace

Sample with replacement.

treetype

Type of forest/tree. Classification or regression.

Author(s)

Roman Hornung, Marvin N. Wright

References

See Also

predict.unityfor

Examples

## Load package:

library("unityForest")


## Set seed to make results reproducible:

set.seed(1234)


## Load wine dataset:

data(wine)


## Construct unity forest and calculate unity VIM values:

model <- unityfor(dependent.variable.name = "C", data = wine,
                  importance = "unity", num.trees = 20)

# NOTE: num.trees = 20 (in the above) would be much too small for practical 
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000.


## Inspect the rankings of the variables and variable pairs with respect to 
## the unity VIM:

sort(model$variable.importance, decreasing = TRUE)


## Prediction:

# Separate 'wine' dataset randomly in training
# and test data:
train.idx <- sample(nrow(wine), 2/3 * nrow(wine))
wine_train <- wine[train.idx, ]
wine_test <- wine[-train.idx, ]

# Construct unity forest on training data:
# NOTE again: num.trees = 20 is specified too small for practical purposes.
model_train <- unityfor(dependent.variable.name = "C", data = wine_train, 
                        importance = "none", num.trees = 20)
# NOTE: Because we are only interested in prediction here, we do not
# calculate unity VIM values (by setting importance = "none"), because 
# this speeds up calculations.

# Predict class values of the test data:
pred_wine <- predict(model_train, data = wine_test)

# Compare predicted and true class values of the test data:
table(wine_test$C, levels(wine_train$C)[apply(pred_wine$predictions, 1, which.max)])


Wine Chemical Analysis Data (Binary Cultivar)

Description

The well-known wine dataset comprises the results of chemical analyses of 178 wines produced in the same region of Italy from three grape varieties (Barolo, Grignolino, and Barbera). The dataset was originally introduced by Forina et al. (1984) and later described in detail by Forina et al. (1986).

Format

A data frame with 178 observations, 13 numeric covariates and one binary target variable.

Details

For each sample, 13 continuous chemical constituents were measured, which serve as covariates for distinguishing between the grape varieties. For the analyses in this package, a version of the dataset with a binary outcome is provided that differentiates between Grignolino ("G") and the two other varieties ("Other"; Barolo and Barbera). This version is available on OpenML under data ID 973.

The variables are as follows:

Source

OpenML: data.id: 973, link: https://www.openml.org/d/973/

References

Examples

data(wine)

table(wine$C)
dim(wine)

head(wine)