| Type: | Package |
| Title: | Improving Interaction Modelling and Interpretability in Random Forests |
| Version: | 0.1.0 |
| Date: | 2026-01-26 |
| Maintainer: | Roman Hornung <hornung@ibe.med.uni-muenchen.de> |
| Description: | Implementation of the unity forest (UFO) framework (Hornung & Hapfelmeier, 2026, <doi:10.48550/arXiv.2601.07003>). UFOs are a random forest variant designed to better take covariates with purely interaction-based effects into account, including interactions for which none of the involved covariates exhibits a marginal effect. While this framework tends to improve discrimination and predictive accuracy compared to standard random forests, it also facilitates the identification and interpretation of (marginal or interactive) effects: In addition to the UFO algorithm for tree construction, the package includes the unity variable importance measure (unity VIM), which quantifies covariate effects under the conditions in which they are strongest - either marginally or within subgroups defined by interactions - as well as covariate-representative tree roots (CRTRs) that provide interpretable visualizations of these conditions. Currently, only classification is supported. This package is a fork of the R package 'ranger' (main author: Marvin N. Wright), which implements random forests using an efficient C++ backend. |
| SystemRequirements: | C++17 |
| Encoding: | UTF-8 |
| License: | GPL-3 |
| Imports: | Rcpp (≥ 0.11.2), Matrix, ggplot2, ggrepel, dplyr, scales, rlang |
| LinkingTo: | Rcpp, RcppEigen |
| Depends: | R (≥ 3.5) |
| Suggests: | patchwork |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | yes |
| Packaged: | 2026-01-26 18:40:17 UTC; hornung |
| Author: | Roman Hornung [aut, cre], Marvin N. Wright [ctb, cph] |
| Repository: | CRAN |
| Date/Publication: | 2026-01-30 11:00:09 UTC |
Unity Forest (UFO) Framework
Description
This package implements the unity forest (UFO) framework. UFOs are a random forest variant designed to better take covariates with purely interaction-based effects into account, including interactions for which none of the involved covariates exhibits a marginal effect. While this framework tends to improve discrimination and predictive accuracy compared to standard random forests, it also facilitates the identification and interpretation of (marginal or interactive) effects: In addition to the UFO algorithm for tree construction, the package includes the unity variable importance measure (unity VIM), which quantifies covariate effects under the conditions in which they are strongest - either marginally or within subgroups defined by interactions - as well as covariate-representative tree roots (CRTRs) that provide interpretable visualizations of these conditions. Currently, only classification is supported.
Details
The main functions of the package are:
-
unityfor: Construct a UFO and compute the unity VIM. -
predict.unityfor: Predict using a UFO fitted usingunityfor. -
reprTrees: Select and visualize covariate-representative tree roots (CRTRs) based on aunityforobject.
This package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. The documentation is partly taken from 'ranger', where some parts of the documentation may not apply to (the current version of) the 'unityForest' package.
The code in the example sections can be used as a template for basic application scenarios.
References
Hornung, R., Hapfelmeier, A. (2026). Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests. arXiv:2601.07003, <doi:10.48550/arXiv.2601.07003>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
Unity Forest prediction
Description
Prediction with new data and a saved forest from unityfor.
Usage
## S3 method for class 'unityfor'
predict(
object,
data = NULL,
predict.all = FALSE,
num.trees = object$num.trees,
type = "response",
num.threads = NULL,
verbose = TRUE,
...
)
Arguments
object |
|
data |
New test data of class |
predict.all |
Return individual predictions for each tree instead of aggregated predictions for all trees. Return a matrix (sample x tree) for classification and a 3d array for probability estimation (sample x class x tree). |
num.trees |
Number of trees used for prediction. The first |
type |
Type of prediction. One of 'response', 'se', 'terminalNodes', 'quantiles' with default 'response'. See below for details. |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Verbose output on or off. |
... |
further arguments passed to or from other methods. |
Details
This package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. More precisely, 'unityForest' was written by modifying the code of 'ranger', version 0.11.0.
Value
Object of class unityfor.prediction with elements
predictions | Predicted classes/probabilities/values (only for classification and regression) |
num.trees | Number of trees. |
num.independent.variables | Number of independent variables. |
treetype | Type of forest/tree. Classification or regression (the latter is not implemented yet). |
num.samples | Number of samples. |
Author(s)
Marvin N. Wright
References
Hornung, R., Hapfelmeier, A. (2026). Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests. arXiv:2601.07003, <doi:10.48550/arXiv.2601.07003>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
See Also
Select and visualize covariate-representative tree roots (CRTRs)
Description
Implements the algorithm for selecting and visualizing covariate-representative tree roots (CRTRs) as described in Hornung & Hapfelmeier (2026).
CRTRs are tree roots extracted from a unity forest that characterize the conditions under which a given variable exhibits its strongest effect on the outcome. The function selects one representative tree root for each variable and visualizes its structure to facilitate interpretation. CRTRs are essential for analyzing the effects identified by the unity VIM (unityfor). See the 'Details' section below for more details.
Usage
reprTrees(
object,
vars = NULL,
numvars = 5,
indvars = NULL,
num.threads = NULL,
plotit = TRUE,
highlight_relevant = TRUE,
box_plots = TRUE,
density_plots = TRUE,
add_split_line = TRUE,
verbose = TRUE
)
Arguments
object |
Object of class |
vars |
This is an optional vector of variable names, for which CRTRs should be obtained |
numvars |
The number of the variables with the largest unity VIM values for which CRTRs should be obtained. |
indvars |
The indices of the variables with the largest unity VIM values for which CRTRs should be obtained. For example, if |
num.threads |
Number of threads. Default is number of CPUs available. |
plotit |
Whether or not the CRTRs should be plotted or merely returned (invisibly). Default is |
highlight_relevant |
Whether or not the nodes not containing the top-scoring splits for the variables of interest or their ancestor nodes should be shaded out. Default is |
box_plots |
Whether boxplots should be used to show the outcome class-specific distributions of the variables values in the nodes with top-scoring splits (see 'Details' section for explanation). For classification only. Default is |
density_plots |
Whether kernel density plots should be used to show the outcome class-specific distributions of the variable values in the nodes with top-scoring splits (see 'Details' section for explanation). For classification only. Default is |
add_split_line |
Whether in the boxplots and/or density plots a line at the split point of the corresponding node should be drawn. Default is |
verbose |
Verbose output on or off. Default is |
Details
Further details on the descriptions below are provided in Hornung & Hapfelmeier (2026).
Covariate-representative tree roots (CRTRs).
Covariate-representative tree roots (CRTRs) (Hornung & Hapfelmeier, 2026) are tree fragments (or 'tree roots' - the first few splits in the trees) extracted from a fitted unity forest (unityfor) that characterize for given variables the conditions under which each variable exerts its strongest influence on the prediction.
Technically, for a given variable, the algorithm identifies tree roots in which this variable attains particularly high split scores (top-scoring splits). From these tree roots, a representative root is extracted (Laabs et al., 2024) that best reflects the conditions under which this variable has its strongest effect.
Interpretation and subgroup effects. If a variable has a strong marginal effect, the corresponding CRTR typically contains a split on this variable at the root node (first split in the tree). In contrast, if a variable has little marginal effect but interacts with another variable, the CRTR may first split on that other variable, thereby defining a subgroup in which the variable of interest exhibits a strong conditional effect.
From a substantive perspective, CRTRs enable the exploration of variable effects that are generally not detectable by conventional methods focusing on marginal associations. In particular, CRTRs can reveal variables that have weak marginal effects but act strongly within specific subgroups defined by interactions with other variables.
Relation to unity VIM.
CRTRs are closely related to the unity variable importance measure (unity VIM) (unityfor). The unity VIM quantifies the strength of variable effects under the conditions in which they are strongest. Analogously, CRTRs visualize these conditions by displaying the tree structures that give rise to the respective unity VIM values.
Accordingly, the CRTR algorithm can be used to visualize and interpret the effects identified by the unity VIM. By default, CRTRs are constructed and visualized for the five variables with the largest unity VIM values.
Scope of applicability. CRTRs should primarily be examined for variables with sufficiently large unity VIM values. Constructing CRTRs for variables with negligible importance may lead to overinterpretation, as apparent patterns may reflect random structure rather than meaningful effects.
Shaded regions in the visualization. For improved interpretability, parts of the CRTRs are shaded out by default. Specifically, only the nodes containing the top-scoring splits for the variable of interest and their ancestor nodes are shown prominently.
This design is motivated by two considerations. First, the purpose of CRTRs is to depict the conditions under which a variable exhibits its strongest effects - conditions that are defined by the ancestors of the nodes with top-scoring splits. Second, the remaining regions of the tree are of limited interpretive value. Since each CRTR is derived from tree roots selected for strong effects of a specific variable, the splitting patterns along the highlighted paths are specific for that variable. In contrast, shaded regions reflect arbitrary aspects of the overall association structure in the data and may include splits on non-informative variables, as each tree root is grown from a (small) random subset of all available variables.
Note that additional splits on the variable of interest may occur within shaded regions and can still be relevant. However, these splits do not represent the conditions under which the variable attains its strongest effects.
In-bag data for top-scoring split visualizations. The boxplots and density plots illustrating the discriminatory power of the top-scoring splits are computed exclusively based on the in-bag observations of the corresponding trees. This is consistent with the construction of the CRTRs themselves, which are derived from in-bag data only.
Value
Object of class unityfor.reprTrees with elements
rules |
List. Ing-bag statistics on the outcome at each node in the CRTRs. For classification, this provides the class frequencies and the numbers of observations representing each class. |
plots |
List. Generated ggplot2 plots. |
var.names |
Labels of the variables for which CRTRs were selected. |
var.names.all |
Names of all independent variables in the dataset. |
num.independent.variables |
Number of independent variables in the dataset. |
num.samples |
Number of observations in the dataset. |
treetype |
Tree type. |
forest |
Sub-forest that contains only the CRTRs. |
Author(s)
Roman Hornung
References
Hornung, R., Hapfelmeier, A. (2026). Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests. arXiv:2601.07003, <doi:10.48550/arXiv.2601.07003>.
Laabs, B.-H., Westenberger, A., & K\"onig, I. R. (2024). Identification of representative trees in random forests based on a new tree-based distance measure. Advances in Data Analysis and Classification 18(2):363-380, <doi:10.1007/s11634-023-00537-7>.
See Also
Examples
## Load package:
library("unityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Load wine dataset:
data(wine)
## Construct unity forest and calculate unity VIM values:
model <- unityfor(dependent.variable.name = "C", data = wine,
importance = "unity", num.trees = 2000)
# NOTE: num.trees = 2000 (in the above) would be too small for practical
# purposes. This quite small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000.
## Visualize the CRTRs for the five variables with the largest unity VIM
## values:
reprTrees(model, box_plots = FALSE, density_plots = FALSE)
## Visualize the CRTRs for the variables with the largest and third-largest
## unity VIM values:
reprTrees(model, indvars = c(2, 3), box_plots = FALSE, density_plots = FALSE)
## Visualize the CRTRs for the variables with the largest and third-largest
## unity VIM values, where density plots are shown to visualize the
## outcome class-specific distributions of the variables values in the
## nodes with top-scoring splits:
reprTrees(model, indvars = c(2, 3), box_plots = FALSE, density_plots = TRUE)
## Visualize the CRTRs for the variables with the largest and third-largest
## unity VIM values, where both density plots and boxplots are shown to
## visualize the outcome class-specific distributions of the variables values
## in the top-scoring splits; the split points are not indicated in these
## plots:
ps <- reprTrees(model, indvars = c(2, 3), add_split_line = FALSE)
## Save one of the CRTRs with the corresponding density plot:
library("patchwork")
library("ggplot2")
p <- ps$plots[[1]]$tree_plot / ps$plots[[1]]$density_plot +
patchwork::plot_layout(heights = c(2, 1))
p
# outfile <- file.path(tempdir(), "figure_xy.pdf")
# ggsave(outfile, device = cairo_pdf, plot = p, width = 18,
# height = 14)
# Note: The plots can be manipulated with the usual ggplot2 syntax, e.g.:
ps$plots[[1]]$density_plot + xlab("Proline") + labs(title = NULL, y = NULL) +
theme(
legend.position = c(0.95, 0.95),
legend.justification = c(1, 1)
)
Construct a unity forest prediction rule and compute the unity VIM.
Description
Constructs a unity forest and computes the unity variable importance measure (VIM), as described in Hornung & Hapfelmeier (2026). Currently, only categorical outcomes are supported.
The unity forest algorithm is a tree construction approach for random forests in which the first few splits are optimized jointly in order to more effectively capture interaction effects beyond marginal effects. The unity VIM quantifies the influence of each variable under the conditions in which that influence is strongest, thereby placing a stronger emphasis on interaction effects than conventional variable importance measures.
To explore the nature of the effects identified by the unity VIM, it is essential to examine covariate-representative tree roots (CRTRs), which are implemented in reprTrees.
Usage
unityfor(
formula = NULL,
dependent.variable.name = NULL,
data = NULL,
num.trees = 20000,
num.cand.trees = 500,
probability = TRUE,
importance = "none",
prop.best.splits = NULL,
min.node.size.root = NULL,
min.node.size = NULL,
max.depth.root = NULL,
max.depth = NULL,
prop.var.root = NULL,
mtry.sprout = NULL,
replace = FALSE,
sample.fraction = ifelse(replace, 1, 0.7),
case.weights = NULL,
class.weights = NULL,
inbag = NULL,
oob.error = TRUE,
num.threads = NULL,
write.forest = TRUE,
verbose = TRUE
)
Arguments
formula |
Object of class |
dependent.variable.name |
Name of outcome variable, needed if no formula given. |
data |
Training data of class |
num.trees |
Number of trees. Default is 20000. |
num.cand.trees |
Number of random candidate trees to generate for each tree root. Default is 500. |
probability |
Grow a probability forest as in Malley et al. (2012). (NOTE: Currently only probability forests are implemented, will be changed in the next version) |
importance |
Variable importance mode, either 'unity' (unity VIM) or 'none'. |
prop.best.splits |
Related to the unity VIM. Default value should generally not be modified by the user. When calculating the unity VIM, only the top |
min.node.size.root |
Minimal node size in the tree roots. Default is 10 irrespective of the outcome type. |
min.node.size |
Minimal node size. Default 1 for classification and 5 for probability. |
max.depth.root |
Maximal depth of the tree roots. Default value is 3 and should generally not be modified by the user. Larger values can be associated with worse predictive performance for some datasets. |
max.depth |
Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). Must be at least as large as |
prop.var.root |
Proportion of variables randomly sampled for constructing each tree root. Default is the square root of the number of variables divided by the number of variables. Consequently, per default, for each tree root, a random subset of variables is considered, with size equal to the (rounded up) square root of the total number of variables. An exception is made for datasets with more than 100 variables, where the default for |
mtry.sprout |
Number of randomly sampled variables to possibly split at in each node of the tree sprouts (i.e., the branches of the trees beyond the tree roots). Default is the (rounded down) square root of the number variables. |
replace |
Sample with replacement. Default is |
sample.fraction |
Fraction of observations to sample for each tree. Default is 1 for sampling with replacement and 0.7 for sampling without replacement. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
class.weights |
Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes. |
inbag |
Manually set observations per tree. List of size |
oob.error |
Compute OOB prediction error. Set to |
num.threads |
Number of threads. Default is number of CPUs available. |
write.forest |
Save |
verbose |
Show computation status and estimated runtime. |
Details
There are two reasons why, for datasets with more than 100 variables, the default value of prop.var.root is set to 0.1 rather than to the square root of the number of variables divided by the total number of variables.
First, as the total number of variables increases, the square-root-based proportion decreases. This makes it less likely that the same pairs of variables are selected together in multiple trees. This can be problematic for the unity VIM, particularly for variables that do not have marginal effects on their own but act only through interactions with one or a few other variables. Such variables are informative in tree roots only when they are used jointly with the covariates they interact with. Setting prop.var.root = 0.1 ensures that interacting covariates are selected together sufficiently often in tree roots.
Second, this choice reflects the fact that in high-dimensional datasets, typically only a small proportion of variables are informative. Applying the square-root rule in such settings may result in too few informative variables being selected, thereby reducing the likelihood of constructing predictive tree roots.
However, note that results obtained from applications of the unity forest framework to high-dimensional datasets should be interpreted with caution. For high-dimensional data, the curse of dimensionality makes the identification of individual interaction effects challenging and increases the risk of false positives. Moreover, the split points identified in the CRTRs (reprTrees) may become less precise as the number of covariates considered per tree root increases.
Value
Object of class unityfor with elements
predictions |
Predicted classes/values, based on out-of-bag samples. |
forest |
Saved forest (If write.forest set to TRUE). Note that the variable IDs in the |
data |
Training data. |
variable.importance |
Variable importance for each independent variable. Only available if |
importance.mode |
Importance mode used. |
prediction.error |
Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples, for probability estimation the Brier score and for regression the mean squared error. |
confusion.matrix |
Contingency table for classes and predictions based on out-of-bag samples (classification only). |
call |
Function call. |
num.trees |
Number of trees. |
num.cand.trees |
Number of candidate trees generated for each tree root. |
num.independent.variables |
Number of independent variables. |
num.samples |
Number of samples. |
prop.var.root |
Proportion of variables randomly sampled for each tree root. |
mtry |
Value of mtry used (in the tree sprouts). |
max.depth.root |
Maximal depth of the tree roots. |
min.node.size.root |
Minimal node size in the tree roots. |
min.node.size |
Value of minimal node size used. |
splitrule |
Splitting rule (used only in the tree sprouts). |
replace |
Sample with replacement. |
treetype |
Type of forest/tree. Classification or regression. |
Author(s)
Roman Hornung, Marvin N. Wright
References
Hornung, R., Hapfelmeier, A. (2026). Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests. arXiv:2601.07003, <doi:10.48550/arXiv.2601.07003>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
See Also
Examples
## Load package:
library("unityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Load wine dataset:
data(wine)
## Construct unity forest and calculate unity VIM values:
model <- unityfor(dependent.variable.name = "C", data = wine,
importance = "unity", num.trees = 20)
# NOTE: num.trees = 20 (in the above) would be much too small for practical
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000.
## Inspect the rankings of the variables and variable pairs with respect to
## the unity VIM:
sort(model$variable.importance, decreasing = TRUE)
## Prediction:
# Separate 'wine' dataset randomly in training
# and test data:
train.idx <- sample(nrow(wine), 2/3 * nrow(wine))
wine_train <- wine[train.idx, ]
wine_test <- wine[-train.idx, ]
# Construct unity forest on training data:
# NOTE again: num.trees = 20 is specified too small for practical purposes.
model_train <- unityfor(dependent.variable.name = "C", data = wine_train,
importance = "none", num.trees = 20)
# NOTE: Because we are only interested in prediction here, we do not
# calculate unity VIM values (by setting importance = "none"), because
# this speeds up calculations.
# Predict class values of the test data:
pred_wine <- predict(model_train, data = wine_test)
# Compare predicted and true class values of the test data:
table(wine_test$C, levels(wine_train$C)[apply(pred_wine$predictions, 1, which.max)])
Wine Chemical Analysis Data (Binary Cultivar)
Description
The well-known wine dataset comprises the results of chemical analyses of 178 wines produced in the same region of Italy from three grape varieties (Barolo, Grignolino, and Barbera). The dataset was originally introduced by Forina et al. (1984) and later described in detail by Forina et al. (1986).
Format
A data frame with 178 observations, 13 numeric covariates and one binary target variable.
Details
For each sample, 13 continuous chemical constituents were measured, which serve as
covariates for distinguishing between the grape varieties. For the analyses in this
package, a version of the dataset with a binary outcome is provided that differentiates
between Grignolino ("G") and the two other varieties ("Other"; Barolo and Barbera).
This version is available on OpenML under data ID 973.
The variables are as follows:
-
Alc. numeric. Alcohol. -
Mal. numeric. Malic acid. -
Ash. numeric. Ash. -
AlcAsh. numeric. Alkalinity of ash. -
Mg. numeric. Magnesium. -
TP. numeric. Total phenols. -
Fla. numeric. Flavonoids. -
NFP. numeric. Nonflavonoid phenols. -
ProAn. numeric. Proanthocyanins. -
Col. numeric. Color intensity. -
Hue. numeric. Hue. -
WAI. numeric. OD280/OD315 of diluted wines (wine absorbance index). -
Prol. numeric. Proline. -
C. factor. Cultivar. Binary target variable:"G"vs"Other".
Source
OpenML: data.id: 973, link: https://www.openml.org/d/973/
References
Forina, M. (1984). PARVUS, TrAC Trends in Analytical Chemistry, 3(2):38–39, <doi:10.1016/0165-9936(84)87050-8>.
Forina, M., Armanino, C., Castino, M., Ubigli, M. (1986). Multivariate data analysis as a discriminating method of the origin of wines, Vitis, 25:189–201, <doi:10.5073/vitis.1986.25.189-201>.
Vanschoren, J., van Rijn, J. N., Bischl, B., Torgo, L. (2013). OpenML: networked science in machine learning. SIGKDD Explorations, 15(2):49–60, <doi:10.1145/2641190.2641198>.
Examples
data(wine)
table(wine$C)
dim(wine)
head(wine)