Umpire 2.0: Clinically Realistic Simulations

Kevin R. Coombes and Caitlin E. Coombes

Introduction

Version 2.0 of the Ultimate Microarray Prediction, Inference, and Reality Engine (Umpire) extends the functions of the Umpire 1.0 R package to allow researchers to simulate realistic, mixed-type, clinical data. Statisticians, computer scientists, and clinical informaticians who develop and improve methods to analyze clinical data from a variety of contexts (including clinical trials, population cohorts, and electronic medical record sources) recognize that it is difficult to evaluate methods on real data where “ground truth” is unknown. Frequently, they turn to simulations where the can control the underlying structure, which can result in simulations which are too simplistic to reflect complex clinical data realities. Clinical measurements on patients may be treated as independent, in spite of the elaborate correlation structures that arise in networks, pathways, organ systems, and syndromes in real biology. Further, the researcher finds limited tools at her disposal to facilitate simulation of binary, categorical, or mixed data at this representative level of biological complexity.

In this vignette, we describe a workflow with the Umpire package to simulate biologically realistic, mixed-type clinical data.

As usual, we start by loading the package:

library(Umpire)

Simulating Mixed-Type Clinical Data

Since we are going to run simulations, for reproducibility purposes, we should set the seed of the random number generator.

set.seed(84503)

Model Subtypes and Survival

The simulation workflow begins by simulating complex, correlated, continuous data with known “ground truth” by instantiating a ClinicalEngine. We simulate 20 features and 4 clusters of unequal size. The ClinicalEngine generates subtypes (clusters) with known “ground truth” through an implementation of the Umpire 1.0 CancerModel and CancerEngine.

ce <- ClinicalEngine(20, 4, isWeighted = TRUE)
summary(ce)
## A 'CancerEngine' using the cancer model:
## --------------
## Clinical Simulation Model (Raw) , a CancerModel object constructed via the function call:
##  CancerModel(name = "Clinical Simulation Model (Raw)", nPossible = NP, nPattern = nClusters, HIT = hitfn, prevalence = Prevalence(isWeighted, nClusters)) 
## 
## Pattern prevalences:
## [1] 0.3173905 0.1315154 0.1265533 0.4245408
## 
## Survival effects:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.2340 -2.7133 -1.6120 -1.4374 -0.1001  1.4111 
## 
## Outcome effects:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.52504 -0.53980 -0.23605 -0.06496  0.77952  0.82696 
## --------------
## 
## Base expression given by:
## An Engine with 10 components.
## 
## Altered expression given by:
## An Engine with 10 components.

Note that the prevalences are not equal; when you use isweighted = TRUE, they are chosen from a Dirichlet distribution. Note also that the summary function describes the object as a CancerEngine, since the same underlying structure is used to implement a ClinicalEngine.

Now we confirm that the model expects to produce the 20 features that we requested. It will do so using 10 “components”, where each component consists of a pair of correlated features.

nrow(ce)
## [1] 20
nComponents(ce)
## [1] 10

Simulate Raw Data

The ClinicalEngine is used to simulate the raw, base dataset.

dset <- rand(ce, 300)

Data are simulated as a list with two objects: simulated data and associated clinical information, including “ground truth” subtype membership and survival data (outcome, length of followup, and occurrence of event of interest within the followup period).

class(dset)
## [1] "list"
names(dset)
## [1] "clinical" "data"
summary(dset$clinical)
##  CancerSubType  Outcome         LFU          Event        
##  Min.   :1.00   Bad :205   Min.   :12.00   Mode :logical  
##  1st Qu.:1.00   Good: 95   1st Qu.:26.00   FALSE:295      
##  Median :3.00              Median :42.50   TRUE :5        
##  Mean   :2.73              Mean   :41.47                  
##  3rd Qu.:4.00              3rd Qu.:55.00                  
##  Max.   :4.00              Max.   :71.00

The raw data are simulated as a matrix of continuous values.

class(dset$data)
## [1] "matrix"
dim(dset$data)
## [1]  20 300

Apply Clinically Realistic Noise

The user may add further additive noise to the raw data. The ClinicalNoiseModel simulates additive noise for each feature f and patient i as a normal distribution \(E_{fi} \sim N(0, \tau)\) , where the standard deviation \(\tau\) varies with a hyperparameter along the gamma distribution \(\tau \sim Gamma(shape, scale)\). Thus, the ClinicalNoiseModel generates many features with low noise (such as a tightly calibrated laboratory test) and some features with high noise (such as a blood pressure measured by hand and manually entered into the medical record.) The user may apply default parameters or individual parameters. Next, the ClinicalNoiseModel is applied to blur the previously simulated data. The default model below generates a low overall level of additive noise.

cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng), shape = 1.02, scale = 0.05)
summary(cnm)
## A 'NoiseModel' with:
##   additive offset = 0
##   additive scale distributed as:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003632 0.027081 0.034741 0.061475 0.076802 0.185948 
##   multiplicative scale = 0
noisy <- blur(cnm, dset$data)

Simulate Mixed-Type Data

Umpire 2.0 allows the simulation of binary, nominal, and ordinal data from raw, continuous data in variable, user-defined mixtures. The user defines prevalences, summing to 1, of binary, continuous, and categorical data in the desired final mixture. For categorical features, the user may tune the percent of categorical data desired to be nominal and the range of the number of categories to be simulated.

The data simulated above by the ClinicalEngine and ClinicalNoiseModel takes rows (not columns) as features, as an omics convention. Thus, by default, when generating data, rows are treated as features and columns as patients. The setDataTypes method transposes its results to a data frame where the columns are features and the rows are patients. This transposition both fits better with the conventions used for clinical data, but also supports the ability to store different kinds of (mixed-type) data in different columns.

dt <- setDataTypes(dset$data,
                   pCont = 1/3, pBin = 1/3, pCat = 1/3,
                   pNominal = 0.5, range = 3:9,
                   inputRowsAreFeatures = TRUE)
## 1
names(dt)
## [1] "binned"    "cutpoints"

The setDataTypes function generates a list containing two objects: a data.frame of mixed-type data…

class(dt$binned)
## [1] "data.frame"
dim(dt$binned)
## [1] 300  20
summary(dt$binned)
##        V1             V2            V3               V4         V5    
##  Min.   :0.00   Min.   :0.0   Min.   :0.5845   Min.   :0.0000   R:70  
##  1st Qu.:0.00   1st Qu.:0.0   1st Qu.:2.1179   1st Qu.:0.0000   S:56  
##  Median :1.00   Median :1.0   Median :2.8589   Median :0.0000   T:50  
##  Mean   :0.69   Mean   :0.7   Mean   :2.8963   Mean   :0.4033   U:54  
##  3rd Qu.:1.00   3rd Qu.:1.0   3rd Qu.:3.7239   3rd Qu.:1.0000   V:70  
##  Max.   :1.00   Max.   :1.0   Max.   :5.0888   Max.   :1.0000         
##                                                                       
##  V6           V7        V8           V9            V10          V11       
##  R:42   Min.   :4.974   R:83   Min.   :0.00   V      :44   Min.   :3.370  
##  S:43   1st Qu.:6.378   S:53   1st Qu.:0.00   T      :41   1st Qu.:5.035  
##  T:34   Median :7.032   T:79   Median :0.00   W      :35   Median :5.565  
##  U:44   Mean   :6.983   U:85   Mean   :0.34   X      :35   Mean   :5.564  
##  V:51   3rd Qu.:7.547          3rd Qu.:1.00   S      :34   3rd Qu.:6.177  
##  W:46   Max.   :9.120          Max.   :1.00   U      :32   Max.   :7.137  
##  X:40                                         (Other):79                  
##       V12            V13            V14            V15       
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :5.191  
##  1st Qu.:0.00   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:6.600  
##  Median :0.00   Median :1.00   Median :1.00   Median :6.950  
##  Mean   :0.17   Mean   :0.79   Mean   :0.92   Mean   :6.960  
##  3rd Qu.:0.00   3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:7.275  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :8.468  
##                                                              
##       V16             V17              V18            V19     V20    
##  Min.   :6.086   Min.   : 7.631   Min.   :0.00   E      :55   R: 61  
##  1st Qu.:7.622   1st Qu.: 8.767   1st Qu.:0.00   C      :43   S:105  
##  Median :8.013   Median : 9.146   Median :0.00   H      :43   T: 83  
##  Mean   :8.042   Mean   : 9.123   Mean   :0.25   B      :36   U: 51  
##  3rd Qu.:8.486   3rd Qu.: 9.491   3rd Qu.:0.25   G      :33          
##  Max.   :9.660   Max.   :10.488   Max.   :1.00   F      :32          
##                                                  (Other):58

The cutpoints contain a record, for each feature, of data type, break points, and labels. Here are two examples of the kind of information stored for a cutpoint.

dt$cutpoints[[1]]
## $breaks
## [1]     -Inf 6.541307      Inf
## 
## $labels
## [1] 1 0
## 
## $Type
## [1] "symmetric binary"
dt$cutpoints[[5]]
## $breaks
##        0% 18.63384% 41.91345% 59.88372% 76.79872%      100% 
##      -Inf  5.489846  5.916301  6.142115  6.512071       Inf 
## 
## $labels
## [1] "S" "R" "U" "T" "V"
## 
## $Type
## [1] "nominal"

And here is an overview of the number of features of each type.

cp <- dt$cutpoints
type <- sapply(cp, function(X) { X$Type })
table(type)
## type
##       continuous          nominal          ordinal symmetric binary 
##                6                5                1                8

The cupoitns should be saved for downstream use in the MixedTypeEngine.

The MixedTypeEngine

The many parameters defining a simulated data mixture can be stored as a single MixedTypeEngine for downstream use to easily generate future datasets with the same simulation parameters.

The MixedTypeEngine stores the following components for re-implementation:

  1. The ClinicalEngine, including parameters for generating the subtype pattern and survival model.
  2. The ClinicalNoiseModel.
  3. The cutpoints generated by setDataTypes.
mte <- MixedTypeEngine(ce,
                       noise = cnm,
                       cutpoints = dt$cutpoints)
summary(mte)
## A 'MixedTypeEngine' (MTE) based on:
## A 'CancerEngine' using the cancer model:
## --------------
## Clinical Simulation Model (Raw) , a CancerModel object constructed via the function call:
##  CancerModel(name = "Clinical Simulation Model (Raw)", nPossible = NP, nPattern = nClusters, HIT = hitfn, prevalence = Prevalence(isWeighted, nClusters)) 
## 
## Pattern prevalences:
## [1] 0.3173905 0.1315154 0.1265533 0.4245408
## 
## Survival effects:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.2340 -2.7133 -1.6120 -1.4374 -0.1001  1.4111 
## 
## Outcome effects:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.52504 -0.53980 -0.23605 -0.06496  0.77952  0.82696 
## --------------
## 
## Base expression given by:
## An Engine with 10 components.
## 
## Altered expression given by:
## An Engine with 10 components.
## 
## ---------------
## The MTE uses the following noise model:
## A 'NoiseModel' with:
##   additive offset = 0
##   additive scale distributed as:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003632 0.027081 0.034741 0.061475 0.076802 0.185948 
##   multiplicative scale = 0
## ---------------
## The MTE simulates clinical data of these types:
## 
##       continuous          nominal          ordinal symmetric binary 
##                6                5                1                8

With rand, the user can easily generate new data sets with the same simulation parameters.

dset2 <- rand(mte, 20)
class(dset2)
## [1] "data.frame"
summary(dset2)
##  V1     V2           V3        V4     V5    V6          V7        V8   
##  0: 9   0:11   Min.   :1.404   0:12   R:7   R:2   Min.   :5.701   R:5  
##  1:11   1: 9   1st Qu.:2.104   1: 8   S:1   S:4   1st Qu.:6.628   S:1  
##                Median :2.866          T:4   T:0   Median :7.090   T:6  
##                Mean   :2.941          U:2   U:0   Mean   :7.142   U:8  
##                3rd Qu.:3.575          V:6   V:3   3rd Qu.:7.633        
##                Max.   :4.903                W:4   Max.   :9.281        
##                                             X:7                        
##  V9          V10         V11        V12    V13    V14         V15       
##  0:11   Z      :4   Min.   :4.143   0:19   0: 6   0: 0   Min.   :6.499  
##  1: 9   S      :3   1st Qu.:5.075   1: 1   1:14   1:20   1st Qu.:6.767  
##         T      :3   Median :5.419                        Median :7.228  
##         U      :3   Mean   :5.464                        Mean   :7.138  
##         W      :2   3rd Qu.:5.959                        3rd Qu.:7.362  
##         Y      :2   Max.   :6.590                        Max.   :8.024  
##         (Other):3                                                       
##       V16             V17        V18         V19    V20  
##  Min.   :7.014   Min.   :7.984   0:11   E      :5   R:9  
##  1st Qu.:7.607   1st Qu.:8.763   1: 9   C      :4   S:5  
##  Median :8.067   Median :9.077          D      :4   T:4  
##  Mean   :8.113   Mean   :9.048          H      :4   U:2  
##  3rd Qu.:8.636   3rd Qu.:9.316          B      :3        
##  Max.   :9.613   Max.   :9.957          A      :0        
##                                         (Other):0

By using the keepal argument othe function, you can keep the intermediate datasets produced by the rand method.

dset3 <- rand(mte, 25, keepall = TRUE)
class(dset3)
## [1] "list"
names(dset3)
## [1] "raw"      "clinical" "noisy"    "binned"

The raw and noisy elements have the rows as (future clinical) features and the columns as patients/samples.

dim(dset3$raw)
## [1] 20 25
summary(t(dset3$raw))
##        V1              V2              V3              V4       
##  Min.   :4.674   Min.   :4.895   Min.   :1.559   Min.   :3.494  
##  1st Qu.:5.829   1st Qu.:6.646   1st Qu.:2.350   1st Qu.:5.219  
##  Median :6.154   Median :7.096   Median :2.748   Median :5.881  
##  Mean   :6.070   Mean   :7.038   Mean   :2.957   Mean   :5.913  
##  3rd Qu.:6.544   3rd Qu.:7.451   3rd Qu.:3.458   3rd Qu.:6.782  
##  Max.   :7.295   Max.   :8.603   Max.   :4.961   Max.   :8.162  
##        V5              V6              V7              V8       
##  Min.   :5.148   Min.   :3.537   Min.   :4.614   Min.   :7.267  
##  1st Qu.:5.954   1st Qu.:4.939   1st Qu.:6.370   1st Qu.:8.221  
##  Median :6.455   Median :5.085   Median :7.124   Median :8.502  
##  Mean   :6.377   Mean   :5.176   Mean   :7.019   Mean   :8.499  
##  3rd Qu.:6.864   3rd Qu.:5.483   3rd Qu.:7.828   3rd Qu.:8.787  
##  Max.   :7.291   Max.   :6.089   Max.   :8.734   Max.   :9.441  
##        V9             V10             V11             V12       
##  Min.   :5.639   Min.   :5.303   Min.   :3.861   Min.   :4.381  
##  1st Qu.:6.965   1st Qu.:6.888   1st Qu.:4.863   1st Qu.:5.071  
##  Median :7.345   Median :7.268   Median :5.436   Median :5.444  
##  Mean   :7.185   Mean   :7.227   Mean   :5.457   Mean   :5.478  
##  3rd Qu.:7.621   3rd Qu.:7.668   3rd Qu.:6.082   3rd Qu.:5.893  
##  Max.   :8.020   Max.   :8.741   Max.   :6.871   Max.   :6.519  
##       V13             V14             V15             V16       
##  Min.   :4.888   Min.   :6.667   Min.   :6.107   Min.   :6.782  
##  1st Qu.:6.590   1st Qu.:7.637   1st Qu.:6.546   1st Qu.:7.711  
##  Median :7.143   Median :8.142   Median :7.030   Median :8.030  
##  Mean   :6.946   Mean   :8.130   Mean   :6.939   Mean   :8.100  
##  3rd Qu.:7.308   3rd Qu.:8.729   3rd Qu.:7.267   3rd Qu.:8.634  
##  Max.   :8.675   Max.   :9.791   Max.   :7.719   Max.   :9.095  
##       V17              V18             V19              V20       
##  Min.   : 7.804   Min.   :2.900   Min.   : 5.922   Min.   :3.486  
##  1st Qu.: 9.050   1st Qu.:3.976   1st Qu.: 6.531   1st Qu.:4.241  
##  Median : 9.272   Median :4.359   Median : 7.223   Median :4.659  
##  Mean   : 9.215   Mean   :4.362   Mean   : 7.207   Mean   :4.687  
##  3rd Qu.: 9.511   3rd Qu.:4.584   3rd Qu.: 7.438   3rd Qu.:5.053  
##  Max.   :10.069   Max.   :6.170   Max.   :10.253   Max.   :7.106
dim(t(dset3$noisy))
## [1] 25 20
summary(dset3$noisy)
##        V1               V2              V3              V4       
##  Min.   : 1.565   Min.   :1.556   Min.   :4.251   Min.   :2.712  
##  1st Qu.: 5.880   1st Qu.:5.632   1st Qu.:5.336   1st Qu.:5.596  
##  Median : 6.853   Median :6.301   Median :6.532   Median :6.823  
##  Mean   : 6.608   Mean   :6.569   Mean   :6.484   Mean   :6.700  
##  3rd Qu.: 7.602   3rd Qu.:8.128   3rd Qu.:7.532   3rd Qu.:8.155  
##  Max.   :10.320   Max.   :9.768   Max.   :9.808   Max.   :9.372  
##        V5              V6              V7              V8       
##  Min.   :1.653   Min.   :2.748   Min.   :3.304   Min.   :3.166  
##  1st Qu.:5.631   1st Qu.:5.199   1st Qu.:6.097   1st Qu.:5.839  
##  Median :6.557   Median :6.801   Median :6.550   Median :6.836  
##  Mean   :6.323   Mean   :6.523   Mean   :6.747   Mean   :6.432  
##  3rd Qu.:7.651   3rd Qu.:7.581   3rd Qu.:7.831   3rd Qu.:7.131  
##  Max.   :9.380   Max.   :9.934   Max.   :9.327   Max.   :8.953  
##        V9             V10             V11             V12       
##  Min.   :4.348   Min.   :3.870   Min.   :2.450   Min.   :3.461  
##  1st Qu.:5.116   1st Qu.:4.967   1st Qu.:5.094   1st Qu.:5.695  
##  Median :6.716   Median :6.680   Median :5.964   Median :6.766  
##  Mean   :6.553   Mean   :6.365   Mean   :6.194   Mean   :6.621  
##  3rd Qu.:7.371   3rd Qu.:7.138   3rd Qu.:7.379   3rd Qu.:7.694  
##  Max.   :8.802   Max.   :9.028   Max.   :8.566   Max.   :9.122  
##       V13             V14              V15             V16       
##  Min.   :2.351   Min.   : 3.955   Min.   :3.374   Min.   :3.520  
##  1st Qu.:5.018   1st Qu.: 5.006   1st Qu.:5.426   1st Qu.:5.245  
##  Median :6.645   Median : 6.797   Median :6.777   Median :6.506  
##  Mean   :6.380   Mean   : 6.460   Mean   :6.684   Mean   :6.515  
##  3rd Qu.:7.822   3rd Qu.: 7.402   3rd Qu.:7.570   3rd Qu.:7.415  
##  Max.   :9.529   Max.   :10.038   Max.   :9.847   Max.   :9.302  
##       V17             V18             V19             V20       
##  Min.   :2.084   Min.   :2.301   Min.   :2.732   Min.   :3.012  
##  1st Qu.:5.526   1st Qu.:4.656   1st Qu.:5.401   1st Qu.:5.263  
##  Median :6.483   Median :6.024   Median :6.597   Median :7.150  
##  Mean   :6.327   Mean   :5.983   Mean   :6.293   Mean   :6.757  
##  3rd Qu.:7.538   3rd Qu.:6.975   3rd Qu.:7.241   3rd Qu.:7.976  
##  Max.   :9.300   Max.   :9.372   Max.   :9.685   Max.   :9.451  
##       V21             V22             V23             V24       
##  Min.   :2.592   Min.   :2.238   Min.   :3.665   Min.   :2.457  
##  1st Qu.:5.681   1st Qu.:4.926   1st Qu.:5.956   1st Qu.:5.709  
##  Median :6.860   Median :6.431   Median :6.887   Median :6.935  
##  Mean   :6.532   Mean   :6.345   Mean   :6.877   Mean   :6.548  
##  3rd Qu.:7.693   3rd Qu.:7.612   3rd Qu.:7.638   3rd Qu.:7.754  
##  Max.   :9.259   Max.   :9.072   Max.   :9.852   Max.   :9.285  
##       V25       
##  Min.   :2.888  
##  1st Qu.:5.239  
##  Median :7.010  
##  Mean   :6.619  
##  3rd Qu.:7.638  
##  Max.   :9.074

Noisy data arises by adding simulated noise to the raw data.

plot(dset3$raw[5,], dset3$noisy[5,], xlab = "Raw", ylab = "Noisy", pch=16)

Raw and noisy data.

The binned>,/tt< elemnt has columns as features and rows as samoples. Binned data arises by applying cut points to noisy data.

dim(dset3$binned)
## [1] 25 20
summary(dset3$binned)
##  V1     V2           V3        V4     V5     V6          V7        V8   
##  0: 6   0: 7   Min.   :1.556   0:14   R: 4   R:1   Min.   :4.603   R:7  
##  1:19   1:18   1st Qu.:2.351   1:11   S: 2   S:6   1st Qu.:6.403   S:4  
##                Median :2.748          T: 6   T:6   Median :7.183   T:6  
##                Mean   :2.956          U: 2   U:2   Mean   :7.035   U:8  
##                3rd Qu.:3.461          V:11   V:7   3rd Qu.:7.771        
##                Max.   :4.963                 W:3   Max.   :8.721        
##                                              X:0                        
##  V9          V10         V11        V12    V13    V14         V15       
##  0:18   V      :5   Min.   :3.919   0:21   0: 5   0: 2   Min.   :6.071  
##  1: 7   W      :5   1st Qu.:4.934   1: 4   1:20   1:23   1st Qu.:6.493  
##         S      :3   Median :5.376                        Median :7.043  
##         T      :3   Mean   :5.449                        Mean   :6.951  
##         X      :3   3rd Qu.:5.933                        3rd Qu.:7.446  
##         Y      :2   Max.   :6.834                        Max.   :7.685  
##         (Other):4                                                       
##       V16             V17         V18         V19    V20  
##  Min.   :6.681   Min.   : 7.820   0:20   B      :6   R:8  
##  1st Qu.:7.571   1st Qu.: 9.072   1: 5   E      :6   S:7  
##  Median :8.085   Median : 9.259          F      :3   T:8  
##  Mean   :8.065   Mean   : 9.219          G      :3   U:2  
##  3rd Qu.:8.410   3rd Qu.: 9.529          H      :3        
##  Max.   :9.201   Max.   :10.038          C      :2        
##                                          (Other):2
plot(dset3$binned[,5], dset3$noisy[5,], xlab = "Binned", ylab = "Noisy")

Noisy and binned data.