Version 2.0 of the Ultimate Microarray Prediction, Inference, and Reality Engine (Umpire) extends the functions of the Umpire 1.0 R package to allow researchers to simulate realistic, mixed-type, clinical data. Statisticians, computer scientists, and clinical informaticians who develop and improve methods to analyze clinical data from a variety of contexts (including clinical trials, population cohorts, and electronic medical record sources) recognize that it is difficult to evaluate methods on real data where “ground truth” is unknown. Frequently, they turn to simulations where the can control the underlying structure, which can result in simulations which are too simplistic to reflect complex clinical data realities. Clinical measurements on patients may be treated as independent, in spite of the elaborate correlation structures that arise in networks, pathways, organ systems, and syndromes in real biology. Further, the researcher finds limited tools at her disposal to facilitate simulation of binary, categorical, or mixed data at this representative level of biological complexity.
In this vignette, we describe a workflow with the Umpire package to simulate biologically realistic, mixed-type clinical data.
As usual, we start by loading the package:
Since we are going to run simulations, for reproducibility purposes, we should set the seed of the random number generator.
The simulation workflow begins by simulating complex, correlated, continuous data with known “ground truth” by instantiating a ClinicalEngine. We simulate 20 features and 4 clusters of unequal size. The ClinicalEngine generates subtypes (clusters) with known “ground truth” through an implementation of the Umpire 1.0 CancerModel and CancerEngine.
## A 'CancerEngine' using the cancer model:
## --------------
## Clinical Simulation Model (Raw) , a CancerModel object constructed via the function call:
## CancerModel(name = "Clinical Simulation Model (Raw)", nPossible = NP, nPattern = nClusters, HIT = hitfn, prevalence = Prevalence(isWeighted, nClusters))
##
## Pattern prevalences:
## [1] 0.3173905 0.1315154 0.1265533 0.4245408
##
## Survival effects:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.2340 -2.7133 -1.6120 -1.4374 -0.1001 1.4111
##
## Outcome effects:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.52504 -0.53980 -0.23605 -0.06496 0.77952 0.82696
## --------------
##
## Base expression given by:
## An Engine with 10 components.
##
## Altered expression given by:
## An Engine with 10 components.
Note that the prevalences are not equal; when you use isweighted = TRUE, they are chosen from a Dirichlet distribution. Note also that the summary function describes the object as a CancerEngine, since the same underlying structure is used to implement a ClinicalEngine.
Now we confirm that the model expects to produce the 20 features that we requested. It will do so using 10 “components”, where each component consists of a pair of correlated features.
## [1] 20
## [1] 10
The ClinicalEngine is used to simulate the raw, base dataset.
Data are simulated as a list with two objects: simulated data and associated clinical information, including “ground truth” subtype membership and survival data (outcome, length of followup, and occurrence of event of interest within the followup period).
## [1] "list"
## [1] "clinical" "data"
## CancerSubType Outcome LFU Event
## Min. :1.00 Bad :205 Min. :12.00 Mode :logical
## 1st Qu.:1.00 Good: 95 1st Qu.:26.00 FALSE:295
## Median :3.00 Median :42.50 TRUE :5
## Mean :2.73 Mean :41.47
## 3rd Qu.:4.00 3rd Qu.:55.00
## Max. :4.00 Max. :71.00
The raw data are simulated as a matrix of continuous values.
## [1] "matrix"
## [1] 20 300
The user may add further additive noise to the raw data. The ClinicalNoiseModel simulates additive noise for each feature f and patient i as a normal distribution \(E_{fi} \sim N(0, \tau)\) , where the standard deviation \(\tau\) varies with a hyperparameter along the gamma distribution \(\tau \sim Gamma(shape, scale)\). Thus, the ClinicalNoiseModel generates many features with low noise (such as a tightly calibrated laboratory test) and some features with high noise (such as a blood pressure measured by hand and manually entered into the medical record.) The user may apply default parameters or individual parameters. Next, the ClinicalNoiseModel is applied to blur the previously simulated data. The default model below generates a low overall level of additive noise.
## A 'NoiseModel' with:
## additive offset = 0
## additive scale distributed as:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003632 0.027081 0.034741 0.061475 0.076802 0.185948
## multiplicative scale = 0
Umpire 2.0 allows the simulation of binary, nominal, and ordinal data from raw, continuous data in variable, user-defined mixtures. The user defines prevalences, summing to 1, of binary, continuous, and categorical data in the desired final mixture. For categorical features, the user may tune the percent of categorical data desired to be nominal and the range of the number of categories to be simulated.
The data simulated above by the ClinicalEngine and ClinicalNoiseModel takes rows (not columns) as features, as an omics convention. Thus, by default, when generating data, rows are treated as features and columns as patients. The setDataTypes method transposes its results to a data frame where the columns are features and the rows are patients. This transposition both fits better with the conventions used for clinical data, but also supports the ability to store different kinds of (mixed-type) data in different columns.
dt <- setDataTypes(dset$data,
pCont = 1/3, pBin = 1/3, pCat = 1/3,
pNominal = 0.5, range = 3:9,
inputRowsAreFeatures = TRUE)
## 1
## [1] "binned" "cutpoints"
The setDataTypes function generates a list containing two objects: a data.frame of mixed-type data…
## [1] "data.frame"
## [1] 300 20
## V1 V2 V3 V4 V5
## Min. :0.00 Min. :0.0 Min. :0.5845 Min. :0.0000 R:70
## 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:2.1179 1st Qu.:0.0000 S:56
## Median :1.00 Median :1.0 Median :2.8589 Median :0.0000 T:50
## Mean :0.69 Mean :0.7 Mean :2.8963 Mean :0.4033 U:54
## 3rd Qu.:1.00 3rd Qu.:1.0 3rd Qu.:3.7239 3rd Qu.:1.0000 V:70
## Max. :1.00 Max. :1.0 Max. :5.0888 Max. :1.0000
##
## V6 V7 V8 V9 V10 V11
## R:42 Min. :4.974 R:83 Min. :0.00 V :44 Min. :3.370
## S:43 1st Qu.:6.378 S:53 1st Qu.:0.00 T :41 1st Qu.:5.035
## T:34 Median :7.032 T:79 Median :0.00 W :35 Median :5.565
## U:44 Mean :6.983 U:85 Mean :0.34 X :35 Mean :5.564
## V:51 3rd Qu.:7.547 3rd Qu.:1.00 S :34 3rd Qu.:6.177
## W:46 Max. :9.120 Max. :1.00 U :32 Max. :7.137
## X:40 (Other):79
## V12 V13 V14 V15
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :5.191
## 1st Qu.:0.00 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:6.600
## Median :0.00 Median :1.00 Median :1.00 Median :6.950
## Mean :0.17 Mean :0.79 Mean :0.92 Mean :6.960
## 3rd Qu.:0.00 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:7.275
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :8.468
##
## V16 V17 V18 V19 V20
## Min. :6.086 Min. : 7.631 Min. :0.00 E :55 R: 61
## 1st Qu.:7.622 1st Qu.: 8.767 1st Qu.:0.00 C :43 S:105
## Median :8.013 Median : 9.146 Median :0.00 H :43 T: 83
## Mean :8.042 Mean : 9.123 Mean :0.25 B :36 U: 51
## 3rd Qu.:8.486 3rd Qu.: 9.491 3rd Qu.:0.25 G :33
## Max. :9.660 Max. :10.488 Max. :1.00 F :32
## (Other):58
The cutpoints contain a record, for each feature, of data type, break points, and labels. Here are two examples of the kind of information stored for a cutpoint.
## $breaks
## [1] -Inf 6.541307 Inf
##
## $labels
## [1] 1 0
##
## $Type
## [1] "symmetric binary"
## $breaks
## 0% 18.63384% 41.91345% 59.88372% 76.79872% 100%
## -Inf 5.489846 5.916301 6.142115 6.512071 Inf
##
## $labels
## [1] "S" "R" "U" "T" "V"
##
## $Type
## [1] "nominal"
And here is an overview of the number of features of each type.
## type
## continuous nominal ordinal symmetric binary
## 6 5 1 8
The cupoitns should be saved for downstream use in the MixedTypeEngine.
The many parameters defining a simulated data mixture can be stored as a single MixedTypeEngine for downstream use to easily generate future datasets with the same simulation parameters.
The MixedTypeEngine stores the following components for re-implementation:
## A 'MixedTypeEngine' (MTE) based on:
## A 'CancerEngine' using the cancer model:
## --------------
## Clinical Simulation Model (Raw) , a CancerModel object constructed via the function call:
## CancerModel(name = "Clinical Simulation Model (Raw)", nPossible = NP, nPattern = nClusters, HIT = hitfn, prevalence = Prevalence(isWeighted, nClusters))
##
## Pattern prevalences:
## [1] 0.3173905 0.1315154 0.1265533 0.4245408
##
## Survival effects:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.2340 -2.7133 -1.6120 -1.4374 -0.1001 1.4111
##
## Outcome effects:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.52504 -0.53980 -0.23605 -0.06496 0.77952 0.82696
## --------------
##
## Base expression given by:
## An Engine with 10 components.
##
## Altered expression given by:
## An Engine with 10 components.
##
## ---------------
## The MTE uses the following noise model:
## A 'NoiseModel' with:
## additive offset = 0
## additive scale distributed as:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003632 0.027081 0.034741 0.061475 0.076802 0.185948
## multiplicative scale = 0
## ---------------
## The MTE simulates clinical data of these types:
##
## continuous nominal ordinal symmetric binary
## 6 5 1 8
With rand, the user can easily generate new data sets with the same simulation parameters.
## [1] "data.frame"
## V1 V2 V3 V4 V5 V6 V7 V8
## 0: 9 0:11 Min. :1.404 0:12 R:7 R:2 Min. :5.701 R:5
## 1:11 1: 9 1st Qu.:2.104 1: 8 S:1 S:4 1st Qu.:6.628 S:1
## Median :2.866 T:4 T:0 Median :7.090 T:6
## Mean :2.941 U:2 U:0 Mean :7.142 U:8
## 3rd Qu.:3.575 V:6 V:3 3rd Qu.:7.633
## Max. :4.903 W:4 Max. :9.281
## X:7
## V9 V10 V11 V12 V13 V14 V15
## 0:11 Z :4 Min. :4.143 0:19 0: 6 0: 0 Min. :6.499
## 1: 9 S :3 1st Qu.:5.075 1: 1 1:14 1:20 1st Qu.:6.767
## T :3 Median :5.419 Median :7.228
## U :3 Mean :5.464 Mean :7.138
## W :2 3rd Qu.:5.959 3rd Qu.:7.362
## Y :2 Max. :6.590 Max. :8.024
## (Other):3
## V16 V17 V18 V19 V20
## Min. :7.014 Min. :7.984 0:11 E :5 R:9
## 1st Qu.:7.607 1st Qu.:8.763 1: 9 C :4 S:5
## Median :8.067 Median :9.077 D :4 T:4
## Mean :8.113 Mean :9.048 H :4 U:2
## 3rd Qu.:8.636 3rd Qu.:9.316 B :3
## Max. :9.613 Max. :9.957 A :0
## (Other):0
By using the keepal argument othe function, you can keep the intermediate datasets produced by the rand method.
## [1] "list"
## [1] "raw" "clinical" "noisy" "binned"
The raw and noisy elements have the rows as (future clinical) features and the columns as patients/samples.
## [1] 20 25
## V1 V2 V3 V4
## Min. :4.674 Min. :4.895 Min. :1.559 Min. :3.494
## 1st Qu.:5.829 1st Qu.:6.646 1st Qu.:2.350 1st Qu.:5.219
## Median :6.154 Median :7.096 Median :2.748 Median :5.881
## Mean :6.070 Mean :7.038 Mean :2.957 Mean :5.913
## 3rd Qu.:6.544 3rd Qu.:7.451 3rd Qu.:3.458 3rd Qu.:6.782
## Max. :7.295 Max. :8.603 Max. :4.961 Max. :8.162
## V5 V6 V7 V8
## Min. :5.148 Min. :3.537 Min. :4.614 Min. :7.267
## 1st Qu.:5.954 1st Qu.:4.939 1st Qu.:6.370 1st Qu.:8.221
## Median :6.455 Median :5.085 Median :7.124 Median :8.502
## Mean :6.377 Mean :5.176 Mean :7.019 Mean :8.499
## 3rd Qu.:6.864 3rd Qu.:5.483 3rd Qu.:7.828 3rd Qu.:8.787
## Max. :7.291 Max. :6.089 Max. :8.734 Max. :9.441
## V9 V10 V11 V12
## Min. :5.639 Min. :5.303 Min. :3.861 Min. :4.381
## 1st Qu.:6.965 1st Qu.:6.888 1st Qu.:4.863 1st Qu.:5.071
## Median :7.345 Median :7.268 Median :5.436 Median :5.444
## Mean :7.185 Mean :7.227 Mean :5.457 Mean :5.478
## 3rd Qu.:7.621 3rd Qu.:7.668 3rd Qu.:6.082 3rd Qu.:5.893
## Max. :8.020 Max. :8.741 Max. :6.871 Max. :6.519
## V13 V14 V15 V16
## Min. :4.888 Min. :6.667 Min. :6.107 Min. :6.782
## 1st Qu.:6.590 1st Qu.:7.637 1st Qu.:6.546 1st Qu.:7.711
## Median :7.143 Median :8.142 Median :7.030 Median :8.030
## Mean :6.946 Mean :8.130 Mean :6.939 Mean :8.100
## 3rd Qu.:7.308 3rd Qu.:8.729 3rd Qu.:7.267 3rd Qu.:8.634
## Max. :8.675 Max. :9.791 Max. :7.719 Max. :9.095
## V17 V18 V19 V20
## Min. : 7.804 Min. :2.900 Min. : 5.922 Min. :3.486
## 1st Qu.: 9.050 1st Qu.:3.976 1st Qu.: 6.531 1st Qu.:4.241
## Median : 9.272 Median :4.359 Median : 7.223 Median :4.659
## Mean : 9.215 Mean :4.362 Mean : 7.207 Mean :4.687
## 3rd Qu.: 9.511 3rd Qu.:4.584 3rd Qu.: 7.438 3rd Qu.:5.053
## Max. :10.069 Max. :6.170 Max. :10.253 Max. :7.106
## [1] 25 20
## V1 V2 V3 V4
## Min. : 1.565 Min. :1.556 Min. :4.251 Min. :2.712
## 1st Qu.: 5.880 1st Qu.:5.632 1st Qu.:5.336 1st Qu.:5.596
## Median : 6.853 Median :6.301 Median :6.532 Median :6.823
## Mean : 6.608 Mean :6.569 Mean :6.484 Mean :6.700
## 3rd Qu.: 7.602 3rd Qu.:8.128 3rd Qu.:7.532 3rd Qu.:8.155
## Max. :10.320 Max. :9.768 Max. :9.808 Max. :9.372
## V5 V6 V7 V8
## Min. :1.653 Min. :2.748 Min. :3.304 Min. :3.166
## 1st Qu.:5.631 1st Qu.:5.199 1st Qu.:6.097 1st Qu.:5.839
## Median :6.557 Median :6.801 Median :6.550 Median :6.836
## Mean :6.323 Mean :6.523 Mean :6.747 Mean :6.432
## 3rd Qu.:7.651 3rd Qu.:7.581 3rd Qu.:7.831 3rd Qu.:7.131
## Max. :9.380 Max. :9.934 Max. :9.327 Max. :8.953
## V9 V10 V11 V12
## Min. :4.348 Min. :3.870 Min. :2.450 Min. :3.461
## 1st Qu.:5.116 1st Qu.:4.967 1st Qu.:5.094 1st Qu.:5.695
## Median :6.716 Median :6.680 Median :5.964 Median :6.766
## Mean :6.553 Mean :6.365 Mean :6.194 Mean :6.621
## 3rd Qu.:7.371 3rd Qu.:7.138 3rd Qu.:7.379 3rd Qu.:7.694
## Max. :8.802 Max. :9.028 Max. :8.566 Max. :9.122
## V13 V14 V15 V16
## Min. :2.351 Min. : 3.955 Min. :3.374 Min. :3.520
## 1st Qu.:5.018 1st Qu.: 5.006 1st Qu.:5.426 1st Qu.:5.245
## Median :6.645 Median : 6.797 Median :6.777 Median :6.506
## Mean :6.380 Mean : 6.460 Mean :6.684 Mean :6.515
## 3rd Qu.:7.822 3rd Qu.: 7.402 3rd Qu.:7.570 3rd Qu.:7.415
## Max. :9.529 Max. :10.038 Max. :9.847 Max. :9.302
## V17 V18 V19 V20
## Min. :2.084 Min. :2.301 Min. :2.732 Min. :3.012
## 1st Qu.:5.526 1st Qu.:4.656 1st Qu.:5.401 1st Qu.:5.263
## Median :6.483 Median :6.024 Median :6.597 Median :7.150
## Mean :6.327 Mean :5.983 Mean :6.293 Mean :6.757
## 3rd Qu.:7.538 3rd Qu.:6.975 3rd Qu.:7.241 3rd Qu.:7.976
## Max. :9.300 Max. :9.372 Max. :9.685 Max. :9.451
## V21 V22 V23 V24
## Min. :2.592 Min. :2.238 Min. :3.665 Min. :2.457
## 1st Qu.:5.681 1st Qu.:4.926 1st Qu.:5.956 1st Qu.:5.709
## Median :6.860 Median :6.431 Median :6.887 Median :6.935
## Mean :6.532 Mean :6.345 Mean :6.877 Mean :6.548
## 3rd Qu.:7.693 3rd Qu.:7.612 3rd Qu.:7.638 3rd Qu.:7.754
## Max. :9.259 Max. :9.072 Max. :9.852 Max. :9.285
## V25
## Min. :2.888
## 1st Qu.:5.239
## Median :7.010
## Mean :6.619
## 3rd Qu.:7.638
## Max. :9.074
Noisy data arises by adding simulated noise to the raw data.
Raw and noisy data.
The binned>,/tt< elemnt has columns as features and rows as samoples. Binned data arises by applying cut points to noisy data.
## [1] 25 20
## V1 V2 V3 V4 V5 V6 V7 V8
## 0: 6 0: 7 Min. :1.556 0:14 R: 4 R:1 Min. :4.603 R:7
## 1:19 1:18 1st Qu.:2.351 1:11 S: 2 S:6 1st Qu.:6.403 S:4
## Median :2.748 T: 6 T:6 Median :7.183 T:6
## Mean :2.956 U: 2 U:2 Mean :7.035 U:8
## 3rd Qu.:3.461 V:11 V:7 3rd Qu.:7.771
## Max. :4.963 W:3 Max. :8.721
## X:0
## V9 V10 V11 V12 V13 V14 V15
## 0:18 V :5 Min. :3.919 0:21 0: 5 0: 2 Min. :6.071
## 1: 7 W :5 1st Qu.:4.934 1: 4 1:20 1:23 1st Qu.:6.493
## S :3 Median :5.376 Median :7.043
## T :3 Mean :5.449 Mean :6.951
## X :3 3rd Qu.:5.933 3rd Qu.:7.446
## Y :2 Max. :6.834 Max. :7.685
## (Other):4
## V16 V17 V18 V19 V20
## Min. :6.681 Min. : 7.820 0:20 B :6 R:8
## 1st Qu.:7.571 1st Qu.: 9.072 1: 5 E :6 S:7
## Median :8.085 Median : 9.259 F :3 T:8
## Mean :8.065 Mean : 9.219 G :3 U:2
## 3rd Qu.:8.410 3rd Qu.: 9.529 H :3
## Max. :9.201 Max. :10.038 C :2
## (Other):2
Noisy and binned data.