In this vignette, we illustrate the basic functionality of the
bfbin2arm package and its core functions. The package can
be used to design a Bayesian (phase II) clinical trial with two arms and
binary endpoints (success or failure) based on Bayes factors. Our main
assumption here is that the observed data in both groups are from two
random variables \(Y_1,Y_2\) which both
follow a binomial distribution with parameters \(n_1\) and \(n_2\) and \(p_1\) respectively \(p_2\), \[Y_1\sim
\mathrm{Bin}(n_1,p_1), \hspace{1cm} Y_2\sim
\mathrm{Bin}(n_2,p_2)\]
In its current form, the package implements four different hypothesis tests for the trial:
\[H_0:p_1=p_2 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:p_1\neq p_2\] Alternatively, a well-known parameterization of this test introduces a difference parameter \(\eta=p_2-p_1\) and the grand mean \(\zeta=\frac{1}{2}(p_1+p_2)\). Using this parameterization, we have \[p_1=\zeta-\frac{\eta}{2}, \hspace{1cm} p_2=\zeta+\frac{\eta}{2}\] and the hypotheses can be rewritten as: \[H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta \neq 0\] Next to this two-sided test, three directional tests are available in the package:
For each of the four tests, a separate Bayes factor exists and can be used. For the two-sided test, we denote the Bayes factor as \(BF_{01}\), and for the three directional tests above we denote the Bayes factors as \(BF_{+-}\), \(BF_{+0}\) and \(BF_{-0}\).
The \(\mathrm{Beta}(a_0,b_0)\) distribution is a conjugate prior for the binomial likelihood, and when chosen as the prior, the posterior \(P_{p \mid Y}\) is also Beta-distributed. A natural choice for the priors is the beta distribution. We assume independent Beta design priors \(H_0\) as follows: \[p_1 =p_2 = p\mid H_0 \sim \mathrm{Beta}(a_0^d,b_0^d)\] Thus, under \(H_0:\eta = 0\), both probabilities are identical, \(p_1=p_2\), and take some value \(p\in [0,1]\), which has a beta design prior. Likewise, we pick independent Beta design priors under \(H_1:\eta \neq 0\): \[p_1 \mid H_1 \sim \mathrm{Beta}(a_1^d,b_1^d), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_2^d,b_2^d)\] For the analysis priors \(P_{p_1}^a\), \(P_{p_2}^a\) under \(H_1\), we also choose independent Beta priors, with possibly different values \(a_i^a\) and \(b_i^a\) for \(i=1,2\), where the superscript signals that the hyperparameters belong to our analysis instead of design prior: \[p_1 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a)\] Lastly, for the analysis prior \(P_{p}^a\) under \(H_0:\eta=0\), we choose a Dirac prior with all probability on \(\eta=p_2-p_1=0\) conditionally on a uniform prior on \(\zeta\), that is \[p_1=p_2=p|H_0 \sim 1_{\{\eta=0\}}| \zeta \sim U(0,1)\] for the analysis with the Bayes factor.
First, we load the package after installation:
Next, we illustrate the key functions of the package by re-analyzing
a phase II trial in the context of oncology. While no Bayesian approach
was used in the original statistical analysis of the trial, the
step-by-step walktrough below showcases how a structured approach to
designing and calibrating a Bayesian phase II trial with the
bfbin2arm package looks like. Importantly, the trial must
have two trial arms and binary endpoints and we assume that one of the
four tests detailed above is carried out using Bayes factors as the test
criterion.
The ICT-107 trial (Wen et al., 2019) was a randomized phase II study in newly diagnosed glioblastoma patients (n=124, 2:1 randomization). The primary binary endpoint is progression status at 6 months (PFS6), and the secondary binary endpoint immunologic status. Here, we focus on the secondary endpoint for illustration purposes.
Reported results (ITT population):
We start by calculating the Bayes factor(s) for the ICT-107 trial data:
## -------------------------------------------------------------
## 2. ICT-107 trial (immunologic response)
## Placebo (control): 12 responders, 31 non-responders
## ICT-107 (treatment): 49 responders, 32 non-responders
## -------------------------------------------------------------
y1_ict <- 12 # control successes
n1_ict <- 12 + 31
y2_ict <- 49 # treatment successes
n2_ict <- 49 + 32
cat("\n=== ICT-107 Trial (n1 =", n1_ict, ", n2 =", n2_ict, ") ===\n")
#>
#> === ICT-107 Trial (n1 = 43 , n2 = 81 ) ===
# BF01
BF01_ict = twoarmbinbf01(y1_ict, y2_ict, n1_ict, n2_ict,
a_0_a = 1, b_0_a = 1,
a_1_a = 1, b_1_a = 1,
a_2_a = 1, b_2_a = 1)
# BF+1
BFp1_ict = BFplus1(y1_ict, y2_ict, n1_ict, n2_ict,
a_1_d = 1, b_1_d = 1,
a_2_d = 1, b_2_d = 1)
# BF-1
BFm1_ict = BFminus1(y1_ict, y2_ict, n1_ict, n2_ict,
a_1_d = 1, b_1_d = 1,
a_2_d = 1, b_2_d = 1)
# BF+0
cat("=== ICT-107 Trial === Bayes factor BF+0 results in ", BFplus0(BFp1_ict, BF01_ict))
#> === ICT-107 Trial === Bayes factor BF+0 results in 186.6192
# BF+-
cat("=== ICT-107 Trial === Bayes factor BF+- results in ", BFplusMinus(BFp1_ict, BFm1_ict))
#> === ICT-107 Trial === Bayes factor BF+- results in 3702.659The most relevant Bayes factor here is \(BF_{+-}\), because it is directional and
leaves open the possibility of the placebo group having a larger
response rate than the treatment group. Note that the hyperparameters of
the beta analysis priors are specified in twoarmbinbf01 via
a_0_a = 1, b_0_a = 1 et cetera.
Now, a key question is which operating characteristics can be
expected based on the actual sample sizes used in the trial. The
powertwoarmbinbf01 function can provide the answer:
ict_results <- powertwoarmbinbf01(
n1 = n1_ict, n2 = n2_ict,
k = 1/3, k_f = 3,
test = "BF+-", # H+: p2 > p1 vs H-: p2 <= p1
a_0_d = 1, b_0_d = 1, a_0_a = 1, b_0_a = 1,
a_1_d = 1, b_1_d = 1, a_2_d = 1, b_2_d = 1,
a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1,
output = "numeric",
compute_freq_t1e = TRUE,
)
print(ict_results)
#> Power Type1_Error CE_H0
#> 0.8788106 0.0214111 0.8788106
#> Frequentist_Type1_Error
#> 0.2871811
#> attr(,"hypothesis")
#> [1] "H[+]:~p[2] > p[1] ~~ vs ~~ H[-]:~p[2] <= p[1]"
#> attr(,"compute_freq_t1e")
#> [1] TRUEWe see that based on the actual sample sizes and a moderate evidence
threshold \(k=1/3\), the Bayesian power
is sufficiently large with \(87.8\%\).
Still, the frequentist type-I-error rate is way too high with \(28.7\%\), so we increase the evidence
threshold to \(k=1/10\) (strong
evidence) and use the ntwoarmbinbf01 function to calibrate
the design based on our requirements next.
The core working function to design a Bayesian trial with the package
is the ntwoarmbinbf01 function. It provides a method to
calibrate a Bayesian design in terms of
The function makes use of parallelization and it is recommended to run it on a computer with multiple cores to make computations fast. First, we perform a sample size search for a ICT-107-type trial (balanced arms) under flat design priors and substantial evidence thresholds, using the directional Bayes factor \(BF_{+-}\):
ntwoarmbinbf01(
k = 1/10, k_f = 10,
power = 0.8, alpha = 0.05, pce_H0 = 0.8,
test = "BF+-",
nrange = c(10, 75), n_step = 1,
progress = FALSE,
compute_freq_t1e = TRUE,
p1_power = 0.3, p2_power = 0.6,
output = "plot" # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 75 (step = 1, 66 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#>
#>
#> Simulation complete.
#> SUMMARY for BF+-:
#> Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#> k = 0.100, k_f = 10.000
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#> POWER not reached: max=0.754 at n_total=75
#> Bayesian Type-I error <= 0.05 achieved at n_total=10
#> P(CE|H0) not reached: max=0.754 at n_total=75
#> FREQUENTIST Type-I error TOO HIGH: max(sup)=0.120 > 0.05
#> Frequentist power >= 0.80 achieved at n_total=50 (p1=0.30, p2=0.60)
The function arguments are
k = 1/10 andk_f = 10 under \(H_-\) (or \(H_0\) for other tests),power, type-I-error rate
alpha and probability of compelling evidence
pce_H0,BF+-,
BF01, BF+0 or BF-0),nrange for which the design
calibration operates (increasing the upper sample size requires more
time for the calibration to finish),n_step (we recommend using
n_step = 1 always, except for quick checks where using
n_step = 5 or n_step = 10 can decrease
computing times significantly),progress shows a progress bar at the
console output (we strongly recommend to set it to TRUE, it is only set
to FALSE to avoid cluttering the output of this R vignette),compute_freq_t1e sets whether the frequentist
type-I-error rate should be computed, toop1_power and p2_power must be specified,
denoting the assumed success probability in the control and treatment
arm for frequentist power calculationsoutput can be set to plot or
numeric.The resulting output plots the design and analysis priors at the top
row, the resulting power and type-I-error rate curves as functions of
the sample size with markers for which sample sizes the design achieves
the required calibration thresholds (middle row), and the probability of
compelling evidence for the null hypothesis (in this case, the
hypothesis \(H_-\)) in the bottom row.
Note that the oscillations happen due to the discrete nature of the
binomial distribution, and the package algorithm ensures that for the
next 10 sample sizes, the power does not drop below the required
threshold. Likewise, the package ensures that the type-I-error rate does
not increase below the required alpha level, and that the probability of
compelling evidence drops below its required threshold. It is
straightforward to check this visually by means of the provided output
plots, too. If no plots are required, use the option
numeric instead of plot for the output
argument.
The resulting plot shows that while the type I error is calibrated for \(n=10\) patients per trial arm, Bayesian power is not reaching our desired level of 80% even for \(n=75\) patients in total (in both arms). We could increase the range, or alternatively, use more informative design priors under which the hypotheses under comparison are separated in a better way. Right now, we essentially assume that everything is equally likely under our design priors, although we should have a clear expectation about the probabilities in the treatment and control arm. Thus, we modify our design priors next. Note that the plot also shows that frequentist power is calibrated for \(n=50\) patients per arm when assuming \(p_1=0.3\) (control arm probability) and \(p_2=0.6\) (treatment arm probability).
Now, the example above used flat design priors, which might be
unrealistic in a variety of settings. Next, we perform a sample size
search for new ICT-107-type trial (balanced arms) under informative
design priors with very strong evidence thresholds. Notice the
additionally specified parameters a_1_d = 1, b_1_d = 2 and
a_2_d = 2, b_2_d = 1 which are the design prior
hyperparameters of the Beta design priors for \(p_1\) and \(p_2\) under \(H_+\).
ntwoarmbinbf01(
k = 1/30, k_f = 30,
power = 0.8, alpha = 0.05, pce_H0 = 0.8,
test = "BF+-",
nrange = c(10, 100), n_step = 1,
progress = FALSE,
a_1_d = 1, b_1_d = 2,
a_2_d = 2, b_2_d = 1,
compute_freq_t1e = TRUE,
p1_power = 0.3, p2_power = 0.6,
output = "plot" # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#>
#>
#> Simulation complete.
#> SUMMARY for BF+-:
#> Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#> k = 0.033, k_f = 30.000
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#> Power >= 0.80 achieved at n_total=72
#> Bayesian Type-I error <= 0.05 achieved at n_total=10
#> P(CE|H0) not reached: max=0.715 at n_total=100
#> Frequentist Type-I error <= 0.05 achieved (max(sup)=0.041)
#> Frequentist power >= 0.80 achieved at n_total=77 (p1=0.30, p2=0.60)
We see that now the Bayesian power is calibrated for \(n=72\) patients per trial arm, while the
frequentist power is calibrated for \(n=77\) patients per trial arm. Importantly,
the frequentist type-I-error is now only \(0.041<0.05\), as stated by the console
output of the function. Thus, the design is fully calibrated except for
the probability of compelling evidence for \(H_-\) shown in the bottom plot.
Therefore, next, we perform a sample size search for new ICT-107-type
trial (balanced arms) under informative design priors with very strong
evidence thresholds, and design prior under H- modified to achieve
probability of compelling evidence PCE(H0) for even smaller sample
sizes. Note that now, additionally, the design prior hyperparameters of
the Beta design priors for \(p_1\) and
\(p_2\) under \(H_-\) are specified in
a_1_d_Hminus = 2, b_1_d_Hminus = 1 and
a_2_d_Hminus = 1, b_2_d_Hminus = 2:
ntwoarmbinbf01(
k = 1/30, k_f = 30,
power = 0.8, alpha = 0.05, pce_H0 = 0.8,
test = "BF+-",
nrange = c(10, 100), n_step = 1,
progress = TRUE,
a_1_d = 1, b_1_d = 2,
a_2_d = 2, b_2_d = 1,
a_1_d_Hminus = 2, b_1_d_Hminus = 1,
a_2_d_Hminus = 1, b_2_d_Hminus = 2,
compute_freq_t1e = TRUE,
p1_power = 0.3, p2_power = 0.6,
output = "plot" # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#>
#>
#> Simulation complete.
#> SUMMARY for BF+-:
#> Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#> k = 0.033, k_f = 30.000
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#> Power >= 0.80 achieved at n_total=72
#> Bayesian Type-I error <= 0.05 achieved at n_total=10
#> P(CE|H0) >= 0.80 achieved at n_total=72
#> Frequentist Type-I error <= 0.05 achieved (max(sup)=0.041)
#> Frequentist power >= 0.80 achieved at n_total=77 (p1=0.30, p2=0.60)
The result is a fully calibrated Bayesian design which meets Bayesian
and frequentist power demands, Bayesian and frequentist type-I-error
rate requirements and our requirement on the probability of compelling
evidence for \(H_0\) (that is, \(H_-\) in this case).
The bfbin2arm package reveals several aspects. If a
balanced design with equal randomization probabilities is desired,
then:
In the original ICT-107 trial, \(2/3\) of the patients was randomized into
the treatment group, while \(1/3\) of
the patients was randomized into the control group. We can use the
parameters alloc1 and alloc2 to specify
randomization probabilities for the control and treatment arms and carry
out the Bayesian sample size calculations based on these randomization
probabilities. As an example, we rerun the last calibration, but use the
randomization probabilities of the ICT-107 trial:
ntwoarmbinbf01(
k = 1/30, k_f = 30,
power = 0.8, alpha = 0.05, pce_H0 = 0.8,
test = "BF+-",
nrange = c(10, 100), n_step = 1,
progress = FALSE,
a_1_d = 1, b_1_d = 2,
a_2_d = 2, b_2_d = 1,
a_1_d_Hminus = 2, b_1_d_Hminus = 1,
a_2_d_Hminus = 1, b_2_d_Hminus = 2,
compute_freq_t1e = TRUE,
p1_power = 0.3, p2_power = 0.6,
output = "plot", # Returns recommended n per group
alloc1 = 1/3,
alloc2 = 2/3
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.333, alloc2 = 0.667
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#>
#>
#> Simulation complete.
#> SUMMARY for BF+-:
#> Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#> k = 0.033, k_f = 30.000
#> Allocation: alloc1 = 0.333, alloc2 = 0.667
#> Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#> Power >= 0.80 achieved at n_total=83
#> Bayesian Type-I error <= 0.05 achieved at n_total=10
#> P(CE|H0) >= 0.80 achieved at n_total=83
#> FREQUENTIST Type-I error TOO HIGH: max(sup)=0.050 > 0.05
#> Frequentist power >= 0.80 achieved at n_total=86 (p1=0.30, p2=0.60)
Remember that the sample size shown at the x-axis in the power and
type-I-error rate plot as well as in the probability of compelling
evidence plot is the total sample size in both arms. We see that now we
need \(n=83\) patients in total to
reach Bayesian power of 80%, while \(n=86\) patients in total are required for
frequentist power calibration of 80%. The probability of compelling
evidence reaches 80% at \(n=83\)
patients in total. Note, however, that the frequentist type-I-error rate
is exactly at the boundary now, which might be too liberal for some. As
the frequentist type-I-error rate assumes fixed success probabilities in
both trial arms and is independent of the design priors, we must change
the evidence threshold \(k\) slightly
to decrease the frequentist type-I-error rate accordingly. Just try it
out yourself and decrease \(k\) from
\(k=1/30\) to \(k=1/40\) and rerun the last code block.
If the original 2:1 randomization of the ICT-107 trial is used and two thirds of the patients are randomized into the treatment group, then:
To fulfill all four requirements, it thus suffices if 32 patients in the control arm and 64 in the treatment arm are enrolled in the trial, and the Bayes factor thresholds \(k=1/40\) and \(k_f=30\) are used for decision making about the hypotheses \(H_+\) and \(H_-\) under consideration.
If desired, we can compare predictive densities under different hypotheses also directly via:
pred_H0 <- predictiveDensityH0(y1_ict, y2_ict, n1_ict, n2_ict)
pred_H1 <- predictiveDensityH1(y1_ict, y2_ict, n1_ict, n2_ict)
pred_Hplus <- predictiveDensityHplus_trunc(y1_ict, y2_ict, n1_ict, n2_ict)
data.frame(
Hypothesis = c("H0: p1=p2", "H1: p1 != p2", "H+: p2>p1"),
"Pred. Density" = round(c(pred_H0, pred_H1, pred_Hplus), 4)
)
#> Hypothesis Pred..Density
#> 1 H0: p1=p2 0e+00
#> 2 H1: p1 != p2 3e-04
#> 3 H+: p2>p1 6e-04Wen PY, et al. (2019). A Randomized Double-Blind Placebo-Controlled Phase II Trial of Dendritic Cell Vaccine ICT-107. Clinical Cancer Research. [PMID: 31320597]