Using the package

First, we load the package after installation:

library(bfbin2arm)

Next, we illustrate the key functions of the package by re-analyzing a phase II trial in the context of oncology. While no Bayesian approach was used in the original statistical analysis of the trial, the step-by-step walktrough below showcases how a structured approach to designing and calibrating a Bayesian phase II trial with the bfbin2arm package looks like. Importantly, the trial must have two trial arms and binary endpoints and we assume that one of the four tests detailed above is carried out using Bayes factors as the test criterion.

ICT-107 Phase II Trial Overview

The ICT-107 trial (Wen et al., 2019) was a randomized phase II study in newly diagnosed glioblastoma patients (n=124, 2:1 randomization). The primary binary endpoint is progression status at 6 months (PFS6), and the secondary binary endpoint immunologic status. Here, we focus on the secondary endpoint for illustration purposes.

Reported results (ITT population):

ICT-107 (n=82): 49/82 responders= 59.7% response rate
Control (n=42): 12/42 responders = 35.7% response rate

1. Bayes Factor Analysis

We start by calculating the Bayes factor(s) for the ICT-107 trial data:

## -------------------------------------------------------------
## 2. ICT-107 trial (immunologic response)
##    Placebo (control): 12 responders, 31 non-responders
##    ICT-107 (treatment): 49 responders, 32 non-responders  
## -------------------------------------------------------------

y1_ict <- 12      # control successes
n1_ict <- 12 + 31
y2_ict <- 49      # treatment successes
n2_ict <- 49 + 32

cat("\n=== ICT-107 Trial (n1 =", n1_ict, ", n2 =", n2_ict, ") ===\n")
#> 
#> === ICT-107 Trial (n1 = 43 , n2 = 81 ) ===

# BF01
BF01_ict = twoarmbinbf01(y1_ict, y2_ict, n1_ict, n2_ict, 
                         a_0_a = 1, b_0_a = 1, 
                         a_1_a = 1, b_1_a = 1, 
                         a_2_a = 1, b_2_a = 1)

# BF+1
BFp1_ict = BFplus1(y1_ict, y2_ict, n1_ict, n2_ict, 
                   a_1_d = 1, b_1_d = 1, 
                   a_2_d = 1, b_2_d = 1)

# BF-1
BFm1_ict = BFminus1(y1_ict, y2_ict, n1_ict, n2_ict, 
                    a_1_d = 1, b_1_d = 1, 
                    a_2_d = 1, b_2_d = 1)

# BF+0
cat("=== ICT-107 Trial === Bayes factor BF+0 results in ", BFplus0(BFp1_ict, BF01_ict))
#> === ICT-107 Trial === Bayes factor BF+0 results in  186.6192

# BF+-
cat("=== ICT-107 Trial === Bayes factor BF+- results in ", BFplusMinus(BFp1_ict, BFm1_ict))
#> === ICT-107 Trial === Bayes factor BF+- results in  3702.659

The most relevant Bayes factor here is \(BF_{+-}\), because it is directional and leaves open the possibility of the placebo group having a larger response rate than the treatment group. Note that the hyperparameters of the beta analysis priors are specified in twoarmbinbf01 via a_0_a = 1, b_0_a = 1 et cetera.

2. Operating characteristics for actual sample sizes

Now, a key question is which operating characteristics can be expected based on the actual sample sizes used in the trial. The powertwoarmbinbf01 function can provide the answer:

ict_results <- powertwoarmbinbf01(
  n1 = n1_ict, n2 = n2_ict,
  k = 1/3, k_f = 3,
  test = "BF+-",  # H+: p2 > p1 vs H-: p2 <= p1
  a_0_d = 1, b_0_d = 1, a_0_a = 1, b_0_a = 1,
  a_1_d = 1, b_1_d = 1, a_2_d = 1, b_2_d = 1,
  a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1,
  output = "numeric",
  compute_freq_t1e = TRUE,
)
print(ict_results)
#>                   Power             Type1_Error                   CE_H0 
#>               0.8788106               0.0214111               0.8788106 
#> Frequentist_Type1_Error 
#>               0.2871811 
#> attr(,"hypothesis")
#> [1] "H[+]:~p[2] > p[1] ~~ vs ~~ H[-]:~p[2] <= p[1]"
#> attr(,"compute_freq_t1e")
#> [1] TRUE

We see that based on the actual sample sizes and a moderate evidence threshold \(k=1/3\), the Bayesian power is sufficiently large with \(87.8\%\). Still, the frequentist type-I-error rate is way too high with \(28.7\%\), so we increase the evidence threshold to \(k=1/10\) (strong evidence) and use the ntwoarmbinbf01 function to calibrate the design based on our requirements next.

3. Power & Sample Size for ICT-107 Design

The core working function to design a Bayesian trial with the package is the ntwoarmbinbf01 function. It provides a method to calibrate a Bayesian design in terms of

the required Bayesian (or frequentist) power
the required Bayesian (or frequentist) type-I-error rate
the required Bayesian probability of compelling evidence for the null hypothesis \(H_0\) (or \(H_-\), in case \(BF_{+-}\) is used)

The function makes use of parallelization and it is recommended to run it on a computer with multiple cores to make computations fast. First, we perform a sample size search for a ICT-107-type trial (balanced arms) under flat design priors and substantial evidence thresholds, using the directional Bayes factor \(BF_{+-}\):

ntwoarmbinbf01(
  k = 1/10, k_f = 10,
  power = 0.8, alpha = 0.05, pce_H0 = 0.8,
  test = "BF+-",
  nrange = c(10, 75), n_step = 1,
  progress = FALSE,
  compute_freq_t1e = TRUE,
  p1_power = 0.3, p2_power = 0.6,
  output = "plot"  # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 75 (step = 1, 66 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#> 
#> 
#> Simulation complete.
#> SUMMARY for BF+-:
#>   Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#>   k = 0.100, k_f = 10.000
#>   Allocation: alloc1 = 0.500, alloc2 = 0.500
#>   Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#>     POWER not reached: max=0.754 at n_total=75
#>     Bayesian Type-I error <= 0.05 achieved at n_total=10
#>     P(CE|H0) not reached: max=0.754 at n_total=75
#>     FREQUENTIST Type-I error TOO HIGH: max(sup)=0.120 > 0.05
#>     Frequentist power >= 0.80 achieved at n_total=50 (p1=0.30, p2=0.60)

The function arguments are

the evidence threshold k = 1/10 and
threshold for compelling evidence k_f = 10 under \(H_-\) (or \(H_0\) for other tests),
the required power power, type-I-error rate alpha and probability of compelling evidence pce_H0,
the test used in the trial (either BF+-, BF01, BF+0 or BF-0),
the range of values nrange for which the design calibration operates (increasing the upper sample size requires more time for the calibration to finish),
the stepsize n_step (we recommend using n_step = 1 always, except for quick checks where using n_step = 5 or n_step = 10 can decrease computing times significantly),
the parameter progress shows a progress bar at the console output (we strongly recommend to set it to TRUE, it is only set to FALSE to avoid cluttering the output of this R vignette),
compute_freq_t1e sets whether the frequentist type-I-error rate should be computed, too
If frequentist power is desired, the parameters p1_power and p2_power must be specified, denoting the assumed success probability in the control and treatment arm for frequentist power calculations
and output can be set to plot or numeric.

The resulting output plots the design and analysis priors at the top row, the resulting power and type-I-error rate curves as functions of the sample size with markers for which sample sizes the design achieves the required calibration thresholds (middle row), and the probability of compelling evidence for the null hypothesis (in this case, the hypothesis \(H_-\)) in the bottom row. Note that the oscillations happen due to the discrete nature of the binomial distribution, and the package algorithm ensures that for the next 10 sample sizes, the power does not drop below the required threshold. Likewise, the package ensures that the type-I-error rate does not increase below the required alpha level, and that the probability of compelling evidence drops below its required threshold. It is straightforward to check this visually by means of the provided output plots, too. If no plots are required, use the option numeric instead of plot for the output argument.

The resulting plot shows that while the type I error is calibrated for \(n=10\) patients per trial arm, Bayesian power is not reaching our desired level of 80% even for \(n=75\) patients in total (in both arms). We could increase the range, or alternatively, use more informative design priors under which the hypotheses under comparison are separated in a better way. Right now, we essentially assume that everything is equally likely under our design priors, although we should have a clear expectation about the probabilities in the treatment and control arm. Thus, we modify our design priors next. Note that the plot also shows that frequentist power is calibrated for \(n=50\) patients per arm when assuming \(p_1=0.3\) (control arm probability) and \(p_2=0.6\) (treatment arm probability).

4. Informative design priors

Now, the example above used flat design priors, which might be unrealistic in a variety of settings. Next, we perform a sample size search for new ICT-107-type trial (balanced arms) under informative design priors with very strong evidence thresholds. Notice the additionally specified parameters a_1_d = 1, b_1_d = 2 and a_2_d = 2, b_2_d = 1 which are the design prior hyperparameters of the Beta design priors for \(p_1\) and \(p_2\) under \(H_+\).

ntwoarmbinbf01(
  k = 1/30, k_f = 30,
  power = 0.8, alpha = 0.05, pce_H0 = 0.8,
  test = "BF+-",
  nrange = c(10, 100), n_step = 1,
  progress = FALSE,
  a_1_d = 1, b_1_d = 2,
  a_2_d = 2, b_2_d = 1,
  compute_freq_t1e = TRUE,
  p1_power = 0.3, p2_power = 0.6,
  output = "plot"  # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#> 
#> 
#> Simulation complete.
#> SUMMARY for BF+-:
#>   Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#>   k = 0.033, k_f = 30.000
#>   Allocation: alloc1 = 0.500, alloc2 = 0.500
#>   Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#>     Power >= 0.80 achieved at n_total=72
#>     Bayesian Type-I error <= 0.05 achieved at n_total=10
#>     P(CE|H0) not reached: max=0.715 at n_total=100
#>     Frequentist Type-I error <= 0.05 achieved (max(sup)=0.041)
#>     Frequentist power >= 0.80 achieved at n_total=77 (p1=0.30, p2=0.60)

We see that now the Bayesian power is calibrated for \(n=72\) patients per trial arm, while the frequentist power is calibrated for \(n=77\) patients per trial arm. Importantly, the frequentist type-I-error is now only \(0.041<0.05\), as stated by the console output of the function. Thus, the design is fully calibrated except for the probability of compelling evidence for \(H_-\) shown in the bottom plot.

Therefore, next, we perform a sample size search for new ICT-107-type trial (balanced arms) under informative design priors with very strong evidence thresholds, and design prior under H- modified to achieve probability of compelling evidence PCE(H0) for even smaller sample sizes. Note that now, additionally, the design prior hyperparameters of the Beta design priors for \(p_1\) and \(p_2\) under \(H_-\) are specified in a_1_d_Hminus = 2, b_1_d_Hminus = 1 and a_2_d_Hminus = 1, b_2_d_Hminus = 2:

ntwoarmbinbf01(
  k = 1/30, k_f = 30,
  power = 0.8, alpha = 0.05, pce_H0 = 0.8,
  test = "BF+-",
  nrange = c(10, 100), n_step = 1,
  progress = TRUE,
  a_1_d = 1, b_1_d = 2,
  a_2_d = 2, b_2_d = 1,
  a_1_d_Hminus = 2, b_1_d_Hminus = 1,
  a_2_d_Hminus = 1, b_2_d_Hminus = 2,
  compute_freq_t1e = TRUE,
  p1_power = 0.3, p2_power = 0.6,
  output = "plot"  # Returns recommended n per group
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.500, alloc2 = 0.500
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#> 
#> 
#> Simulation complete.
#> SUMMARY for BF+-:
#>   Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#>   k = 0.033, k_f = 30.000
#>   Allocation: alloc1 = 0.500, alloc2 = 0.500
#>   Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#>     Power >= 0.80 achieved at n_total=72
#>     Bayesian Type-I error <= 0.05 achieved at n_total=10
#>     P(CE|H0) >= 0.80 achieved at n_total=72
#>     Frequentist Type-I error <= 0.05 achieved (max(sup)=0.041)
#>     Frequentist power >= 0.80 achieved at n_total=77 (p1=0.30, p2=0.60)

The result is a fully calibrated Bayesian design which meets Bayesian and frequentist power demands, Bayesian and frequentist type-I-error rate requirements and our requirement on the probability of compelling evidence for \(H_0\) (that is, \(H_-\) in this case).

The bfbin2arm package reveals several aspects. If a balanced design with equal randomization probabilities is desired, then:

n=77 patients in total (39 patients per trial arm) are needed for 80% frequentist power at ICT-107 effect size when evidence threshold \(k=1/30\) is used. Here, the assumption is that the true proportions are \(p_1=0.3\) and \(p_2=0.6\), which can easily be modified if a more optimistic or pessimistic assumption is warranted
n=72 patients in total (36 patients per trial arm) are needed for 80% Bayesian power at ICT-107 effect size when evidence threshold \(k=1/30\) is used, and slightly informative Beta design priors are assumed under \(H_+\).
Type-I error control both from a frequentist perspective (≤5% across designs when \(k=1/30\) is used) and from a Bayesian perspective, where for the latter only \(n=10\) patients in total (5 patients per trial arm) are required.
High P(CE|H-) guarantees that under \(H_-\) there is 80% probability to find a Bayes factor of at least \(k_f=30\) in favour of \(H_-\). n=72 patients in total (36 patients per trial arm) are required to assert this probability of compelling evidence for \(H_-\).

5. Unequal randomization probabilities

In the original ICT-107 trial, \(2/3\) of the patients was randomized into the treatment group, while \(1/3\) of the patients was randomized into the control group. We can use the parameters alloc1 and alloc2 to specify randomization probabilities for the control and treatment arms and carry out the Bayesian sample size calculations based on these randomization probabilities. As an example, we rerun the last calibration, but use the randomization probabilities of the ICT-107 trial:

ntwoarmbinbf01(
  k = 1/30, k_f = 30,
  power = 0.8, alpha = 0.05, pce_H0 = 0.8,
  test = "BF+-",
  nrange = c(10, 100), n_step = 1,
  progress = FALSE,
  a_1_d = 1, b_1_d = 2,
  a_2_d = 2, b_2_d = 1,
  a_1_d_Hminus = 2, b_1_d_Hminus = 1,
  a_2_d_Hminus = 1, b_2_d_Hminus = 2,
  compute_freq_t1e = TRUE,
  p1_power = 0.3, p2_power = 0.6,
  output = "plot",  # Returns recommended n per group
  alloc1 = 1/3,
  alloc2 = 2/3
)
#> Frequentist power computation: p1=0.30, p2=0.60
#> Computing for total n = 10 to 100 (step = 1, 91 values)
#> Allocation: alloc1 = 0.333, alloc2 = 0.667
#> Frequentist Type-I error computation: ENABLED
#> Frequentist power computation: ENABLED
#> 
#> 
#> Simulation complete.
#> SUMMARY for BF+-:
#>   Hypotheses: BF+- test: H+: p2 > p1 vs H-: p1 > p2
#>   k = 0.033, k_f = 30.000
#>   Allocation: alloc1 = 0.333, alloc2 = 0.667
#>   Target power = 0.80, alpha = 0.05, P(CE|H0) = 0.80
#>     Power >= 0.80 achieved at n_total=83
#>     Bayesian Type-I error <= 0.05 achieved at n_total=10
#>     P(CE|H0) >= 0.80 achieved at n_total=83
#>     FREQUENTIST Type-I error TOO HIGH: max(sup)=0.050 > 0.05
#>     Frequentist power >= 0.80 achieved at n_total=86 (p1=0.30, p2=0.60)

Remember that the sample size shown at the x-axis in the power and type-I-error rate plot as well as in the probability of compelling evidence plot is the total sample size in both arms. We see that now we need \(n=83\) patients in total to reach Bayesian power of 80%, while \(n=86\) patients in total are required for frequentist power calibration of 80%. The probability of compelling evidence reaches 80% at \(n=83\) patients in total. Note, however, that the frequentist type-I-error rate is exactly at the boundary now, which might be too liberal for some. As the frequentist type-I-error rate assumes fixed success probabilities in both trial arms and is independent of the design priors, we must change the evidence threshold \(k\) slightly to decrease the frequentist type-I-error rate accordingly. Just try it out yourself and decrease \(k\) from \(k=1/30\) to \(k=1/40\) and rerun the last code block.

6. Design Recommendations based on the calibration

If the original 2:1 randomization of the ICT-107 trial is used and two thirds of the patients are randomized into the treatment group, then:

n=96 patients in total (32 patients in the control arm and 64 in the treatment arm) are needed for 80% frequentist power at ICT-107 effect size when evidence threshold \(k=1/40\) is used. Here, the assumption is that the true proportions are \(p_1=0.3\) and \(p_2=0.6\), which can easily be modified if a more optimistic or pessimistic assumption is warranted
n=92 patients in total (31 patients in the control arm and 61 in the treatment arm) are needed for 80% Bayesian power at ICT-107 effect size when evidence threshold \(k=1/40\) is used, and slightly informative Beta design priors are assumed under \(H_+\).
Type-I error control both from a frequentist perspective (≤5% across designs when \(k=1/40\) is used) and from a Bayesian perspective, where for the latter only \(n=10\) patients in total (both arms) are required.
High P(CE|H-) guarantees that under \(H_-\) there is 80% probability to find a Bayes factor of at least \(k_f=30\) in favour of \(H_-\). n=83 patients in total (28 in the control arm and 55 in the treatment arm) are required to assert this probability of compelling evidence for \(H_-\).

To fulfill all four requirements, it thus suffices if 32 patients in the control arm and 64 in the treatment arm are enrolled in the trial, and the Bayes factor thresholds \(k=1/40\) and \(k_f=30\) are used for decision making about the hypotheses \(H_+\) and \(H_-\) under consideration.

7. Predictive Densities

If desired, we can compare predictive densities under different hypotheses also directly via:

pred_H0 <- predictiveDensityH0(y1_ict, y2_ict, n1_ict, n2_ict)
pred_H1 <- predictiveDensityH1(y1_ict, y2_ict, n1_ict, n2_ict)
pred_Hplus <- predictiveDensityHplus_trunc(y1_ict, y2_ict, n1_ict, n2_ict)

data.frame(
Hypothesis = c("H0: p1=p2", "H1: p1 != p2", "H+: p2>p1"),
"Pred. Density" = round(c(pred_H0, pred_H1, pred_Hplus), 4)
)
#>     Hypothesis Pred..Density
#> 1    H0: p1=p2         0e+00
#> 2 H1: p1 != p2         3e-04
#> 3    H+: p2>p1         6e-04

References

Wen PY, et al. (2019). A Randomized Double-Blind Placebo-Controlled Phase II Trial of Dendritic Cell Vaccine ICT-107. Clinical Cancer Research. [PMID: 31320597]

Bayesian Reanalysis of the ICT-107 Trial

Riko Kelter
Institute of Medical Statistics and Computational Biology
Faculty of Medicine
University of Cologne
Cologne, Germany

23 December 2025

Introduction and Overview

Hypothesis tests

Design and analysis priors

Using the package

ICT-107 Phase II Trial Overview

1. Bayes Factor Analysis

2. Operating characteristics for actual sample sizes

3. Power & Sample Size for ICT-107 Design

4. Informative design priors

5. Unequal randomization probabilities

6. Design Recommendations based on the calibration

7. Predictive Densities

References

Bayesian Reanalysis of the ICT-107 Trial

Riko Kelter Institute of Medical Statistics and Computational Biology Faculty of Medicine University of Cologne Cologne, Germany

23 December 2025

Introduction and Overview

Hypothesis tests

Design and analysis priors

Using the package

ICT-107 Phase II Trial Overview

1. Bayes Factor Analysis

2. Operating characteristics for actual sample sizes

3. Power & Sample Size for ICT-107 Design

4. Informative design priors

5. Unequal randomization probabilities

6. Design Recommendations based on the calibration

7. Predictive Densities

References

Riko Kelter
Institute of Medical Statistics and Computational Biology
Faculty of Medicine
University of Cologne
Cologne, Germany