The CLRtools package provides a suite of functions designed to support the structured development of conditional logistic regression models, guided by the Purposeful Selection Method introduced by Hosmer, Lemeshow, and Sturdivant in Applied Logistic Regression (2013). This method outlines a systematic, seven-step process for constructing multivariable models, adapted here for use with matched case-control data.
This vignette parallels the approach presented in the Logistic Regression vignette but adapts it for use with matched data. The same modelling process is followed, using functions and methods appropriate for conditional logistic regression analysis. The dataset used is GLOW11M, derived from the original GLOW500 dataset for illustrative purposes. GLOW11M is a 1:1 matched case-control dataset, where each woman who experienced a fracture was matched with a woman of the same age who did not experience a fracture.
In the following sections, we walk through each of the seven steps of the Purposeful Selection Method, applying them to the GLOW11M dataset using functions from CLRtools.
The first step involves evaluating each predictor individually using
univariable conditional logistic regression models. The
univariable.clogmodels() function is used to fit each model
and generate a summary table. It is specifically adapted for conditional
logistic regression, with the strata argument allowing the user to
specify the variable that identifies each matched set. The output
includes optional odds ratios, confidence intervals, and p-values based
on the Wald test.
Additionally, the discordant.pairs() function in the
package calculates the number of discordant (informative) pairs for each
categorical variable. Together, the univariable model results and
discordant pair counts correspond to Table 7.1 in Chapter 7 of the book,
and are used to guide the selection of candidate variables for the
multivariable model.
Variables with p-values below 0.25 are considered candidates for
inclusion in the initial multivariable model. According to this
criterion, the selected variables are: height,
priorfrac, premeno, momfrac,
armassist, and raterisk. Although
weight and bmi were not statistically
significant on their own, the authors retained them due to being
interrelated with height.
# Predictor variables to evaluate
var_to_test <- c('height','weight', 'bmi', 'priorfrac', 'premeno', 'momfrac', 'armassist','smoke', 'raterisk')
# Define OR scaling factors (used to interpret ORs per meaningful unit)
OR_values <- c(10, 10, 3, 1, 1, 1, 1, 1, 1, 1)
# Generate the univariable models table
univariable.clogmodels(glow11m, yval = 'fracture',xval = var_to_test, strata='pair', OR=TRUE, inc.or = OR_values)
#> coeff se p.val estim.OR low.limit
#> height -0.057313512 0.023793530 0.016005701 0.5637552 0.3536386
#> weight -0.001375754 0.008380274 0.869600692 0.9863367 0.8369358
#> bmi 0.019109958 0.022938647 0.404793975 1.0590051 0.9253836
#> priorfracYes 0.838329190 0.299210673 0.005081799 2.3125000 1.2864507
#> premenoYes 0.693147178 0.462910050 0.134297259 2.0000000 0.8072355
#> momfracYes 0.510825624 0.365148372 0.161826902 1.6666667 0.8147679
#> armassistYes 0.632522558 0.300122524 0.035070125 1.8823529 1.0452888
#> smokeYes -0.336472237 0.585540044 0.565537675 0.7142857 0.2267041
#> rateriskSame 0.551913855 0.290930672 0.057819605 1.7365734 0.9818666
#> rateriskGreater 1.024804405 0.366941433 0.005224942 2.7865504 1.3574561
#> up.limit
#> height 0.8987139
#> weight 1.1624070
#> bmi 1.2119209
#> priorfracYes 4.1569072
#> premenoYes 4.9551835
#> momfracYes 3.4092874
#> armassistYes 3.3897355
#> smokeYes 2.2505287
#> rateriskSame 3.0713816
#> rateriskGreater 5.7201578
# Obtaining the discordant pairs
discordant.pairs(glow11m, outcome = 'fracture', strata = 'pair')
#> priorfrac premeno momfrac armassist
#> 53 21 32 49
#> smoke raterisk total.valid.pairs
#> 12 86 119Next, we build a multivariable conditional logistic regression model incorporating the variables identified in Step 1. At this stage, variables that do not show statistical significance or lack clinical importance are flagged as potential candidates for removal and subjected to further evaluation.
summary(clogit(as.numeric(fracture) ~ height + weight + bmi + priorfrac + premeno + momfrac + armassist + raterisk + strata(pair), data = glow11m))
#> Call:
#> coxph(formula = Surv(rep(1, 238L), as.numeric(fracture)) ~ height +
#> weight + bmi + priorfrac + premeno + momfrac + armassist +
#> raterisk + strata(pair), data = glow11m, method = "exact")
#>
#> n= 238, number of events= 119
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> height 0.06327 1.06531 0.12205 0.518 0.6042
#> weight -0.15423 0.85707 0.13104 -1.177 0.2392
#> bmi 0.38651 1.47183 0.34170 1.131 0.2580
#> priorfracYes 0.69351 2.00072 0.35377 1.960 0.0500 *
#> premenoYes 0.21800 1.24359 0.55232 0.395 0.6931
#> momfracYes 0.72541 2.06558 0.43262 1.677 0.0936 .
#> armassistYes 0.81779 2.26549 0.38241 2.139 0.0325 *
#> rateriskSame 0.15157 1.16366 0.34117 0.444 0.6569
#> rateriskGreater 0.58880 1.80182 0.42556 1.384 0.1665
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> height 1.0653 0.9387 0.8387 1.353
#> weight 0.8571 1.1668 0.6629 1.108
#> bmi 1.4718 0.6794 0.7534 2.876
#> priorfracYes 2.0007 0.4998 1.0001 4.002
#> premenoYes 1.2436 0.8041 0.4213 3.671
#> momfracYes 2.0656 0.4841 0.8847 4.823
#> armassistYes 2.2655 0.4414 1.0707 4.794
#> rateriskSame 1.1637 0.8594 0.5962 2.271
#> rateriskGreater 1.8018 0.5550 0.7825 4.149
#>
#> Concordance= 0.697 (se = 0.06 )
#> Likelihood ratio test= 27.91 on 9 df, p=0.001
#> Wald test = 19.78 on 9 df, p=0.02
#> Score (logrank) test = 24.66 on 9 df, p=0.003Several variables are not statistically significant at this stage,
with the first focus was on premeno and
raterisk as candidates for removal and further analysis. In
the book, a partial likelihood ratio test was performed comparing models
with and without these variables, resulting in a p-value of 0.49,
indicating that their removal did not lead to a significant change in
the model.
In this step, we evaluate whether any of the variables considered for removal may act as confounders by examining their influence on the coefficients of the other variables retained in the model. For each candidate variable, we compare the coefficient estimates from a model that includes the variable to a model that excludes it. This allows us to evaluate whether excluding the variable significantly alters the estimated effects of the remaining predictors. The percentage change in the coefficients, denoted as delta beta hat (\(\Delta \hat{\beta}\%\)), is used as a diagnostic measure. As recommended by Hosmer et al., (2013), a change exceeding 20% in any coefficient suggests the presence of confounding, and the variable should be retained in the model for adjustment.
As with the logistic regression approach, the
check_coef_change() function can be used to automate this
comparison and calculate the percentage change in coefficients for each
added variable. For conditional logistic regression, the strata argument
must also be specified to account for the matched design.
# Variables that were not significant in Step 2 to check
candidate.exclusion <- c('premeno', 'raterisk')
# Significant variables in Step 2
var.preliminar <- c('height', 'weight', 'bmi','priorfrac', 'momfrac', 'armassist')
# Check for confounding
check_coef_change(data = glow11m, yval = 'fracture', xpre = var.preliminar, xcheck = candidate.exclusion, strata = 'pair')
#> premeno raterisk
#> height 0.2595129 -14.6186189
#> weight 1.1046352 -2.4501722
#> bmi 0.5586207 -4.7535031
#> priorfracYes 2.5485736 19.5540103
#> momfracYes -1.1016684 -0.4431701
#> armassistYes 4.9012633 4.3504119For the variables height, weight, and
bmi, the authors fitted three models, each including a
different combination of two of the three variables. They found that the
model containing weight and bmi had the
smallest (i.e., best) log-likelihood, and both variables were
statistically significant at the 5% level. Therefore, this model was
selected for the next steps.
summary(clogit(as.numeric(fracture) ~ weight + bmi + priorfrac + momfrac + armassist + strata(pair), data = glow11m))
#> Call:
#> coxph(formula = Surv(rep(1, 238L), as.numeric(fracture)) ~ weight +
#> bmi + priorfrac + momfrac + armassist + strata(pair), data = glow11m,
#> method = "exact")
#>
#> n= 238, number of events= 119
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> weight -0.09471 0.90964 0.02994 -3.163 0.00156 **
#> bmi 0.22244 1.24912 0.08098 2.747 0.00602 **
#> priorfracYes 0.83492 2.30464 0.33965 2.458 0.01396 *
#> momfracYes 0.72659 2.06802 0.40928 1.775 0.07585 .
#> armassistYes 0.88876 2.43212 0.36663 2.424 0.01535 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> weight 0.9096 1.0993 0.8578 0.9646
#> bmi 1.2491 0.8006 1.0658 1.4640
#> priorfracYes 2.3046 0.4339 1.1844 4.4845
#> momfracYes 2.0680 0.4836 0.9272 4.6125
#> armassistYes 2.4321 0.4112 1.1855 4.9896
#>
#> Concordance= 0.672 (se = 0.061 )
#> Likelihood ratio test= 25.3 on 5 df, p=1e-04
#> Wald test = 18.78 on 5 df, p=0.002
#> Score (logrank) test = 22.67 on 5 df, p=4e-04In this step, we evaluate the variables that were excluded in Step 1
by reintroducing them into the model one at a time. Although these
variables were not individually significant in univariable analysis,
their relationship with the outcome may change when adjusted for other
covariates. The check_coef_significant() function does this
process by fitting each model with an added variable and reporting the
corresponding coefficient estimates, standard errors, and Wald test
statistics (z and p-values). When applying this function to conditional
logistic regression, the strata argument must be included to properly
account for the matched design.
# Variables excluded in Step 1 to check
excluded<-c('smoke')
# Preliminary model variables (retained after Step 2 and 3)
var.preliminar <- c('weight','bmi', 'priorfrac', 'momfrac','armassist')
check_coef_significant(data = glow11m, yval = 'fracture', xpre = var.preliminar, xcheck = excluded, strata = 'pair')
#> smokeYes
#> coef -0.7581140
#> exp(coef) 0.4685493
#> se(coef) 0.6590178
#> z -1.1503694
#> Pr(>|z|) 0.2499918The variable smoke did not become statistically
significant when added individually to the multivariable model and,
therefore, was not retained.
In this step, we need to assess whether each continuous predictor has a roughly linear association with the log-odds of the outcome, which is an important assumption in conditional logistic regression. The book outlines several strategies for this purpose, such as LOWESS smoothing, quartile-based design variables, fractional polynomials, and spline functions. Because the choice of method depends heavily on the dataset and research context, and because many established R packages already support these diagnostics, no dedicated function is included in this package. Users are encouraged to select an approach that aligns with their modelling goals and data characteristics.
For this dataset, the continuous variables to be assessed for
linearity are weight and bmi. After evaluating
their relationships with the logit, the authors concluded that no
transformation was necessary, as both variables appeared approximately
linear. As a result, the model from the previous step remains the
same.
In this step, interaction terms between variables in the preliminary main-effects model (from Step 5) are evaluated one at a time. This helps determine whether the effect of one variable varies across levels of another, which may reveal important effect modifications. Interactions that are both statistically significant, based on likelihood ratio tests, and clinically meaningful are retained, while those lacking significance are excluded. Depending on the context, a 5% or 10% significance level may be applied.
The check_interactions() function does this process by
fitting each interaction model and reporting the model’s log-likelihood,
the likelihood ratio test result comparing it to the main-effects model,
and the corresponding p-value. For conditional logistic regression,
users must specify the strata argument to correctly account for the
matched structure of the data.
var.maineffects<-c('weight', 'bmi','priorfrac','momfrac','armassist')
check_interactions(data = glow11m, variables = var.maineffects, yval = 'fracture', rounding = 4, strata = 'pair')
#> log.likh G p_value
#> 1 -69.8344 NA NA
#> weight*bmi -69.6172 0.4343 0.5099
#> weight*priorfrac -69.0815 1.5056 0.2198
#> weight*momfrac -69.7965 0.0757 0.7832
#> weight*armassist -69.8047 0.0594 0.8075
#> bmi*priorfrac -69.0820 1.5047 0.2200
#> bmi*momfrac -69.8017 0.0653 0.7984
#> bmi*armassist -69.8308 0.0072 0.9325
#> priorfrac*momfrac -69.7576 0.1535 0.6952
#> priorfrac*armassist -68.2461 3.1764 0.0747
#> momfrac*armassist -69.1492 1.3703 0.2418For this dataset, no interactions were statistically significant at the 5% level. Therefore, the model includes only the main effects.
final.model <- clogit(as.numeric(fracture) ~ weight + bmi + priorfrac + momfrac + armassist + strata(pair), data = glow11m)
summary(final.model)
#> Call:
#> coxph(formula = Surv(rep(1, 238L), as.numeric(fracture)) ~ weight +
#> bmi + priorfrac + momfrac + armassist + strata(pair), data = glow11m,
#> method = "exact")
#>
#> n= 238, number of events= 119
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> weight -0.09471 0.90964 0.02994 -3.163 0.00156 **
#> bmi 0.22244 1.24912 0.08098 2.747 0.00602 **
#> priorfracYes 0.83492 2.30464 0.33965 2.458 0.01396 *
#> momfracYes 0.72659 2.06802 0.40928 1.775 0.07585 .
#> armassistYes 0.88876 2.43212 0.36663 2.424 0.01535 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> weight 0.9096 1.0993 0.8578 0.9646
#> bmi 1.2491 0.8006 1.0658 1.4640
#> priorfracYes 2.3046 0.4339 1.1844 4.4845
#> momfracYes 2.0680 0.4836 0.9272 4.6125
#> armassistYes 2.4321 0.4112 1.1855 4.9896
#>
#> Concordance= 0.672 (se = 0.061 )
#> Likelihood ratio test= 25.3 on 5 df, p=1e-04
#> Wald test = 18.78 on 5 df, p=0.002
#> Score (logrank) test = 22.67 on 5 df, p=4e-04To evaluate the fit of the conditional logistic regression models, the authors applied an extension of regression diagnostics specifically designed for conditional logistic regression. These diagnostics can be obtained using the residuals_clog() function from the package, which returns a dataset containing the residuals, leverage values, and diagnostic plots described in Chapter 7 of the book.
# Converting the outcome from "Yes"/"No" to binary 0/1, as required by the function
glow11m$fracture <- ifelse(glow11m$fracture == "Yes", 1, 0)
#
head(residuals_clog(final.model, strata = glow11m$pair, y = glow11m$fracture, plot = TRUE)$data.results)
#> weight bmi priorfracYes momfracYes armassistYes y pair theta
#> 1 113.4 41.65289 1 0 1 1 1 0.4818595
#> 2 101.2 40.53838 0 0 1 0 1 0.5181405
#> 3 62.6 26.05619 0 0 0 1 2 0.7296054
#> 4 70.8 25.08504 0 0 0 0 2 0.2703946
#> 5 81.6 31.09282 0 0 1 1 3 0.7098407
#> 6 96.2 33.28720 0 0 1 0 3 0.2901593
#> pearson leverage spearson deltachi deltabeta
#> 1 0.7464270 0.022287916 0.7548869 0.5698542 0.0129903911
#> 2 -0.7198198 0.020727279 -0.7273977 0.5291075 0.0111990853
#> 3 0.3165585 0.005470466 0.3174280 0.1007605 0.0005542389
#> 4 -0.5199948 0.014760949 -0.5238757 0.2744457 0.0041117727
#> 5 0.3443944 0.004628063 0.3451941 0.1191589 0.0005540392
#> 6 -0.5386644 0.011322013 -0.5417399 0.2934821 0.0033608598
residuals_clog(final.model, strata = glow11m$pair, y = glow11m$fracture, plot = TRUE)$leverageresiduals_clog(final.model, strata = glow11m$pair, y = glow11m$fracture, plot = TRUE)$change.PearsonchiTo explore individual outliers more closely, the dataset of residuals returned by the residuals_logistic() function can be used. Outliers can be identified by inspecting the plots and looking for values that deviate substantially from the rest of the data. The authors recommend focusing on points that are clearly separated from the general pattern. These potential outliers are filtered in the code below, and the eight covariate patterns identified by the authors in Table 5.10 are recovered.
To investigate outliers in the matched dataset, the residuals output from the residuals_clog() function can be used. By examining diagnostic plots, we look for matched pairs whose residual values are distant from the rest. The code below filters such pairs and retrieves the six matched sets identified as noteworthy in Table 7.4, which will be further investigated.
# Obtain residual diagnostics for the conditional logistic regression model.
residuals.data<-residuals_clog(final.model, strata = glow11m$pair, y = glow11m$fracture, plot = TRUE)$data.results
# Summarize the delta chi-squared and delta beta for each matched pair.
glow.match<-residuals.data %>% group_by(pair) %>% dplyr::summarise(sum_deltachi=sum(deltachi), sum_deltabeta=sum(deltabeta))
# Identify individual observations considered potential outliers based on thresholds from the plot
outliers<-subset(residuals.data, deltachi>3 | deltabeta>0.15 )
outliers.match<-subset(glow.match, sum_deltachi>3 | sum_deltabeta >0.15)
# Subset all observations belonging to the identified outlier pairs.
glow.outieliers<-subset(residuals.data, pair %in% outliers$pair)
# Reorder and display the final table of potential outliers and their diagnostics.
glow.outieliers.t<-glow.outieliers[,c(7,1:5,8,12,13,10)]
glow.outieliers.t
#> pair weight bmi priorfracYes momfracYes armassistYes theta
#> 73 37 72.6 26.03177 0 0 0 0.1898475
#> 74 37 64.4 25.79715 0 1 0 0.8101525
#> 75 38 72.6 26.66667 0 0 0 0.1756779
#> 76 38 52.2 21.17733 1 0 0 0.8243221
#> 99 50 111.6 43.59375 0 0 1 0.2401537
#> 100 50 57.2 22.34375 0 1 1 0.7598463
#> 133 67 55.3 20.06822 1 0 0 0.7991524
#> 134 67 67.6 22.85019 0 0 0 0.2008476
#> 199 100 88.5 33.72199 1 0 1 0.8041247
#> 200 100 57.2 21.79546 0 0 0 0.1958753
#> 233 117 57.2 23.80853 1 0 0 0.1809594
#> 234 117 50.8 20.60936 1 1 1 0.8190406
#> deltachi deltabeta leverage
#> 73 3.5700055 0.116451535 0.031589012
#> 74 0.8161943 0.006086877 0.007402428
#> 75 3.9737840 0.108766824 0.026641879
#> 76 0.8290293 0.004733994 0.005677863
#> 99 2.5908078 0.201145194 0.072044620
#> 100 0.7775512 0.018117460 0.022770104
#> 133 0.8036264 0.004499104 0.005567333
#> 134 3.2517798 0.073664766 0.022151855
#> 199 0.8097653 0.005680171 0.006965728
#> 200 3.3983447 0.100040981 0.028596327
#> 233 3.8625579 0.162020662 0.040257796
#> 234 0.8263910 0.007416372 0.008894587
outliers.match
#> # A tibble: 6 × 3
#> pair sum_deltachi sum_deltabeta
#> <int> <dbl> <dbl>
#> 1 37 4.39 0.123
#> 2 38 4.80 0.114
#> 3 50 3.37 0.219
#> 4 67 4.06 0.0782
#> 5 100 4.21 0.106
#> 6 117 4.69 0.169