Automated Data Auditing for Causal Studies

Before diving into causal estimation, a critical but often overlooked step is data auditing: systematically checking which variables in your dataset might introduce bias, confounding, or estimation problems.

The audit_data() function in causaldef automates this process by testing each variable against the treatment and outcome to classify its causal role.

Why Audit Your Data?

Traditional exploratory data analysis (EDA) tools check for: - Missing values - Distributional skew - Outliers

But they miss causal validity issues: - Which variables are confounders that MUST be adjusted for? - Which variables are potential instruments? - Which variables might serve as negative controls? - Which variables are “leaky” and could bias your analysis?

Based on the manuscript’s negative control certification/bounding logic (thm:nc_bound), audit_data() systematically evaluates each variable.

Case Study: Right Heart Catheterization (RHC)

We’ll demonstrate data auditing using the classic RHC dataset from Connors et al. (1996). This dataset contains 5,735 critically ill patients from 5 medical centers, with the treatment being Right Heart Catheterization (RHC) and the outcome being 30-day mortality.

This is an ideal case study because: - Medium size: n = 5,735 patients - Many covariates: p = 63 variables - Real confounding concerns: RHC is not randomly assigned - Clinical importance: Used extensively in causal inference literature

library(causaldef)

# Load the RHC dataset
data(rhc)

cat("Dataset dimensions:", nrow(rhc), "patients,", ncol(rhc), "variables\n")
#> Dataset dimensions: 5735 patients, 63 variables

Understanding the Research Question

Treatment: swang1 - Whether the patient received Right Heart Catheterization (1 = Yes, 0 = No)

Outcome: death - 30-day mortality (Yes/No)

Covariates: Demographics, disease category, vital signs, lab values, comorbidities, etc.

# Key variables
cat("Treatment distribution (swang1 = RHC):\n")
#> Treatment distribution (swang1 = RHC):
table(rhc$swang1)
#> 
#> No RHC    RHC 
#>   3551   2184

cat("\nOutcome distribution (death):\n")
#> 
#> Outcome distribution (death):
table(rhc$death)
#> 
#>   No  Yes 
#> 2013 3722

Running the Data Audit

Let’s audit all available variables to understand which ones are most relevant for causal analysis.

# Prepare data - convert death to numeric for auditing
rhc_clean <- rhc
rhc_clean$death_num <- as.numeric(rhc_clean$death == "Yes")

# Select relevant numeric and factor columns for audit
# (excluding IDs, dates, and the outcome/treatment themselves)
exclude_cols <- c("X", "ptid", "sadmdte", "dschdte", "dthdte", "lstctdte",
                  "swang1", "death", "death_num")
audit_cols <- setdiff(names(rhc_clean), exclude_cols)

# Run the audit
report <- audit_data(
  data = rhc_clean,
  treatment = "swang1",
  outcome = "death_num",
  covariates = audit_cols[1:25],  # First 25 covariates for demonstration
  alpha = 0.01,  # Stricter significance level
  verbose = FALSE
)

print(report)
#> 
#> ==============================================================================
#>                          Data Integrity Report
#> ==============================================================================
#> 
#> Treatment: swang1 | Outcome: death_num 
#> Variables audited: 25 
#> Issues found: 22 
#> 
#> -- Issues Detected --------------------------------------------------
#> 
#>  Variable Type                 p-value  Recommendation                         
#>  cat1     Confounder           8.91e-14 Include in adjustment set              
#>  cat2     Confounder           0.000777 Include in adjustment set              
#>  cardiohx Confounder           0.001571 Include in adjustment set              
#>  chfhx    Outcome Predictor    2.23e-06 May improve precision if included      
#>  dementhx Confounder           5.48e-09 Include in adjustment set              
#>  psychhx  Confounder           0.004839 Include in adjustment set              
#>  chrpulhx Potential Instrument 4.32e-12 Consider using as instrumental variable
#>  liverhx  Outcome Predictor    0.007240 May improve precision if included      
#>  malighx  Confounder           0.000217 Include in adjustment set              
#>  immunhx  Confounder           0.002996 Include in adjustment set              
#>  transhx  Confounder           9.21e-07 Include in adjustment set              
#>  amihx    Potential Instrument 0.005233 Consider using as instrumental variable
#>  age      Outcome Predictor    < 2e-16  May improve precision if included      
#>  sex      Potential Instrument 0.000631 Consider using as instrumental variable
#>  edu      Potential Instrument 0.000776 Consider using as instrumental variable
#>  surv2md1 Confounder           2.67e-13 Include in adjustment set              
#>  das2d3pc Outcome Predictor    < 2e-16  May improve precision if included      
#>  t3d30    Confounder           2.69e-08 Include in adjustment set              
#>  dth30    Confounder           9.08e-09 Include in adjustment set              
#>  aps1     Confounder           < 2e-16  Include in adjustment set              
#>  scoma1   Confounder           6.69e-05 Include in adjustment set              
#>  meanbp1  Confounder           3.23e-15 Include in adjustment set              
#> 
#> -- Recommendations --------------------------------------------------
#> 
#> * CONFOUNDERS: Variables [cat1, cat2, cardiohx, dementhx, psychhx, malighx, immunhx, transhx, surv2md1, t3d30, dth30, aps1, scoma1, meanbp1] correlate with both treatment and outcome - must adjust for these 
#> * INSTRUMENTS: Variables [chrpulhx, amihx, sex, edu] correlate with treatment but not outcome - consider as IVs

Interpreting the Report

The audit classifies each variable into one of these categories:

Classification	Meaning	Action
Confounder	Correlates with BOTH treatment and outcome	MUST adjust for this
Potential Instrument	Correlates with treatment but NOT outcome	Consider for IV analysis
Outcome Predictor	Correlates with outcome but NOT treatment	Include for precision
Safe	No significant correlations	Can include or exclude

Examining Detected Issues

Let’s look more closely at the flagged confounders:

# Filter to see only confounders
confounders <- report$issues[report$issues$issue_type == "Confounder", ]
if (nrow(confounders) > 0) {
  cat("Detected Confounders (must adjust for these):\n\n")
  print(confounders[, c("variable", "r_treatment", "r_outcome", "p_value")])
}
#> Detected Confounders (must adjust for these):
#> 
#>    variable r_treatment   r_outcome      p_value
#> 1      cat1  0.10083106  0.09823951 8.913722e-14
#> 2      cat2  0.18615863 -0.09688936 7.772900e-04
#> 4  cardiohx  0.05671208  0.04173494 1.570847e-03
#> 6  dementhx -0.07691385  0.08093069 5.475391e-09
#> 7   psychhx -0.06735438 -0.03720101 4.838619e-03
#> 12  malighx -0.04881156  0.18325255 2.174017e-04
#> 13  immunhx  0.03918708  0.05980653 2.996256e-03
#> 14  transhx  0.08416616 -0.06475250 9.211552e-07
#> 19 surv2md1 -0.09632573 -0.34610143 2.668645e-13
#> 21    t3d30 -0.07334547 -0.46736750 2.686694e-08
#> 22    dth30  0.07579735  0.52131104 9.076865e-09
#> 23     aps1  0.23861164  0.19205247 8.950783e-49
#> 24   scoma1 -0.05262318  0.12494518 6.690392e-05
#> 25  meanbp1 -0.21278822 -0.10381577 3.233814e-15

These variables show significant correlation with both the treatment decision (whether a patient receives RHC) and the outcome (mortality). Failing to adjust for these would introduce confounding bias.

Clinical Interpretation

The audit results make clinical sense:

Disease severity indicators (like APACHE score aps1, vital signs) should correlate with both:
- Sicker patients are more likely to receive RHC (treatment)
- Sicker patients are more likely to die (outcome)
Demographics (age, comorbidities) follow similar patterns
Some variables correlate only with outcome (predictors of mortality but not of treatment selection)

Comparing Audit Results Across Subsets

Let’s see how the audit differs across patient subgroups:

# Audit cardiac patients only
cardiac_patients <- rhc_clean[rhc_clean$card == 1, ]

if (nrow(cardiac_patients) > 50) {
  report_cardiac <- audit_data(
    data = cardiac_patients,
    treatment = "swang1",
    outcome = "death_num",
    covariates = audit_cols[1:15],
    alpha = 0.01,
    verbose = FALSE
  )
  
  cat("=== Cardiac Patients Subgroup ===\n")
  cat("Sample size:", nrow(cardiac_patients), "\n")
  cat("Issues found:", report_cardiac$summary_stats$n_issues, "\n")
  cat("Confounders:", report_cardiac$summary_stats$n_confounders, "\n")
}

Using Audit Results for Causal Analysis

Once you’ve identified confounders through the audit, use them in your causal specification:

# Get the list of detected confounders
confounder_vars <- report$issues$variable[report$issues$issue_type == "Confounder"]

# If we have confounders, build a proper causal specification
if (length(confounder_vars) > 0) {
  # Use detected confounders in causal spec
  spec <- causal_spec(
    data = rhc_clean,
    treatment = "swang1",
    outcome = "death_num",
    covariates = confounder_vars
  )
  
  print(spec)
}
#> Warning: 4535 observations dropped due to missing values
#> Warning: 8 observations have extreme propensity scores
#> ✔ Created causal specification: n=1200, 14 covariate(s)
#> 
#> -- Causal Specification --------------------------------------------------
#> 
#> * Treatment: swang1 ( binary )
#> * Outcome: death_num ( continuous )
#> * Covariates: cat1, cat2, cardiohx, dementhx, psychhx, malighx, immunhx, transhx, surv2md1, t3d30, dth30, aps1, scoma1, meanbp1 
#> * Sample size: 1200 
#> * Estimand: ATE

Full Audit Summary

# Summary statistics from the audit
cat("\n=== Audit Summary ===\n")
#> 
#> === Audit Summary ===
cat("Variables audited:", report$summary_stats$n_vars_audited, "\n")
#> Variables audited: 25
cat("Total issues:", report$summary_stats$n_issues, "\n")
#> Total issues: 22
cat("  - Confounders:", report$summary_stats$n_confounders, "\n")
#>   - Confounders: 14
cat("  - Potential instruments:", report$summary_stats$n_instruments, "\n")
#>   - Potential instruments: 4

Best Practices for Data Auditing

Run audit early: Before any causal analysis, audit your data to understand variable roles
Use domain knowledge: The audit identifies statistical associations; combine with clinical/domain expertise
Adjust significance level:
- Use stricter alpha (e.g., 0.01) for larger datasets to reduce false positives
- Use looser alpha (e.g., 0.10) for smaller samples to catch potential confounders
Audit subgroups: Confounding patterns may differ across patient populations
Document decisions: Record which variables you adjust for and why
Iterate: After initial analysis, re-audit to check if additional variables should be included

Conclusion

The audit_data() function provides an automated first pass at identifying causal structure in your dataset. It answers key questions:

What must I adjust for? → Confounders
What might help with identification? → Potential instruments
What improves precision? → Outcome predictors
What’s safe to ignore? → Variables with no causal role

This systematic approach helps ensure your causal analysis is built on a solid foundation, reducing the risk of confounding bias and improving the reliability of your conclusions.

References

Connors, A.F. et al. (1996). The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA, 276(11), 889-897.
Akdemir, D. (2026). Constraints on Causal Inference as Experiment Comparison. Negative control certification/bounding (thm:nc_bound).