---
title: "rdhte: Empirical Illustration"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{rdhte-illustration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.width = 6,
  fig.height = 4,
  dpi = 90,
  message = FALSE,
  warning = FALSE
)
```

## Overview

`rdhte` estimates **heterogeneous treatment effects** in sharp
Regression Discontinuity (RD) designs along one or more pretreatment
covariates. The package exposes three main commands:

- `rdhte()` -- point estimation and robust bias-corrected inference of
  group-conditional RD effects.
- `rdbwhte()` -- automatic per-cell bandwidth selection.
- `rdhte_lincom()` -- linear-combination tests over the estimated CATEs.

This vignette illustrates each feature on a real-data extract from
Granzier, Pons, and Tricaud (2023, *American Economic Journal:
Applied*), bundled with the package as `rdhte_dataset`.

### The data and the design

Granzier, Pons, and Tricaud study **coordination and bandwagon effects
in French two-round elections**, where candidates must clear a
qualifying-vote threshold in the first round to advance to the runoff.
The institutional rule creates a sharp regression-discontinuity design
on every candidate's first-round margin against that threshold:
candidates just above the cutoff advance, those just below do not, and
small differences in first-round support produce a discrete change in
runoff participation. The authors use the design to ask whether being
just-qualified causally affects subsequent voter and candidate
behavior, and document substantive heterogeneity across ideology and
candidate-strength dimensions.

The bundled extract has 39,534 candidate-race observations with the
following variables:

- `y` -- 0/1 indicator for advancing to the runoff (the outcome).
- `x` -- first-round margin against the qualifying threshold (the
  running variable; cutoff at zero).
- `cluster_var` -- district identifier, used for cluster-robust
  inference.
- `w_left` -- 0/1 indicator for left-of-center parties.
- `w_ideology` -- unordered party-ideology bucket (4 categories).
- `w_strength` -- continuous proxy for ex-ante candidate strength
  (average prior national-level vote share).
- `w_strong` -- 0/1 indicator for above-median strength.
- `w_strength_qrt` -- ordered quartile of `w_strength` (4 levels).

These covariates support every covariate-incorporation pattern
`rdhte` exposes: binary cells, multi-level (unordered and ordered)
factor cells, factor-by-factor interactions, continuous slopes, and
binary x continuous interactions.

```{r setup}
library(rdhte)
library(rdrobust)
library(broom)
data(rdhte_dataset)
attach(rdhte_dataset)
```

## Single binary variable

When `covs.hte` is binary (or a factor with two levels), `rdhte`
reports one CATE per cell.

```{r binary}
summary(rd_left <- rdhte(y = y, x = x, covs.hte = factor(w_left),
                          cluster = cluster_var))
```

How large is the advantage for left-of-center candidates? Test the
difference with `rdhte_lincom`:

```{r binary-lincom}
rdhte_lincom(model = rd_left,
             linfct = c("`factor(w_left)1` - `factor(w_left)0` = 0"))
```

Forcing a common bandwidth across cells (`bw.joint = TRUE`) can
distort the cell-specific inference -- here it makes the right-of-center
effect (incorrectly) statistically significant:

```{r binary-bwjoint}
summary(rdhte(y = y, x = x, covs.hte = w_left, cluster = cluster_var,
              bw.joint = TRUE))
```

A 0/1 numeric variable is auto-promoted to a factor; coefficient names
then carry the variable name as a prefix:

```{r binary-bare}
summary(rd_left2 <- rdhte(y = y, x = x, covs.hte = w_left,
                           cluster = cluster_var))
rdhte_lincom(model = rd_left2,
             linfct = c("`w_left1` - `w_left0` = 0"))
```

## Single categorical variable -- unordered

A multi-level factor produces one CATE per level. The reference
category is the factor's first level.

```{r categorical-unordered}
summary(rd_ideology <- rdhte(y = y, x = x,
                              covs.hte = factor(w_ideology),
                              cluster = cluster_var))
```

Joint test that the three non-reference ideology cells are all
indistinguishable from zero:

```{r categorical-lincom}
rdhte_lincom(model = rd_ideology,
             linfct = c("`factor(w_ideology)2` = 0",
                        "`factor(w_ideology)3` = 0",
                        "`factor(w_ideology)4` = 0"))
```

## Single categorical variable -- ordered

For an ordered categorical variable, wrapping in `factor()` keeps the
per-level CATE interpretation:

```{r categorical-ordered}
summary(rdhte(y = y, x = x, covs.hte = factor(w_strength_qrt),
              cluster = cluster_var))
```

The per-quartile CATE increases monotonically: candidates with higher
average strength have larger treatment effects.

A bare numeric variable (no `factor()`) is treated as continuous --
useful when the ordering is meaningful but you want the linear-in-W
parametrization:

```{r categorical-as-continuous}
summary(rdhte(y = y, x = x, covs.hte = w_strength_qrt,
              cluster = cluster_var))
```

## Two binary variables -- interaction

Interactions of two factors produce one CATE per joint cell. The
ordering of cells follows R's `interaction()` convention.

```{r factor-by-factor}
summary(rdhte(y = y, x = x,
              covs.hte = factor(w_left):factor(w_strong),
              cluster = cluster_var))
```

The same model can be expressed by pre-building the joint cell
variable:

```{r factor-by-factor-alt}
interactions <- 1*(w_left==0)*(w_strong==1) +
                2*(w_left==0)*(w_strong==2) +
                3*(w_left==1)*(w_strong==1) +
                4*(w_left==1)*(w_strong==2)
summary(rdhte(y = y, x = x, covs.hte = factor(interactions),
              cluster = cluster_var))
```

## Single continuous variable

When `covs.hte` is continuous, `rdhte` switches to a linear-in-W
parametrization of the CATE function:

\[
\tau(w) = \beta_T + \beta_{T:W}\,w.
\]

```{r continuous}
summary(rd_continuous <- rdhte(y = y, x = x, covs.hte = w_strength,
                                kernel = "uni",
                                cluster = cluster_var))
```

To aid interpretation, the slope coefficient is precisely a
partial-linear regression slope. With the **uniform kernel** and a
**fixed bandwidth**, the rdhte point estimates match plain `lm()` on
the same in-bandwidth subset:

```{r continuous-vs-lm}
trt <- (x > 0)
new.data <- data.frame(y, x, w_strength, trt)
using.lm <- coef(lm(y ~ trt * x * w_strength,
                    data = new.data[abs(new.data$x) < rd_continuous$h[1], ]))
rd_continuous$Estimate[1] - using.lm["trtTRUE"]
rd_continuous$Estimate[2] - using.lm["trtTRUE:w_strength"]
```

(Inference, however, requires the robust bias correction in `rdhte`
and cannot be obtained from the `lm()` fit alone.)

## Interaction effect: binary x continuous

A fully interacted specification gives a separate intercept and slope
for each level of the categorical covariate.

```{r binary-by-continuous}
summary(rd_interaction <- rdhte(y = y, x = x,
                                 covs.hte = "w_left*w_strength",
                                 cluster = cluster_var))
```

Wrapping the binary in `factor()` is also allowed but shifts the
baseline category and therefore the reported intercepts:

```{r binary-by-continuous-factor}
summary(rdhte(y = y, x = x, covs.hte = "factor(w_left)*w_strength",
              cluster = cluster_var))
```

Each cell-specific coefficient may look insignificant individually
while the joint test still rejects:

```{r binary-by-continuous-joint}
rdhte_lincom(model = rd_interaction,
             linfct = c("`T` = 0",
                        "`T:w_left` = 0",
                        "`T:w_strength` = 0",
                        "`T:w_left:w_strength` = 0"))
```

With the bandwidth fixed, the fully interacted model matches results
from category-specific estimation. Use `rdhte_lincom` to read off the
per-cell intercept and slope:

```{r binary-by-continuous-fixedbw}
summary(rd_interaction <- rdhte(y = y, x = x,
                                 covs.hte = "w_left*w_strength",
                                 h = 0.1, cluster = cluster_var))
summary(rdhte(y = y[w_left == 0], x = x[w_left == 0],
              covs.hte = w_strength[w_left == 0], h = 0.1,
              cluster = cluster_var[w_left == 0]))
summary(rdhte(y = y[w_left == 1], x = x[w_left == 1],
              covs.hte = w_strength[w_left == 1], h = 0.1,
              cluster = cluster_var[w_left == 1]))
rdhte_lincom(model = rd_interaction,
             linfct = c("`T` + `T:w_left` = 0",
                        "`T:w_strength` + `T:w_left:w_strength` = 0"))
```

## Standalone bandwidth selection (`rdbwhte`)

`rdbwhte` runs the same data-driven bandwidth selectors as `rdhte` but
returns the bandwidths without estimating the CATEs. Useful when you
want to inspect or fix `h` and then explore alternative variance
estimators or compare specifications at a common bandwidth.

```{r rdbwhte}
bw <- rdbwhte(y = y, x = x, covs.hte = factor(w_ideology),
              cluster = cluster_var)
bw$h
```

`bw.joint = TRUE` forces a single shared bandwidth across cells:

```{r rdbwhte-joint}
bw_joint <- rdbwhte(y = y, x = x, covs.hte = factor(w_ideology),
                    cluster = cluster_var, bw.joint = TRUE)
bw_joint$h
```

## Efficiency-improving covariates (`covs.eff`)

`covs.eff` adds covariates to the local-polynomial regression to
**shrink standard errors** without changing the identification of the
CATE. They enter additively (and as `covs.eff x W` interactions in the
heterogeneity-aware paths) but never with the treatment indicator or
the running-variable polynomial. Useful when you have strong
pretreatment predictors of the outcome.

```{r covs-eff}
m_no_eff <- rdhte(y = y, x = x, covs.hte = factor(w_ideology),
                   cluster = cluster_var)
m_eff    <- rdhte(y = y, x = x, covs.hte = factor(w_ideology),
                   covs.eff = w_strength, cluster = cluster_var)
data.frame(cell       = m_no_eff$W.lev,
           se_no_eff  = round(m_no_eff$se.rb, 4),
           se_eff     = round(m_eff$se.rb,    4),
           pct_change = round(100 * (m_eff$se.rb - m_no_eff$se.rb)
                              / m_no_eff$se.rb, 1))
```

## Plotting

`plot()` is a post-estimation method for categorical `covs.hte`: one
point per cell at the conventional point estimate, with the robust
bias-corrected CI as an error bar.

```{r plot-basic, fig.alt="rdhte forest plot, ideology buckets"}
plot(rd_ideology)
```

`sort = TRUE` reorders cells along the x-axis by their point
estimate -- often the more useful default for ranking-style narratives:

```{r plot-sort, fig.alt="sorted forest plot"}
plot(rd_ideology, sort = TRUE)
```

The returned object is a `ggplot`, so additional layers compose
naturally. For example, a custom title with horizontal orientation:

```{r plot-custom, fig.alt="custom-themed forest plot"}
p <- plot(rd_ideology, sort = TRUE,
          title = "Heterogeneity by ideology bucket",
          ylab  = "Sharp RD ITT")
if (requireNamespace("ggplot2", quietly = TRUE)) {
  p + ggplot2::theme_minimal(base_size = 11) +
      ggplot2::coord_flip()
}
```

## Building publication-ready tables

`rdhte` exposes the per-cell results through `tidy()` and a one-row
summary through `glance()` (both broom-compatible). Combined with
`knitr::kable()` / `xtable::xtable()` the same numbers can be turned
into Markdown, HTML, or LaTeX tables for papers.

### (A) per-cell table from a single call

`tidy()` returns one row per cell with the conventional point
estimate, robust BC standard error, z / p-value, BC confidence
interval, and per-side `h` and `Nh`. Renaming or formatting columns
before `kable()` gives a clean publication panel:

```{r table-tidy}
tab_A <- tidy(rd_ideology)
tab_A
```

```{r table-A-kable}
knitr::kable(
  tab_A[, c("term", "estimate", "std.error", "conf.low", "conf.high",
            "n.eff.left", "n.eff.right", "h.left", "h.right")],
  digits = c(NA, 3, 3, 3, 3, 0, 0, 3, 3),
  col.names = c("Cell", "Estimate", "SE", "CI lo", "CI hi",
                "Nh-", "Nh+", "h-", "h+"),
  caption = "Per-cell RD effects by ideology bucket"
)
```

`glance()` complements `tidy()` with a one-row description of the fit
(N, polynomial orders, kernel, VCE, bandwidth selector):

```{r table-glance}
glance(rd_ideology)
```

### (B) cell x specification comparison

Fix the `covs.hte` argument and vary `vce` across columns to see how
the choice of variance estimator moves the per-cell standard errors.

```{r table-spec-compare}
specs <- list(HC0 = list(vce = "hc0"),
              HC2 = list(vce = "hc2"),
              HC3 = list(vce = "hc3"),
              CR1 = list(vce = "cr1", cluster = cluster_var))

fit_one <- function(args) {
  args <- c(list(y = y, x = x, covs.hte = factor(w_ideology)), args)
  do.call(rdhte, args)
}

cells  <- sort(unique(w_ideology))
mat_pt <- sapply(specs, function(s) tidy(fit_one(s))$estimate)
mat_se <- sapply(specs, function(s) tidy(fit_one(s))$std.error)
rownames(mat_pt) <- rownames(mat_se) <- as.character(cells)

mat_pt
mat_se
```

A common publication layout interleaves the point estimate and its SE
in parentheses below it:

```{r table-spec-stacked}
n_cells <- nrow(mat_pt)
out     <- data.frame(matrix(NA_character_, nrow = 2 * n_cells,
                             ncol = ncol(mat_pt)))
for (i in seq_len(n_cells)) {
  out[2 * i - 1, ] <- sprintf("%.3f", mat_pt[i, ])
  out[2 * i,     ] <- sprintf("(%.3f)", mat_se[i, ])
}
out <- cbind(Cell = rep(rownames(mat_pt), each = 2), out)
out$Cell[seq(2, nrow(out), by = 2)] <- ""
colnames(out)[-1] <- colnames(mat_pt)
knitr::kable(out, caption = "Per-cell estimates by variance option (SE in parentheses)")
```

### (C) LaTeX export

`xtable` or `kableExtra` can render the same data frame as LaTeX.
The snippet below produces a paper-ready table; uncomment the
`writeLines()` call to write to a file:

```{r table-latex, eval = requireNamespace("xtable", quietly = TRUE)}
tex <- xtable::xtable(
  tab_A[, c("term", "estimate", "std.error", "conf.low", "conf.high")],
  digits  = c(0, 0, 3, 3, 3, 3),
  caption = "Per-cell RD effects by ideology bucket",
  label   = "tab:rdhte-ideology"
)
print(tex, include.rownames = FALSE, booktabs = TRUE,
      caption.placement = "top")
# writeLines(capture.output(print(tex, include.rownames = FALSE,
#                                booktabs = TRUE,
#                                caption.placement = "top")),
#            con = "rdhte_table_A.tex")
```

## Replicating rdrobust

Average effects from `rdhte()` and `rdrobust()` will not match under
default settings, because the two packages use different defaults for
`rho` and `vce`:

```{r rdrobust-defaults}
summary(rdhte(y = y, x = x))
summary(rdrobust(y = y, x = x))
```

To replicate exactly with `rdrobust`, set `rho = 1`, choose a matching
`vce`, and enforce the same bandwidth:

```{r rdrobust-match-ate}
summary(rdhte(y = y, x = x, h = 0.1, vce = "hc3"))
summary(rdrobust(y = y, x = x, h = 0.1, rho = 1, vce = "hc3"))
```

The same trick reproduces the per-cell `rdrobust` estimates from a
single `rdhte` call with per-cell bandwidths:

```{r rdrobust-match-cells}
summary(rdhte(y = y, x = x, covs.hte = w_left,
              h = c(0.078, 0.116), vce = "hc3"))
summary(rdrobust(y = y[w_left == 1], x = x[w_left == 1],
                 h = 0.116, rho = 1, vce = "hc3"))
```

```{r cleanup, include = FALSE}
detach(rdhte_dataset)
```

## See also

- `rdhte()`, `rdbwhte()`, `rdhte_lincom()`, `plot.rdhte()`.
- The companion Stata package ships an `rdhte_plot` command that
  mirrors the R `plot()` method.

## Reference

Calonico, S., Cattaneo, M.D., Farrell, M.H., Palomba, F., and
Titiunik, R. (2025). *Treatment Effect Heterogeneity in Regression
Discontinuity Designs.*
[arXiv:2503.13696](https://arxiv.org/abs/2503.13696).

Granzier, R., Pons, V., and Tricaud, C. (2023). *Coordination and
Bandwagon Effects: How Past Rankings Shape the Behavior of Voters and
Candidates.* **American Economic Journal: Applied Economics** 15(4):
177-217.
