This vignette illustrates the basic and advanced usage of knockoff.filter. For simplicity, we will use synthetic data constructed such that the true coefficient vector \(\beta\) has very few nonzero entries.

# Problem parameters
n = 600          # number of observations
p = 200          # number of variables
k = 30           # number of variables with nonzero coefficients
amplitude = 3.5  # signal amplitude (for noise level = 1)

# Problem data
X = matrix(rnorm(n*p), nrow=n, ncol=p)
nonzero = sample(p, k)
beta = amplitude * (1:p %in% nonzero)
y.sample <- function() X %*% beta + rnorm(n)

First examples

To begin, we call knockoff.filter with all the default settings.

library(knockoff)

y = y.sample()
result = knockoff.filter(X, y)
print(result)
## Call:
## knockoff.filter(X = X, y = y)
## 
## Selected variables:
##  [1]   7  13  16  17  18  23  34  43  50  52  53  58  60  61  63  65  67
## [18]  76  85  86  87  88  96  97 101 116 128 131 140 152 165 170 177 179
## [35] 187 194

The false discovery proportion is

fdp <- function(selected) sum(beta[selected] == 0) / max(1, length(selected))
fdp(result$selected)
## [1] 0.1666667

This is below the default FDR target of 0.20.

By default, the knockoff filter uses a test statistic based on the lasso. Specifically, it uses the statistic knockoff.stat.lasso_signed_max, which computes \[ W_j = \max(Z_j, \tilde{Z}_j) \cdot \mathrm{sgn}(Z_j - \tilde{Z}_j), \] where \(Z_j\) and \(\tilde{Z}_j\) are the maximum values of the regularization parameter \(\lambda\) at which the \(j\)th variable and its knockoff, respectively, enter the lasso model.

The knockoff package includes several other test statistics, all of which have names prefixed with knockoff.stat. In the next snippet, we use a statistic based on forward selection. We also set a lower target FDR of 0.10.

result = knockoff.filter(X, y, fdr = 0.10, statistic = knockoff.stat.fs)
fdp(result$selected)
## [1] 0.1176471

User-defined test statistics

In addition to using the predefined test statistics, it is also possible to define your own test statistics. To illustrate this possibility, we implement one of the simplest test statistics from the knockoff filter paper, namely \[ W_j = \left|X_j^\top \cdot y\right| - \left|\tilde{X}_j^\top \cdot y\right|. \]

my_knockoff_stat <- function(X, X_ko, y) {
  abs(t(X) %*% y) - abs(t(X_ko) %*% y)
}
result = knockoff.filter(X, y, statistic = my_knockoff_stat)
fdp(result$selected)
## [1] 0.2307692

As another example, we show how to customize the grid of \(\lambda\)’s used to compute the lasso path in the default test statistic.

my_lasso_stat <- function(...) knockoff.stat.lasso_signed_max(..., nlambda=10*p)
result = knockoff.filter(X, y, statistic = my_lasso_stat)
fdp(result$selected)
## [1] 0.1666667

For more information about the nlambda parameter, see the documentation for knockoff.stat.lasso_signed_max.

Advanced usage

The function knockoff.filter is a wrapper around several simpler functions that

  1. Construct knockoff variables (knockoff.create)
  2. Compute the test statistic \(W\) (various functions with prefix knockoff.stat)
  3. Compute the threshold for variable selection (knockoff.threshold)

These functions may be called directly if desired. For more information, see the documentation for the individual functions.

Warning. The high-level function knockoff.filter will automatically normalize the columns of the input matrix (unless this behavior is explicitly disabled). However, all other functions in this package assume that the columns of the input matrix have unit Euclidean norm. Please be aware of these conventions.