Overview

This vignette aims to clarify the usage of the survtab function included in this package. survtab estimates various survival functions and cumulative incidence functions (CIFs) non-parametrically, and was developed with large dataset in mind.

Two methods (surv.method) are currently supported: The lifetable (actuarial) method only makes use of counts when estimating any function. The default method estimates hazards and transforms them into survival function or CIF estimates.

survtab requires the the subjects' survival intervals to be split into (small) subintervals, since the requested survival time functions are estimated in small subintervals and cumulated to yield overall estimates of the requested survival time functions. More and smaller subintervals effect higher fidelity estimates.

For relative survival estimation survtab also requires the population hazard rates at the person-specific subintervals. For both splitting and merging population hazard rates one should currently use the lexpand (convenience) function. Esseantialy, lexpand makes use of Lexis found in Epi, splits with lexpand and merges population hazards if requested.

Basic usage of lexpand and survtab

To demonstrate splitting data and merging population hazard information we use a simulated cohort of Finnish female rectal cancer patients.

library(popEpi)

sr <- copy(popEpi::sire)
head(sr)
##    sex    bi_date    dg_date    ex_date status   dg_age
## 1:   1 1952-05-27 1994-02-03 2012-12-31      0 41.68877
## 2:   1 1959-04-04 1996-09-20 2012-12-31      0 37.46378
## 3:   1 1958-06-15 1994-05-30 2012-12-31      0 35.95616
## 4:   1 1957-05-10 1997-09-04 2012-12-31      0 40.32055
## 5:   1 1957-01-20 1996-09-24 2012-12-31      0 39.67745
## 6:   1 1962-05-25 1997-05-17 2012-12-31      0 34.97808

lexpand accepts both Date (see ?as.Date) and fractional year (e.g. 2000.315) as the time variables for time of birth, diagnosis and exit. status may be e.g. a numeric or factor variable.

For splitting one can always specify the break points in any time scale as fractional years, and also as dates for splitting along calendar time (per). Note that if any time variable or breaks along per is given in the Date format, they are converted intenally by lexpand into fractional years using get.yrs(x, year.length = "actual").

There are two ways to specify breaks in lexpand: via a list, e.g.

x <- lexpand(sr, birth=bi_date, entry=dg_date, exit=ex_date,
             breaks = list(fot = seq(0,5, 1/12)), status = status, pophaz=popmort)

or via supplying breaks via the ... argument simply by e.g.

x <- lexpand(sr, birth=bi_date, entry=dg_date, exit=ex_date, 
             fot = seq(0,5, 1/12), status = status, pophaz=popmort)

The former can be simpler sometimes because you can pass a prepared list to the function:

BL <- list(fot = seq(0, 5, 1/12))
x <- lexpand(sr, birth=bi_date, entry=dg_date, exit=ex_date, 
             breaks = BL, status = status, pophaz=popmort)

The methods are equivalent, but if both are used, only the breaks list is utilized in splitting.

fot used above specifies splitting breaks along the follow-up time scale. By default lexpand drops any observed survival time outside the time window (here 0-5 years along fot) specified by breaks along the three time scales, so one should supply meaningful floors and roofs for the time scales. This way, one can e.g. prepare the data for period analysis by specifying lowest and highest allowed calendar times experienced using per.

## up to 5 years of follow-up time in month-long intervals
x <- lexpand(sr, birth=bi_date, entry=dg_date, exit=ex_date, 
             fot = seq(0,5, 1/12), status = status, pophaz=popmort)
## dropped 16 rows where entry == exit
## 14 rows in expanded data had age values >= 101; assumed for these the same expected hazard as for people of age 100

By default survtab estimates EdererII relative survivals via computing excess hazards for each survival interval as specified by the breaks along the fot time scale.

st <- survtab(x)
## event.values was NULL, so chose 0 as non-event value
head(st)
##    surv.int  Tstart   Tstop   delta  pyrs n.start   d n.cens d.exp
## 1:        1 0.00000 0.08333 0.08333 672.4    8227 297     25 25.64
## 2:        2 0.08333 0.16670 0.08333 647.1    7905 215     44 23.63
## 3:        3 0.16670 0.25000 0.08333 627.6    7646 205     36 22.19
## 4:        4 0.25000 0.33330 0.08333 608.6    7405 165     42 21.19
## 5:        5 0.33330 0.41670 0.08333 592.4    7198 152     35 20.26
## 6:        6 0.41670 0.50000 0.08333 578.8    7011 114     26 19.62
##    surv.obs.lo surv.obs surv.obs.hi SE.surv.obs r.e2.lo   r.e2 r.e2.hi
## 1:      0.9596   0.9639      0.9677    0.002059  0.9626 0.9669  0.9707
## 2:      0.9321   0.9375      0.9426    0.002673  0.9379 0.9434  0.9484
## 3:      0.9060   0.9124      0.9183    0.003126  0.9143 0.9208  0.9267
## 4:      0.8850   0.8920      0.8985    0.003436  0.8958 0.9028  0.9094
## 5:      0.8657   0.8731      0.8802    0.003688  0.8787 0.8862  0.8934
## 6:      0.8511   0.8589      0.8663    0.003861  0.8663 0.8743  0.8818
##     SE.r.e2
## 1: 0.002065
## 2: 0.002690
## 3: 0.003155
## 4: 0.003477
## 5: 0.003744
## 6: 0.003930
plot(st, y = "r.e2")

plot of chunk surv1

Period analysis

Sometimes we are interested in the experienced survival in a time period instead of the survival of the persons diagnosed in that period. This is referred to as period analysis. It is considered to produce more up-to-date estimates of survival, when e.g. estimating survival for the latest 5-year period instead of the persons diagnosed in that period, for which follow-up can be quite short.

In the framework of popEpi, period analysis is accommodated by splitting the data appropriately before estimation. In lexpand one simply adds additional splitting breaks along the per time scale and drops observed survival outside of it. Here we conduct a period method analysis for the years 2008-2012.

 x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date,
              status = status,
              fot=seq(0, 5, by = 1/12),
              per = c("2008-01-01", "2013-01-01"), 
              pophaz = popmort)
## dropped 3215 rows where subjects left follow-up before earliest per breaks value
## dropped 4 rows where entry == exit
## 11 rows in expanded data had age values >= 101; assumed for these the same expected hazard as for people of age 100

We may use the default settings for period method EdererII estimates as well. Note that surv.method = "lifetable" is inappropriate for period analysis is it does not handle late entry correctly.

st <- survtab(x)
## event.values was NULL, so chose 0 as non-event value
head(st)
##    surv.int  Tstart   Tstop   delta  pyrs n.start  d n.cens d.exp
## 1:        1 0.00000 0.08333 0.08333 179.5    2174 55     25 6.057
## 2:        2 0.08333 0.16670 0.08333 174.2    2128 49     46 5.657
## 3:        3 0.16670 0.25000 0.08333 170.4    2059 47     34 5.388
## 4:        4 0.25000 0.33330 0.08333 166.6    2021 36     43 5.158
## 5:        5 0.33330 0.41670 0.08333 163.0    1972 42     35 4.930
## 6:        6 0.41670 0.50000 0.08333 159.5    1923 32     26 4.741
##    surv.obs.lo surv.obs surv.obs.hi SE.surv.obs r.e2.lo   r.e2 r.e2.hi
## 1:      0.9673   0.9748      0.9806    0.003356  0.9699 0.9775  0.9833
## 2:      0.9424   0.9522      0.9604    0.004574  0.9475 0.9575  0.9656
## 3:      0.9190   0.9306      0.9405    0.005452  0.9264 0.9382  0.9481
## 4:      0.9014   0.9140      0.9250    0.006016  0.9110 0.9238  0.9349
## 5:      0.8809   0.8945      0.9067    0.006592  0.8925 0.9065  0.9187
## 6:      0.8653   0.8797      0.8927    0.006984  0.8789 0.8937  0.9067
##     SE.r.e2
## 1: 0.003366
## 2: 0.004599
## 3: 0.005496
## 4: 0.006081
## 5: 0.006679
## 6: 0.007095
plot(st, y = "r.e2")

plot of chunk surv2

Standardization

survtab currently enables relatively easy age standardization. The exact breaks and weights can be supplied by hand. The weights can also be so called internal weights based on the proportions of subjects in each age group at diagnosis or one of the three standard weighting schemes integrated into the programme; see ?ICSS for the exact standard weights.

Internal weights

To use internal weights, simply leave agegr.w.weights = NULL and define agegr.w.breaks to what you want.

st.as.int <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf))
## event.values was NULL, so chose 0 as non-event value

We may demonstrate how the internal weights are computed:

## get internal weights from the data with age group breaks c(0, 45, 65, 85, Inf)
## this means getting numbers of cases by age group at diagnosis.
## below the method using data.table syntax.

iw <- x[!duplicated(lex.id), .N, keyby = cut(dg_age, c(0,45,65,75,Inf), right=FALSE)]
iw <- iw$N/sum(iw$N)

iw
## [1] 0.04012175 0.33453237 0.27061428 0.35473160
st.as.hand <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf), 
                      agegr.w.weights = iw)
## event.values was NULL, so chose 0 as non-event value
plot(st.as.int, y = "r.e2.as", conf.int = FALSE, lwd=4)
lines(st.as.hand, y = "r.e2.as", conf.int = FALSE, 
      col = "green", lty = 2, lwd=4)

plot of chunk surv.as2

So we get the same result as by using agegr.w.weights = NULL.

The ICSS weights can be used simply by naming the appropriate weighting scheme in the agegr.w.weights argument, e.g. agegr.w.weights = "ICSS1". See ?ICSS for details on the weights. Note that the weights are only tabulated in no smaller than 5-year age groups, meaning the agegr.w.breaks argument must all except for the last break be divisible by 5. We recommend using Inf as the last break.

## with ICSS1 weights; see ?ICSS
st.as.icss1 <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf), 
                       agegr.w.weights = "ICSS1")
## event.values was NULL, so chose 0 as non-event value
st.as.icss2 <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf), 
                       agegr.w.weights = "ICSS2")
## event.values was NULL, so chose 0 as non-event value
plot(st.as.icss1, conf.int = FALSE, lwd = 4, col = "black")
## y was NULL; chose r.e2.as automatically
lines(st.as.icss2, conf.int = FALSE, lwd = 4, col = "blue")

plot of chunk surv.as3

Though we did not mention it, the age-adjusted estimates above were period analysis estimates. Combinations of different estimation schemes are straightforward to use with lexpand and survtab. Any estimate outputted by survtab can be age-adjusted and computed as in the period analysis framework. Counts-based (lifetable) estimates are possible to compute for non-period-analysis estimates with or without age-adjustments.

Other estimates of survival time functions

Cause-specific survival

Sometimes information on the cause of death is available. This may be utilized instead of computing estimates of net survival to yield the desired estimates. The two estimation perspectives typically converge, as seen below.

st.as.cause <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf), 
                       surv.type="surv.cause")
## event.values was NULL, so chose 0 as non-event value
st.as.rel   <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf), 
                       surv.type="surv.rel")
## event.values was NULL, so chose 0 as non-event value
plot(st.as.rel, y = "r.e2.as", conf.int = FALSE, lwd=4)
lines(st.as.cause, y = "surv.obs1.as", conf.int = FALSE, 
      lwd=4, col="red", lty=2)

plot of chunk surv.cause

Absolute (crude) risk / CIF

Absolute risks of dying due to cause \(k\) can be easily computed (note: still using period analysis data):

st.co <- survtab(x,  surv.type="cif.obs")
## event.values was NULL, so chose 0 as non-event value
st.cr <- survtab(x,  surv.type="cif.rel")
## event.values was NULL, so chose 0 as non-event value
## the two are very similar; here CIF.rel is NA after 3 years because
## d < d.exp; could be alleviated with larger survival intervals
plot(st.co, "CIF_1", conf.int = FALSE, lwd = 4)
lines(st.cr, "CIF.rel", conf.int = FALSE, lwd = 4, col = "red")

plot of chunk CIF.as

cif.rel is an indirect absolute risk estimate, where absolute risk is estimated using excess cases of death instead of cause-specific cases. This may sometimes produce NA results if the count of excess cases is negative even in one survival interval.