This vignette aims to clarify the usage of the survtab
function included in this package. survtab
estimates various survival functions and cumulative incidence functions (CIFs) non-parametrically, and was developed with large dataset in mind.
Two methods (surv.method
) are currently supported: The lifetable (actuarial) method only makes use of counts when estimating any function. The default method estimates hazards and transforms them into survival function or CIF estimates.
survtab
requires the the subjects' survival intervals to be split into (small) subintervals, since the requested survival time functions are estimated in small subintervals and cumulated to yield overall estimates of the requested survival time functions. More and smaller subintervals effect higher fidelity estimates.
For relative survival estimation survtab
also requires the population hazard rates at the person-specific subintervals. For both splitting and merging population hazard rates one should currently use the lexpand
(convenience) function. Esseantialy, lexpand
makes use of Lexis
found in Epi
, splits with lexpand
and merges population hazards if requested.
lexpand
and survtab
To demonstrate splitting data and merging population hazard information we use a simulated cohort of Finnish female rectal cancer patients.
library(popEpi)
sr <- copy(popEpi::sire)
head(sr)
## sex bi_date dg_date ex_date status dg_age
## 1: 1 1952-05-27 1994-02-03 2012-12-31 0 41.68877
## 2: 1 1959-04-04 1996-09-20 2012-12-31 0 37.46378
## 3: 1 1958-06-15 1994-05-30 2012-12-31 0 35.95616
## 4: 1 1957-05-10 1997-09-04 2012-12-31 0 40.32055
## 5: 1 1957-01-20 1996-09-24 2012-12-31 0 39.67745
## 6: 1 1962-05-25 1997-05-17 2012-12-31 0 34.97808
lexpand
accepts both Date
(see ?as.Date
) and fractional year (e.g. 2000.315
) as the time variables for time of birth, diagnosis and exit. status
may be e.g. a numeric or factor variable.
For splitting one can always specify the break points in any time scale as fractional years, and also as dates for splitting along calendar time (per
). Note that if any time variable or breaks along per
is given in the Date
format, they are converted intenally by lexpand
into fractional years using get.yrs(x, year.length = "actual")
.
There are two ways to specify breaks in lexpand: via a list, e.g.
x <- lexpand(sr, birth=bi_date, entry=dg_date, exit=ex_date,
breaks = list(fot = seq(0,5, 1/12)), status = status, pophaz=popmort)
or via supplying breaks via the ...
argument simply by e.g.
x <- lexpand(sr, birth=bi_date, entry=dg_date, exit=ex_date,
fot = seq(0,5, 1/12), status = status, pophaz=popmort)
The former can be simpler sometimes because you can pass a prepared list to the function:
BL <- list(fot = seq(0, 5, 1/12))
x <- lexpand(sr, birth=bi_date, entry=dg_date, exit=ex_date,
breaks = BL, status = status, pophaz=popmort)
The methods are equivalent, but if both are used, only the breaks
list is utilized in splitting.
fot
used above specifies splitting breaks along the follow-up time scale. By default lexpand
drops any observed survival time outside the time window (here 0-5 years along fot
) specified by breaks along the three time scales, so one should supply meaningful floors and roofs for the time scales. This way, one can e.g. prepare the data for period analysis by specifying lowest and highest allowed calendar times experienced using per
.
## up to 5 years of follow-up time in month-long intervals
x <- lexpand(sr, birth=bi_date, entry=dg_date, exit=ex_date,
fot = seq(0,5, 1/12), status = status, pophaz=popmort)
## dropped 16 rows where entry == exit
## 14 rows in expanded data had age values >= 101; assumed for these the same expected hazard as for people of age 100
By default survtab
estimates EdererII relative survivals via computing excess hazards for each survival interval as specified by the breaks along the fot
time scale.
st <- survtab(x)
## event.values was NULL, so chose 0 as non-event value
head(st)
## surv.int Tstart Tstop delta pyrs n.start d n.cens d.exp
## 1: 1 0.00000 0.08333 0.08333 672.4 8227 297 25 25.64
## 2: 2 0.08333 0.16670 0.08333 647.1 7905 215 44 23.63
## 3: 3 0.16670 0.25000 0.08333 627.6 7646 205 36 22.19
## 4: 4 0.25000 0.33330 0.08333 608.6 7405 165 42 21.19
## 5: 5 0.33330 0.41670 0.08333 592.4 7198 152 35 20.26
## 6: 6 0.41670 0.50000 0.08333 578.8 7011 114 26 19.62
## surv.obs.lo surv.obs surv.obs.hi SE.surv.obs r.e2.lo r.e2 r.e2.hi
## 1: 0.9596 0.9639 0.9677 0.002059 0.9626 0.9669 0.9707
## 2: 0.9321 0.9375 0.9426 0.002673 0.9379 0.9434 0.9484
## 3: 0.9060 0.9124 0.9183 0.003126 0.9143 0.9208 0.9267
## 4: 0.8850 0.8920 0.8985 0.003436 0.8958 0.9028 0.9094
## 5: 0.8657 0.8731 0.8802 0.003688 0.8787 0.8862 0.8934
## 6: 0.8511 0.8589 0.8663 0.003861 0.8663 0.8743 0.8818
## SE.r.e2
## 1: 0.002065
## 2: 0.002690
## 3: 0.003155
## 4: 0.003477
## 5: 0.003744
## 6: 0.003930
plot(st, y = "r.e2")
Sometimes we are interested in the experienced survival in a time period instead of the survival of the persons diagnosed in that period. This is referred to as period analysis. It is considered to produce more up-to-date estimates of survival, when e.g. estimating survival for the latest 5-year period instead of the persons diagnosed in that period, for which follow-up can be quite short.
In the framework of popEpi, period analysis is accommodated by splitting the data appropriately before estimation. In lexpand
one simply adds additional splitting breaks along the per
time scale and drops observed survival outside of it. Here we conduct a period method analysis for the years 2008-2012.
x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date,
status = status,
fot=seq(0, 5, by = 1/12),
per = c("2008-01-01", "2013-01-01"),
pophaz = popmort)
## dropped 3215 rows where subjects left follow-up before earliest per breaks value
## dropped 4 rows where entry == exit
## 11 rows in expanded data had age values >= 101; assumed for these the same expected hazard as for people of age 100
We may use the default settings for period method EdererII estimates as well. Note that surv.method = "lifetable"
is inappropriate for period analysis is it does not handle late entry correctly.
st <- survtab(x)
## event.values was NULL, so chose 0 as non-event value
head(st)
## surv.int Tstart Tstop delta pyrs n.start d n.cens d.exp
## 1: 1 0.00000 0.08333 0.08333 179.5 2174 55 25 6.057
## 2: 2 0.08333 0.16670 0.08333 174.2 2128 49 46 5.657
## 3: 3 0.16670 0.25000 0.08333 170.4 2059 47 34 5.388
## 4: 4 0.25000 0.33330 0.08333 166.6 2021 36 43 5.158
## 5: 5 0.33330 0.41670 0.08333 163.0 1972 42 35 4.930
## 6: 6 0.41670 0.50000 0.08333 159.5 1923 32 26 4.741
## surv.obs.lo surv.obs surv.obs.hi SE.surv.obs r.e2.lo r.e2 r.e2.hi
## 1: 0.9673 0.9748 0.9806 0.003356 0.9699 0.9775 0.9833
## 2: 0.9424 0.9522 0.9604 0.004574 0.9475 0.9575 0.9656
## 3: 0.9190 0.9306 0.9405 0.005452 0.9264 0.9382 0.9481
## 4: 0.9014 0.9140 0.9250 0.006016 0.9110 0.9238 0.9349
## 5: 0.8809 0.8945 0.9067 0.006592 0.8925 0.9065 0.9187
## 6: 0.8653 0.8797 0.8927 0.006984 0.8789 0.8937 0.9067
## SE.r.e2
## 1: 0.003366
## 2: 0.004599
## 3: 0.005496
## 4: 0.006081
## 5: 0.006679
## 6: 0.007095
plot(st, y = "r.e2")
survtab
currently enables relatively easy age standardization. The exact breaks and weights can be supplied by hand. The weights can also be so called internal weights based on the proportions of subjects in each age group at diagnosis or one of the three standard weighting schemes integrated into the programme; see ?ICSS
for the exact standard weights.
To use internal weights, simply leave agegr.w.weights = NULL
and define agegr.w.breaks
to what you want.
st.as.int <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf))
## event.values was NULL, so chose 0 as non-event value
We may demonstrate how the internal weights are computed:
## get internal weights from the data with age group breaks c(0, 45, 65, 85, Inf)
## this means getting numbers of cases by age group at diagnosis.
## below the method using data.table syntax.
iw <- x[!duplicated(lex.id), .N, keyby = cut(dg_age, c(0,45,65,75,Inf), right=FALSE)]
iw <- iw$N/sum(iw$N)
iw
## [1] 0.04012175 0.33453237 0.27061428 0.35473160
st.as.hand <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf),
agegr.w.weights = iw)
## event.values was NULL, so chose 0 as non-event value
plot(st.as.int, y = "r.e2.as", conf.int = FALSE, lwd=4)
lines(st.as.hand, y = "r.e2.as", conf.int = FALSE,
col = "green", lty = 2, lwd=4)
So we get the same result as by using agegr.w.weights = NULL
.
The ICSS weights can be used simply by naming the appropriate weighting scheme in the agegr.w.weights
argument, e.g. agegr.w.weights = "ICSS1"
. See ?ICSS
for details on the weights. Note that the weights are only tabulated in no smaller than 5-year age groups, meaning the agegr.w.breaks
argument must all except for the last break be divisible by 5. We recommend using Inf
as the last break.
## with ICSS1 weights; see ?ICSS
st.as.icss1 <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf),
agegr.w.weights = "ICSS1")
## event.values was NULL, so chose 0 as non-event value
st.as.icss2 <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf),
agegr.w.weights = "ICSS2")
## event.values was NULL, so chose 0 as non-event value
plot(st.as.icss1, conf.int = FALSE, lwd = 4, col = "black")
## y was NULL; chose r.e2.as automatically
lines(st.as.icss2, conf.int = FALSE, lwd = 4, col = "blue")
Though we did not mention it, the age-adjusted estimates above were period analysis estimates. Combinations of different estimation schemes are straightforward to use with lexpand
and survtab
. Any estimate outputted by survtab
can be age-adjusted and computed as in the period analysis framework. Counts-based (lifetable) estimates are possible to compute for non-period-analysis estimates with or without age-adjustments.
Sometimes information on the cause of death is available. This may be utilized instead of computing estimates of net survival to yield the desired estimates. The two estimation perspectives typically converge, as seen below.
st.as.cause <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf),
surv.type="surv.cause")
## event.values was NULL, so chose 0 as non-event value
st.as.rel <- survtab(x, agegr.w.breaks = c(0,45,65,75, Inf),
surv.type="surv.rel")
## event.values was NULL, so chose 0 as non-event value
plot(st.as.rel, y = "r.e2.as", conf.int = FALSE, lwd=4)
lines(st.as.cause, y = "surv.obs1.as", conf.int = FALSE,
lwd=4, col="red", lty=2)
Absolute risks of dying due to cause \(k\) can be easily computed (note: still using period analysis data):
st.co <- survtab(x, surv.type="cif.obs")
## event.values was NULL, so chose 0 as non-event value
st.cr <- survtab(x, surv.type="cif.rel")
## event.values was NULL, so chose 0 as non-event value
## the two are very similar; here CIF.rel is NA after 3 years because
## d < d.exp; could be alleviated with larger survival intervals
plot(st.co, "CIF_1", conf.int = FALSE, lwd = 4)
lines(st.cr, "CIF.rel", conf.int = FALSE, lwd = 4, col = "red")
cif.rel
is an indirect absolute risk estimate, where absolute risk is estimated using excess cases of death instead of cause-specific cases. This may sometimes produce NA
results if the count of excess cases is negative even in one survival interval.