The R package extremeStat
, available at github.com/brry, contains code to fit, plot and compare several (extreme value) distribution functions. It can also compute (truncated) distribution quantile estimates and draw a plot with return periods on a linear scale. (Vignette Rmd source)
Main focus of this document:
Quantile estimation via distribution fitting
Comparison of GPD implementations in several R packages
Note: in some disciplines, quantiles are called percentiles, but technically, percentiles are only one kind of quantiles (as are deciles, quartiles, etc).
install.packages("extremeStat")
library(extremeStat)
To install the development version from github:
install.packages(c("devtools","evd","evir","extRemes","fExtremes",
"ismev","lmomco","pbapply","Renext"))
# reiterate untill all of them work (some may not install properly on first try)
devtools::install_github("brry/berryFunctions")
devtools::install_github("brry/extremeStat")
library(extremeStat)
extremeStat has 28 dependencies, because of the GPD comparison across the packages.
Let’s use the dataset rain
with 17k values. With very small values removed, as those might be considered uncertain records, this leaves us with 6k values.
data(rain, package="ismev")
rain <- rain[rain>2]
hist(rain, breaks=80, col=4, las=1)
# Visual inspection is easier on a logarithmic scale:
berryFunctions::logHist(rain, breaks=80, col=3, las=1)
The function distLfit
fits 17 of the distribution types avalable in the R package lmomco
(there are more, but some of these require quite a bit of computation time and are prone to not be able to be fitted to this type of data distribution anyways. Turn them on with speed=FALSE
).
The parameters are estimated via linear moments. These are analogous to the conventional statistical moments (mean, variance, skewness and kurtosis), but “robust [and] suitable for analysis of rare events of non-Normal data. […] L-moments are especially useful in the context of quantile functions” Asquith, W. (2015): lmomco package
distLfit
ranks the distributions according to their goodness of fit (RMSE between ecdf and cdf).
To estimate the quantile of (small) samples via a distribution function, you can use distLquantile
, which internally calls distLfit
, in the following manner:
dlq <- distLquantile(rain, probs=c(0.8,0.9,0.99,0.999), returnlist=TRUE, quiet=TRUE)
By default, the 5 best fitting distribution types are drawn and the quantiles for each distribution returned. If returnlist is set to TRUE, it will return an object that can be examined with
distLprint(dlq)
## ----------
## Dataset 'rain' with 6362 values. min/median/max: 2.3/6.6/86.6 nNA: 0
## truncate: 0 threshold: 2.3. dat_full with 6362 values: 2.3/6.6/86.6 nNA: 0
## dlf with 17 distributions. In descending order of fit quality:
## wei, wak, kap, gpa, pe3, exp, gno, ln3, gev, glo, gam, lap, gum, ray, nor, rice, revgum
## gofProp: 1, RMSE min/median/max: 0.014/0.036/0.14 nNA: 0
## quant: 34 rows, 4 columns, 136 values, of which 8 NA.
## 5 distribution colors: #3300FFFF, #00D9FFFF, #00FF19FF, #F2FF00FF, #FF0000FF
## # More information on dlf objects in
## ?extremeStat
plotted with
distLplot(dlq, nbest=8, qlines=TRUE, qlinargs=list(lwd=2),
qheights=seq(0.04, 0.01, len=8), breaks=80)
and the resulting parametric quantiles can be obtained with
dlq$quant # distLquantile output if returnlist=FALSE (the default)
## 80% 90% 99% 99.9%
## wei 13.35648 18.81148 38.06546 58.44658
## wak 13.40619 18.87738 37.81141 57.73253
## kap 13.38974 18.82981 37.92247 58.37445
## gpa 13.20620 18.54999 38.85626 63.76409
## pe3 13.43704 18.90830 37.67313 56.83284
## exp 13.64134 18.79921 35.93328 53.06736
## gno 12.81427 18.01421 40.77776 74.40654
## ln3 12.81427 18.01421 40.77776 74.40654
## gev 12.49364 17.39559 42.07102 90.15945
## glo 12.27871 16.92716 42.72106 102.56565
## gam 13.94944 18.56683 33.03442 46.91116
## lap 11.51132 15.02821 26.71108 38.39395
## gum 14.05929 18.08737 30.70033 43.08422
## ray 14.58774 18.15385 27.16319 34.07630
## nor 14.65654 17.55771 24.44775 29.48528
## rice 13.03579 15.59222 22.05073 27.00652
## revgum 14.75911 16.68154 20.40216 22.57858
## quantileMean 13.20000 19.10000 36.74778 60.57805
## weighted1 13.24463 18.07356 36.21154 60.00399
## weighted2 13.22444 18.11376 36.55873 60.80506
## weighted3 13.32627 18.69234 38.12765 60.23631
## weightedc NaN NaN NaN NaN
## q_gpd_evir_pwm 13.13589 18.51086 39.35954 65.73245
## q_gpd_evir_ml 13.17887 18.50146 38.75922 63.66670
## q_gpd_evd 13.49848 18.84150 39.17160 64.15801
## q_gpd_extRemes_MLE 13.49802 18.84137 39.17554 64.17217
## q_gpd_extRemes_GMLE 13.48895 18.88301 39.69709 65.82174
## q_gpd_extRemes_Bayesian NA NA NA NA
## q_gpd_extRemes_Lmoments 13.45454 18.85351 39.79964 66.30399
## q_gpd_fExtremes_pwm 13.45497 18.85333 39.79276 66.28045
## q_gpd_fExtremes_mle 13.49613 18.83907 39.17433 64.17709
## q_gpd_ismev 13.49921 18.84241 39.17242 64.15725
## q_gpd_Renext_r 13.49848 18.84150 39.17160 64.15801
## q_gpd_Renext_f 14.79884 20.59891 37.67775 51.83807
quantileMean
is an average of R’s 9 methods implemented in stats::quantile
to determine empirical percentiles (order based statistic, keyword plotting positions).q_gpd_*
are the General Pareto Distribution quantiles, as estimated by a range of different R packages and methods (specified in the row names), computed by q_gpd
. More on that in the next section GPD.weighted*
are averages of the quantiles estimated from the distribution functions, weighted by their goodness of fit (RMSE ecdf / cdf) in three default and a custom weighting schemes:distLgofPlot(dlq, ranks=FALSE,
legargs=list(cex=0.8, bg="transparent"), quiet=TRUE)
The General Pareto Distribution (‘GPD’, or ‘gpa’ in the package lmomco
) is often used to obtain parametric quantile values because of the Pickands-Balkema-DeHaan theorem. It states that the tails of many (empirical) distributions converge to the GPD if a Peak-Over-Threshold (POT) method is used, i.e. the distribution is fitted only to the largest values of a sample. The resulting percentiles can be called censored or truncated quantiles.
p = 0.99
) is to be computed from the top 20 % ( t = 0.2
) of the full dataset, then Q0.95 ( p2 = 0.95
) of the truncated sample must be used. The probability adjustment for censored quantiles with truncation percentage t
happens with the equation \[ p2 = \frac{p-t}{1-t} \]
\[ \frac{1-p}{1-t} = \frac{1-p2}{1-0} \] as visualized along a probability line:
In distLquantile
, you can set the threshold manually, or (better) as a truncate percentage reflecting the proportion of data discarded:
d <- distLquantile(rain, truncate=0.9, plot=TRUE, probs=0.999, quiet=TRUE, breaks=50)
To examine the effect of the truncation percentage, we can compute the quantiles for different cutoff percentages. This is quite time consuming, so the code is not performed upon vignette creation. The result is loaded instead.
tt <- seq(0,0.95, len=50)
if(interactive()) lapply <- pbapply::pblapply # for progress bars
qq <- lapply(tt, function(t) distLquantile(rain, truncate=t,
probs=c(0.99,0.999), quiet=TRUE))
save(tt,qq, file="qq.Rdata")
We can visualize the truncation dependency with
load("qq.Rdata")
par(mar=c(3,2.8,2.2,0.4), mgp=c(1.8,0.5,0))
plot(tt,tt, type="n", xlab="truncation proportion", ylab="Quantile estimation",
main="truncation effect for 6k values of rain", ylim=c(22,90), las=1)
dn <- c("wak","kap","wei","gpa","pe3","weighted2")
cols <- c(4,5,3,"orange",2,1) ; names(cols) <- dn
for(d in rownames(qq[[1]])) lines(tt, sapply(qq, "[", d, j=2), col=8)
for(d in dn)
{
lines(tt, sapply(qq, "[", d, j=1), col=cols[d], lwd=2)
lines(tt, sapply(qq, "[", d, j=2), col=cols[d], lwd=2)
}
abline(h=berryFunctions::quantileMean(rain, probs=c(0.99,0.999)), lty=3)
legend("topright", c(dn,"other"), col=c(cols,8), lty=1, lwd=c(rep(2,6),1), bg="white", cex=0.6)
text(0.9, 53, "Q99.9%") ; text(0.9, 34, "Q99%")
text(0.35, 62, "empirical quantile (full sample)", cex=0.7)
The 17 different distribution quantiles and 12 different GPD estimates seem to converge with increasing truncation percentage. However, at least 5 remaining values in the truncated sample are necessary to fit distributions via linear moments, so don’t truncate too much. I found a good cutoff percentage is 0.8. If you fit to the top 20% of the data, you get good results, while needing ‘only’ approximately 25 values in a sample to infer a quantile estimate.
One motivation behind the development of this package is the finding that high empirical quantiles depend not only on the values of a sample (as it should be), but also on the number of observations available. That is not surprising: Given a distribution of a population, small samples tend to less often include the high (and rare) values. The cool thing about parametric quantiles is that they don’t systematically underestimate the actual quantile in small samples. Here’s a quick demonstration.
set.seed(1)
ss <- c(30,50,70,100,200,300,400,500,1000)
rainsamplequantile <- function() sapply(ss, function(s) distLquantile(sample(rain,s),
probs=0.999, plot=F, truncate=0.8, quiet=T, sel="wak", gpd=F, weight=F))
sq <- pbapply::pbreplicate(n=100, rainsamplequantile())
save(ss,sq, file="sq.Rdata")
Load the resulting R objects:
load("sq.Rdata")
par(mar=c(3,2.8,2.2,0.4), mgp=c(1.7,0.5,0))
sqs <- function(prob,row) apply(sq, 1:2, quantile, na.rm=TRUE, probs=prob)[row,]
berryFunctions::ciBand(yu=sqs(0.6,1), yl=sqs(0.4,1), ym=sqs(0.5,1), x=ss,
ylim=c(25,75), xlim=c(30,900), xlab="sample size", ylab="estimated 99.9% quantile",
main="quantile estimations of small random samples", colm="blue")
berryFunctions::ciBand(yu=sqs(0.6,2), yl=sqs(0.4,2), ym=sqs(0.5,2), x=ss, add=TRUE)
abline(h=quantile(rain,0.999))
text(250, 50, "empirical", col="forestgreen")
text(400, 62, "Wakeby", col="blue")
text(0, 61, "'True' population value", adj=0)
text(600, 40, "median and central 20% of 100 simulations")
Once you have a quantile estimator, you can easily compute extremes (= return levels) for given return periods.
A value x
in a time series has a certain expected frequency to occur or be exceeded: the exceedance probability Pe. The Return Period (RP) of x
can be computed as follows:
\[ RP = \frac{1}{Pe} = \frac{1}{1-quantile(x)} \]
Here is an example with annual block maxima of stream discharge in Austria:
data("annMax") # annual discharge maxima in the extremeStat package itself
dle <- distLextreme(annMax, log=TRUE, legargs=list(cex=0.6, bg="transparent"), nbest=17, quiet=TRUE)
dle$returnlev[1:20,]
## RP.2 RP.5 RP.10 RP.20 RP.50
## wak 62.06908 82.00224 93.37393 103.30175 114.81836
## kap 61.63990 82.43319 94.20990 103.80750 113.98584
## wei 61.84405 81.72957 93.39678 103.55521 115.46953
## pe3 61.86107 81.13112 92.97753 103.72004 116.87811
## ray 62.37416 81.92136 93.07332 102.63851 113.71311
## ln3 61.89473 80.85126 92.71419 103.69028 117.47486
## gno 61.89473 80.85126 92.71419 103.69028 117.47486
## gev 61.85979 80.83924 92.82171 103.89743 117.64984
## gum 61.24316 80.26879 92.86542 104.94841 120.58860
## gpa 61.36114 84.02187 95.30208 103.19519 110.11583
## gam 62.54834 81.39612 92.57994 102.53018 114.52039
## glo 62.16467 79.38862 91.10085 103.11624 120.24369
## lap 62.19140 77.33774 88.79551 100.25328 115.39963
## rice 64.59196 82.14649 91.37297 99.01113 107.62436
## nor 64.78000 82.13652 91.20908 98.70136 107.13390
## exp 57.63946 78.96177 95.09148 111.22119 132.54351
## revgum 68.31684 82.45728 88.46912 92.88645 97.36604
## quantileMean 61.42222 82.14694 93.28444 105.64000 112.76000
## weighted1 62.14394 81.21538 92.70405 102.97786 115.41622
## weighted2 62.04278 81.23617 92.80714 103.14857 115.65607
Explore the other possibilities of the package by reading the function help files.
A good place to start is the package help:
?extremeStat
Any Feedback on this package (or this vignette) is very welcome via github or berry-b@gmx.de!