This short document showcases how to use the gllvm package to analyse multivariate percent cover data. Namely, we show how to apply the hurdle beta GLLVM (with logistic link), as detailed in Korhonen et al. (2024), to analyse the kelp forest dataset from the Santa Barbara Coastal Long-Term Ecological Research project, available from https://doi.org/10.6073/pasta/0af1a5b0d9dde5b4e5915c0012ccf99c.
A multivariate percent cover dataset often comes in the form of a \(n \times m\) matrix \(\bf{Y}\), with \(n\) being the number of observational units/sites, and \(m\) being the number of species of plants, macroalgae, sessile invertebrates, et cetera. The response \(y_{ij}\) is then the recorded percentage of the relative area covered by species \(j\) on unit/site \(i\). Typically such datasets contain considerable proportion of zero observations, as a given obs. unit is often populated by only a small subset of the \(m\) species in total. More rarely, it might even be the case that one of the species covers some obs. unit completely.
Traditionally, beta regression has been used to model responses that take the form of percentages. If \(y^*_{ij}\) is in the (open) interval \((0,1)\), and is distributed according to beta distribution, \(y^*_{ij} \sim \text{Beta}(\mu_{ij}, \phi_j)\), then it has the following density: \[ f_{\text{beta}}(y_{ij}^*; \mu_{ij}, \phi_j) = \frac{\Gamma(\phi_j)}{\Gamma(\mu_{ij}\phi_j)\Gamma(\phi_j-\phi_j\mu_{ij})} (y_{ij}^*)^{\mu_{ij}\phi_j-1}(1-y_{ij}^*)^{\phi_j(1-\mu_{ij}-1)}.\]
The mean parameter \(\mu_{ij}\) can be connected to a set of covariates and latent variables through some link function, by the equation \(g(\mu_{ij})=\eta_{ij} = \beta_{0j} + \boldsymbol{\beta}_j^\top\bf{x_i} + \boldsymbol{\lambda}_j^\top\bf{u_i}\).
However, such a model is ill-suited for most percent cover datasets, due to the fact that the beta distribution isn’t capable of handling zero (or \(100\%\)) responses. To make use of the beta regression model in such scenario, one needs to use some transformation in order to ‘push’ the responses away from the boundaries. This procedure might provide reasonable results when the numbers of zeros or ones in the data are relatively low. On the other hand, by transforming the zeros and ones, we might lose some important information.
A more sophisticated way of tackling such issue is by considering a zero-accommodating model, couple of which have been proposed recently. One such model is the hurdle beta model, which models the two (can be extended to three if \(100\%\) covers are recorded, see Korhonen et al. (2024)) classes the response \(y_{ij}\) can take, i.e., \(\{0\}\) and \((0,1)\), by separate processes. Namely, the zeros are assumed to be generated by a Bernoulli process. Conditional on \(y_{ij}\in(0,1)\), the response is modeled using the standard beta distribution as presented above. The likelihood function for a response \(Y_{ij}\) following the hurdle beta distribution is of the form: \[ P(Y_{ij};\mu_{ij}, \mu_{ij}^0, \phi_j) = \begin{cases} 1-\mu_{ij}^0, & Y_{ij} = 0,\\ \mu_{ij}^0 \cdot f_{\text{beta}}(Y_{ij};\mu_{ij},\phi_j), & Y_{ij} \in (0,1). \\ \end{cases} \] where \(g(\mu_{ij}^0) = \eta_{ij}^0 = \beta_{0j}^0 + \bf{x_i}^\top\boldsymbol{\beta}_j^0 + \bf{u_i}^\top\boldsymbol{\lambda}_j^0\) for and \(g(\mu_{ij}) = \eta_{ij} = \beta_{0j} + \bf{x_i}^\top\boldsymbol{\beta}_j + \bf{u_i}^\top\boldsymbol{\lambda}_j\) and \(g(\cdot)\) can be either probit or logistic link function. Note, that here, the separate linear predictors share the same environmental covariates \(\bf{x}_i\) and latent variable scores \(\bf{u}_i\), while the coefficients and loadings are allowed to differ.
The gllvm package implements the hurdle beta GLLVM
with two different estimation methods available. First, accessed by the
argument method="VA" when calling uses a hybrid approach,
where the method of variational approximations, or VA, is applied to the
Bernoulli-process part of the data (only probit link allowed), while the
method of extended variational approximations, see
Korhonen et al. (2023), is applied to the
beta distributed part. By instead specifying method="EVA",
the EVA method is applied to both parts of the
likelihood.
In the following, we show how to fit the logistic hurdle beta GLLVM using EVA on the SBC LTER marine macroalgae (i.e., seaweed) percent cover dataset. The data has been collected in 2000-2020 along 44 permanent transect lines along coastal southern California. We will specify a model with two latent variables (for ordination) and will include the rockiness of the seabed and the average number of stripes of giant kelp as environmental covariates in the model.
library(gllvm)
data("kelpforest")
Yabund <- kelpforest$Y
Xenv <- kelpforest$X
SPinfo <- kelpforest$SPinfo
# Data contains both algae and sessile invertebrates
table(SPinfo$GROUP)##
## ALGAE INVERT PLANT
## 61 69 2
# Select only the macroalgae:
Yalg <- Yabund[,SPinfo$GROUP=="ALGAE"]
# To demonstrate the models, use only the data from the year 2016:
Yalg <- Yalg[Xenv$YEAR==2016,]
Xenv <- Xenv[Xenv$YEAR==2016,]
# Remove species which have no observations or just one
Yalg <- Yalg[,-which(colSums(Yalg>0)<2)]
# Number of obs. and species:
dim(Yalg)## [1] 44 42
After setting up the data, LV design and the covariates, the model is estimated by
fit <- gllvm(Yalg, X=Xenv, formula = Xformulai, family = "betaH", method="EVA",
num.lv = 2, link="logit", control=list(reltol=1e-12))To inspect e.g., the covariate effects, use
## KELP_FRONDS PERCENT_ROCKY
## AU -0.122209103 -1.283349e-03
## BF -1.360332309 5.801991e-02
## BO -0.003100851 1.431293e-03
## BR -0.068643141 1.907711e-02
## BRA 0.347549306 -8.085150e-02
## CAL -0.320344891 -8.675098e-02
## CC 0.142755687 -5.013181e-03
## CF -0.092967998 1.271395e-02
## CG 0.020884837 -3.758900e-02
## CO -0.068555886 1.041070e-02
## COF -0.423411406 5.522641e-02
## CP 0.118056173 -1.588175e-02
## CRYP -0.002858204 1.981374e-04
## CYOS 0.018435875 -3.250126e-03
## DL 0.101084534 -3.888676e-03
## DMH -0.174994103 7.444457e-05
## DP -0.058492910 7.745072e-03
## DU -0.127369273 -8.036426e-03
## EAH 6.482020011 -3.228852e-02
## EC 0.078057891 1.147692e-02
## EH 0.610369851 -1.145691e-01
## ER 0.068963779 3.079087e-02
## FB -0.457775979 8.607179e-03
## FTHR -0.455661752 -1.344829e-02
## GR 0.103849898 1.390224e-03
## GS -0.522809726 1.885821e-03
## GYSP -0.240696643 2.006943e-02
## MH 0.095837375 -2.722491e-03
## NIE -0.021605927 -5.742149e-03
## PH 0.886170873 -2.838054e-01
## PHSE -0.589173708 1.128536e-02
## PL 0.097993599 5.566366e-03
## POLA -0.819891772 3.100197e-03
## PRSP 0.086299073 -4.671810e-03
## R 0.083686724 7.659231e-03
## RAT 0.064375184 1.928793e-04
## SAFU -0.545371547 2.397104e-02
## SAHO 6.750796990 -3.701052e-02
## SAMU -0.047064169 1.316889e-03
## SCCA 0.011306430 1.871045e-02
## TALE 0.063861591 2.003799e-02
## UV -0.154942212 -1.844867e-02
## H01_AU -0.231701250 -1.961703e-02
## H01_BF -0.062660834 -3.214329e-02
## H01_BO 0.183579117 -9.410487e-03
## H01_BR -0.104798619 5.574894e-04
## H01_BRA -0.752722063 -1.833260e-03
## H01_CAL -0.184689504 2.078563e-01
## H01_CC -0.442607358 1.098526e-02
## H01_CF -0.085197179 -3.546043e-03
## H01_CG 0.032741147 2.803438e-01
## H01_CO 0.042194990 -1.066421e-02
## H01_COF -0.915370086 2.653183e-01
## H01_CP -1.069792809 1.980182e-01
## H01_CRYP -0.085198743 -1.767328e-02
## H01_CYOS 0.143152624 -1.624587e-02
## H01_DL 0.059840588 -1.453731e-02
## H01_DMH -0.141005971 -2.404699e-02
## H01_DP -0.092802713 1.609255e-02
## H01_DU -0.408002869 2.124290e-01
## H01_EAH -6.706575631 1.835916e-01
## H01_EC 0.102534384 5.339297e-02
## H01_EH -0.068284984 -3.794347e-02
## H01_ER 0.018869047 3.552697e-02
## H01_FB -0.542344508 -7.666806e-03
## H01_FTHR -0.759636069 -4.048875e-02
## H01_GR -0.239926167 3.154727e-01
## H01_GS -0.935919860 -1.516765e-01
## H01_GYSP 0.012620466 1.237633e-02
## H01_MH 2.392631048 -2.076656e-02
## H01_NIE -0.080470870 -2.567291e-02
## H01_PH -0.367997950 -6.207349e-02
## H01_PHSE -1.600730192 -3.291749e-02
## H01_PL 0.066267348 -2.442770e-03
## H01_POLA -0.509469632 -9.180662e-03
## H01_PRSP -0.090440275 -8.011631e-03
## H01_R -0.145724376 8.065022e-03
## H01_RAT -0.180950988 2.836153e-02
## H01_SAFU -3.230597944 -9.267528e-02
## H01_SAHO -3.108593916 2.004519e-01
## H01_SAMU -0.116237559 4.621002e-03
## H01_SCCA -0.032835380 2.183077e-02
## H01_TALE -0.420711801 2.199147e-01
## H01_UV -0.536467918 -1.000283e-02
In the above, the prefix indicates that the coefficient relates to the Bernoulli part of the hurdle model.
Ordination plot can then be generated as per usual:
Another solution for modeling percentage cover data in gllvm is to use ordered beta response model. It handles zeros and ones slightly differently compared to hurdle beta model. Instead of assuming that the zeros and ones comes from separate process from the percent cover, the model assumes that there is an underlying process \(z_{ij}\) where all observations comes from.
For species \(j = 1, . . . , m\), let \(z_{ij}\) denote an underlying continuous variable, and define two cutoff parameters \(\zeta_{j0} < \zeta_{j1}\) such that \(Y_{ij} = 0\) occurs when \(z_{ij} < \zeta_{j0}\), \(Y_{ij}=1\) occurs when \(z_{ij}>\zeta_{j1}\), and \(Y_{ij} \in (0,1)\) occurs when \(\zeta_{j0} < z_{ij} < \zeta_{j1}\). Conditional on \(Y_{ij} \in (0,1)\), the response variable follows a beta distribution. By assuming \(z_{ij}\) follows a logistic distribution, then marginalising over \(z_{ij}\) we obtain the following distribution for the percent cover responses that characterizes the ordered beta GLLVM,
\[\begin{align} P(Y_{ij};\eta_{ij}, \phi_j) = \begin{cases} \rho^0_{ij}, & if Y_{ij} = 0 ,\\ \left(\rho^1 - \rho^0 \right) \cdot f_{\text{beta}}(Y_{ij}; \mu_{ij}, \phi_j), & if Y_{ij} \in (0,1) ,\\ 1-\rho^1_{ij}, & if Y_{ij} = 1 ,\\ \end{cases} \end{align}\]
Lets demonstrate the ordered beta response model for the previous example. As the data has no ones:
## [1] 0
We can accommodate the model to better handle data that by fixing the
upper cutoff parameters to some large value, for example 20. With
gllvm this can be done by setting starting values for the
cutoff parameter, ´zetacutoff = c(0, 20)´ and fixing the upper cutoff
parameters with ´setMap = list(zeta = factor(rbind(1:m, rep(NA, m))))´.
We assume that the number of species/response variables in the data is
saved to object m. There are ´m´ (number of species) lower
cutoff parameters \(\zeta_{j0}\) we let
be freely estimated (that’s indexes ´1:m´ in ´setMap´) and ´m´ upper
cutoff parameters \(\zeta_{j1}\) that
we fix (that’s ´rep(NA, m)´ in ´setMap´).
Some species are observed only a few times:
## AU BF BO BR BRA CAL CC CF CG CO COF CP CRYP CYOS DL DMH
## 23 3 11 5 2 5 30 13 5 24 5 3 3 33 26 8
## DP DU EAH EC EH ER FB FTHR GR GS GYSP MH NIE PH PHSE PL
## 14 4 3 36 2 6 6 6 11 20 7 36 13 2 5 4
## POLA PRSP R RAT SAFU SAHO SAMU SCCA TALE UV
## 6 4 31 17 6 2 7 7 5 6
so there is not much information to estimate the shape parameter of the beta distribution for each species separately. Thus we can also set the shape parameter to be common across species with ´disp.formula = rep(1, m)´. This can be applied for all beta based models in gllvm.
Now we are ready to fit the model:
# save the number of species to object m
m <- ncol(Yalg)
fit_ob <- gllvm(Yalg, X=Xenv, formula = Xformulai, family = "orderedBeta",
method="EVA", num.lv = 2, link="logit",
disp.formula = rep(1, m), zetacutoff = c(0, 20),
setMap = list(zeta = factor(rbind(1:m, rep(NA, m)))) )
fit_ob## Call:
## gllvm(y = Yalg, X = Xenv, formula = Xformulai, family = "orderedBeta",
## num.lv = 2, method = "EVA", link = "logit", disp.formula = rep(1,
## m), setMap = list(zeta = factor(rbind(1:m, rep(NA, m)))),
## zetacutoff = c(0, 20))
## family:
## [1] "orderedBeta"
## method:
## [1] "EVA"
##
## log-likelihood: 79648452
## Residual degrees of freedom: 1554
## AIC: -159296316
## AICc: -159296204
## BIC: -159294692
Now if we check the cutoff values we see that the upper bounds are fixed to 20
## cutoff0 cutoff1
## AU -1.0938416 20
## BF -2.2403039 20
## BO -3.5024107 20
## BR 0.1418348 20
## BRA 0.6865917 20
## CAL -1.6336500 20
## CC -4.0876720 20
## CF -3.2309521 20
## CG -0.8384183 20
## CO -3.6302783 20
## COF -6.3210988 20
## CP -5.6135398 20
## CRYP -2.0447577 20
## CYOS -4.3729782 20
## DL -4.9061019 20
## DMH -2.3962789 20
## DP -0.5498199 20
## DU 1.3754556 20
## EAH -4.0762696 20
## EC -4.9641995 20
## EH -0.6291676 20
## ER -1.5659049 20
## FB -3.6929316 20
## FTHR -4.3242277 20
## GR -5.4747089 20
## GS -6.1731040 20
## GYSP -4.2806674 20
## MH -8.5175978 20
## NIE -1.6000196 20
## PH 0.5491855 20
## PHSE -2.7255742 20
## PL -1.2783440 20
## POLA -4.5658141 20
## PRSP -1.9830743 20
## R -5.0921771 20
## RAT -4.4333795 20
## SAFU -6.5571713 20
## SAHO -0.7429040 20
## SAMU -1.8068872 20
## SCCA 0.1864054 20
## TALE -1.2032984 20
## UV -4.2008827 20
Ordination plot: