Sherman E. Lo, Queen Mary, University of London
A reimplementation of poLCA [CRAN, GitHub] in C++. It attempts to reproduce results and be as similar as possible to the original code, while running faster, especially with multiple repetitions, by utilising multiple threads.
The package poLCAParallel reimplements the poLCA fitting, standard error calculations, goodness of fit tests and the bootstrap log-likelihood ratio test in C++. This was done using Rcpp and RcppArmadillo which allows R to run fast C++ code. Additional notes include:
std::jthread
for multi-thread programming and std::mutex
to prevent data racesstd::map
for the chi-squared calculations to improve performanceFurther reading is available on the QMUL ITS Research Blog.
poLCA is a software package for the estimation of latent class models and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment.
Latent class analysis (also known as latent structure analysis) can be used to identify clusters of similar “types” of individuals or observations from multivariate categorical data, estimating the characteristics of these latent groups, and returning the probability that each observation belongs to each group. These models are also helpful in investigating sources of confounding and nonindependence among a set of categorical variables, as well as for density estimation in cross-classification tables. Typical applications include the analysis of opinion surveys; rater agreement; lifestyle and consumer choice; and other social and behavioral phenomena.
The basic latent class model is a finite mixture model in which the component distributions are assumed to be multi-way cross-classification tables with all variables mutually independent. The model stratifies the observed data by a theoretical latent categorical variable, attempting to eliminate any spurious relationships between the observed variables. The latent class regression model makes it possible for the researcher to further estimate the effects of covariates (or “concomitant” variables) on predicting latent class membership.
poLCA uses expectation-maximization and Newton-Raphson algorithms to find maximum likelihood estimates of the parameters of the latent class and latent class regression models.
The easiest way to install poLCAParallel is to use R with remotes.
Run the following in R to install the latest version
remotes::install_github("QMUL/poLCAParallel@package")or for a previous version, for example,
remotes::install_github("QMUL/poLCAParallel@v1.2.4")Download the .zip or .tar.gz file from the
releases. Install it in R using
remotes::install_local(<PATH TO .zip OR .tar.gz FILE>)Please consider citing the corresponding QMUL ITS Research Blog
and the publication below which this software was originally created for
model <- poLCAParallel::poLCA(), set the
parameters calc.se=FALSE and calc.chisq=FALSE
to avoid doing standard error and goodness of fit calculations
respectively. This will save time if you do not require those results.
You can always calculate them afterwards using
model <- poLCAParallel::poLCAParallel.se(model) and
model <- poLCAParallel::poLCAParallel.goodnessfit(model).poLCAParallel::poLCA(), set nrep=1 to do a
test run and gauge how long it takes. Afterwards, set nrep
to a bigger number to try different initial values in parallel.poLCAParallel::poLCA(), set
n.thread to set the number of threads to be used by the
computer. By default, it uses all detectable threads.poLCAParallel::poLCA(), set
se.smooth=TRUEpoLCAParallel::poLCAParallel.se(), set
is_smooth=TRUEset.seed() before using
poLCAParallel::poLCA() to set the seed for random number
generation. This ensures reproducibility when reporting what seed you
have used.R scripts which compare poLCAParallel with poLCA are provided in
exec/. An example use of a bootstrap likelihood ratio test
is shown in exec/3_blrt.R.
poLCAParallel::poLCA(), the following arguments have
been added:
n.thread is provided to specify the number of threads
to use.calc.chisq is provided to specify if you want to
conduct goodness of fit tests or not.se.smooth is provided if you wish to use Laplace
smoothing on the response probabilities in the standard error
calculations.$prior.probs.start are the initial probabilities
used to achieve the maximum log-likelihood from any repetition
rather than from the first repetition.eflag is set to TRUE if
any repetition has to be restarted, rather than the repetition
which achieves maximum log-likelihood.calc.se is set
to FALSE even in poLCA regression. Previously, the standard
error was calculated regardless of calc.se in poLCA
regression.stop() rather
than return a NULL.predcell.The following installation instructions are useful if you wish to develop the code and install a locally modified version of the package.
Requires the R packages for compiling and testing:
Requires the dependent R packages:
Git clone this repository
git clone https://github.com/QMUL/poLCAParallel.gitand change directory into it
cd poLCAParallelFrom there, in the repository root, run the following to generate additional code and documentation so that the package can be compiled correctly
R -e "usethis::use_namespace()"
R -e "Rcpp::compileAttributes()"
R -e "roxygen2::roxygenize()"Install the package using
R CMD INSTALL --preclean --no-multiarch .The testing of the C++ code is done using Catch2 and the R code
using testthat. All test codes
are in tests/.
The tests for the C++ code are done by compiling the test code, isolated from any R ecosystem, and running a compiled executable. It requires cmake, Catch2 and armadillo. To compile the code, from the repository root, make a new directory and use cmake inside it
mkdir build
cd build
cmake ..
cmake --build .This will compile an executable called
test_polca_parallel. Execute it to run the tests. Pass
names or tags to run specific tests, see tests/*.cc.
To test the R code, run the following at the repository root
R -e "testthat::test_local()"The package renv is used to record and manage R
dependencies, with versions pinned, for use during development,
maintenance and testing. The file renv.lock contains these
dependencies. It shall be regularly updated during maintenance. The lock
file is also used in the Apptainer definition file
poLCAParallel-dev.def below to further reproduce the
environment in a container.
From the repository root, run the following commands to set up an R environment and install the dependencies, with the specified versions, used for development and testing
R -e "renv::init(bare=TRUE)"
R -e "renv::restore()"Run R commands from the repository root to use these
dependencies.
The lock file may need to be updated during maintenance. This can be
done by starting a fresh R environment, after ensuring the
renv artifacts are deleted:
.Rprofilerenv.lockrenv/Then take a snapshot of the latest dependencies
R -e "renv::init()"
R -e "renv::snapshot(dev=TRUE)"This will overwrite the file renv.lock specifying
dependencies with the latest versions.
Apptainer definition files are provided, which can be used to install the package inside a container. These may be useful for further troubleshooting or development.
poLCAParallel.def installs R and
the package only. No version pinningpoLCAParallel-dev.def installs the
R package as well as generating documentation and running tests within
the container. Versions of dependencies are pinned to the ones used
during development or maintenanceTo build the container, use the command (or similar)
apptainer build poLCAParallel-dev.sif poLCAParallel-dev.defWithin the container, the package is located in
/usr/src/poLCAParallel. When using the definition file
poLCAParallel-dev.def, the C++ doxygen documentation is
located in /usr/src/poLCAParallel/html.
All generated documents and codes, eg from
R -e "Rcpp::compileAttributes()"and
R -e "roxygen2::roxygenize()"shall not be included in the master branch. Instead,
they shall be in the package branch so that this package
can be installed using
remotes::install_github("QMUL/poLCAParallel@package"). This
is to avoid having duplicate documentation and generated code on the
master branch. The exception to this rule is
renv.lock which is produced by
renv::snapshot(dev=TRUE).
Semantic versioning is used and tagged. Tags on the
master branch shall have v prepended and
-master appended, eg. v1.1.0-master. The
corresponding tag on the package branch shall only have
v prepended, eg. v1.1.0.
85ee419, the
multiplication starts from DOUBLE_XMAX to avoid underflows
but was reverted. Consider investigating further in future
releases.n.thread should be 1
instead of parallel::detectCores()The R code should follow the Tidyverse style guide. In particular, variables, functions and parameters should be in snake case. This will result in
poLCA. and poLCAParallel.
prefix in function and file namesna.rm should be called
na_rmThe following R functions, many of which are internal, are marked as deprecated and should be deleted
poLCA.se() and poLCA.dLL2dBeta.C() - no
longer needed because the standard error calculations are reimplemented
in poLCAParallel.se()poLCA.probHat.C - no longer needed because the goodness
of fit test is reimplemented in goodness_fit.ccpoLCA.postClass.C() and poLCA.ylik.C() -
no longer needed and reimplemented in polca_rcpp.ccpoLCA.vectorize() and poLCA.unvectorize()
- no longer needed and reimplemented in
poLCAParallel.vectorize() and
poLCAParallel.unvectorize() respectivelyAll C code in poLCA.C is deprecated because they are
reimplemented in C++.
The parameters:
results in
poLCAParallel.goodnessfit()polca in poLCAParallel.se()should be renamed to lc to be consistent with other
functions with a parameter also named lc.
Similarly, the parameters model_null and
model_alt in blrt() should be renamed to
lc_null and lc_alt respectively.
There was an attempt to use the Google C++ style guide.
The C++ code documentation can be created with Doxygen by running
doxygen
and viewed at html/index.html.
The software is under the GNU GPL 2.0 license, as with the original poLCA code, stated in their documentation.