Calculating the Proportionality Coefficients of Compositional Data

Thomas Quinn

2016-07-05

Introduction

The bioinformatic evaluation of gene co-expression often begins with correlation-based analyses. However, as demonstrated thoroughly in a recent publication, this approach lacks statistical validity when applied to relative data (Lovell 2015). This includes some of the most frequently studied biological data, such as gene expression data produced by microarray assays or high-throughput RNA-sequencing. As an alternative to correlation, Lovell et al propose a proportionality metric, \(\phi\), as derived from compositional data (CoDa) analysis. A subsequent publication expounded this work by elaborating on another proportionality metric, \(\rho\) (Erb 2016). This package introduces a programmatic framework for the calculation of feature dependence using proportionality and other compositional data methods discussed in the cited publications.

Let \(A_i\) and \(A_j\) each represent a log-ratio transformed feature vector (e.g., a transformed vector of \(d\) gene values measured across \(n\) conditions). We then define the metrics \(\phi\) and \(\rho\) accordingly:

\[\phi(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i)}\]

\[\rho(A_i, A_j) = 1 - \frac{var(A_i - A_j)}{var(A_i) + var(A_j)}\]

Above, we use the log-ratio transformation in order to normalize the data in a manner that respects the nature of relative data. In other words, log-ratio transformation yields the same result whether applied to absolute or relative data. In this package, we consider two log-ratio transformations of the subject vector \(x\), the centered log-ratio transformation (clr) and the additive log-ratio transformation (alr). We define the metrics \(clr(x)\) and \(alr(x)\) accordingly:

\[\textrm{clr(x)} = \left[\ln\frac{x_i}{g(\textrm{x})};...;\ln\frac{x_D}{g(\textrm{x})}\right]\]

\[\textrm{alr(x)} = \left[\ln\frac{x_i}{x_D};...;\ln\frac{x_{D-1}}{x_D}\right]\]

In clr-transformation, sample vectors undergo normalization based on the logarithm of the ratio between the individual elements and the geometric mean of the vector, \(g(\textrm{x}) = \sqrt[D]{x_i...x_D}\). In alr-transformation, sample vectors undergo normalization based on the logarithm of the ratio between the individual elements and chosen reference feature. Although these transformations differ in definition, we will sometimes will refer to them jointly by the acronym *lr.

Calculating proportionality

We provide two principal functions for calculating proportionality. The first function, phit, implements the calculation of \(\phi\) described in Lovell et al (2015). This function makes use of clr-transformation exclusively. The second function, perb, implements the calculation of \(\rho\) described initially in Lovell et al (2015) and expounded by Erb and Notredame (2016). This function makes use of either clr- or alr-transformation.

The first difference between \(\phi\) and \(\rho\) is scale. The values of \(\phi\) range from \([0, \infty)\), with lower \(\phi\) values indicating more proportionality. The values of \(\rho\) range from \([-1, 1]\), with greater \(|\rho|\) values indicating more proportionality and negative \(\rho\) values indicating inverse proportionality. A second difference is that \(\phi\) lacks symmetry. However, one can force symmetry by reflecting the lower left triangle of the matrix across the diagonal (toggled by the argument symmetrize = TRUE). A third difference is that \(\rho\) corrects for the individual variance of each feature in the pair, rather than for just one of the features.

For now, we will focus on the implementations that use clr-transformation, saving a discussion of alr-transformation for later. Let us begin by building an arbitrary dataset of 4 features (e.g., genes) measured across 100 subjects. In this example dataset, feature pairs “a” and “b” will show proportional change and also feature pairs “c” and “d”.

set.seed(12345)
N <- 100
X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
                c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))

Let \(d\) represent any number of features measured across \(n\) observations undergoing a binary or continuous event \(E\). For example, \(n\) could represent subjects differing in case-control status, treatment status, treatment dose, or time. The phit and perb functions ultimately convert a “count matrix” with \(n\) rows and \(d\) columns into a proportionality matrix of \(d\) rows and \(d\) columns containing a \(\phi\) or \(\rho\) measurement for each feature pair. One can think of this matrix as analogous to a dissimilarity matrix (in the case of \(\phi\)) or a correlation matrix (in the case of \(\rho\)). Both functions return the proportionality matrix bundled within an object of the class propr. This object contains four slots:

library(propr)
phi <- phit(X, symmetrize = FALSE)
rho <- perb(X, ivar = 0)

Subsetting propr objects

We have provided methods for conveniently subsetting objects belonging to the propr class. By using the familiar $ and [ methods, we can subset the entire propr object by any of the annotations found in the @pairs slot. Alternatively, using the subset method, we can subset the entire propr object by a vector of feature indices or names. This latter method also provides a convenient way to re-order feature and subject vectors for downstream visualization. In the examples below, we first subset by \(\rho > .99\). Then, separately, we subset by the feature names “a” and “b”.

rho99 <- rho[rho$prop > .99, ]
rho99@pairs
##   feature1 feature2      prop
## 1        a        b 0.9999151
## 6        c        d 0.9930481
rhoab <- subset(rho, select = c("a", "b"))
rhoab@matrix
##           a         b
## a 1.0000000 0.9999151
## b 0.9999151 1.0000000

Visualizing pairs

Each feature belonging to a highly proportional data pair should show approximately linearly correlated *lr-transformed expression with one another across all subjects. The method plot provides a means by which to visually inspect whether this holds true. Since this function will plot all pairs in the @pairs slot, we strongly recommend the user first subset the propr object before plotting. “Noisy” correlation between some feature pairs could suggest that the proportionality cutoff is too lenient. We include this plot as a handy “sanity check” when working with high-dimensional datasets.

plot(rho[rho$prop > .99, ])

Computational burden

Both microarray technology and high-throughput genomic sequencing have the ability to measure tens of thousands of features for each subject. Since the calculation of proportionality requires manipulating a matrix sized \(d^2\), this method uses a lot of RAM when applied to real biological datasets. Below, we provide a small table that estimates the approximate amount of RAM needed based on the number of features studied.

Features Peak RAM (Mb)
1000 101
2000 283
4000 926
8000 3,322
16000 12,554
24000 27,706
32000 48,779
64000 192,276
100000 466,940

An in-depth look at clr

We recognize that this package utilizes concepts largely unintuitive to many. Since the log-ratio transformation of relative data comprises a major portion of proportionality analysis, we decided to dedicate some extra space to this topic specifically. In this section, we discuss the centered log-ratio (clr) and its limitations in context of proportionality analysis. To this end, we begin by simulating count data for 5 features (e.g., genes) labeled “a”, “b”, “c”, “d”, and “e”, as measured across 100 subjects.

N <- 100
a <- seq(from = 5, to = 15, length.out = N)
b <- a * rnorm(N, mean = 1, sd = 0.1)
c <- rnorm(N, mean = 10)
d <- rnorm(N, mean = 10)
e <- rep(10, N)
X <- data.frame(a, b, c, d, e)

Let us assume that these data \(X\) represent absolute abundance counts (i.e., not relative data). We can build a relative dataset, \(Y\), by distorting \(X\) accordingly:

Y <- X / rowSums(X) * abs(rnorm(N))

As a “sanity check”, we will confirm that these new feature vectors do in fact contain relative quantities. We do this by calculating the ratio of the second feature vector to the first for both the absolute and relative datasets.

all(round(X[, 2] / X[, 1] - Y[, 2] / Y[, 1], 5) == 0)
## [1] TRUE

The following figures compare pairwise scatterplots for the absolute count data and the corresponding relative count data. We see quickly how these relative data suggest a spurious correlation: although genes “c” and “d” do not correlate with one another absolutely, their relative quantities do.

pairs(X)
pairs(Y)

Next, we will see that when we do calculate correlation, the coefficients differ for the absolute and relative datasets. This further demonstrates the spurious correlation.

cor(X)
##             a           b           c          d  e
## a  1.00000000  0.94105585 -0.05241693 0.13970299 NA
## b  0.94105585  1.00000000 -0.03989741 0.12296945 NA
## c -0.05241693 -0.03989741  1.00000000 0.09194316 NA
## d  0.13970299  0.12296945  0.09194316 1.00000000 NA
## e          NA          NA          NA         NA  1
cor(Y)
##           a         b         c         d         e
## a 1.0000000 0.9910068 0.8754629 0.8901186 0.8854411
## b 0.9910068 1.0000000 0.8808572 0.8919059 0.8887141
## c 0.8754629 0.8808572 1.0000000 0.9875242 0.9909355
## d 0.8901186 0.8919059 0.9875242 1.0000000 0.9931569
## e 0.8854411 0.8887141 0.9909355 0.9931569 1.0000000

However, by calculating the variance of the log-ratios (vlr), defined as the variance of the logarithm of the ratio of two feature vectors, we can arrive at a single measure of dependence that (a) does not change with respect to the nature of the data (i.e., absolute or relative), and (b) does not change with respect to the number of features included in the computation. As such, the vlr, constituting the numerator portion of the \(\phi\) metric and a portion of the \(\rho\) metric as well, is sub-compositionally coherent. Yet, while vlr yields valid results for compositional data, it lacks a meaningful scale.

propr:::proprVLR(Y[, 1:4])
##             a           b          c          d
## a 0.000000000 0.009370437 0.11269078 0.09815095
## b 0.009370437 0.000000000 0.11090280 0.09925570
## c 0.112690778 0.110902799 0.00000000 0.01775651
## d 0.098150954 0.099255695 0.01775651 0.00000000
propr:::proprVLR(X)
##             a           b          c          d          e
## a 0.000000000 0.009370437 0.11269078 0.09815095 0.09796050
## b 0.009370437 0.000000000 0.11090280 0.09925570 0.09746355
## c 0.112690778 0.110902799 0.00000000 0.01775651 0.01003068
## d 0.098150954 0.099255695 0.01775651 0.00000000 0.00968784
## e 0.097960496 0.097463554 0.01003068 0.00968784 0.00000000

Similarly, transformation of a counts matrix by clr also makes the data sub-compositionally coherent. In the calculation of proportionality coefficients, we use the variance about the clr-transformed data to normalize the variance of the log-ratios (vlr). In other words, we adjust the arbitrarily defined vlr by the variance of its individual constituents. In this way, the use of clr-transformed data shifts the vlr-matrix onto a “standardized” scale that compares across all feature pairs.

In the next figures, we compare pairwise scatterplots for the clr-transformed absolute count data and the corresponding clr-transformed relative count data. While equivalent, we see a relationship between “c” and “d” that should not exist based on what we know from the non-transformed absolute count data. This relationship is ultimately reflected (at least partially) in the results of phit and perb alike.

pairs(propr:::proprCLR(Y[, 1:4]))
pairs(propr:::proprCLR(X))

However, division of the vlr by the variance of the clr lacks sub-compositional coherence. As such, neither \(\phi\) nor \(\rho\), at least when calculated via clr, yield the same result for absolute and relative data. Therefore, measuring proportionality does not, per se, guard against the possibility of spurious proportionality.

phit(Y[, 1:4])@matrix
## Calculating all phi for "count matrix"...
##           a         b         c         d
## a 0.0000000 0.3464746 4.1667734 3.6291593
## b 0.3464746 0.0000000 4.1267256 3.6933335
## c 4.1667734 4.1267256 0.0000000 0.5492343
## d 3.6291593 3.6933335 0.5492343 0.0000000
phit(X)@matrix
## Calculating all phi for "count matrix"...
##          a        b         c         d         e
## a 0.000000 0.252547 3.0371808 2.6453114 2.6401782
## b 0.252547 0.000000 3.0081284 2.6922123 2.6436022
## c 3.037181 3.008128 0.0000000 0.7477885 0.4224268
## d 2.645311 2.692212 0.7477885 0.0000000 0.5253878
## e 2.640178 2.643602 0.4224268 0.5253878 0.0000000
perb(Y[, 1:4])@matrix
## Calculating all rho for "count matrix"...
##            a          b          c          d
## a  1.0000000  0.8262139 -0.8979606 -0.8579366
## b  0.8262139  1.0000000 -0.8732360 -0.8849433
## c -0.8979606 -0.8732360  1.0000000  0.6944455
## d -0.8579366 -0.8849433  0.6944455  1.0000000
perb(X)@matrix
## Calculating all rho for "count matrix"...
##            a          b          c          d          e
## a  1.0000000  0.8733236 -0.8519710 -0.7671116 -0.8275712
## b  0.8733236  1.0000000 -0.8296846 -0.7946279 -0.8263425
## c -0.8519710 -0.8296846  1.0000000  0.5790778  0.7507478
## d -0.7671116 -0.7946279  0.5790778  1.0000000  0.7227065
## e -0.8275712 -0.8263425  0.7507478  0.7227065  1.0000000

The reader should note that in this contrived example, \(\phi(X)\) can equal \(\phi(Y)\) if and only if every composition (i.e., feature vector) of \(Y\) is available for the computation. Such equivalency occurs when the sum of the feature parts in the relative dataset can explain the whole of each subject. This is rarely the case when studying biological count data and alone does not imply sub-compositional coherence.

An in-depth look at alr

Unlike the centered log-ratio (clr) which adjusts each subject vector by the geometric mean of that vector, the additive log-ratio (alr) adjusts each subject vector by the value of one its own components, chosen as a reference. If we select as a reference some feature \(D\) with an a priori known fixed absolute count across all subjects, we can effectively “back-calculate” absolute data from relative data. When initially crafting the data \(X\), we included “e” as this fixed value.

The following figures compare pairwise scatterplots for alr-transformed relative count data (i.e., \(alr(Y)\) with “e” as the reference) and the corresponding absolute count data. We see here how alr-transformation eliminates the spurious correlation between “c” and “d”.

pairs(propr:::proprALR(Y, ivar = 5))
pairs(X[, 1:4])

Again, this gets reflected in the results of perb when we select “e” as the reference.

perb(Y, ivar = 5)@matrix
## Calculating all rho for "count matrix"...
##             a           b           c          d
## a  1.00000000  0.95205075 -0.04351842 0.08822600
## b  0.95205075  1.00000000 -0.03170931 0.07368732
## c -0.04351842 -0.03170931  1.00000000 0.09950079
## d  0.08822600  0.07368732  0.09950079 1.00000000

Now, let us assume these same data, \(X\), actually measure relative counts. In other words, \(X\) is already relative and we do not know the real quantities which correspond to \(X\) absolutely. Well, if we knew that “a” represented a known fixed quantity, we could use alr-transformation again to “back-calculate” the absolute abundances. In this case, we will see that “c”, “d”, and “e” actually do have proportional expression under these conditions. Although the measured quantity of “c”, “d”, and “e” do not change considerably across subjects, the measured quantity of the known fixed feature does change. As such. whenever “a” increases while “c”, “d”, and “e” remains the same, the latter three features have actually decreased. Since they all decreased together, they act as a highly proportional module.

pairs(propr:::proprALR(X, ivar = 1))

Again, this gets reflected in the results of perb when we select “a” as the reference.

perb(X, ivar = 1)@matrix
## Calculating all rho for "count matrix"...
##            b          c         d          e
## b 1.00000000 0.09141656 0.0768749 0.09193416
## c 0.09141656 1.00000000 0.9157828 0.95238255
## d 0.07687490 0.91578276 1.0000000 0.95060033
## e 0.09193416 0.95238255 0.9506003 1.00000000

We can visualize this module using the helpful clustering method dendrogram.

dendrogram(perb(X, ivar = 1))
## Calculating all rho for "count matrix"...

## 'dendrogram' with 2 branches and 4 members total, at height 0.9231251

Resuming our initial claim that the matrix \(X\) contains absolute count data while the matrix \(Y\) contains relative count data, we can show that alr-transformation not only corrects for spurious proportionality, but it also serves as a sub-compositionally coherent metric of dependence. However, unlike the aforementioned vlr, \(\rho\) has a meaningful scale. In the example below, we calculate \(\rho\) using the alr-transformation about the reference “e” for four compositions of the relative count matrix, \(Y\), as well as for the absolute count matrix, \(X\). We see here that, unlike clr-transformed proportionality metrics, the alr-transformed metric \(\rho\) yields identical results regardless of the nature of the data explored. Of course, this assumes that one knows the identity of a feature fixed across all subjects.

perb(Y[, 2:5], ivar = 4)@pairs
## Calculating all rho for "count matrix"...
##   feature1 feature2        prop
## 1        c        d  0.09950079
## 2        b        d  0.07368732
## 3        b        c -0.03170931
perb(X, ivar = 5)@pairs
## Calculating all rho for "count matrix"...
##   feature1 feature2        prop
## 1        a        b  0.95205075
## 2        c        d  0.09950079
## 3        a        d  0.08822600
## 4        b        d  0.07368732
## 5        a        c -0.04351842
## 6        b        c -0.03170931

Limitations

Although we developed this package with biological count data in mind, many of the ostensibly compositional biological datasets do not behave in a truly compositional manner. For example, in the setting of gene expression data, measuring the expression of “Gene A” as 1 in one subject and the expression of “Gene B” as 2 in another subject (i.e., the feature vector \([1, 2]\)), does not carry the same information as measuring the expression of “Gene A” as 1000 in one subject and the expression of “Gene B” as 2000 in another subject (i.e., the feature vector \([1000, 2000]\)). As such, these data do not strictly meet the criteria for compositional data.

Unfortunately, we do not yet have a model to adequately address this drawback. Therefore, we advise the investigator to proceed with caution when working with such “count compositional” data.

References

  1. Erb, I. & Notredame, C. 2016. How should we measure proportionality on relative gene expression data? Theory Biosci.

  2. Lovell, D. et al. 2015. Proportionality: A Valid Alternative to Correlation for Relative Data. PLoS Comput Biol 11.