The TARA Oceans expedition facilitated the study of plankton communities by providing oceans metagenomic data combined with environmental measures to the scientific community. This study focuses on 139 prokaryotic-enriched samples collected from 68 stations and spread across three depth layers: the surface (SRF), the deep chlorophyll maximum (DCM) layer and the mesopelagic (MES) zones. Samples were located in 8 different oceans or seas: Indian Ocean (IO), Mediterranean Sea (MS), North Atlantic Ocean (NAO), North Pacific Ocean (NPO), Red Sea (RS), South Atlantic Ocean (SAO), South Pacific Ocean (SPO) and South Ocean (SO).
In this vignette, we consider a subset of the original data studied in the reference article
Mariette, J. and Villa-Vialaneix, N. (2017) Integrating TARA Oceans datasets using unsupervised multiple kernel learning. Submitted for publication.
More precisely, only 1% of the 35,650 prokaryotic OTUs and of the 39,246 bacterial genes were randomly selected. The aim of the study is to integrate prokaryotic abundances and functional processes to environmental measure.
To run this vignette, the latest version of mixOmics must be previously installed.
Install and load the mixOmics and mixKernel libraries (note that mixKernel will soon be included in mixOmics!).
# install.package(mixKernel)
library(mixOmics)
library(mixKernel)
Datasets (previously normalized) are provided as matrices with matching sample names (row names):
data(TARAoceans)
# more details with: ?TARAOceans
lapply(list("phychem" = TARAoceans$phychem, "pro.phylo" = TARAoceans$pro.phylo,
"pro.NOGs" = TARAoceans$pro.NOGs),
dim)
## $phychem
## [1] 139 22
##
## $pro.phylo
## [1] 139 356
##
## $pro.NOGs
## [1] 139 638
For each input dataset, a kernel is computed using the function compute.kernel
that allows to choose between linear, phylogenic or abundance kernels. A user defined function can also be provided as input (argument kernel.func
, see ?compute.kernel
).
Returned objects are lists with a ‘kernel’ entry that stores the kernel matrix. The resulting kernels are squared symmetric matrices with a size equal to the number of observations (rows) in the input datasets.
phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear")
pro.phylo.kernel <- compute.kernel(TARAoceans$pro.phylo, kernel.func = "abundance")
pro.NOGs.kernel <- compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance")
# check dimensions
dim(phychem.kernel)
## NULL
A general overview of the correlation structure between datasets is obtained as described in (Mariette and Villa-Vialaneix, 2017) and displayed using the function cim.kernel
:
cim.kernel(phychem = phychem.kernel,
pro.phylo = pro.phylo.kernel,
pro.NOGs = pro.NOGs.kernel,
method = "square")
The figure shows that pro.phylo
and pro.NOGs
is the most correlated pair of kernels. This result is expected as both kernels provide a summary of prokaryotic communities.
The function combine.kernels
implements 3 different methods for combining kernels: STATIS-UMKL, sparse-UMKL and full-UMKL (see Mariette and Villa-Vialaneix, 2017). It returns a meta-kernel that can be used as an input for the function kernel.pca
(kernel PCA). The three methods are complementary and must be chosen according to the research question. The STATIS-UMKL
approach gives an overview on the common information between the different datasets. The full-UMKL
computes a kernel that minimizes the distortion between all input kernels. sparse-UMKL
is a sparse variant of full-UMKL
but also selects the most relevant kernels.
meta.kernel <- combine.kernels(phychem = phychem.kernel,
pro.phylo = pro.phylo.kernel,
pro.NOGs = pro.NOGs.kernel,
method = "full-UMKL")
A kernel PCA can be performed from the combined kernel with the function kernel.pca
. The argument ncomp
allows to choose how many axes are extracted by KPCA.
kernel.pca.result <- kernel.pca(meta.kernel, ncomp = 10)
Its results are displayed using the plotIndiv
function from mixOmics
, that provides the representation of individuals.
all.depths <- levels(factor(TARAoceans$sample$depth))
depth.pch <- c(20, 17, 4, 3)[match(TARAoceans$sample$depth, all.depths)]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = TRUE,
group = as.vector(TARAoceans$sample$ocean),
col.per.group = c("#f99943", "#44a7c4", "#05b052", "#2f6395",
"#bb5352", "#87c242", "#07080a", "#92bbdb"),
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Projection of TARA Oceans stations",
size.title = 10,
legend.title.pch = "Depth")
The explained variance supported by each axis of KPCA is displayed with the plot
function, and can help choosing the number of components in PCA.
plot(kernel.pca.result)
The first axis reproduces approximately 20% of the total variance.
The first axis is used to extract which variables are important. Variables values are randomly permuted with the function permute.kernel.pca
.
In the following example, physical variable are permuted at the variable level (kernel phychem
), OTU abundances from pro.phylo
kernel are permuted at the phylum level (OTU phyla are stored in the second column, named Phylum
, of the taxonomy annotation provided in TARAoceans
object in the entry taxonomy
) and gene abundances from pro.NOGs
are permuted at the GO level (GO are provided in the entry GO
of the dataset).
head(TARAoceans$taxonomy[ ,"Phylum"], 10)
## [1] Actinobacteria Proteobacteria Proteobacteria Gemmatimonadetes
## [5] Actinobacteria Actinobacteria Proteobacteria Proteobacteria
## [9] Proteobacteria Cyanobacteria
## 56 Levels: Acidobacteria Actinobacteria aquifer1 ... WCHB1-60
head(TARAoceans$GO, 10)
## [1] NA NA "K" NA NA "S" "S" "S" NA "S"
set.seed(17051753)
kernel.pca.result <- kernel.pca.permute(kernel.pca.result, ncomp = 1,
phychem = colnames(TARAoceans$phychem),
pro.phylo = TARAoceans$taxonomy[ ,"Phylum"],
pro.NOGs = TARAoceans$GO)
Results are displayed with the function plotVar.kernel.pca
. The argument ndisplay
indicates the number of variables to display for each kernel:
plotVar.kernel.pca(kernel.pca.result, ndisplay = 10, ncol = 3)
Proteobacteria
is the most important variable for the pro.phylo
kernel.
Then, the relative abundance of Proteobacteria
is extracted in each of our 139 samples, and each sample is colored according to the value of this variable in the KPCA projection plot:
selected <- which(TARAoceans$taxonomy[ ,"Phylum"] == "Proteobacteria")
proteobacteria.per.sample <- apply(TARAoceans$pro.phylo[ ,selected], 1, sum) /
apply(TARAoceans$pro.phylo, 1, sum)
colfunc <- colorRampPalette(c("royalblue", "red"))
col.proteo <- colfunc(length(proteobacteria.per.sample))
col.proteo <- col.proteo[rank(proteobacteria.per.sample, ties = "first")]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = FALSE,
group = c(1:139),
col = col.proteo,
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Representation of Proteobacteria abundance",
legend.title.pch = "Depth")
Similarly, the temperature is the most important variable for the phychem
kernel. The temperature values can be displayed on the kernel PCA projection as follows:
col.temp <- colfunc(length(TARAoceans$phychem[ ,4]))
col.temp <- col.temp[rank(TARAoceans$phychem[ ,4], ties = "first")]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = FALSE,
group = c(1:139),
col = col.temp,
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Representation of mean temperature",
legend.title.pch = "Depth")
Mariette, J. and Villa-Vialaneix, N. (2017) Integrating TARA Oceans datasets using unsupervised multiple kernel learning. Submitted for publication.
Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011). Unsupervised multiple kernel clustering. Journal of Machine Learning Research: Workshop and Conference Proceedings, 20, 129–144.
Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994). The act (statis method). Computational Statistics & Data Analysis, 18(1), 97 – 119.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mixKernel_0.1 mixOmics_6.1.3 ggplot2_2.2.1 lattice_0.20-35
## [5] MASS_7.3-47 knitr_1.15.1
##
## loaded via a namespace (and not attached):
## [1] rgl_0.98.1 Rcpp_0.12.10 ape_4.1
## [4] phyloseq_1.20.0 tidyr_0.6.1 corpcor_1.6.9
## [7] Biostrings_2.44.0 assertthat_0.2.0 rprojroot_1.2
## [10] digest_0.6.12 psych_1.7.5 foreach_1.4.3
## [13] mime_0.5 R6_2.2.0 plyr_1.8.4
## [16] backports_1.0.5 stats4_3.4.0 ellipse_0.3-8
## [19] evaluate_0.10 zlibbioc_1.22.0 lazyeval_0.2.0
## [22] data.table_1.10.4 vegan_2.4-3 S4Vectors_0.14.0
## [25] Matrix_1.2-10 rmarkdown_1.5 labeling_0.3
## [28] splines_3.4.0 foreign_0.8-68 stringr_1.2.0
## [31] htmlwidgets_0.8 igraph_1.0.1 munsell_0.4.3
## [34] shiny_1.0.3 compiler_3.4.0 httpuv_1.3.3
## [37] BiocGenerics_0.22.0 mnormt_1.5-5 multtest_2.32.0
## [40] mgcv_1.8-17 htmltools_0.3.6 biomformat_1.4.0
## [43] tibble_1.3.0 quadprog_1.5-5 IRanges_2.10.0
## [46] codetools_0.2-15 permute_0.9-4 dplyr_0.5.0
## [49] grid_3.4.0 nlme_3.1-131 jsonlite_1.4
## [52] xtable_1.8-2 gtable_0.2.0 DBI_0.6-1
## [55] magrittr_1.5 scales_0.4.1 stringi_1.1.5
## [58] XVector_0.16.0 reshape2_1.4.2 RColorBrewer_1.1-2
## [61] iterators_1.0.8 tools_3.4.0 ade4_1.7-6
## [64] Biobase_2.36.0 LDRTools_0.2 parallel_3.4.0
## [67] survival_2.41-3 yaml_2.1.14 colorspace_1.3-2
## [70] rhdf5_2.20.0 cluster_2.0.6 corrplot_0.77