Abstract
Tutorial on how to use the RNAseqNet package to infer networks from RNA-seq expression datasets with or without an auxiliary dataset.
The R package RNAseqNet
can be used to infer networks from RNA-seq expression data. The count data are given as a \(n \times p\) matrix in which \(n\) is the number of individuals and \(p\) the number of genes. This matrix is denoted by \(\mathbf{X}\) in the sequel.
Eventually, the RNA-seq dataset is complemented with an \(n' \times d\) matrix, \(\mathbf{Y}\) which can be used to impute missing individuals in \(\mathbf{X}\) as described in [Imbert et al., 2018]1.
Two datasets are available in the package: lung
and thyroid
with \(n = 221\) rows and respectively 100 and 50 columns. The raw data were downloaded from https://gtexportal.org/. The TMM normalisation of RNA-seq expression was performed with the R package edgeR
.
Data are loaded with:
data(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)
data(thyroid)
boxplot(log2(thyroid + 1), las = 3, cex.names = 0.5)
Network inference from RNA-seq data is performed with the Poisson GLM model described in [Allen and Liu, 2012]2. The inference can be performed with the function GLMnetwork
as follows:
lambdas <- 4 * 10^(seq(0, -2, length = 10))
ref_lung <- GLMnetwork(lung, lambdas = lambdas)
The entry path
of ref_lung
contains length(lambdas)
= 10 matrices with estimated coefficients. Each matrix is a square matrix with ncol(lung)
= 100 rows and columns.
The choice of the most appropriate value for \(\lambda\) can be performed with the StARS criterion of [Liu et al., 2010]3, which is implemented in the function stabilitySelection
. The argument B
is used to specify the number of re-sampling used to compute the stability criterion:
set.seed(11051608)
stability_lung <- stabilitySelection(lung, lambdas = lambdas, B = 50)
plot(stability_lung)
The entry best
of stability_lung
is the index of the chosen \(\lambda\) in lambdas
. Here, the value \(\lambda=\) lambdas[stability_lung$best]
= 0.3097055 is chosen.
The corresponding set of estimated coefficients, is in ref_lung$path[[stability_lung$best]]
and can be transformed into a network with the function GLMnetToGraph
:
lung_refnet <- GLMnetToGraph(ref_lung$path[[stability_lung$best]])
print(lung_refnet)
## IGRAPH dcc3d30 UN-- 100 454 --
## + attr: name (v/c)
## + edges from dcc3d30 (vertex names):
## [1] MT-CO1--MT-ND4 MT-CO1--SFTPA2 MT-CO1--hsa-mir-6723
## [4] MT-CO1--A2M MT-CO1--MT-CO2 MT-CO1--MT-CO3
## [7] MT-CO1--MT-RNR2 MT-CO1--MT-ATP6 MT-CO1--MT-ND1
## [10] MT-CO1--MTND2P28 MT-CO1--MT-ND6 MT-CO1--SAT1
## [13] MT-CO1--HSPB1 MT-CO1--MTND4P12 MT-CO1--HLA-DRA
## [16] MT-CO1--MTND1P23 SFTPB --SFTPA2 SFTPB --FN1
## [19] SFTPB --B2M SFTPB --NEAT1 SFTPB --SLC34A2
## [22] SFTPB --PGC SFTPB --CTSD SFTPB --TSC22D3
## + ... omitted several edges
set.seed(1243)
plot(lung_refnet, vertex.size = 5, vertex.color = "orange",
vertex.frame.color = "orange", vertex.label.cex = 0.5,
vertex.label.color = "black")
In this section, we artificially remove some of the observations in lung
to create missing individuals (as compared to those in thyroid
):
set.seed(1717)
nobs <- nrow(lung)
miss_ind <- sample(1:nobs, round(0.2 * nobs), replace = FALSE)
lung[miss_ind, ] <- NA
lung <- na.omit(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)
The method described in [Imbert et al., 2018] is thus used to infer a network for lung
expression data, imputing missing individuals from the information provided between gene expressions by the thyroid
dataset.
The first step of the method is to choose a relevant value for the donor list parameter, \(\sigma\). This is done computing \(V_{\textrm{intra}}\), the intra-variability in donor pool, for various values of \(\sigma\). An elbow rule is thus used to choose an appropriate value:
sigmalist <- 1:5
sigma_stats <- chooseSigma(lung, thyroid, sigmalist)
p <- ggplot(sigma_stats, aes(x = sigma, y = varintra)) + geom_point() +
geom_line() + theme_bw() +
ggtitle(expression("Evolution of intra-pool homogeneity versus" ~ sigma)) +
xlab(expression(sigma)) + ylab(expression(V[intra])) +
theme(title = element_text(size = 10))
print(p)
Here, \(\sigma = 2\) is chosen. Finally, hd-MI is processed with the chosen \(\sigma\), a list of regularization parameters \(\lambda\) that are selected with with the StARS criterion (from B = 10
subsamples) in m = 100
replicates of the inference, all performed on a different imputed dataset. The function imputedGLMnetwork
is the one implementing the full method. The distribution of edge frequency among the m = 100
inferred network is obtained with the function plot
applied to the result of this function.
set.seed(16051244)
lung_hdmi <- imputedGLMnetwork(lung, thyroid, sigma = 2, lambdas = lambdas,
m = 100, B = 10)
plot(lung_hdmi)
Finally, the final graph is extracted using the function GLMnetToGraph
on the result of the function ``imputedGLMnetwork``` and providing a threshold for edge frequency prediction.
lung_net <- GLMnetToGraph(lung_hdmi, threshold = 0.9)
lung_net
## IGRAPH ded1a65 UN-- 100 132 --
## + attr: name (v/c)
## + edges from ded1a65 (vertex names):
## [1] MT-CO1 --MT-ND4 MT-CO1 --MT-CO2
## [3] MT-CO1 --MT-CO3 MT-CO1 --MT-ND1
## [5] SFTPB --SLC34A2 SFTPB --PGC
## [7] SFTPB --NAPSA SFTPB --ABCA3
## [9] SFTPC --SFTPA2 SFTPC --SFTPA1
## [11] MT-ND4 --MT-ND4L SFTPA2 --SFTPA1
## [13] hsa-mir-6723--MT-RNR2 hsa-mir-6723--MTATP6P1
## [15] hsa-mir-6723--RP5-857K21.11 hsa-mir-6723--MTND1P23
## + ... omitted several edges
set.seed(1605)
plot(lung_net, vertex.size = 5, vertex.color = "orange",
vertex.frame.color = "orange", vertex.label.cex = 0.5,
vertex.label.color = "black")
Here is the output of sessionInfo()
on the system on which this document was compiled:
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RNAseqNet_0.1.1 ggplot2_2.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.13 compiler_3.4.3 plyr_1.8.4 iterators_1.0.8
## [5] tools_3.4.3 rpart_4.1-11 digest_0.6.12 evaluate_0.10.1
## [9] tibble_1.3.4 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1
## [13] rlang_0.1.4 Matrix_1.2-12 foreach_1.4.3 igraph_1.1.2
## [17] yaml_2.1.14 stringr_1.2.0 knitr_1.17 rprojroot_1.2
## [21] glmnet_2.0-13 grid_3.4.3 nnet_7.3-12 mice_2.46.0
## [25] PoiClaClu_1.0.2 hot.deck_1.1 survival_2.41-3 rmarkdown_1.7
## [29] magrittr_1.5 backports_1.1.1 scales_0.5.0 codetools_0.2-15
## [33] htmltools_0.3.6 MASS_7.3-48 splines_3.4.3 colorspace_1.3-2
## [37] labeling_0.3 stringi_1.1.5 lazyeval_0.2.1 munsell_0.4.3
Imbert, A., Valsesia, A., Le Gall, C., Armenise, C., Lefebvre, G., Gourraud, P.A., Viguerie, N. and Villa-Vialaneix, N. (2018) Multiple hot-deck imputation for network inference from RNA sequencing data. Bioinformatics. DOI: http://dx.doi.org/10.1093/bioinformatics/btx819↩
Allen, G. and Liu, Z. (2012) A log-linear model for inferring genetic networks from high-throughput sequencing data. In Proceedings of IEEE International Conference on Bioinformatics and Biomedecine (BIBM)↩
Liu, H., Roeber, K. and Wasserman, L. (2010) Stability approach to regularization selection (StARS) for high dimensional graphical models. In Proceedings of Neural Information Processing Systems (NIPS 2010), 23, 1432-1440, Vancouver, Canada.↩