The NormalizeMets package is a collection of functions designed to implement, assess, and choose a suitable normalization method for a given metabolomics study. The functions in this package are also available as a graphical user interface within Microsoft Excel, a familiar program for most biological researchers.
The package includes several widely used traditional and recently developed metabolomics normalization methods, which can be used to
remove the component of unwanted variation to obtain a ``normalized" data matrix that is suitable for downstream statistical analysis, or to
accommodate the component of unwanted variation in a statistical model designed to answer the research question of interest.
In addition, the package can be used for visualisation of metabolomics data using interactive graphical displays, and for obtaining statistical results for
identifying biomarkers that are associated with an exposure, adjusting for confounding variables,
clustering using heirarchical cluster analysis and principal component analysis,
classification using support vector machine algorithm, and
correlation analysis.
The R software environment can be downloaded for free from the Comprehensive R Archive Network (CRAN) https://cran.r-project.org/, and is hosted by a large number of sites. A very detailed description of installation of R and alternate methods, FAQs, platform dependencies and the like can be found at https://cran.r-project.org/doc/manuals/R-admin.html.
The use of RStudio is also recommended. RStudio is an integrated development environment (IDE) that can be useful for handling R scripts and functions, as well as loading packages and data. For a guide on installation and usage of RStudio, please refer to RStudios page, https://www.rstudio.com/.
Install the NormalizeMets package by using the following function:
install.packages("NormalizeMets")
To then load the package use:
library(NormalizeMets)
To cite the package use:
citation("NormalizeMets")
##
## To cite package 'NormalizeMets' in publications use:
##
## Alysha M De Livera and Gavriel Olshansky (2017). NormalizeMets:
## Analysis of Metabolomics Data. R package version 0.22.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {NormalizeMets: Analysis of Metabolomics Data},
## author = {Alysha M {De Livera} and Gavriel Olshansky},
## year = {2017},
## note = {R package version 0.22},
## }
##
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
Four different datasets are included in this package:
mixdata, as described by Redestig et al. (2009). See ?mixdata in R.
Didata, as described by Kirwan et al. (2014). See ?Didata in R.
UVdata, as described by De Livera et al. (2015). See ?UVdata in R.
alldata_eg, a subset of a cohort study dataset described by De Livera et al. (2015). See ?alldata_eg in R
The NormalizeMets package stores three different sets of information.
featuredata is a metabolomics data matrix taking the following format, with metabolites in columns and samples in rows. Unique sample names should be provided as row names.
data("alldata_eg")
featuredata_eg<-alldata_eg$featuredata
dataview(featuredata_eg)
##
## - The data consists of 150 rows by 128 columns
## m_1 m_2 m_3 m_4 m_5 m_6 m_7
## s_1 10485.867 33220.562 1112.9795 1408.6396 455.7529 1100.4023 122.36316
## s_2 8960.469 29995.516 926.2891 529.0449 873.6094 2201.2461 173.08020
## s_3 10160.445 28559.641 1230.3330 1306.3203 1027.5078 2066.5918 269.79980
## s_4 8794.477 27593.750 901.7622 1800.0830 675.0396 1675.8496 287.99927
## s_5 8956.922 28161.766 979.1724 818.0366 904.0254 1245.9912 167.41333
## s_6 9092.258 31685.312 634.6899 478.3811 980.4233 1259.5635 97.56250
## s_7 2271.045 7692.176 143.1688 109.8399 730.6172 1387.7920 149.09668
## s_8 7850.402 26462.797 822.5991 1341.2158 583.5830 2164.1270 227.76965
## s_9 10969.062 21605.141 408.0005 1499.4795 452.2375 916.5283 234.49756
## s_10 1743.933 7647.645 218.4199 127.4294 408.2913 999.5718 74.82697
## m_8 ...
## s_1 3855.805 ...
## s_2 5090.637 ...
## s_3 4483.906 ...
## s_4 4949.262 ...
## s_5 4302.934 ...
## s_6 4406.344 ...
## s_7 1543.406 ...
## s_8 5030.688 ...
## s_9 4657.254 ...
## s_10 1510.012 ...
sampledata is a dataframe that contains sample specific information. This information can include sample types (i.e., Quality control or biological), run order of the samples, factors of interest and other sample-specific data relevant to the analysis of the data. Unique sample names should be provided as row names. These sample names must match with and be ordered according to the sample ordering in featuredata.
sampledata_eg<-alldata_eg$sampledata
dataview(sampledata_eg)
##
## - The data consists of 150 rows by 4 columns
## batch gender Age bmi
## s_1 Batch 2 code_1 58.66 22.22
## s_2 Batch 2 code_0 76.70 23.74
## s_3 Batch 2 code_0 56.15 28.19
## s_4 Batch 2 code_0 77.22 26.18
## s_5 Batch 2 code_0 74.29 26.37
## s_6 Batch 2 code_1 66.82 26.39
## s_7 Batch 1 code_0 65.00 29.00
## s_8 Batch 2 code_1 66.51 26.13
## s_9 Batch 2 code_0 70.23 27.26
## s_10 Batch 1 code_1 55.00 25.00
metabolitedata contains metabolite specific information in a separate dataframe. This information can include, but not limited to, internal/external standard and other positive/negative control information. Metabolite names should be provided as row names, and must match with and be ordered according to the metabolite ordering in featuredata.
metabolitedata_eg<-alldata_eg$metabolitedata
dataview(metabolitedata_eg)
##
## - The data consists of 128 rows by 4 columns
## names IS neg_controls pos_controls_gender
## 1 m_1 0 1 0
## 2 m_2 0 1 0
## 3 m_3 0 1 0
## 4 m_4 0 0 0
## 5 m_5 0 0 0
## 6 m_6 0 0 0
## 7 m_7 0 0 0
## 8 m_8 0 1 0
## 9 m_9 0 1 0
## 10 m_10 0 1 0
Importing the data from Excel or text files for featuredata, sampledata, metabolitedata is relatively simple and can be done using the commands read.csv()
or ‘read.table()’. The user may find it easier to combine these three datasets into a list that can be called in functions as required. For example,
alldata_eg<-list(featuredata=featuredata_eg, sampledata=sampledata_eg, metabolitedata=metabolitedata_eg)
dataview(alldata_eg$metabolitedata)
##
## - The data consists of 128 rows by 4 columns
## names IS neg_controls pos_controls_gender
## 1 m_1 0 1 0
## 2 m_2 0 1 0
## 3 m_3 0 1 0
## 4 m_4 0 0 0
## 5 m_5 0 0 0
## 6 m_6 0 0 0
## 7 m_7 0 0 0
## 8 m_8 0 1 0
## 9 m_9 0 1 0
## 10 m_10 0 1 0
A class called alldata (see ?alldata-class
) is used in the package to store the information as above.
Abundances of metabolites in a data matrix usually have a right skewed distribution. Therefore, an appropriate transformation is needed to obtain a more symmetric distribution. The metabolomics literature have discussed various transformations such as log, cubic and square root as ways of handling these, most of which belong to the family of power transformations. However, the log transformation is usually adequate for statistical purposes. A metabolomics data matrix in the featuredata format can be transformed using the following function.
LogTransform <- function(featuredata, base=exp(1), saveoutput=FALSE,
outputname="log.results",zerotona=FALSE)
The output can be saved as a .csv file by setting saveoutput equals TRUE, and giving the output a name using outputname. To log transform the example data the following can be used.
logdata <- LogTransform(featuredata_eg,zerotona=TRUE)
#logdata
#dataview(logdata$featuredata)
A frequent issue in metabolomics data sets is the occurence of missing values. It is important to reduce the number of missing values as much as possible by using an effective pre-processing procedure. For example, a secondary peak picking method can be used for LC-MS data to fill in missing peaks which are not detected and aligned. The following MissingValues()
function can be used to replace missing values, depending on the nature of missing data.
MissingValues(featuredata, sampledata, as.dataframe(metabolitedata), feature.cutoff = 0.8,
sample.cutoff = 0.8, method = c("knn", "replace"), k = 10,
featuremax.knn = 0.8, samplemax.knn = 0.8, seed = 100,
saveoutput = FALSE, outputname = "missing.values.rep")
The user is able to,
Remove features with a large proportion feature.cutoff of missing values, and/or
Remove samples with a large proportion sample.cutoff of missing values, and/or
Replace missing values by using either the k-th nearest neighbour algorithm or by replacing values with a small number (half the minimum of the matrix as is commonly used).
In the above, for example one can use,
imp <- MissingValues(logdata$featuredata,sampledata_eg,metabolitedata_eg,
feature.cutof=0.8, sample.cutoff=0.8, method="knn")
## -> Checking features...
## -> Checking samples...
#imp
#dataview(imp$featuredata)
The log transformed data matrix can then be visualised using various plots in order to explore variation in the data, clustering tendencies, trends and outliers.
One way of visualising the log transformed metabolomics data is the use of across group or within group relative log abundance (RLA) plots (De Livera et al. 2012 De Livera et al. (2015)). In R, these can be obtained using the following function:
RlaPlots <- function(featuredata, groupdata, minoutlier = 0.5, type=c("ag", "wg"), saveplot=FALSE,
plotname = "RLAPlot", savetype= c("png","bmp","jpeg", "tiff","pdf"),
interactiveplot=TRUE, interactiveonly = TRUE,
saveinteractiveplot = FALSE,
interactivesavename = "interactiveRlaPlot",
cols=NULL,cex.axis=0.8, las=2, ylim=c(-2, 2), oma=c(7, 4, 4, 2) + 0.1, ...)
The default is an interactive plot which can be saved by setting saveinteractiveplot to TRUE. This can also be downloaded as a png file. A non-interactive plot can be obtained by setting interactiveplot to FALSE and saved in 5 different formats using saveplot = TRUE, giving the plotname and specifying the savetype. To avoid label overlapping, minoutlier could be set so that only samples with resulting median greater than minoutlier will be labeled.
An example of sample-wise RLA plots:
RlaPlots(imp$featuredata, sampledata_eg[,1], cex.axis = 0.6,saveinteractiveplot = TRUE)
The user can also explore metabolite-wise RLA plots as follows:
RlaPlots(t(imp$featuredata), groupdata=rep("group",dim(imp$featuredata)[2]),
cex.axis = 0.6,saveinteractiveplot = TRUE,xlabel="Metabolites")
The following function can be used to obtain multiple plots for exploration of the principal components of the featuredata matrix: a bar plot indicating the variance explained by each principal component, scores and loading plots with specified axes (interactive and non-interactive), and a pairs plot of the first n principal components. These plots are useful in identifying any outlying samples and getting a preliminary understanding of the structure of the data. As described in the section above, the outputs can be saved for publication purposes.
PcaPlots <- function(featuredata, groupdata, saveplot=FALSE,saveinteractiveplot = FALSE,
plotname="",savetype= c("png","bmp","jpeg","tiff","pdf"),
interactiveplots = TRUE, y.axis=1, x.axis=2, center=TRUE, scale=TRUE,
main=NULL, varplot=FALSE, multiplot=FALSE, n=3, cols=NULL,cex_val = 0.7, ...)
An example is given below:
PcaPlots(imp$featuredata,sampledata_eg[,1],
scale=FALSE, center=TRUE, multiplot = TRUE, varplot = TRUE)
The HeatMap function produces an interactive and/or a non-interactive heatmap, enabling visualization of the whole data matrix. The metabolites and/or the samples can be optionally clustered using hierarchial clustering. This function is demonstrated in section 4.32b.
Normalization methods presented in this package are divided into four categories, as those which use (i) internal, external standards and other quality control metabolites ( NormQcmets) (Sysi-Aho et al. 2007, Redestig et al. (2009), De Livera et al. (2012), De Livera et al. (2015), Gullberg et al. (2004)) (ii) quality control samples ( NormQcsamples) (Dunn et al. 2011), (iii) scaling methods ( NormScaling) (Scholz et al. 2004, Wang et al. (2003)), and (iv) combined methods ( NormCombined) (Kirwan and Broadhurst (2013)). Unless otherwise stated, these functions assume that featuredata has been log transformed beforehand.
The approaches in NormQcmets use internal, external standards and other quality control metabolites. These include the is method which uses a single standard (Gullberg et al. 2004), the ccmn (cross contribution compensating multiple internal standard) method (Redestig et al. 2009), the nomis (normalization using optimal selection of multiple internal standards) method (Sysi-Aho et al. 2007), and the remove unwanted variation methods (J. A. Gagnon-Bartsch, Jacob, and Speed 2014) as applied to metabolomics using “ruv2” (De Livera et al. 2012), “ruvrand” and “ruvrandclust” (De Livera et al. 2015). Note that ruv2 is an application specific method designed for identifying biomarkers using a linear model that adjusts for the unwanted variation component.
The implementation is as follows:
NormQcmets <- function(featuredata, factors = NULL, method = c("is", "nomis", "ccmn",
"ruv2", "ruvrand", "ruvrandclust"), isvec = NULL, ncomp = NULL,
k = NULL, plotk = FALSE, lambda = NULL, qcmets = NULL,
maxIter = 200, nUpdate = 100, lambdaUpdate = TRUE, p = 2,
saveoutput = FALSE, outputname = NULL,...)
Several examples are given below:
##'nomis' method
Norm_nomis <-NormQcmets(imp$featuredata, method = "nomis",
qcmets = which(metabolitedata_eg$IS ==1))
#Norm_nomis
#Norm_nomis$featuredata
##'ccmn' method
Norm_ccmn <-NormQcmets(imp$featuredata, method = "ccmn",
qcmets = which(metabolitedata_eg$IS ==1),
factors=sampledata_eg$gender)
#Norm_ccmn
#Norm_ccmn$featuredata
##`median' method
Norm_med <- NormScaling(imp$featuredata, method = "median")
#Norm_med
#Norm_med$featuredata
##`ruv2' method
factormat<-model.matrix(~gender +Age +bmi, sampledata_eg)
#head(factormat)
Norm_ruv2<-NormQcmets(imp$featuredata, factormat=factormat,method = "ruv2",
k=2, qcmets = which(metabolitedata_eg$IS ==1))
#Norm_ruv2
##`is' method
Norm_is <-NormQcmets(imp$featuredata, method = "is",
isvec = imp$featuredata[,which(metabolitedata_eg$IS ==1)[1]])
#Norm_is
#Norm_is$featuredata
This function is based on the quality control sample based robust LOESS (locally estimated scatterplot smoothing) signal correction (QC-RLSC) method as described by Dunn et al. (2011) and impletemented statTarget (Luan 2017). Notice that for this approach featuredata is not log transformed a priori. By default, the function log transforms the data after normalization, and this can be changed by setting lg=FALSE.
NormQcsamples<- function(featuredata, sampledata, method = c("rlsc"), span = 0,
deg = 2, lg = TRUE, saveoutput = FALSE,
outputname = "qcsample_results", ...)
Example implementation is given below. The sampledata should contain the batch number, the class and the run order, with column names ‘batch’, ‘class’ and ‘order’ respectively. For the QCs samples, ‘class’ should be allocated as 0.
data(Didata)
dataview(Didata$sampledata)
##
## - The data consists of 172 rows by 3 columns
## batch class order
## batch01_QC03 1 0 3
## batch01_QC01 1 0 1
## batch01_QC02 1 0 2
## batch01_C05 1 1 4
## batch01_S07 1 2 5
## batch01_C10 1 1 6
## batch01_QC04 1 0 7
## batch01_S01 1 2 8
## batch01_C03 1 1 9
## batch01_S05 1 2 10
#Not run here due to lengthy output
#Norm_rlsc<- NormQcsamples(sampledata=Didata$sampledata[order(Didata$sampledata$order),],
# featuredata=Didata$featuredata[order(Didata$sampledata$order),])
#Norm_rlsc
The scaling normalization methods (Scholz et al. 2004, Wang et al. (2003)) included in the package are normalization to a total sum, normalisation by the median or mean of each sample, and are denoted by sum, median, and mean respectively. The method ref normalises the metabolite abundances to a specific reference vector such as the sample weight or volume.
NormScaling<-function(featuredata, method = c("median", "mean", "sum", "ref"),
refvec = NULL, saveoutput = FALSE, outputname = NULL, ...)
An example,
Norm_med <- NormScaling(imp$featuredata, method = "median")
Norm_med
##
## Feature data
##
## - The data consists of 150 rows by 128 columns
##
## m_1 m_2 m_3 m_4 m_5 m_6
## s_1 2.0775504 3.230691 -0.1654373 0.07014646 -1.0582824 -0.17680211
## s_2 1.7648718 2.973097 -0.5045196 -1.06463264 -0.5630727 0.36107289
## s_3 2.0381826 3.071675 -0.0730348 -0.01310540 -0.2531834 0.44558111
## s_4 1.7505001 2.893965 -0.5270282 0.16420897 -0.8166078 0.09269646
## s_5 1.9863439 3.131882 -0.2271303 -0.40693092 -0.3069806 0.01384862
## s_6 1.7174797 2.965910 -0.9445623 -1.22729118 -0.5097144 -0.25917839
## s_7 0.9931214 2.213085 -1.7708494 -2.03585006 -0.1409843 0.50059532
## s_8 1.6018886 2.817064 -0.6539625 -0.16509966 -0.9972448 0.31334083
## s_9 1.8053610 2.483213 -1.4862047 -0.18459978 -1.3832655 -0.67688013
## s_10 0.7098876 2.188143 -1.3675909 -1.90644751 -0.7420295 0.15331663
## m_7 m_8 ...
## s_1 -2.373240 1.0771017 ...
## s_2 -2.181951 1.1994522 ...
## s_3 -1.590395 1.2201749 ...
## s_4 -1.668421 1.1756146 ...
## s_5 -1.993372 1.2532143 ...
## s_6 -2.817206 0.9931017 ...
## s_7 -1.730279 0.6068732 ...
## s_8 -1.938097 1.1568805 ...
## s_9 -2.040028 0.9487082 ...
## s_10 -2.438832 0.5658624 ...
In some circumstances, researchers use a combination of the above normalizations (i.e., one method followed by another). This can be achieved using the NormCombined function. The function defaults to employing ‘rlsc’ approach followed by the `median’.
NormCombined<-function(featuredata, methods = c("rlsc", "median"),
savefinaloutput = FALSE, finaloutputname = NULL, ...)
For instance,
#Not run due to lenghty output
#Norm_comb<- NormCombined(featuredata=Didata$featuredata[order(Didata$sampledata$order),],
# sampledata=Didata$sampledata[order(Didata$sampledata$order),],
# methods=c("rlsc","median"))
#Norm_comb
The criteria for assessing and choosing a normalization method implemented in this package have been described in detail by De Livera et al. (2012), De Livera et al. (2015) and J. A. Gagnon-Bartsch, Jacob, and Speed (2014).
Examples of fitting a linear model to normalized data in order to identify biomarkers associated with factors of interest.
unadjustedFit<-LinearModelFit(featuredata=imp$featuredata,
factormat=factormat,
ruv2=FALSE)
#unadjustedFit
isFit<-LinearModelFit(featuredata=Norm_is$featuredata,
factormat=factormat,
ruv2=FALSE)
#isFit
ruv2Fit<-LinearModelFit(featuredata=imp$featuredata,
factormat=factormat,
ruv2=TRUE,k=2,
qcmets = which(metabolitedata_eg$IS ==1))
#ruv2Fit
#Exploring metabolites associated with age
lcoef_age<-list(unadjusted=unadjustedFit$coefficients[,"Age"],
is_age=isFit$coefficients[,"Age"],
ruv2_age=ruv2Fit$coefficients[,"Age"])
lpvals_age<-list(unadjusted=unadjustedFit$p.value[,"Age"],
is=isFit$p.value[,"Age"],
ruv2=ruv2Fit$p.value[,"Age"])
negcontrols<-metabolitedata_eg$names[which(metabolitedata_eg$IS==1)]
CompareVolcanoPlots(lcoef=lcoef_age,
lpvals_age,
normmeth = c(":unadjusted", ":is", ":ruv2"),
xlab="Coef",
negcontrol=negcontrols)
## Warning: Ignoring 1 observations
#Exploring metabolites associated with BMI
lcoef_bmi<-list(unadjusted=unadjustedFit$coefficients[,"bmi"],
is=isFit$coefficients[,"bmi"],
ruv2=ruv2Fit$coefficients[,"bmi"])
lpvals_bmi<-list(unadjusted=unadjustedFit$p.value[,"bmi"],
is=isFit$p.value[,"bmi"],
ruv2=ruv2Fit$p.value[,"bmi"])
CompareVolcanoPlots(lcoef=lcoef_bmi,
lpvals_bmi,
normmeth = c(":unadjusted", ":is", ":ruv2"),
xlab="Coef",
negcontrol=negcontrols)
## Warning: Ignoring 1 observations
#Exploring metabolites associated with gender
lcoef_gender<-list(unadjusted=unadjustedFit$coefficients[,"gendercode_1"],
is_age=isFit$coefficients[,"gendercode_1"],
ruv2_age=ruv2Fit$coefficients[,"gendercode_1"])
lpvals_gender<-list(unadjusted=unadjustedFit$p.value[,"gendercode_1"],
is=isFit$p.value[,"gendercode_1"],
ruv2=ruv2Fit$p.value[,"gendercode_1"])
poscontrols_gender<-metabolitedata_eg$names[which(metabolitedata_eg$pos_controls_gender==1)]
CompareVolcanoPlots(lcoef=lcoef_gender,
lpvals_gender,
normmeth = c(":unadjusted", ":is", ":ruv2"),
negcontrol=negcontrols,
poscontrol=poscontrols_gender)
## Warning: Ignoring 1 observations
An example:
lresiddata<-list(unadjusted=unadjustedFit$residuals,
is=isFit$residuals,
ruv2=ruv2Fit$residuals)
CompareRlaPlots(lresiddata,groupdata=sampledata_eg$batch,
yrange=c(-3,3),
normmeth = c("unadjusted:","is:","ruv2:"))
ComparePvalHist(lpvals = lpvals_age,ylim=c(0,40),
normmeth = c("unadjusted","is","ruv2"))
The example datasets did not involve multiple platforms. In what follows, for the purpose of demonstrating VennPlot, we simply compare the results from different normalisation methods.
lnames<- list(names(ruv2Fit$coef[,"Age"])[which(ruv2Fit$p.value[,"Age"]<0.05)],
names(unadjustedFit$coef[,"Age"])[which(unadjustedFit$p.value[,"Age"]<0.05)],
names(isFit$coef[,"Age"])[which(isFit$p.value[,"Age"]<0.05)])
VennPlot(lnames, group.labels=c("ruv2","unadjusted","is"))
data(UVdata)
dataview(UVdata$featuredata)
##
## - The data consists of 185 rows by 33 columns
## m_g3_1 m_g3_2 m_g3_3 m_g3_4 m_g3_5 m_g3_6 m_g3_7
## s_1 15.32284 12.49509 12.92832 12.63389 12.77157 14.38417 16.52140
## s_2 15.33973 12.49854 12.95774 12.66332 12.80278 14.39479 16.53920
## s_3 15.27855 12.91523 13.04520 13.04520 12.90187 14.49742 16.60793
## s_4 14.97955 13.06340 13.01026 12.71651 12.94753 14.48041 16.56116
## s_5 15.25609 12.84194 13.02690 12.71838 12.85834 14.40205 16.59200
## s_6 15.29793 12.88264 13.01264 12.70868 12.86208 14.43004 16.58074
## s_7 15.29637 12.71626 13.03121 12.72204 12.89868 14.38071 16.57633
## s_8 16.38772 14.76798 14.31781 14.02218 14.11455 15.53213 17.25176
## s_9 16.39216 14.87127 14.35985 14.06619 14.16954 15.49755 17.25271
## s_10 15.98707 14.41295 13.87622 13.87622 13.58081 15.29668 17.11996
## m_g3_8 ...
## s_1 14.44427 ...
## s_2 14.46314 ...
## s_3 14.54595 ...
## s_4 14.48674 ...
## s_5 14.53921 ...
## s_6 14.53075 ...
## s_7 14.51485 ...
## s_8 15.84891 ...
## s_9 15.88468 ...
## s_10 15.42501 ...
dataview(UVdata$sampledata)
##
## - The data consists of 185 rows by 3 columns
## group instrument temperature
## s_1 Group1 inst1 temp1
## s_2 Group1 inst1 temp1
## s_3 Group1 inst1 temp1
## s_4 Group1 inst1 temp1
## s_5 Group1 inst1 temp1
## s_6 Group1 inst1 temp1
## s_7 Group1 inst1 temp1
## s_8 Group2 inst1 temp1
## s_9 Group2 inst1 temp1
## s_10 Group2 inst1 temp1
dataview(UVdata$metabolitedata)
##
## - The data consists of 33 rows by 3 columns
## Names IS neg_control
## 1 m_g3_1 0 0
## 2 m_g3_2 0 0
## 3 m_g3_3 0 0
## 4 m_g3_4 0 0
## 5 m_g3_5 0 0
## 6 m_g3_6 0 0
## 7 m_g3_7 0 0
## 8 m_g3_8 0 0
## 9 m_g3_9 0 0
## 10 m_g3_10 0 0
#Not RUN due to user input; we set k=1 each and saved normalized data as uv_ruvrandclust
#uv_ruvrand_norm<-NormQcmets(featuredata=UVdata$featuredata,
# method="ruvrandclust",
# qcmets=which(UVdata$metabolitedata$neg_control==1),
# k=1)
data("uv_ruvrandclust")
dataview(uv_ruvrandclust$featuredata)
##
## - The data consists of 185 rows by 33 columns
## m_g3_1 m_g3_2 m_g3_3 m_g3_4 m_g3_5 m_g3_6
## s_1 -1.0086748 -3.452141 -0.69290532 -0.84687172 -0.44706357 -1.4747494
## s_2 -0.9824190 -3.439549 -0.65567608 -0.80971346 -0.40827974 -1.4550329
## s_3 -1.1638612 -3.140287 -0.66851990 -0.52710301 -0.40651965 -1.4691839
## s_4 -1.4178384 -2.948157 -0.66590423 -0.81863202 -0.32442133 -1.4424767
## s_5 -1.1988837 -3.225855 -0.69730346 -0.86429848 -0.46022498 -1.5767529
## s_6 -1.1541648 -3.182337 -0.70915725 -0.87161609 -0.45414934 -1.5459585
## s_7 -1.1418740 -3.335193 -0.67903072 -0.84682118 -0.40633760 -1.5818450
## s_8 -0.6461694 -1.865099 0.11077337 -0.03835783 0.32741883 -1.0088362
## s_9 -0.6780271 -1.797255 0.12254054 -0.02430341 0.35303212 -1.0786570
## s_10 -0.6833759 -1.865244 -0.02769755 0.11568067 0.08784137 -0.8913671
## m_g3_7 m_g3_8 ...
## s_1 0.6112379 -1.1593267 ...
## s_2 0.6381617 -1.1315104 ...
## s_3 0.5897389 -1.1635953 ...
## s_4 0.5868263 -1.1797910 ...
## s_5 0.5615645 -1.1823399 ...
## s_6 0.5531114 -1.1880398 ...
## s_7 0.5621986 -1.1907146 ...
## s_8 0.6573495 -0.4257514 ...
## s_9 0.6229378 -0.4246587 ...
## s_10 0.8796062 -0.5024085 ...
#INCLUDE A PAIRS COMPAREPCA PLOT HERE
lfeaturedata<-list(unadj=UVdata$featuredata,ruv=uv_ruvrandclust$featuredata,
ruvuv=uv_ruvrandclust$uvdata)
CompareRlaPlots(lfeaturedata,
groupdata=interaction(UVdata$sampledata$temperature,UVdata$sampledata$instrument),
normmeth=c("Unadjusted:", "RUVrandclust normalized:",
"RUVrandclust: removed uv:"),
yrange=c(-3,3))
hca<-Dendrogram(featuredata=uv_ruvrandclust$featuredata,
groupdata=UVdata$sampledata$group,
clust=TRUE,
nclust=2)
HeatMap(uv_ruvrandclust$featuredata,
UVdata$sampledata$group,interactiveplot = TRUE,
colramp=c(75, "magenta", "green"),
distmethod = "manhattan", aggmethod = "ward.D")
Follow a similar approach to section 4.32a
svm<-SvmFit(featuredata=uv_ruvrandclust$featuredata,
groupdata=UVdata$sampledata$group,
crossvalid=TRUE,
k=5,
rocplot = TRUE)
Follow a similar approach to section 4.32a
See (Freytag et al. 2015) for a detailed example of this concept.
lcor<-list(Corr(UVdata$featuredata)$results[,3],
Corr(uv_ruvrandclust$featuredata)$results[,3])
ComparePvalHist(lcor,normmeth = c("unadjusted","ruvrandclust"),
xlim=c(-1,1), xlab="Correlation coefficients",ylim=c(0,120))
lcor_p<-list(Corr(UVdata$featuredata)$results[,4],
Corr(uv_ruvrandclust$featuredata)$results[,4])
ComparePvalHist(lcor_p,normmeth = c("unadjusted","ruvrandclust"),
xlim=c(0,1),ylim=c(0,200))
De Livera, Alysha M, M. Aho-Sysi, Laurent Jacob, J. Gagnon-Bartch, Sandra Castillo, J.A. Simpson, and Terence P. Speed. 2015. “Statistical methods for handling unwanted variation in metabolomics data.” Analytical Chemistry 87 (7). American Chemical Society: 3606–15. doi:10.1021/ac502439y.
De Livera, Alysha M, Daniel A Dias, David De Souza, Thusitha Rupasinghe, James Pyke, Dedreia Tull, Ute Roessner, Malcolm McConville, and Terence P Speed. 2012. “Normalizing and integrating metabolomics data.” Analytical Chemistry 84 (24): 10768–76. doi:10.1021/ac302748b.
Dunn, Warwick B, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, et al. 2011. “Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry.” Nature Protocols 6 (7): 1060–83.
Freytag, Saskia, Johann Gagnon-Bartsch, Terence P. Speed, and Melanie Bahlo. 2015. “Systematic noise degrades gene co-expression signals but can be corrected.” BMC Bioinformatics 16 (1). BMC Bioinformatics: 309. doi:10.1186/s12859-015-0745-3.
Gagnon-Bartsch, Johann A, Laurent Jacob, and Terence P. Speed. 2014. Removing unwanted variation from high dimensional data with negative controls. IMS Monographs. Accepted for publication.
Gullberg, Jonas, Pär Jonsson, Anders Nordström, Michael Sjöström, and Thomas Moritz. 2004. “Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry.” Analytical Biochemistry 331 (2): 283–95. doi:10.1016/j.ab.2004.04.037.
Kirwan, JA, and DI Broadhurst. 2013. “Characterising and correcting batch variation in an automated direct infusion mass spectrometry (DIMS) metabolomics workflow.” Analytical and …, 5147–57. doi:10.1007/s00216-013-6856-7.
Kirwan, JA, RJM Weber, DI Broadhurst, and MR Viant. 2014. “Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control.” Scientific Data 1 (June): 1–13. doi:10.1038/sdata.2014.12.
Luan, Hemi. 2017. “statTarget: R package.”
Redestig, Henning, Atsushi Fukushima, Hans Stenlund, Thomas Moritz, Masanori Arita, Kazuki Saito, and Miyako Kusano. 2009. “Compensation for systematic cross-contribution improves normalization of mass spectrometry based metabolomics data.” Analytical Chemistry 81 (19): 7974–80.
Scholz, M, S Gatzek, a Sterling, O Fiehn, and J Selbig. 2004. “Metabolite fingerprinting: detecting biological features by independent component analysis.” Bioinformatics (Oxford, England) 20 (15): 2447–54. doi:10.1093/bioinformatics/bth270.
Sysi-Aho, Marko, Mikko Katajamaa, Yetukuri Laxman, and Matej Oresic. 2007. “Normalization method for metabolomics data using optimal selection of multiple internal standards.” BMC Bioinformatics 8 (January): 93. doi:10.1186/1471-2105-8-93.
Wang, Weixun, Haihong Zhou, Hua Lin, Sushmita Roy, Thomas A Shaler, Lander R Hill, Scott Norton, Praveen Kumar, Markus Anderle, and Christopher H Becker. 2003. “Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards.” Analytical Chemistry 75 (18): 481848–26.