NormalizeMets Vignette

Alysha M De Livera, Gavriel Olshansky

2017-10-22

1. Introduction

The NormalizeMets package is a collection of functions designed to implement, assess, and choose a suitable normalization method for a given metabolomics study. The functions in this package are also available as a graphical user interface within Microsoft Excel, a familiar program for most biological researchers.

The package includes several widely used traditional and recently developed metabolomics normalization methods, which can be used to

  1. remove the component of unwanted variation to obtain a ``normalized" data matrix that is suitable for downstream statistical analysis, or to

  2. accommodate the component of unwanted variation in a statistical model designed to answer the research question of interest.

In addition, the package can be used for visualisation of metabolomics data using interactive graphical displays, and for obtaining statistical results for

  1. identifying biomarkers that are associated with an exposure, adjusting for confounding variables,

  2. clustering using heirarchical cluster analysis and principal component analysis,

  3. classification using support vector machine algorithm, and

  4. correlation analysis.

2. Getting Started

The R software environment can be downloaded for free from the Comprehensive R Archive Network (CRAN) https://cran.r-project.org/, and is hosted by a large number of sites. A very detailed description of installation of R and alternate methods, FAQs, platform dependencies and the like can be found at https://cran.r-project.org/doc/manuals/R-admin.html.

The use of RStudio is also recommended. RStudio is an integrated development environment (IDE) that can be useful for handling R scripts and functions, as well as loading packages and data. For a guide on installation and usage of RStudio, please refer to RStudios page, https://www.rstudio.com/.

Install the NormalizeMets package by using the following function:

install.packages("NormalizeMets")

To then load the package use:

library(NormalizeMets)

To cite the package use:

citation("NormalizeMets")
## 
## To cite package 'NormalizeMets' in publications use:
## 
##   Alysha M De Livera and Gavriel Olshansky (2017). NormalizeMets:
##   Analysis of Metabolomics Data. R package version 0.22.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {NormalizeMets: Analysis of Metabolomics Data},
##     author = {Alysha M {De Livera} and Gavriel Olshansky},
##     year = {2017},
##     note = {R package version 0.22},
##   }
## 
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

3. Reading the data

3.1 Example datasets

Four different datasets are included in this package:

  1. mixdata, as described by Redestig et al. (2009). See ?mixdata in R.

  2. Didata, as described by Kirwan et al. (2014). See ?Didata in R.

  3. UVdata, as described by De Livera et al. (2015). See ?UVdata in R.

  4. alldata_eg, a subset of a cohort study dataset described by De Livera et al. (2015). See ?alldata_eg in R

3.2 Data format used in the package

The NormalizeMets package stores three different sets of information.

(i) featuredata

featuredata is a metabolomics data matrix taking the following format, with metabolites in columns and samples in rows. Unique sample names should be provided as row names.

data("alldata_eg")
featuredata_eg<-alldata_eg$featuredata
dataview(featuredata_eg)
## 
##  - The data consists of 150 rows by 128 columns
##            m_1       m_2       m_3       m_4       m_5       m_6       m_7
## s_1  10485.867 33220.562 1112.9795 1408.6396  455.7529 1100.4023 122.36316
## s_2   8960.469 29995.516  926.2891  529.0449  873.6094 2201.2461 173.08020
## s_3  10160.445 28559.641 1230.3330 1306.3203 1027.5078 2066.5918 269.79980
## s_4   8794.477 27593.750  901.7622 1800.0830  675.0396 1675.8496 287.99927
## s_5   8956.922 28161.766  979.1724  818.0366  904.0254 1245.9912 167.41333
## s_6   9092.258 31685.312  634.6899  478.3811  980.4233 1259.5635  97.56250
## s_7   2271.045  7692.176  143.1688  109.8399  730.6172 1387.7920 149.09668
## s_8   7850.402 26462.797  822.5991 1341.2158  583.5830 2164.1270 227.76965
## s_9  10969.062 21605.141  408.0005 1499.4795  452.2375  916.5283 234.49756
## s_10  1743.933  7647.645  218.4199  127.4294  408.2913  999.5718  74.82697
##           m_8 ...
## s_1  3855.805 ...
## s_2  5090.637 ...
## s_3  4483.906 ...
## s_4  4949.262 ...
## s_5  4302.934 ...
## s_6  4406.344 ...
## s_7  1543.406 ...
## s_8  5030.688 ...
## s_9  4657.254 ...
## s_10 1510.012 ...

(ii) sampledata

sampledata is a dataframe that contains sample specific information. This information can include sample types (i.e., Quality control or biological), run order of the samples, factors of interest and other sample-specific data relevant to the analysis of the data. Unique sample names should be provided as row names. These sample names must match with and be ordered according to the sample ordering in featuredata.

sampledata_eg<-alldata_eg$sampledata
dataview(sampledata_eg)
## 
##  - The data consists of 150 rows by 4 columns
##        batch gender   Age   bmi
## s_1  Batch 2 code_1 58.66 22.22
## s_2  Batch 2 code_0 76.70 23.74
## s_3  Batch 2 code_0 56.15 28.19
## s_4  Batch 2 code_0 77.22 26.18
## s_5  Batch 2 code_0 74.29 26.37
## s_6  Batch 2 code_1 66.82 26.39
## s_7  Batch 1 code_0 65.00 29.00
## s_8  Batch 2 code_1 66.51 26.13
## s_9  Batch 2 code_0 70.23 27.26
## s_10 Batch 1 code_1 55.00 25.00

(iii) metabolitedata

metabolitedata contains metabolite specific information in a separate dataframe. This information can include, but not limited to, internal/external standard and other positive/negative control information. Metabolite names should be provided as row names, and must match with and be ordered according to the metabolite ordering in featuredata.

metabolitedata_eg<-alldata_eg$metabolitedata
dataview(metabolitedata_eg)
## 
##  - The data consists of 128 rows by 4 columns
##    names IS neg_controls pos_controls_gender
## 1    m_1  0            1                   0
## 2    m_2  0            1                   0
## 3    m_3  0            1                   0
## 4    m_4  0            0                   0
## 5    m_5  0            0                   0
## 6    m_6  0            0                   0
## 7    m_7  0            0                   0
## 8    m_8  0            1                   0
## 9    m_9  0            1                   0
## 10  m_10  0            1                   0

Importing the data from Excel or text files for featuredata, sampledata, metabolitedata is relatively simple and can be done using the commands read.csv() or ‘read.table()’. The user may find it easier to combine these three datasets into a list that can be called in functions as required. For example,

alldata_eg<-list(featuredata=featuredata_eg, sampledata=sampledata_eg,         metabolitedata=metabolitedata_eg)
dataview(alldata_eg$metabolitedata)
## 
##  - The data consists of 128 rows by 4 columns
##    names IS neg_controls pos_controls_gender
## 1    m_1  0            1                   0
## 2    m_2  0            1                   0
## 3    m_3  0            1                   0
## 4    m_4  0            0                   0
## 5    m_5  0            0                   0
## 6    m_6  0            0                   0
## 7    m_7  0            0                   0
## 8    m_8  0            1                   0
## 9    m_9  0            1                   0
## 10  m_10  0            1                   0

A class called alldata (see ?alldata-class) is used in the package to store the information as above.

4. NormalizeMets workflow

4.1 Log transforming, handling missing values, and visualization

4.11 Log transforming

Abundances of metabolites in a data matrix usually have a right skewed distribution. Therefore, an appropriate transformation is needed to obtain a more symmetric distribution. The metabolomics literature have discussed various transformations such as log, cubic and square root as ways of handling these, most of which belong to the family of power transformations. However, the log transformation is usually adequate for statistical purposes. A metabolomics data matrix in the featuredata format can be transformed using the following function.

LogTransform <- function(featuredata, base=exp(1), saveoutput=FALSE,
    outputname="log.results",zerotona=FALSE)

The output can be saved as a .csv file by setting saveoutput equals TRUE, and giving the output a name using outputname. To log transform the example data the following can be used.

logdata <- LogTransform(featuredata_eg,zerotona=TRUE)
#logdata
#dataview(logdata$featuredata)

4.12 Missing Values

A frequent issue in metabolomics data sets is the occurence of missing values. It is important to reduce the number of missing values as much as possible by using an effective pre-processing procedure. For example, a secondary peak picking method can be used for LC-MS data to fill in missing peaks which are not detected and aligned. The following MissingValues() function can be used to replace missing values, depending on the nature of missing data.

MissingValues(featuredata, sampledata, as.dataframe(metabolitedata), feature.cutoff = 0.8,
  sample.cutoff = 0.8, method = c("knn", "replace"), k = 10,
  featuremax.knn = 0.8, samplemax.knn = 0.8, seed = 100,
  saveoutput = FALSE, outputname = "missing.values.rep")

The user is able to,

  1. Remove features with a large proportion feature.cutoff of missing values, and/or

  2. Remove samples with a large proportion sample.cutoff of missing values, and/or

  3. Replace missing values by using either the k-th nearest neighbour algorithm or by replacing values with a small number (half the minimum of the matrix as is commonly used).

In the above, for example one can use,

imp <-  MissingValues(logdata$featuredata,sampledata_eg,metabolitedata_eg,
                      feature.cutof=0.8, sample.cutoff=0.8, method="knn")
##  -> Checking features...
##  -> Checking samples...
#imp
#dataview(imp$featuredata)

4.13 Visualisation

The log transformed data matrix can then be visualised using various plots in order to explore variation in the data, clustering tendencies, trends and outliers.

4.13a RlaPlots

One way of visualising the log transformed metabolomics data is the use of across group or within group relative log abundance (RLA) plots (De Livera et al. 2012 De Livera et al. (2015)). In R, these can be obtained using the following function:

RlaPlots <- function(featuredata, groupdata, minoutlier = 0.5, type=c("ag", "wg"), saveplot=FALSE,
                     plotname = "RLAPlot", savetype= c("png","bmp","jpeg", "tiff","pdf"),
                     interactiveplot=TRUE, interactiveonly = TRUE,
                     saveinteractiveplot = FALSE,
                     interactivesavename = "interactiveRlaPlot",
                     cols=NULL,cex.axis=0.8, las=2, ylim=c(-2, 2), oma=c(7, 4, 4, 2) + 0.1, ...)

The default is an interactive plot which can be saved by setting saveinteractiveplot to TRUE. This can also be downloaded as a png file. A non-interactive plot can be obtained by setting interactiveplot to FALSE and saved in 5 different formats using saveplot = TRUE, giving the plotname and specifying the savetype. To avoid label overlapping, minoutlier could be set so that only samples with resulting median greater than minoutlier will be labeled.

An example of sample-wise RLA plots:

RlaPlots(imp$featuredata, sampledata_eg[,1], cex.axis = 0.6,saveinteractiveplot = TRUE)

The user can also explore metabolite-wise RLA plots as follows:

RlaPlots(t(imp$featuredata), groupdata=rep("group",dim(imp$featuredata)[2]),
         cex.axis = 0.6,saveinteractiveplot = TRUE,xlabel="Metabolites")
4.13b PcaPlots

The following function can be used to obtain multiple plots for exploration of the principal components of the featuredata matrix: a bar plot indicating the variance explained by each principal component, scores and loading plots with specified axes (interactive and non-interactive), and a pairs plot of the first n principal components. These plots are useful in identifying any outlying samples and getting a preliminary understanding of the structure of the data. As described in the section above, the outputs can be saved for publication purposes.

PcaPlots <- function(featuredata, groupdata, saveplot=FALSE,saveinteractiveplot = FALSE, 
                     plotname="",savetype= c("png","bmp","jpeg","tiff","pdf"),
                     interactiveplots = TRUE, y.axis=1, x.axis=2, center=TRUE, scale=TRUE,
                     main=NULL, varplot=FALSE, multiplot=FALSE, n=3, cols=NULL,cex_val = 0.7, ...)

An example is given below:

PcaPlots(imp$featuredata,sampledata_eg[,1],
         scale=FALSE, center=TRUE, multiplot = TRUE, varplot = TRUE)
4.13c HeatMap

The HeatMap function produces an interactive and/or a non-interactive heatmap, enabling visualization of the whole data matrix. The metabolites and/or the samples can be optionally clustered using hierarchial clustering. This function is demonstrated in section 4.32b.

4.2 Normalisation

Normalization methods presented in this package are divided into four categories, as those which use (i) internal, external standards and other quality control metabolites ( NormQcmets) (Sysi-Aho et al. 2007, Redestig et al. (2009), De Livera et al. (2012), De Livera et al. (2015), Gullberg et al. (2004)) (ii) quality control samples ( NormQcsamples) (Dunn et al. 2011), (iii) scaling methods ( NormScaling) (Scholz et al. 2004, Wang et al. (2003)), and (iv) combined methods ( NormCombined) (Kirwan and Broadhurst (2013)). Unless otherwise stated, these functions assume that featuredata has been log transformed beforehand.

4.21 NormQcmets Normalisation methods based on quality control metabolites

The approaches in NormQcmets use internal, external standards and other quality control metabolites. These include the is method which uses a single standard (Gullberg et al. 2004), the ccmn (cross contribution compensating multiple internal standard) method (Redestig et al. 2009), the nomis (normalization using optimal selection of multiple internal standards) method (Sysi-Aho et al. 2007), and the remove unwanted variation methods (J. A. Gagnon-Bartsch, Jacob, and Speed 2014) as applied to metabolomics using “ruv2” (De Livera et al. 2012), “ruvrand” and “ruvrandclust” (De Livera et al. 2015). Note that ruv2 is an application specific method designed for identifying biomarkers using a linear model that adjusts for the unwanted variation component.

The implementation is as follows:

NormQcmets <- function(featuredata, factors = NULL, method = c("is", "nomis", "ccmn",
  "ruv2", "ruvrand", "ruvrandclust"), isvec = NULL, ncomp = NULL,
  k = NULL, plotk = FALSE, lambda = NULL, qcmets = NULL,
  maxIter = 200, nUpdate = 100, lambdaUpdate = TRUE, p = 2,
  saveoutput = FALSE, outputname = NULL,...)

Several examples are given below:

##'nomis' method
Norm_nomis <-NormQcmets(imp$featuredata, method = "nomis", 
                        qcmets = which(metabolitedata_eg$IS ==1))
#Norm_nomis
#Norm_nomis$featuredata

##'ccmn' method
Norm_ccmn <-NormQcmets(imp$featuredata, method = "ccmn", 
                       qcmets = which(metabolitedata_eg$IS ==1),
                       factors=sampledata_eg$gender)
#Norm_ccmn
#Norm_ccmn$featuredata

##`median' method
Norm_med <- NormScaling(imp$featuredata, method = "median")
#Norm_med
#Norm_med$featuredata

##`ruv2' method
factormat<-model.matrix(~gender +Age +bmi, sampledata_eg)
#head(factormat)
Norm_ruv2<-NormQcmets(imp$featuredata, factormat=factormat,method = "ruv2", 
                      k=2, qcmets = which(metabolitedata_eg$IS ==1))
#Norm_ruv2

##`is' method
Norm_is <-NormQcmets(imp$featuredata, method = "is", 
                       isvec = imp$featuredata[,which(metabolitedata_eg$IS ==1)[1]])
#Norm_is
#Norm_is$featuredata
4.22 NormQcsamples Normalisation methods based on quality control samples

This function is based on the quality control sample based robust LOESS (locally estimated scatterplot smoothing) signal correction (QC-RLSC) method as described by Dunn et al. (2011) and impletemented statTarget (Luan 2017). Notice that for this approach featuredata is not log transformed a priori. By default, the function log transforms the data after normalization, and this can be changed by setting lg=FALSE.

NormQcsamples<- function(featuredata, sampledata, method = c("rlsc"), span = 0,
  deg = 2, lg = TRUE, saveoutput = FALSE,
  outputname = "qcsample_results", ...)

Example implementation is given below. The sampledata should contain the batch number, the class and the run order, with column names ‘batch’, ‘class’ and ‘order’ respectively. For the QCs samples, ‘class’ should be allocated as 0.

 data(Didata)
 dataview(Didata$sampledata)
## 
##  - The data consists of 172 rows by 3 columns
##              batch class order
## batch01_QC03     1     0     3
## batch01_QC01     1     0     1
## batch01_QC02     1     0     2
## batch01_C05      1     1     4
## batch01_S07      1     2     5
## batch01_C10      1     1     6
## batch01_QC04     1     0     7
## batch01_S01      1     2     8
## batch01_C03      1     1     9
## batch01_S05      1     2    10
#Not run here due to lengthy output
#Norm_rlsc<- NormQcsamples(sampledata=Didata$sampledata[order(Didata$sampledata$order),],
#               featuredata=Didata$featuredata[order(Didata$sampledata$order),])
#Norm_rlsc
4.23 NormScaling Normalisation methods based on scaling

The scaling normalization methods (Scholz et al. 2004, Wang et al. (2003)) included in the package are normalization to a total sum, normalisation by the median or mean of each sample, and are denoted by sum, median, and mean respectively. The method ref normalises the metabolite abundances to a specific reference vector such as the sample weight or volume.

NormScaling<-function(featuredata, method = c("median", "mean", "sum", "ref"),
  refvec = NULL, saveoutput = FALSE, outputname = NULL, ...)

An example,

 Norm_med <- NormScaling(imp$featuredata, method = "median")
 Norm_med
## 
## Feature data
## 
##  - The data consists of 150 rows by 128 columns 
## 
##            m_1      m_2        m_3         m_4        m_5         m_6
## s_1  2.0775504 3.230691 -0.1654373  0.07014646 -1.0582824 -0.17680211
## s_2  1.7648718 2.973097 -0.5045196 -1.06463264 -0.5630727  0.36107289
## s_3  2.0381826 3.071675 -0.0730348 -0.01310540 -0.2531834  0.44558111
## s_4  1.7505001 2.893965 -0.5270282  0.16420897 -0.8166078  0.09269646
## s_5  1.9863439 3.131882 -0.2271303 -0.40693092 -0.3069806  0.01384862
## s_6  1.7174797 2.965910 -0.9445623 -1.22729118 -0.5097144 -0.25917839
## s_7  0.9931214 2.213085 -1.7708494 -2.03585006 -0.1409843  0.50059532
## s_8  1.6018886 2.817064 -0.6539625 -0.16509966 -0.9972448  0.31334083
## s_9  1.8053610 2.483213 -1.4862047 -0.18459978 -1.3832655 -0.67688013
## s_10 0.7098876 2.188143 -1.3675909 -1.90644751 -0.7420295  0.15331663
##            m_7       m_8 ...
## s_1  -2.373240 1.0771017 ...
## s_2  -2.181951 1.1994522 ...
## s_3  -1.590395 1.2201749 ...
## s_4  -1.668421 1.1756146 ...
## s_5  -1.993372 1.2532143 ...
## s_6  -2.817206 0.9931017 ...
## s_7  -1.730279 0.6068732 ...
## s_8  -1.938097 1.1568805 ...
## s_9  -2.040028 0.9487082 ...
## s_10 -2.438832 0.5658624 ...
4.24 NormCombined Normalisation methods based on a combination of methods

In some circumstances, researchers use a combination of the above normalizations (i.e., one method followed by another). This can be achieved using the NormCombined function. The function defaults to employing ‘rlsc’ approach followed by the `median’.

NormCombined<-function(featuredata, methods = c("rlsc", "median"),
  savefinaloutput = FALSE, finaloutputname = NULL, ...)

For instance,

#Not run due to lenghty output
#Norm_comb<- NormCombined(featuredata=Didata$featuredata[order(Didata$sampledata$order),],
#                          sampledata=Didata$sampledata[order(Didata$sampledata$order),],
#                          methods=c("rlsc","median"))
#Norm_comb

4.3 Assessing and choosing a normalization method

The criteria for assessing and choosing a normalization method implemented in this package have been described in detail by De Livera et al. (2012), De Livera et al. (2015) and J. A. Gagnon-Bartsch, Jacob, and Speed (2014).

4.31 Identifying biomarkers

4.31a Exploring the impact of the normalization methods on positive and negative control metabolites using volcano plots

Examples of fitting a linear model to normalized data in order to identify biomarkers associated with factors of interest.

unadjustedFit<-LinearModelFit(featuredata=imp$featuredata,
                              factormat=factormat,
                              ruv2=FALSE)
#unadjustedFit
isFit<-LinearModelFit(featuredata=Norm_is$featuredata,
                       factormat=factormat,
                       ruv2=FALSE)
#isFit
ruv2Fit<-LinearModelFit(featuredata=imp$featuredata,
                        factormat=factormat,
                        ruv2=TRUE,k=2,
                        qcmets = which(metabolitedata_eg$IS ==1))
#ruv2Fit
#Exploring metabolites associated with age
lcoef_age<-list(unadjusted=unadjustedFit$coefficients[,"Age"],
                is_age=isFit$coefficients[,"Age"],
                ruv2_age=ruv2Fit$coefficients[,"Age"])
lpvals_age<-list(unadjusted=unadjustedFit$p.value[,"Age"],
                 is=isFit$p.value[,"Age"],
                 ruv2=ruv2Fit$p.value[,"Age"])

negcontrols<-metabolitedata_eg$names[which(metabolitedata_eg$IS==1)]                   

CompareVolcanoPlots(lcoef=lcoef_age, 
                    lpvals_age, 
                    normmeth = c(":unadjusted", ":is", ":ruv2"),
                    xlab="Coef",
                    negcontrol=negcontrols)
## Warning: Ignoring 1 observations
#Exploring metabolites associated with BMI
lcoef_bmi<-list(unadjusted=unadjustedFit$coefficients[,"bmi"],
                   is=isFit$coefficients[,"bmi"],
                   ruv2=ruv2Fit$coefficients[,"bmi"])

lpvals_bmi<-list(unadjusted=unadjustedFit$p.value[,"bmi"],
                    is=isFit$p.value[,"bmi"],
                    ruv2=ruv2Fit$p.value[,"bmi"])

CompareVolcanoPlots(lcoef=lcoef_bmi, 
                    lpvals_bmi, 
                    normmeth = c(":unadjusted", ":is", ":ruv2"),
                    xlab="Coef",
                    negcontrol=negcontrols)
## Warning: Ignoring 1 observations
#Exploring metabolites associated with gender
lcoef_gender<-list(unadjusted=unadjustedFit$coefficients[,"gendercode_1"],
                is_age=isFit$coefficients[,"gendercode_1"],
                ruv2_age=ruv2Fit$coefficients[,"gendercode_1"])
lpvals_gender<-list(unadjusted=unadjustedFit$p.value[,"gendercode_1"],
                 is=isFit$p.value[,"gendercode_1"],
                 ruv2=ruv2Fit$p.value[,"gendercode_1"])
poscontrols_gender<-metabolitedata_eg$names[which(metabolitedata_eg$pos_controls_gender==1)]                   
CompareVolcanoPlots(lcoef=lcoef_gender, 
                    lpvals_gender, 
                    normmeth = c(":unadjusted", ":is", ":ruv2"),
                    negcontrol=negcontrols, 
                    poscontrol=poscontrols_gender)
## Warning: Ignoring 1 observations
4.31b Examine the the residuals obtained from a fitted linear model using RLA plots

An example:

lresiddata<-list(unadjusted=unadjustedFit$residuals,
                 is=isFit$residuals,
                 ruv2=ruv2Fit$residuals)
CompareRlaPlots(lresiddata,groupdata=sampledata_eg$batch,
                yrange=c(-3,3),
               normmeth = c("unadjusted:","is:","ruv2:"))
4.31c Explore the distribution of p-values using histograms
  ComparePvalHist(lpvals = lpvals_age,ylim=c(0,40),
  normmeth = c("unadjusted","is","ruv2"))

4.31d Explore the consistency between results from different platforms using venn plots

The example datasets did not involve multiple platforms. In what follows, for the purpose of demonstrating VennPlot, we simply compare the results from different normalisation methods.

  lnames<- list(names(ruv2Fit$coef[,"Age"])[which(ruv2Fit$p.value[,"Age"]<0.05)],
                names(unadjustedFit$coef[,"Age"])[which(unadjustedFit$p.value[,"Age"]<0.05)],
                names(isFit$coef[,"Age"])[which(isFit$p.value[,"Age"]<0.05)])
  
  VennPlot(lnames, group.labels=c("ruv2","unadjusted","is"))

4.32 Clustering

4.32a Exploration of the normalized data and the removed component of unwanted variation
data(UVdata)
dataview(UVdata$featuredata)
## 
##  - The data consists of 185 rows by 33 columns
##        m_g3_1   m_g3_2   m_g3_3   m_g3_4   m_g3_5   m_g3_6   m_g3_7
## s_1  15.32284 12.49509 12.92832 12.63389 12.77157 14.38417 16.52140
## s_2  15.33973 12.49854 12.95774 12.66332 12.80278 14.39479 16.53920
## s_3  15.27855 12.91523 13.04520 13.04520 12.90187 14.49742 16.60793
## s_4  14.97955 13.06340 13.01026 12.71651 12.94753 14.48041 16.56116
## s_5  15.25609 12.84194 13.02690 12.71838 12.85834 14.40205 16.59200
## s_6  15.29793 12.88264 13.01264 12.70868 12.86208 14.43004 16.58074
## s_7  15.29637 12.71626 13.03121 12.72204 12.89868 14.38071 16.57633
## s_8  16.38772 14.76798 14.31781 14.02218 14.11455 15.53213 17.25176
## s_9  16.39216 14.87127 14.35985 14.06619 14.16954 15.49755 17.25271
## s_10 15.98707 14.41295 13.87622 13.87622 13.58081 15.29668 17.11996
##        m_g3_8 ...
## s_1  14.44427 ...
## s_2  14.46314 ...
## s_3  14.54595 ...
## s_4  14.48674 ...
## s_5  14.53921 ...
## s_6  14.53075 ...
## s_7  14.51485 ...
## s_8  15.84891 ...
## s_9  15.88468 ...
## s_10 15.42501 ...
dataview(UVdata$sampledata)
## 
##  - The data consists of 185 rows by 3 columns
##       group instrument temperature
## s_1  Group1      inst1       temp1
## s_2  Group1      inst1       temp1
## s_3  Group1      inst1       temp1
## s_4  Group1      inst1       temp1
## s_5  Group1      inst1       temp1
## s_6  Group1      inst1       temp1
## s_7  Group1      inst1       temp1
## s_8  Group2      inst1       temp1
## s_9  Group2      inst1       temp1
## s_10 Group2      inst1       temp1
dataview(UVdata$metabolitedata)
## 
##  - The data consists of 33 rows by 3 columns
##      Names IS neg_control
## 1   m_g3_1  0           0
## 2   m_g3_2  0           0
## 3   m_g3_3  0           0
## 4   m_g3_4  0           0
## 5   m_g3_5  0           0
## 6   m_g3_6  0           0
## 7   m_g3_7  0           0
## 8   m_g3_8  0           0
## 9   m_g3_9  0           0
## 10 m_g3_10  0           0
#Not RUN due to user input; we set k=1 each and saved normalized data as uv_ruvrandclust
#uv_ruvrand_norm<-NormQcmets(featuredata=UVdata$featuredata,
#                            method="ruvrandclust",
#                            qcmets=which(UVdata$metabolitedata$neg_control==1),
#                            k=1)

data("uv_ruvrandclust")
dataview(uv_ruvrandclust$featuredata)
## 
##  - The data consists of 185 rows by 33 columns
##          m_g3_1    m_g3_2      m_g3_3      m_g3_4      m_g3_5     m_g3_6
## s_1  -1.0086748 -3.452141 -0.69290532 -0.84687172 -0.44706357 -1.4747494
## s_2  -0.9824190 -3.439549 -0.65567608 -0.80971346 -0.40827974 -1.4550329
## s_3  -1.1638612 -3.140287 -0.66851990 -0.52710301 -0.40651965 -1.4691839
## s_4  -1.4178384 -2.948157 -0.66590423 -0.81863202 -0.32442133 -1.4424767
## s_5  -1.1988837 -3.225855 -0.69730346 -0.86429848 -0.46022498 -1.5767529
## s_6  -1.1541648 -3.182337 -0.70915725 -0.87161609 -0.45414934 -1.5459585
## s_7  -1.1418740 -3.335193 -0.67903072 -0.84682118 -0.40633760 -1.5818450
## s_8  -0.6461694 -1.865099  0.11077337 -0.03835783  0.32741883 -1.0088362
## s_9  -0.6780271 -1.797255  0.12254054 -0.02430341  0.35303212 -1.0786570
## s_10 -0.6833759 -1.865244 -0.02769755  0.11568067  0.08784137 -0.8913671
##         m_g3_7     m_g3_8 ...
## s_1  0.6112379 -1.1593267 ...
## s_2  0.6381617 -1.1315104 ...
## s_3  0.5897389 -1.1635953 ...
## s_4  0.5868263 -1.1797910 ...
## s_5  0.5615645 -1.1823399 ...
## s_6  0.5531114 -1.1880398 ...
## s_7  0.5621986 -1.1907146 ...
## s_8  0.6573495 -0.4257514 ...
## s_9  0.6229378 -0.4246587 ...
## s_10 0.8796062 -0.5024085 ...
#INCLUDE A PAIRS COMPAREPCA PLOT HERE 
 lfeaturedata<-list(unadj=UVdata$featuredata,ruv=uv_ruvrandclust$featuredata,
                    ruvuv=uv_ruvrandclust$uvdata)
 CompareRlaPlots(lfeaturedata,
                 groupdata=interaction(UVdata$sampledata$temperature,UVdata$sampledata$instrument),
                 normmeth=c("Unadjusted:", "RUVrandclust normalized:", 
                            "RUVrandclust: removed uv:"),
                 yrange=c(-3,3))
4.32b clustering accuracy of the known samples
 hca<-Dendrogram(featuredata=uv_ruvrandclust$featuredata,
                 groupdata=UVdata$sampledata$group, 
                 clust=TRUE, 
                 nclust=2)

 HeatMap(uv_ruvrandclust$featuredata,
          UVdata$sampledata$group,interactiveplot = TRUE, 
          colramp=c(75, "magenta", "green"),
          distmethod = "manhattan", aggmethod = "ward.D")

4.33 Classification

4.33a Explore the normalized data and the removed component of unwanted variation

Follow a similar approach to section 4.32a

4.33b Classification accuracy of known samples

  svm<-SvmFit(featuredata=uv_ruvrandclust$featuredata, 
              groupdata=UVdata$sampledata$group,
              crossvalid=TRUE,
              k=5,
              rocplot = TRUE)

4.34 Correlation analysis

4.34a Explore the normalized data and the removed component of unwanted variation similar to above

Follow a similar approach to section 4.32a

4.34b Explore the distribution of correlation coefficients and the p-values

See (Freytag et al. 2015) for a detailed example of this concept.

lcor<-list(Corr(UVdata$featuredata)$results[,3],
  Corr(uv_ruvrandclust$featuredata)$results[,3])
ComparePvalHist(lcor,normmeth = c("unadjusted","ruvrandclust"),
                xlim=c(-1,1), xlab="Correlation coefficients",ylim=c(0,120)) 

lcor_p<-list(Corr(UVdata$featuredata)$results[,4],
  Corr(uv_ruvrandclust$featuredata)$results[,4])
ComparePvalHist(lcor_p,normmeth = c("unadjusted","ruvrandclust"),
                xlim=c(0,1),ylim=c(0,200)) 

References

De Livera, Alysha M, M. Aho-Sysi, Laurent Jacob, J. Gagnon-Bartch, Sandra Castillo, J.A. Simpson, and Terence P. Speed. 2015. “Statistical methods for handling unwanted variation in metabolomics data.” Analytical Chemistry 87 (7). American Chemical Society: 3606–15. doi:10.1021/ac502439y.

De Livera, Alysha M, Daniel A Dias, David De Souza, Thusitha Rupasinghe, James Pyke, Dedreia Tull, Ute Roessner, Malcolm McConville, and Terence P Speed. 2012. “Normalizing and integrating metabolomics data.” Analytical Chemistry 84 (24): 10768–76. doi:10.1021/ac302748b.

Dunn, Warwick B, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, et al. 2011. “Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry.” Nature Protocols 6 (7): 1060–83.

Freytag, Saskia, Johann Gagnon-Bartsch, Terence P. Speed, and Melanie Bahlo. 2015. “Systematic noise degrades gene co-expression signals but can be corrected.” BMC Bioinformatics 16 (1). BMC Bioinformatics: 309. doi:10.1186/s12859-015-0745-3.

Gagnon-Bartsch, Johann A, Laurent Jacob, and Terence P. Speed. 2014. Removing unwanted variation from high dimensional data with negative controls. IMS Monographs. Accepted for publication.

Gullberg, Jonas, Pär Jonsson, Anders Nordström, Michael Sjöström, and Thomas Moritz. 2004. “Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry.” Analytical Biochemistry 331 (2): 283–95. doi:10.1016/j.ab.2004.04.037.

Kirwan, JA, and DI Broadhurst. 2013. “Characterising and correcting batch variation in an automated direct infusion mass spectrometry (DIMS) metabolomics workflow.” Analytical and …, 5147–57. doi:10.1007/s00216-013-6856-7.

Kirwan, JA, RJM Weber, DI Broadhurst, and MR Viant. 2014. “Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control.” Scientific Data 1 (June): 1–13. doi:10.1038/sdata.2014.12.

Luan, Hemi. 2017. “statTarget: R package.”

Redestig, Henning, Atsushi Fukushima, Hans Stenlund, Thomas Moritz, Masanori Arita, Kazuki Saito, and Miyako Kusano. 2009. “Compensation for systematic cross-contribution improves normalization of mass spectrometry based metabolomics data.” Analytical Chemistry 81 (19): 7974–80.

Scholz, M, S Gatzek, a Sterling, O Fiehn, and J Selbig. 2004. “Metabolite fingerprinting: detecting biological features by independent component analysis.” Bioinformatics (Oxford, England) 20 (15): 2447–54. doi:10.1093/bioinformatics/bth270.

Sysi-Aho, Marko, Mikko Katajamaa, Yetukuri Laxman, and Matej Oresic. 2007. “Normalization method for metabolomics data using optimal selection of multiple internal standards.” BMC Bioinformatics 8 (January): 93. doi:10.1186/1471-2105-8-93.

Wang, Weixun, Haihong Zhou, Hua Lin, Sushmita Roy, Thomas A Shaler, Lander R Hill, Scott Norton, Praveen Kumar, Markus Anderle, and Christopher H Becker. 2003. “Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards.” Analytical Chemistry 75 (18): 481848–26.