1 Introduction to consensusDE

consensusDE aims to make first pass differential expression (DE) analysis, with reporting of significance scores from multiple methods easy. It implements wrappers for Voom, DEseq2 and EdgeR and reports differential expression results seperately, as well as merging the results into a single table for determining consensus. The results of the merged table, are ordered by the summed ranks of the p-values for each algorithm.

Core functionality is simplified into two function:

  • buildSummarized() generate a summarized experiment that counts reads mapped (from bam files) against a transcriptome
  • multi_de_pairs() perform DE analysis (all possible pairwise comparisons)

Below, we demonstrate the core functionality of consensusDE as well as how to plot results from obtained results using the diag_plots() function.

2 consensusDE examples

Begin by first installing and then loading the consensusDE library. To illustrate functionality of consensusDE, we will utilise data from the airway and annotation libraries as follows. Begin by installing and attaching data from these libraries:

3 Building a summarized experiment

A summarized experiment is an object format that stores all the relevant information for performing differential expression analysis. buildSummarized() allows users to build a summarized object by simply providing 1) a table of bam files (more below on format), 2) a directory of where to locate the bam files and 3) a transcript database to map the reads to (either a gtf file or txdb). We will use bam files attached to this package (from GenomicAlignments) as an example:

The minimum information is now ready to build a summarized experiment as follows:

This will output a summarized object that has mapped the reads for the bam files that are listed in sample_table, located in bam_dir, against the transcript database provided: TxDb.Dmelanogaster.UCSC.dm3.ensGene. Bam file format, whether “paired” or “single” end (the type of sequencing technology used) must be specified using the read_format parameter. gtf formatted transcript databases can also be used instead of a txdb, by providing the full path to the gtf file using the gtf parameter. To save the summarized experiment externally, for future use, specify the path to save the summarized experiment using output_log To see details of all parameters see ?buildSummarized.

Overview of the summarized experiment:

3.1 Filtering low count data

buildSummarized() also allows users to filter out low read counts. This can be done when building the summarized experiment, or re-running with the summarized experiment output using buildSummarized(). See “Performing Differential Expresssion” below with filter example.

4 Performing Differential Expresssion

For differential expression (DE) analysis we will use the airway data for demonstration. See ?airway for more details for this experiment. NOTE: the summarized meta-data must include the columns “group” and “file” to build the correct models. For illustration, we sample 1000 genes from this dataset.

Running multi_de_pairs() will perform DE analysis on all possible pairs of “groups” and save these results as a simple list of “merged” - being the merged results of “deseq”, “voom” and “edger”, as well as the latter three as objects independently. To access the merged results:

4.1 Annotating DE results

It is often useful to add additional annotated information to the output tables. This can be acheived by providing a database for annotations via ``ensembl_annotate. Annotations needs to be a Genome Wide Annotation object, e.g. org.Mm.eg.db for mouse or org.Hs.eg.db for human from BioConductor. For example, to install the database for the mouse annotation, go to http://bioconductor.org/packages/org.Mm.eg.db and follow the instructions. Ensure that after installing the database package that the library is loaded usinglibrary(org.Mm.eg.db). When running, "'select()' returned 1:many mapping between keys and columns" will appear on the command line. This is the result of multiple mapped transcript ID to Annotations. Only the first annotation is reported. See?multi_de_pairs``` for additional documentation.

An example of annotating the above filtered airway data is provided below:

4.2 Writing tables to an output directory

multi_de_pairs provides options to automatically write all results to output directories when a full path is provided. Which results are output depends on which directories are provided. Full paths provided to the parameters of output_voom, output_edger, output_deseq and output_combined will output Voom, EdgeR, DEseq and the merged results to the directories provided, respectively.

5 Removing unwanted variation (RUV)

consensusDE also provides the option to remove batch effects through RUVseq functionality. consensusDE currently implements RUVr which models a first pass generalised linear model (GLM) using EdgeR and obtaining residuals for incorporation into the SummarizedExperiment object for inclusion in the models for DE analysis. The following example, uses RUV to identify these residuals. To view the residuals in the model see the resisuals section below in the plotting functions. Note, that if ruv_correct = TRUE and a path to a plot_dir is provided, diagnostic plots before and after RUV correction will be produced. The residuals can also be accessed in the summarizedExperiment as below. These are present in the “W_1” column. At present only one factor of variation is determined.

6 Plotting functions

When performing DE analysis, a series of plots (currently 10) can be generated and saved as .pdf files in a plot directory provided to multi_de_pairs() with the parameter: plot_dir = "/path/to/save/pdfs/. See ?multi_de_pairs for description.

In addition, each of the 10 plots can be plotted individually using the diag_plots function. See ?diag_plots for description, which provides wrappers for 10 different plots. Next we will plot each of these using the example data.

6.1 Mapped reads

Plot the number of reads that mapped to the transcriptome of each sample. The sample numbers on the x-axis correspond to the sample row number in the summarizedExperiment built, accessible using colData(airway). Samples are coloured by their “group”.

6.4 RUV residuals

Residuals for the RUV model can be plotted as follows:

6.8 MA plot

This will perform an MA plot given a dataset of the appropriate structure. This will plot the Log-fold change (M) versus the average expression level (A). To use independently of multi_de_pairs() and plot to only one comparison, constructing a list with one data.frame with the columns labelled “ID”, “AveExpr”, and “Adj_PVal” is required. The following illustrates an example for using the merged data, which needs to be put into a list and labelled appropriately. Note that this is done automatically with multi_de_pairs().

6.9 Volcano

This plot a volcano plot, which compares the Log-fold change versus significance of change -log transformed score. As above and described in the MA plot section, to use independently of multi_de_pairs() and plot to only one comparison, constructing a list with one data.frame with the columns labelled “ID”, “AveExpr”, and “Adj_PVal” is required.

6.10 P-value distribution

This plot the distribution of p-values for diagnostic analyses. As above and described in the MA plot section, to use independently of multi_de_pairs() and plot to only one comparison, constructing a list with one data.frame with the columns labelled “ID”, “AveExpr”, and “Adj_PVal” is required.

6.10.1 General notes about plotting

The legend and labels can be turned off using legend = FALSE and label = TRUE for diag_plots(). See ?diag_plots for more details of these parameters.

7 Accessing additional data for each comparison

When performing DE analysis, data is stored in simple list object that can be accessed. Below are the levels of data available from the output of a DE analysis. We use the all_pairs_airway results from the above analysis to demonstrate how to locate these tables.

  • all_pairs_airway$merged
    • list of the comparisons performed

In addition to the list with the combined results of DESeq2, Voom and EdgeR, the full results can be accessed for each method, as well as fit tables and the contrasts performed.

  • all_pairs_airway$deseq (list of the DEseq2 results)
  • all_pairs_airway$voom (list of the Voom results)
  • all_pairs_airway$edger (list of the edgeR results)

Within each list the following data is accessible. Each object is list of all the comparisons performed.

  • all_pairs_airway$deseq$short_results
    • Formatted results. To access the first table, for examples, use all_pairs_airway$deseq$short_results[[1]]
  • all_pairs_airway$deseq$full_results
    • Full results that normally ouput by a pairwise comparison
  • all_pairs_airway$deseq$fitted
    • Fit table to access coeffients etc.
  • all_pairs_airway$deseq$contrasts
    • Contrasts performed

8 Citing results that use consensusDE

When using this package, please cite consensusDE as follows and all methods used in your analysis.

For consensus DE:

  • When using RUVseq (also check package reference suggestions)
    • D. Risso, J. Ngai, T. P. Speed and S. Dudoit (2014). Normalization of RNA-seq data using factor analysis of control genes or samples Nature Biotechnology, 32(9), 896-902
  • When using DESeq2 (also check package reference suggestions)
    • Love, M.I., Huber, W., Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Genome Biology 15(12):550 (2014)
  • When using edgeR (also check package reference suggestions)
    • Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
    • McCarthy DJ, Chen Y and Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297
  • When using limma/voom (also check package reference suggestions)
    • Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.