Contents

1 Introduction

RNA-Seq is a revolutionary approach to investigate and discover the transcriptome using next-generation sequencing technologies(Wang et al.). Typically, this transcriptome analysis aims to identify genes differentially expressed among different conditions or tissues, resulting in the understanding of the important pathways that are associated with conditions(Wang et al.).

RNASeqR is an user-friendly R-based tool for running RNA-Seq analysis pipeline including quality assessment, reads alignment and quantification, differential expression analysis, and functional analysis. The main features of this package are automated workflow, comprehensive report with data visualization and extendable file structure. In this package, new tuxedo pipeline published in Nature Protocols in 2016 can be fully implemented under R environment with extra functions such as reads quality assessment and functional analysis.

The following are main tools that are used in this package: ‘HISAT2’ for reads alignment(Kim et al. 2015); ‘StringTie’ for alignments assembly and transcripts quantification(Pertea et al. 2015); ‘Rsamtools’ for converting SAM files to BAM files(Morgan et al. 2018); ‘Gffcompare’ for comparing merged GTF file with reference GTF file; ‘systemPipeR’ package for quality assessment(Backman et al. 2016); ‘ballgown’ package(Fu et al. 2018), ‘DESeq2’ package(Love et al. 2014) and ‘edgeR’ package(Robinson et al. 2010;McCarthy et al. 2012) for finding potential differential expressed genes; ‘clusterProfiler’ package(Yu et al. 2012) for Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis.

The central concept behind this package is that each step involved in RNA-Seq data analysis is a function call in R. At the beginning, users will create a RNASeqRParam S4 object by running RNASeqRParam() constructor function for all variable checking. After the creation of RNASeqRParam, it will be used as input of the following analysis function.

  1. RNASeqEnvironmentSet_CMD() or RNASeqEnvironmentSet(): to setup RNA-Seq environment.

  2. RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment(): (Optional) to run quality assessment step.

  3. RNASeqReadProcess_CMD() or RNASeqReadProcess(): to run reads alignment and quatification.

  4. RNASeqDifferentialAnalysis_CMD() or RNASeqDifferentialAnalysis(): to run differential analysis via different R packages.

  5. RNASeqGoKegg_CMD() or RNASeqGoKegg(): to conduct GO and KEGG analysis.

Functions with CMD suffix create an R script and run nohup R CMD BATCH script.R in background. Functions with no CMD suffix process in R shell. After running the above functions, the whole RNA-Seq analysis is done and generated files in each step will be stored in organized file directory. RNASeqR package makes two-group RNA-Seq analysis more efficient and easier for users.

Functions with CMD suffix will create an R script and run nohup R CMD BATCH script.R in background while functions with no CMD suffix will be processed in R shell. Files generated in each step will be kept in proper directory. Once the workflow is completed, a comprehensive RNA-Seq analysis is done. Additionally, this package is mainly designed for a two-group comparison setting, i.e. differential expression profile between two conditions.

2 Sample Definition

Sample data used in this vignette can be downloaded from RNASeqRData experiment package. It was originated from NCBI’s Sequence Read Archive for the entries SRR3396381, SRR3396382, SRR3396384, SRR3396385, SRR3396386, and SRR3396387. These samples were from Saccharomyces cerevisiae. Suitable reference genome and gene annotation files for this species can be further downloaded from iGenomes, Ensembl, R64-1-1. To create mini data for demonstration purpose, reads aligned to the region from 0 to 100000 at chromosome XV were extracted. The following analysis results of this mini data will be shown in this vignette. The experiment data package is here

For more case-control real data RNA-Seq analysis results of this package, please go to this website (https://github.com/HowardChao/RNASeqR_analysis_result).

3 System Requirements

Necessary:

  1. R version >= 3.5.0

  2. Operating System: ‘Linux’ and ‘macOS’ are supported in RNASeqR package. ‘Windows’ is not supported. (Because ‘StringTie’ and ‘HISAT2’ are not available for ‘Windows’)

  3. Third-party softwares used in this package include ‘HISAT2’, ‘StringTie’ and ‘Gffcomapre’. The availability of these commands will be checked by R system2() through R shell at the end of ‘Environmnet Setup’ step. Environemnt must successfully built before running the following RNA-Seq analysis. By default, binaries will be installed base on the operating system of the workstation, therefore there is no additonal compiling. Alternatively, users can still decide to skip certain software binaries installation. More details please refer to ‘Environment Setup’ chapter.

Recommended:

  1. Python: Python2 or Python3.

  2. 2to3: If the Python version in workstation is 3, this command will be used. Generally, 2to3 is available if Python3 is available.
    • Python and 2to3 are used for creating raw reads count for DESeq2 and edgeR.

    • The following are two conditions that will create raw reads count from ‘StringTie’ output.
      1. Python2
      2. Python3 with 2to3 command available.
    • If one of these conditions meets, raw reads count will be created and DESeq2, edgeR will be run automatically by default in ‘Gene-level Differential Analyses’ step. If not, DESeq2 and edgeR will be skipped during ‘Gene-level Differential Analyses’ step. Checking Python version and 2to3 command in workstation beforehands are highly recommended but not necessary.

  3. ‘HISAT2’ indexex: Users are advised to provide ‘indices/’ directory in ‘inputfiles/’. ‘HISAT2’ requires at least 160 GB of RAM and several hours to index the entire human genome.

4 Installation

5 “RNASeqRParam” S4 Object Creation

This is the first step of RNA-Seq workflow in this package. Prior to conducting RNA-Seq analysis, it is necessary to implement a constructor function, called RNASeqRParam() and create a RNASeqRParam S4 object which stores parameters not only for pre-checking but also for utilizing as input parameters in the following analyses.

5.1 RNASeqRParam Slots Explanation

There are 11 slots in RNASeqRParam:

  1. os.type : The operating system type. Value is linux or osx. This package only support ‘Linux’ and ‘macOS’ (no ‘Windows’). If other operating system is detected, ERROR will be reported.

  2. python.variable : Python-related variable. Value is a list of whether Python is available and Python version (TRUE or FALSE, 2 or 3).

  3. python.2to3 : Availability of 2to3 command. Value is TRUE or FALSE.

  4. path.prefix : Path prefix of ‘gene_data/’, ‘RNASeq_bin/’, ‘RNASeq_results/’, ‘Rscript/’ and ‘Rscript_out/’ directories. It is recommended to create a new directory with out any file inside and all the following RNA-Seq results will be installed in it.

  1. input.path.prefix : Path prefix of ‘input_files/’ directory. User have to prepare an ‘input_file/’ directory with the following rules:
    • genome.name.fa: reference genome in FASTA file formation.

    • genome.name.gtf: gene annotation in GTF file formation.

    • raw_fastq.gz/: directory storing FASTQ files.

      • Support paired-end reads files only.

      • Names of paired-end FASTQ files : ’sample.pattern_1.fastq.gz’ and ’sample.pattern_2.fastq.gz’. sample.pattern must be distinct for each sample.

    • phenodata.csv: information about RNA-Seq experiment design.

      • First column : Distinct ids for each sample. Value of each sample of this column must match sample.pattern in FASTQ files in ‘raw_fastq.gz/’. Column names must be ids.

      • Second column : independent variable for the RNA-Seq experiment. Value of each sample of this column can only be parameter case.group and control.group. Column name is parameter independent.variable.

    • indices/ : directory storing HT2 indices files for HISAT2 alignment tool.

      • This directory is optional. HT2 indices files corresponding to target reference genome can be installed at HISAT2 official website. Providing HT2 files can accelerate the subsequent steps. It is highly advised to install HT2 files.

      • If HT2 index files are not provided, ‘input_files/indices/’ directory should be deleted.

  1. genome.name : Variable of genome name defined in this RNA-Seq workflow (ex. ‘genome.name.fa’, ‘genome.name.gtf’)

  2. sample.pattern : Regular expression of paired-end fastq.gz files under ‘input_files/raw_fastq.gz/’. IMPORTANT!! Expression shouldn’t have _[1,2].fastq.gz in end.

  3. independent.variable: Independent variable for the biological experiment design of two-group RNA-Seq workflow.

  4. case.group : Group name of the case group.

  5. control.group : Group name of the control group.

  6. indices.optional : logical value whether ‘input_files/indices/’ is exit. Value is TRUE or FALSE

5.2 RNASeqRParam Constructor Checking

  1. Create a new directory for RNA-Seq analysis. It is highly recommended to create a new directory without any files inside. The parameter path.prefix of RNASeqRParam() constructor should be the absolute path of this new directory. All the RNA-Seq related files that generated in the following steps will be stored inside of this directory.

  2. Create valid ‘input_files/’ directory. You should create a file directory named ‘input_files/’ with neccessary files inside. It should follow the rules metioned above.

  3. Run constructor of RNASeqRParam S4 object. This constructor will check the validity of input parameters before creating S4 objects.
    • Operating system

    • Python version

    • 2to3 command

    • Structure, contents and rules of ‘inputfiles/’

    • Validity of input parameters

5.3 Example

##  [1] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.fa" 
##  [2] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.gtf"
##  [3] "input_files/phenodata.csv"                          
##  [4] "input_files/raw_fastq.gz/SRR3396381_XV_1.fastq.gz"  
##  [5] "input_files/raw_fastq.gz/SRR3396381_XV_2.fastq.gz"  
##  [6] "input_files/raw_fastq.gz/SRR3396382_XV_1.fastq.gz"  
##  [7] "input_files/raw_fastq.gz/SRR3396382_XV_2.fastq.gz"  
##  [8] "input_files/raw_fastq.gz/SRR3396384_XV_1.fastq.gz"  
##  [9] "input_files/raw_fastq.gz/SRR3396384_XV_2.fastq.gz"  
## [10] "input_files/raw_fastq.gz/SRR3396385_XV_1.fastq.gz"  
## [11] "input_files/raw_fastq.gz/SRR3396385_XV_2.fastq.gz"  
## [12] "input_files/raw_fastq.gz/SRR3396386_XV_1.fastq.gz"  
## [13] "input_files/raw_fastq.gz/SRR3396386_XV_2.fastq.gz"  
## [14] "input_files/raw_fastq.gz/SRR3396387_XV_1.fastq.gz"  
## [15] "input_files/raw_fastq.gz/SRR3396387_XV_2.fastq.gz"

Check the files in ‘inputfiles/’ directory.

## RNASeqRParam S4 object
##               os.type : linux 
##       python.variable : (Availability: TRUE , Version: 2 )
##           python.2to3 : TRUE 
##           path.prefix : /tmp/RtmpeDMmjV/ 
##     input.path.prefix : /home/biocbuild/bbs-3.8-bioc/R/library/RNASeqRData/extdata/ 
##           genome.name : Saccharomyces_cerevisiae_XV_Ensembl 
##        sample.pattern : SRR[0-9]*_XV 
##  independent.variable : state 
##            case.group : 60mins_ID20_amphotericin_B 
##         control.group : 60mins_ID20_control 
##      indices.optional : FALSE 
##  independent.variable : state

In this example, RNASeqRParam S4 object is store in exp for subsequent RNA-Seq analysis steps. Any ERROR occured in checking steps will terminate the program.

6 Environment Setup

This is the second step of RNA-Seq workflow in this package. To set up the environment, run RNASeqEnvironmentSet_CMD() to execute process in background or running RNASeqEnvironmentSet() to execute process in R shell.

6.1 Files Setup

  1. Create Base Directories. ‘gene_data/’, ‘RNASeq_bin/’, ‘RNASeq_results’, ‘Rscript’ and ‘Rscript_out’ will be created under path.prefix directory. Here are the usage of these five main directories:
    • ‘gene_data/’: Symblic links of ‘input_files/’ and files that are created in each step of RNA-Seq analysis will be stored in this directory.

    • ‘RNASeq_bin/’: The binaries of necessary tools, HISAT2, SAMtools, StringTie and Gffcompare, are installed in this directory.

    • ‘RNASeq_results’: The RNA-Seq results, for example, alignment results, quality assessment results, differential analysis results etc., will be stored in this directory.

    • ‘Rscript’: If your run XXX_CMD() function, corresponding R script(XXX.R) for certain step will be created in the directory.

    • ‘Rscript_out’: The corresponding output report for R script(XXX.Rout) will be stored in this directory.

  2. Symbolic links will be created from files in ‘input_files/’ to path.prefix directory.

6.2 Necessary Tools Installation

The operating system of your workstation will be detected. If the operating system were not Linux and macOS, ERROR would be reported. Users can decide whether the installation of essential programs(HISAT2, StringTie and Gffcompare) are going be automatically processed.

Third-party softwares used in this package include ‘HISAT2’, ‘StringTie’ and ‘Gffcomapre’. Binaries are all available for these three softwares, and by default, they will be installed base on the operating system of the workstation automatically. Zipped binaries will be unpacked, and exported to R environment PATH. No compilation is needed.

To specify, there are three parameters(install.hisat2, install.stringtie and install.gffcompare) in both RNASeqEnvironmentSet_CMD() and RNASeqEnvironmentSet() functions for users to determine which software is going to be installed automatically or to be skipped. The default settings of these parameters are TRUE so that these three programs will be installed directly. Otherwise, users can skip certain software installation process by turning the values to FALSE. Please make sure to check the skipped programs are available by system2() through R shell. Any unavailability of each program will cause fail in ‘Environment Setup’ step.

Here are the version information of each software binary.

  • HISAT2
    • Based on your operating system, hisat2-2.1.0-Linux_x86_64.zip or hisat2-2.1.0-OSX_x86_64.zip zipped file will be installed.
    • Installed file will be unzipped and all binary files will be copied under ‘RNASeq_bin/’
  • StringTie
    • Based on your operating system, stringtie-1.3.4d.Linux_x86_64.tar.gz or stringtie-1.3.4d.Linux_x86_64 zipped file will be installed.
    • Installed file will be unzipped and all binary files will be copied under ‘RNASeq_bin/’.
  • Gffcompare
    • Based on your operating system, gffcompare-0.10.4.Linux_x86_64.tar.gz or gffcompare-0.10.4.Linux_x86_64.tar.gz zipped file will be installed.
    • Installed file will be unzipped and all binary files will be copied under ‘RNASeq_bin/’.

6.3 Export Path

‘RNASeq_bin/’ will be added to R environment PATH so that these binaries can be found in R environment in R shell through system2(). In the last step of environment setting, hisat2 --version,stringtie --version,gffcompare --version,samtools --version commands will be checked in order to make sure the environment is correctly constructed. Environment must be setup successfully before the following analyses.

6.4 Example

Run RNASeqEnvironmentSet_CMD() or RNASeqEnvironmentSet().

  1. Run in background Result will be reported in ‘Rscript_out/Environment_Set.Rout’. Make sure the environment is successfully set up before running the subsequent steps.
  1. Run in R shell Result will be reported in R shell. Make sure the environment is successfully setup before running the subsequent steps.

7 Quality Assessment of FASTQ sequence data

This is the third step of RNA-Seq workflow in this package. Different from other necessary steps, it is optional and can be run several times with each result stored seperately. Although this step can be skipped, it is strongly recommended before processing the alignment step. To evaluate the quality of raw reads in FASTQ files, it can be achieved by running RNASeqQualityAssessment_CMD() to execute process in background or running RNASeqQualityAssessment() to execute process in R shell.

7.1 “systemPipeR” Quality Assessment

In this step, systemPipeR package is used for evaluating sequencing reads and the details are as follows:

  1. Check the number of times that user has run quality assessment process and create the corresponding files ‘RNASeq_results/QA_results/QA_{times}’.

  2. RNA-Seq environment set up. ‘rnaseq/’ directory will be created by systemPipeR package.

  3. Create ‘data.list.txt’ file.

  4. Reading FASTQ files and create ‘fastqReport.pdf’ as the report result of quality assessment

  5. Remove ‘rnaseq/’ directory.

This quality assessment result is generated by systemPipeR package. It will be stored as PDF.

7.2 Example

Run RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment().

  1. Run in background Result will be reported in ‘Rscriptout/Quality_Assessment.Rout’. Make sure quality assessment is successfully done before running the subsequent steps.
  1. Run in R shell Result will be reported in R shell. Make sure quality assessment is successfully done before running the subsequent steps.
## [1] "Generated rnaseq directory. Next run in rnaseq directory, the R code from *.Rmd (*.Rnw) template interactively. Alternatively, workflows can be exectued with a single command as instructed in the vignette."

8 Reads Alignment and Quantification

This is a fourth step of RNA-Seq workflow in this package. To process raw reads FASTQ files, users can either run RNASeqRawReadProcess_CMD() to execute process in background or run RNASeqRawReadProcess() to execute process in R shell. For further details about commands and parameters that executed during each step, please check the reported ‘RNASeq_results/COMMAND.txt’.

8.1 The HISAT2 Indexer

In preparation step(RNASeqRParam creation step), ‘indices/’ directory is checked whether HT2 indices files already exist. If not, the following commands will be executed:

  • Input: ‘genome.name.gtf’, ‘genome.name.fa’

  • Output: ‘genome.name.ss’, ‘genome.name.exon’, ’genome.name_tran.{number}.ht2’

  1. extract_splice_sites.py, extract_exons.py execution
    • These two commands are executed to extract splice-site and exon information from gene annotation file.
  2. hisat2-build index creation
    • This command build HISAT2 indices files with genome.name.ss and genome.name.exon created in the previous step. Be aware that index building step requires a larger amount of memory and longer time than other steps, and it might not be possible to run on some personal workstations. It is highly recommended to check the availibility of HT2 indices files at HISAT2 official website for your target reference genome beforehands. Install HT2 indices files will greatly shorten the analysis time.
  3. Write ‘RNASeq_results/COMMAND.txt’
    • Shell command that run in this step will be documented into ‘RNASeq_results/COMMAND.txt’.

8.2 The HISAT2 Aligner

  • Input: ’genome.name_tran.{number}.ht2’, ‘sample.pattern.fastq.gz’

  • Output: ‘sample.pattern.sam’

  1. hisat2 command is executed on paired end FASTQ files. SAM files will be created.
    • SAM files are stored in ‘gene_data/raw_bam/’.
  2. Write ‘RNASeq_results/COMMAND.txt’
    • Shell command that run in this step will be documented into ‘RNASeq_results/COMMAND.txt’.
  3. Summary dataframe for alignment reads, rates in terms of tabular(CSV) and picture(PNG) format are created and kept at the directory ‘RNASeq_results/Alignment_Report’.

8.3 The Rsamtools SAM to BAM Converter

In this step, users can choose whether they want to use ‘Rsamtools’(R package) or ‘SAMtools’(command-line-based tool) to do files conversion by setting SAMtools.or.Rsamtoolsparameter Rsamtools or SAMtools. By default, Rsamtools will be used. However, if the size of RNA-Seq data are too large, ‘Rsamtools’ might not be able to finish this process due to the Rtmp file issue, therefore ‘SAMtools’ is recommended. Users have to make sure ‘samtools’ command is available on the workstation beforehands or ERROR will be reported.

  • Input: ‘sample.pattern.sam’

  • Output: ‘sample.pattern.bam’

  1. Rsamtools package provides an interface to samtools in R environment. In this step, SAM files from HISAT2 will be converted to BAM files by running asBam() function.
    • Output BAM files are stored in ‘gene_data/raw_sam/’.
  2. Write ‘RNASeq_results/COMMAND.txt’

8.4 The StringTie Assembler

  • Input: ‘genome.name.gtf’, ‘sample.pattern.bam’

  • Output: ‘sample.pattern.gtf’

  1. stringtie command is executed.
    • Assembler of RNA-Seq alignments into potential transcripts.
    • Output assembled GTF files which are from each FASTQ files are stored in ‘gene_data/raw_gtf/’
  2. Write ‘RNASeq_results/COMMAND.txt’
    • Shell commands that run in this step will be documented into ‘RNASeq_results/COMMAND.txt’.

8.5 The StringTie GTF Merger

  • Input: ‘sample.pattern.gtf’

  • Output: ‘stringtiemerged.gtf’, ‘mergelist.txt’

  1. Creating mergelist.txt
    • gene_data/merged/mergelist.txt is created for the merging step.
  2. stringtie command is executed.
    • Transcript merger merges each sample.pattern.gtf into stringtiemerged.gtf
    • Output files are all stored in ‘gene_data/merged/’
  3. Write ‘RNASeq_results/COMMAND.txt’
    • Shell commands that run in this step will be documented into ‘RNASeq_results/COMMAND.txt’.

8.6 The Gffcompare Comparer

  • Input: ‘genome.name.gtf’, ‘stringtie_merged.gtf’

  • Output: ‘merged.annotated.gtf’, ‘merged.loci’, ‘merged.stats’, ‘merged.stringtie_merged.gtf.refmap’, ‘merged.stringtie_merged.gtf.tmap’, ‘merged.tracking’

  1. gffcompare command is executed.
    • The comparison result of merged GTF file and reference annotation file is reported under ‘merged/’ directory.
  2. Write ‘RNASeq_results/COMMAND.txt’
    • Shell commands that run in this step will be documented into ‘RNASeq_results/COMMAND.txt’.

8.7 The Ballgown input Creater

  • Input: ‘stringtie_merged.gtf’

  • Output: ‘ballgown/’, ‘gene_abundance/’

  1. stringtie command is executed.
    • StringTie will create the input directory for ballgown package for the following differential analysis. Ballgown-related files will be stored by each sample name in ‘gene_data/ballgown/’.
    • StringTie will store gene-related information in TSV file by each sample name in ‘gene_data/gene_abundance/’.
  2. Write ‘RNASeq_results/COMMAND.txt’
    • Shell commands that run in this step will be documented into ‘RNASeq_results/COMMAND.txt’.

8.8 The Reads Count Table Creater

Whether this step is executed depends on the availability of Python on your workstation.

  • Input: ‘samplelst.txt’

  • Output: ‘gene_count_matrix.csv’, ‘transcript_count_matrix.csv’

  1. Reads count table converter Python script is downloaded as prepDE.py

  2. Python checking
    • When Python is not available, this step is skipped.

    • When Python2 is available, prepDE.py is executed.

    • When Python3 is available, 2to3 command will be checked.(Usally, if Python3 is installed, 2to3 command will be installed too.)

    • When Python3 is available but 2to3 command is unavailable, raw reads count generation step will be skipped.

    • When Python3 and 2to3 command is available, prepDE.py is converted to file that can be executed by Python2 and be executed.

  3. raw reads count creation
    • Raw reads count is created and the results are stored in ‘gene_data/reads_count_matrix/’
  4. Write ‘RNASeq_results/COMMAND.txt’
    • Shell commands that run in this step will be documented into ‘RNASeq_results/COMMAND.txt’.

8.9 Example

Run RNASeqReadProcess_CMD() or RNASeqReadProcess().

  1. Run in background Result will be reported in ‘Rscriptout/Raw_Read_Process.Rout’. Make sure raw read process is successfully done before running the subsequent steps.
  1. Run in R shell Result will be reported in R shell. Make sure raw read process is successfully done before running the subsequent steps.

9 Gene-level Differential Analyses

This is the fifth step of RNA-Seq workflow in this package. To identify differential expressed genes, users can either run RNASeqDifferentialAnalysis_CMD() to execute process in background or run RNASeqDifferentialAnalysis() to execute process in R shell. In this package, we provide three normalized expression values, Fragments Per Kilobase per Million(FPKM)(Mortazavi et al. 2008), normalized counts by means of Median of Ratios Normalization(MRN) or Trimmed Mean of M-values(TMM), with proper statistical analyses using R packages, ‘ballgown’, ‘stats’, ‘DESeq2’ and ‘edgeR’. Gene IDs from StringTie and Ballgown R package will be mapped to ‘gene_name’ in GTF file for further functional analysis.

9.1 General Data Visualization

Here we illustrate general data visualization before and after differential expression analysis. The results based on each differential analysis tool(ballgown, DESeq2, edgeR) are kept in the directory ‘RNASeq_results/’ separately. These plots shown below are the statistical results visualization of toy data in RNASeqRData package based on MRN-normalized value through DESeq2. For real data analysis results, please go to this website: https://howardchao.github.io/RNASeqR_analysis_result/.

9.1.1 Before Differential Expression Analysis

9.1.1.1 Frequency Plots

To visualize the frequency of expression value per sample using ggplot2 R package; x-axis represents the range of normalized counts value by MRN or log2(MRN+1) value and y-axis represents the frequency corresponding to x-axis

  • ‘Frequency_Plot_normalized_count_ggplot2.png’

  • ‘Frequency_Plot_log_normalized_count_ggplot2.png’

9.1.1.2 Distribution Plots

To display the distribution of normalized expreesion value (e.g. log2(MRN+1) value) by boxplot and violin plot using ggplot2 R package. Samples are colored by defined groups: blue for case group and yellow for control group.

  • ‘Box_Plot_ggplot2.png’

  • ‘Violin_Plot_ggplot2.png’

9.1.1.3 PCA Plots

To display how the biological samples compare in overall similarities and difference using principal component analysis(PCA); the principal component scores of top five dimensions are calculated using FactoMineR package and the results are extracted and visulazied using factoextra package or ggplot2 package.

  • ‘Dimension_PCA_Plot_factoextra.png’

  • ‘PCA_Plot_factoextra.png’: Samples are colored by defined groups: blue for case group and yellow for control group. The small point represents each sample while the big one represent each comparison group. Ellipases can be further added for grouping samples.

  • ‘PCA_Plot_ggplot2.png’: Samples are colored by defined groups: blue for case group and yellow for control group.

9.1.1.4 Correlation Plots

To display the pearson correlation coefficient of a pairwise correlation analysis of changes in gene expression from all samples calculated by stats package using ggplot2(correlation heat plot), corrplot(correlation dot plot) and PerformanceAnalytics(correlation bar plot) packages. The colors from red to blue mark the value of the coefficient from maximum value to minimum value among all samples.

  • ‘Correlation_Heat_Plot_ggplot2.png’

  • ‘Correlation_Dot_Plot_corrplot.png’

  • ‘Correlation_Bar_Plot_PerformanceAnalytics.png’

9.1.2 After Differential Expression Analysis

9.1.2.1 Volcano Plots

  • ‘Volcano_Plot_graphics.png’: To display the criteria for selecting DEGs and identify these DEGs using ggplot2 package. The x-axis represents log2-based fold change while y-axis denotes log10-based p-value. The upregulated and downregulated DEGs are hightlighted in red and green color, respectively.

9.1.2.2 PCA Plots

To display how the biological samples compare in similarities and difference based on the expression value of DEGs using principal component analysis. FactoMineR, factoextra and ggplot2 packages are used in this step.

  • ‘Dimension_PCA_Plot_factoextra.png’

  • ‘PCA_Plot_factoextra.png’: Samples are colored by defined groups: blue for case group and yellow for control group. The small point represents each sample while the big one represent each comparison group. Ellipases can be further added for grouping samples.

  • ‘PCA_Plot_ggplot2.png’: Samples are colored by defined groups: blue for case group and yellow for control group.

9.1.2.3 Heatmap Plots

  • ‘Heatmap_Plot_pheatmap.png’: To visualize the expression value of DEGs in terms of log2-based normalized value from all samples of two groups using pheatmap package. Samples are colored by defined groups: blue for case group and yellow for control 2. Gene names and sample names will be shown on heatmap except for two conditions: If there are more than 60 differential expressed genes, gene names will not be shown on each row beside heatmap. If there are more than 16 samples, sample names will not be shown on each column below heatmap.

9.2 “ballgown” Analysis Based on FPKM Value

Ballgown is an R package designed for differential expression analysis of RNA-Seq data. This package extracts FPKM values, i.e. reads count normalized by both library size and gene length, from StringTie software followed by applying a parametric F-test comparing nested linear model as its default statistic model to identify differential expression genes. The basic steps are as follows:

  • Input: ‘gene_data/ballgown/’
  • Output: ‘RNASeq_results/ballgown_analysis/’
  1. Create a ballgown object that will be stored in ‘RNASeq_results/ballgown_analysis/ballgown.rda’.

  2. Filer the genes that the sum of FPKM values of all sample per gene equals 0.

  3. Calculate Log2-based fold change value in column log2FC.

  4. Split a matrix of normalized counts into case and control group based on phenotype data (‘gene_data/phenodata.csv’) and assign relative information in column sample.pattern.FPKM.

  5. Generate a CSV file, ‘RNASeq_results/ballgown_analysis/ballgown_normalized_result.csv’, to store normalized FPKM values, mean expression values per group and statistic results.

  6. Select DEGs based on default criteria: pval < 0.05 and log2FC > 1 | log2FC < 1, and store the result in ‘RNASeq_results/ballgown_analysis/ballgown_normalized_DE_result.csv’

  7. Additional Data Visualization. Aside from general data visualization mentioned above, transcript-related plots and MA plot are also provided.

  • ‘Distribution_Transcript_Count_per_Gene_Plot.png’: To plot the distribution of transcript count per gene.

  • ‘Distribution_Transcript_Length_Plot.png’: To plot the distribution of transcript length.

  • ‘MA_Plot_ggplot2.png’: To display the difference of expression value between two groups by transforming the data onto log2-based ratio (x-axis) and log2-based mean (y-axis) scales by ggplot2.

9.3 “DESeq2” Analysis Based on Reads Count

DESeq2 is an R package for count-based different expression analysis using reads count to estimate variance-mean dependence. It takes sequence depth and gene composition into consideration and use median of ratios normalization(MRN) method to normalize reads count. The statistic model for differential expression is based on negative binomial distribution. The basic steps are as follows:

  • Input: ‘gene_data/reads_count_matrix/gene_count_matrix.csv’
  • Output: ‘RNASeq_results/ballgown_analysis/’
  1. Create DESeqDataSet object based on count data from matrix of reads count and phenotype data through DESeqDataSetFromMatrix() function.

  2. Filer the genes that the sum of reads count of all sample equals 0.

  3. Run DESeq2() function to process differential expression analysis.

  4. Generate a CSV file, ‘RNASeq_results/DESeq2_analysis/DESeq2_normalized_result.csv’, to store normalized MRN count, mean expression values per group and statistic results

  5. Select DEGs based on default criteria: pval < 0.05 and log2FC > 1 | log2FC < 1 , and store the result in ‘RNASeq_results/DESeq2_analysis/DESeq2_normalized_DE_result.csv’

  6. Additional Data Visualization. Aside from general data visualization mentioned above, dispersion plot and MA plot are also provided.

  • Dispersion_Plot_DESeq2.png: To display the dispersion estimates before and after normalization using plotDispEsts(). The x-axis denotes the mean of normalized counts while y-axis represents the dispersion estimates value by plotDispEsts() in DESeq2.

  • MA_Plot_DESeq2.png: To display the differences of expression value between two groups by transforming the data onto log2-based ratio (x-axis) and log2-based mean (y-axis) scales by plotMA() in DESeq2.

9.4 “edgeR” Analysis Based on Reads Count

edgeR is another R package for count-based different expression analysis. It implements the trimmed mean of M-values(TMM) method that are used to normalize count data between samples and several statistical strategies based on the negative binomial distributions such as exact tests that are used to detect differential expression. The basic steps are as follows:

  • Input: ‘gene_data/reads_count_matrix/gene_count_matrix.csv’
  • Output: ‘RNASeq_results/edgeR_analysis/’
  1. Create DEGList based on count data from matrix of reads count and phenotype data through DGEList() function.

  2. Normalize DEGList object through running three functions in following order: calcNormFactors(), estimateCommonDisp() and estimateTagwiseDisp().

  3. Conduct genewise exact tests through exactTest() function.

  4. Obtain a normalized count matrix through cpm() after TMM normalization. (cpm = counts per million)

  5. Generate a CSV file, ‘RNASeq_results/edgeR_analysis/edgeR_normalized_DE_result.csv’, to store normalized TMM-normalized count, mean expression values per group and statistic results.

  6. Select DEGs based on default criteria: pval < 0.05 and log2FC > 1 | log2FC < 1, and store the result in ‘RNASeq_results/edgeR_analysis/edgeR_normalized_DE_result.csv’.

  7. Additional Data Visualization. Aside from general data visualization mentioned above, MeanVar plot, BCV plot, MDS plot and Smear plot are also provided.

  • MeanVar_Plot_edgeR.png: To visualize the mean-variance relationship before and after TMM normalization using plotMeanVar() function in edgeR.

  • BCV_Plot_edgeR.png: To display the genewise biological coefficient of variance(BCV) against gene abundance using plotBCV() function in edgeR.

  • MDS_Plot_edgeR.png: To present expression differences between the samples using plotMDS.DGEList() function in edgeR.

  • Smear_Plot_edgeR.png: To plot log2-based fold change against the log10-based concentration using plotSmear() in edgeR.

9.5 Example

Run RNASeqDifferentialAnalysis_CMD() or RNASeqDifferentialAnalysis().

  1. Run in background Result will be reported in ‘Rscriptout/Differential_Analysis.Rout’. Make sure differential expression analysis is successfully finished before running subsequent steps.
  1. Run in R shell Result will be reported in R shell. Make sure differential expression analysis is successfully finished before running subsequent steps.

10 Functional Analysis

This is the sixth step of RNA-Seq workflow in this package. clusterProfiler is used for Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis based on the differential expressed genes(DEG) found in three different differential analyses. User can either run RNASeqGoKegg_CMD() to execute process in background or run RNASeqGoKegg() to execute process in R shell. In this step, users have to provide gene name type, input.TYPE.ID, that used in StringTie, ballgown and supported in OrgDb.species annotation package for target species. In GO functional analysis and KEGG pathway analysis, input.TYPE.ID ID type will be converted into ENTREZID ID type by bitr() function in clusterProfiler first. Those input.TYPE.ID with no corresponding ENTREZID will return NA and be filtered out. The genes with Inf or -Inf log2 fold change will be filtered out too. ID conversion will be done in each differential analysis tools, ballgown, DESeq2 and edgeR.

In this example, the RNA-Seq analysis target species is Saccharomyces cerevisiae(yeast). The OrgDb.species is org.Sc.sgd.db; the input.TYPE.ID is GENENAME. IDs are converted from GENENAME to ENTREZID.

10.1 GO Enrichment Analysis

Gene Ontology defines the universe of concepts relating to gene functions(GO terms) along three aspects: molecular function(MF), cellular component(CC), biological process(BP), and how these functions are related to each other. In this step, GO classification and GO over-representation test are conducted. To classify significant GO terms, differential expressed genes are analyzed by groupGO() function. Similarly, GO over-representation test of differential expressed genes is conducted using enrichGO(). Both results are stored in CSV file and top 15 GO terms are visualized by bath bar plot and bubble plot.

In this example, DESeq2 CC GO Classification bar plot is showed.

In this example, DESeq2 MF GO Over-representation bar plot and DESeq2 MF GO Over-representation dot plot are showed.

10.2 KEGG Pathway Analysis

Kyoto Encyclopedia of Genes and Genomes(KEGG) is a database resource for understanding functions and utilities of the biological system from molecular-level information(KEGG website). In this step, KEGG over-representation test can be conducted by clusterProfiler package. KEGG over-representation test of differential expressed genes is conducted using enrichKEGG(). KEGG over-representation result will be stored in CSV. The pathway IDs that found in of KEGG over-representation will be visualized with pathview package. KEGG pathway URL will also be stored.

In this example, due to the limited differential expressed genes, no over-represented pathways are found.

10.3 Example

Run RNASeqGoKegg_CMD() or RNASeqGoKegg().

  1. Run in background Result will be reported in ‘Rscriptout/GO_KEGG_Analysis.Rout’.
  1. Run in R shell Result will be reported in R shell.

11 Conlusion

RNASeqR is an user-friendly R-based tool for running case-control study(two group) RNA-Seq analysis pipeline. The six main steps in this package is Environment Setup, Quality Assessment, Reads Alignment and Quantification, Gene-level Differential Expression Analysis and Functional Analysis. The main features that RNASeqR provides are automated workflow, extendable file structure, comprehensive reports, data visualization on widely-used differential analysis tools etc. With this R package, doing two-group RNA-Seq analysis will be much easier and faster.

12 Session Information

## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] parallel  stats4    grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] DOSE_3.8.0           org.Sc.sgd.db_3.7.0  RNASeqRData_0.99.8  
##  [4] RNASeqR_1.0.0        edgeR_3.24.0         limma_3.38.0        
##  [7] pathview_1.22.0      org.Hs.eg.db_3.7.0   AnnotationDbi_1.44.0
## [10] IRanges_2.16.0       S4Vectors_0.20.0     Biobase_2.42.0      
## [13] BiocGenerics_0.28.0  ggplot2_3.1.0        png_0.1-7           
## [16] BiocStyle_2.10.0    
## 
## loaded via a namespace (and not attached):
##   [1] reticulate_1.10             tidyselect_0.2.5           
##   [3] RSQLite_2.1.1               htmlwidgets_1.3            
##   [5] FactoMineR_1.41             BiocParallel_1.16.0        
##   [7] BatchJobs_1.7               munsell_0.5.0              
##   [9] units_0.6-1                 systemPipeR_1.16.0         
##  [11] withr_2.1.2                 colorspace_1.3-2           
##  [13] GOSemSim_2.8.0              Category_2.48.0            
##  [15] knitr_1.20                  rstudioapi_0.8             
##  [17] leaps_3.0                   labeling_0.3               
##  [19] KEGGgraph_1.42.0            urltools_1.7.1             
##  [21] GenomeInfoDbData_1.2.0      hwriter_1.3.2              
##  [23] bit64_0.9-7                 farver_1.0                 
##  [25] pheatmap_1.0.10             rprojroot_1.3-2            
##  [27] xfun_0.4                    R6_2.3.0                   
##  [29] GenomeInfoDb_1.18.0         locfit_1.5-9.1             
##  [31] bitops_1.0-6                fgsea_1.8.0                
##  [33] gridGraphics_0.3-0          DelayedArray_0.8.0         
##  [35] assertthat_0.2.0            scales_1.0.0               
##  [37] ggraph_1.0.2                nnet_7.3-12                
##  [39] enrichplot_1.2.0            gtable_0.2.0               
##  [41] ballgown_2.14.0             sva_3.30.0                 
##  [43] systemPipeRdata_1.9.4       rlang_0.3.0.1              
##  [45] genefilter_1.64.0           BBmisc_1.11                
##  [47] scatterplot3d_0.3-41        splines_3.5.1              
##  [49] rtracklayer_1.42.0          lazyeval_0.2.1             
##  [51] acepack_1.4.1               europepmc_0.3              
##  [53] brew_1.0-6                  checkmate_1.8.5            
##  [55] BiocManager_1.30.3          yaml_2.2.0                 
##  [57] reshape2_1.4.3              GenomicFeatures_1.34.0     
##  [59] backports_1.1.2             rafalib_1.0.0              
##  [61] qvalue_2.14.0               Hmisc_4.1-1                
##  [63] clusterProfiler_3.10.0      RBGL_1.58.0                
##  [65] tools_3.5.1                 bookdown_0.7               
##  [67] ggplotify_0.0.3             RColorBrewer_1.1-2         
##  [69] ggridges_0.5.1              Rcpp_0.12.19               
##  [71] plyr_1.8.4                  base64enc_0.1-3            
##  [73] progress_1.2.0              zlibbioc_1.28.0            
##  [75] purrr_0.2.5                 RCurl_1.95-4.11            
##  [77] prettyunits_1.0.2           ggpubr_0.1.8               
##  [79] rpart_4.1-13                viridis_0.5.1              
##  [81] cowplot_0.9.3               zoo_1.8-4                  
##  [83] SummarizedExperiment_1.12.0 ggrepel_0.8.0              
##  [85] cluster_2.0.7-1             factoextra_1.0.5           
##  [87] magrittr_1.5                data.table_1.11.8          
##  [89] DO.db_2.9                   triebeard_0.3.0            
##  [91] matrixStats_0.54.0          hms_0.4.2                  
##  [93] evaluate_0.12               xtable_1.8-3               
##  [95] XML_3.98-1.16               gridExtra_2.3              
##  [97] compiler_3.5.1              biomaRt_2.38.0             
##  [99] tibble_1.4.2                crayon_1.3.4               
## [101] htmltools_0.3.6             GOstats_2.48.0             
## [103] mgcv_1.8-25                 Formula_1.2-3              
## [105] tidyr_0.8.2                 geneplotter_1.60.0         
## [107] sendmailR_1.2-1             DBI_1.0.0                  
## [109] tweenr_1.0.0                corrplot_0.84              
## [111] MASS_7.3-51                 PerformanceAnalytics_1.5.2 
## [113] ShortRead_1.40.0            Matrix_1.2-14              
## [115] quadprog_1.5-5              bindr_0.1.1                
## [117] igraph_1.2.2                GenomicRanges_1.34.0       
## [119] pkgconfig_2.0.2             flashClust_1.01-2          
## [121] rvcheck_0.1.1               GenomicAlignments_1.18.0   
## [123] foreign_0.8-71              xml2_1.2.0                 
## [125] annotate_1.60.0             XVector_0.22.0             
## [127] AnnotationForge_1.24.0      stringr_1.3.1              
## [129] digest_0.6.18               graph_1.60.0               
## [131] Biostrings_2.50.0           rmarkdown_1.10             
## [133] fastmatch_1.1-0             htmlTable_1.12             
## [135] GSEABase_1.44.0             Rsamtools_1.34.0           
## [137] rjson_0.2.20                nlme_3.1-137               
## [139] jsonlite_1.5                bindrcpp_0.2.2             
## [141] viridisLite_0.3.0           pillar_1.3.0               
## [143] lattice_0.20-35             KEGGREST_1.22.0            
## [145] httr_1.3.1                  survival_2.43-1            
## [147] GO.db_3.7.0                 glue_1.3.0                 
## [149] xts_0.11-1                  UpSetR_1.3.3               
## [151] bit_1.1-14                  Rgraphviz_2.26.0           
## [153] ggforce_0.1.3               stringi_1.2.4              
## [155] blob_1.1.1                  DESeq2_1.22.0              
## [157] latticeExtra_0.6-28         memoise_1.1.0              
## [159] dplyr_0.7.7

13 References

Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics, 10(1), 57.

Pang CN et al., “Transcriptome and network analyses in Saccharomyces cerevisiae reveal that amphotericin B and lactoferrin synergy disrupt metal homeostasis and stress response.”, Sci Rep, 2017 Jan 12;7:40232

Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 2015

Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Nature Biotechnology 2015, doi:10.1038/nbt.3122

Morgan M, Pagès H, Obenchain V, Hayden N (2018). Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 1.32.0, http://bioconductor.org/packages/release/bioc/html/Rsamtools.html.

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621.

Backman TWH, Girke T (2016). “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics, 17(1). doi: 10.1186/s12859-016-1241-0, https://doi.org/10.1186/s12859-016-1241-0.

Morgan M, Anders S, Lawrence M, Aboyoun P, Pagès H, Gentleman R (2009). “ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data.” Bioinformatics, 25, 2607-2608. doi: 10.1093/bioinformatics/btp450, http://dx.doi.org10.1093/bioinformatics/btp450.

Fu J, Frazee AC, Collado-Torres L, Jaffe AE, Leek JT (2018). ballgown: Flexible, isoform-level differential expression analysis. R package version 2.12.0.

Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.

Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140.

McCarthy, J. D, Chen, Yunshun, Smyth, K. G (2012). “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research, 40(10), 4288-4297.

Yu G, Wang L, Han Y, He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), 284-287. doi: 10.1089/omi.2011.0118.

Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature protocols, 11(9), 1650.ISO 690

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621.

Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A., & Dewey, C. N. (2009). RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4), 493-500.