The ultimate value of an RNA-Seq experiment comes from data analysis. Next generation sequencing (NGS) experiments generate a tremendous amount of data which—unlike Sanger sequencing results—can't be directly analyzed in any meaningful way. With help from modern computing, scientists aim to extract as much useful information as possible from RNA-Seq results while avoiding misinterpretation or bias. Selecting the right analytical approach along with an appropriate set of bioinformatics tools is key to this objective. Many powerful software programs are available, often for free, that can slice and dice the data in myriad ways. For newcomers, analyzing huge NGS datasets can be intimidating, and knowing where to start isn’t obvious. Here, we provide a brief introductory tutorial to RNA-Seq bioinformatics as well as resources for more in-depth exploration. We’ll focus our discussion on short-read Illumina® sequencing data, which is commonly used for RNA-Seq experiments.
Most bioinformatics tools for NGS analysis have been developed for the Linux environment, so you’ll need to know how to use the Linux command line to install software, execute programs, and navigate file trees. Some familiarity with scripting is also required to parse data, format output files, and execute some analyses. To speed up computation, we recommend using a grid cluster, if available, to analyze multiple samples at the same time.
Let’s walk through a typical bioinformatics workflow to illustrate step by step how RNA-Seq data is processed with the goal of differential gene expression analysis.
First, switch to the FASTQ directory. Use the
cd command (i.e. change directory) followed by the path where the FASTQ files are stored.
Next, you can check the FASTQ files by using the
ls command (i.e. listing), which shows the contents of the current working directory.
Data files from sequencing providers are typically compressed and have the extension “.fastq.gz”. These files contain structured information about individual NGS reads—a unique identifier, the called bases, and the associated quality scores—which can be viewed using the
zmore command. In this format, however, the raw data is not particularly useful to examine directly by eye, given the sheer number of reads—often in the millions.
Lastly, you can make an output directory using the
mkdir command (i.e. make directory). Output files can be stored here.
For the next steps in the workflow, we’ll provide general commands for the processes described. However, please check the documentation that accompanies each software program to tailor it to your data, computer system, and analytical requirements. We also recommend viewing our recorded workshop “Bioinformatics for RNA-Seq” which provides screenshots of actual analyses along with helpful commentary and insights.
Run FastQC to check the raw data quality.
The output contains graphs and statistics about the raw quality, including quality scores, GC content, adapter percentage, and more. Below are two examples of the output files.
Poor-quality regions and adapter sequences should be trimmed from the reads before further analysis. Since Trimmomatic has an executable JAR file, you’ll need to use Java to execute it rather than doing so directly in the command line.
Run FastQC again on the trimmed treads to confirm that the new quality is acceptable (see figures below).
First, index the reference genome using STAR to prepare it for alignment. Adding gene annotation information to the reference genome will facilitate alignment of RNA-Seq reads across exon-intron boundaries. This indexing step is only required once; you can then use the indexed genome repeatedly in future analysis.
Then, align the reads. The default output for the STAR aligner is a SAM file, which should be converted to a BAM file for downstream use.
Check the mapping statistics in the [sample_name]Log.final.out file to ensure the BAM file was generated properly and the reads align to the genome correctly. Uniquely mapped reads are the most useful for expression analysis, as there is high confidence in which loci they represent. In general, >60-70% for the “uniquely mapped reads %” metric is considered good; a significantly lower value warrants further investigation.
Lastly, use Samtools to sort and index the BAM files. Organizing the reads by position within the BAM file is needed for downstream analysis.
Use FeatureCounts to calculate the number of reads per gene. We suggest counting only uniquely mapped reads that fall within exons. Reads that align to introns or intergenic regions may represent genomic DNA contamination or pre-mRNA (i.e. unspliced mRNA). We recommend using gene ID as the identifier in the summary table since other identifiers, such as gene symbol, may not be available for all loci.
Next, check the stats file to validate the quality of the data, in particular the number of assigned reads. About 15 to 20 million reads per sample are typically recommended for RNA-Seq analysis, but it's possible to use fewer reads. The target number of reads depends on the objectives of the project and the species under investigation.
First, you’ll need to merge the hit counts from all samples into a summary table using R scripting or Excel™ (see example below).
|And so on||...||...||...||...||...||...|
Then, create a table that lists the sample names and their group assignments (see example below). This information will instruct DESeq2 how to compare the samples. Both tables can be outputted as tab-delimited text files.
Before working with DESeq2, install the following libraries in R.
In R, import the count data and sample grouping files.
Then execute the DESeq2 analysis, specifying that samples should be compared based on “condition”.
Below are examples of several plots that can be generated with DESeq2. A comprehensive tutorial of this software is beyond the scope of this article. We recommend reviewing the DESeq2 manual for an in-depth explanation of how to perform each type of analysis and which parameters can be adjusted.
A theme emerges from the workflow above: it’s important to perform quality control checks, where possible, before advancing to the next step. Detecting errors or issues early will save substantial time and effort down the line. With a bit of training and practice, virtually any scientist can get started with RNA-Seq data analysis. Once you feel comfortable with each step, you can create scripts to build pipelines for streamlined analysis. Although we focused on the mechanics of performing data analysis here, it’s imperative that you understand the statistical principles behind the analyses to interpret your data properly.
For a deeper dive into RNA-Seq bioinformatics, check out our on-demand educational workshop. It explores the bioinformatics pipeline, explains NGS results, addresses common challenges, and answers frequently asked questions regarding RNA-Seq analysis.