GVPPSR Kumar, AP Sahoo and A Kumar
Animal Biotechnology Division, IVRI, UP, India
The complete set of transcripts (transcribed RNA products) in a cell is called the transcriptome (in terms of both type and quantity). Transcriptome analysis helps in understanding the pattern of gene expression to address basic biological questions, and unravels biological pathways and molecular mechanisms that regulate cell fate, development and disease progression. The transcriptome is analyzed through RNA‐sequencing (RNA‐seq) or microarray experiments. RNA‐sequencing involves sequencing of the entire transcriptome, using next‐generation sequencing (NGS) platforms.
The data generated through various platforms go into secondary analysis, which mainly involves alignment (if it is reference‐based) or assembly (if it is de novo). In this book, RNA‐seq data generated are analyzed through distinct pipelines to identify differentially expressed genes under different experimental conditions. The basics of RNA‐sequencing data analysis will be discussed in this chapter.
With proper depth/coverage, NGS addresses the limitations of a microarray experiment, such as: high background levels owing to cross‐hybridization; limited dynamic range of detection owing to both background and saturation of signals; and reliance upon existing knowledge about genome sequence.
The secondary analysis primarily involves mapping of the reads to the reference genome (reference‐based assembly) or a reference transcriptome (de novo assembly). Mapping is aligning the read sequence to portions of the genome. This mapping/alignment is nothing but a pairwise alignment. The most accurate and sensitive method of pairwise alignment is dynamic programming, but this is time‐consuming and cannot be used for aligning the NGS reads to the genome. This warrants fast sequence alignment strategies.
The faster alignment strategies that have evolved either use hash table‐based indexing (seed extend paradigm with space allowance) or suffix/prefix tree‐based indexing (suffix array or Burrows–Wheeler (BW) transformation and FM‐index).
There are several aligners in vogue. Some of these are summarized below:
TABLE 42.1 Example and purpose of short read aligners.
Aligner | Purpose | Strategy |
Bowtie | Fast but gaps not allowed | BW transformation and FM index |
BWA | small gaps (indels) | BW transformation and FM index |
GSNAP | Large gaps (introns) | A double lookup scheme |
Bowtie 2 | Takes care of gaps | BW transformation and FM index |
TABLE 42.2 Example and purpose of long read aligners.
Aligner | Purpose | Strategy |
BLAST | Many reference genome | Heuristic method |
BLAT | Large gaps (introns) | BW transformation and FM Index |
BWA | Small gaps (indels) | BW transformation and FM Index |
Exonerate | Ease of use | Bounded sparse dynamic programming |
GMAP | Large gaps (introns) | A double lookup scheme |
MUMmer | Align two genome | Suffix tree |
There are three major challenges in aligning the RNA‐seq reads:
Among the reads generated from the RNA‐seq experiment, some cover only the exon regions and some span across the intron junctions covering two exons – termed junction reads (Wang et al., 2009). The alignment of the reads can be done to a reference genome or a reference transcriptome. If the alignment is done to a reference transcriptome, short read aligners that do not allow gaps can be used, and, if the reads are aligned to reference genome, short read aligners that allow gaps have to be used. The provision to allow gaps is mainly to align junction reads to the reference genome, which contains introns.
There are two major alignment strategies for RNA‐seq alignment – the exon‐first approach and the seed‐extend approach (Garber et al., 2011).
All the aligners mentioned above generate SAM (sequence alignment mapping) files from RNA‐Seq data. Cufflinks or RSEM use these SAM files (depending on the experimental setup) in DE packages (DESeq2, EBSeq, edgeR) to generate differentially expressed genes. These packages normalize the read counts by using different metrics, such as RPKM/FPKM, TPM, or TMM.
The RNA‐Seq reads are normalized for sequencing for the depth and length of the gene. Sequencing with greater depth will have more reads mapped to each gene, and longer genes will have more reads mapped to them. FPKM and TPM are the metrics used for normalizing for the length of the gene and depth of sequencing. There is a new metric termed TMM, which normalizes for differences in RNA composition between samples. Some parts of the following section are adapted from statquest.org:
Reads/fragment per kilobase of exon per million mappable reads. FPKM is used for paired‐end, and RPKM for single‐end reads. “Per million reads” means the counts of fragments are normalized against the library sizes, which allows comparison of the same gene across samples. This value is further normalized per kilobase of exon, by dividing by the total length of all exons in the gene. This allows comparison of the expression of genes of different lengths. Cufflinks gives FPKM values, whereas RSEM gives both the FPKM and TPM values.
FPKM normalization involves two steps. In step 1, reads are normalized for library sizes, while step 2 involves normalization for the length of the gene. Here we consider four genes, with a variable number of reads for three samples.
TABLE 42.3 Information about total reads of samples 1, 2, and 3, and values obtained by dividing the reads for each gene in each sample with the corresponding total reads.
Genes | Sample1 | Sample 2 | Sample 3 |
1 (2 kb) | 20 | 24 | 60 |
2 (4 kb) | 40 | 50 | 120 |
3 (1 kb) | 10 | 16 | 30 |
4 (10 kb) | 0 | 0 | 2 |
Total | 70 (7 M) | 90 (9 M) | 212 (21.2 M) |
Genes | Sample 1 (RPM) | Sample 2 (RPM) | Sample 3 (RPM) |
1 (2 kb) | 2.86 | 2.67 | 2.83 |
2 (4 kb) | 5.71 | 5.56 | 5.66 |
3 (1 kb) | 1.43 | 1.78 | 1.42 |
4 (10 kb) | 0 | 0 | 0.09 |
This metric corrects for transcript length distribution in an RNA pool and provides better across‐sample comparability. Here, the read counts are initially normalized for length, and the total taken for the library sizes after dividing for length is used to normalize for library sizes. RSEM gives TPM.
Calculation of TPM is as follows:
TABLE 42.4 Calculation of RPKM by dividing the reads obtained after step 1 for each gene with gene length.
Genes | Sample 1 (RPKM) | Sample 2 (RPKM) | Sample 3 (RPKM) |
1 (2 kb) | 1.43 | 1.33 | 1.42 |
2 (4 kb) | 1.43 | 1.39 | 1.42 |
3 (1 kb) | 1.43 | 1.78 | 1.42 |
4 (10 kb) | 0 | 0 | 0.009 |
Total normalized reads | 4.29 | 4.5 | 4.5 |
Genes | Sample 1 | Sample 2 | Sample 3 |
1 (2 kb) | 20 | 24 | 60 |
2 (4 kb) | 40 | 50 | 120 |
3 (1 kb) | 10 | 16 | 30 |
4 (10 kb) | 0 | 0 | 2 |
TABLE 42.6 Total reads per kb (RPK) of gene for sample 1, 2, and 3 (millions of reads equated to a scale of tens of reads).
Genes | Sample 1 (RPK) | Sample 2 (RPK) | Sample 3 (RPK) |
1 (2 kb) | 10 | 12 | 30 |
2 (4 kb) | 10 | 12.5 | 30 |
3 (1 kb) | 10 | 16 | 30 |
4 (10 kb) | 0 | 0 | 0.2 |
Total | 30 (3 M) | 40.5 (4.05 M) | 90.2 (9.02 M) |
Total reads per kb of gene for samples 1, 2 and 3 are 3 M, 4.05 M, and 9.02 M, respectively (millions of reads equated to a scale of tens of reads).
TABLE 42.7 Calculation of TPM by dividing the total reads obtained in step 1 sample, with total reads per kb of gene.
Genes | Sample 1 (TPM) | Sample 2 (TPM) | Sample 3 (TPM) |
1 (2 kb) | 3.33 | 2.96 | 3.326 |
2 (4 kb) | 3.33 | 3.09 | 3.326 |
3 (1 kb) | 3.33 | 3.95 | 3.326 |
4 (10 kb) | 0 | 0 | 0.02 |
Total normalized reads | 10 | 10 | 10 |
Genes | Sample 1 (RPKM) | Sample 2 (RPKM) | Sample 3 (RPKM) |
1 (2 kb) | 1.43 | 1.33 | 1.42 |
2 (4 kb) | 1.43 | 1.39 | 1.42 |
3 (1 kb) | 1.43 | 1.78 | 1.42 |
4 (10 kb) | 0 | 0 | 0.009 |
Total normalized reads | 4.29 | 4.5 | 4.5 |
The differences in RNA composition between samples are corrected by this metric – that is, in a sample having certain genes that are very highly expressed, there is less scope for the less expressed genes to be sequenced, and RPKM or TPM normalization will yield biased expression values.
With the same sequencing depth for populations 1 and 2, A and C will have a lower RPKM in RNA population 1, though the expression of these genes is the same in populations 1 and 2 (Robinson and Oshlack, 2010). TMM is used by the edgeR DE package to normalize counts data.
18.218.132.6