CHAPTER 42
Basics of RNA‐Seq Data Analysis

GVPPSR Kumar, AP Sahoo and A Kumar

Animal Biotechnology Division, IVRI, UP, India

42.1 INTRODUCTION

The complete set of transcripts (transcribed RNA products) in a cell is called the transcriptome (in terms of both type and quantity). Transcriptome analysis helps in understanding the pattern of gene expression to address basic biological questions, and unravels biological pathways and molecular mechanisms that regulate cell fate, development and disease progression. The transcriptome is analyzed through RNA‐sequencing (RNA‐seq) or microarray experiments. RNA‐sequencing involves sequencing of the entire transcriptome, using next‐generation sequencing (NGS) platforms.

The data generated through various platforms go into secondary analysis, which mainly involves alignment (if it is reference‐based) or assembly (if it is de novo). In this book, RNA‐seq data generated are analyzed through distinct pipelines to identify differentially expressed genes under different experimental conditions. The basics of RNA‐sequencing data analysis will be discussed in this chapter.

42.2 AIM OF AN RNA‐SEQ EXPERIMENT

  1. To quantify RNA abundance.
  2. Annotating the transcription start site, 5′ and 3′ ends and splicing patterns of genes.
  3. To quantify the changing expression levels of each transcript during development, and under different conditions.
  4. To identify variants on the transcripts.

With proper depth/coverage, NGS addresses the limitations of a microarray experiment, such as: high background levels owing to cross‐hybridization; limited dynamic range of detection owing to both background and saturation of signals; and reliance upon existing knowledge about genome sequence.

42.2.1 Sequence alignment

The secondary analysis primarily involves mapping of the reads to the reference genome (reference‐based assembly) or a reference transcriptome (de novo assembly). Mapping is aligning the read sequence to portions of the genome. This mapping/alignment is nothing but a pairwise alignment. The most accurate and sensitive method of pairwise alignment is dynamic programming, but this is time‐consuming and cannot be used for aligning the NGS reads to the genome. This warrants fast sequence alignment strategies.

42.3 FAST SEQUENCE ALIGNMENT STRATEGIES

The faster alignment strategies that have evolved either use hash table‐based indexing (seed extend paradigm with space allowance) or suffix/prefix tree‐based indexing (suffix array or Burrows–Wheeler (BW) transformation and FM‐index).

There are several aligners in vogue. Some of these are summarized below:

42.3.1 Short read aligners (Table 42.1)

TABLE 42.1 Example and purpose of short read aligners.

Aligner Purpose Strategy
Bowtie Fast but gaps not allowed BW transformation and FM index
BWA small gaps (indels) BW transformation and FM index
GSNAP Large gaps (introns) A double lookup scheme
Bowtie 2 Takes care of gaps BW transformation and FM index

42.3.2 Long read aligners (Table 42.2)

TABLE 42.2 Example and purpose of long read aligners.

Aligner Purpose Strategy
BLAST Many reference genome Heuristic method
BLAT Large gaps (introns) BW transformation and FM Index
BWA Small gaps (indels) BW transformation and FM Index
Exonerate Ease of use Bounded sparse dynamic programming
GMAP Large gaps (introns) A double lookup scheme
MUMmer Align two genome Suffix tree

42.3.3 Challenges in RNA‐seq alignment

There are three major challenges in aligning the RNA‐seq reads:

  1. the reads are short (35–125 n);
  2. error rates are considerable;
  3. Some reads span exon–exon junctions (Garber et al., 2011).

Among the reads generated from the RNA‐seq experiment, some cover only the exon regions and some span across the intron junctions covering two exons – termed junction reads (Wang et al., 2009). The alignment of the reads can be done to a reference genome or a reference transcriptome. If the alignment is done to a reference transcriptome, short read aligners that do not allow gaps can be used, and, if the reads are aligned to reference genome, short read aligners that allow gaps have to be used. The provision to allow gaps is mainly to align junction reads to the reference genome, which contains introns.

There are two major alignment strategies for RNA‐seq alignment – the exon‐first approach and the seed‐extend approach (Garber et al., 2011).

  1. Exon‐first approach: tools that use this approach are TopHat, MapSplice and Splice Map. This is a two‐step procedure which initially involves mapping reads continuously, using unspliced read aligners (i.e., aligners that do not allow gaps), followed by mapping the unmapped reads by splitting them into shorter segments and independently mapping these segments to the reference. TopHat, followed by Cufflinks, is one of the most common pipelines followed in analyzing the RNA‐Seq data (Garber et al., 2011).
  2. Seed‐extend approach: tools that use this approach are GMAP, GSNAP, and QPALMA. This involves breaking the reads into short seeds, and then combining the candidate regions by the Smith–Waterman algorithm (Smith and Waterman, 1981).
Flow diagram illustrating the 2 major alignment strategies for RNA‐seq alignment: exon-first approach (left) and seed-extend approach (right) with boxes representing the exon 1 (filled) and exon 2 (unfilled).

FIGURE 42.1

All the aligners mentioned above generate SAM (sequence alignment mapping) files from RNA‐Seq data. Cufflinks or RSEM use these SAM files (depending on the experimental setup) in DE packages (DESeq2, EBSeq, edgeR) to generate differentially expressed genes. These packages normalize the read counts by using different metrics, such as RPKM/FPKM, TPM, or TMM.

42.3.4 Why is normalization needed?

The RNA‐Seq reads are normalized for sequencing for the depth and length of the gene. Sequencing with greater depth will have more reads mapped to each gene, and longer genes will have more reads mapped to them. FPKM and TPM are the metrics used for normalizing for the length of the gene and depth of sequencing. There is a new metric termed TMM, which normalizes for differences in RNA composition between samples. Some parts of the following section are adapted from statquest.org:

42.3.4.1 R/FPKM (Mortazavi et al., 2008)

Reads/fragment per kilobase of exon per million mappable reads. FPKM is used for paired‐end, and RPKM for single‐end reads. “Per million reads” means the counts of fragments are normalized against the library sizes, which allows comparison of the same gene across samples. This value is further normalized per kilobase of exon, by dividing by the total length of all exons in the gene. This allows comparison of the expression of genes of different lengths. Cufflinks gives FPKM values, whereas RSEM gives both the FPKM and TPM values.

FPKM normalization involves two steps. In step 1, reads are normalized for library sizes, while step 2 involves normalization for the length of the gene. Here we consider four genes, with a variable number of reads for three samples.

  1. Step 1. Divide the reads of each gene with the total reads of the sample. Total reads for sample 1, 2 and 3 are 7 M, 9 M and 21.2 M, respectively (here millions of reads equated to a scale of tens of reads). By dividing the reads for each gene in each sample with the corresponding total reads, we get the information shown in Table 42.3. (i.e., for sample 1 gene 1, 20/7 = 2.86 and likewise).
  2. Step 2. Divide the values obtained after step 1 with the gene lengths. Hereafter, step 2 reads are scaled for both length and library size to get the RPKM values. (i.e., for sample 1 gene 1, 2.86/2 = 1.43 and likewise).

TABLE 42.3 Information about total reads of samples 1, 2, and 3, and values obtained by dividing the reads for each gene in each sample with the corresponding total reads.

Genes Sample1 Sample 2 Sample 3
1 (2 kb) 20 24  60
2 (4 kb) 40 50 120
3 (1 kb) 10 16  30
4 (10 kb)  0  0   2
Total 70 (7 M) 90 (9 M) 212 (21.2 M)
Genes Sample 1 (RPM) Sample 2 (RPM) Sample 3 (RPM)
1 (2 kb) 2.86 2.67 2.83
2 (4 kb) 5.71 5.56 5.66
3 (1 kb) 1.43 1.78 1.42
4 (10 kb) 0 0 0.09

42.3.4.2 TPM (Transcripts per million) (Li et al., 2010; Wagner et al., 2012)

This metric corrects for transcript length distribution in an RNA pool and provides better across‐sample comparability. Here, the read counts are initially normalized for length, and the total taken for the library sizes after dividing for length is used to normalize for library sizes. RSEM gives TPM.

Calculation of TPM is as follows:

  1. Step 1. Divide the reads of each gene by the length of each gene:

    TABLE 42.4 Calculation of RPKM by dividing the reads obtained after step 1 for each gene with gene length.

    Genes Sample 1 (RPKM) Sample 2 (RPKM) Sample 3 (RPKM)
    1 (2 kb) 1.43 1.33 1.42
    2 (4 kb) 1.43 1.39 1.42
    3 (1 kb) 1.43 1.78 1.42
    4 (10 kb) 0 0 0.009
    Total normalized reads 4.29 4.5 4.5

    TABLE 42.5 Total reads of samples 1, 2, 3, and 4.

    Genes Sample 1 Sample 2 Sample 3
    1 (2 kb) 20 24  60
    2 (4 kb) 40 50 120
    3 (1 kb) 10 16  30
    4 (10 kb)  0  0   2

    TABLE 42.6 Total reads per kb (RPK) of gene for sample 1, 2, and 3 (millions of reads equated to a scale of tens of reads).

    Genes Sample 1 (RPK) Sample 2 (RPK) Sample 3 (RPK)
    1 (2 kb) 10 12 30
    2 (4 kb) 10 12.5 30
    3 (1 kb) 10 16 30
    4 (10 kb)  0  0 0.2
    Total 30 (3 M) 40.5 (4.05 M) 90.2 (9.02 M)

    Total reads per kb of gene for samples 1, 2 and 3 are 3 M, 4.05 M, and 9.02 M, respectively (millions of reads equated to a scale of tens of reads).

  2. Step 2. Divide the values obtained after step 1 with total reads per kb of gene:

    TABLE 42.7 Calculation of TPM by dividing the total reads obtained in step 1 sample, with total reads per kb of gene.

    Genes Sample 1 (TPM) Sample 2 (TPM) Sample 3 (TPM)
    1 (2 kb)  3.33  2.96  3.326
    2 (4 kb)  3.33  3.09  3.326
    3 (1 kb)  3.33  3.95  3.326
    4 (10 kb)  0  0  0.02
    Total normalized reads 10 10 10

    TABLE 42.8 Comparison of reads of RPKM.

    Genes Sample 1 (RPKM) Sample 2 (RPKM) Sample 3 (RPKM)
    1 (2 kb) 1.43 1.33 1.42
    2 (4 kb) 1.43 1.39 1.42
    3 (1 kb) 1.43 1.78 1.42
    4 (10 kb) 0 0 0.009
    Total normalized reads 4.29 4.5 4.5

    TABLE 42.9 Comparison of reads of TPM.

    Genes Sample 1 (TPM) Sample 2 (TPM) Sample 3 (TPM)
    1 (2 kb) 3.33 2.96 3.326
    2 (4 kb) 3.33 3.09 3.326
    3 (1 kb) 3.33 3.95 3.326
    4 (10 kb) 0 0 0.02
    Total normalized reads 10 10 10

42.3.4.3 TMM – Trimmed mean of M value (Robinson and Oshlack, 2010)

The differences in RNA composition between samples are corrected by this metric – that is, in a sample having certain genes that are very highly expressed, there is less scope for the less expressed genes to be sequenced, and RPKM or TPM normalization will yield biased expression values.

Illustration of differences in RNA composition between samples displaying three light to dark ellipsis labeled A, B, and C enclosed in two filled ovals representing RNA population 1 and 2.

FIGURE 42.2

With the same sequencing depth for populations 1 and 2, A and C will have a lower RPKM in RNA population 1, though the expression of these genes is the same in populations 1 and 2 (Robinson and Oshlack, 2010). TMM is used by the edgeR DE package to normalize counts data.

42.4 QUESTIONS

  1. 1. What will be the coverage if 6 Gb data are generated for a transcriptome of 6 × 107 bp and a genome of 3 × 109 bp?
  2. 2. The total exon size of a gene is 3000 nt. Calculate the expression levels for this gene in RPKM, in an RNA‐seq experiment that contained 50 million mappable reads, with 600 reads falling into exon regions of this gene.
  3. 3. Which approach, Exon‐first or Seed‐extend, is more appropriate for mapping reads from polymorphic species?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.132.6