GVPPSR Kumar, A Kumar and AP Sahoo
Animal Biotechnology Division, IVRI, UP, India
This chapter is discussed in three sections:
The data generated from most of the platforms are in FASTQ format (i.e., base call data). The data files for this chapter are designated as control.fastq and infected.fastq. Both the fastq files have paired end reads. These data need to be initially checked and quality trimmed for further use. The most commonly used program for quality filtering/trimming is prinseq‐lite.pl. There are several options in Prinseq‐lite for data trimming and/or filtering. First, trimming is done, followed by execution of the filtering commands. Trimming is commonly done to remove the adapter sequences present in the raw data generated. It is also used to remove the poly A tail at the end of the read.
From a data set, summary statistics, filtered, reformatted and trimmed quality data can be generated using PRINSEQ. This can be used for all types of sequence data. PRINSEQ can be accessed through a web interface or can be used, standalone.
The command for quality filtering is given below (Figure 44.1):
perl prinseq‐lite.pl ‐fastq control.fastq ‐out_format 5 ‐min_len 50 ‐min_qual_mean 25
For any further help please type:‐ perl prinseq‐lite.pl ‐h on the command line.
Prinseqlite is to be run on both the data files as given below:
For the control sample:
perl prinseq‐lite.pl –fastq control_R1.fastq –out_format 5 –min_len 50 –min_qual_mean 25
For the infected sample:
perl prinseq‐lite.pl –fastq infected_R1.fastq –out_format 5 –min_len 50 –min_qual_mean 25
These quality‐filtered data are further analyzed through different pipelines. Here, we initially discuss Cufflinks, and then the RSEM–DE package. The summary statistics, good and bad files generated from control_R1.fastq, are given below.
A summary of total input sequences, the number of good sequences as per the details provided in the command, the number of bad sequences, and so on, is obtained for each dataset (Figure 44.2).
With the output format 5, six files (three “good” and three “bad”) are generated in the folder from which the command is run (Figure 44.3). The next steps of the analysis pipeline will use the “Good” fastq file.
“Cufflinks” stands for the suite of software tools as well as the program, which assembles and estimates abundances of transcripts and evaluates differential expression in samples. It accepts mapped reads and assembles them into a parsimonious set of transcripts. “Cuffdiff” then estimates FPKM or RPKM by normalizing for both the library size and gene length (Trapnell et al., 2012).
The prinseq‐lite output (the good files) of all the data files can be analyzed either by mapping to the reference genome or by de novo assembling the transcriptome. Here, we will be illustrating data analysis using a reference‐based approach, by mapping the reads using GMAP‐GSNAP.
Before proceeding with GMAP‐GSNAP, we need to initially download the GTF and the FASTA file sequence of the reference genome. Note that the reference FASTA file and the GTF file should be downloaded from the same genome browser. The most commonly used genome browsers are NCBI, UCSC, and Ensembl. Here, we download the GTF and FASTA files from the UCSC genome browser.
Go to the UCSC genome browser, and click on downloads (Figure 44.5) and then on genomics data, to select the species of your interest (Figure 44.6). Here we select the cow to open the cow genome files.
When you click on the bosTau8.fa.gz, you will be able to download a file of 866.1 MB which, on gunzipping, would give a file of 2.72 GB (Figure 44.7).
The GTF file can be downloaded from UCSC by clicking on the table browser, and then selecting the options shown below (Figure 44.8).
GMAP‐GSNAP is a standalone program for mapping and aligning reads to a genome. This program does a fast batch processing of large sequence sets by aligning sequences with minimal startup time and memory requirements. The program generates accurate gene structures without using probabilistic splice site models (Wu and Watanabe, 2005); even plenty of polymorphisms and sequence errors are present in the data. The genome sequence that is downloaded is initially indexed, and this index is further used for mapping the filtered reads to generate the Sequence Alignment/Map (SAM) file(s).
Identifying the differentially expressed genes, starting from indexing the genome (Step 1, Figures 44.9 and 44.10), mapping the reads to the indexed genome (Step 2) to generate SAM files, and converting the SAM files to BAM files using Samtools (Step 3), to differential expression using cufflinks suite (Steps 3, 4, 5 and 6, Figures 44.11 – 44.13), is explained in six steps below:
Step 1: Command for indexing the genome: gmap_build ‐d btau8 bosTau8.fa.
Here, the FASTA reference genome (bosTau8.fa) is indexed as btau8.
The index files created are as below in the folder btau8.
Step 2: Mapping the reads to the genome.
The good fastq files from the prinseq‐lite.pl output for control and infected samples are renamed Control_R1.fastq and infected_R1.fastq, respectively.
Note: R1 and R2 paired end reads of the same sample are treated as replicates for further analysis.
Command for mapping:
gsnap –d <genome> –t <nthreads> <fastq_file> > <output_file.sam>
Example:
For the control sample:
gsnap –d btau8 –t 4 control_R1.fastq> control_R1.sam
For the infected sample:
gsnap –d btau8 –t 4 infected_R1.fastq> infected_R1.sam
The end product of the GMAP‐GSNAP aligner is a SAM file, which needs to be converted into a BAM file for further analysis in cufflinks. Repeat the same for the other replicates. A total of four SAM files are generated separately for two replicates of each sample.
Step 3: Converting SAM to BAM using Samtools.
Samtools is useful for manipulating alignments in the SAM and BAM formats. It imports from and exports to the SAM format, and does sorting, merging and indexing (Li et al., 2009a, 2009b).
Command for SAM to BAM conversion: ./samtools view –bsh aln.sam >aln.bam
–b: Output in the BAM format. –s: Input in the SAM format. –h: Include header in the output
Example:
For the control sample:
./samtools view –bsh control_R1.sam >control_R1.bam
For the infected sample:
./samtools view –bsh infected_R1.sam >infected_R1.bam
Step 4: Sorting BAM using samtools
Command for sorting: ./samtools sort aln.bam aln.sorted
Example:
For the control sample:
./samtools sort control_R1.bam control_R1_sorted
For the infected sample:
./samtools sort infected_R1.bam infected_R1_sorted
The BAM files generated can be analyzed in two ways:
Step 5 (Option 1): Differential expression using cufflinks, cuffmerge, and cuffdiff.
Command for running Cufflinks on a BAM file (Figures 44.11 and 44.12):
For the control sample:
cufflinks ‐G btau8refflat.gtf ‐g btau8refflat.gtf ‐b bosTau8.fa ‐u ‐L CN control_R1_sorted.bam
For the infected sample:
cufflinks ‐G btau8refflat.gtf ‐g btau8refflat.gtf ‐b bosTau8.fa ‐u ‐L CN infected_R1_sorted.bam
The transcript.gtf files (Figures 44.10 and 44.11) for each replicate are renamed as per the sample and replicate, and are further used in cuffmerge to generate a merged assembly. This merged assembly is then used in Cuffdiff to generate differentially expressed genes.
Command for running Cuffmerge:
cuffmerge ‐g btau8refflat.gtf ‐s bosTau8.fa ‐p 8 assemblies.txt
assemblies.txt is the file with the list of all the GTFs (transcripts.gtf) for all the replicates of all the samples. The file assemblies.txt is a text file, which looks like the file below (Figure 44.13).
The Cuffmerge command generates a merged.gtf in the merged_asm folder. This file is used in the next Cuffdiff command.
Command for running cuffdiff:
CuffDiff computes differentially expressed genes. The design of experiment should consider at least two contrasting groups of experimental subjects (e.g., healthy vs. diseased) for identifying the differentially expressed genes. CuffDiff should always be run on replicates (i.e., infected vs. control).
cuffdiff merged.gtf control_R1_sorted.bam control_R2_sorted.bam infected_R1_sorted.bam infected_R2_sorted.bam
This command generates many files, out of which, gene_exp.diff is the file to look for the differentially expressed genes.
Step 5 (Option 2): Differential expression using CuffDiff directly from the sorted bam file.
Command:
Cuffdiff –p –N transcripts.gtf
–p: num‐threads <int>. –N
cuffdiff –p 3 –N bostau8refflat.gtf
control_R1_sorted.bam,control_R2_sorted.bam infected_R1_sorted.bam,infected_R2_sorted.bam –o cuffdiff_out
The gene_exp.diff is the file in which to look for the differentially expressed genes. The file contains the fields as marked below (Figure 44.14).
Calculation of Log2fold change for A1BG gene (row 3 in Figure 44.14 above):
Log2fold change = Log2(FPKM infected/FPKM of control)
= Log2(0.576748/3.92513) = –2.76673
RSEM is a cutting‐edge RNASeq analysis package that is an end‐to‐end solution for differential expression, and simplifies the whole process (Li and Dewey, 2011). It also introduces a new more robust unit of RNASeq measurement called TPM. Calculating expression counts using RSEM should be initially taken up. These counts for all the samples and their replicates are further used in differential expression (DE) packages for identifying differential expressed genes (DEGs).
Calculating expression counts using RSEM is explained in nine steps below:
Step 1: Downloading RSEM and installing.
By using the wget command, RSEM can be downloaded using the link below. After unzipping the folder, run “make” to install RSEM.
wget http://deweylab.biostat.wisc.edu/rsem/src/rsem‐1.2.19.tar.gz
tar –xvzf rsem‐1.2.19.tar.gz
cd rsem‐1.2.19/make
Step 2: Prerequisites required for running RSEM.
Perl, R, and Bowtie need to be installed. Perl and R are normally present on most computers. Bowtie 2 needs to be added to your path (explained in steps 3 and 4 below).
Step 3: Downloading Bowtie and installing
Download Bowtie from http://sourceforge.net/projects/bowtie‐bio/files/bowtie/1.1.1/
Step 4: Copy bowtie in your path or add bowtie path in bash profile.
Copying bowtie in your path:
sudo cp –R/Users/appleserver/Desktop/bowtie2/usr/local/bin
Add bowtie path in bash profile (preferred). Open the.bash_profile (Figure 44.16), add the path below to the file and run the source from the ~/.bash_profile:
export PATH="/Users/ravikumar/Desktop/bowtie2:$PATH"
run source ~/.bash_profile
echo $PATH – to check whether the path has been added
To check whether the path has been added to the.bash_profile, type ‐ echo $PATH (Figure 44.17).
Step 5: Downloading the reference, gunzipping and concatenating
Download Bos taurus genome from Ensembl genome browser. An easier alternative is to use the wget command for a direct download on HPC (Figure 44.18):
wget –m ftp://ftp.ensembl.org/pub/release‐81/fasta/bos_taurus/dna/&or f in $(find.–name "*.gz")
The folder that is created after the download is ftp.ensembl.org(Figure 44.19). This folder contains FASTA files of all chromosomes (Figure 44.20). These FASTA files are further concatenated into a single file (combined.fa), having all chromosomes.
A direct download of each chromosome from the ftp site can also be done as given below (Figure 44.21). However, this is time‐consuming. The first option, downloading using the wget command, is faster.
The files downloaded are gunzipped using:
gunzip Bos_taurus.UMD3.1.dna.chromosome.*.fa.gz
Concatenating/combining all the fasta files into a combined fasta file (reference):
cat Bos_taurus.UMD3.1.dna.chromosome.*.fa > combined.fa
Step 6: Download annotation file in gtf format.
Command for downloading the gtf: wget –m
ftp://ftp.ensembl.org/pub/release‐81/gtf/bos_taurus
The gtf file downloaded needs to be modified for RSEM to extract only the exon annotations. This is done by using an “awk” command to create a filtered.gtf file.
awk command to extract the exon annotations from gtf:
awk ‘$3 == “exon”’ Bos_taurus.UMD3.1.8.1.gtf> filtered.gtf
Step 7: Prepare reference using RSEM
To prepare the reference sequence, run the “rsem‐prepare‐reference” program.
Command for preparing the reference is simply indexing the reference sequence. This creates 12 files as index files (Figure 44.22) with the name of BT and extension bt2.
Step 8: Calculating expression values in counts, TPM and FPKM:
To calculate expression values, the “rsem‐calculate‐expression” program is used.
The command for running rsem‐calculate‐expression should be run for each of the replicates (_R1 and _R2) of both the samples. This will generate six files, as shown in Figure 44.23, of which genes.results is the most important file among the six for identifying the differentially expressed genes.
For the control sample:
./rsem‐calculate‐expression –‐bowtie2 control_R1.fastq BT ControlR1
There will be six files generated as shown above, and genes.results is the most important file among the six for identifying the differentially expressed genes.
For the infected sample:
./rsem‐calculate‐expression –‐bowtie2 infected_R1.fastq BT infectedR1
The output ControlR1.genes.results gives the expected counts, TPM and FPKM for each of the ensemblIDs (Figure 44.24).
Step 9: Combining RSEM genes.results of all the files. The expected counts of all the ensemblIDs for all four files (two replicates each of control and infected) are combined (Figure 44.25).
Command for combining the RSEM genes.results of all the files:
./rsem‐generate‐data‐matrix *.genes.results > genes.results
After rounding these expected counts values to the nearest integer (Figure 44.25), they can be used in programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.
(using EBSeq, DESeq2 and edgeR)
EBSeq is an R package for identifying differentially expressed genes (DEGs) across biological conditions. EBSeq uses RSEM counts as input to identify differentially expressed genes. RSEM counts as input to identify differentially expressed genes. Identifying the DEGs using EBSeq is explained in six steps below.
Step 1: Installing EBSeq.
To install, type the following commands in R:
source("https://bioconductor.org/biocLite.R")
biocLite("EBSeq")
Step 2: Command for loading the package EBSeq (Figure 44.26).
>library(EBSeq)
Step 3: Command for getting the working directory.
>getwd()
Step 4: Command for setting the working library (Figure 44.27).
> setwd()
Set the working directory to RSEM.
Step 5: Input requirement for Gene level DE analysis:
The input file formats supported by EBSeq are.csv,.xls, or.xlsx,.txt (tab delimited). In the input file, rows should be the genes, and columns should be the samples. An example of the data set in .txt format (genesresults.txt) is given in Figure 44.27.
Step 6: Commands to Run EBSeq (the details of each of the commands are given in explaining the commands (https://www.bioconductor.org/packages/3.3/bioc/vignettes/EBSeq/inst/doc/EBSeq_Vignette.pdf)):
> x=data.matrix(read.table("genesresults.txt"))
> dim(x)
[1] 24596 4
> str(x)
num [1:24596, 1:4] 615 3 0 473 1 286 832 362 103 17 …
– attr(*, "dimnames")=List of 2
..$ : chr [1:24596] "ENSBTAG00000000005" "ENSBTAG00000000008" "ENSBTAG00000000009" "ENSBTAG00000000010" …
..$ : chr [1:4] "infectedR1.genes.results" "infectedR2.genes.results" "ControlR1.genes.results" "ControlR2.genes.results"
> Sizes=MedianNorm(x)
> EBOut=EBTest(Data=x,
+ Conditions=as.factor(rep(c("C1","C2"),each=2)),sizeFactors=Sizes,
+ maxround=5)
Removing transcripts with 75th quantile < = 10
12071 transcripts will be tested
iteration 1 done
time 0.12
iteration 2 done
time 0.13
iteration 3 done
time 0.08
iteration 4 done
> PP=GetPPMat(EBOut)
> str(PP)
num [1:12071, 1:2] 1 1 0 0 1 …
– attr(*, "dimnames")=List of 2
..$ : chr [1:12071] "ENSBTAG00000000005" "ENSBTAG00000000010" "ENSBTAG00000000012" "ENSBTAG00000000013" …
..$ : chr [1:2] "PPEE" "PPDE"
> DEfound=rownames(PP)[which(PP[,"PPDE"]>=.95)]
> str(DEfound)
chr [1:6528] "ENSBTAG00000000012" "ENSBTAG00000000013" "ENSBTAG00000000015" "ENSBTAG00000000019" "ENSBTAG00000000021" "ENSBTAG00000000025" "ENSBTAG00000000026" "ENSBTAG00000000032" …
> write.table(DEfound,"DE.txt",sep = " ",quote = F,col.names=F)
> GeneFC=PostFC(EBOut)
> write.table(GeneFC,"FC.txt",sep = " ",quote = F,col.names=F)
Running of EBSeq in R
The output file – FC.txt
The other output file – DE.txt
Explaining the commands: (https://www.bioconductor.org/packages/3.3/bioc/vignettes/EBSeq/inst/doc/EBSeq_Vignette.pdf)
The object data should be a G × S matrix containing the expression values for each gene and each sample;
where: “G”: number of genes
“S”: number of samples.
These values should exhibit raw counts, without normalizing over the samples. The dim(X) command gives us the dimensions of the matrix; str(x) command gives the structure of the data; num(x) gives the details of the values of the samples; attr(x) gives the details of row names and column names.
> x=data.matrix(read.table("genesresults.txt"))
> dim(x)
[1] 24596 4
> str(x)
num [1:24596, 1:4] 615 3 0 473 1 286 832 362 103 17 …
– attr(*, "dimnames")=List of 2
..$ : chr [1:24596] "ENSBTAG00000000005" "ENSBTAG00000000008" "ENSBTAG00000000009" "ENSBTAG00000000010" …
..$ : chr [1:4] "infectedR1.genes.results" "infectedR2.genes.results" "ControlR1.genes.results" "ControlR2.genes.results"
In our analysis, object “x” is a simulated data matrix containing 24 596 rows of genes and four columns of samples. The genes are named “ENSBTAG0000000000 5”, “ENSBTAG00000000008”… (Figure 44.28).
EBSeq requires the library size factors for each of the samples. This is achieved by the function MedianNorm, which uses the median normalization approach.
> Sizes=MedianNorm(x)
The function EBTest is used to detect DE genes. We define the conditions and size factors.
Explaining the conditions to EBseq:
The object conditions should be a vector of length S that indicates to which condition each sample belongs. For example, if there are two conditions and sample‐pair in each, then S = 4 and conditions may be given by as.factor(c("C1","C1","C2","C2")). This means that we have simulated the first two samples to be in condition 1 and the other two in condition 2, and thus defined conditions as:
Conditions=as.factor(rep(c("C1","C2"),each=2))
Normalization using sizeFactors:
Similarly, sizeFactors in the EBTest command is used to define the library size factor of each sample. It could be obtained by summing up the total number of reads per sample. We can opt for median normalization, scaling normalization, upper‐quantile normalization or some other such approach. Here, we are doing a median normalization and running the EM algorithm by setting the number of iterations to five via maxround=5, which can be seen in the output of step 6 above (Figure 44.28).
> EBOut=EBTest(Data=x,
+ Conditions=as.factor(rep(c("C1","C2"),each=2)),sizeFactors=Sizes,
+ maxround=5)
Calculating the probabilities for the DE genes:
The list of DE genes and the posterior probabilities of being DE are obtained as follows (Figure 44.29):
> PP=GetPPMat(EBOut)
> str(PP)
num [1:12071, 1:2] 1 1 0 0 1 …
– attr(*, "dimnames")=List of 2
..$ : chr [1:12071] "ENSBTAG00000000005" "ENSBTAG00000000010" "ENSBTAG00 000000012" "ENSBTAG00000000013" …
..$ : chr [1:2] "PPEE" "PPDE"
PPEE gives the posterior probability of equally expressed and PPDE gives the posterior probability of differentially expressed. This indicates that 12 071 genes are differentially expressed (Figure 44.29).
Differentially expressed genes at the 5% level of significance:
To get the DE genes with a probability at the level of significance 5%, we run the DEfound command. DEfound is a list of genes identified with PPDE >=0.95 or FDR < 0.05. EBSeq found 6528 genes significantly (P < 0.05) differentially expressed (Figure 44.29).
> DEfound=rownames(PP)[which(PP[,"PPDE"]>=.95)]
> str(DEfound)
chr [1:6528] "ENSBTAG00000000012" "ENSBTAG00000000013" "ENSBTAG0000000001 5" "ENSBTAG00000000019" "ENSBTAG00000000021" "ENSBTAG00000000025" "ENSBTAG00000000026" "ENSBTAG00000000032" …
Calculating the fold change:
“PostFC” calculates the posterior fold change for each transcript across conditions (Figure 44.29).
> GeneFC=PostFC(EBOut)
Writing the files:
write.table is used to write the fold changes and the differentially expressed genes into a file. Here the fold changes are saved as fc.txt and DEGs as DE.txt files (Figures 44.30 and 44.31).
This is a differential expression analysis based on the negative binomial distribution. DESeq2 uses RSEM counts as input to identify differentially expressed genes.
To install, type the following commands in R:
>source("https://bioconductor.org/biocLite.R")
>biocLite("DESeq2")
>library(DESeq2)
Running DESeq2 in R:
Note: commands for getting the working directory and setting the working directory are the same as for step 3 and step 4 of EBSeq.
Step 5: Input requirement for Gene level DE analysis:
The input file formats supported by DESeq are .csv, .xls, or .xlsx, .txt (tab delimited). In the input file, rows should be the genes and the columns should be the samples.
Example of the data set in .txt format (roundedfn.txt) that is used here (Figure 44.33):
Step 6: Commands to Run DESeq2:
> counts <– read.table(file = "roundedfn.txt", header = TRUE,row.names=1)
> class(counts)
[1] "data.frame"
> countdata=data.matrix(counts)
> class(countdata)
[1] "matrix"
> Design = data.frame(
+ row.names = colnames(counts),
+ condition = c("Control", "Control", "infected", "infected"),
+ libType = c("single‐end", "single‐end", "single‐end", "single‐end"))
> Design
condition libType
C_R1 Control single‐end
C_R2 Control single‐end
T_R1 infected single‐end
T_R2 infected single‐end
> dds <– DESeqDataSetFromMatrix(countData = countdata,
+ colData = Design,
+ design = ~ condition)
> dds
class: DESeqDataSet
dim: 24596 4
exptData(0):
assays(1): counts
rownames(24596): ENSBTAG00000000005 ENSBTAG00000000008 … ENSBTAG00000048316 ENSBTAG00000048317
rowData metadata column names(0):
colnames(4): C_R1 C_R2 T_R1 T_R2
colData names(2): condition libType
> dds <– DESeq(dds)
estimating size factors
estimating dispersions
gene‐wise dispersion estimates
mean‐dispersion relationship
final dispersion estimates
fitting model and testing
> res <– results(dds)
> resOrdered <– res[order(res$padj),]
> head(resOrdered)
log2 fold change (MAP): condition infected vs Control
Wald test p‐value: condition infected vs Control
DataFrame with 6 rows and 6 columns
> write.table(resOrdered,"DEDEseq2.txt",sep = " ",quote = F,col.names=F)
The output file is DEDESeq2.txt.
To get the DE genes with probability at the 5% level of significance, we select genes with a p value of < 0.05. DESeq found 8249 genes with a p‐value or padj < 0.05 (Figure 44.37).
Explaining the commands: https://www.bioconductor.org/packages/3.3/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
> counts <– read.table(file = "roundedfn.txt", header = TRUE,row.names=1
> countdata=data.matrix(counts)
> Design = data.frame(
+ row.names = colnames(counts),
+ condition = c("Control", "Control", "infected", "infected"),
+ libType = c("single‐end", "single‐end", "single‐end", "single‐end"))
With the countData and sample information in Design, we can construct a DESeqDataSet, which is the actual dataset used to identify differentially expressed genes:
> dds <– DESeqDataSetFromMatrix(countData = countdata,
+ colData = Design,
+ design = ~ condition)
>dds
class: DESeqDataSet
dim: 24596 4
exptData(0):
assays(1): counts
rownames(24596): ENSBTAG00000000005 ENSBTAG00000000008 … ENSBTAG00000048316 ENSBTAG00000048317
rowData metadata column names(0):
colnames(4): C_R1 C_R2 T_R1 T_R2
colData names(2): condition libType
> dds <– DESeq(dds)
estimating size factors
estimating dispersions
gene‐wise dispersion estimates
mean‐dispersion relationship
final dispersion estimates
fitting model and testing
> res <‐ results(dds)
> resOrdered <‐ res[order(res$padj),]
> head(resOrdered)
log2 fold change (MAP): condition infected vs Control
Wald test p‐value: condition infected vs Control
DataFrame with 6 rows and 6 columns
Note:
Step 1: Installing edgeR.
To install, type the following commands in R:
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
Step 2: Command to load the library.
>library(edgeR)
Step 3: Input requirement for Gene level DE analysis.
The input file formats supported by edgeR are .csv,.xls, or.xlsx,.txt (tab delimited). In the input file, the rows should be the genes, and the columns should be the samples.
Example of the data set in.txt format (roundedfnedge.txt) that is used here:
(Note: commands for getting the working directory and setting the working directory are the same as step 3 and step 4 of EBSeq)
Step 5: Commands to run edgeR:
> raw.data <– read.table(file = "roundedfnedge.txt", header = TRUE)
> counts <– raw.data[, ‐c(1,ncol(raw.data))]
> head(counts)
C_R1 C_R2 T_R1 T_R2
1 884 855 615 588
2 5 5 3 2
3 0 0 0 0
4 658 647 473 466
5 0 0 1 1
6 121 116 286 275
> rownames(counts) <‐ raw.data[, 1]
> head(counts)
C_R1 C_R2 T_R1 T_R2
ENSBTAG00000000005 884 855 615 588
ENSBTAG00000000008 5 5 3 2
ENSBTAG00000000009 0 0 0 0
ENSBTAG00000000010 658 647 473 466
ENSBTAG00000000011 0 0 1 1
ENSBTAG00000000012 121 116 286 275
> colnames(counts) <‐ paste(c(rep("C_R",2), rep("T_R",2)), c(1:2,1:2), sep="")
> dim(counts)
[1] 24596 4
> colSums(counts)
C_R1 C_R2 T_R1 T_R2
9348648 9150009 10517019 10334348
> colSums(counts)/1e06
C_R1 C_R2 T_R1 T_R2
9.348648 9.150009 10.517019 10.334348
> table(rowSums(counts))[1:30]
> group <– c(rep("C", 2), rep("T", 2))
> cds <– DGEList(counts, group = group)
> names(cds)
[1] "counts" "samples"
> cds <– cds[rowSums(1e+06 * cds$counts/expandAsMatrix(cds$samples$lib.size, dim(cds)) > 1) >= 3,]
> cds <– calcNormFactors(cds)
> cds$samples
group lib.size norm.factors
group lib.size norm.factors
C_R1 C 9348648 1.2798014
C_R2 C 9150009 1.2807311
T_R1 T 10517019 0.7805419
T_R2 T 10334348 0.7816336
> cds <– estimateCommonDisp(cds)
> names(cds)
[1] "counts" "samples" "common.dispersion" "pseudo.counts" "pseudo.lib.size" "AveLogCPM"
> cds <– estimateTagwiseDisp(cds)
> names(cds)
[1] "counts" "samples" "common.dispersion" "pseudo.counts" "pseudo.lib.size" "AveLogCPM"
[7] "prior.n" "tagwise.dispersion"
> summary(cds$tagwise.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.571e‐06 1.571e‐06 1.571e‐06 1.972e‐05 1.571e‐06 6.434e‐03
> de.tgw <– exactTest(cds,dispersion = "common", pair = c("C", "T")) or
> de.tgw <– exactTest(cds,dispersion = "tagwise", pair = c("C", "T"))
> de.tgw
An object of class "DGEExact"
$table
logFC logCPM PValue
ENSBTAG00000000005 0.008546278 6.205706 8.892866e‐01
ENSBTAG00000000010 0.065301786 5.816468 3.023431e‐01
ENSBTAG00000000012 1.782004256 4.336904 6.581677e‐63
ENSBTAG00000000013 2.043132781 5.820494 1.076701e‐219
ENSBTAG00000000014 0.021064053 5.484187 7.743713e‐01
10813 more rows …
$comparison
[1] "C" "T"
$genes
NULL
> options(digits = 3)
> topTags(de.tgw, n = 20, sort.by = "p.value")
Comparison of groups: T‐C
logFC logCPM PValue FDR
ENSBTAG00000009012 8.89 11.55 0 0
ENSBTAG00000033748 7.81 8.25 0 0
ENSBTAG00000007883 7.81 7.41 0 0
ENSBTAG00000008951 7.55 9.53 0 0
ENSBTAG00000014762 7.52 8.86 0 0
ENSBTAG00000037608 7.39 5.91 0 0
ENSBTAG00000009206 7.26 6.92 0 0
ENSBTAG00000007881 7.25 11.49 0 0
ENSBTAG00000014707 7.02 11.29 0 0
> Z <– topTags(de.tgw, sort.by = "p.value")
> options(digits = 5)
> Z <– topTags(de.tgw, sort.by = "p.value")
>write.table(Z,"DEEdgeR.txt",sep = " ",quote = F,col.names=F)
> Z <– topTags(de.tgw, n = 10000, sort.by = "p.value")
>write.table(Z,"DEEdgeR.txt",sep = " ",quote = F,col.names=F)
Running of edgeR in R
The output file is DEEdgeR.txt.
Differentially expressed genes at the 5% level of significance:
To get the DE genes at the 5% level of significance, we select genes with p value < 0.05. edgeR found 9113 genes significantly (P < 0.05) differentially expressed.
Explaining the commands:
https://www.bioconductor.org/packages/3.3/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
> raw.data <– read.table(file = "roundedfnedge.txt", header = TRUE)
> counts <– raw.data[, ‐c(1,ncol(raw.data))]
> head(counts)
> rownames(counts) <– raw.data[, 1]
> head(counts)
>colnames(counts) <– paste(c(rep("C_R",2),rep("T_R",2)),c(1:2,1:2),sep="")
> dim(counts)
> colSums(counts)
> colSums(counts)/1e06
> table(rowSums(counts))[1:30]
> group <– c(rep("C", 2), rep("T", 2))
> cds <– DGEList(counts, group = group)
> names(cds)
The low count reads need to be filtered out to detect differential expression. In edgeR, only those genes that have at least one read per million in at least three samples are kept for further analysis. After filtering, normalization factors, which correct for the different compositions of the samples, are calculated. The product of the actual library sizes, and these factors, give effective library sizes.
> cds <– cds[rowSums(1e+06 * cds$counts/expandAsMatrix(cds$samples$lib.size, dim(cds)) > 1) >= 3,]
> cds <– calcNormFactors(cds)
> cds$samples
The following commands are used to estimate common dispersion and tagwise (i.e., genewise) dispersion:
> cds <– estimateCommonDisp(cds)
> names(cds)
> cds <– estimateTagwiseDisp(cds)
> names(cds)
> summary(cds$tagwise.dispersion)
The pair‐wise test for differential expression between two groups is performed by the function exactTest(). One of the lists of elements generated in the output of exactTest() is a table of results. However, the table from exactTest() does not contain p‐values adjusted for multiple testing. These can be obtained by using the function topTags(). This takes the account from ExacTest() and adjusts the raw p‐values (by False Discovery rate (FDR) correction) to return the top differentially expressed genes. But for a column of adjusted p‐values sorted in increasing order, the output is similar to that of exactTest(). The sort.by argument sorts the table by p‐value or fold‐change. The topTags() function generates the top differentially expressed genes. If the n parameter is set to the total number of genes, the entire topTags() results table can be saved. Write.table writes the output to a txt file (Figure 44.42)
> de.tgw <– exactTest(cds, pair = c("C", "T"))
> de.tgw
> options(digits = 3)
> topTags(de.tgw, n = 20, sort.by = "p.value")
> Z <– topTags(de.tgw, sort.by = "p.value")
> options(digits = 5)
> Z <– topTags(de.tgw, n = 10000, sort.by = "p.value")
> write.table(Z,"DEEdgeR.txt",sep = " ", quote = F, col.names=F)
3.144.86.233