Using DESeq2 from a count matrix

In Step 1, we use the readr package's read_tsv() function to load the tab-delimited text file of counts into a dataframe called count_dataframe. Then, from that, we extract the 'gene' column to a new variable, genesand erase it from count_dataframe, by assigning NULL. This is all done so we can easily convert into the count_matrix matrix with the base as.matrix() function and add the gene information back as rownames. Finally, we load the phenotype data we'll need from the file using the readr read_table2() function. 

Step 2 is concerned with working out which columns in count_matrix we want to use. We define a variable, experiments_of_interestthat holds the column names we want and then use the %in% operator and which() functions to create a binary vector that matches the number of columns. If, say the third column of the columns_of_interest vector is 'TRUE', it indicates the name was in the experiments_of interest variable. 

Step 3 begins with loading the magrittr package to get the %>% operator, which will allow piping. We then use R indexing with the binary columns_of_interest factor to select the names of columns we want and send it to the forcats as_factor() function to get a factor object for our grouping variable. Sample grouping information is basically a factor that tells us which samples are replications of the same thing and it's important for the experimental design description. You can see an expanded description of these grouping/factor objects in step 3 in Recipe 1.

In Step 4, we use indexing to extract the columns of data we want to actually analyze.

By Step 5, we are into the actual analysis section. First, we convert our matrix of counts into a DESeqDataSet object; this can be done with the conversion function, DESeqDataSetFromMatrix(), passing in the counts, the groups, and a design. The design is in the form of an R formula, hence, the ~ stage annotation.

In Step 6, we perform the actual analysis using the DESeq() function on the dds DESeqDataSet object and in Step 7, we get the results into the res variable using the results() function. The output has the following six columns:

baseMean log2FoldChange lfcSE stat pvalue padj

This shows the mean counts, the log2 fold change between samples for a gene, the standard error of the log2 fold change, the Wald statistic, and the raw and adjusted P value. The padj column for adjusted P values is the one most commonly used for concluding about significance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.144.18