How it works...

We first make a vector of file paths to all of the fastq files we wish to use by passing the fq_dir variable containing the fastq directory to the list.files() function. Then, we use the looping function, lapply(), to iterate over each fastq file path and run the dada function, plotQualityProfile(), with each file in turn. Each resulting plot object is saved into the list object, quality_plots. The cowplot function, plot_grid(), will plot all these in a grid when a list of plots is passed to the plotlist argument.

We get the plot in the following diagram. Note how the fastq quality scores are poor in the first 10 or so nucleotides and after about 260 nucleotides in. These will be the trimming points for the next step:

To carry out trimming, we run a loop over the fastq files in read_files. In each iteration of the loop, we create an output fastq filename, out_fq, by pasting the text "trimmed.filtered" onto the filename (since we will save the trimmed reads to a new file, rather than memory), then run the fastqFilter() trimming function, passing it the input filename, fq; the out_fq filename; and the trim parameters. At the end of this loop, we have a folder full of trimmed read files. The names of these are loaded into a vector with the list.files() function again—this time, matching only files with the "trimmed.filtered" pattern. All of these files are loaded into memory and dereplicated using the derepFaistq() function. We then calculate the parameters for the compositional inference step using the dada() function on a proportion of the files. We pass the first five sets of dereplicated files using indexing on the derep_reads object. By setting err to NULL and selfConsist to TRUE, we force dada() to estimate parameters from the data, saving the results in the dd_model variable.

We next run the dada() function on all of the data, setting the err parameter to that estimated previously and stored in dd_model. This step calculates the final sequence composition for the whole data.

Finally, we can make the sequence table with the results of the dada() function and use that to find OTUs using assignTaxonomy(). This function uses a naive Bayes classifier to assign sequences to taxa, based on the classification in the training set provided in the rdp_train_set_14.fa file. The output of this function is the classification of each sequence. A single row of the resulting table, taxonomy_tb, looks like this:

## Kingdom Phylum 
## "Bacteria" "Cyanobacteria/Chloroplast" 
## Class Order 
## "Chloroplast" "Chloroplast" 
## Family Genus 
## "Bacillariophyta" NA

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...