Mir Asif Iquebal, Sarika and D Kumar
CABiN, ICAR‐IASRI, New Delhi, India
There exist genetic variations among individuals of all organisms, and these genetic variations make individuals look phenotypically different. Single nucleotide polymorphisms (SNPs) are considered the simplest and most abundant type of genetic variations in the genome of organisms. SNPs are the markers of choice in most species for genome‐wide association studies (GWAS), phylogenetic analysis, marker‐assisted selection and genomic selection (Liu et al. 2013). They are the genetic markers of choice due to their high density and stability, and the highly automated techniques which are available for detection of SNPs (Kerstens et al., 2009).
Numerous tools are available online for mining SNPs computationally. SNP mining in NGS data has been well documented using two online open source tools: Stacks (Catchen et al., 2011; Ogden et al., 2013) and GATK (DePristo et al., 2011).
To learn about SNP mining using Stacks, the Burrows–Wheeler algorithm (BWA) aligner, the Genome analysis toolkit (GATK) and Samtools.
We will learn to install and run the tools STACKS, BWA, GATK, Samtools, and so on, to mine SNPs in given nucleotide sequences.
This is a program to study population genetics, and it is designed to work with any restriction‐enzyme‐based data, such as GBS (Genotyping by Sequencing), CRoPS, RAD‐Seq, and ddRAD‐Seq. Stacks can identify SNPs within or among populations. It has different modules to generate summary statistics and also compute parameters of population genetics, such as Fis and π, within populations, and Fst between populations.
The output of Stacks can be exported in VCF (Variant Call Format) and many other standard formats, which we can use in different programs like STRUCTURE and GenePop for downstream analysis. The SNPs predicted by Stacks can also be exported in Phylip format to predict phylogenetic trees by any standard phylogenetic software/tool. Stacks can be used to predict SNPs de novo, as well as by a reference‐based method. It is a Linux‐based program and is available for free download from the Stacks website (http://creskolab.uoregon.edu/stacks/) (Figure 40.1).
Now run ./bwa from the command line to check if BWA is installed correctly. Additionally add the binaries of the softwares to the bashrc.
The denovo_map.pl Perl wrapper script (which includes three components – ustacks, cstacks and sstacks) is used for de novo SNP mining in Stacks. The following codes have been obtained from http://catchenlab.life.illinois.edu/stacks/comp/denovo_map.php. The codes and its explanations are verbatim as available on the website.
Please also consult “Stacks”, using the following link for downloading Stacks: http://creskolab.uoregon.edu/stacks/
Use the denovo_map.pl script to call SNP from Mango example RAD‐Seq dataset (Figure 40.2):
denovo_map.pl –m < number > –M < number > –n < number > –T < number > –S –b < number > –o < path/to/output/result/folder > –s < path/to/input/file1 > –s < path/to/input/file2 > –X “populations:‐b <1 > –t < number > –vcf ”
This script will generate results and place them in the output folder specified by the –o option. There are two input files (e.g., Mango_R1.fastq and Mango_R2.fastq) that are specified by option –s. The code for analysis is set using a number of options as given below:
Additionally, our code is running a “populations program” (–X option) to generate “population genetics statistics” on batch 1 (–b option) with 100 threads to run in a parallel section (–t option) and generate results in vcf format (–vcf option).
The reference‐based SNP mining in Stacks is done using ref_map.pl Perl, which includes three components – pstacks, cstacks and sstacks. This program needs a reference‐aligned data file and can take input data that has been aligned using Bowtie or any other aligner (like BWA); output will be in SAM (Sequence alignment/Map) format.
The ref_map.pl perl wrapper script is used to run reference‐based SNP mining in Stacks software (Figure 40.3):
file.fastq>
<path/to/input/sequence/file2.fastq> > <path/to/output/file.sam>
This perl script will generate results in the output folder specified by the –o option from a reference aligned input file – e.g., MangoSeq.sam, specified by the –s option with 15 threads to execute, specified by –T option and batch id 1, for the run specified by –b option. Additionally, it can run a populations program (–X option) to generate population genetics statistics on batch 1 (–b option), with 100 threads to run in parallel section (–t option), and generate results in VCF format (–vcf option).
This is an organized programming framework (DePristo et al., 2011), which is designed to develop effective and durable analysis tools for next‐generation DNA sequencing, using the functional programming theme of MapReduce (Dean and Ghemawat, 2008; DePristo et al. 2011). The GATK has a variety of tools which primarily focus on SNP discovery and genotyping, and has a strong emphasis on data quality assurance. It has a robust architecture, a powerful processing engine, and high‐performance computing features which make it suitable to be used for a project of any size (https://www.broadinstitute.org/gatk/index.php). A reference‐based SNP mining program that uses the Linux operating system, it is available for free download from the GATK website (Figure 40.4) (https://www.broadinstitute.org/gatk/index.php).
In order to use GATK, it is first necessary to install the BWA, Samtools, and Picard tools.
To call SNPs using the GATK toolkit, we have to pre‐process the input bam files. Steps for pre‐processing the bam files are described below:
With these processed Bam files, we now can proceed further to call SNPs by GATK using the “UnifiedGenotyper” program, as described below:
java –jar < path/to/GenomeAnalysisTK.jar –T UnifiedGenotyper –R < path/to/reference/sequence.fa –I <path/to/input/file.bam –I <path/to/input/file2.bam –o <path/to/output/file.vcf VALIDATION_STRINGENCY = <SILENT,LENIENT or STRICT>
The script used in the GATK pipeline calls SNPs from two input bam files (specified by the –I option), using the reference sequence specified by the –R option, and outputs results to the output folder specified by the –o option in VCF format, using the “UnifiedGenotyper” tool. The user can select the level of stringency required, from Silent, Lenient, and Strict options.
The GATK script, running on an example dataset, is shown in Figure 40.5.
The Variant Call Format (VCF) has its first six columns representing observed variation, and is explained as follows (see Figure 40.6):
3.133.134.58