CS Mukhopadhyay and RK Choudhary
School of Animal Biotechnology, GADVASU, Ludhiana
The Basic Local Alignment Search Tool (BLAST) is a collection of programs for searching homologous sequences for a given query sequence or a set of sequences against selected database (called “Subject sequence”). Thus, BLAST finds regions of local similarities between these query and subject sequences. BLAST is a heuristic program, developed by Altschul and coworkers (Altschul et al., 1990), that can yield results in a reasonable time. The term “heuristic” means that the developed algorithm is faster than the classical method but may not be the optimum method. Default parameters of BLAST can be modified according to need.
BLASTn (BLAST with suffix n) is one of the BLAST programs (Table 12.1) that is used to compare a nucleotide query sequence against a nucleotide database. Functional and evolutionary relationships between sequences can be deciphered using BLAST. In addition, it is used to identify member(s) of gene families.
TABLE 12.1 Overview of various types of BLAST algorithms available at the National Center for Biotechnology Information (NCBI) website, with their applications.
BLAST type | Query | Database | Alignment level | Application |
BLASTn | Nucleotide | Nucleotide | Nucleotide | Oligo‐mapping, cross‐species sequence study, cDNA, EST study, screening repetitive elements, gDNA annotation. |
BLASTp | Protein | Protein | Protein | Protein homology, motif search, phylogeny study, characterize novel transcripts. |
BLASTx | Nucleotide | Protein | Protein | Explore protein coding genes in cDNA/gDNA, characterize a novel transcript. |
tBLASTn | Protein | Nucleotide | Protein | Mapping protein to genomic DNA, compare unknown proteins from multiple organisms to gDNA. |
tBLASTx | Nucleotide | Nucleotide | Protein | Search for protein coding genes whose products are not in protein database, cross‐species gene prediction at transcript level. |
To search a homologous nucleotide sequence(s) from the nucleotide database, using query nucleotide sequence.
The BLASTn program is run by feeding the input sequence and setting the BLASTn parameters. General steps for setting BLASTn search are given below:
These steps are discussed in detail below to elucidate the operation of BLASTn.
Open the NCBI home page by typing http://www.ncbi.nlm.nih.gov/ and click “BLAST”. Alternatively, it can also be opened by typing http://blast.ncbi.nlm.nih.gov/Blast.cgi. Then click “nucleotide blast” and the BLASTn window will appear.
The inputs required to specify the BLASTn parameters are broadly categorized into three sections:
The user may enter an accession number or nucleotide sequence (raw or FASTA formatted) in the specified sequence box. If the sequence of entry is in raw sequence format (i.e., without header line of FASTA), the output page will show “None” (instead of the header line of the FASTA sequence) under the heading “Description”.
TABLE 12.2 Optional BLASTn parameters. Numbered arrows refer to the serial number (SN) of discussion in Table 12.3.
SN | Terms | Explanation |
1 | Maximum target sequences | An user can opt (from the drop‐down list) for the maximum number of aligned sequences to be displayed in the BLAST result. You can select a range of nucleotide searches. |
2 | Sort queries | Check this box if BLASTn needs to adjust for short queries (i.e., input “word size” or seed) or related parameters to improve results. |
3 | Expect threshold | BLAST alignment may result in chance hits to non‐homologous sequences. This threshold value (E‐value) should be lower to minimize the random matches in the databases – for example, if the match score (S‐score) is 32.7, and E‐value is 0.025, meaning that a score of 32.7 or better would be expected by chance 2.5 times in 100 times (i.e., one time in 40). E‐value ≰ 0.005 is considered to be statistically significant. |
4 | Word size | BLAST follows a heuristic algorithm, where a seed word of specific length starts finding its match, and then gets extended in both directions. BLASTn needs an exact match for the seed word between both query and subject sequences. A drop‐down menu of the word size has been provided. Taking a larger word size may end up in fewer results, while a shorter word size may lead to more random hits. |
5 | Max matches in a query range | This is a very useful parameter with practical implications. Sometimes a particular portion of a given query sequence gets a very large number of matches, due to strong similarity, while the other portion does not get a chance to display the result. This option sets a balance by limiting the occurrence of a strong match and offers an opportunity for the portions with weak matches within the same query sequence. |
6 | Match/mismatch scores | The user can set the ratio between award and penalty for match and mismatch, respectively. Selection can be made from the pull‐down menu specifying a positive value for the match and negative value for a mismatch. A wider ratio should be used for identifying divergent sequences through BLASTn. |
7 | Gap costs | A drop‐down menu displays a given range of gap costs. Linear costs are available for MegaBLAST only, while increasing the gap costs will minimize the occurrence of gaps in the aligned sequences. |
8 | Filter: low complexity region | Low complexity regions are repeat sequences which could introduce spurious results in BLASTn matching. Check this box. |
Filter: species‐specific repeats for | If checked, this will mask the repeat elements for that particular selected species. | |
9 | Mask: masks for lookup table only | BLASTn selects the seed word from the look‐up table and then proceeds for an extension. If repeated filter is checked, then no seed is obtained from the low complexity region. |
Mask: mask lower case letter | Lower case characters (i.e., bases), indicating low complexity regions, are masked and not considered for BLAST. |
This is a very important parameter to be chosen:
The default parameters of BLAST are fine to use on some occasions. The user, nevertheless, would need to optimize the parameters under certain conditions, such as if the query size is short, or the query is to be searched against divergent homologs, by expanding algorithm parameters (Figure 12.2). The meanings of the parameters are explained in Table 12.2.
TABLE 12.3 Databases against which a query can be searched in BLASTn (http://www.ncbi.nlm.nih.gov/books/NBK153387/).
SN | Databases | Description of database |
1 | Human genomics plus transcript | Genomic DNA sequences (from all assemblies and chromosomes) and RefSeq RNA sequences of human. |
2 | Mouse genomics plus transcript | Genomic DNA sequences (from all assemblies and chromosomes) and RefSeq RNA sequences of mouse. |
3 | Nucleotide collection | Non‐redundant sequences from GenBank, EMBL, DDBJ, PDB and RefSeq; however, this excludes very specific databases like EST, STS, GSS, WGS, TSA, patent sequences and HTGS (phases 0–2) sequences. |
4 | Reference RNA sequence | Reference sequences for various transcripts at NCBI db. |
5 | Reference genomic sequences | Reference sequences for various genomic sequences at NCBI db. |
6 | NCBI genomes | NCBI chromosomal DNA sequences of all species in db. |
7 | Expressed sequence tags (EST) | EST sequences from GenBank, EMBL and DDBJ db. |
8 | Genomic survey sequences (GSS) | GSS, namely single‐pass genomic data (a sequence that has been analyzed in sequencer machine only once), exon‐trapped sequences (that are used to identify genes in cloned DNA, by recognizing and trapping carrier containing the exon sequence), and Alu PCR sequences. |
9 | HT genomic sequences (HTGS) | HTGS of Phases 0, 1 and 2; the unfinished HTGS. |
10 | Patent sequences | DNA sequences available at the patent division of GenBank. |
11 | Protein data bank (PDB) | Nucleotide sequence database maintained at PDB. |
12 | Human Alu repeat elements | Abundant Alu elements present in human genome. |
13 | Sequence tagged sites (STS) | STS sequences from GenBank, EMBL and DDBJ db. |
14 | Whole genome shotgun contigs (WGSC) | Database harboring the WGS contigs, except for the WGS data from Chromosome db. |
15 | Transcriptome shotgun assembly (TSA) | Database containing the computationally assembled mRNA sequences from primary data. |
16 | 16 s rRNA sequences (bacteria and archaea) | 16 s Ribosomal RNA data belonging to bacteria and archaea. |
The Blast results can be obtained on a new Window (or tab) if the “Show results in a new window” box is checked.
18.226.186.172