CHAPTER 12
Basic Local Alignment Search Tool for Nucleotide (BLASTn)

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

12.1 INTRODUCTION

The Basic Local Alignment Search Tool (BLAST) is a collection of programs for searching homologous sequences for a given query sequence or a set of sequences against selected database (called “Subject sequence”). Thus, BLAST finds regions of local similarities between these query and subject sequences. BLAST is a heuristic program, developed by Altschul and coworkers (Altschul et al., 1990), that can yield results in a reasonable time. The term “heuristic” means that the developed algorithm is faster than the classical method but may not be the optimum method. Default parameters of BLAST can be modified according to need.

BLASTn (BLAST with suffix n) is one of the BLAST programs (Table 12.1) that is used to compare a nucleotide query sequence against a nucleotide database. Functional and evolutionary relationships between sequences can be deciphered using BLAST. In addition, it is used to identify member(s) of gene families.

TABLE 12.1 Overview of various types of BLAST algorithms available at the National Center for Biotechnology Information (NCBI) website, with their applications.

BLAST type Query Database Alignment level Application
BLASTn Nucleotide Nucleotide Nucleotide Oligo‐mapping, cross‐species sequence study, cDNA, EST study, screening repetitive elements, gDNA annotation.
BLASTp Protein Protein Protein Protein homology, motif search, phylogeny study, characterize novel transcripts.
BLASTx Nucleotide Protein Protein Explore protein coding genes in cDNA/gDNA, characterize a novel transcript.
tBLASTn Protein Nucleotide Protein Mapping protein to genomic DNA, compare unknown proteins from multiple organisms to gDNA.
tBLASTx Nucleotide Nucleotide Protein Search for protein coding genes whose products are not in protein database, cross‐species gene prediction at transcript level.

12.2 OBJECTIVE

To search a homologous nucleotide sequence(s) from the nucleotide database, using query nucleotide sequence.

12.3 PROCEDURE

The BLASTn program is run by feeding the input sequence and setting the BLASTn parameters. General steps for setting BLASTn search are given below:

  1. Feed the query sequence(s) of interest.
  2. Selection of specific subject database.
  3. Select the BLASTn program (MegaBLAST, Discontinuous MegaBLAST, blastn).
  4. Selection of optional parameters, if required.

These steps are discussed in detail below to elucidate the operation of BLASTn.

12.3.1 Open BLASTn homepage

Open the NCBI home page by typing http://www.ncbi.nlm.nih.gov/ and click “BLAST”. Alternatively, it can also be opened by typing http://blast.ncbi.nlm.nih.gov/Blast.cgi. Then click “nucleotide blast” and the BLASTn window will appear.

The inputs required to specify the BLASTn parameters are broadly categorized into three sections:

12.3.1.1 Enter query sequence

The user may enter an accession number or nucleotide sequence (raw or FASTA formatted) in the specified sequence box. If the sequence of entry is in raw sequence format (i.e., without header line of FASTA), the output page will show “None” (instead of the header line of the FASTA sequence) under the heading “Description”.

  1. Multiple input sequences: one can provide more than one input query sequences in FASTA‐sequence format (or NCBI accession numbers separated by Return or Enter) in the specified sequence box. Alternatively, a text file containing the query sequences (in FASTA format) could be uploaded by clicking the “Choose File” button. The results page will display one drop‐down option under the “Results for:” heading, at the top left of the page. This allows the user to select the BLAST result for any one of the multiple input sequences. Once the user specifies the required option, the other parameters of the search result, including “Query ID”, “Description”, “Molecule type” and “Query Length”, change accordingly.
  2. Give a job title: It is a good practice to provide a job title to identify the results of BLASTn in saved searches. Please note that pasting your sequence in FASTA format will automatically pick up the descriptive line of FASTA as “Job Title”.
  3. Checking “Align two or more sequences” option: Checking this box will create another sequence box where your own subject sequence(s) is/are pasted. This is done to align a query sequence with a specific subject sequence. Please provide the input sequences (query and subject) in FASTA format, so that BLAST can assign an identification tag to the subject sequence name. BLASTn also gives you one Dot plot (“Dot Matrix View”) of the pairwise local sequence alignment when a single input is given as subject sequence.
  4. Provide query sub‐range (optional): To specify a particular range of a single input sequence (applicable for single query sequence) that is to be searched against the database. This is especially useful when a GenBank accession number is used instead of the whole sequence itself.

12.3.1.2 Choose search set

  1. Select database: Three options are available: “Human Genomic + Transcript”, “Mouse genomic + Transcript” and “Others (nr)”. For a DNA database (for BLASTn), the default database is either human or mouse genomic, plus transcript database. Other commonly used databases include the nucleotide “nr” database or EST database. Choose the nucleotide nr database if the databases for microbes/plants/animals are to be searched. The drop‐down menu enables the user to specify the required database against which the query sequence is to be searched. “Others (nr) etc” will provide the databases shown in Figure 12.1. Table 12.2 enlists the databases against which a query sequence can be searched in BLASTn.
  2. Select organism (optional): specify if the database search is to be restricted to a specific organism, or if any specific organism is to be excluded from the search.
  3. Entrez query (optional): the BLAST search can be refined by limiting the search to specific databases by restricting the sequences as per the Entrez query. Some examples of Entrez query are: 3000 : 5000 [mlwt], 100 : 450 [slen], protease NOT Bos [organism]. One can use AND, OR, NOT operators to refine the search.
    1. 3000 : 5000 [mlwt] means this search will limit protein sequences with a molecular weight of 3–5 kD.
    2. 100 : 450 [slen] means the length of nucleotide or protein search will be limited to 100–450 residue/bases.
    3. Protease NOT Bos [organism] means the search will be for all proteases except those in bovine (Bos taurus or Bos indicus).

TABLE 12.2 Optional BLASTn parameters. Numbered arrows refer to the serial number (SN) of discussion in Table 12.3.

SN Terms Explanation
1 Maximum target sequences An user can opt (from the drop‐down list) for the maximum number of aligned sequences to be displayed in the BLAST result. You can select a range of nucleotide searches.
2 Sort queries Check this box if BLASTn needs to adjust for short queries (i.e., input “word size” or seed) or related parameters to improve results.
3 Expect threshold BLAST alignment may result in chance hits to non‐homologous sequences. This threshold value (E‐value) should be lower to minimize the random matches in the databases – for example, if the match score (S‐score) is 32.7, and E‐value is 0.025, meaning that a score of 32.7 or better would be expected by chance 2.5 times in 100 times (i.e., one time in 40). E‐value ≰ 0.005 is considered to be statistically significant.
4 Word size BLAST follows a heuristic algorithm, where a seed word of specific length starts finding its match, and then gets extended in both directions. BLASTn needs an exact match for the seed word between both query and subject sequences. A drop‐down menu of the word size has been provided. Taking a larger word size may end up in fewer results, while a shorter word size may lead to more random hits.
5 Max matches in a query range This is a very useful parameter with practical implications. Sometimes a particular portion of a given query sequence gets a very large number of matches, due to strong similarity, while the other portion does not get a chance to display the result. This option sets a balance by limiting the occurrence of a strong match and offers an opportunity for the portions with weak matches within the same query sequence.
6 Match/mismatch scores The user can set the ratio between award and penalty for match and mismatch, respectively. Selection can be made from the pull‐down menu specifying a positive value for the match and negative value for a mismatch. A wider ratio should be used for identifying divergent sequences through BLASTn.
7 Gap costs A drop‐down menu displays a given range of gap costs. Linear costs are available for MegaBLAST only, while increasing the gap costs will minimize the occurrence of gaps in the aligned sequences.
8 Filter: low complexity region Low complexity regions are repeat sequences which could introduce spurious results in BLASTn matching. Check this box.
Filter: species‐specific repeats for If checked, this will mask the repeat elements for that particular selected species.
9 Mask: masks for lookup table only BLASTn selects the seed word from the look‐up table and then proceeds for an extension. If repeated filter is checked, then no seed is obtained from the low complexity region.
Mask: mask lower case letter Lower case characters (i.e., bases), indicating low complexity regions, are masked and not considered for BLAST.
Main page for BLASTn search at NCBI, displaying arrows and circles highlighting sequence in FASTA format, the job title, and the option bar under database labeled nucleotide collection (nr/nt).

FIGURE 12.1 Main page for BLASTn search at NCBI. The sequence can be entered into the box as query sequences with either accession number or sequence in FASTA format. The gene identity number (i.e., the gi mentioned in this figure) is not currently used as sequence identifier in the NCBI nucleotide database.

12.3.1.3 Program selection

This is a very important parameter to be chosen:

  1. MegaBLAST for highly similar sequences: this is very fast, but the target should have 95% or more identification with the query – for example, two nucleotides or protein sequences of same species (B. taurus vs. B. indicus).
  2. Discontiguous MegaBLAST for more dissimilar sequences: this allows mismatches and is more suitable for cross‐species comparison – for example, two nucleotide sequences between two different species (dog vs. bovine or more divergent species).
  3. BLASTn for somewhat similar BLAST: this is somewhat slower than the other two options. It allows the user to search the database with a shorter word size that ultimately searches for a similar type of sequence that has a smaller degree of similarity – for example, two nucleotides or protein sequences of unrelated organisms (e.g., searching homologous sequences of yeast in mice).

12.3.2 Algorithm parameters

The default parameters of BLAST are fine to use on some occasions. The user, nevertheless, would need to optimize the parameters under certain conditions, such as if the query size is short, or the query is to be searched against divergent homologs, by expanding algorithm parameters (Figure 12.2). The meanings of the parameters are explained in Table 12.2.

Image described by caption.

FIGURE 12.2 Optional BLASTn parameters. Numbered arrows refer to the serial number of discussion in Table 12.3.

TABLE 12.3 Databases against which a query can be searched in BLASTn (http://www.ncbi.nlm.nih.gov/books/NBK153387/).

SN Databases Description of database
 1 Human genomics plus transcript Genomic DNA sequences (from all assemblies and chromosomes) and RefSeq RNA sequences of human.
 2 Mouse genomics plus transcript Genomic DNA sequences (from all assemblies and chromosomes) and RefSeq RNA sequences of mouse.
 3 Nucleotide collection Non‐redundant sequences from GenBank, EMBL, DDBJ, PDB and RefSeq; however, this excludes very specific databases like EST, STS, GSS, WGS, TSA, patent sequences and HTGS (phases 0–2) sequences.
 4 Reference RNA sequence Reference sequences for various transcripts at NCBI db.
 5 Reference genomic sequences Reference sequences for various genomic sequences at NCBI db.
 6 NCBI genomes NCBI chromosomal DNA sequences of all species in db.
 7 Expressed sequence tags (EST) EST sequences from GenBank, EMBL and DDBJ db.
 8 Genomic survey sequences (GSS) GSS, namely single‐pass genomic data (a sequence that has been analyzed in sequencer machine only once), exon‐trapped sequences (that are used to identify genes in cloned DNA, by recognizing and trapping carrier containing the exon sequence), and Alu PCR sequences.
 9 HT genomic sequences (HTGS) HTGS of Phases 0, 1 and 2; the unfinished HTGS.
10 Patent sequences DNA sequences available at the patent division of GenBank.
11 Protein data bank (PDB) Nucleotide sequence database maintained at PDB.
12 Human Alu repeat elements Abundant Alu elements present in human genome.
13 Sequence tagged sites (STS) STS sequences from GenBank, EMBL and DDBJ db.
14 Whole genome shotgun contigs (WGSC) Database harboring the WGS contigs, except for the WGS data from Chromosome db.
15 Transcriptome shotgun assembly (TSA) Database containing the computationally assembled mRNA sequences from primary data.
16 16 s rRNA sequences (bacteria and archaea) 16 s Ribosomal RNA data belonging to bacteria and archaea.

12.3.3 Click on BLAST button

The Blast results can be obtained on a new Window (or tab) if the “Show results in a new window” box is checked.

The result page of BLASTn displaying 3 panels for general parameters, scoring parameters, and filters and masking, with right arrows pointing to max target sequences, short queries, word size, filter, mask, etc.

FIGURE 12.3 The result page of BLASTn contains the color key‐based alignment display, followed by a tabular description of sequence alignments and, finally, alignments of each of the sequence pairs (query vs. database sequence).

12.3.4 Interpretation of BLAST results

  1. Query: This refers to the input sequence (or accession number of a sequence given as input) that is to be compared against the entries (i.e., subject) in a database.
  2. Raw alignment score (S): The score of an alignment, calculated on the basis of match, mismatch/substitution and gap in the alignment. The BLAST program awards the substitution score according to PAM or BLOSUM matrices, while the gaps are penalized with gap‐open penalty (higher value) and gap‐extension penalty (lower value than gap‐open penalty).
  3. Bit score (S′): The raw alignment score is normalized for the scoring system to determine the bit score, in order to compare alignment scores from different searches. The higher the bit score, the better the alignment.
  4. High‐scoring Segment Pair (HSP): A local alignment (without gaps) with maximal (or near the highest) alignment scores in a given search. A single query may reveal more than one HSP with a single subject of the database sequence. These are presented as the ranges of the subject sequence. Two situations can arise when aligning the ‘Query’ and ‘Subject’ sequences, due to the occurrence of considerably large gaps in any one sequence (i.e., gaps arising due to intron).
    1. When the query‐sequence is a gene sequence with intron, but the subject is a coding sequence (or mRNA) without gap, the color key for alignment score will show a black pipe symbol on the colored line.
    2. When the subject sequence is a gene sequence with intron, but the query is a coding sequence (or mRNA) without gap, the color key for alignment score will show a blank space on the colored line.
    3. Note that there could also be several ranges for a single pair of query and subject alignments, which can be overlapping over the subject sequence.
  5. Max Score: This is inversely proportional to the E‐value. The Max Score is the highest bits value out of more than one HSP for a single pair of alignments between query and a subject.
  6. Total Score: The sum of the bit scores from all HSPs obtained in an alignment between query and subject sequences.
  7. Maximum Identity: The highest percentage of matches for a set of HSPs with respect to the subject sequence.
  8. E‐value or Expectation value or Expect value: The statistical likelihood that the alignment between the query and subject sequences has occurred by chance. The hit obtained is not due to homology, but is due to mere random matches between the two sequences. Thus, it “describes the chance of randomly achieving the same alignment in a database of a particular size”. The E‐value is the number of alignments with scores superseding S, however, which occur due to any random cause but not homology between the sequences. Hence, the lower the probability (i.e., E‐value), the better the alignment is. The E‐value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The E‐value is calculated as: E = (query length) * (length of database) * 2–(S)
  9. Query Coverage: The proportion (expressed in %) of the query sequence that has a homologous counterpart in the subject sequence (i.e., the percentage of the query sequence that has been included in the alignments over all the HSPs).
  10. Maximum Identity: The highest percentage of matches for a set of HSPs with respect to the subject sequence.

12.4 QUESTIONS

  1. 1. Download the sequence EF432553.2 (Partial mRNA sequence of taurine) from the NCBI GenBank and then BLAST it to find the bubaline mRNA sequence of the same gene. Give reasons for selecting the particular bubaline sequence.
  2. 2. Given the same sequence (EF432553.2), how will you obtain the transcript variants of the bovine TSPY gene?
  3. 3. Suppose BLASTn of a given nucleotide sequence (200 bases length) shows an E‐value of > 0.05 for a set of sequences. Will you consider these sequences to be worth further study?
  4. 4. Explain the following terms:
    1. E‐value
    2. HSP
    3. Bit‐score
    4. Megablast
    5. Discontiguous megablast
  5. 5. Interpret the given BLASTn output in your own language. Explain each of the terms given in the output:
    BLASTn output displaying the sequences producing significant alignments.

    FIGURE 12.4

    BLASTn output displaying the sequence EF432553.2 from the NCBI GenBank.

    FIGURE 12.5

  6. 6. When can you infer that you have obtained a unique sequence in the output? Does E‐value play any role in finding the unique match?
  7. 7 From 5a output obtained, can we find out or reach the page indicating its cytogenetic location? If yes, how can we do so?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.186.172