CHAPTER 13
Basic Local Alignment Search Tool for Amino Acid Sequences (BLASTp)

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

13.1 INTRODUCTION

BLASTp is a set of programs that searches the protein databases using an amino acid sequence as the query. There are four different algorithms, with well‐defined applications in BLASTp: blastp, psi‐blast, phi‐blast and delta‐blast.

13.2 OBJECTIVE

To search a homologous protein sequence from the protein database, using the given amino acid sequence as query.

13.3 PROCEDURE

13.3.1 Protein‐protein BLAST (BLASTp)

The necessary steps are the same for BLASTp and BLASTn (see Chapter 12: “Basic Local Alignment Search Tool for nucleotide (BLASTn)”), regarding:

  • selection of sequence of interests (query sequences);
  • specifying the BLAST program;
  • selecting the sequence database;
  • adjusting the optional parameters.

13.1.1 Open the BLASTp homepage

Open the URL: http://www.ncbi.nlm.nih.gov/blast/Blast.cgi? PROGRAM = blastp&PAGE_TYPE = BlastSearch&LINK_LOC = blasthome to get the homepage for BLASTp.

13.3.1.2 Enter query sequences

  1. Enter accession number(s) or FASTA sequence(s): Paste one or more query sequence(s) in FASTA format, or the respective NCBI Protein accession number(s), into the specified sequence box. Alternatively, a text file containing the query sequences (in FASTA format) could be uploaded by clicking the “Choose File” button.
  2. Give a job title to identify the BLAST results from saved searches.
  3. Uncheck “Align two or more sequences”: When this checkbox is ticked, the page will be refreshed to provide the user with another sequence box where the subject sequence(s) is/are to be pasted. Such alignment of query and specific subject sequences is done to study sequence homology, according to the requirements of the user. However, when this option is checked, the “Database”, “Organism”, Exclude” and “Entrez Query” parameters are not required and so do not remain available on the page.
  4. Provide Query Sub‐range (optional) to specify a particular range of the input sequence which is to be searched against the database. It is used when the NCBI Protein Accession number is used instead of the whole sequence itself (Figure 13.1).

13.3.1.3 Choose search set

  1. Database: You need to choose one of the following databases:
    1. Non‐redundant protein sequences (nr): this contains translated non‐redundant protein sequences, PIR, Swiss‐Prot, PDB, PRF, excluding those in env_nr.
    2. Reference proteins (refseq_protein): contains amino acid sequences from the NCBI Reference Sequence project.
    3. UniProtKB/Swiss‐Prot (swissprot): includes the latest major release of the Swiss‐Prot database.
    4. Patented protein sequences (pat): consists of the proteins maintained by the Patent Division of NCBI GenBank.
    5. Protein Data Bank proteins (pdb): the amino acid sequences derived from the reported 3D structure records of PDB.
    6. Metagenomic proteins (env_nr): in silico translated, non‐redundant coding sequence entries from env_nt.
    7. Transcriptome Shotgun Assembly proteins (tsa_nr): the non‐redundant coding sequences are translated from the TSA archive.
  2. Organism (optional): Specify the organism, by common name, binomial name or taxonomical ID, to do the search against its protein sequences. Conversely, you can also check the small check box adjacent to the entry box to exclude any one or more organisms (click on the “+” sign to add more organisms) from your search results.
  3. Exclude Models (XM/XP) and/or Uncultured/environmental sample sequences (optional): You can check one or both of the check boxes to exclude one or both options. Models (XM/XP) stands for the “model reference sequences”. This is determined and annotated from the Genome Annotation Project of NCBI and, hence, could be incomplete.
  4. Entrez Query (Optional): Same as BLASTn, and used to restrict the search to specified Entrez query. It allows Boolean operators AND, OR, NOT to define the database to be searched.
Image described by caption and surrounding text.

FIGURE 13.1 Setting the parameters for BLASTp search at NCBI. The sequence(s) can be entered into the box as query sequence(s), with either NCBI Protein accession number or sequence(s) in FASTA format.

13.3.1.4 Program selection

  1. Algorithm: There are four algorithms; choose any one of these, depending on your sequence and the end results you are interested in getting from the BLAST.
    1. BLASTp: searches protein database using protein query. Recently, NCBI protein BLAST has included a new method called “Quick BLAST” or faster BLASTp.
    2. Position‐Specific Iterative BLAST (PSI‐BLAST): used to find more distantly related matches. The preliminary search results, by default, present information on permitted mutations; subsequent searches use these data to create a substitution matrix. That is how it finds the members of a protein family.
    3. Position Hit Initiated BLAST (PHI‐BLAST): a variation of the earlier PSI‐BLAST and used when the protein family has a known signature pattern (e.g., structural domain, active site, evolutionarily conserved sequence, etc.), with the aim of eliminating false positives. A pattern (protein domain or motif) is specified in the sequence box, which must be matched during the database search.
    4. Domain enhanced lookup time accelerated (DELTA) BLAST: faster and more accurate than BLASTp, as it uses the Reversed Position Specific BLAST(RPSBLAST) search to construct the PSSM. DELTA‐BLAST results are used to initiate a PSI‐BLAST search for better accuracy.
  2. Click “BLAST”: Click on the button to begin the BLASTp search. Click the adjacent checkbox (before executing “BLAST” command) to open the search result in a new window.

13.3.2 Algorithm parameters

These are of the following subtypes. Details of each have been provided in Table 13.1.

  1. General parameters
  2. Scoring parameters
  3. Filters and masking

TABLE 13.1 Algorithm parameters of BLASTp: Numbered arrows in Figure 13.2 refer to the serial number (SN) of discussion in this table.

SN Terms Explanation
1 Maximum target sequences Just like BLASTn, the user can opt (from the drop‐down list) for the highest number of aligned sequences to be displayed in the BLAST result.
2 Sort queries Check this box if BLASTp needs to adjust for short queries (i.e., input “word size” or seed) regarding related parameters to improve results.
3 Expect threshold BLAST alignment may occur due to chance hits to non‐homologous sequences. This threshold value (E‐value) should be lower, to minimize the random matches in the databases. The default value of 10 means that, out of the search results, ten matches could be due to chance. Reducing the Expect Threshold value will reduce the search output.
4 Word size You can opt for either 2 or 3. Since BLAST follows a heuristic algorithm, a seed word of particular length starts finding its match, and then extends in both directions. Taking a larger word size may end up in fewer results, while a shorter word size can lead to more random hits.
5 Max matches in a query range This is a very useful parameter with practical implications. Sometimes, a particular portion of a given query sequence gets a very large number of matches due to high similarity, while the other portion does not get a chance to display the result. This option sets a balance by limiting the occurrence of strong match, and offers an opportunity for the portions with weak matches within the same query sequence.
6 Matrix The user can select between the Percent Accepted Mutations (PAM: used to score alignment between closely related sequences) or Blocks Substitution Matrix (BLOSUM: for evolutionarily divergent sequences), where higher values for matrix indicate greater evolutionary distance in PAM and vice versa in BLOSUM. The BLOSUM62 scoring matrix is a useful all‐square matrix.
7 Gap costs A drop‐down menu displays a given range of gap costs. Increasing gap cost will reduce the number of gaps. It is better to use the default value, unless the results obtained are very irrelevant regarding false positives.
8 Compositional adjustments This matrix is used to adjust or compensate the compositional differences between the sequences being compared. The adjustment thus improves the E‐value of the search. “Conditional compositional score matrix adjustment” is more sophisticated than “Composition‐based statistics”. One can use the default option.
9 Filter: low complexity region Low complexity regions are repeat sequences which could introduce spurious results in BLASTp matching.
10 Mask: masks for lookup table only BLASTp selects the seed word from the look‐up table and then proceeds for an extension. If repeated filter is checked, no seed is found from the low complexity region.
Mask: mask lower case letter The lower‐case characters (i.e., bases) indicating low complexity regions are masked and are not considered for BLAST.
11 Upload PSSM This is a very useful, advanced, but optional tool. One can download a PSSM from PSI‐BLAST or DELTA‐BLAST search. That PSSM can then be uploaded for searching a different database to find out a required homology.
12 PSI‐BLAST threshold The threshold for statistical significance is set for including a protein sequence in the PSSM in the following iteration. The default is 0.005.
13 Pseudo count The default is “0”, which enables BLASTp to determine the pseudo‐count value, based on minimum length description principle.
Optional BLASTp parameters displaying 4 panels for general parameters, scoring parameters, filters and masking, and PSI/PHI/DELTA BLAST, with rightward arrows pointing to matrix, gap costs, masking, etc.

FIGURE 13.2 Optional BLASTp parameters. The numbered arrows refer to the serial number of discussion in Table 13.1.

13.3.3 Interpretation of BLASTp results

13.3.3.1 Results of Protein‐Protein BLAST (BLASTp)

  1. This is similar to BLASTn results. The final result compares the query and the database sequence intervened by the line of characters containing the matched residues (indicating identity), “+” symbol (indicating a positive substitution, but not an identity), and gap (indicating mismatch).
  2. The “Method” indicates the compositional adjustment selected during BLASTp parameter selection.
  3. GenPept: Clicking this hyperlink will open the flat file containing the sequence with annotation in NCBI GenBank format.
  4. Graphics: Clicking on this hyperlink will open the Graphics window for the protein under consideration.
Image described by caption.

FIGURE 13.3 Different sections of the result page of BLASTp.

“A” indicates the putative conserved domain(s) detected by BLASTp search. Clicking on this image will open the graphical summary of the conserved domain(s) of that protein.

“B” indicates the alignment and the scores in terms of color key, for each of the alignments.

“C” indicates the table of alignment detail (Description, Max score, Total score, Query coverage, E‐value, Identity, and Accession).

“D” shows the detail of the alignment residue‐wise.

13.3.3.2 Results of Position‐Specific Iterative BLAST (PSI‐BLAST)

The results window is almost the same as previous ones, except for a new column added: “Select for PSI‐blast” (indicated by “E” in Figure 13.4). This column contains checkboxes against each of the PSI‐BLAST results. All the checkboxes are, by default, ticked; however, they can be reversed (unchecked) by unchecking them individually. PSI‐BLAST can then be run for a second iteration by indicating the number of sequences (default is 500) and then clicking on the “Go” button.

Image described by caption and surrounding text.

FIGURE 13.4 Results of PSI‐BLAST. ‘E’ indicates the “Select for PSI blast” column, and “F” indicates the detailed result for each alignment.

The last column, “Used to build PSSM” is checked to indicate the sequences which have been used in the second iteration. The rows harboring the sequences which have not been used in this iteration to build PSSM are highlighted in yellow.

13.3.3.3 Results of Position Hit Initiated BLAST (PHI‐BLAST)

In this example, the amino acid sequence “krpmnafivw srdqrrkmal” has been used as a PHI pattern to run PHI‐BLAST. The specified pattern is indicated by asterisk marks above the alignment.

Image described by caption.

FIGURE 13.5 Result of PHI‐BLAST. ‘G’ indicates the detailed result of each alignment. The asterisks in the second row of alignment indicate the pattern which has been given for PHI‐BLAST analysis.

13.3.3.4 Results of domain enhanced lookup time accelerated (DELTA) BLAST

The output of DELTA‐BLAST (Figure 13.6) is more sensitive and accurate than PSI‐BLAST.

Result page of DELTA‐BLAST displaying the color key‐based alignment followed by tabular description of sequence alignments and alignment of each of the sequence pairs (query versus database subject sequences).

FIGURE 13.6 The result page of DELTA‐BLAST. The components and parameters are similar to PSI‐BLAST.

13.4 QUESTIONS

  1. 1. Enumerate in brief the principle and use of the following types of protein BLASTs: BLASTP, PSI‐BLAST, PHI‐BLAST, DELTA‐BLAST
  2. 2. Download a set of divergent peptide sequences (at least ten different species) using the given query: ACC61291.
  3. 3. Interpret various parameters obtained from the BLASTp result.
  4. 4. Use the following sequence as a PHI pattern and identify the pattern: gkqesmdskl in the sequence NP_005517.1.
  5. 5. Explain the results for each of these parameters:
    Result page of BLASTp displaying the tabular description of sequence alignments featuring max score, total score, query cover, E value, ident, and accession.
    Result page of PSI‐BLAST displaying the alignment of the sequence pairs (query versus database subject sequences).
    Result page of PHI‐BLAST displaying the tabular description of sequence alignments.
    Result page of DELTA‐BLAST displaying the alignment of each of the sequence pairs (query versus database subject sequences).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.111.116