This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
256
|
Chapter 13: NCBI-BLAST Reference
blastpgp Parameters (PSI-BLAST
and PHI-BLAST)
blastpgp is the program used to run PSI-BLAST and PHI-BLAST. These programs are
specialized protein BLAST comparisons that are more sensitive than the standard
BLASTP search. PSI-BLAST considers position-specific information when searching
for significant hits. PHI-BLAST uses a pattern, or profile, to seed an alignment,
which is then extended by the normal BLASTP algorithm.
PSI-BLAST
PSI-BLAST (position-specific iterated BLAST) uses a specialized scoring matrix that
assigns scores to each position (hence, position-specific) in the query sequence based
on alignments defined by consecutive iterations of searches (hence, iterated). The
specialized matrix is a position-specific scoring matrix (PSSM) that assigns a score for
every amino acid at each position in the query sequence (See Figure 13-1).
Figure 13-1 shows a portion of a PSSM calculated for the coelacanth Hoxa11 protein
(AAG39070). The query amino acids are numbered in the left column with the posi-
tion-specific scores for each of the 20 amino acids shown across each row. The
diverse scores of the three Tyrosines (Y) at positions 1, 7, and 8 highlight the posi-
tion-specific aspect of this scoring scheme compared to traditional BLAST matrices,
which would contain the same scores for Y in all three positions.
The PSSM, or checkpoint file, is created internally by PSI-BLAST, but it can also be
exported to a file using the
-C option of blastpgp. This option is extremely useful.
You can use the checkpoint file in subsequent PSI-BLAST (blastpgp) searches or as a
database entry for the RPS-BLAST program. You can also use the PSSM in a special-
ized tblastn search in blastall by using the -
p psitblastn and -R <checkpoint file>
options with a nucleotide database.
Figure 13-1. PSSM for the first 10 amino acids of the coelacanth HoxA11 protein
A R N D C Q E G H I L K M F P S T W Y V
1 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1
2 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1
3 P -1 -2 -2 -2 -3 -2 -1 -2 -2 -3 -3 -1 -3 -4 8 -1 -1 -4 -3 -3
4 S 1 -1 0 -1 -1 0 0 -1 -1 -3 -3 0 -2 -3 -1 5 1 -3 -2 -2
5 C -1 -4 -3 -4 9 -3 -4 -3 -3 -2 -2 -3 -2 -3 -3 -1 -1 -3 -3 -1
6 T 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 3 -3 -2 -2
7 Y -2 -3 -3 -4 -3 -2 -3 -4 1 -1 -1 -3 -1 5 -4 -2 -2 1 7 -2
8 Y -1 -1 -1 -1 -2 0 -1 -2 6 -2 -1 -1 -1 1 -1 -1 -1 0 5 -2
9 V -1 -2 -2 -2 -1 -2 -2 -2 -2 1 2 -2 0 -1 -2 -2 -1 -2 -1 4
10 S -1 -1 -1 -1 -3 3 3 -2 -1 -2 1 0 -1 -2 -2 2 -1 -3 -2 -2
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
blastpgp Parameters (PSI-BLAST and PHI-BLAST)
|
257
To run PSI-BLAST, the -j parameter must be set to something greater than 1. The
default of -
j1means that there are no iterations and that it’s therefore the same as a
single BLASTP search. Setting -
j sets the maximum number of iterations to run, with
the program stopping beforehand if the search comes to convergence. Convergence
occurs when no new sequences are found that are better than the E value threshold
set by the -
h parameter.
Here are a few sample command lines:
blastpgp -d nr -i my_protein -s T -j 5
blastpgp -d nr -i my_protein -R my_protein.ckp -d nr -j 5 -h 0.001
PHI-BLAST
PHI-BLAST stands for pattern-hit initiated BLAST. The program uses an input
sequence and a defined pattern to query a protein database. The pattern is defined in
PROSITE format (http://ca.expasy.org/prosite/) and is used as the seed for the align-
ment. The pattern is used instead of the words that are usually generated for seeding
alignments in BLASTP. Here’s a sample profile:
ID HoxA11 pattern1
PA Y-S-[SA]-X-[LVIM]
The profile’s syntax has a line starting with ID, followed by two spaces and the name
of the pattern. The name is free text. The next line should start with
PA, followed by
two spaces, and then the pattern in PROSITE format. The PROSITE format is sim-
ple. A dash (-) separates letters, an X means any letter, and the brackets ([]) specify a
choice of amino acids. You can find more information on the pattern syntax in the
README.bls file that comes with the NCBI-BLAST distribution.
Additionally, if the pattern occurs more than once in the query and you would like to
limit which occurrences are used as seeds, specify those locations by using the HI (hit
initiation) tag in the pattern file. You set -
p to seedp instead of patseedp (explained in
the reference section that follows). The following example specifies that the pattern
starting at position 143 should be used. (In this case, there’s also an occurrence at
34, which is ignored.)
ID HoxA11 pattern2
PA Y-S-[SA]-X -[LVIMK]
HI 143
PHI-BLAST can also be a jumping-off point for a PSI-BLAST run. In the following
command line, the pattern in hit_file initiates the first iteration of PSI-BLAST for the
development of the PSSM, followed by normal rounds of PSI-BLAST iterations.
blastpgp -d nr -i my_protein -k hit_file -p patseedp -j 5
Here are a few sample PHI-BLAST command lines:
blastpgp -d nr -i my_protein -k hit_file -p patseedp
blastpgp -d nr -i my_protein -k multi_hit_file -p seedp
blastpgp -d HoxDB.pep -i AAG39070.pep -k hit_file.hox -p patseedp
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
258
|
Chapter 13: NCBI-BLAST Reference
The following reference describes parameters used with blastpgp, which executes
PSI- and PHI-BLAST searches.
-a [integer]
Default: 1
The number of processors to use; same as blastall.
-A [integer]
Default: blastn 0, others 40
The multiple-hit window size; same as blastall.
-b [integer]
Default: 250
The number of alignments to show; same as blastall.
-B [file]
Default: Optional
Program: PSI-BLAST only
The input alignment file for a PSI-BLAST restart. It allows a PSI-BLAST run to start with a
curated multiple sequence alignment instead of allowing the program to generate it from
the first round of database alignments. For example:
blastpgp -i query -B multiple_alignment -j 5 -d nr
The alignment file must be based on the Clustal format but without the header and footer.
The file should have a row for each sequence and can be broken into blocks separated by
one or more blank lines. The query file (specified by -
i) must be included in the alignment
(though it doesn’t need to be the first one), and all rows must be padded with dashes (---)
to make them equal lengths. Also, each column must contain either all uppercase or lower-
case letters. An uppercase letter signifies that the column should be given a position-
specific score; a lowercase letter means that the matrix (specified by -
M) score should be
used. Here is a portion of the example alignment file included in README.bls (the query is
26SPS9_Hs, in this case):
26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllc
F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymll
YDL097c_Sc ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlky
YMJ5_Ce LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymil
FUS6_ARATH KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvn
COS41.8_Ci SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetad
644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvs
YPR108w_Sc IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvt
eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw---------------
T23D8.4_Ce SKAMLNGDWKKCQDYIVNDKMNQkvw---------------
YD95_Sp IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavis
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
blastpgp Parameters (PSI-BLAST and PHI-BLAST)
|
259
KIAA0107_Hs LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyv
F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvit
Int-6_Mm KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklase
26SPS9_Hs kimlntpedvqalvsgklalryagrqtealkcvaqasknr
F57B9_Ce ckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk
YDL097c_Sc mllskimlnliddvknilnakytketyqsrgidamkavae
YMJ5_Ce ckimlneteqlagllaakeivayqkspriiairsmadafr
FUS6_ARATH kaeqnpetlepmvnaklrcasglahlelkkyklaarkfld
COS41.8_Ci eqlqihykvcyarvldyrrkfleaaqrynelsyksaihet
644879 kaestpeiaeqrgerdsqtqailtklkcaaglaelaarky
YPR108w_Sc glftlertdlkskvidspellslisttaalqsissltisl
eif-3p110_Hs ----------------------------------------
T23D8.4_Ce ----------------------------------------
YD95_Sp gaisldrvdvktkivdspevlavlpqnesmssleacinsl
KIAA0107_Hs smialerpdlrekvikgaeilevlhslpavrqylfslyec
F49C12.8_Hs ttfaldrpdlrtkvircnevqeqltggglngtlipvreyl
Int-6_Mm ilmqnwdaamedltrlketidnnsvssplqslqqrtwlih
-c [integer]
Default: 9
Program: PSI-BLAST only
Sets a constant in pseudocounts for PSSM. It’s generally not necessary to change this
parameter.
-C [file]
Default: Optional
Program: PSI-BLAST only
Outputs a file for PSI-BLAST checkpointing. This outputs the final PSSM for a multipass
run of PSI-BLAST. The checkpoint file can then be used in a PSI-BLAST restart (see -
R), in a
blastall -
p psitblastn run (also see -R), or as an entry in an RPS-BLAST database.
blastpgp -d nr -i my_protein -j 5 -C my_protein.ckp
-d [string]
Default: nr
The database name; same as blastall.
-e [real]
Default: 10
The expectation value; same as blastall.
-E [integer]
Default: blastn 2, others 1
The penalty to extend a gap; same as blastall.
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
260
|
Chapter 13: NCBI-BLAST Reference
-f [integer]
Default: 11
The threshold for extending a hit; same as blastall.
-F [string]
Default:
Filters the query sequence; same as blastall.
-g [T/F]
Default: T
Performs gapped alignment; same as blastall.
PHI-BLAST requires gapping and therefore forbids -
g F.
-G [integer]
Defaults: blastn 5, others 11
The penalty to open a gap; same as blastall.
-h [real number]
Default: 0.005
Program: PSI-BLAST only
The E-value threshold for inclusion in PSSM. All alignments better than this threshold are
used in constructing the PSSM.
-H [integer]
Default: -1
The end of the required region in query. The default of -1 indicates the actual end of the
query. This option can be used in combination with -S to specify a particular region to use
-i [file]
Default: stdin
The query file; same as blastall.
-I [T/F]
Default: F
Shows GIs in defline; same as blastall
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.16.23