This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
126
|
Chapter 8: 20 Tips to Improve Your BLAST Searches
Figure 8-7 demonstrates another feature of gapped alignment: alignments may
extend far beyond the end of an exon because gapped extension is generally less spe-
cific. This is especially annoying in genomes with short introns in which gapped
alignments can extend between nonadjacent exons and obscure intervening introns
and exons. To reduce these lengthy extensions, decrease X, increase the gap exten-
sion cost, select a more stringent scoring matrix, or use ungapped alignment.
8.15 Look for Gaps in Coverage as a Sign
of Missed Exons
The seeding parameters and alignment thresholds may prevent short or highly diver-
gent exons from appearing in BLAST reports. Figure 8-8a shows an alignment
between a genomic query and an EST. Most alignments overlap by a few bp, except
for the 2 at the 5´ end (left side). Gaps and overlaps in coverage are easier to see by
using the reciprocal search shown in Figure 8-8b. To find the missing 7-bp exon in
Figure 8-8c, use bl2seq (see Chapter 13) with the following command line:
bl2seq -i est -I 21,29 -j genomic -J 76047,76744 -pblastn -W 7
The -I and -J parameters let you select a specific region of each sequence. What
you’ve done is a BLASTN search between the missing part of the EST and the region
between the alignments.
8.16 Parse BLAST Reports with Bioperl
The traditional BLAST output format is meant to be human readable, but when your
BLAST report is 1,000 pages long, it isn’t much fun to read. Sometimes all you want
is the names of all sequences that have alignments above 90 percent identity. Such
tasks require a BLAST parser that lets you select only the information you want.
Many freely available BLAST parsers can be downloaded from the Internet, but the
ones in most common use come from the Bioperl project. Bioperl is an open-source
community of bioinformatics professionals that develops and maintains code librar-
ies and applications written in the Perl programming language. If your daily routine
finds you running BLAST or other sequence analysis applications, learning to use the
Bioperl system can save you many hours of work and frustration.
Figure 8-7. Extension is sometimes excessive: the real exon region is boxed in this BLASTX
alignment
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
20 Tips to Improve Your BLAST Searches
|
127
Let’s see how Bioperl can help solve the problem posed earlier: to report the names
of all sequences that are more than 90 percent identical to your query.
#!/usr/bin/perl -w
use strict;
use Bio::SearchIO;
my $blast = new Bio::SearchIO(
-format => 'blast',
-file => $ARGV[0]);
my %Name;
my $result = $blast->next_result;
while(my $sbjct = $result->next_hit) {
while(my $hsp = $sbjct->next_hsp) {
$Name{$sbjct->name} = 1 if $hsp->frac_identical >= 0.9;
}
}
print join(" ", sort keys %Name), " ";
Pretty simple, huh? With BLAST and Bioperl, it’s possible to create all kinds of use-
ful applications.
Figure 8-8. Finding missed exons: (a) an alignment between a genomic query and EST, (b) the
reciprocal alignment showing a gap (d) and overlap (e) in coverage, (c) the tiny missed exon can be
found (f) by changing the word size to 7
a
b
c
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.247