GENSCAN, an HMM algorithm‐based online program, is used to identify complete gene structures in genomic DNA, and to predict the location of genes and their exon–intron boundaries in genomic sequences of vertebrates, Arabidopsis and maize. GENSCAN was developed by Christopher Burge of the Department of Mathematics, Stanford University (Burge and Karlin, 1997; Burge, 1998).
37.2 OBJECTIVE
To predict the putative gene sequence(s) in a given input nucleotide sequence and annotate the sequence.
37.3 PROCEDURE
Download a sequence (fewer than 1 million base pairs) from NCBI Nucleotide, and save in Notepad in FASTA format: here, chromosome 1 (CM000409.1) sequence of duck‐billed platypus (Ornithorhynchus anatinus) has been downloaded from NCBI (http://www.ncbi.nlm.nih.gov/nuccore/CM000409.1).
The original sequence is more than 1 megabase in size, so it needs to be trimmed from any termini to approximately 1 megabase in size (using Notepad ++). The user needs to subject the input sequence to repeat‐masker to remove low‐complexity, repeat regions in the input sequence.
Organism: select the appropriate option from “Vertebrate”, “Arabidopsis”, or “Maize”, available in the drop‐down options with “Organism”. Here, we will select “Vertebrate”.
Suboptimal exon cutoff: values ranging from 0.01 to 1.00. This is the probability value of finding the exon of a gene, and is an optional parameter which, by default, is set to 1.00. It can be reduced; however, the reliability of predicted exons is also reduced. The probability should not be reduced below 0.50.
Sequence name: A text box is provided to type the name of the sequence. This is also optional, and is used to name the sequence for ease of identification.
Print options: presents two output or result options: “Predicted peptides only” and “Predicted CDS and peptides”. The second option will give the predicted amino acid, followed by the encoding nucleotide sequences.
Browse button: to upload the input nucleotide sequence for gene prediction.
Browse to upload the sequence using the “Browse…” button.
Click “Run GENSCAN” to start the analysis (Figure 37.1).
37.4 INTERPRETATION OF GENSCAN OUTPUT
The GENSCAN output appears in a new window on the same web page. The GENSCAN version, date and time of run are shown at the top.
This is followed by the size of input sequence, G/C percentage, which gives the predicted exons in a tabular form in the next section as:
It also gives the results for the suboptimal exons with probability 1.
Finally, the predicted amino acid sequence(s) and the respective coding nucleotide sequence(s) are given.
37.4.1 Some points to remember while using GENSCAN
This tool cannot handle data larger than one million bases, so please limit the input sequence size to 1 MB.
The user needs to mask the repeat sequences prior to submitting to GENSCAN.
It is not to be used for prokaryotic and yeast sequences.
It can predict internal exons more accurately than the terminal exons.
37.5 QUESTIONS
1. Discuss the output parameters obtained from GENSCAN.
2. Predict and annotate the genes of the taurine Y chromosome.
3. What are the key elements of the eukaryotic gene that are taken into account while predicting genes? What will be your strategy to predict eukaryotic genes from a given sequence, if no tools are available?
4. Download the sex chromosomes of mouse (Mus musculus) and predict the genes in both chromosomes. Which genes are in common in both chromosomes?