Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 37
Genome Annotation in Eukaryotes

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

37.1 INTRODUCTION

GENSCAN, an HMM algorithm‐based online program, is used to identify complete gene structures in genomic DNA, and to predict the location of genes and their exon–intron boundaries in genomic sequences of vertebrates, Arabidopsis and maize. GENSCAN was developed by Christopher Burge of the Department of Mathematics, Stanford University (Burge and Karlin, 1997; Burge, 1998).

37.2 OBJECTIVE

To predict the putative gene sequence(s) in a given input nucleotide sequence and annotate the sequence.

37.3 PROCEDURE

Download a sequence (fewer than 1 million base pairs) from NCBI Nucleotide, and save in Notepad in FASTA format: here, chromosome 1 (CM000409.1) sequence of duck‐billed platypus (Ornithorhynchus anatinus) has been downloaded from NCBI (http://www.ncbi.nlm.nih.gov/nuccore/CM000409.1).
The original sequence is more than 1 megabase in size, so it needs to be trimmed from any termini to approximately 1 megabase in size (using Notepad ++). The user needs to subject the input sequence to repeat‐masker to remove low‐complexity, repeat regions in the input sequence.
Open the GENSCAN web server: http://genes.mit.edu/GENSCAN.html.
Set the parameters:
1. Organism: select the appropriate option from “Vertebrate”, “Arabidopsis”, or “Maize”, available in the drop‐down options with “Organism”. Here, we will select “Vertebrate”.
2. Suboptimal exon cutoff: values ranging from 0.01 to 1.00. This is the probability value of finding the exon of a gene, and is an optional parameter which, by default, is set to 1.00. It can be reduced; however, the reliability of predicted exons is also reduced. The probability should not be reduced below 0.50.
3. Sequence name: A text box is provided to type the name of the sequence. This is also optional, and is used to name the sequence for ease of identification.
4. Print options: presents two output or result options: “Predicted peptides only” and “Predicted CDS and peptides”. The second option will give the predicted amino acid, followed by the encoding nucleotide sequences.
5. Browse button: to upload the input nucleotide sequence for gene prediction.
Browse to upload the sequence using the “Browse…” button.
Click “Run GENSCAN” to start the analysis (Figure 37.1).

Homepage of the online GENSCAN software with option bars for “Run GENSCAN” and “Clear Input” at the bottom left. — **FIGURE 37.1** Homepage of the online GENSCAN software.

37.4 INTERPRETATION OF GENSCAN OUTPUT

The GENSCAN output appears in a new window on the same web page. The GENSCAN version, date and time of run are shown at the top.
This is followed by the size of input sequence, G/C percentage, which gives the predicted exons in a tabular form in the next section as:
Gn.ExTypeS.Begin …End.Len FrPh I/Ac Do/T CodRgP…. Tscr.. (Figure 37.2)
It also gives the results for the suboptimal exons with probability 1.
Finally, the predicted amino acid sequence(s) and the respective coding nucleotide sequence(s) are given.

Output page of the GENSCAN software depicting some of the predicted genes or exons (top) and some of the predicted protein sequences (bottom). — **FIGURE 37.2** Output page of the GENSCAN software.

The terms used in the output of GENSCAN are as follows (Source: http://www.biomedcentral.com/content/supplementary/1471‐2164‐11‐156‐s2/Additionalfile2/GENSCAN_output/GENSCAN%20output%20EG926217.htm):

Gn.Ex → gene number, exon number (for reference purpose).
Type: Init → Initial exon (ATG to 5′ splice site).
Intr → Internal exon (3′ splice site to 5′ splice site).
Term → Terminal exon (3′ splice site to stop codon).
Sngl → Single‐exon gene (start codon “ATG” to any one of the stop codons).
Prom → Promoter (TATA box/transcription initiation site).
PlyA → poly‐A signal (consensus sequence : AATAAA).
S → DNA strand (+ = input strand; – = opposite strand).
Begin → beginning of exon or signal (numbered on input strand).
End → end point of exon or signal (numbered on input strand).
Len → length of exon or signal (bp).
Fr → reading frame.
Ph → net phase of exon.
I/Ac → initiation signal or 3′ splice site score.
Do/T → 5′ splice site or termination signal score.
CodRg → coding region score.
P → probability of exon (sum over all parses containing exon).
Tscr → exon score (depends on length, I/Ac, Do/T and CodRg scores).

A detailed explanation regarding GENSCAN output is available at http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/GENSCAN.html.

37.4.1 Some points to remember while using GENSCAN

This tool cannot handle data larger than one million bases, so please limit the input sequence size to 1 MB.
The user needs to mask the repeat sequences prior to submitting to GENSCAN.
It is not to be used for prokaryotic and yeast sequences.
It can predict internal exons more accurately than the terminal exons.

37.5 QUESTIONS

1. Discuss the output parameters obtained from GENSCAN.
2. Predict and annotate the genes of the taurine Y chromosome.
3. What are the key elements of the eukaryotic gene that are taken into account while predicting genes? What will be your strategy to predict eukaryotic genes from a given sequence, if no tools are available?
4. Download the sex chromosomes of mouse (Mus musculus) and predict the genes in both chromosomes. Which genes are in common in both chromosomes?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 37: Genome Annotation in Eukaryotes

Create new playlist

Sign In