CHAPTER 34 Prediction of Translation Initiation Sites
S Jain1, S Panwar2 and A Kumar3
1 Department of Applied Sciences & Humanities, Jai Parkash MukandLal Innovative Engineering and Technology Institute, Haryana, India
2 Department of Genetics and Plant Breeding, Chaudhary Charan Singh University, Uttar Pradesh, India
3 Department of Nutrition Biology, Central University of Haryana, Haryana, India
34.1 INTRODUCTION
The correct recognition of translation initiation sites (TIS) can help us to understand the gene structure and its product. The computational identification of TIS is the main constituent of the gene prediction system and, therefore, has utmost importance in genome annotation. Lots of data mining methods have been employed to identify TIS in transcripts such as mRNA, EST and cDNA sequences. All these methods are based on the scanning model (Kozak, 1989), which states that, in eukaryotes, the first “AUG” (start codon) at the 5′ prime of the mRNA transcript is usually the exact TIS. However, exceptions can occur via the process of leaky scanning, re‐initiation and internal initiation of translation, which results in another AUG being the true TIS.
The consensus motif GCCRCCatgG around the TIS was probably the first effort to identify TIS with statistical meaning (Salamov et al., 1998). The general approach for answering the TIS prediction difficulty is to create the numerical data from the cDNA sequences and, subsequently, apply computational methods.
34.2 OBJECTIVE
To predict the translation initiation site by exploiting the NetStart and TIS Miner tools.
34.2.1 The Kozak sequence
The Kozak consensus sequence was originally demarcated as ACCAUGG, based on the effect of single amino acid change around the translation initiation codon (AUG) of the preproinsulin gene. Consequently, it was extended to “GCCGCCACCAUGG”, based on the mutation and survey study of 699 vertebrate transcripts. Further, expression of preproinsulin and alpha‐globin in the cells showed that a purine (generally “A”) in position –3 is essential for efficient initiation of translation, and in its absence, a “G” at position +4 is essential.
34.3 PROCEDURE
34.3.1 Tools used in translation initiation site prediction
Kozak (1987) proposed the first method to identify TIS. The weight matrix is applicable for the modeling of conserved sequence in the vicinity of TIS. Nevertheless, Pedersen and Nielsen (1997) introduced the NetStart system (the first real automated system) and employed the artificial neural network (ANN) to identify TIS in the mRNA transcripts. Salzberg (1997) used a conditional probability (CP) matrix to model TIS. The work was subsequently carried out by Li and Jiang (2004), who developed a new Edit‐Kernel approach called TIS hunter.
34.3.1.1 NetStart 1.0
In this method, the artificial neural network predicts which AUG triplet in the mRNA sequence is the start codon. The trained network correctly classifies 88% of Arabidopsis and 85% of vertebrate “AUG” triplets in a reading frame. The steps are as follows:
Input sequences: this can done in the following two ways for processing (Figure 34.1):
Paste a nucleotide sequence or a number of sequences in FASTA format into the upper window of the main server page.
Choose a FASTA file on the hard disk.
The acceptable input alphabet is “A”, “C”, “G”, “T”, “U” and “X” (unknown). All other codes will be converted to X before being processed. “T” and “U” are treated as equivalent.
Select organism type: depending on the origin of input sequences, click on either Vertebrate or A. Thaliana. The former is the default setting.
Submit the job: enter the “Submit” button. The status of the job will be displayed and constantly updated until it terminates, and the server output appears in the browser window.
Output format: each input sequence will be shown with the predicted translation start site, followed by a table showing the positions and scores of all the positions of ATG in the sequence. Beneath the sequence, the denoted estimated start codon is “i” (initiation). At another position of “ATG,” it is “N” (non‐start), while all other sequences are denoted by dots (“.”).The scores are mainly [0.0, 1.0]; however, if the score is higher than 0.5, then it is probably a translation start site. The output format is depicted in Figure 34.2.
34.3.1.2 TIS Miner
This is used for the prediction of translation initiation site(s) in vertebrate DNA/mRNA/cDNA sequences. Training of the TIS Miner was completed on 3312 vertebrate mRNA sequences extracted from GenBank. Pedersen et al. (1997) initially analyzed the data and observed 3312 true TIS ATGs and 10 063 non‐TIS ATGs. The accuracy is 92.45% at 80.19% sensitivity and 96.48% specificity.
The nucleotide sequence can be submitted either in raw or in FASTA format. A limit of maximum 50 000 base pairs per sequence per submission is set to avoid a long waiting time for users (Figure 34.3).
The number of predictions is defined as the digit of highest‐scored candidates of the anticipated functional site. The hexamer poly (A) signal consensus can be opted if anticipating poly (A) signals. The choices are either ATTAAA or any variant of NNTANA‐type.
Submit the query by pressing “SUBMIT”.
Output format: The output page of TIS miner is summarized below and shown in Figure 34.4.
No. of ATG(s) from the 5′ prime. The i means that the corresponding candidate is the ith candidate ATG from the 5′ end. Normally, a sequence may include several candidates of the functional site.
Score. The anticipated scores range (0, 1) corresponds to the exact TIS and is supported by vector machine (SVM). The higher the score, the greater the likelihood of being an accurate TIS. If the score is higher than 0.6 at a threshold value of 0.6, then it is anticipated to be accurate TIS.
Position (bp). This indicates the position of the corresponding candidate in the submitted nucleic acid sequence.
Identity to Kozak consensus [AG] XXATGC: a “G” residue has a tendency to follow a true TIS, while either the “A” or “G” residue is usually found three bases upstream of a true TIS. Thus, the candidate “ATG” fits this consensus.
Is any ATG in 100 bp upstream? This column shows whether an ATG exists within 100 bp upstream of the candidate.
The presence of in‐frame stop codon 100 bp downstream: This just presents any in‐frame stop codon within 100 bp downstream.
34.4 QUESTIONS
1. How will you predict the start codon in an mRNA sequence of Arabidopsis?
Hint: use NetStart 1.0.
2. What do you mean by Kozak Sequence? How will you predict the translation initiation sites in vertebrate DNA?
Hint: Kozak sequence is a sequence which occurs on eukaryotic mRNA and has the consensus (gcc)gccRccAUGG; use TIS Miner and follow the usage instructions explained in this chapter.
3. Explain in detail about using the Net Start 1.0 system to classify TIS on a genomic scale.