CHAPTER 34
Prediction of Translation Initiation Sites

S Jain1, S Panwar2 and A Kumar3

1 Department of Applied Sciences & Humanities, Jai Parkash MukandLal Innovative Engineering and Technology Institute, Haryana, India

2 Department of Genetics and Plant Breeding, Chaudhary Charan Singh University, Uttar Pradesh, India

3 Department of Nutrition Biology, Central University of Haryana, Haryana, India

34.1 INTRODUCTION

The correct recognition of translation initiation sites (TIS) can help us to understand the gene structure and its product. The computational identification of TIS is the main constituent of the gene prediction system and, therefore, has utmost importance in genome annotation. Lots of data mining methods have been employed to identify TIS in transcripts such as mRNA, EST and cDNA sequences. All these methods are based on the scanning model (Kozak, 1989), which states that, in eukaryotes, the first “AUG” (start codon) at the 5′ prime of the mRNA transcript is usually the exact TIS. However, exceptions can occur via the process of leaky scanning, re‐initiation and internal initiation of translation, which results in another AUG being the true TIS.

The consensus motif GCCRCCatgG around the TIS was probably the first effort to identify TIS with statistical meaning (Salamov et al., 1998). The general approach for answering the TIS prediction difficulty is to create the numerical data from the cDNA sequences and, subsequently, apply computational methods.

34.2 OBJECTIVE

To predict the translation initiation site by exploiting the NetStart and TIS Miner tools.

34.2.1 The Kozak sequence

The Kozak consensus sequence was originally demarcated as ACCAUGG, based on the effect of single amino acid change around the translation initiation codon (AUG) of the preproinsulin gene. Consequently, it was extended to “GCCGCCACCAUGG”, based on the mutation and survey study of 699 vertebrate transcripts. Further, expression of preproinsulin and alpha‐globin in the cells showed that a purine (generally “A”) in position –3 is essential for efficient initiation of translation, and in its absence, a “G” at position +4 is essential.

34.3 PROCEDURE

34.3.1 Tools used in translation initiation site prediction

Kozak (1987) proposed the first method to identify TIS. The weight matrix is applicable for the modeling of conserved sequence in the vicinity of TIS. Nevertheless, Pedersen and Nielsen (1997) introduced the NetStart system (the first real automated system) and employed the artificial neural network (ANN) to identify TIS in the mRNA transcripts. Salzberg (1997) used a conditional probability (CP) matrix to model TIS. The work was subsequently carried out by Li and Jiang (2004), who developed a new Edit‐Kernel approach called TIS hunter.

34.3.1.1 NetStart 1.0

In this method, the artificial neural network predicts which AUG triplet in the mRNA sequence is the start codon. The trained network correctly classifies 88% of Arabidopsis and 85% of vertebrate “AUG” triplets in a reading frame. The steps are as follows:

  1. Check the link: go to http://www.cbs.dtu.dk/services/NetStart/
  2. Input sequences: this can done in the following two ways for processing (Figure 34.1):
    1. Paste a nucleotide sequence or a number of sequences in FASTA format into the upper window of the main server page.
    2. Choose a FASTA file on the hard disk.
    The acceptable input alphabet is “A”, “C”, “G”, “T”, “U” and “X” (unknown). All other codes will be converted to X before being processed. “T” and “U” are treated as equivalent.
  3. Select organism type: depending on the origin of input sequences, click on either Vertebrate or A. Thaliana. The former is the default setting.
  4. Submit the job: enter the “Submit” button. The status of the job will be displayed and constantly updated until it terminates, and the server output appears in the browser window.
  5. Output format: each input sequence will be shown with the predicted translation start site, followed by a table showing the positions and scores of all the positions of ATG in the sequence. Beneath the sequence, the denoted estimated start codon is “i” (initiation). At another position of “ATG,” it is “N” (non‐start), while all other sequences are denoted by dots (“.”).The scores are mainly [0.0, 1.0]; however, if the score is higher than 0.5, then it is probably a translation start site. The output format is depicted in Figure 34.2.
File format of inserted nucleotide sequence in NetStart 1.0, with arrows pointing to NetStart 1.0, text box to specify the input sequences, Vertebrate option, and Submit button.

FIGURE 34.1 File format of inserted nucleotide sequence in NetStart 1.0.

Image described by caption.

FIGURE 34.2 Output format for translation start predictions for a vertebrate sequence.

34.3.1.2 TIS Miner

This is used for the prediction of translation initiation site(s) in vertebrate DNA/mRNA/cDNA sequences. Training of the TIS Miner was completed on 3312 vertebrate mRNA sequences extracted from GenBank. Pedersen et al. (1997) initially analyzed the data and observed 3312 true TIS ATGs and 10 063 non‐TIS ATGs. The accuracy is 92.45% at 80.19% sensitivity and 96.48% specificity.

  1. Go to http://dnafsminer.bic.nus.edu.sg/Tis.html. TIS Miner and Poly (A) Signal Miner are raised from the left panel of the homepage.
  2. The nucleotide sequence can be submitted either in raw or in FASTA format. A limit of maximum 50 000 base pairs per sequence per submission is set to avoid a long waiting time for users (Figure 34.3).
  3. The number of predictions is defined as the digit of highest‐scored candidates of the anticipated functional site. The hexamer poly (A) signal consensus can be opted if anticipating poly (A) signals. The choices are either ATTAAA or any variant of NNTANA‐type.
  4. Submit the query by pressing “SUBMIT”.
  5. Output format: The output page of TIS miner is summarized below and shown in Figure 34.4.
    1. No. of ATG(s) from the 5′ prime. The i means that the corresponding candidate is the ith candidate ATG from the 5′ end. Normally, a sequence may include several candidates of the functional site.
    2. Score. The anticipated scores range (0, 1) corresponds to the exact TIS and is supported by vector machine (SVM). The higher the score, the greater the likelihood of being an accurate TIS. If the score is higher than 0.6 at a threshold value of 0.6, then it is anticipated to be accurate TIS.
    3. Position (bp). This indicates the position of the corresponding candidate in the submitted nucleic acid sequence.
    4. Identity to Kozak consensus [AG] XXATGC: a “G” residue has a tendency to follow a true TIS, while either the “A” or “G” residue is usually found three bases upstream of a true TIS. Thus, the candidate “ATG” fits this consensus.
    5. Is any ATG in 100 bp upstream? This column shows whether an ATG exists within 100 bp upstream of the candidate.
    6. The presence of in‐frame stop codon 100 bp downstream: This just presents any in‐frame stop codon within 100 bp downstream.
Image described by caption.

FIGURE 34.3 File format of inserted nucleotide sequence in TIS Miner.

Image described by caption.

FIGURE 34.4 Output format for TIS Miner

34.4 QUESTIONS

  1. 1. How will you predict the start codon in an mRNA sequence of Arabidopsis?

    Hint: use NetStart 1.0.

  2. 2. What do you mean by Kozak Sequence? How will you predict the translation initiation sites in vertebrate DNA?

    Hint: Kozak sequence is a sequence which occurs on eukaryotic mRNA and has the consensus (gcc)gccRccAUGG; use TIS Miner and follow the usage instructions explained in this chapter.

  3. 3. Explain in detail about using the Net Start 1.0 system to classify TIS on a genomic scale.

    Hint: See Section 34.3.1.1 and 34.3.1.2.

  4. 4. Elaborate the complete experimental set up of TIS Miner system for TIS prediction in a mRNA sequence extracted from GenBank.

    Hint: See Section 34.3.1.2.

  5. 5. Briefly define every column of the output format table of an inserted sequence in the TIS Miner system.

    Hint: Consult 5 point of the section 34.3.1.2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.98.153