CS Mukhopadhyay and RK Choudhary
School of Animal Biotechnology, GADVASU, Ludhiana
Nucleotide and amino acid sequences are analyzed to understand their hidden features, to discover patterns, and to determine function, structure and their evolution. The frequently used in silico analyses using molecular sequences are: sequence alignment; determining conserved regions; identification of low‐complexity region of nucleotides; gene prediction; nucleotide sequence assembling; exploring biochemical and immunogenic properties of amino acid sequences; protein structure prediction, and so on. After completing this chapter, you will learn how to use some of the sequence analytical techniques using free online software called Sequence Manipulation Suite. The original examples cited in the software suit (as “help” for explaining the programs) have been used here. In some places, the explanations may be verbatim.
This is a collection (which is why it is called a “suite”) of software, written in JavaScript1.5, for generating, formatting, and analyzing short DNA and protein sequences. Paul Stothard (of the University of Alberta, Canada) wrote the software suite (Stothard, 2000). The off‐line suite can be downloaded from the link http://www.bioinformatics.org/sms2/mirror.html/.
Sequences are submitted to the sequence box of SMS and then analyzed according to the particular query.
To learn the use of the different programs within SMS for analyzing nucleotide and amino acid sequences.
This converts multiple FASTA sequences (either nucleotide or amino acid records) into a single sequence (Figure 6.1). The software imposes a restriction of input to 500 000 characters in total (inclusive of description line and input sequences).
This program extracts the salient features (according to the annotations) of one or more EMBL file(s), and returns the sequences in FASTA format in a new window. The program thus returns the whole nucleotide sequence, the mRNA and the cDNA sequence as separate FASTA format files as output. This is useful if the user wants to extract only the cds (coding sequence) or mRNA sequence out of the whole gene sequence (containing exons and introns, as well).
This program has a limit of 200 000 characters as input. There are two options for the output sequence features:
This program accepts one or more EMBL files and extracts the translated amino acid sequence in the result window (Figure 6.2). This program has a limit of 200 000 characters of input.
The input DNA sequence is filtered by eliminating the non‐DNA characters (digits and blank spaces) from the whole sequence (input limit is 500 000) (Figure 6.3).
There are some options to modify filtration:
This is similar to the “Filter DNA” program. It filters out non‐amino acid characters (digits, blank spaces, special characters) from an amino acid sequence. Some options are available on what to replace, replace with what and case conversion. The character limit of input is 500 000.
Similar to the “EMBL Feature Extractor” program. The input is nucleotide sequences in GenBank format.
Similar to the “EMBL Trans Extractor”. The input is the nucleotide sequence in GenBank format.
This program converts single‐letter amino acid codes into three‐character amino acid codes. Single or multiple amino acid sequence(s) in FASTA format (one letter code) is/are required and pasted into the sequence box. The input limit is 100 000 characters.
This returns the specific nucleotide sequence, based on the position and/or range(s) of nucleotide(s)/nucleotide sequence(s) specified in the input. The user needs to paste the DNA sequence in the sequence box. The specific position(s) (given by the position value(s) of the base(s)) and/or the ranges (two position values for the termini, separated by “…”), separated by comma(s), are then given. There are some options to output the results in FASTA format (either in upper or lower case) in one sequence, or in multiple sets of sequences (for multiple positions/ranges). The range(s) can be specified either in the original strand (“direct strand” option) or in the complementary strand to the input sequence (“complementary strand” option). The input limit is 500 000 characters.
This program is similar to “Range Extractor DNA”, except that the input sequence is amino acid (in FASTA format) (Figure 6.4). Obviously, the drop‐down options for “direct strand” and “complementary strand” are not there for amino acids in this program.
This program is used to fetch the reverse‐complement of the input sequence, or obtaining the reverse sequence(s), or only the complement of given nucleotide sequence(s). It can work with single or multiple DNA sequence(s) as input. It supports all the IUPAC DNA alphabets (Figure 6.5). The input limit is 100 000 characters.
This accepts single/multiple DNA sequence(s) in FASTA format, and estimates the number and frequency of usage of each of the 64 codons available in the specific genome (eukaryotic/prokaryotic, nuclear/mitochondrial, etc.). The output file presents the frequencies of occurrence of each of the codons in the given input sequence(s).The preference of a given sequence for a specific synonymous codon can be determined by this program. Input limit is 500 000 characters.
This program estimates the Observed/Expected values for G/C dinucleotide contents in a 200 bp window, within a given DNA sequence and G/C content (Gardiner‐Garden and Frommer, 1987). CpG islands are like islets within a given DNA sequence (split in windows of a specific length) that are characterized by a higher Observed/Expected ratio (>0.6) of Cytosine‐Phosphate‐Guanosine (CpG) dimers and GC content greater than 50%.
This program can also be used for identifying the 5′ regions of vertebrate genes, since these regions are often thronging with CpG dimers in vertebrates. The maximum input limit is 100 000 characters.
This calculates the molecular weight of double/single‐stranded, linear/circular DNA sequence(s) (drop‐down options are there to select the types of DNA molecule(s)) in FASTA format. Standard IUPAC base symbols are accepted. The character limit is 200 000. This program is used for calculating molecule copy number.
This program scans one or more submitted DNA sequence(s) for a specific pattern instructed by the user. The default pattern is “ctt[ca]”, which searches for occurrences of “cttc” and “ctta”. The user can modify it. The output file mentions the base positions (start and end) of the match, along with the number of times that it has been identified in the direct (original) or reverse strand. “DNA Pattern Find” is used to screen the input sequence (as a raw sequence of FASTA formatted sequence(s)) and localize the pattern of interest. The input limit is 500 000 characters.
A very useful program to obtain the number, as well as the percentage, of each of the bases from the input sequence(s) in terms of the kinds of bases (means, pyrimidine, purine, A/T, etc.). The limit of input is 500 000 characters. The sequence(s) are submitted as a raw sequence, or as one or more FASTA‐format.
This program is used to explore mutable regions in a DNA sequence (provided in FASTA format) to generate a restriction site to study the effect of mutation on restriction digestion. The output file also displays the translation of the DNA (according to the reading frame indicated by the user), to determine the alterations in various reading frames (RFs) due to the proposed mutation. Thus, experiments involving polymerase chain reactions (PCR) or site‐directed mutagenesis can be studied in silico using this program. Four parameters can be set for alteration of output:
The theoretical isoelectric point (pI) is calculated for single or multiple amino acid sequence(s) (in FASTA format, with input limit of 200 000 characters), to estimate the probable location of a protein on a 2D gel. The user can add up to five copies of one of the 21 optional epitopes and fusion protein tags listed (e.g., His6, HSV, Glu‐Glu, etc.) to modify the pH of the submitted amino acid sequences (Figure 6.6).
This calculates the molecular weight of one or more protein sequence(s), entered in FASTA format or as a raw (unformatted) sequence (character limit is 200 000). The user can append 1–5 copies of one out of the 21 enlisted epitopes and fusion proteins. This program is used to predict a recombinant or simple protein by determining the position of a particular protein on a gel, compared with a set of protein standards.
Similar to “DNA Pattern Find”, this program is used to search a query (i.e., any consensus amino acid sequence) within one or more input sequence(s) (entered in FASTA format, and with character limit 500 000). The default search pattern is “X[^X]{0,5}X”, which means that the user wants to search for the occurrence of two residues of the amino acid “X” which may be spanned by 0–5 amino acids (other than X) in between.
Similar to the DNA Stats program, this is used to obtain data such as times of occurrences of each residue in one or more input sequence(s) (in FASTA format or raw sequence; input limit 500 000 characters).
Returns the positions of the restriction sites for all the enlisted regularly used REs against one or more linear or circular DNA sequence in FASTA format (100 000 base limit). This program is very useful to scan a DNA sequence for possible RE sites present.
This returns the reverse translated nucleotide sequence(s), along with a consensus sequence for each amino acid, from one or more input amino acid sequence(s) (in FASTA format with a limit of 20 000 characters), based on the codon usage table entered by the user (selected from http://www.kazusa.or.jp/codon/). This program is used to design oligos that target a (not yet sequenced) coding region belonging to a related species.
This displays a “textual map” for the RE sites in the template DNA (FASTA format input; input limit is 100 000 characters) which can be exploited for exploring the RE sites for cloning a sequence. It also returns the in silico translated amino acid sequence, according to the user‐defined reading frame.
Depicts a textual map for displaying in silico translations of the input DNA sequence (in FASTA format; input limit is 500 000 characters), according to the first, second, third, or all three reading frames (RFs). This program understands IUPAC codes and different genetic codes being used.
This introduces random mutation(s) in a coding sequence (presented in FASTA format as input sequence; input limit is 100 000 characters), which are studied to assess the effect of spontaneous mutation on the nature of the encoded peptide. The user can specify the number of mutation(s) and whether mutation is to occur in the start and stop codon of the mRNA.
Similar to the “Mutate DNA” program, this affects the mutation rate in an amino acid sequence. Multiple mutations can occur, just like in the “Mutate DNA” program, in the same amino acid position. This program is used to assess the effect of mutation on the chemical nature of the peptide, and the phenotypic effect on the trait.
This produces a random coding sequence (ORF from start to stop codon), based on the user‐specified genetic code and ORF length. Such ORFs are used to study the evolutionary perspectives and speciation.
Similar to “Random Coding Sequence”, this generates random DNA instead of a coding sequence.
> Seq1_GenBank_Acc_No_ AB002707.1
A G A T A A T A C T T G A G A C G T T C C A G T T T N T A T T A G T A C A A A A T G N C C A A T T C A T T C A A T G A A T T G A G A A A T G A C A T T C T A A G T G A G T T A G G A G C C A C G A C A A T T G T A G A A C A C A C A G T G T T T A A C A A G T A A C C A A T G A G A A T T N N T G A T C T A T C A A T C A G T T G G T A G T A T C G A G G A C T A C C A A G A T T A T A A C G G A A T A A C G A G G A A T T
> Seq2_GenBank_Acc_No_ KT779508.1
T G A G T A A A T C A G T T A T A G T T T G T T T G A T G G T A T C T A C T A C T C G G A T A A C C G T A G T A A T T C T A G A G C T A A T A C G T G C A A C A A A C C C C G A C T T C T G G A A G G G A T G C A T T T A T T A G A T A A A A G G T C G A C G C G G G C T C T G C C C G T T G C T G C G A T G A T T C A T G A T A A C T C G A C G G A T C G C A C G G C C A T C G T G C C G G C G A C G C A T C A T T C A A A T T T C T G C C C T A T C A A C T T T C G A T G G T A G G A T A G T G G C C T A C C A T G G T G G T G A C G G G T G A C G G A G A A T T A G G G T T C G A T T C C G G A G A G G G A G C C T G A G A A A C G G C T A C C A C A T C C A A G G A A G G C A G C A G G C G C G C A A A T T A C C C A A T C C T G A C A C G G G G A G G T A G T G A C A A T A A A T A A C A A T A C C G G G C T C A A T G A G T C T G G T A A T T G G A A T G A G T A C A A T C T A A A T C C C T T A A
> Seq1_GenBank_Acc_No_ XM_014823107.1
A T G A G A A G C G G C A T C A T A G C G C A G T G C G C T T T C T G T G T A A C T C G C G G C A A C G T C G C T C A G G C A A G C T T T C G A T T T C T G G C C C A G A A C T T C G G C C G C A A G A T C T G T C C G C T A G C T T G G G C A C A C T C G T C G G A T C G G T G C C G C A G C T G C T T C T G G C G C G G C C G G A T A C C A G A C A T A C C A G A G C G T G A T T A C C T G C G T G T G T G G G C G C A A G A G G A T C T C A A C G T C A T C G T C A T C G T C A T G G C A A C C C T T G G C A A G T T T G C C T T A A C G G T T A C G T T C G C C G T C T G C T A C C T G T A C A G C G G T G A G A T C T A C C C G A C T G C C A T C C G G A A T G T C G G A C T T G G A A G C A A T T C G G C T T G T G C G C G G G T C G G A G C G A T G G T G G C G C C A T A T A T C A C C C T G C T G G C C A A G G A C G T G G C G T G G C T G C C C A T G G T A C T G T T C G G C G C G C T G G C A G T G G T T G C T G C T C T G C T G G C A G C C A T G T T G C C A G A G A C G C G A A A T T G C C A T C T G C C A G A G A C G A T C G A A G A C G G A G A G A A T T T C A A C A G
> Seq1_GenBank_Acc_No_ XM_012883685.1
A T G G T A G A G G A C G A G G A C G A A G A C G A A G A T A C G T C T A A C A A C A G C A G C T C A G A T G A C A G C A G C A G C T C C G A T G A C G A T G A C G A T G A C G T C C C A G A C G A T G A C G A G T A T G A T G T T A A G A A A G T T A A G C A C C G A G A G G A G G T G C C G C G C A T T C A G A T A G T T G G A T C A A G G T C G C A A T G G T T G G A A G C A A T C C G C A G A G A C G G C A C G G C A G G T G A G T C A G C T A G G A T G A A G G C A T T C T T A G A G G T A T T T C G C G A A G C C C A A C A C C T T T A T C C T G A C C A G A G A G T T T C T G C T A C C T C C G A G G A G A C G A A G A C C C T T G A T A T C G T C G C C C T T A T T C T A A A G G A T G A A G G G A A A A T C T G T G T G C A A T A T G A T G G C A T A C T T C C G C C C C G C G A T A G G G C A G C A G C G C T A A A G A C A T T C C A G G A T G G G G C T C C A G C T A C C T T T G T C T G A
18.217.12.218