CS Mukhopadhyay and RK Choudhary
School of Animal Biotechnology, GADVASU, Ludhiana
A computer file format is a distinct way of encoding data to store in a file. Biological sequence format is an assemblage of distinct file formats, with the aim of rendering the files legible to specific programs.
Note: Biological sequences are generally written in Courier New font. This enables us to arrange the sequences uniformly in each line of the text
Sequence formats are manipulated or inter‐converted by the system in the base level through ASCII (American Standard Code for Information Interchange – i.e. binary code) text – that is, A–Z characters are encoded by 65–90; a–z characters by 97–122. Thus, the sequence formats are the required arrangement of characters, symbols, and keywords that specify the sequence, ID name, comments, and so on.
The sequence formats are needed for two purposes:
Commonly used sequence formats.
1. IG/Stanford | 7. Fitch | 13. Plain/Raw |
2. GenBank/GB | 8. Pearson/Fasta | 14. PIR/CODATA |
3. NBRF | 9. Zuker (in‐only) | 15. MSF |
4. EMBL | 10. Olsen (in‐only) | 16. ASN.1 |
5. GCG | 11. Phylip3.2 | 17. PAUP |
6. DNAStrider | 12. Phylip | 18. Pretty (out‐only) |
To convert the format of a given molecular sequence to other sequence formats like NCBI, EMBL, PIR, etc.
The online program ReadSeq (by Don Gilbert) will be used to convert the sequence formats. ReadSeq accepts the following formats: FASTA, Abstract Syntax Notation (ASN.1), National Biomedical Research Foundation (NBRF), EMBL, Fitch (phylogenetic analysis), GenBank, GCG, DNA Strider, Intelligenetics, Multiple sequence format, Protein Information Resource (PIR), and eight additional specialised formats.
The International Union of Pure and Applied Chemistry (IUPAC) nucleic acid code has been adopted to specify a single or a group of nucleotide(s) by a single alphabet:
A = adenine | U = uracil | M = A or C (amino) | D = G or A or T |
C = cytosine | R = G or A (purine) | S = G or C | H = A or C or T |
G = guanine | Y = T or C (pyrimidine) | W = A or T | V = G or C or A |
T = thymine | K = G or T (keto) | B = G or T or C | N = A or G or C or T (any) |
IUPAC amino acid codes:
A = Alanine | G = Glycine | M = Methionine | S = Serine |
C = Cysteine | H = Histidine | N = Asparagine | T = Threonine |
D = Aspartic Acid | I = Isoleucine | P = Proline | V = Valine |
E = Glutamic Acid | K = Lysine | Q = Glutamine | W = Tryptophan |
F = Phenylalanine | L = Leucine | R = Arginine | Y = Tyrosine |
A | >DL;readseq‐43434_tmp_1 readseq‐43434_tmp_1 100 bases cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagagg cgccatcatccggggcatccccggcttctgggccaatgccattgcgaacc* |
B | LOCUS readseq‐13129_tmp_1 100 bp ORIGIN 1 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc 61 cggggcatccccggcttctgggccaatgccattgcgaacc // |
C | >readseq‐14738_tmp_1 100 bp cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc cggggcatccccggcttctgggccaatgccattgcgaacc |
D | ID readseq‐10695_tmp_1 standard; DNA; UNC; 100 BP. SQ Sequence 100 BP; cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc 60 cggggcatccccggcttctgggccaatgccattgcgaacc 100 |
E | readseq‐946_tmp_1 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc readseq‐946_tmp_1 cggggcatccccggcttctgggccaatgccattgcgaacc |
F | 1 100 readseq‐26 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagagg cgccatcatccggggcatccccggcttctgggccaatgccattgcgaacc |
a. Clustal | b. EMBL | c. Phylip |
(A)
>readseq‐26104_tmp_1 204 bp
ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc
ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag
atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca
aatacccgggctataaatatcgac
(B)
LOCUS readseq‐11577_tmp_1 204 bp
ORIGIN
1 ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc
61 ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag
121 atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca
181 aatacccgggctataaatatcgac
(C)
ID readseq‐2117_tmp_1 standard; DNA; UNC; 204 BP.
SQ Sequence 204 BP;
ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc 60
ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag 120
atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca 180
aatacccgggctataaatatcgac 204
//
(D)
\
ENTRY readseq‐18456_tmp_1
TITLE readseq‐18456_tmp_1 204 bases
SEQUENCE
5 10 15 20 25 30
1 c c a t g a a c g c c t t c a t t g t g t g g t c t c g t g
31 a a c g a a g a c g a aa g g t g g c t c t a g a g a a t c
61 c c a aaa t g a aaaa c t c a g a c a t c a g c a a g c
91 a g c t g gg a t a t g a g t g g a aaa g g c t t a c a g
121 a t g c t g a aaa g c g c cc a t t c t tt g a g g a g g
151 c a c a g a g a c t a c t a g c c a t a c a c c g a g a c a
181 a a t a c cc g gg c t a t a aa t a t c g a c
///
Features:
Features:
Features:
Features:
Features:
Features:
18.191.222.5