S Jain1, S Panwar2 and A Kumar3
1 Department of Applied Sciences & Humanities, Jai Parkash Mukand Lal Innovative Engineering and Technology Institute, Haryana, India
2 Department of Genetics and Plant Breeding, Chaudhary Charan Singh University, Uttar Pradesh, India
3 Department of Nutrition Biology, Central University of Haryana, Haryana, India
The European Bioinformatics Institute (EBI) is a constituent body of EMBL and is situated at the Wellcome Trust Genome Campus, Cambridge (UK). It provides all sorts of molecular data, as well as bioinformatics databases, software and tools, at no cost. It has all kinds of life sciences information, and helps in basic and advanced research. The information in the databases and tools described in this chapter is extracted from the EMBL‐guide and related sites. Therefore, in several instances, the information given may be verbatim.
Information on each of the databases has been collected from EMBL. The databases available via dbfetch are listed in Table 1. An overview of each database is also provided, which includes a short description and link to the databases.
TABLE 1 Features and links of various EMBL databases.
S.N. | Databases | Features | Links |
1. | EDAM | EMBRACE Data and Methods (EDAM) Ontology. | http://edamontology.sourceforge.net/ |
2. | ENA Coding | European Nucleotide Archive (ENA) Coding is a database of nucleotide sequences of the CDS (coding sequence) features, as annotated in the ENA Sequence database. ENA Coding records contain the nucleotide sequence of the CDS, along with annotated parent nucleotide, in addition to spontaneously produced annotation. | http://www.ebi.ac.uk/ena/ |
3. | ENA Geospatial | A database of nucleotide sequences of the ENA Geospatial Sequence. | http://www.ebi.ac.uk/ena/ |
4. | ENA Non‐coding | A database of nucleotide sequences of the non‐coding RNA features, as annotated in the ENA Sequence database. ENA Non‐coding records contain the nucleotide sequence of the RNA feature, along with annotated parent nucleotide, in addition to spontaneously produced annotation. | http://www.ebi.ac.uk/ena/ |
5. | ENA Sequence | ENA Sequence (formerly known as EMBL‐Bank) is Europe’s primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects, and patent applications. | http://www.ebi.ac.uk/ena/ |
6. | ENA Sequence Constructed | The ENA Sequence Constructed database division represents complete genomes and other long sequences constructed from segment entries. Instead of containing the sequence, these entries detail how to assemble the sequence from other ENA Sequence entries. | http://www.ebi.ac.uk/ena/ |
7. | ENA Sequence Constructed Expanded | Expanded entries include the complete nucleotide sequence of the constructed entry. | http://www.ebi.ac.uk/ena/ |
8. | ENA/SVA | The ENA Sequence Version Archive (SVA) is a repository of all entries which have ever appeared in the EMBL Nucleotide Sequence Databank (EMBL‐Bank) or ENA Sequence databases. | http://www.ebi.ac.uk/cgi‐bin/sva/sva.pl |
9. | Ensembl Gene | Ensembl genome databases for vertebrate species and model organisms. For other species, see below. | http://www.ensembl.org/ |
10. | Ensembl Genomes Gene | Genome databases for metazoa, plants, fungi, protists and bacteria. | http://www.ensemblgenomes.org/ |
11. | Ensembl Genomes Transcript | Genome databases for metazoa, plants, fungi, protists and bacteria. | http://www.ensemblgenomes.org/ |
12. | Ensembl Transcript | Ensembl genome databases for vertebrate species and model organisms. For other species, see Ensembl Genomes instead. | http://www.ensembl.org/ |
13. | European Patent Office (EPO) Proteins | Patented Protein present in the European Patent Office. | http://www.ebi.ac.uk/patentdata/proteins/ |
14. | HGNC | HUGO Gene Nomenclature Committee (HGNC) approved gene name and symbol (short‐form abbreviation) for each human gene. | http://genenames.org/ |
15. | IMGT/HLA | The International ImMunoGeneTics (IMGT) database provides a specialist database for the sequences of the human major histocompatibility complex (HLA), including the official sequences for the WHO Nomenclature Committee For Factors of the HLA System. | http://www.ebi.ac.uk/imgt/hla/ |
16. | IMGT/LIGM‐DB | A comprehensive database of immunoglobulins and T cell receptors (LIGM) from human and other vertebrates. | http://imgt.cines.fr/cgi‐bin/IMGTlect.jv |
17. | InterPro | The InterPro database (Integrated Resource of Protein Domains and Functional Sites) is an integrated documentation resource for protein families, domains, and functional sites. It was originally used to rationalize the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects, but now it also includes the SMART, TIGRFAMs, PIR SuperFamilies and most recently SUPERFAMILY databases. | http://www.ebi.ac.uk/interpro/ |
18. | IPD‐KIR | A centralized repository for human Killer‐cell Immunoglobulin‐like Receptor (KIR) sequences. | http://www.ebi.ac.uk/ipd/kir/ |
19. | IPD‐MHC | Sequences of the major histocompatibility complex (MHC) in a number of species. | http://www.ebi.ac.uk/ipd/mhc/ |
20. | IPRMC | InterPro Matches Complete (IPRMC) for UniProtKB proteins. | http://www.ebi.ac.uk/interpro/ |
21. | IPRMC UniParc | InterPro Matches Complete (IPRMC) for UniParc proteins. | http://www.ebi.ac.uk/interpro/ |
22. | JPO Proteins | Protein sequences are appearing in patents from the Japanese Patent Office (JPO). | http://www.ebi.ac.uk/patentdata/proteins/ |
23. | KIPO Proteins | Patented Protein present in the Korean Intellectual Property Office (KIPO). | http://www.ebi.ac.uk/patentdata/proteins/ |
24. | MEDLINE | Comprises citations and abstracts records of more than 5000 medically related journals published in the United States and 70 other countries. The files contain over 19 million citations, dating back to the mid‐1940s, and are updated weekly. | http://www.nlm.nih.gov/pubs/factsheets/medline.html |
25. | Patent DNA NRL1 | Non‐redundant patent nucleotides level 1 (NRL‐1). Nucleotide sequences from patents clustered by 100% sequence identity over the whole length. | http://www.ebi.ac.uk/patentdata/nr/ |
26. | Patent DNA NRL2 | Non‐redundant patent nucleotides level 2 (NRL‐2). Nucleotide sequences from patents clustered by patent family, and then by 100% sequence identity over the whole length. | http://www.ebi.ac.uk/patentdata/nr/ |
27. | Patent Protein NRL1 | Non‐redundant patent proteins level 1. Protein sequences from patents clustered by 100% sequence identity over the whole length. | http://www.ebi.ac.uk/patentdata/nr/ |
28. | Patent Protein NRL2 | Non‐redundant patent proteins level 2. Protein sequences from patents clustered by patent family and then by 100% sequence identity over the whole length. | http://www.ebi.ac.uk/patentdata/nr/ |
29. | Patent Equivalents | Patent number equivalents (families) and patent classifications for patents containing sequence data. The patent equivalents are obtained from the patent numbers cited in the major sequence databases (e.g., EMBL‐Bank and Patent Proteins), which are then expanded into a set of patent equivalents forming a WIPO Simple Patent Family. | http://www.ebi.ac.uk/patentdata/ |
30. | PDB | Comprises structure and sequence information of proteins and nucleotides. | http://www.ebi.ac.uk/pdbe/ |
31. | Reference Sequence project (RefSeq) | All sorts of information on reference sequences of natural molecules. | http://www.ncbi.nlm.nih.gov/refseq/ |
32. | RefSeq (protein) | All sorts of information on reference sequences of natural molecules. | http://www.ncbi.nlm.nih.gov/refseq/ |
33. | SGT | Structural Genomics Targets (SGT) is a protein target registration database, providing information on the experimental progress and status of target amino acid sequences selected for structural determination. | http://targetdb.pdb.org/ |
34. | Taxonomy | Taxonomic classification of organisms for which there are sequences in the INSDC databases (i.e., DDBJ, EMBL‐Bank, and GenBank) and many other biological databases. | http://www.ncbi.nlm.nih.gov/Taxonomy/ |
35. | Trace Archive | An archive of capillary electrophoresis trace data. | http://www.ebi.ac.uk/ena/ |
36. | UniParc | Protein sequences retrieval system. | http://www.uniprot.org/ |
37. | UniProtKB | Curated protein information retrieval system. | http://www.uniprot.org/ |
38. | The UniProt Reference Clusters UniRef100/UniRef90/UniRef50 | Access point for combined resemble sequences. In UniRef100, UniRef90 and UniRef50, no sequence mutual pair identity exceeds > 100%, > 90% or > 50%. | http://www.uniprot.org/ |
39. | UniProtKB Sequence/Annotation Version Archive (UniSave) | Access point for UniProtKB/Swiss‐Prot and UniProtKB/TrEMBL admitted versions. | http://www.ebi.ac.uk/uniprot/unisave/ |
40. | United States Patent and Trademark Office (USPTO) Proteins | Patented Protein present in the USPTO. | http://www.ebi.ac.uk/patentdata/proteins/ |
This is the access and analysis point for numerous data resources through Web Services technologies (Li et al., 2015; Lopez et al., 2014). The program basically works on integration and inter‐operation technology and has been created from Representational state transfer (REST), Simple Object Access Protocol (SOAP) and Web Services Description Language (WSDL).
The details and description of EMBL services are given in Table 2.
TABLE 2 Description of various EMBL tools.
General Services
Including data retrieval, access various sequence, and structural databases |
||
S.N. | Service | Description |
1. | ArrayExpress | Microarray data searching with ArrayExpress. |
2. | ChEBI Web Services | Entry retrieval from the ChEBI database. |
3. | ChEMBL Web Services | Retrieval data system. |
4. | EB‐eye (SOAP)/(REST) | EBI search engine (EB‐eye). |
5. | ENA Browser | Access point for sequence retrieval . |
6. | Gene Expression Atlas API | Access point for statistics data over a curated subset of ArrayExpress Archive. |
7. | MartService | Searching and retrieving the data through BioMart. |
8. | PDBe (REST) | Helps in gathering facts from PDB and EMDB. |
9. | PSICQUIC | Information retrieval system for molecular interaction, comprising ChEMBL, Reactome, and IntAct. |
10. | Rhea | Access point for manually annotated chemical reactions information. |
11. | Universal Protein Resource UniProt.org | Protein sequence information including annotated. |
12. | WSDbfetch (REST)/(SOAP) | Identifier entry retrieval system. |
Protein Functional Analysis (PFA)
Identifying protein‐related information, i.e., sequences, motifs, conserved regions, etc. |
||
REST/SOAP Service | Description | |
13. | FingerPRINTScan | Recognizing the proximal matching fingerprints motif. |
14. | InterProScan 5 | This tool is used for bringing different protein signature recognition methods into one platform or page. |
15. | HMMER hmmscan | Access point for Hidden Markov Models (HMMs) database. |
16. | PfamScan | PfamScan is used to explore the similar sequences for a query FASTA sequence against a library of Pfam HMM. |
17. | Phobius | Prediction of transmembrane topology and signal peptides from the amino acid sequences of protein. |
18. | Pratt | Identifying conserved patterns in unaligned protein sequences. |
19. | PROSITE Scan | Comparing a protein sequence against the signatures in PROSITE (both patterns and profiles). |
20. | RADAR | Repeat identification and alignment system in protein sequences. |
Sequence Similarity Search (SSS)
Provides the identification of homologous sequences. |
||
REST/SOAP Service | Description | |
21. | FASTA | Fast protein or nucleotide comparison access tool. |
22. | FASTM | Peptide fragment access point from FASTA. |
23. | NCBI BLAST | Nucleotide and protein sequence comparison system. |
24. | PSI‐BLAST | Position Specific Iterative BLAST (PSI‐BLAST), guided mode |
25. | PSI‐Search | Iterative Smith and Waterman using a PSI‐BLAST strategy |
Multiple Sequence Alignment (MSA)
Alignment of a set of three or more, protein or nucleotide sequences. |
||
REST/SOAP Service | Description | |
26. | Clustal Omega | Sequence alignments tool. |
27. | ClustalW2 | Global multiple sequence alignment of DNA and protein sequences using ClustalW2. |
28. | DbClustal | Global multiple sequence alignment of DNA or protein sequences using anchor regions from BLAST results |
29. | Kalign | Sequence alignment system of large sequences. |
30. | MAFFT | Sequence alignment using the MAFFT method. Fast, and capable of handling large sequences. |
31. | Multiple Sequence Comparison by Log‐Expectation (MUSCLE) | Sequence alignment tool. |
32. | MView | Reformat a multiple sequence alignment or create a multiple sequence alignment from a sequence similarity search result (e.g., BLAST or FASTA). |
33. | PRANK | Sequence alignment using the PRANK method. |
34. | T‐Coffee | Sequence alignment using the T‐Coffee method. |
Phylogeny
Phylogenetic analysis |
||
REST/SOAP Service | Description | |
35. | ClustalW2 Phylogeny | Neighbor‐joining or UPGMA phylogenetic trees access system. |
Pairwise Sequence Alignment (PSA)
Alignment of two sequences |
||
REST/SOAP Service | Description | |
36. | EMBOSS matcher | Waterman–Eggert local alignment using EMBOSS matcher. |
37. | EMBOSS needle | Needleman–Wunsch global alignment using EMBOSS needle. |
38. | EMBOSS stretcher | Myers and Miller global alignment using EMBOSS stretcher. |
39. | EMBOSS water | Smith–Waterman local alignment using EMBOSS water. |
40. | GeneWise | Provides comparison of protein and genomic DNA sequence. |
41. | lalign | Huang and Miller sim local alignment using lalign. |
42. | PromoterWise | Comparison of two DNA sequences, allowing for inversions and translocations. |
43. | Wise2DBA | The Wise2 DNA Block Aligner (DBA) aligns two DNA sequences. |
RNA
RNA Analysis |
||
REST/SOAP Service | Description | |
44. | Infernal cmscan | Searching system for CM‐format Rfam database. |
45. | MapMi | Accessing mapping and analysis of miRNA sequences. |
Sequence Format Conversion
Convert between homologous sequences or confirm the formatting of a sequence. |
||
REST/SOAP Service | Description | |
46. | EMBOSS seqret | Accessing manipulated sequence entries. |
47. | MView | Reformatting of multiple sequence alignment data. |
48. | Readseq | Convert biosequences between a selection of common biological sequence formats. |
Sequence Statistics
Analyze a sequence to determine its properties and use statistics to assign significance. |
||
REST/SOAP Service | Description | |
49. | EMBOSS cpgplot | European Molecular Biology Open Software Suite (EMBOSS) cpgplot identifies and plots CpG islands in a nucleotide sequence. |
50. | EMBOSS isochore | Plots isochores in DNA sequences. |
51. | EMBOSS pepinfo | Plots amino acid properties. |
52. | EMBOSS pepstats | Provides calculation of protein properties. |
53. | EMBOSS pepwindow | Generates a hydropathy plot for protein. |
54. | SAPS | Statistical Analysis of Protein Sequences. |
Sequence Translation
Translate a coding nucleotide sequence into a protein sequence and vice versa. |
||
REST/SOAP Service | Description | |
55. | EMBOSS transeq | Translates the nucleiceotide sequences. |
56. | EMBOSS sixpack | Displays DNA sequences with six‐frame translation and ORFs. |
57. | EMBOSS backtranseq | Back‐translates the protein sequences. |
58. | EMBOSS backtranambig | Back‐translates protein sequences to ambiguous nucleotide sequences. |
Structural Analysis
Analysis of macromolecular structures. |
||
REST/SOAP Service | Description | |
59. | DaliLite | Pairwise structure comparison. |
60. | MaxSprout | Provides fast database algorithm for making protein backbone and side chain. |
Literature and Ontologies
Look‐up ontology terms and navigate ontology relationships. |
||
Service | Description | |
61. | BioModels | Access point for mathematical models of biological interest. |
62. | PICR | Protein Identifier Cross‐Reference Service. |
63. | QuickGO | Gene Ontology (GO) and Gene Ontology Annotation (GOA) databases. |
64. | Europe PMC Web Service | Provides searching access from Europe PubMed Central. |
65. | WSMIRIAM | Web Services for the Minimal Information Requested In the Annotation of biochemical Models (MIRIAM). |
66. | WSOntology Lookup | Search multiple ontologies from a single location. |
67. | WSSBO | Web Services for the Systems Biology Ontology (SBO). |
68. | WSWhatizit | permits text mining tasks. |
3.16.81.14