Glossary

Term Meaning
ab initio A Latin term that means starting “from the beginning” or initiation.
Accession number The unique number assigned to an accepted submission (e.g., molecular sequence, genome project data, WGS, STs, etc.) by the database (NCBI, DDBJ, EMBL) to differentiate the submission from another similar type. The accession number is alphanumeric, and the format differs among molecular sequences (nucleotide and protein) as well as the type of database (NCBI, Swiss‐Prot‐UniProt, etc.)
Adapter Priming site created by ligation of short oligonucleotide to the DNA which is to be sequenced or amplified
Algorithm A set of rules set to complete an assignment or operation by a computer (in general). The term is derived from the name of Iraqi mathematician Mohammed ibn Musa al‐Khwarizmi (9th century AD).
Allosteric Protein A protein having multiple ligand binding sites, whose conformation changes upon ligand binding. The enzyme can be an allosteric protein.
Amplicon Gene‐specific nucleotides sequences amplified by PCR.
Annealing Temperature (Tm) The temperature at which 50% of the DNA helices are dissociated during PCR amplification.
Annotation Comments on or explanation of a text or data.
Barcode Short sequences of typically six or more nucleotides that are used to identify/label individual samples when they are pooled in one sample.
Binary Tree Tree‐like data structures with two (binary) branches. The point from where each branch separates is called a node.
Binding site A region of protein or DNA where the ligands bind.
Bioinformatics A branch of science that interprets biological data with the help of statistics, computer science, mathematics and engineering.
Biostatistics The application of statistics in biological science.
Bit score (S’) The similarity between two sequences by alignments, expressed by bit scores (denoted by “S”). The higher the scores are, the better the alignment is. It is calculated from the formula that considers conserved sequence, identical sequence and gaps therein.
BLASTn Standard Nucleotide BLAST. Here, two nucleotide sequences are compared. The word BLAST (Basic Local Alignment Search Tool) is online software to compare query sequences from an online database.
BLASTp The term BLASTp stands for protein BLAST. Here, two amino acid sequences are aligned and compared.
BLASTx BLASTx aligns six conceptually translated DNA sequences from both the stands with a database of protein sequences.
Bridge amplification Amplification of fragments attached on a chip by the adapter at both of its ends.
Burrows–Wheeler transform An aligner that helps in reading large volumes of short‐read data that have not been fully studied
Clustering In gene expression analysis, a microarray cluster is the grouping together of genes of similar functions. In phylogenetic tree analysis, the data points having smaller or larger distances are connected and form different clusters. The distance matrix is calculated based on some algorithm, and there are more than 100 algorithms published. Hierarchical clustering is one of the common examples of connectivity‐based clustering methods.
CpG Islands Word CpG stands for Cytosine‐phosphate diester‐Guanine. CpG is an area of increased density C and P in the DNA (100–1000 bp long) at various places. CpG areas are usually non‐methylated and present near 5’‐end of gene at transcription initiation sites. In humans, there are around 45 000 CpG islands in the DNA. CpG sites are important, as they are involved in regulation of gene transcription.
C t value The cycle number in real time PCR when the fluorescent signal is above the threshold limit and can be detected by a machine. It is also called the Cp value.
de novo Assembly Sequencing of genetic materials if the reference sequence is not available.
Deep sequencing Repeated time sequencing of genetic material, measured in terms of coverage.
Delta BLAST Domain Enhanced Lookup Time Accelerated (Delta) BLAST is a new algorithm to yield better homology of remote protein sequences. It searches a database of the pre‐constructed position‐specific scoring matrix (PSSM) before searching a protein sequence database. The web link for the paper describing Delta‐BLAST for the first time is: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3438057/.
Delta Ct value The difference between the two threshold cycles (Ct ) of two genes (say, target and control genes, or target and reference genes).
Docking Docking refers to a method that predicts the orientation of one molecular by binding to another molecular while making a complex. In bioinformatics, docking is a computational simulation of a ligand binding to its receptor.
Domain A conserved part of a protein whose tertiary structure changes independently from that of the rest of the protein.
Dynamic Programming A method of solving a complex problem by breaking it down into many sub‐problems. DP in bioinformatics has been used in the sequence alignment, DNA‐protein binding prediction, and protein structure prediction.
Edge Length In the phylogenetic tree, an edge length is a number associated with an edge which represents either time or expected genetic distance from the other branches.
Energy Functions
E‐Value (Expectation value) A way of representing the significance of the alignment. It is a probability of this alignment occurring with a particular bit score (S) or better in the database search. The lower the value, the better is the chance of getting this alignment.
Expressed sequence tag (EST) A short sequence of cloned cDNA that is used to identify gene transcripts in gene discovery.
FASTA FASTA is the first algorithm for searching database similarity in sequences. It is a text‐based format for representing nucleotide or amino acid sequences. The sequence in FASTA format begins with “>“ (greater than sign), followed by sequence description and sequence.
Fastq file Result of primary analysis representing individual reads with quality indicators for each base of corresponding sequences
Functional annotation Use of the analyzed output data of genomic and transcriptomic projects to describe gene/protein functions and interactions.
Gap‐penalty During sequence alignment, to compensate insertion or deletion of query sequences, gaps in the sequences are introduced. Introduction or extension of gap is penalized in the scoring of an alignment of nucleotide or protein sequences and is called the gap penalty.
GC‐clamp The presence of Guanine (Gs) and Cytosine (Cs) nucleotides at the 3’‐end of the primer. More than three Cs should be avoided. GC‐clamps help in the specific binding of primer with the DNA template.
Gene Identity Number (gi) Sometimes written as “gi”, this number is simply a series of digits assigned to each sequence of NCBI. It has been discontinued from September 2016.
Gene ontology The bioinformatics process is to annotate, assimilate and disseminate information of gene and gene product across all species through a common platform.
Genetic Code These are the triplets of three nucleotides that code for amino acids. Those triplets that do not code for any amino acid (UGA, UAG, and UAA) are called stop codons and, therefore, halt translation.
Genomic Survey Sequences (GSS) Genome survey sequences are the short genomic DNA sequences from coding, non‐coding and repetitive portions of genomic DNA that aid in rapid characterization of the unknown genome.
Genomics Study of the whole DNA of an organism, e.g., genes, their structure, and organization, location in the chromosome, etc.
Gibb’s free energy The Gibbs free energy of a system at any time is defined as the enthalpy (H) of the system – the product of the entropy (S) of the system multiplied by the temperature (T), i.e., G = H – ST.
Global Alignment The Needleman–Wunsch based algorithm dynamic programming methods of aligning two or more nucleotide sequences that are similar in nature.
Guide tree This is constructed during multiple sequence alignment from the pair‐wise distance scores. It is different from the phylogenetic tree that is constructed at the end of the MSA.
Hairpin loop (turn) A hairpin loop is formed by single‐stranded DNA or mRNA when a portion of strand folds up and pairs with another section of the same strand. In designing primers (short oligonucleotides) for PCR, the formation of the hairpin loop at the 3’ end is avoided because if affects PCR efficiency.
Heuristic program A method of problem‐solving that often involves experimentation on the basis of trial and error. Likewise, a heuristic program is an algorithm that produces an acceptable solution without formal proof of its correctness.
Hidden Markov Model (HMM) HMM is used to present the probability distributions over the sequences of observations. It is a Markov model with a hidden (unobserved) state, where the state is not directly visible but the output is visible.
High‐throughput genomic sequences (HTGS) The division to accommodate rapidly growing unfinished genomic sequence databases of DDBJ, EMBL, and GenBank, where sequences are available for BLAST homology. When sequences are at the finished level (phase 3: finished with no gaps either with or without annotations), the data are moved from HTGS to the corresponding taxonomic division.
High‐scoring segment pair (HSP) HSP is the basic unit of BLAST algorithm output. It consists of two sequence fragments whose alignment is locally maximal, and for which the alignment score meets or exceeds a threshold or cut‐off score.
Homology Homology is the shared ancestry between a pair of the genes in different species.
InDel An abbreviation of “insertion and deletion” of genes in mutation.
InDels One or more Insertion or Deletion event detected in sequences of genetic materials.
Internal Node The intermediate node between root node and leaf node in a phylogenetic tree.
International Nucleotide Sequence Database Collaboration (INSDC) A long‐standing foundational collaboration between DDBJ, EMBL, and NCBI in data raw reads, their alignment, assemblies and functional annotations, with related information on samples and experiments associated with the data.
Iteration The process of repeating a process many times unless the desired results are achieved.
Leaf In a phylogenetic tree, a leaf usually represents a single present‐day taxon that is typically a DNA sequence whose genetic distance is measured with other taxa.
Library This refers to a collection or pool of DNA or cDNA of an entire organism. A collection of the entire genome (exon and introns) is called a genomic DNA library, and a collection of all complementary DNA is called the cDNA library.
MegaBLAST Alignment of larger DNA sequences that differ slightly as a result of sequencing. MegaBLAST is similar to BLASTn but able to efficiently handle longer DNA sequences.
Microarray It is a set of DNA sequences representing the entire set of genes of an organism that are arranged (arrayed) in a grid pattern for use in gene expression analysis (cDNA microarray) or genetic testing (DNA microarray). A typical microarray experiment involves hybridization of mRNA molecules (called targets) to the DNA template (called probes) from which it is originated.
Mispriming When primers of PCR anneal to non‐specific sites, leading to the background or non‐specific amplification, this is called mispriming.
Monte Carlo Simulation A computerized mathematical technique to analyze risk assessment in quantitative analysis. It provides all possible outcomes of decisions and risk assessment, allowing scientists to make a better decision.
Motif Motifs are the structural characteristics of a protein that are associated with a particular arrangement of amino acids. When such arrangements of amino acids are associated with a function like DNA binding or catalytic activity, then it is called a domain.
Multiple alignments A computational method that lines up, as a set of three or more sequences in row, to identify overlapping positions with maximum accuracy and minimum mismatches and gaps.
Next‐generation sequencing High‐throughput sequencing to sequence DNA and RNA much more quickly and cheaply than the previously used Sanger sequencing, by producing thousands or millions of sequences at once
Node A node in a phylogeny represents the common ancestor of a set of taxa, from which different taxa are descended.
omics The word “omics” is informally related to the field of biology such as genomics, proteomics or metabolomics, where the suffix ‐omics refers to the field of study of the genome, protein or metabolites, respectively.
Paired‐end sequencing The sequence of the DNA is obtained from the 5’ ends of both strands of the insert.
Palindrome A sequence of the word (or nucleotide) that reads the same backwards or forwards. For example, in the word RACECAR, the arrangement of the word is the same forwards and backwards.
Phi angle A torsion angle of right‐handed rotation around the N‐atom of the NH2 group and the C‐alpha atom of the Carboxyl group (N‐Ca bond). The angle ranges from –180 to +180 degrees.
Phred scale Measurement of base calling accuracy using the Phred quality score (Q score) for assessing the accuracy of a sequencing platform.
Position Hit Initiated BLAST (PHI‐BLAST) A variant of PSI‐BLAST, based on the construction of Position‐Specific Scoring Matrix (PSSM) around a motif of protein.
Position‐Specific Iterative BLAST (PSI‐BLAST) An iterative search of the protein BLAST algorithm.
Position‐Specific Scoring Matrix (PSSM) A profile providing matching of an amino acid of a target sequence from a query sequence, estimated by log‐odd scores.
Primary structure The primary structure of a protein or polypeptide is a linear sequence of amino acids from the N‐terminal to the C‐terminal end.
Primer 18–25 bp of nucleotides sequences (in pairs usually) used to amplify specific genes in PCR.
Probe (microarray) In a spotted microarray, probes refer to synthesize short oligonucleotides or DNA that is complementary to mRNA.
Prosthetic Group “Prosthetic” means an external part that supports the functions of an organ. Similarly, a prosthetic group is a non‐protein part, like vitamins or metal ions, that accelerates functions of an enzyme or protein.
Protein families Like gene families, protein families are evolutionarily related proteins that share common features or functions.
Protein Isoelectric Point (pI) The pH of a solution at which amino acid does not migrate in an electric field. For example, the pI of aspartic acid is 2.77, and of arginine is 10.76.
Proteomics The entire set of proteins expressed by a genome of a cell/tissue/organism at a particular point in time.
Pseudo Count In probability estimation of a model, an amount is added to the number of observed cases. Those priori counts, which might a subjective value, are called pseudo counts.
Psi angle A torsion angle of right‐handed rotation around the C‐alpha atom of the carboxyl group and C‐atom bond (Ca‐C bond). The angle ranges from –180 to +180 degrees.
Query Coverage The percentage of the query sequence that overlaps the subject sequence.
Ramachandran Plot A diagrammatic visualization of protein structure by dihedral angles, psi (ψ) against phi (ϕ), against amino acid residues. It was originally developed by a team led by Ramachandran.
Raw alignment score (S) A number used to assess the biological relevance of alignments of two sequences, where a higher score corresponds to a higher similarity of two sequences.
RCSB Research Collaboratory for Structural Bioinformatics, founded in 1998 and responsible for maintaining protein data bank (PDB). PDB is the single worldwide repository maintaining the 3D structure of proteins and nucleic acids.
Real‐time PCR The real‐time quantitative polymerase chain reaction (RT‐qPCR), where the formation on amplicons can be visualized in real time on a monitor or screen. It is an advanced form of conventional PCR and utilizes a double‐stranded DNA binding dye that combines with accumulated amplicon to be detected by the camera.
Reference genome Reference assembly is a digital nucleic acid sequence database of set of genes assembled as a representative example of a species and can be retrieved using three different genome browsers.
RefSeq “Reference Sequence” of either protein or nucleotide in a database of NCBI, derived from curation and computation of archived sequences.
Re‐sequencing Sequencing of genetic material with reference sequence available.
Restriction Enzyme Also called “molecular scissors”, used to chop DNA/plasmid sequences at specific sites in either a blunt or sticky end fashion to generate recombinant DNA.
Rn Value An abbreviation of “normalized reporter value”. The Rn value is the fluorescent signal of SYBR Green dye (DNA intercalating dye) normalized to (divided by) the signal of the passive reference dye (e.g., Rox). The delta Rn value is the Rn value of the reaction minus the Rn value of the baseline signal of the instrument.
Root The root of a tree is the node of the phylogenetic tree that represents a common ancestor.
RSCB A protein databank, an informative tool of predict molecular structure of proteins, genomic position and sequence alignments. The web link to the RSCB portal is: www.rscb.org/
SCOP Standing for Structural Classification of Proteins, this is a manual classification of protein structural domains based on their amino acid sequences and structures. The SCOP database was discontinued in the year 2009, and a newer and better prototype is available, called SCOP2.
Secondary Structure The second level of protein structure. The most common type of secondary structure in proteins is the alpha‐helix. Beta‐sheets are another type of secondary structure of protein.
Sequence format The method of writing the nucleotide bases of a sequence is called the sequence format. There are various ways to write sequences, including: plain sequence format; EMBL format; FASTA format; GCG format; GenBank format; and IG format.
Sequence Similarity Comparing sequences of either DNA, RNA or protein with each other for a degree of similarity is one of the most frequent tasks of computational biology. Two sequences showing a high degree of similarity often implies similar functions.
Sequence Tagged Sites (STS) A 200–500 bp long DNA sequence that occurs singly (one copy) in a genome whose location and sequence are known. STS may contain repetitive sequences, but usually flanked by unique flanking regions (not present elsewhere in the genome). The microsatellite is a type of STS.
Short read Single‐End and Pair‐End methods of sequencing of fragments of genetic material as per the specified read length.
SNP mining The extraction of valuable information from single nucleotide polymorphism (SNP) data. SNP is a fast and cost‐effective means of studying genetic variation.
Subtree A part of the original tree, representing a fraction of the taxa being studied.
Taxa The singular form of taxa is the taxon. This is a generic name for a taxonomic group, such as species. Taxon also represents genera, families, orders, phyla, and so on.
Taxonomy Taxonomy is a branch of science that deals with the classification of new organisms and species systematically.
tBLASTn Alignment of protein vs. translated nucleotide sequences for the identification of database sequences that encode proteins.
tBLASTx Alignment of translated nucleotide vs. translated nucleotide sequences for identification of nucleotide sequences, based on their coding potential.
Tertiary Structure The third level of protein structure, describing complex and irregular folding of peptide chains in three dimensions.
Third party annotation (TPA) An annotated database derived from GenBank primary data or DDBJ/EMBL sequence databases. A TPA database could be experimental (if annotated from wet‐lab experiment) or inferential (annotated by inference only).
Threading (protein sequence) Protein threading refers to a method of protein modeling, where proteins may not be homologous but may have the same fold as a protein of known structure.
Topology The physical layout of a gene or protein network is referred to as its topology. The three main topologies of a network are ring, bus, and star, which more likely exist as hybrid networks (combinations of ring and bus, or ring and star, or bus and star).
Torsion Angle The angle of the geometric relation of two parts of a molecule joined by a chemical bond.
Transcriptome Shotgun Assembly (TSA) An archived data of computationally assembled sequences derived from ESTs and next‐generation sequencing.
Transcriptomics Study of whole RNA profile (transcripts) of cells/tissue at a particular point in time (development stage, normal or diseased stage).
Tree A phylogenetic tree, or simply “tree”, is an evolutionary relationship among a set of organisms called a taxon.
Ultrametric Tree It is a rooted tree with equal edge lengths from the root and represents an equal rate of mutation in all the lineages. It is also called a “dendrogram.”
Whole Genome Shotgun Contigs The sequence of the overlapping fragments of the whole genome
X‐Ray crystallography A tool to identify the atomic and molecular structure of a crystal by using X‐rays.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.65.1