This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
313
Glossary
1˚
The abbreviation for primary. 1˚ sequence
refers to the letters of DNA, RNA, or pro-
tein. transcript refers to an unproc-
essed RNA that still contains its introns.
2˚
The abbreviation for secondary. Most fre-
quently used for generalizing protein and
RNA structures; for example, the α-helix
and hair-pin are common 2˚ structures.
3´
The end of a nucleic acid sequence; often
used with UTR.
5´
The start of a nucleic acid (DNA or RNA)
sequence; often used in conjunction with
UTR (e.g., 5´UTR). Nucleotide sequences
are conventionally written with the 5´ end
at the left. DNA molecules are usually
double-stranded but when written, usu-
ally only the to strand is displayed.
The complementary strand has reversed
polarity (3´ to 5´).
aa
The abbreviation for an amino acid that is
often used when describing the length of a
protein (e.g., the average protein is about
300 aa long).
allele
A form of a gene. Typically, the most
common form is called wild-type, and
each allele is given a specific (and often
obscure) name.
amino acid
The basic building block for all proteins.
There are 20 common amino acids.
Arabidopsis thaliana
Known by its common name, thale cress,
this mustard weed is a favorite organism
for plant genetics and molecular biology.
It was the first plant with a complete
genomic sequence. For more information,
see http://www.arabidosis.org.
bit
The contraction for binary digit. The
base-2 logarithm of a number is in units of
bits.
BLOSUM
The abbreviation for a blocks substitution
matrix. Matrix names are followed by a
number (e.g., BLOSUM62) that indicate
the minimum percent identity between
any two aligned sequences.
bp
The abbreviation for base pair. The length
of DNA is usually given in bp or nt, Com-
mon measures include Kb, Mb, and Gb
for thousands, millions, and billions of bp,
respectively.
C-terminus
The end of a protein. In text form, the
C-terminus of the protein is always at the
right.
Caenorhabditis elegans
A nematode (also called a roundworm)
that is about 1 mm long and has about
CDS
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
314 | Glossary
1,000 cells as an adult. C. elegans was the
first animal to have its complete genome
sequenced. See http://www.wormbase.org.
CDS
The abbreviation for a coding sequence.
CDS isn’t synonymous with exon, since
exons may contain noncoding sequence.
codon
Three contiguous letters of DNA or RNA.
Each of the 64 codons specifies either an
amino acid or a translation stop.
complement
The complement of a DNA sequence is
the sequence on the other strand. For
example, the complement of ACCCGT is
TGGGCA. To complement a sequence in
Perl, use either of the following:
# 4-letter alphabet
$dna =~ tr/ACGT/TGCA/;
# 15-letter alphabet
$dna =~ tr[ACGTRYWSKMBDHV]
[TGCAYRSWMKVHDB];
Drosophila melanogaster
The common fruit fly. This is one of the
most famous organisms for genetic
research and was one of the first animals
whose complete genomic sequence was
determined. See http://www.fruitfly.org.
dynamic programming
A common technique that reduces the
computational complexity of a problem
by finding and extending a partial optimi-
zation.
E. coli
Eschericia coli. A common bacteria nor-
mally found in your gut and a favorite
organism for molecular biology research.
Some variants cause food poisoning.
effective length
Karlin-Altschul statistics assume
sequences of infinite length. To adjust for
edge effects in real sequences, the search
space is reduced by adjusting the true
lengths of the sequences to effective
lengths.
entropy
Randomness; disorder; unpredictability.
eukaryote
Organisms with intracellular membra-
nous organelles such as the nucleus and
mitochondria are called eukaryotes.
frame-shift mutation
A mutation that causes an insertion or
deletion of nucleotides that isn’t a multi-
ple of three, and therefore causes the read-
ing frame to change.
gene
A functional unit of the genome. When
not specifically stated, “gene” is usually
considered a “protein-coding” gene, but
many genes don’t contain the instructions
for proteins (e.g., various RNA genes).
genetic code
The mapping of codons to amino acids.
See Table 2-3.
genetic drift
The tendency of sequences to change over
time by accumulating random mutations.
genome
The complete genetic material for an
organism. For eukaryotes, the genome
refers to the nuclear genome and doesn’t
include organelles.
global alignment
An alignment algorithm that requires
every letter of each sequence to appear in
the alignment. Globally aligning
sequences of different lengths may lead to
very strange alignments.
homologous
In sequence analysis, homologous means
derived from a common ancestor.
Sequences are either homologous or they
aren’t. It is incorrect to say that sequences
are 80 percent homologous unless you
mean that there is an 80 percent chance of
common ancestry. Use percent identity to
describe the similarity of alignments.
hydrophilic
Literally, “likes water.” Water is a polar
molecule that mixes well with other polar
molecules. The charged amino acids K, R,
D, and E, are examples of hydrophilic
amino acids.
PAM
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Glossary | 315
hydrophobic
Literally, “fears water.” Nonpolar mole-
cules (like those in oils) don’t mix well
with water. The amino acids L, I, V, and F
are particularly hydrophobic.
Karlin-Altschul
The standard local alignment theory is
often called Karlin-Altschul statistics after
its founding authors.
lambda, λ
The Karlin-Altschul statistical parameter
that converts a raw score to a normalized
score.
local alignment
An alignment algorithm that finds the
optimal subsequence alignment. The
alignment may include all letters of each
sequence, but it isn’t required to do so.
low-complexity sequence
Regions of sequences that are highly pre-
dictable—for example, a region that is 90
percent A or T.
methionine
One of the 20 common amino acids.
Methionine is abbreviated as M or Met,
and is especially important because all
proteins begin with a methionine. There is
only one codon for this amino acid: ATG.
mutation
Any change in sequence to a DNA mole-
cule.
N-terminus
The start of a protein. In text form, a pro-
tein’s N-terminus is always at the left.
nat
Contraction for natural log digits. The
base e logarithm of a number is in units of
nats.
natural selection
A theory founded by Charles Darwin that
explains how organisms change over time
to better fit their environment. It is based
on the principles of variation, heritability,
and differential reproduction.
ncRNA
The abbreviation for noncoding RNA.
Some RNAs, like tRNAs or rRNAs, don’t
contain information for protein
sequences.
Needleman-Wunsch
Global alignment is often called Needle-
man-Wunsch after the authors who first
described the algorithm.
nucleotide
The basic building block of nucleic acid
sequences (DNA and RNA). DNA is made
from A, C, G, or T, while RNA contains
A, C, G, or U.
nt
The abbreviation for nucleotide.
O(n)
The computational complexity of an algo-
rithm is often described by its asymptotic
behavior. O(n) problems grow linearly
with the size of the input. O(log
2
n) grow
much more slowly, and O(n
2
) grow much
more quickly.
ORF
Abbreviation for open reading frame.
Each strand of DNA has three frames. Any
subsequence that doesn’t contain stop
codons in a particular frame is an open
reading frame.
ortholog
Genes that are separated by speciation
(i.e., the same gene in different species).
This is often approximated as the best
reciprocal match between two complete
genomes or proteomes.
palindrome
A palindrome in DNA is a sequence that is
read the same on the plus and minus
strands. For example, the sequence
GAATTC is a palindrome. Palindromes
and near-palindromes are often sites for
DNA-protein interaction. Proteins scan-
ning along DNA “see” a palindrome as the
same sequence regardless of which direc-
tion they are moving.
PAM
An acronym for Percent or Point Accepted
Mutation. PAM scoring matrix names are
usually followed by a number (e.g.,
PAM200), which indicates how many iter-
ations of multiplication were used starting
paralogs
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
316 | Glossary
with the PAM1 matrix. The higher num-
ber indicates a more distant similarity.
paralogs
Genes that are duplicated within a single
genome. Duplication sometimes allows
one of the genes to take on a specialized
function.
phylogenetics
The study of evolutionary relationships
among organisms.
prokaryotes
Organisms that don’t contain intracellu-
lar organelles. All bacteria are prokary-
otes.
proteome
The complete set of all proteins produced
by a particular organism. Many proteins
undergo post-translational modifications
that add or subtract features from a pro-
tein. Therefore, a particular mRNA might
have many different protein isoforms.
pseudogene
A sequence that looks like a gene but isn’t.
Most pseudogenes are derived from
mRNAs that have been reverse-tran-
scribed back to DNA and inserted into the
genome. They have the hallmarks of RNA
processing—notably a poly-A tail and no
introns.
relative entropy
The average number of bits (or nats) per
aligned letter for a given scoring scheme.
repeat
Any class of a sequence that appears mul-
tiple times in a genome. Usually, gene
families aren’t called repeats and the term
is used for junk DNA. Some of the most
common repeats in the human genome
include the ALU and LINE families.
reverse transcriptase
A protein that creates DNA from an RNA
template.
RNA
Ribonucleic acid. RNA is chemically simi-
lar to DNA but not used strictly for stor-
age. Many RNA molecules have important
functions in the cell and may even have
enzymatic properties. Some of the most
common functional RNA molecules
include rRNAs and tRNAs.
RNA polymerase
A protein or multiprotein complex that
creates RNA from a DNA template.
ribosome
A complex macromolecule made up of
proteins and rRNAs. Ribosomes are
responsible for translating mRNAs into
proteins.
rRNA
Ribosomal RNA. The ribosome is com-
posed of many specific RNA molecules,
and these components are called rRNAs.
rRNAs are some of the most abundant
RNAs in a cell.
Smith-Waterman
Local alignment is often referred to as
Smith-Waterman, after the authors who
first described the algorithm.
start codon
ATG. Codes for the amino acid methion-
ine. Many proteins have N-terminal
post-translational modifications, and the
first amino acid of the mature protein may
therefore not be methionine.
stop codon
TAA, TGA, and TAG are the three codons
that terminate translation.
sum statistics
A method that determines the aggregate
statistical significance of multiple local
alignments.
target frequency
The expected frequencies of individual let-
ter pairings. For nucleotide scoring matri-
ces, the target frequency is often
summarized by the expected percent iden-
tity in sequences with unbiased composi-
tion.
transcriptome
The complete set of transcripts for a par-
ticular genome. This term is often used to
mean the mRNAs of protein coding genes
and their alternatively spliced variants.
UTR
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Glossary | 317
tRNA
The abbreviation for transfer RNA. tRNAs
transfer individual amino acids to the
ribosome. Each tRNA molecule has an
anti-codon the matches the reverse-com-
plement of the amino acid it carries.
UTR
The abbreviation for an untranslated
region. The and ends of an mRNA
have untranslated regions. These regions
sometimes play regulatory roles that
change the mRNA’s stability, translatabil-
ity, or localization.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.204.201