2

Providing Structure to Unstructured Data

Abstract

Most of the data collected today is unstructured. For data managers, unstructured data is any stored information that comes in different sizes (e.g., a tweet, an email message, a book, a library corpus), that contains information that may express a single concept in many different ways (e.g., September 3, 2012; Sept 3, 2012, 09/03/12, 03/09/12, 3/9/12, Labor Day), that is not neatly packaged into spreadsheet cells (e.g., an audio file, a photograph of the Declaration of Independence, a tweet), that cannot be assigned a numeric value (such as yesterday, TRUE, nil), or that does not conform in any way to a specific data standard. Examples of unstructured data would include just about everything you experience on the Internet (advertisements, news, Web pages, downloaded music, images). Unstructured data is grist for many Big Data resources. The purpose of this chapter is to describe useful algorithms that provide structure to unstructured data, including indexing, autocoding, and term extraction methods.

Keywords

Unstructured data; Parsing; Autocoding; Indexing; Machine translation; Term extraction; Concordances

Section 2.1. Nearly All Data Is Unstructured and Unusable in Its Raw Form

I was working on the proof of one of my poems all the morning, and took out a comma. In the afternoon I put it back again.

Oscar Wilde

In the early days of computing, data was always highly structured. All data was divided into fields, the fields had a fixed length, and the data entered into each field was constrained to a pre-determined set of allowed values. Data was entered into punch cards with pre-configured rows and columns. Depending on the intended use of the cards, various entry and read-out methods were chosen to express binary data, numeric data, fixed-size text, or programming instructions. Key-punch operators produced mountains of punch cards. For many analytic purposes, card-encoded data sets were analyzed without the assistance of a computer; all that was needed was a punch card sorter. If you wanted the data card on all males, over the age of 18, who had graduated high school, and had passed their physical exam, then the sorter would need to make 4 passes. The sorter would pull every card listing a male, then from the male cards it would pull all the cards of people over the age of 18, and from this double-sorted sub-stack, it would pull cards that met the next criterion, and so on. As a high school student in the 1960s, I loved playing with the card sorters. Back then, all data was structured data, and it seemed to me, at the time, that a punch-card sorter was all that anyone would ever need to analyze large sets of data. [Glossary Binary data]

How wrong I was! Today, most data entered by humans is unstructured in the form of free-text. The free-text comes in email messages, tweets, and documents. Structured data has not disappeared, but it sits in the shadows cast by mountains of unstructured text. Free-text may be more interesting to read than punch cards, but the venerable punch card, in its heyday, was much easier to analyze than its free-text descendant. To get much informational value from free-text, it is necessary to impose some structure. This may involve translating the text to a preferred language; parsing the text into sentences; extracting and normalizing the conceptual terms contained in the sentences; mapping terms to a standard nomenclature; annotating the terms with codes from one or more standard nomenclatures; extracting and standardizing data values from the text; assigning data values to specific classes of data belonging to a classification system; assigning the classified data to a storage and retrieval system (e.g., a database); and indexing the data in the system. All of these activities are difficult to do on a small scale and virtually impossible to do on a large scale. Nonetheless, every Big Data project that uses unstructured data must deal with these tasks to yield the best possible results with the resources available. [Glossary Parsing, Nomenclature, Nomenclature mapping, Thesaurus, Indexes, Plain-text]

Section 2.2. Concordances

The limits of my language are the limits of my mind. All I know is what I have words for. (Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt.)

Ludwig Wittgenstein

A concordance is a list of all the different words contained in a text with the locations in the text where each word appears. Concordances have been around for a very long time, painstakingly constructed from holy scriptures thought to be of such immense value that every word deserved special attention. Creating a concordance has always been a straightforward operation. You take the first word in the text and you note its location (i.e., word 1, page 1); then onto the second word (word 2 page 1), and so on. When you come to a word that has been included in the nascent concordance, you add its location to the existing entry for the word. Continuing thusly, for a few months or so, you end up with a concordance that you can be proud of. Today a concordance for the Bible can be constructed in a small fraction of a second. [Glossary Concordance]

Without the benefit of any special analyses, skimming through a book's concordance provides a fairly good idea of the following:

  •   The topic of the text based on the words appearing in the concordance. For example, a concordance listing multiple locations for “begat” and “anointed” and “thy” is most likely to be the Old Testament.
  •   The complexity of the language. A complex or scholarly text will have a larger vocabulary than a romance novel.
  •   A precise idea of the length of the text, achieved by adding all of the occurrences of each of the words in the concordance. Knowing the number of items in the concordance, multiplied by the average number of locations of concordance items, provides a rough estimate of the total number of words in the text.
  •   The care with which the text was prepared, achieved by counting the misspelled words.

Here, in a short Python script, concord_gettysbu.py, that builds a concordance for the Gettysburg address, located in the external file “gettysbu.txt”: [Glossary Script]

import re, string
word_list =[];word_dict ={};key_list =[]
count = 0; word =""
in_text_string = open('gettysbu.txt', "r").read().lower()
word_list = re.split(r'[ˆa-zA-z\_-]+',in_text_string)
for word in word_list:
  count = count + 1
  if word in word_dict:
   word_dict[word] = word_dict[word] + ',' + str(count)
  else:
   word_dict[word] = str(count)
key_list = list(word_dict)
key_list.sort()
for key in key_list:
  print(key + " " + word_dict[key])

The first few lines of output are shown:

a 14,36,59,70,76,104,243
above 131
add 136
advanced 185
ago 6
all 26
altogether 93
and 3,20,49,95,122,248
any 45
are 28,33,56
as 75
battlefield 61
be 168,192
before 200
birth 245
brave 119
brought 9
but 102,151
by 254
can 52,153
cannot 108,111,114

The numbers that follow each item in the concordance correspond to the locations (expressed as the nth words of the Gettysburg address) of each word in the text.

At this point, building a concordance may appear to an easy, but somewhat pointless exercise. Does the concordance provide any functionality beyond that provided by the ubiquitous “search” box. There are five very useful properties of concordances that you might not have anticipated.

  •   You can use a concordance to rapidly search and retrieve the locations where single-word terms appear.
  •   You can always reconstruct the original text from the concordance. Hence, after you've built your concordance, you can discard the original text.
  •   You can merge concordances without forfeiting your ability to reconstruct the original texts, provided that you tag locations with some character sequence that identifies the text of origin.
  •   With a little effort a dictionary can be transformed into a universal concordance (i.e., a merged dictionary/concordance of every book in existence) by attaching the book identifier and its concordance entries to the corresponding dictionary terms.
  •   You can easily find the co-locations among words (i.e., which words often precede or follow one another).
  •   You can use the concordance to retrieve the sentences and paragraphs in which a search word or a search term appears, without having access to the original text. The concordance alone can reconstruct and retrieve the appropriate segments of text, on-the-fly, thus bypassing the need to search the original text.
  •   A concordance provides a profile of the book and can be used to compute a similarity score among different books.

There is insufficient room to explore all of the useful properties of concordances, but let us examine a script, concord_reverse.py, that reconstructs the original text, in lowercase, from the concordance. In this case, we have pasted the output from the concord_gettysbu.py script (vida supra) into the external file, “concordance.txt”.

import re, string
concordance_hash = {} ; location_array = []
in_text = open('concordance.txt', "r")
for line in in_text:
  line = line.replace("
","")
  location_word, separator, location_positions = line.partition(" ")
  location_array = location_positions.split(",")
  location_array = [int(x) for x in location_array]
  for location in location_array:
   concordance_hash[location] = location_word
for n in range(300):
  if n in concordance_hash:
   print((concordance_hash[n]), end = " ")

Here is the familiar output:

four score and seven years ago our fathers brought forth on this continent a new nation conceived in liberty and dedicated to the proposition that all men are created equal now we are engaged in a great civil war testing whether that nation or any nation so conceived and so dedicated can long endure we are met on a great battlefield of that war we have come to dedicate a portion of that field as a final resting-place for those who here gave their lives that that nation might live it is altogether fitting and proper that we should do this but in a larger sense we cannot dedicate we cannot consecrate we cannot hallow this ground the brave men living and dead who struggled here have consecrated it far above our poor power to add or detract the world will little note nor long remember what we say here but it can never forget what they did here it is for us the living rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced it is rather for us to be here dedicated to the great task remaining before us–that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion–that we here highly resolve that these dead shall not have died in vain that this nation under god shall have a new birth of freedom and that government of the people by the people for the people shall not perish from the earth

Had we wanted to write a script that produces a merged concordance, for multiple documents, we could have simply written a loop that repeated the concordance-building process for each text. Within the loop, we would have tagged each word location with a short notation indicating the particular source book. For example, locations from the Gettysburg address could have been prepended with “G:” and locations from the Bible might have been prepended with a “B:”.

We have not finished with the topic of concordances. Later in this chapter (Section 2.8), we will show how concordances can be transformed to speed-up search and retrieval operations on large bodies of text.

Section 2.3. Term Extraction

There's a big difference between knowing the name of something and knowing something.

Richard Feynman

One of my favorite movies is the parody version of “Hound of the Baskervilles,” starring Peter Cooke as Sherlock Holmes and Dudley Moore as his faithful hagiographer, Dr. Watson. Sherlock, preoccupied with his own ridiculous pursuits, dispatches Watson to the Baskerville family manse, in Dartmoor, to undertake urgent sleuth-related activities. The hapless Watson, standing in the great Baskerville Hall, has no idea how to proceed with the investigation. After a moment of hesitation, he turns to the incurious maid and commands, “Take me to the clues!”

Building an index is a lot like solving a fiendish crime; you need to know how to find the clues. For informaticians, the terms in the text are the clues upon which the index is built. Terms in a text file do not jump into your index file; you need to find them. There are several available methods for finding and extracting index terms from a corpus of text [1], but no method is as simple, fast, and scalable as the “stop word” method [2]. [Glossary Term extraction algorithm, Scalable]

The “stop word” method presumes that text is composed of terms that are somehow connected into sequences known as sentences. [Glossary Sentence]

Consider the following:

The diagnosis is chronic viral hepatitis.

This sentence contains two very specific medical concepts: “diagnosis” and “chronic viral hepatitis.” These two concepts are connected to form a sentence, using grammatical bric-a-brac such as “the” and “is”, and the sentence delimiter, “.”. These grammatical bric-a-brac are found liberally sprinkled in every paragraph you are likely to read.

A term can be defined as a sequence of one or more uncommon words that are demarcated (i.e., bounded on one side or another) by the occurrence of one or more very common words (e.g., “and”, “the”, “a”, “of”) and phrase delimiters (e.g., “.”, “,”, and “;”)

Consider the following:

An epidural hemorrhage can occur after a lucid interval.

The medical concepts “epidural hemorrhage” and “lucid interval” are composed of uncommon words. These uncommon word sequences are bounded by common words (i.e., “the”, “an”, “can”, “a”) or a sentence delimiter (i.e., “.”).

If we had a list of all the words that were considered common, we could write a program that extracts the all the concepts found in any text of any length. The concept terms would consist of all sequences of uncommon words that are uninterrupted by common words. Here is an algorithm for extracting terms from a sentence:

  1. 1.  Read the first word of the sentence. If it is a common word, delete it. If it is an uncommon word, save it.
  2. 2.  Read the next word. If it is a common word, delete it, and place the saved word (from the prior step, if the prior step saved a word) into our list of terms found in the text. If it is an uncommon word, concatenate it with the word we saved in step one, and save the 2-word term. If it is a sentence delimiter, place any saved term into our list of terms, and stop the program.
  3. 3.  Repeat step two.

This simple algorithm, or something much like it, is a fast and efficient method to build a collection of index terms. The following list of common words might be useful: “about, again, all, almost, also, although, always, among, an, and, another, any, are, as, at, be, because, been, before, being, between, both, but, by, can, could, did, do, does, done, due, during, each, either, enough, especially, etc, for, found, from, further, had, has, have, having, here, how, however, i, if, in, into, is, it, its, itself, just, kg, km, made, mainly, make, may, mg, might, ml, mm, most, mostly, must, nearly, neither, no, nor, obtained, of, often, on, our, overall, perhaps, pmid, quite, rather, really, regarding, seem, seen, several, should, show, showed, shown, shows, significantly, since, so, some, such, than, that, the, their, theirs, them, then, there, therefore, these, they, this, those, through, thus, to, upon, use, used, using, various, very, was, we, were, what, when, which, while, with, within, without, would.”

Such lists of common words are sometimes referred to as “stop word” lists or “barrier word” lists, as they demarcate the beginnings and endings of extraction terms. Let us look at a short Python script (terms.py) that uses our list of stop words (contained in the file stop.txt) and extracts the terms from the sentence: “Once you have a method for extracting terms from sentences the task of creating an index associating a list of locations with each term is child's play for programmers”

import re, string
stopfile = open("stop.txt",'r')
stop_list = stopfile.readlines()
stopfile.close()
item_list = []
line = "Once you have a method for extracting terms from 
sentences the task of creating an index associating a list 
of locations with each term is child's play for programmers"
for stopword in stop_list:
  stopword = re.sub(r'
', '', stopword)
  line = re.sub(r' ⁎' + stopword + r' ⁎', '
', line)
item_list.extend(line.split("
"))
item_list = sorted(set(item_list))
for item in item_list:
  print(item)

Here is the output:

Once
child's play
creating
extracting terms
index associating
list
locations
method
programmers
sentences
task
term

Extracting terms is the first step in building a very crude index. Indexes built directly from term extraction algorithms always contain lots of unnecessary terms having little or no informational value. For serious indexers, the collection of terms extracted from a corpus, along with their locations in the text, is just the beginning of an intellectual process that will eventually lead to a valuable index.

Section 2.4. Indexing

Knowledge can be public, yet undiscovered, if independently created fragments are logically related but never retrieved, brought together, and interpreted.

Donald R. Swanson [3]

Individuals accustomed to electronic media tend to think of the Index as an inefficient or obsolete method for finding and retrieving information. Most currently available e-books have no index. It is far easier to pull up the “Find” dialog box and enter a word or phrase. The e-reader can find all matches quickly, providing the total number of matches, and bringing the reader to any or all of the pages containing the selection. As more and more books are published electronically, the book Index, as we have come to know it, may cease to be.

It would be a pity if indexes were to be abandoned by computer scientists. A well-designed book index is a creative, literary work that captures the content and intent of the book and transforms it into a listing wherein related concepts are collected under common terms, and keyed to their locations. It saddens me that many people ignore the book index until they want something from it. Open a favorite book and read the index, from A to Z, as if you were reading the body of the text. You will find that the index refreshes your understanding of the concepts discussed in the book. The range of page numbers after each term indicates that a concept has extended its relevance across many different chapters. When you browse the different entries related to a single term, you learn how the concept represented by the term applies itself to many different topics. You begin to understand, in ways that were not apparent when you read the book as a linear text, the versatility of the ideas contained in the book. When you have finished reading the index, you will notice that the indexer exercised great restraint when selecting terms. Most indexes are under 20 pages. The goal of the indexer is not to create a concordance (i.e., a listing of every word in a book, with its locations), but to create a keyed encapsulation of concepts, sub-concepts and term relationships.

The indexes we find in today's books are generally alphabetized terms. In prior decades and prior centuries, authors and editors put enormous effort into building indexes, sometimes producing multiple indexes for a single book. For example, a biography might contain a traditional alphabetized term index, followed by an alphabetized index of the names of the people included in the text. A zoology book might include an index specifically for animal names, with animals categorized according to their taxonomic order. A geography index might list the names of localities sub-indexed by country, with countries sub-indexed by continent. A single book might have 5 or more indexes. In nineteenth century books, it was not unusual to publish indexes as stand-alone volumes. [Glossary Taxonomy, Systematics, Taxa, Taxon]

You may be thinking that all this fuss over indexes is quaint, but it cannot apply to Big Data resources. Actually, Big Data resources that lack a proper index cannot be utilized to their full potential. Without an index, you never know what your queries are missing. Remember, in a Big Data resource, it is the relationship among data objects that are the keys to knowledge. Data by itself, even in large quantities, tells only part of a story. The most useful Big Data resources have electronic indexes that map concepts, classes, and terms to specific locations in the resource where data items are stored. An index imposes order and simplicity on the Big Data resource. Without an index, Big Data resources can easily devolve into vast collections of disorganized information. [Glossary Class]

The best indexes comply with international standards (ISO 999) and require creativity and professionalism [4]. Indexes should be accepted as another device for driving down the complexity of Big Data resources. Here are a few of the specific strengths of an index that cannot be duplicated by “find” operations on terms entered into a query box:

  •   An index can be read, like a book, to acquire a quick understanding of the contents and general organization of the data resource.
  •   Index lookups (i.e., searches and retrievals) are virtually instantaneous, even for very large indexes (see Section 2.6 of this chapter, for explanation).
  •   Indexes can be tied to a classification. This permits the analyst to know the relationships among different topics within the index, and within the text. [Glossary Classification]
  •   Many indexes are cross-indexed, providing relationships among index terms that might be extremely helpful to the data analyst.
  •   Indexes from multiple Big Data resources can be merged. When the location entries for index terms are annotated with the name of the resource, then merging indexes is trivial, and index searches will yield unambiguously identified locators in any of the Big Data resources included in the merge.
  •   Indexes can be created to satisfy a particular goal; and the process of creating a made-to-order index can be repeated again and again. For example, if you have a Big Data resource devoted to ornithology, and you have an interest in the geographic location of species, you might want to create an index specifically keyed to localities, or you might want to add a locality sub-entry for every indexed bird name in your original index. Such indexes can be constructed as add-ons, as needed. [Glossary Ngrams]
  •   Indexes can be updated. If terminology or classifications change, there is nothing stopping you from re-building the index with an updated specification. In the specific context of Big Data, you can update the index without modifying your data. [Glossary Specification]
  •   Indexes are created after the database has been created. In some cases, the data manager does not envision the full potential of the Big Data resource until after it is created. The index can be designed to facilitate the use of the resource in line with the observed practices of users.
  •   Indexes can serve as surrogates for the Big Data resource. In some cases, all the data user really needs is the index. A telephone book is an example of an index that serves its purpose without being attached to a related data source (e.g., caller logs, switching diagrams).

Section 2.5. Autocoding

The beginning of wisdom is to call things by their right names.

Chinese proverb

Coding, as used in the context of unstructured textual data, is the process of tagging terms with an identifier code that corresponds to a synonymous term listed in a standard nomenclature. For example, a medical nomenclature might contain the term renal cell carcinoma, a type of kidney cancer, attaching a unique identifier code for the term, such as “C9385000.” There are about 50 recognized synonyms for “renal cell carcinoma.” A few of these synonyms and near-synonyms are listed here to show that a single concept can be expressed many different ways, including: adenocarcinoma arising from kidney, adenocarcinoma involving kidney, cancer arising from kidney, carcinoma of kidney, Grawitz tumor, Grawitz tumour, hypernephroid tumor, hypernephroma, kidney adenocarcinoma, renal adenocarcinoma, and renal cell carcinoma. All of these terms could be assigned the same identifier code, “C9385000”. [Glossary Coding, Identifier]

The process of coding a text document involves finding all the terms that belong to a specific nomenclature, and tagging each term with the corresponding identifier code.

A nomenclature is a specialized vocabulary, usually containing terms that comprehensively cover a knowledge domain. For example, there may be a nomenclature of diseases, of celestial bodies, or of makes and models of automobiles. Some nomenclatures are ordered alphabetically. Others are ordered by synonymy, wherein all synonyms and plesionyms (near-synonyms) are collected under a canonical (i.e., best or preferred) term. Synonym indexes are always corrupted by the inclusion of polysemous terms (i.e., terms with multiple meanings). In many nomenclatures, grouped synonyms are collected under a so-called code (i.e., a unique alphanumeric string) assigned to all of the terms in the group.

Nomenclatures have many purposes: to enhance interoperability and integration, to allow synonymous terms to be retrieved regardless of which specific synonym is entered as a query, to support comprehensive analyses of textual data, to express detail, to tag information in textual documents, and to drive down the complexity of documents by uniting synonymous terms under a common code. Sets of documents held in more than one Big Data resource can be harmonized under a nomenclature by substituting or appending a nomenclature code to every nomenclature term that appears in any of the documents. [Glossary Interoperability, Data integration, Plesionymy, Polysemy, Vocabulary, Uniqueness, String]

In the case of “renal cell carcinoma,” if all of the 50 + synonymous terms, appearing anywhere in a medical text, were tagged with the code “C938500,” then a search engine could retrieve documents containing this code, regardless of which specific synonym was queried (e.g., a query on Grawitz tumor would retrieve documents containing the word “hypernephroid tumor”). To do so the search engine would simply translate the query word, “Grawitz tumor” into its nomenclature code “C938500” and would pull every record that had been tagged by the code.

Traditionally, nomenclature coding, much like language translation, has been considered a specialized and highly detailed task that is best accomplished by human beings. Just as there are highly trained translators who will prepare foreign language versions of popular texts, there are highly trained coders, intimately familiar with specific nomenclatures, who create tagged versions of documents. Tagging documents with nomenclature codes is serious business. If the coding is flawed the consequences can be dire. In 2009 the Department of Veterans Affairs sent out hundreds of letters to veterans with the devastating news that they had contracted Amyotrophic Lateral Sclerosis, also known as Lou Gehrig's disease, a fatal degenerative neurologic condition. About 600 of the recipients did not, in fact, have the disease. The VA retracted these letters, attributing the confusion to a coding error [5]. Coding text is difficult. Human coders are inconsistent, idiosyncratic, and prone to error. Coding accuracy for humans seems to fall in the range of 85%–90% [6]. [Glossary Accuracy versus precision]

When dealing with text in gigabyte and greater quantities, human coding is simply out of the question. There is not enough time or money or talent to manually code the textual data contained in Big Data resources. Computerized coding (i.e., autocoding) is the only practical solution.

Autocoding is a specialized form of machine translation, the field of computer science wherein meaning is drawn from narrative text. Not surprisingly, autocoding algorithms have been adopted directly from the field of machine translation, particularly algorithms for natural language processing. A popular approach to autocoding involves using the natural rules of language to find words or phrases found in text and matching them to nomenclature terms. Ideally the terms found in text are correctly matched to their equivalent nomenclature terms, regardless of the way that the terms were expressed in the text. For instance, the term “adenocarcinoma of lung” has much in common with alternate terms that have minor variations in word order, plurality, inclusion of articles, terms split by a word inserted for informational enrichment, and so on. Alternate forms would be “adenocarcinoma of the lung,” “adenocarcinoma of the lungs,” “lung adenocarcinoma,” and “adenocarcinoma found in the lung.” A natural language algorithm takes into account grammatical variants, allowable alternate term constructions, word roots (i.e., stemming), and syntax variation. Clever improvements on natural language methods might include string similarity scores, intended to find term equivalences in cases where grammatical methods come up short. [Glossary Algorithm, Syntax, Machine translation, Natural language processing]

A limitation of the natural language approach to autocoding is encountered when synonymous terms lack etymologic commonality. Consider the term “renal cell carcinoma.” Synonyms include terms that have no grammatical relationship with one another. For example, hypernephroma, and Grawitz tumor are synonyms for renal cell carcinoma. It is impossible to compute the equivalents among these terms through the implementation of natural language rules or word similarity algorithms. The only way of obtaining adequate synonymy is through the use of a comprehensive nomenclature that lists every synonym for every canonical term in the knowledge domain.

Setting aside the inability to construct equivalents for synonymous terms that share no grammatical roots, the best natural language autocoders are pitifully slow. The reason for the slowness relates to their algorithm, which requires the following steps, at a minimum: parsing text into sentences; parsing sentences into grammatical units; re-arranging the units of the sentence into grammatically permissible combinations; expanding the combinations based on stem forms of words; allowing for singularities and pluralities of words, and matching the allowable variations against the terms listed in the nomenclature. A typical natural language autocoder parses text at about 1 kilobyte per second, which is equivalent to a terabyte of text every 30 years. Big Data resources typically contain many terabytes of data; thus, natural language autocoding software is unsuitable for translating Big Data resources. This being the case, what good are they?

Natural language autocoders have value when they are employed at the time of data entry. Humans type sentences at a rate far less than 1 kilobyte per second, and natural language autocoders can keep up with typists, inserting codes for terms, as they are typed. They can operate much the same way as auto-correct, auto-spelling, look-ahead, and other commonly available crutches intended to improve or augment the output of plodding human typists.

  •   Recoding and speed

It would seem that by applying the natural language parser at the moment when the data is being prepared, all of the inherent limitations of the algorithm can be overcome. This belief, popularized by developers of natural language software, and perpetuated by a generation of satisfied customers, ignores two of the most important properties that must be preserved in Big Data resources: longevity, and curation. [Glossary Curator]

Nomenclatures change over time. Synonymous terms and the codes will vary from year to year as new versions of old nomenclature are published and new nomenclatures are developed. In some cases, the textual material within the Big Data resource will need to be annotated using codes from nomenclatures that cover informational domains that were not anticipated when the text was originally composed.

Most of the people who work within an information-intensive society are accustomed to evanescent data; data that is forgotten when its original purpose is served. Do we really want all of our old e-mails to be preserved forever? Do we not regret our earliest blog posts, Facebook entries, and tweets? In the medical world, a code for a clinic visit or a biopsy diagnosis, or a reportable transmissible disease will be used in a matter of minutes or hours; maybe days or months. Few among us place much value on textual information preserved for years and decades. Nonetheless, it is the job of the Big Data manager to preserve resource data over years and decades. When we have data that extends back, over decades, we can find and avoid errors that would otherwise reoccur in the present, and we can analyze trends that lead us into the future.

To preserve its value, data must be constantly curated, adding codes that apply to currently available nomenclatures. There is no avoiding the chore; the entire corpus of textual data held in the Big Data resource needs to be recoded again and again, using modified versions of the original nomenclature, or using one or more new nomenclatures. This time, an autocoding application will be required to code huge quantities of textual data (possibly terabytes), quickly. Natural language algorithms, which depend heavily on regex operations (i.e., finding word patterns in text) are too slow to do the job. [Glossary RegEx]

A faster alternative is so-called lexical parsing. This involves parsing text, word by word, looking for exact matches between runs of words and entries in a nomenclature. When a match occurs, the words in the text that matched the nomenclature term are assigned the nomenclature code that corresponds to the matched term. Here is one possible algorithmic strategy for autocoding the sentence: “Margins positive malignant melanoma.” For this example, you would be using a nomenclature that lists all of the tumors that occur in humans. Let us assume that the terms “malignant melanoma,” and “melanoma” are included in the nomenclature. They are both assigned the same code, for example “Q5673013,” because the people who wrote the nomenclature considered both terms to be biologically equivalent.

Let us autocode the diagnostic sentence, “Margins positive malignant melanoma”:

  1. 1.  Begin parsing the sentence, one word at a time. The first word is “Margins.” You check against the nomenclature, and find no match. Save the word “margins.” We will use it in step 2.
  2. 2.  You go to the second word, “positive” and find no matches in the nomenclature. You retrieve the former word “margins” and check to see if there is a 2-word term, “margins positive.” There is not. Save “margins” and “positive” and continue.
  3. 3.  You go to the next word, “malignant.” There is no match in the nomenclature. You check to determine whether the 2-word term “positive malignant” and the 3-word term “margins positive malignant” are in the nomenclature. They are not.
  4. 4.  You go to the next word, “melanoma.” You check and find that melanoma is in the nomenclature. You check against the two-word term “malignant melanoma,” the three-word term “positive malignant melanoma,” and the four-word term “margins positive malignant melanoma.” There is a match for “malignant melanoma” but it yields the same code as the code for “melanoma.”
  5. 5.  The autocoder appends the code, “Q5673013” to the sentence, and proceeds to the next sentence, where it repeats the algorithm.

The algorithm seems like a lot of work, requiring many comparisons, but it is actually much more efficient than natural language parsing. A complete nomenclature, with each nomenclature term paired with its code, can be held in a single variable, in volatile memory. Look-ups to determine whether a word or phrase is included in the nomenclature are also fast. As it happens, there are methods that will speed things along. In Section 2.7, we will see a 12-line autocoder algorithm that can parse through terabytes of text at a rate that is much faster than commercial-grade natural language autocoders [7]. [Glossary Variable]

Another approach to the problem of recoding large volumes of textual data involves abandoning the attempt to autocode the entire corpus, in favor of on-the-fly autocoding, when needed. On-the-fly autocoding involves parsing through a text of any size, and searching for all the terms that match one particular concept (i.e., the search term).

Here is a general algorithm on-the-fly coding [8]. This algorithm starts with a query term and seeks to find every synonym for the query term, in any collection of Big Data resources, using any convenient nomenclature.

  1. 1.  The analyst starts with a query term submitted by a data user. The analyst chooses a nomenclature that contains his query term, as well as the list of synonyms for the term. Any vocabulary is suitable, so long as the vocabulary consists of term/code pairs, where a term and its' synonyms are all paired with the same code.
  2. 2.  All of the synonyms for the query term are collected together. For instance the 2004 version of a popular medical nomenclature, the Unified Medical Language System, had 38 equivalent entries for the code C0206708, nine of which are listed here:
    C0206708 | Cervical Intraepithelial Neoplasms
    C0206708 | Cervical Intraepithelial Neoplasm
    C0206708 | Intraepithelial Neoplasm, Cervical
    C0206708 | Intraepithelial Neoplasms, Cervical
    C0206708 | Neoplasm, Cervical Intraepithelial
    C0206708 | Neoplasms, Cervical Intraepithelial
    C0206708 | Intraepithelial Neoplasia, Cervical
    C0206708 | Neoplasia, Cervical Intraepithelial
    C0206708 | Cervical Intraepithelial Neoplasia

    If the analyst had chosen to search on “Cervial Intraepithelial Neoplasia,” his term will be attached to the 38 synonyms included in the nomenclature.
  3. 3.  One-by-one, the equivalent terms are matched against every record in every Big Data resource available to the analyst.
  4. 4.  Records are pulled that contain terms matching any of the synonyms for the term selected by the analyst.

In the case of the example, this would mean that all 38 synonymous terms for “Cervical Intraepithelial Neoplasms” would be matched against the entire set of data records. The benefit of this kind of search is that data records that contain any search term, or its nomenclature equivalent, can be extracted from multiple data sets in multiple Big Data resources, as they are needed, in response to any query. There is no pre-coding, and there is no need to match against nomenclature terms that have no interest to the analyst. The drawback of this method is that it multiplies the computational task by the number of synonymous terms being searched, 38-fold in this example. Luckily, there are published methods for conducting simple and fast synonym searches, using precompiled concordances [8].

Section 2.6. Case Study: Instantly Finding the Precise Location of Any Atom in the Universe (Some Assembly Required)

There's as many atoms in a single molecule of your DNA as there are stars in the typical galaxy. We are, each of us, a little universe.

Neil deGrasse Tyson, Cosmos

If you have sat through an introductory course in Computer Science, you are no doubt familiar with three or four sorting algorithms. Indeed, most computer science books devote a substantial portion of their texts to describing sorting algorithms. The reason for this infatuation with sorting is that all sorted lists can be searched nearly instantly, regardless of the size of the list. The so-called binary algorithm for searching a sorted list is incredibly simple. For the sake of discussion, let us consider an alphabetically sorted list of 1024 words. I want to determine if the word “kangaroo” is in the list; and, if so, its exact location in the list. Here is how a binary search would be conducted.

  1. 1. Go to the middle entry of the list.
  2. 2. Compare the middle entry to the word “kangaroo.” If the middle entry comes earlier in the alphabet than “kangaroo,” then repeat step 1, this time ignoring the first half of the list and using only the second half of the list (i.e., going to the middle entry of the second half of the file). Otherwise, go to step 1, this time ignoring the second half of the list and using only the first half.

These steps are repeated until you come to the location where kangaroo resides, or until you have exhausted the list without finding your kangaroo.

Each cycle of searching cuts the size of the list in half. Hence, a search through a sorted list of 1024 items would involve, at most, 10 cycles through the two-step algorithm (because 1024 = 2ˆ10).

Every computer science student is expected to write her own binary search script. Here is a simple script, binary.py, that does five look-ups through a sorted numeric list, reporting on which items are found, and which items are not.

def Search(search_list, search_item):
 first_item = 0
 last_item = len(search_list)-1
 found = False
 while (first_item <= last_item) and not found:
   middle = (first_item + last_item)//2
   if search_list[middle] == search_item:
    found = True
   else:
    if search_item < search_list[middle]:
     last_item = middle - 1
    else:
     first_item = middle + 1
 return found
sorted_list = [4, 5, 8, 15, 28, 29, 30, 45, 67, 82, 99, 101, 1002]
for item in [3, 7, 28, 31, 45, 1002]:
 print(Search(sorted_list, item))
output:
False
False
True
False
True
True

Let us say, just for fun, we wanted to search through a sorted list of every atom in the universe. First we would take each atom in the universe and assign it a location. Then we would sort the locations based on their distances from the center of the center of the universe, which is apparently located at the tip of my dog's left ear. We could then substitute the sorted atom list for the sorted_list in the binary.py script, shown above.

How long would it take to search all the atoms of the universe, using the binary.py script. As it happens, we could find the list location for any atom in the universe, almost instantly. The reason is that there are only about 2ˆ260 atoms in the known universe. This means that the algorithm would required, at the very most, 260 2-step cycles. Each cycle is very fast, requiring only that we compare the search atom's distance from my dog's ear, against the middle atom of the list.

Of course, composing the list of atom locations may pose serious difficulties, and we might need another universe, much larger than our own, to hold the sorted list that we create. Nonetheless, a valid point emerges; that binary searches are fast, and the time to completion of a binary search is not significantly lengthened by any increase in the number of items in the list. Had we chosen, we could have annotated the items of sorted_list with any manner of information (e.g., locations in a file, nomenclature code, links to web addresses, definitions of the items, metadata), so that our binary searches would yield something more useful than the location of the item in the list.

Section 2.7. Case Study (Advanced): A Complete Autocoder (in 12 Lines of Python Code)

Software is a gas; it expands to fill its container.

Nathan Myhrvold

This script requires two external files:

  1. 1.  The nomenclature file that will be converted into a Python dictionary, wherein each term is a dictionary key, and each nomenclature code is a value assigned to a term. [Glossary Dictionary]
    Here are a few sample lines from the nomenclature file (nomenclature_dict.txt, in this case):
    oropharyngeal adenoid cystic adenocarcinoma , C6241000
    peritoneal mesothelioma , C7633000
    benign tumour arising from the exocrine pancreas , C4613000
    basaloid penile squamous cell cancer , C6980000
    cns malignant soft tissue tumor , C6758000
    digestive stromal tumour of stomach , C5806000
    bone with malignancy , C4016000
    benign mixed tumor arising from skin , C4474000
  2. 2.  The file containing a corpus of sentences that will be autocoded by the script.
    Here are a few sample lines from the corpus file (tumorabs.txt, in this case):
    local versus diffuse recurrences of meningiomas factors correlated to the extent of the recurrence
    the effect of an unplanned excision of a soft tissue sarcoma on prognosis
    obstructive jaundice associated burkitt lymphoma mimicking pancreatic carcinoma
    efficacy of zoledronate in treating persisting isolated tumor cells in bone marrow in patients with breast cancer a phase ii pilot study
    metastatic lymph node number in epithelial ovarian carcinoma does it have any clinical significance
    extended three dimensional impedance map methods for identifying ultrasonic scattering sites
    aberrant expression of connexin 26 is associated with lung metastasis of colorectal cancer

The 19-line python script, autocode.txt, produces a sentence-by-sentence list of extracted autocoded terms:

outfile = open("autocoded.txt", "w")
literalhash = {}
with open("nomenclature_dict.txt") as f:
  for line in f:
    (key, val) = line.split(" , ")
    literalhash[key] = val
corpus_file = open("tumorabs.txt", "r")
for line in corpus_file:
  sentence = line.rstrip()
  outfile.write("
" + sentence[0].upper() + sentence[1:] + "." + "
")
  sentence_array = sentence.split(" ")
  length = len(sentence_array)
  for i in range(length):
    for place_length in range(len(sentence_array)):
      last_element = place_length + 1
      phrase = ' '.join(sentence_array[0:last_element])
      if phrase in literalhash:
       outfile.write(phrase + " " + literalhash[phrase])
    sentence_array.pop(0)

The first seven lines of code are housekeeping chores, in which the external nomenclature is loaded into a Python dictionary (literalhash, in this case), and an external file composed of lines, with one sentence on each line, is opened and prepared for reading, and which another external file, autocoded.txt, is created to accept the script's output. We will not count these first seven lines as belonging to our autocoder because, in all fairness, they are not doing any of the work of autocoding. The meat of the script is the next twelve lines, beginning with “for line in corpus_file.”

Here is a sample of the output:

Obstructive jaundice associated burkitt lymphoma mimicking pancreatic
carcinoma.
  burkitt lymphoma C7188000
  lymphoma C7065000
  pancreatic carcinoma C3850000
  carcinoma C2000000
Littoral cell angioma of the spleen.
  littoral cell angioma C8541100
  littoral cell angioma of the spleen C8541100
  angioma C3085000
  angioma of the spleen C8541000
Isolated b cell lymphoproliferative disorder at the dura mater with b cell chronic lymphocytic leukemia immunophenotype.
  lymphoproliferative disorder C4727100
  b cell chronic lymphocytic leukemia C3163000
  chronic lymphocytic leukemia C3163000
  lymphocytic leukemia C7539000
  leukemia C3161000

By observing a few samples of autocoded lines of text, we can see that the autocoder extracts all cancer terms, and supplies its nomenclature code, regardless of whether a term is contained within a longer term.

For example, the autocoder managed to find four terms within the sentence “Littoral cell angioma of the spleen,” these being: littoral cell angioma, littoral cell angioma of the spleen, angioma, and angioma of the spleen. The ability to extract every valid term, even when they are subsumed by larger terms, guarantees that a query term and all its synonyms will always be retrieved, if the query term happens to be a valid nomenclature term.

This short autocoding script comes with a few advantages that are of particular interest to Big Data professionals:

  •   Scalable to any size

All nomenclatures are small. Most of us have a working vocabulary of a few thousand words. Most dictionaries are smaller, containing maybe 60,000 words. The most extreme case of verbiage about verbiage is The 20-volume Oxford English Dictionary, which contains about 170,000 entries. Even in this case, slurping the entire list of Oxford English dictionary items would be a simple matter for any modern computer.

Most importantly, the autocoding algorithm imposes no limits on the size of the Big Data corpus. The software proceeds line-by-line until the task is complete. Memory requirements and other issues of scalability are not a problem.

  •   Fast

On my modest desktop computer, the 12-line autocoding algorithm processes text at the rate of 1 megabyte every two seconds. A fast and powerful computer, using the same algorithm, would be expected to parse at rates of 1 gigabytes of text per second, or greater.

  •   Repeatable

Code a gigabyte of data in the morning. Do it all over again in the afternoon. Use another version of the nomenclature, or use a different nomenclature, entirely. Recoding is not a problem.

  •   Simple and adaptable, with easily maintained code

The larger the program, the more difficult it is to find bugs, or to recover from errors produced when the code is modified. It is nearly impossible to inflict irreversible damage upon a simple, 12-line script. As a general rule, tiny scripts are seldom a problem if you maintain records of where the scripts are located, how the scripts are used, and how the scripts are modified over time.

  •   Reveals the dirty little secret that every programmer knows, but few are willing to admit.

Virtually all useful algorithms can be implemented in a few lines of code; autocoders are no exception. The thousands, or millions, of lines of code in just about any commercial software application are devoted, in one way or another, to the graphic user interface.

Section 2.8. Case Study: Concordances as Transformations of Text

Interviewer: Is there anything from home that you brought over with you to set up for yourself? Creature comforts?

Hawkeye: I brought a book over.

Interviewer: What book?

Hawkeye: The dictionary. I figure it's got all the other books in it.

Interview with the character Hawkeye, played by Alan Alda, from television show M⁎A⁎S⁎H

A transform is a mathematical operation that takes a function, a signal, or a set of data and changes it into something else, that is easier to work with than the original data. The concept of the transform is a simple but important idea that has revolutionized many scientific fields including electrical engineering, digital signal processing, and data analysis. In the field of digital signal processing, data in the time domain (i.e., wherein the amplitude of a measurement varies over time, as in a signal), is commonly transformed into the frequency domain (i.e., wherein the original data can be assigned to amplitude values for a range of frequencies). There are dozens, possibly hundreds, of mathematical transforms that enable data analysts to move signal data between forward transforms (e.g., time domain to frequency domain), and their inverse counterparts (e.g., frequency domain to time domain). [Glossary Transform, Signal, Digital signal, Digital Signal Processing, DSP, Fourier transform, Burrows-Wheeler transform]

A concordance is transform, for text. A concordance takes a linear text and transforms it a word-frequency distribution list; which can reversed as needed. Like any good transform, we can expect to find circumstances when it is easier to perform certain types of operations on the transformed data than on the original data. [Glossary Concordance]

Here is an example, from the Python script proximate_words.py, where we use a concordance to list the words in close proximity to the concordance entries (i.e., the words contained in the text). In this script, we use the previously constructed (vida supra) concordance of the Gettysburg address.

import string
infile = open ("concordance.txt", "r")
places = []
word_array = []
concordance_hash = {}
words_hash = {}
for line in infile:
  line = line.rstrip()
  line_array = line.split(" ")
  word = line_array[0]
  places = line_array[1]
  places_array = places.split(",")
  words_hash[word] = places_array
  for word_position in places_array:
   concordance_hash[word_position] = word
for k, v in words_hash.items():
  print(k, end =" - 
")
  for items in v:
   n = 0
   while n < 5:
    nextone = str(int(items) + n)
    if nextone in concordance_hash:
     print(concordance_hash[nextone], end =" ")
    n = n + 1
   print()
  print()

The script produces a list of the words from the Gettysburg address, along with short sequences of the text that follow each occurrence of the word in the text, as shown in this sampling from the output file:

to -
to the proposition that all
to dedicate a portion of
to add or detract. The
to be dedicated here to
to the unfinished work which
to be here dedicated to
to the great task remaining
to that cause for which
dedicated -
dedicated to the proposition that
dedicated can long endure. We
dedicated here to the unfinished
dedicated to the great task

Inspecting some of the output, we see that the word “to” appears 8 times in the Gettysburg address. We used the concordance to reconstruct four words that follow the word “to” wherever it occurs in the text. Likewise we see that the word “dedicated” occurs 4 times in the text, and the concordance tells us the four words that follow at each of the locations where “dedicated” appears. We can construct these proximity phrases very quickly, because the concordance tells us the exact location of the words in the text. If we were working from the original text, instead of its transform (i.e., the concordance), then our algorithm would run much more slowly, because each word would need to be individually found and retrieved, by parsing every word in the text, sequentially.

Section 2.9. Case Study (Advanced): Burrows Wheeler Transform (BWT)

All parts should go together without forcing. You must remember

that the parts you are reassembling were disassembled by you. Therefore,

if you can't get them together again, there must be a reason. By all

means, do not use a hammer.

IBM Manual, 1925

One of the most ingenious transforms in the field of data science is the Burrows Wheeler transform. Imagine an algorithm that takes a corpus of text and creates an output string consisting of a transformed text combined with its own word index, in a format that can be compressed to a smaller size than the compressed original file. The Burrows Wheeler Transform does all this, and more [9,10]. A clever informatician may find many ways to use the BWT transform in search and retrieval algorithms and in data merging projects [11]. Using the BWT file, you can re-compose the original file, or you can find any portion of a file preceding or following any word from the file [12]. [Glossary Data merging, Data fusion]

Excellent discussions of the algorithm are available, along with implementations in several languages [9,10,13]. The Python script, bwt.py, shown here, is a modification of a script available on Wikipedia [13]. The script executes the BWT algorithm in just three lines of code. In this example, the input string is a excerpt from Lincoln's Gettysburg address [12].

input = "four score and seven years ago our fathers brought forth upon"
input = input + " this continent a new nation conceived in liberty and"
input = input + ""
table = sorted(input[i:] + input[:i] for i in range(len(input)))
last_column = [row[-1:] for row in table]
print("".join(last_column))

Here is the transformed output:

dtsyesnsrtdnwaordnhn efni n snenryvcvnhbsn uatttgl tthe oioe oaai eogipccc
fr fuuuobaeoerri nhra naro ooieet

Admittedly, the output does not look like much. Let us juxtapose our input string and our BWT's transform string:

four score and seven years ago our fathers brought forth upon this continent a new nation conceived in liberty and
dtsyesnsrtdnwaordnhn efni n snenryvcvnhbsn uatttgl tthe oioe oaai eogipcccfr fuuuobaeoerri nhra naro ooieet

We see that the input string and the transformed output string both have the same length, so there doesn't seem to be any obvious advantage to the transform. If we look a bit closer, though, we see that the output string consists largely of runs of repeated individual characters, repeated substrings, and repeated spaces (e.g., “ttt” “uuu”). These frequent repeats in the transform facilitate compression algorithms that hunt for repeat patterns. BWT's facility for creating runs of repeated characters accounts for its popularity in compression software (e.g., the Bunzip compression utility).

The Python script, bwt_inverse.py, computes the inverse BWT to re-construct the original input string. Notice that the inverse algorithm is implemented in just the last four lines of the python code (the first five lines re-created the forward BWT transform) [12]

input = "four score and seven years ago our fathers brought forth upon"
input = input + " this continent a new nation conceived in liberty and"
input = input + ""
table = sorted(input[i:] + input[:i] for i in range(len(input)))
last_column = [row[-1:] for row in table]
#The first lines re-created the bwt transform
#The next four lines compute the inverse transform
table = [""] ⁎ len(last_column)
for i in range(len(last_column)):
  table = sorted(last_column[i] + table[i] for i in range(len(input)))
print([row for row in table if row.endswith("")][0])

As we would expect, the output of the bwt_inverse.py script, is our original input string:

four score and seven years ago our fathers brought forth upon this continent a new nation conceived in liberty and

The charm of the BWT transform is demonstrated when we create an implementation that parses the input string word-by-word; not character-by-character.

Here is the Python script, bwt_trans_inv.py, that transforms an input string, word-by-word, producing its transform; then reverses the process to yield the original string, as an array of words. As an extra feature, the script produces the first column, as an array, of the transform table [12]. [Glossary Numpy]

import numpy as np
input = " four score and seven years ago our fathers brought forth upon"
input = input + " this continent a new nation conceived in liberty and"
word_list = input.rsplit()
table = sorted(word_list[i:] + word_list[:i] for i in range(len(word_list)))
last_column = [row[-1:] for row in table]
first_column = [row[:1] for row in table]
print("First column of the transform table:
" + str(first_column) + "
")
table = [""] ⁎ len(last_column)
for i in range(len(last_column)):
  table = sorted(str(last_column[i]) + " " + str(table[i]) for i in range(len(word_list)))
original = [row for row in table][0]
print("Inverse transform, as a word array:
" + str(original))

Here is the output of the bwt_trans_inv.py script. Notice once more that the word-by-word transform was implemented in 3 lines of code, and the inverse transform was implemented in four lines of code.

First column of the transform table:
[['x00'], ['a'], ['ago'], ['and'], ['and'], ['brought'], ['conceived'], ['continent'], ['fathers'], ['forth'], ['four'], ['in'], ['liberty'], ['nation'], ['new'], ['our'], ['score'], ['seven'], ['this'], ['upon'], ['years']]
Inverse transform, as a word array:
['x00'] ['four'] ['score'] ['and'] ['seven'] ['years'] ['ago'] ['our'] ['fathers'] ['brought'] ['forth'] ['upon'] ['this'] ['continent'] ['a'] ['new'] ['nation'] ['conceived'] ['in'] ['liberty'] ['and']

The first column of the transform, created in the forward BWT, is a list of the words in the input string, in alphabetic order. Notice that words that occurred more than one time in the input text were repeated in the first column of the transform table (i.e., [and], [and] in the example sentence). Hence, the transform yields all the words from the original input, along with their frequency of occurrence in the text. As expected, the inverse of the transform yields our original input string.

Glossary

Accuracy versus precision Accuracy measures how close your data comes to being correct. Precision provides a measurement of reproducibility (i.e., whether repeated measurements of the same quantity produce the same result). Data can be accurate but imprecise. If you have a 10 pound object, and you report its weight as 7.2376 pounds, on every occasion when the object is weighed, then your precision is remarkable, but your accuracy is dismal.

Algorithm An algorithm is a logical sequence of steps that lead to a desired computational result. Algorithms serve the same function in the computer world as production processes serve in the manufacturing world and as pathways serve in the world of biology. Fundamental algorithms can be linked to one another, to create new algorithms (just as biological pathways can be linked). Algorithms are the most important intellectual capital in computer science. In the past half century, many brilliant algorithms have been developed for the kinds of computation-intensive work required for Big Data analysis [14,15].

Binary data Computer scientists say that there are 10 types of people. Those who think in terms of binary numbers, and those who do not. Pause for laughter and continue. All digital information is coded as binary data. Strings of 0s and 1s are the fundamental units of electronic information. Nonetheless, some data is more binary than other data. In text files, 8-bite sequences are converted into decimals in the range of 0–256, and these decimal numbers are converted into characters, as determined by the ASCII standard. In several raster image formats (i.e., formats consisting of rows and columns of pixel data), 24-bit pixel values are chopped into red, green and blue values of 8-bits each. Files containing various types of data (e.g., sound, movies, telemetry, formatted text documents), all have some kind of low-level software that takes strings of 0s and 1s and converts them into data that has some particular meaning for a particular use. So-called plain-text files, including HTML files and XML files are distinguished from binary data files and referred to as plain-text or ASCII files. Most computer languages have an option wherein files can be opened as “binary,” meaning that the 0s and 1s are available to the programmer, without the intervening translation into characters or stylized data.

Burrows-Wheeler transform Abbreviated as BWT, the Burrows-Wheeler transform produces a compressed version of an original file, along with a concordance to the contents of the file. Using a reverse BWT, you can reconstruct the original file, or you can find any portion of a file preceding or succeeding any location in the file. The BWT transformation is an amazing example of simplification, applied to informatics. A detailed discussion of the BWT is found in Section 2.9, “Case Study (Advanced): Burrows Wheeler Transform.”

Class A class is a group of objects that share a set of properties that define the class and that distinguish the members of the class from members of other classes. The word “class,” lowercase, is used as a general term. The word “Class,” uppercase, followed by an uppercase noun (e.g., Class Animalia), represents a specific class within a formal classification.

Classification A system in which every object in a knowledge domain is assigned to a class within a hierarchy of classes. The properties of superclasses are inherited by the subclasses. Every class has one immediate superclass (i.e., parent class), although a parent class may have more than one immediate subclass (i.e., child class). Objects do not change their class assignment in a classification, unless there was a mistake in the assignment. For example, a rabbit is always a rabbit, and does not change into a tiger. Classifications can be thought of as the simplest and most restrictive type of ontology, and serve to reduce the complexity of a knowledge domain [16].
Classifications can be easily modeled in an object-oriented programming language and are non-chaotic (i.e., calculations performed on the members and classes of a classification should yield the same output, each time the calculation is performed). A classification should be distinguished from an ontology. In an ontology a class may have more than one parent class and an object may be a member of more than one class. A classification can be considered a special type of ontology wherein each class is limited to a single parent class and each object has membership in one and only one class.

Coding The term “coding” has three very different meanings depending on which branch of science influences your thinking. For programmers, coding means writing the code that constitutes a computer programmer. For cryptographers, coding is synonymous with encrypting (i.e., using a cipher to encode a message). For medics, coding is calling an emergency team to handle a patient in extremis. For informaticians and library scientists, coding involves assigning a alphanumeric identifier, representing a concept listed in a nomenclature, to a term. For example, a surgical pathology report may includes the diagnosis, “Adenocarcinoma of prostate.” A nomenclature may assign a code C4863000 that uniquely identifies the concept “Adenocarcinoma.” Coding the report may involve annotating every occurrence of the work “Adenocarcinoma” with the “C4863000” identifier. For a detailed explanation of coding, and its importance for searching and retrieving data, see the full discussion in Section 3.4, “Autoencoding and Indexing with Nomenclatures.”

Concordance A concordance is an index consisting of every word in the text, along with every location wherein each word can be found. It is computationally trivial to reconstruct the original text from the concordance. Before the advent of computers, concordances fell into the provenance of religious scholars, who painstakingly recorded the locations of the all words appearing in the Bible, ancient scrolls, and any texts whose words were considered to be divinely inspired. Today, a concordance for a Bible-length book can be constructed in about a second. Furthermore, the original text can be reconstructed from the concordance, in about the same time.

Curator The word “curator” derives from the latin, “curatus,” the same root for “curative,” indicating that curators “take care of” things. A data curator collects, annotates, indexes, updates, archives, searches, retrieves, and distributes data. Curator is another of those somewhat arcane terms (e.g., indexer, data archivist, lexicographer) that are being rejuvenated in the new millennium. It seems that if we want to enjoy the benefits of a data-centric world, we will need the assistance of curators, trained in data organization.

DSP Abbreviation for Digital Signal Processing.

Data fusion Data fusion is very closely related to data integration. The subtle difference between the two concepts lies in the end result. Data fusion creates a new and accurate set of data representing the combined data sources. Data integration is an on-the-fly usage of data pulled from different domains and, as such, does not yield a residual fused set of data.

Data integration The process of drawing data from different sources and knowledge domains in a manner that uses and preserves the identities of data objects and the relationships among the different data objects. The term “integration” should not be confused with a closely related term, “interoperability.” An easy way to remember the difference is to note that integration applies to data; interoperability applies to software.

Data merging A nonspecific term that includes data fusion, data integration, and any methods that facilitate the accrual of data derived from multiple sources.

Dictionary In general usage a dictionary is a word list accompanied by a definition for each item. In Python a dictionary is a data structure that holds an unordered list of key/value pairs. A dictionary, as used in Python, is equivalent to an associative array, as used in Perl.

Digital Signal Processing Digital Signal Processing (DSP) is the field that deals with creating, transforming, sending, receiving, and analyzing digital signals. Digital signal processing began as a specialized subdiscipline of signal processing, another specialized subdiscipline. For most of the twentieth century, many technologic advances came from converting non-electrical signals (temperature, pressure, sound, and other physical signals) into electric signals that could be carried via electromagnetic waves, and later transformed back into physical actions. Because electromagnetic waves sit at the center of so many transform process, even in instances when the input and outputs are non-electrical in nature, the field of electrical engineering and signal processing have paramount importance in every field of engineering. In the past several decades the intermediate signals have been moved from the analog domain (i.e., waves) into the digital realm (i.e., digital signals expressed as streams of 0s and 1s). Over the years, as techniques have developed by which any kind of signal can be transformed into a digital signal, the subdiscipline of digital signal processing has subsumed virtually all of the algorithms once consigned to its parent discipline. In fact, as more and more processes have been digitized (e.g., telemetry, images, audio, sensor data, communications theory), the field of digital signal processing has come to play a central role in data science.

Digital signal A signal is a description of how one parameter varies with some other parameter. The most familiar signals involve some parameter varying over time (e.g., sound is air pressure varying over time). When the amplitude of a parameter is sampled at intervals, producing successive pairs of values, the signal is said to be digitized.

Fourier transform A transform is a mathematical operation that takes a function or a time series (e.g., values obtained at intervals of time) and transforms it into something else. An inverse transform takes the transform function and produces the original function (Fig. 2.1). Transforms are useful when there are operations that can be more easily performed on the transformed function than on the original function. Possibly the most useful transform is the Fourier transform, which can be computed with great speed on modern computers, using a modified form known as the fast Fourier Transform. Periodic functions and waveforms (periodic time series) can be transformed using this method. Operations on the transformed function can sometimes eliminate repeating artifacts or frequencies that occur below a selected threshold (e.g., noise). The transform can be used to find similarities between two signals. When the operations on the transform function are complete, the inverse of the transform can be calculated and substituted for the original set of data (Fig. 2.2).

Fig. 2.1
Fig. 2.1 The Fourier transform and its inverse. In this representation of the transform, x represents time in seconds and the transform variable zeta represents frequency in hertz.
Fig. 2.2
Fig. 2.2 A square wave is approximated by a single sine wave, the sum of two sine waves, three sine waves, and so on. As more components are added, the representation of the original signal or periodic set of data, is more closely approximated. From Wikimedia Commons.

Identifier A string that is associated with a particular thing (e.g., person, document, transaction, data object), and not associated with any other thing [17]. In the context of Big Data, identification usually involves permanently assigning a seemingly random sequence of numeric digits (0–9) and alphabet characters (a-z and A-Z) to a data object. The data object can be a class of objects.

Indexes Every writer must search deeply into his or her soul to find the correct plural form of “index”. Is it “indexes” or is it “indices”? Latinists insist that “indices” is the proper and exclusive plural form. Grammarians agree, reserving “indexes” for the third person singular verb form; “The student indexes his thesis.” Nonetheless, popular usage of the plural of “index,” referring to the section at the end of a book, is almost always “indexes,” the form used herein.

Interoperability It is desirable and often necessary to create software that operates with other software, regardless of differences in hardware, operating systems and programming language. Interoperability, though vital to Big Data science, remains an elusive goal.

Machine translation Ultimately, the job of machine translation is to translate text from one language into another language. The process of machine translation begins with extracting sentences from text, parsing the words of the sentence into grammatical parts, and arranging the grammatical parts into an order that imposes logical sense on the sentence. Once this is done, each of the parts can be translated by a dictionary that finds equivalent terms in a foreign language, then re-assembled as a foreign language sentence by applying grammatical positioning rules appropriate for the target language. Because these steps apply the natural rules for sentence constructions in a foreign language, the process is often referred to as natural language machine translation. It is important to note that nowhere in the process of machine translation is it necessary to find meaning in the source text, or to produce meaning in the output. Good machine translation algorithms preserve ambiguities, without attempting to impose a meaningful result.

Natural language processing A field broadly concerned with how computers interpret human language (i.e., machine translation). At its simplest level this may involve parsing through text and organizing the grammatical units of individual sentences (i.e., tokenization). For example, we might assign the following tokens to the grammatical parts of a sentence: A = adjective, D = determiner, N = noun, P = preposition, V = main verb. A determiner is a word such as “a” or “the”, which specifies the noun [18].
Consider the sentence, “The quick brown fox jumped over lazy dogs.” This sentence can be grammatically tokenized as:
       the::D
       quick::A
       brown::A
       fox::N
       jumped::V
       over::P
       the::D
       lazy::A
       dog::N
We can express the sentence as the sequence of its tokens listed in the order of occurrence in the sentence: DAANVPDAN. This does not seem like much of a breakthrough, but imagine having a large collection of such token sequences representing every sentence from a large text corpus. With such a data set, we could begin to understand the rules of sentence structure. Commonly recurring sequences, like DAANVPDAN, might be assumed to be proper sentences. Sequences that occur uniquely in a large text corpus are probably poorly constructed sentences. Before long, we might find ourselves constructing logic rules for reducing the complexity of sentences by dropping subsequences which, when removed, yield a sequence that occurs more commonly than the original sequence. For example, our table of sequences might indicate that we can convert DAANVPDAN into NVPAN (i.e., “Fox jumped over lazy dog”), without sacrificing too much of the meaning from the original sentence and preserving a grammatical sequence that occurs commonly in the text corpus.
This short example serves as an overly simplistic introduction to natural language processing. We can begin to imagine that the grammatical rules of a language can be represented by sequences of tokens that can be translated into words or phrases from a second language, and re-ordered according to grammatical rules appropriate to the target language. Many natural language processing projects involve transforming text into a new form, with desirable properties (e.g., other languages, an index, a collection of names, a new text with words and phrases replaced with canonical forms extracted from a nomenclature) [18]. When we use natural language rules to autocode text, the grammatical units are trimmed, reorganized, and matched against concept equivalents in a nomenclature.

Ngrams Ngrams are subsequences of text, of length n words. A complete collection of ngrams consists of all of the possible ordered subsequences of words in a text. Because sentences are the basic units of statements and ideas, when we speak of ngrams, we are confining ourselves to ngrams of sentences. Let us examine all the ngrams for the sentence, “Ngrams are ordered word sequences.”
Ngrams (1-gram)
are (1-gram)
ordered (1-gram)
word (1-gram)
sequences (1-gram)
Ngrams are (2-gram)
are ordered (2-gram)
ordered word (2-gram)
word sequences (2-gram)
Ngrams are ordered (3-gram)
are ordered word (3-gram)
ordered word sequences (3-gram)
Ngrams are ordered word (4-gram)
are ordered word sequences (4-gram)
Ngrams are ordered word sequences (5-gram)
Here is a short Python script, ngram.py, that will take a sentence and produce a list of all the contained ngrams.
import string
text = "ngrams are ordered word sequences"
partslist = []
ngramlist = {}
text_list = text.split(" ")
while(len(text_list) > 0):
      partslist.append(" ".join(text_list))
      del text_list[0]
for part in partslist:
      previous = ""
      wordlist = part.split(" ")
      while(len(wordlist) > 0):
            ngramlist[(" ".join(wordlist))] = ""
            firstword = wordlist[0]
            del wordlist[0]
            ngramlist[firstword] = ""
            previous = previous + " " + firstword
            previous = previous.strip()
            ngramlist[previous] = ""
for key in sorted(ngramlist):
            print(key)
exit
output:
are
are ordered
are ordered word
are ordered word sequences
ngrams
ngrams are
ngrams are ordered
ngrams are ordered word
ngrams are ordered word sequences
ordered
ordered word
ordered word sequences
sequences
word
word sequences
The ngram.py script can be easily modified to parse through all the sentences of any text, regardless of length, building the list of ngrams as it proceeds.
Google has collected ngrams from scanned literature dating back to 1500. The public can enter their own ngrams into Google's ngram viewer, and receive a graph of the published occurrences of the phrase, through time [18]. We can use the Ngram viewer to find trends (e.g., peaks, valleys and periodicities) in data. Consider the Google Ngram Viewer results for the two-word ngram, “yellow fever” (Fig. 2.3).
We see that the term “yellow fever” (a mosquito-transmitted hepatitis) appeared in the literature beginning about 1800, with several subsequent peaks. The dates of the peaks correspond roughly to outbreaks of yellow fever in Philadelphia (epidemic of 1793), New Orleans (epidemic of 1853), with United States construction efforts in the Panama Canal (1904–14), and with well-documented WWII Pacific outbreaks (about 1942). Following the 1942 epidemic an effective vaccine was available, and the incidence of yellow fever, as well as the literature occurrences of the “yellow fever” n-gram, dropped precipitously. In this case, a simple review of n-gram frequencies provides an accurate chart of historic yellow fever outbreaks [19,18].

Fig. 2.3
Fig. 2.3 Google Ngram for the phrase “yellow fever,” counting occurrences of the term in a large corpus, from the years 1700–2000. Peaks roughly correspond to yellow fever epidemics. Source: Google Ngram viewer, with permission from Google.

Nomenclature A nomenclatures is a listing of terms that cover all of the concepts in a knowledge domain. A nomenclature is different from a dictionary for three reasons: 1) the nomenclature terms are not annotated with definitions, 2) nomenclature terms may be multi-word, and 3) the terms in the nomenclature are limited to the scope of the selected knowledge domain. In addition, most nomenclatures group synonyms under a group code. For example, a food nomenclature might collect submarine sandwich, hoagie, po' boy, grinder, hero, and torpedo under an alphanumeric code such as “F63958.” Nomenclatures simplify textual documents by uniting synonymous terms under a common code. Documents that have been coded with the same nomenclature can be integrated with other documents that have been similarly coded, and queries conducted over such documents will yield the same results, regardless of which term is entered (i.e., a search for either hoagie, or po' boy will retrieve the same information, if both terms have been annotated with the synonym code, “F63948”). Optimally, the canonical concepts listed in the nomenclature are organized into a hierarchical classification [20,21,12].

Nomenclature mapping Specialized nomenclatures employ specific names for concepts that are included in other nomenclatures, under other names. For example, medical specialists often preserve their favored names for concepts that cross into different fields of medicine. The term that pathologists use for a certain benign fibrous tumor of the skin is “fibrous histiocytoma,” a term spurned by dermatologists, who prefer to use “dermatofibroma” to describe the same tumor. As another horrifying example, the names for the physiologic responses caused by a reversible cerebral vasoconstricitve event include: thunderclap headache, Call-Fleming syndrome, benign angiopathy of the central nervous system, postpartum angiopathy, migrainous vasospasm, and migraine angiitis. The choice of term will vary depending on the medical specialty of the physician (e.g., neurologist, rheumatologist, obstetrician). To mitigate the discord among specialty nomenclatures, lexicographers may undertake a harmonization project, in which nomenclatures with overlapping concepts are mapped to one another.

Numpy Numpy (Numerical Python) is an open source extension to Python that supports matrix operations, as well as a rich assortment of mathematical functions. Numpy can be easily downloaded from sourceforge.net: http://sourceforge.net/projects/numpy/. Here is a short Python script, numpy_dot.py, that creates a 3x3 matrix, inverts the matrix, and calculates the dot produce of the matrix and its inverted counterpart.
      import numpy
      from numpy.linalg import inv
      a = numpy.array([[1,4,6], [9,15,55], [62,-5, 4]])
      print(a)
      print(inv(a))
      c = numpy.dot(a, inv(a))
      print(numpy.round_(c))
The numpy_dot.py script employs numpy, numpy's linear algebra module, and numpy's matrix inversion method, and the numpy dot product method. Here is the output of the script, displaying the original matrix, its inversion, and the dot product, which happens to be the unity matrix:
      c:ftppy > numpy_dot.py
      [[ 1  4  6]
        [ 9 15 55]
        [62 -5  4]]
      [[  4.19746899e-02  -5.76368876e-03  1.62886856e-02]
        [  4.22754041e-01  -4.61095101e-02  -1.25297582e-04]
        [ -1.22165142e-01  3.17002882e-02  -2.63124922e-03]]
      [[ 1.  0.  0.]
        [ 0.  1.  0.]
        [ 0.  0.  1.]]

Parsing Much of computer programming involves parsing; moving sequentially through a file or some sort of data structure and performing operations on every contained item, one item at a time. For files, this might mean going through a text file line by line, or sentence by sentence. For a data file, this might mean performing an operation on each record in the file. For in-memory data structures, this may mean performing an operation on each item in a list or a tuple or a dictionary.
The parse_directory.py script prints all the file names and subdirectory names in a directory tree.
      import os
      for root, dirs, files in os.walk(".", topdown = False):
           for filename in files:
                  print(os.path.join(root, filename))
           for dirname in dirs:
                  print(os.path.join(root, dirname))

Plain-text Plain-text refers to character strings or files that are composed of the characters accessible to a typewriter keyboard. These files typically have a “.txt” suffix to their names. Plain-text files are sometimes referred to as 7-bit ascii files because all of the familiar keyboard characters have ASCII vales under 128 (i.e., can be designated in binary with, just seven 0s and 1s. In practice, plain-text files exclude 7-bit ascii symbols that do not code for familiar keyboard characters. To further confuse the issue, plain-text files may contain ascii characters above 7 bits (i.e., characters from 128 to 255) that represent characters that are printable on computer monitors, such as accented letters.

Plesionymy Nearly synonymous words, or pairs of words that are sometimes synonymous; other times not. For example, the noun forms of “smell” and “odor” are synonymous. As verb forms, “smell” applies, but odor does not. You can small a fish, but you cannot odor a fish. Smell and odor are plesionyms. Plesionymy is another challenge for machine translators.

Polysemy Occurs when a word has more than one distinct meaning. The intended meaning of a word can sometimes be determined by the context in which the word is used. For example, “She rose to the occasion,” and “Her favorite flower is the rose.” Sometimes polysemy cannot be resolved. For example, “Eats shoots and leaves.”

RegEx Short for Regular Expressions, RegEx is a syntax for describing patterns in text. For example, if I wanted to pull all lines from a text file that began with an uppercase “B” and contained at least one integer, and ended with the a lowercase x, then I might use the regular expression: “ˆB.⁎[0-9].⁎x$”. This syntax for expressing patterns of strings that can be matched by pre-built methods available to a programming language is somewhat standardized. This means that a RegEx expression in Perl will match the same pattern in Python, or Ruby, or any language that employs RegEx. The relevance of RegEx to Big Data is several-fold. RegEx can be used to build or transform data from one format to another; hence creating or merging data records. It can be used to convert sets of data to a desired format; hence transforming data sets. It can be used to extract records that meet a set of characteristics specified by a user; thus filtering subsets of data or executing data queries over text-based files or text-based indexes. The big drawback to using RegEx is speed: operations that call for many RegEx operations, particularly when those operations are repeated for each parsed line or record, will reduce software performance. RegEx-heavy programs that operate just fine on megabyte files may take hours, days or months to parse through terabytes of data.
A 12-line python script, file_search.py, prompts the user for the name of a text file to be searched, and then prompts the user to supply a RegEx pattern. The script will parse the text file, line by line, displaying those lines that contain a match to the RegEx pattern.
      import sys, string, re
      print("What is file would you like to search?")
      filename = sys.stdin.readline()
      filename = filename.rstrip()
      print("Enter a word, phrase or regular expression to search.")
      word_to_search = (sys.stdin.readline()).rstrip()
      infile = open (filename, "r")
      regex_object = re.compile(word_to_search, re.I)
      for line in infile:
          m = regex_object.search(line)
          if m:
                print(line)

Scalable Software is scalable if it operates smoothly, whether the data is small or large. Software programs that operate by slurping all data into a RAM variable (i.e., a data holder in RAM memory) are not scalable, because such programs will eventually encounter a quantity of data that is too large to store in RAM. As a rule of thumb, programs that process text at speeds less than a megabyte per second are not scalable, as they cannot cope, in a reasonable time frame, with quantities of data in the gigabyte and higher range.

Script A script is a program that is written in plain-text, in a syntax appropriate for a particular programming language, that needs to be parsed through that language's interpreter before it can be compiled and executed. Scripts tend to run a bit slower than executable files, but they have the advantage that they can be understood by anyone who is familiar with the script's programming language.

Sentence Computers parse files line by line, not sentence by sentence. If you want a computer to perform operations on a sequence of sentences found in a corpus of text, then you need to include a subroutine in your scripts that list the sequential sentences. One of the simplest ways to find the boundaries of sentences is to look for a period followed by one or more spaces, followed by an uppercase letter. Here's a simple Python demonstration of a sentence extractor, using a few famous lines from the Lewis Carroll poem, Jabberwocky.
      import re
      all_text =
      "And, has thou slain the Jabberwock? Come
      to my arms, my beamish boy! O frabjous
      day! Callooh! Callay! He chortled in his
      joy. Lewis Carroll, excerpted from
      Jabberwocky";
      sentence_list = re.split(r'[.!?] +(?=[A-Z])', all_text)
      print(" ".join(sentence_list))
Here is the output:
      And, has thou slain the Jabberwock
      Come to my arms, my beamish boy
      O frabjous day
      Callooh
      Callay
      He chortled in his joy
      Lewis Carroll, excerpted from Jabberwocky
The meat of the script is the following line of code, which splits lines of text at the boundaries of sentences:
      sentence_list = re.split(r'[.!?] +(?=[A-Z])',in_text_string)
This algorithm is hardly foolproof, as periods are used for many purposes other than as sentence terminators. But it may suffice for most purposes.

Signal In a very loose sense a signal is a way of gauging how measured quantities (e.g., force, voltage, or pressure) change in response to, or along with, other measured quantities (e.g., time). A sound signal is caused by the changes in pressure, exerted on our eardrums, over time. A visual signal is the change in the photons impinging on our retinas, over time. An image is the change in pixel values over a two-dimensional grid. Because much of the data stored in computers consists of discrete quantities of describable objects, and because these discrete quantities change their values, with respect to one another, we can appreciate that a great deal of modern data analysis is reducible to digital signal processing.

Specification A specification is a method for describing objects (physical objects such as nuts and bolts or symbolic objects such as numbers). Specifications do not require specific types of information, and do not impose any order of appearance of the data contained in the document. Specifications do not generally require certification by a standards organization. They are generally produced by special interest organizations, and their legitimacy depends on their popularity. Examples of specifications are RDF (Resource Description Framework) produced by the W3C (WorldWide Web Consortium), and TCP/IP (Transfer Control Protocol/Internet Protocol), maintained by the Internet Engineering Task Force.

String A string is a sequence of characters. Words, phrases, numbers, and alphanumeric sequences (e.g., identifiers, one-way hash values, passwords) are strings. A book is a long string. The complete sequence of the human genome (3 billion characters, with each character an A,T,G, or C) is a very long string. Every subsequence of a string is another string.

Syntax Syntax is the standard form or structure of a statement. What we know as English grammar is equivalent to the syntax for the English language. If I write, “Jules hates pizza,” the statement would be syntactically valid, but factually incorrect. If I write, “Jules drives to work in his pizza,” the statement would be syntactically valid but nonsensical. For programming languages, syntax refers to the enforced structure of command lines. In the context of triplestores, syntax refers to the arrangement and notation requirements for the three elements of a statement (e.g., RDF format or N3 format). Charles Mead distinctly summarized the difference between syntax and semantics: “Syntax is structure; semantics is meaning” [22].

Systematics The term “systematics” is, by tradition, reserved for the field of biology that deals with taxonomy (i.e., the listing of the distinct types of organisms) and with classification (i.e., the classes of organisms and their relationships to one another). There is no reason why biologists should lay exclusive claim to the field of systematics. As used herein, systematics equals taxonomics plus classification, and this term applies just as strongly to stamp collecting, marketing, operations research, and object-oriented programming as it does to the field of biology.

Taxa Plural of taxon.

Taxon A taxon is a class. The common usage of “taxon” is somewhat inconsistent, as it sometimes refers to the class name, and at other times refers to the instances (i.e., members) of the class. In this book, the term “taxon” is abandoned in favor of “class,” the plesionym used by computer scientists. Hence, the term “class” is used herein in the same manner that it is used in modern object oriented programming languages.

Taxonomy When we write of “taxonomy” as an area of study, we refer to the methods and concepts related to the science of classification, derived from the ancient Greek taxis, “arrangement,” and nomia, “method.” When we write of “a taxonomy,” as a construction within a classification, we are referring to the collection of named instances (class members) in the classification. To appreciate the difference between a taxonomy and a classification, it helps to think of taxonomy as the scientific field that determines how different members of a classification are named. Classification is the scientific field that determines how related members are assigned to classes, and how the different classes are related to one another. A taxonomy is similar to a nomenclature; the difference is that in a taxonomy, every named instance must have an assigned class.

Term extraction algorithm Terms are phrases, most often noun phrases, and sometimes individual words, that have a precise meaning within a knowledge domain. For example, “software validation,” “RDF triple,” and “WorldWide Telescope” are examples of terms that might appear in the index or the glossary of this book. The most useful terms might appear up to a dozen times in the text, but when they occur on every page, their value as a searchable item is diminished; there are just too many instances of the term to be of practical value. Hence, terms are sometimes described as noun phrases that have low-frequency and high information content. Various algorithms are available to extract candidate terms from textual documents. The candidate terms can be examined by a curator who determines whether they should be included in the index created for the document from which they were extracted. The curator may also compare the extracted candidate terms against a standard nomenclature, to determine whether the candidate terms should be added to the nomenclature. For additional discussion, see Section 2.3, “Term Extraction.”

Thesaurus A vocabulary that groups together synonymous terms. A thesaurus is very similar to a nomenclature. There are two minor differences. Nomenclatures do not always group terms by synonymy; and nomenclatures are often restricted to a well-defined topic or knowledge domain (e.g., names of stars, infectious diseases, etc.).

Transform (noun form) There are three truly great conceptual breakthroughs that have brought with them great advances to science and to civilization. The first two to be mentioned are well known to everyone: equations and algorithms. Equations permit us to relate variable quantities in a highly specific and repeatable way. Algorithms permit us to follow a series of steps that always produce the same results. The third conceptual breakthrough, less celebrated but just as important, is the transformation; a way of changing things to yield a something new, with properties that provide an advantage over the original item. In the case of reversible transformation, we can return the transformed item to its original form, and often in improved condition, when we have completed our task.
It should be noted that this definition applies only to the noun form of “transform.” The meaning of the verb form of transform is to change or modify, and a transformation is the closest noun form equivalent of the verb form, “to transform.”

Uniqueness Uniqueness is the quality of being separable from every other thing in the universe. For data scientists, uniqueness is achieved when data is bound to a unique identifier (i.e., a randomly chosen string of alphanumeric characters) that has not, and will never be, assigned to any data. The binding of data to a permanent and inseparable identifier constitutes the minimal set of ingredients for a data object. Uniqueness can apply to two or more indistinguishable objects, if they are assigned unique identifiers (e.g., unique product numbers stamped into identical auto parts).

Variable In algebra, a variable is a quantity, in an equation, that can change; as opposed to a constant quantity, that cannot change. In computer science, a variable can be perceived as a container that can be assigned a value. If you assign the integer 7 to a container named “x,” then “x” equals 7, until you re-assign some other value to the container (i.e., variables are mutable). In most computer languages, when you issue a command assigning a value to a new (undeclared) variable, the variable automatically comes into existence to accept the assignment. The process whereby an object comes into existence, because its existence was implied by an action (such as value assignment), is called reification.

Vocabulary A comprehensive collection of the words used in a general area of knowledge. The term “vocabulary” and the term “nomenclature” are nearly synonymous. In common usage, a vocabulary is a list of words and typically includes a wide range of terms and classes of terms. Nomenclatures typically focus on a class of terms within a vocabulary. For example, a physics vocabulary might contain the terms “quark, black hole, Geiger counter, and Albert Einstein”; a nomenclature might be devoted to the names of celestial bodies.

References

[1] Krauthammer M., Nenadic G. Term identification in the biomedical literature. J Biomed Inform. 2004;37:512–526.

[2] Berman J.J. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton: Chapman and Hall; 2010.

[3] Swanson D.R. Undiscovered public knowledge. Libr Q. 1986;56:103–118.

[4] Wallis E., Lavell C. Naming the indexer: where credit is due. The Indexer. 1995;19:266–268.

[5] Hayes A. VA to apologize for mistaken Lou Gehrig's disease notices. CNN; 2009. August 26. Available from: http://www.cnn.com/2009/POLITICS/08/26/veterans.letters.disease [viewed September 4, 2012].

[6] Hall P.A., Lemoine N.R. Comparison of manual data coding errors in 2 hospitals. J Clin Pathol. 1986;39:622–626.

[7] Berman J.J. Doublet method for very fast autocoding. BMC Med Inform Decis Mak. 2004;4:16.

[8] Berman J.J. Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching. In Silico Biol. 2005;5:0029.

[9] Burrows M., Wheeler D.J. a block-sorting lossless data compression algorithm. SRC Research Report 124, May 10 1994.

[10] Berman J.J. Perl programming for medicine and biology. Sudbury, MA: Jones and Bartlett; 2007.

[11] Healy J., Thomas E.E., Schwartz J.T., Wigler M. Annotating large genomes with exact word matches. Genome Res. 2003;13:2306–2315.

[12] Berman J.J. Data simplification: taming information with open source tools. Waltham, MA: Morgan Kaufmann; 2016.

[13] Burrows-Wheeler transform. Wikipedia. Available at: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform [viewed August 18, 2015].

[14] Cipra B.A. The best of the 20th century: editors name top 10 algorithms. SIAM News. May 2000;33(4).

[15] Wu X., Kumar V., Quinlan J.R., Ghosh J., Yang Q., Motoda H., et al. Top 10 algorithms in data mining. Knowl Inf Syst. 2008;14:1–37.

[16] Patil N., Berno A.J., Hinds D.A., Barrett W.A., Doshi J.M., Hacker C.R., et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 2001;294:1719–1723.

[17] Paskin N. Identifier interoperability: a report on two recent ISO activities. D-Lib Mag. 2006;12:1–23.

[18] Berman J.J. Repurposing legacy data: innovative case studies. Waltham, MA: Morgan Kaufmann; 2015.

[19] Berman J.J. Principles of big data: preparing, sharing, and analyzing complex information. Waltham, MA: Morgan Kaufmann; 2013.

[20] Berman J.J. Tumor classification: molecular analysis meets Aristotle. BMC Cancer. 2004;4:10.

[21] Berman J.J. Tumor taxonomy for the developmental lineage classification of neoplasms. BMC Cancer. 2004;4:88.

[22] Mead C.N. Data interchange standards in healthcare IT–computable semantic interoperability: now possible but still difficult, do we really need a better mousetrap? J Healthc Inf Manag. 2006;20:71–78.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.237.122