Chapter 1

Providing Structure to Unstructured Data

Outline

I was working on the proof of one of my poems all the morning, and took out a comma. In the afternoon I put it back again.

Oscar Wilde

Background

In the early days of computing, data was always highly structured. All data was divided into fields, the fields had a fixed length, and the data entered into each field was constrained to a predetermined set of allowed values. Data was entered into punch cards, with preconfigured rows and columns. Depending on the intended use of the cards, various entry and read-out methods were chosen to express binary data, numeric data, fixed-size text, or programming instructions (see Glossary item, Binary data). Key-punch operators produced mountains of punch cards. For many analytic purposes, card-encoded data sets were analyzed without the assistance of a computer; all that was needed was a punch card sorter. If you wanted the data card on all males, over the age of 18, who had graduated high school, and had passed their physical exam, then the sorter would need to make four passes. The sorter would pull every card listing a male, then from the male cards it would pull all the cards of people over the age of 18, and from this double-sorted substack it would pull cards that met the next criterion, and so on. As a high school student in the 1960s, I loved playing with the card sorters. Back then, all data was structured data, and it seemed to me, at the time, that a punch-card sorter was all that anyone would ever need to analyze large sets of data.

Of course, I was completely wrong. Today, most data entered by humans is unstructured, in the form of free text. The free text comes in e-mail messages, tweets, documents, and so on. Structured data has not disappeared, but it sits in the shadows cast by mountains of unstructured text. Free text may be more interesting to read than punch cards, but the venerable punch card, in its heyday, was much easier to analyze than its free-text descendant. To get much informational value from free text, it is necessary to impose some structure. This may involve translating the text to a preferred language, parsing the text into sentences, extracting and normalizing the conceptual terms contained in the sentences, mapping terms to a standard nomenclature (see Glossary items, Nomenclature, Thesaurus), annotating the terms with codes from one or more standard nomenclatures, extracting and standardizing data values from the text, assigning data values to specific classes of data belonging to a classification system, assigning the classified data to a storage and retrieval system (e.g., a database), and indexing the data in the system. All of these activities are difficult to do on a small scale and virtually impossible to do on a large scale. Nonetheless, every Big Data project that uses unstructured data must deal with these tasks to yield the best possible results with the resources available.

Machine Translation

The purpose of narrative is to present us with complexity and ambiguity.

Scott Turow

The term unstructured data refers to data objects whose contents are not organized into arrays of attributes or values (see Glossary item, Data object). Spreadsheets, with data distributed in cells, marked by a row and column position, are examples of structured data. This paragraph is an example of unstructured data. You can see why data analysts prefer spreadsheets over free text. Without structure, the contents of the data cannot be sensibly collected and analyzed. Because Big Data is immense, the tasks of imposing structure on text must be automated and fast.

Machine translation is one of the better known areas in which computational methods have been applied to free text. Ultimately, the job of machine translation is to translate text from one language into another language. The process of machine translation begins with extracting sentences from text, parsing the words of the sentence into grammatic parts, and arranging the grammatic parts into an order that imposes logical sense on the sentence. Once this is done, each of the parts can be translated by a dictionary that finds equivalent terms in a foreign language to be reassembled by applying grammatic positioning rules appropriate for the target language. Because this process uses the natural rules for sentence constructions in a foreign language, the process is often referred to as natural language machine translation.

It all seems simple and straightforward. In a sense, it is—if you have the proper look-up tables. Relatively good automatic translators are now widely available. The drawback of all these applications is that there are many instances where they fail utterly. Complex sentences, as you might expect, are problematic. Beyond the complexity of the sentences are other problems, deeper problems that touch upon the dirtiest secret common to all human languages—languages do not make much sense. Computers cannot find meaning in sentences that have no meaning. If we, as humans, find meaning in the English language, it is only because we impose our own cultural prejudices onto the sentences we read, to create meaning where none exists.

It is worthwhile to spend a few moments on some of the inherent limitations of English. Our words are polymorphous; their meanings change depending on the context in which they occur. Word polymorphism can be used for comic effect (e.g., “Both the martini and the bar patron were drunk”). As humans steeped in the culture of our language, we effortlessly invent the intended meaning of each polymorphic pair in the following examples: “a bandage wound around a wound,” “farming to produce produce,” “please present the present in the present time,” “don’t object to the data object,” “teaching a sow to sow seed,” “wind the sail before the wind comes,” and countless others.

Words lack compositionality; their meaning cannot be deduced by analyzing root parts. For example, there is neither pine nor apple in pineapple, no egg in eggplant, and hamburgers are made from beef, not ham. You can assume that a lover will love, but you cannot assume that a finger will “fing.” Vegetarians will eat vegetables, but humanitarians will not eat humans. Overlook and oversee should, logically, be synonyms, but they are antonyms.

For many words, their meanings are determined by the case of the first letter of the word. For example, Nice and nice, Polish and polish, Herb and herb, August and august.

It is possible, given enough effort, that a machine translator may cope with all the aforementioned impedimenta. Nonetheless, no computer can create meaning out of ambiguous gibberish, and a sizable portion of written language has no meaning, in the informatics sense (see Glossary item, Meaning). As someone who has dabbled in writing machine translation tools, my favorite gripe relates to the common use of reification—the process whereby the subject of a sentence is inferred, without actually being named (see Glossary item, Reification). Reification is accomplished with pronouns and other subject references.

Here is an example, taken from a newspaper headline: “Husband named person of interest in slaying of mother.” First off, we must infer that it is the husband who was named as the person of interest, not that the husband suggested the name of the person of interest. As anyone who follows crime headlines knows, this sentence refers to a family consisting of a husband, wife, and at least one child. There is a wife because there is a husband. There is a child because there is a mother. The reader is expected to infer that the mother is the mother of the husband’s child, not the mother of the husband. The mother and the wife are the same person. Putting it all together, the husband and wife are father and mother, respectively, to the child. The sentence conveys the news that the husband is a suspect in the slaying of his wife, the mother of the child. The word “husband” reifies the existence of a wife (i.e., creates a wife by implication from the husband—wife relationship). The word “mother” reifies a child. Nowhere is any individual husband or mother identified; it’s all done with pointers pointing to other pointers. The sentence is all but meaningless; any meaning extracted from the sentence comes as a creation of our vivid imaginations.

Occasionally, a sentence contains a reification of a group of people, and the reification contributes absolutely nothing to the meaning of the sentence. For example, “John married aunt Sally.” Here, a familial relationship is established (“aunt”) for Sally, but the relationship does not extend to the only other person mentioned in the sentence (i.e., Sally is not John’s aunt). Instead, the word “aunt” reifies a group of individuals; specifically, the group of people who have Sally as their aunt. The reification seems to serve no purpose other than to confuse.

Here is another example, taken from a newspaper article: “After her husband disappeared on a 1944 recon mission over Southern France, Antoine de Saint-Exupery’s widow sat down and wrote this memoir of their dramatic marriage.” There are two reified persons in the sentence: “her husband” and “Antoine de Saint-Exupery’s widow.” In the first phrase, “her husband” is a relationship (i.e., “husband”) established for a pronoun (i.e., “her”) referenced to the person in the second phrase. The person in the second phrase is reified by a relationship to Saint-Exupery (i.e., “widow”), who just happens to be the reification of the person in the first phrase (i.e., “Saint-Exupery is her husband”).

We write self-referential reifying sentences every time we use a pronoun: “It was then that he did it for them.” The first “it” reifies an event, the word “then” reifies a time, the word “he” reifies a subject, the second “it” reifies some action, and the word “them” reifies a group of individuals representing the recipients of the reified action.

Strictly speaking, all of these examples are meaningless. The subjects of the sentence are not properly identified and the references to the subjects are ambiguous. Such sentences cannot be sensibly evaluated by computers.

A final example is “Do you know who I am?” There are no identifiable individuals; everyone is reified and reduced to an unspecified pronoun (“you,” “I”). Though there are just a few words in the sentence, half of them are superfluous. The words “Do,” “who,” and “am” are merely fluff, with no informational purpose. In an object-oriented world, the question would be transformed into an assertion, “You know me,” and the assertion would be sent a query message, “true?” (see Glossary item, Object-oriented programming). We are jumping ahead. Objects, assertions, and query messages will be discussed in later chapters.

Accurate machine translation is beyond being difficult. It is simply impossible. It is impossible because computers cannot understand nonsense. The best we can hope for is a translation that allows the reader to impose the same subjective interpretation of the text in the translation language as he or she would have made in the original language. The expectation that sentences can be reliably parsed into informational units is fantasy. Nonetheless, it is possible to compose meaningful sentences in any language, if you have a deep understanding of informational meaning. This topic will be addressed in Chapter 4.

Autocoding

The beginning of wisdom is to call things by their right names.

Chinese proverb

Coding, as used in the context of unstructured textual data, is the process of tagging terms with an identifier code that corresponds to a synonymous term listed in a standard nomenclature (see Glossary item, Identifier). For example, a medical nomenclature might contain the term renal cell carcinoma, a type of kidney cancer, attaching a unique identifier code for the term, such as “C9385000.” There are about 50 recognized synonyms for “renal cell carcinoma.” A few of these synonyms and near-synonyms are listed here to show that a single concept can be expressed many different ways, including adenocarcinoma arising from kidney, adenocarcinoma involving kidney, cancer arising from kidney, carcinoma of kidney, Grawitz tumor, Grawitz tumour, hypernephroid tumor, hypernephroma, kidney adenocarcinoma, renal adenocarcinoma, and renal cell carcinoma. All of these terms could be assigned the same identifier code, “C9385000.”

The process of coding a text document involves finding all the terms that belong to a specific nomenclature and tagging the term with the corresponding identifier code.

A nomenclature is a specialized vocabulary, usually containing terms that comprehensively cover a well-defined and circumscribed area (see Glossary item, Vocabulary). For example, there may be a nomenclature of diseases, or celestial bodies, or makes and models of automobiles. Some nomenclatures are ordered alphabetically. Others are ordered by synonymy, wherein all synonyms and plesionyms (near-synonyms, see Glossary item, Plesionymy) are collected under a canonical (i.e., best or preferred) term. Synonym indexes are always corrupted by the inclusion of polysemous terms (i.e., terms with multiple meanings; see Glossary item, Polysemy). In many nomenclatures, grouped synonyms are collected under a code (i.e., a unique alphanumeric string) assigned to all of the terms in the group (see Glossary items, Uniqueness, String). Nomenclatures have many purposes: to enhance interoperability and integration, to allow synonymous terms to be retrieved regardless of which specific synonym is entered as a query, to support comprehensive analyses of textual data, to express detail, to tag information in textual documents, and to drive down the complexity of documents by uniting synonymous terms under a common code. Sets of documents held in more than one Big Data resource can be harmonized under a nomenclature by substituting or appending a nomenclature code to every nomenclature term that appears in any of the documents.

In the case of “renal cell carcinoma,” if all of the 50 + synonymous terms, appearing anywhere in a medical text, were tagged with the code “C938500,” then a search engine could retrieve documents containing this code, regardless of which specific synonym was queried (e.g., a query on Grawitz tumor would retrieve documents containing the word “hypernephroid tumor”). The search engine would simply translate the query word, “Grawitz tumor,” into its nomenclature code, “C938500,” and would pull every record that had been tagged by the code.

Traditionally, nomenclature coding, much like language translation, has been considered a specialized and highly detailed task that is best accomplished by human beings. Just as there are highly trained translators who will prepare foreign language versions of popular texts, there are highly trained coders, intimately familiar with specific nomenclatures, who create tagged versions of documents. Tagging documents with nomenclature codes is serious business. If the coding is flawed, the consequences can be dire. In 2009, the Department of Veterans Affairs sent out hundreds of letters to veterans with the devastating news that they had contracted amyotrophic lateral sclerosis, also known as Lou Gehrig’s disease, a fatal degenerative neurologic condition. About 600 of the recipients did not, in fact, have the disease. The VA retracted these letters, attributing the confusion to a coding error.12 Coding text is difficult. Human coders are inconsistent, idiosyncratic, and prone to error. Coding accuracy for humans seems to fall in the range of 85 to 90%13 (see Glossary item, Accuracy and precision).

When dealing with text in gigabyte and greater quantities, human coding is simply out of the question. There is not enough time, or money, or talent to manually code the textual data contained in Big Data resources. Computerized coding (i.e., autocoding) is the only practical solution.

Autocoding is a specialized form of machine translation, the field of computer science dealing with drawing meaning from narrative text, or translating narrative text from one language to another. Not surprisingly, autocoding algorithms have been adopted directly from the field of machine translation, particularly algorithms for natural language processing (see Glossary item, Algorithm). A popular approach to autocoding involves using the natural rules of language to find words or phrases found in text, and matching them to nomenclature terms. Ideally the correct text term is matched to its equivalent nomenclature term, regardless of the way that the term is expressed in the text. For instance, the term “adenocarcinoma of lung” has much in common with alternate terms that have minor variations in word order, plurality, inclusion of articles, terms split by a word inserted for informational enrichment, and so on. Alternate forms would be “adenocarcinoma of the lung,” “adenocarcinoma of the lungs,” “lung adenocarcinoma,” and “adenocarcinoma found in the lung.” A natural language algorithm takes into account grammatic variants, allowable alternate term constructions, word roots (stemming), and syntax variation (see Glossary item, Syntax). Clever improvements on natural language methods might include string similarity scores, intended to find term equivalences in cases where grammatic methods come up short.

A limitation of the natural language approach to autocoding is encountered when synonymous terms lack etymologic commonality. Consider the term “renal cell carcinoma.” Synonyms include terms that have no grammatic relationship with one another. For example, hypernephroma and Grawitz tumor are synonyms for renal cell carcinoma. It is impossible to compute the equivalents among these terms through the implementation of natural language rules or word similarity algorithms. The only way of obtaining adequate synonymy is through the use of a comprehensive nomenclature that lists every synonym for every canonical term in the knowledge domain.

Setting aside the inability to construct equivalents for synonymous terms that share no grammatic roots (e.g., renal cell carcinoma, Grawitz tumor, and hypernephroma), the best natural language autocoders are pitifully slow. The reason for the slowness relates to their algorithm, which requires the following steps, at a minimum: parsing text into sentences; parsing sentences into grammatic units, rearranging the units of the sentence into grammatically permissible combinations, expanding the combinations based on stem forms of words, allowing for singularities and pluralities of words, and matching the allowable variations against the terms listed in the nomenclature.

A good natural language autocoder parses text at about 1 kilobyte per second. This means that if an autocoder must parse and code a terabyte of textual material, it would require 1000 million seconds to execute, or about 30 years. Big Data resources typically contain many terabytes of data; thus, natural language autocoding software is unsuitable for translating Big Data resources. This being the case, what good are they?

Natural language autocoders have value when they are employed at the time of data entry. Humans type sentences at a rate far less than 1 kilobyte per second, and natural language autocoders can keep up with typists, inserting codes for terms, as they are typed. They can operate much the same way as autocorrect, autospelling, look-ahead, and other commonly available crutches intended to improve or augment the output of plodding human typists. In cases where a variant term evades capture by the natural language algorithm, an astute typist might supply the application with an equivalent (i.e., renal cell carcinoma = rcc) that can be stored by the application and applied against future inclusions of alternate forms.

It would seem that by applying the natural language parser at the moment when the data is being prepared, all of the inherent limitations of the algorithm can be overcome. This belief, popularized by developers of natural language software and perpetuated by a generation of satisfied customers, ignores two of the most important properties that must be preserved in Big Data resources: longevity and curation (see Glossary item, Curator).

Nomenclatures change over time. Synonymous terms and their codes will vary from year to year as new versions of old nomenclature are published and new nomenclatures are developed. In some cases, the textual material within the Big Data resource will need to be re-annotated using codes from nomenclatures that cover informational domains that were not anticipated when the text was originally composed.

Most of the people who work within an information-intensive society are accustomed to evanescent data; data that is forgotten when its original purpose was served. Do we really want all of our old e-mails to be preserved forever? Do we not regret our earliest blog posts, Facebook entries, and tweets? In the medical world, a code for a clinic visit, a biopsy diagnosis, or a reportable transmissible disease will be used in a matter of minutes or hours—maybe days or months. Few among us place much value on textual information preserved for years and decades. Nonetheless, it is the job of the Big Data manager to preserve resource data over years and decades. When we have data that extends back, over decades, we can find and avoid errors that would otherwise reoccur in the present, and we can analyze trends that lead us into the future.

To preserve its value, data must be constantly curated, adding codes that apply to currently available nomenclatures. There is no avoiding the chore—the entire corpus of textual data held in the Big Data resource needs to be recoded again and again, using modified versions of the original nomenclature or using one or more new nomenclatures. This time, an autocoding application will be required to code huge quantities of textual data (possibly terabytes), quickly. Natural language algorithms, which depend heavily on regex operations (i.e., finding word patterns in text) are too slow to do the job (see Glossary item, Regex).

A faster alternative is so-called lexical parsing. This involves parsing text, word by word, looking for exact matches between runs of words and entries in a nomenclature. When a match occurs, the words in the text that matched the nomenclature term are assigned the nomenclature code that corresponds to the matched term. Here is one possible algorithmic strategy for autocoding the sentence “Margins positive malignant melanoma.” For this example, you would be using a nomenclature that lists all of the tumors that occur in humans. Let us assume that the terms “malignant melanoma” and “melanoma” are included in the nomenclature. They are both assigned the same code, for example, “Q5673013,” because the people who wrote the nomenclature considered both terms to be biologically equivalent.

Let’s autocode the diagnostic sentence “Margins positive malignant melanoma”:

1. Begin parsing the sentence, one word at a time. The first word is “Margins.” You check against the nomenclature and find no match. Save the word “margins.” We’ll use it in step 2.

2. You go to the second word, “positive,” and find no matches in the nomenclature. You retrieve the former word “margins” and check to see if there is a two-word term, “margins positive.” There is not. Save “margins” and “positive” and continue.

3. You go to the next word, “malignant.” There is no match in the nomenclature. You check to determine whether the two-word term “positive malignant” and the three-word term “margins positive malignant” are in the nomenclature. They are not.

4. You go to the next word, “melanoma.” You check and find that melanoma is in the nomenclature. You check against the two-word term “malignant melanoma,” the three-word term “positive malignant melanoma,” and the four-word term “margins positive malignant melanoma.” There is a match for “malignant melanoma” but it yields the same code as the code for “melanoma.”

5. The autocoder appends the code “Q5673013” to the sentence and proceeds to the next sentence, where it repeats the algorithm.

The algorithm seems like a lot of work, requiring many comparisons, but it is actually much more efficient than natural language parsing. A complete nomenclature, with each nomenclature term paired with its code, can be held in a single variable, in volatile memory (see Glossary item, Variable). Look-ups to determine whether a word or phrase is included in the nomenclature are also fast. As it happens, there are methods that will speed things along much faster than our sample algorithm. My own previously published method can process text at a rate more than 1000-fold faster than natural language methods.14 With today’s fast desktop computers, lexical autocoding can recode all of the textual data residing in most Big Data resources within a realistic time frame.

A seemingly insurmountable obstacle arises when the analyst must integrate data from two separate Big Data resources, each annotated with a different nomenclature. One possible solution involves on-the-fly coding, using whatever nomenclature suits the purposes of the analyst.

Here is a general algorithm for on-the-fly coding.15 This algorithm starts with a query term and seeks to find every synonym for the query term, in any collection of Big Data resources, using any convenient nomenclature.

1. The analyst starts with a query term submitted by a data user. The analyst chooses a nomenclature that contains his query term, as well as the list of synonyms for the term. Any vocabulary is suitable so long as the vocabulary consists of term/code pairs, where a term and its synonyms are all paired with the same code.

2. All of the synonyms for the query term are collected together. For instance, the 2004 version of a popular medical nomenclature, the Unified Medical Language System, had 38 equivalent entries for the code C0206708, nine of which are listed here:

C0206708|Cervical Intraepithelial Neoplasms

C0206708|Cervical Intraepithelial Neoplasm

C0206708|Intraepithelial Neoplasm, Cervical

C0206708|Intraepithelial Neoplasms, Cervical

C0206708|Neoplasm, Cervical Intraepithelial

C0206708|Neoplasms, Cervical Intraepithelial

C0206708|Intraepithelial Neoplasia, Cervical

C0206708|Neoplasia, Cervical Intraepithelial

C0206708|Cervical Intraepithelial Neoplasia

If the analyst had chosen to search on “Cervical Intraepithelial Neoplasia,” his term will be attached to the 38 synonyms included in the nomenclature.

3. One by one, the equivalent terms are matched against every record in every Big Data resource available to the analyst.

4. Records are pulled that contain terms matching any of the synonyms for the term selected by the analyst.

In the case of this example, this would mean that all 38 synonymous terms for “Cervical Intraepithelial Neoplasms” would be matched against the entire set of data records. The benefit of this kind of search is that data records that contain any search term, or its nomenclature equivalent, can be extracted from multiple data sets in multiple Big Data resources, as they are needed, in response to any query. There is no pre-coding, and there is no need to match against nomenclature terms that have no interest to the analyst. The drawback of this method is that it multiplies the computational task by the number of synonymous terms being searched, 38-fold in this example. Luckily, there are simple and fast methods for conducting these synonym searches.15

Indexing

Knowledge can be public, yet undiscovered, if independently created fragments are logically related but never retrieved, brought together, and interpreted.

Donald R. Swanson16

Individuals accustomed to electronic media tend to think of the Index as an inefficient or obsolete method for finding and retrieving information. Most currently available e-books have no index. It’s far easier to pull up the “Find” dialog box and enter a word or phrase. The e-reader can find all matches quickly, provide the total number of matches, and bring the reader to any or all of the pages containing the selection. As more and more books are published electronically, the Index, as we have come to know it, will cease to be.

It would be a pity if indexes were to be abandoned by computer scientists. A well-designed book index is a creative, literary work that captures the content and intent of the book and transforms it into a listing wherein related concepts, found scattered throughout the text, are collected under common terms and keyed to their locations. It saddens me that many people ignore the book index until they want something from it. Open a favorite book and read the index, from A to Z, as if you were reading the body of the text. You will find that the index refreshes your understanding of the concepts discussed in the book. The range of page numbers after each term indicates that a concept has extended its relevance across many different chapters. When you browse the different entries related to a single term, you learn how the concept represented by the term applies itself to many different topics. You begin to understand, in ways that were not apparent when you read the book as a linear text, the versatility of the ideas contained in the book. When you’ve finished reading the index, you will notice that the indexer exercised great restraint when selecting terms. Most indexes are under 20 pages (see Glossary item, Indexes). The goal of the indexer is not to create a concordance (i.e., a listing of every word in a book, with its locations), but to create a keyed encapsulation of concepts, subconcepts, and term relationships.

The indexes we find in today’s books are generally alphabetized terms. In prior decades and prior centuries, authors and editors put enormous effort into building indexes, sometimes producing multiple indexes for a single book. For example, a biography might contain a traditional alphabetized term index, followed by an alphabetized index of the names of the people included in the text. A zoology book might include an index specifically for animal names, with animals categorized according to their taxonomic order (see Glossary item, Taxonomy). A geography index might list the names of localities subindexed by country, with countries subindexed by continent. A single book might have five or more indexes. In 19th century books, it was not unusual to publish indexes as stand-alone volumes.

You may be thinking that all this fuss over indexes is quaint, but it cannot apply to Big Data resources. Actually, Big Data resources that lack a proper index cannot be utilized to their full potential. Without an index, you never know what your queries are missing. Remember, in a Big Data resource, it is the relationship among data objects that are the keys to knowledge. Data by itself, even in large quantities, tells only part of a story. The most useful Big Data resource has electronic indexes that map concepts, classes, and terms to specific locations in the resource where data items are stored. An index imposes order and simplicity on the Big Data resource. Without an index, Big Data resources can easily devolve into vast collections of disorganized information.

The best indexes comply with international standards (ISO 999) and require creativity and professionalism.17 Indexes should be accepted as another device for driving down the complexity of Big Data resources. Here are a few of the specific strengths of an index that cannot be duplicated by “find” operations on terms entered into a query box.

1. An index can be read, like a book, to acquire a quick understanding of the contents and general organization of the data resource.

2. When you do a “find” search in a query box, your search may come up empty if there is nothing in the text that matches your query. This can be very frustrating if you know that the text covers the topic entered into the query box. Indexes avoid the problem of fruitless searches. By browsing the index you can find the term you need, without foreknowledge of its exact wording within the text.

3. Index searches are instantaneous, even when the Big Data resource is enormous. Indexes are constructed to contain the results of the search of every included term, obviating the need to repeat the computational task of searching on indexed entries.

4. Indexes can be tied to a classification. This permits the analyst to know the relationships among different topics within the index and within the text.

5. Many indexes are cross-indexed, providing relationships among index terms that might be extremely helpful to the data analyst.

6. Indexes from multiple Big Data resources can be merged. When the location entries for index terms are annotated with the name of the resource, then merging indexes is trivial, and index searches will yield unambiguously identified locators in any of the Big Data resources included in the merge.

7. Indexes can be created to satisfy a particular goal, and the process of creating a made-to-order index can be repeated again and again. For example, if you have a Big Data resource devoted to ornithology, and you have an interest in the geographic location of species, you might want to create an index specifically keyed to localities, or you might want to add a locality subentry for every indexed bird name in your original index. Such indexes can be constructed as add-ons, as needed.

8. Indexes can be updated. If terminology or classifications change, there is nothing stopping you from rebuilding the index with an updated specification. In the specific context of Big Data, you can update the index without modifying your data (see Chapter 6).

9. Indexes are created after the database has been created. In some cases, the data manager does not envision the full potential of the Big Data resource until after it is created. The index can be designed to facilitate the use of the resource, in line with the observed practices of users.

10. Indexes can serve as surrogates for the Big Data resource. In some cases, all the data user really needs is the index. A telephone book is an example of an index that serves its purpose without being attached to a related data source (e.g., caller logs, switching diagrams).

Term Extraction

There’s a big difference between knowing the name of something and knowing something.

Richard Feynman

One of my favorite movies is the parody version of “Hound of the Baskervilles,” starring Peter Cooke as Sherlock Holmes and Dudley Moore as his faithful hagiographer, Dr. Watson. Sherlock, preoccupied with his own ridiculous pursuits, dispatches Watson to the Baskerville family manse, in Dartmoor, to undertake urgent sleuth-related activities. The hapless Watson (Dudley Moore), standing in the great Baskerville Hall, has no idea how to proceed with the investigation. After a moment of hesitation, he turns to the incurious maid and commands, “Take me to the clues!”

Building an index is a lot like solving a fiendish crime—you need to know how to find the clues. Likewise, the terms in the text are the clues upon which the index is built. Terms in a text file do not jump into your index file—you need to find them. There are several available methods for finding and extracting index terms from a corpus of text,18 but no method is as simple, fast, and scalable as the “stop” word method19(see Glossary items, Term extraction algorithm, Scalable).

Text is composed of words and phrases that represent specific concepts that are connected together into a sequence, known as a sentence.

Consider the following: “The diagnosis is chronic viral hepatitis.” This sentence contains two very specific medical concepts: “diagnosis” and “chronic viral hepatitis.” These two concepts are connected to form a meaningful statement with the words “the” and “is,” and the sentence delimiter, “.” “The,” “diagnosis,” “is,” “chronic viral hepatitis,” “.”

A term can be defined as a sequence of one or more uncommon words that are demarcated (i.e., bounded on one side or another) by the occurrence of one or more common words, such as “is,” “and,” “with,” “the.”

Here is another example: “An epidural hemorrhage can occur after a lucid interval.” The medical concepts “epidural hemorrhage” and “lucid interval” are composed of uncommon words. These uncommon word sequences are bounded by sequences of common words or of sentence delimiters (i.e., a period, semicolon, question mark, or exclamation mark indicating the end of a sentence or the end of an expressed thought). “An,” “epidural hemorrhage,” “can occur after a,” “lucid interval,” “.”

If we had a list of all the words that were considered common, we could write a program that extracts all the concepts found in any text of any length. The concept terms would consist of all sequences of uncommon words that are uninterrupted by common words. An algorithm for extracting terms from a sentence follows.

1. Read the first word of the sentence. If it is a common word, delete it. If it is an uncommon word, save it.

2. Read the next word. If it is a common word, delete it and place the saved word (from the prior step, if the prior step saved a word) into our list of terms found in the text. If it is an uncommon word, append it to the word we saved in step one and save the two-word term. If it is a sentence delimiter, place any saved term into our list of terms and stop the program.

3. Repeat step two.

This simple algorithm, or something much like it, is a fast and efficient method to build a collection of index terms. To use the algorithm, you must prepare or find a list of common words appropriate to the information domain of your Big Data resource. To extract terms from the National Library of Medicine’s citation resource (about 20 million collected journal articles), the following list of common words is used: “about, again, all, almost, also, although, always, among, an, and, another, any, are, as, at, be, because, been, before, being, between, both, but, by, can, could, did, do, does, done, due, during, each, either, enough, especially, etc, for, found, from, further, had, has, have, having, here, how, however, i, if, in, into, is, it, its, itself, just, kg, km, made, mainly, make, may, mg, might, ml, mm, most, mostly, must, nearly, neither, no, nor, obtained, of, often, on, our, overall, perhaps, pmid, quite, rather, really, regarding, seem, seen, several, should, show, showed, shown, shows, significantly, since, so, some, such, than, that, the, their, theirs, them, then, there, therefore, these, they, this, those, through, thus, to, upon, use, used, using, various, very, was, we, were, what, when, which, while, with, within, without, would.”

Such lists of common words are sometimes referred to as “stop word lists” or “barrier word lists,” as they demarcate the beginnings and endings of extraction terms.

Notice that the algorithm parses through text sentence by sentence. This is a somewhat awkward method for a computer to follow, as most programming languages automatically cut text from a file line by line (i.e., breaking text at the newline terminator). A computer program has no way of knowing where a sentence begins or ends, unless the programmer finds sentences, as a program subroutine.

There are many strategies for determining where one sentence stops and another begins. The easiest method looks for the occurrence of a sentence delimiter immediately following a lowercase alphabetic letter, that precedes one or two space characters, that precede an uppercase alphabetic character.

Here is an example: “I like pizza. Pizza likes me.” Between the two sentences is the sequence “a. P,” which consists of a lowercase “a” followed by a period, followed by two spaces, followed by an uppercase “P”. This general pattern (lowercase, period, one or two spaces, uppercase) usually signifies a sentence break. The routine fails with sentences that break at the end of a line or at the last sentence of a paragraph (i.e., where there is no intervening space). It also fails to demarcate proper sentences captured within one sentence (i.e., where a semicolon ends an expressed thought, but is not followed by an uppercase letter). It might falsely demarcate a sentence in an outline, where a lowercase letter is followed by a period, indicating a new subtopic. Nonetheless, with a few tweaks providing for exceptional types of sentences, a programmer can whip up a satisfactory subroutine that divides unstructured text into a set of sentences.

Once you have a method for extracting terms from sentences, the task of creating a true index, associating a list of locations with each term, is child’s play for programmers. Basically, as you collect each term (as described above), you attach the term to the location at which it was found. This is ordinarily done by building an associative array, also called a hash or a dictionary depending on the programming language used. When a term is encountered at subsequent locations in the Big Data resource, these additional locations are simply appended to the list of locations associated with the term. After the entire Big Data resource has been parsed by your indexing program, a large associative array will contain two items for each term in the index: the name of the term and the list of locations at which the term occurs within the Big Data resource. When the associative array is displayed as a file, your index is completed! No, not really.

Using the described methods, an index can be created for any corpus of text. However, in most cases, the data manger and the data analyst will not be happy with the results. The index will contain a huge number of terms that are of little or no relevance to the data analyst. The terms in the index will be arranged alphabetically, but an alphabetic representation of the concepts in a Big Data resource does not associate like terms with like terms.

Find a book with a really good index. You will see that the indexer has taken pains to unite related terms under a single subtopic. In some cases, the terms in a subtopic will be divided into subtopics. Individual terms will be linked (cross-referenced) to related terms elsewhere in the index.

A good index, whether it is created by a human or by a computer, will be built to serve the needs of the data manager and of the data analyst. The programmer who creates the index must exercise a considerable degree of creativity, insight, and elegance. Here are just a few of the questions that should be considered when an index is created for unstructured textual information in a Big Data resource.

1. Should the index be devoted to a particular knowledge domain? You may want to create an index of names of persons, an index of geographic locations, or an index of types of transactions. Your choice depends on the intended uses of the Big Data resource.

2. Should the index be devoted to a particular nomenclature? A coded nomenclature might facilitate the construction of an index if synonymous index terms are attached to their shared nomenclature code.

3. Should the index be built upon a scaffold that consists of a classification? For example, an index prepared for biologists might be keyed to the classification of living organisms. Gene data has been indexed to a gene ontology and used as a research tool.20

4. In the absence of a classification, might proximity among terms be included in the index? Term associations leading to useful discoveries can sometimes be found by collecting the distances between indexed terms.21,22 Terms that are proximate to one another (i.e., co-occurring terms) tend to have a relational correspondence. For example, if “aniline dye industry” co-occurs often with the seemingly unrelated term “bladder cancer,” then you might start to ask whether aniline dyes can cause bladder cancer.

5. Should multiple indexes be created? Specialized indexes might be created for data analysts who have varied research agendas.

6. Should the index be merged into another index? It is far easier to merge indexes than to merge Big Data resources. It is worthwhile to note that the greatest value of Big Data comes from finding relationships among disparate collections of data.

References

12. Hayes A. VA to apologize for mistaken Lou Gehrig’s disease notices. Available from: CNN August 26, 2009; http://www.cnn.com/2009/POLITICS/08/26/veterans.letters.disease; August 26, 2009; viewed September 4, 2012.

13. Hall PA, Lemoine NR. Comparison of manual data coding errors in 2 hospitals. J Clin Pathol. 1986;39:622–626.

14. Berman JJ. Doublet method for very fast autocoding. BMC Med Inform Decis Mak. 2004;4:16.

15. Berman JJ. Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching. In Silico Biol. 2005;5:0029.

16. Swanson DR. Undiscovered public knowledge. Libr Q. 1986;56:103–118.

17. Wallis E, Lavell C. Naming the indexer: where credit is due. The Indexer. 1995;19:266–268.

18. Krauthammer M, Nenadic G. Term identification in the biomedical literature. J Biomed Inform. 2004;37:512–526.

19. Berman JJ. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton, FL: Chapman and Hall; 2010.

20. Shah NH, Jonquet C, Chiang AP, Butte AJ, Chen R, Musen MA. Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinform. 2009;10(Suppl. 2):S1.

21. Cohen T, Whitfield GK, Schvaneveldt RW, Mukund K, Rindflesch T. EpiphaNet: an interactive tool to support biomedical discoveries. J Biomed Discov Collab. 2010;5:21–49.

22. Swanson DR. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986;30:7–18.


ent“To view the full reference list for the book, click here

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.177.14