Chapter 3

Ontologies and Semantics

Outline

Order and simplification are the first steps toward the mastery of a subject.

Thomas Mann

Background

Information has limited value unless it can take its place within our general understanding of the world. When a financial analyst learns that the price of a stock has suddenly dropped, he cannot help but wonder if the drop of a single stock reflects conditions in other stocks in the same industry. If so, the analyst may check to ensure that other industries are following a downward trend. He may wonder whether the downward trend represents a shift in the national or global economies. There is a commonality to all of the questions posed by the financial analyst. In every case, the analyst is asking a variation on a single question: “How does this thing relate to that thing?”

Big Data resources are complex. When data is simply stored in a database, without any general principles of organization, it is impossible to discover the relationships among the data objects. To be useful, the information in a Big Data resource must be divided into classes of data. Each data object within a class shares a set of properties chosen to enhance our ability to relate one piece of data with another.

Ontologies are formal systems that assign data objects to classes and that relate classes to other classes. When the data within a Big Data resource is classified within an ontology, data analysts can determine whether observations on a single object will apply to other objects in the same class. Similarly, data analysts can begin to ask whether observations that hold true for a class of objects will relate to other classes of objects. Basically, ontologies help scientists fulfill one of their most important tasks—determining how things relate to other things. This chapter will describe how ontologies are constructed and how they are used for scientific discovery in Big Data resources. The discussion will begin with a discussion of the simplest form of ontology—classification.

Classifications, the Simplest of Ontologies

The human brain is constantly processing visual and other sensory information collected from the environment. When we walk down the street, we see images of concrete and asphalt and millions of blades of grass, birds, dogs, other persons, and so on. Every step we take conveys a new world of sensory input. How can we process it all? The mathematician and philosopher Karl Pearson (1857—1936) has likened the human mind to a “sorting machine.”44 We take a stream of sensory information, sort it into a set of objects, and then assign the individual objects to general classes. The green stuff on the ground is classified as “grass,” and the grass is subclassified under some larger grouping, such as “plants.” A flat stretch of asphalt and concrete may be classified as a “road,” and the road might be subclassified under “man-made constructions.” If we lacked a culturally determined classification of objects for our world, we would be overwhelmed by sensory input, and we would have no way to remember what we see and no way to draw general inferences about anything. Simply put, without our ability to classify, we would not be human.45

Every culture has some particular way to impose a uniform way of perceiving the environment. In English-speaking cultures, the term “hat” denotes a universally recognized object. Hats may be composed of many different types of materials and they may vary greatly in size, weight, and shape. Nonetheless, we can almost always identify a hat when we see one, and we can distinguish a hat from all other types of objects. An object is not classified as a hat simply because it shares a few structural similarities with other hats. A hat is classified as a hat because it has a class relationship; all hats are items of clothing that fit over the head. Likewise, all biological classifications are built by relationships, not by similarities.45,46

Aristotle was one of the first experts in classification. His greatest insight came when he correctly identified a dolphin as a mammal. Through observation, he knew that a large group of animals was distinguished by a gestational period in which a developing embryo is nourished by a placenta, and the offspring are delivered into the world as formed but small versions of the adult animals (i.e., not as eggs or larvae), and the newborn animals feed from milk excreted from nipples, overlying specialized glandular organs (mammae). Aristotle knew that these features, characteristic of mammals, were absent in all other types of animals. He also knew that dolphins had all these features; fish did not. He correctly reasoned that dolphins were a type of mammal, not a type of fish. Aristotle was ridiculed by his contemporaries for whom it was obvious that dolphins were a type of fish. Unlike Aristotle, they based their classification on similarities, not on relationships. They saw that dolphins looked like fish and dolphins swam in the ocean like fish, and this was all the proof they needed to conclude that dolphins were indeed fish. For about 2000 years following the death of Aristotle, biologists persisted in their belief that dolphins were a type of fish. For the past several hundred years, biologists have acknowledged that Aristotle was correct after all—dolphins are mammals. Aristotle discovered and taught the most important principle of classification: that classes are built on relationships among class members, not by counting similarities.45 We will see in later chapters that methods of grouping data objects by similarity can be very misleading and should not be used as the basis for constructing a classification or an ontology.

A classification is a very simple form of ontology, in which each class is limited to one parent class. To build a classification, the ontologist must do the following: (1) define classes (i.e., find the properties that define a class and extend to the subclasses of the class), (2) assign instances to classes, (3) position classes within the hierarchy, and (4) test and validate all of the above.

The constructed classification becomes a hierarchy of data objects conforming to a set of principles:

1. The classes (groups with members) of the hierarchy have a set of properties or rules that extend to every member of the class and to all of the subclasses of the class, to the exclusion of unrelated classes. A subclass is itself a type of class wherein the members have the defining class properties of the parent class plus some additional property(ies) specific to the subclass.

2. In a hierarchical classification, each subclass may have no more than one parent class. The root (top) class has no parent class. The biological classification of living organisms is a hierarchical classification.

3. At the bottom of the hierarchy is the class instance. For example, your copy of this book is an instance of the class of objects known as “books.”

4. Every instance belongs to exactly one class.

5. Instances and classes do not change their positions in the classification. As examples, a horse never transforms into a sheep and a book never transforms into a harpsichord.

6. The members of classes may be highly similar to one another, but their similarities result from their membership in the same class (i.e., conforming to class properties), and not the other way around (i.e., similarity alone cannot define class inclusion).

Classifications are always simple; the parental classes of any instance of the classification can be traced as a simple, nonbranched list, ascending through the class hierarchy. As an example, here is the lineage for the domestic horse (Equus caballus) from the classification of living organisms:

Equus caballus

Equus subg. Equus

Equus

Equidae

Perissodactyla

Laurasiatheria

Eutheria

Theria

Mammalia

Amniota

Tetrapoda

Sarcopterygii

Euteleostomi

Teleostomi

Gnathostomata

Vertebrata

Craniata

Chordata

Deuterostomia

Coelomata

Bilateria

Eumetazoa

Metazoa

Fungi/Metazoa group

Eukaryota

cellular organisms

The words in this zoologic lineage may seem strange to laypersons, but taxonomists who view this lineage instantly grasp the place of domestic horses in the classification of all living organisms.

A classification is a list of every member class, along with their relationships to other classes. Because each class can have only one parent class, a complete classification can be provided when we list all the classes, adding the name of the parent class for each class on the list. For example, a few lines of the classification of living organisms might be:

Craniata, subclass of Chordata

Chordata, subclass of Deuterostomia

Deuterostomia, subclass of Coelomata

Coelomata, subclass of Bilateria

Bilateria, subclass of Eumetazoa

Given the name of any class, a programmer can compute (with a few lines of code) the complete ancestral lineage for the class by iteratively finding the parent class assigned to each ascending class.19

A taxonomy is a classification with the instances “filled in.” This means that for each class in a taxonomy, all the known instances (i.e., member objects) are explicitly listed. For the taxonomy of living organisms, the instances are named species. Currently, there are several million named species of living organisms, and each of these several million species is listed under the name of some class included in the full classification.

Classifications drive down the complexity of their data domain because every instance in the domain is assigned to a single class and every class is related to the other classes through a simple hierarchy.

It is important to distinguish a classification system from an identification system. An identification system puts a data object into its correct slot within the classification. For example, a fingerprint-matching system may look for a set of features that puts a fingerprint into a special subclass of all fingerprints, but the primary goal of fingerprint matching is to establish the identity of an instance (i.e., to show that two sets of fingerprints belong to the same person). In the realm of medicine, when a doctor renders a diagnosis on a patient’s diseases, she is not classifying the disease—she is finding the correct slot within the preexisting classification of diseases that holds her patient’s diagnosis.

Ontologies, Classes with Multiple Parents

Ontologies are constructions that permit an object to be a direct subclass of more than one class. In an ontology, the class “horse” might be a subclass of Equus, a zoologic term, as well as a subclass of “racing animals,” “farm animals,” and “four-legged animals.” The class “book” might be a subclass of “works of literature,” as well as a subclass of “wood-pulp materials” and “inked products.” Ontologies are unrestrained classifications.

Ontologies are predicated on the belief that a single object or class of objects might have multiple different fundamental identities and that these different identities will often place one class of objects directly under more than one superclass.

Data analysts sometimes prefer ontologies over classifications because they permit the analyst to find relationships among classes of objects that would have been impossible to find under a classification. For example, a data analyst might be interested in determining the relationships among groups of flying animals, such as butterflies, birds, bats, and so on. In the classification of living organisms, these animals occupy classes that are not closely related to one another—no two of the different types of flying animals share a single parent class. Because classifications follow relationships through a lineage, they cannot connect instances of classes that fall outside the line of descent.

Ontologies are not subject to the analytic limitations imposed by classifications. In an ontology, a data object can be an instance of many different kinds of classes; thus, the class does not define the essence of the object as it does in a classification. In an ontology, the assignment of an object to a class and the behavior of the members of the objects of a class are determined by rules. An object belongs to a class when it behaves like the other members of the class, according to a rule created by the ontologist. Every class, subclass, and superclass is defined by rules, and rules can be programmed into software.

Classifications were created and implemented at a time when scientists did not have powerful computers that were capable of handling the complexities of ontologies. For example, the classification of all living organisms on earth was created over a period of two millennia. Several million species have been assigned to the classification. It is currently estimated that we will need to add another 10 to 50 million species before we come close to completing the taxonomy of living organisms. Prior generations of scientists could cope with a simple classification, wherein each class of organisms falls under a single superclass; they could not cope with a complex ontology of organisms.

The advent of powerful and accessible computers has spawned a new generation of computer scientists who have developed powerful methods for building complex ontologies. It is the goal of these computer scientists to analyze data in a manner that allows us to find and understand ontologic relationships among data objects.

In simple data collections, such as spreadsheets, data is organized in a very specific manner that preserves the relationships among specific types of data. The rows of the spreadsheet are the individual data objects (i.e., people, experimental samples, class of information, etc.). The left-hand field of the row is typically the name assigned to the data object, and the cells of the row are the attributes of the data object (e.g., quantitative measurements, categorical data, and other information). Each cell of each row occurs in a specific order, and the order determines the kind of information contained in the cell. Hence, every column of the spreadsheet has a particular type of information in each spreadsheet cell.

Big Data resources are much more complex than spreadsheets. The set of features belonging to an object (i.e., the values, sometimes called variables, belonging to the object, corresponding to the cells in a spreadsheet row) will be different for different classes of objects. For example, a member of Class Automobile may have a feature such as “average miles per gallon in city driving,” whereas a member of Class Mammal would not. Every data object must be assigned membership in a class (e.g., Class Persons, Class Tissue Samples, Class Bank Accounts), and every class must be assigned a set of class properties. In Big Data resources that are based on class models, the data objects are not defined by their location in a rectangular spreadsheet—they are defined by their class membership. Classes, in turn, are defined by their properties and by their relations to other classes.

The question that should confront every Big Data manager is “Should I model my data as a classification, wherein every class has one direct parent class, or should I model the resource as an ontology, wherein classes may have multiparental inheritance?”

Choosing a Class Model

The simple and fundamental question “Can a class of objects have more than one parent class?” lies at the heart of several related fields: database management, computational informatics, object-oriented programming, semantics, and artificial intelligence (see Glossary item, Artificial intelligence). Computer scientists are choosing sides, often without acknowledging the problem or fully understanding the stakes. For example, when a programmer builds object libraries in the Python or the Perl programming languages, he is choosing to program in a permissive environment that supports multiclass object inheritance. In Python and Perl, any object can have as many parent classes as the programmer prefers. When a programmer chooses to program in the Ruby programming language, he shuts the door on multiclass inheritance. A Ruby object can have only one direct parent class. Most programmers are totally unaware of the liberties and restrictions imposed by their choice of programming language until they start to construct their own object libraries or until they begin to use class libraries prepared by another programmer.

In object-oriented programming, the programming language provides a syntax whereby a named method is “sent” to data objects, and a result is calculated. The named methods are functions and short programs contained in a library of methods created for a class. For example, a “close” method, written for file objects, typically shuts a file so that it cannot be accessed for read or write operations. In object-oriented languages, a “close” method is sent to an instance of class “File” when the programmer wants to prohibit access to the file. The programming language, upon receiving the “close” method, will look for a method named “close” somewhere in the library of methods prepared for the “File” class. If it finds the “close” method in the “File” class library, it will apply the method to the object to which the method was sent. In simplest terms, the specified file would be closed.

If the “close” method were not found among the available methods for the “File” class library, the programming language would automatically look for the “close” method in the parent class of the “File” class. In some languages, the parent class of the “File” class is the “Input/Output” class. If there were a “close” method in the “Input/Output” class, the method would be sent to the “File” Object. If not, the process of looking for a “close” method would be repeated for the parent class of the “Input/Output” class. You get the idea. Object-oriented languages search for methods by moving up the lineage of ancestral classes for the object instance that receives the method.

In object-oriented programming, every data object is assigned membership to a class of related objects. Once a data object has been assigned to a class, the object has access to all of the methods available to the class in which it holds membership and to all of the methods in all the ancestral classes. This is the beauty of object-oriented programming. If the object-oriented programming language is constrained to single parental inheritance (e.g., the Ruby programming language), then the methods available to the programmer are restricted to a tight lineage. When the object-oriented language permits multiparental inheritance (e.g., Perl and Python programming languages), a data object can have many different ancestral classes spread horizontally and vertically through the class libraries.

Freedom always has its price. Imagine what happens in a multiparental object-oriented programming language when a method is sent to a data object and the data object’s class library does not contain the method. The programming language will look for the named method in the library belonging to a parent class. Which parent class library should be searched? Suppose the object has two parent classes, and each of those two parent classes has a method of the same name in their respective class libraries? The functionality of the method will change depending on its class membership (i.e., a “close” method may have a different function within class “File” than it may have within class “Transactions” or class “Boxes”). There is no way to determine how a search for a named method will traverse its ancestral class libraries; hence, the output of a software program written in an object-oriented language that permits multiclass inheritance is unpredictable.

The rules by which ontologies assign class relationships can become computationally difficult. When there are no restraining inheritance rules, a class within the ontology might be an ancestor of a child class that is an ancestor of its parent class (e.g., a single class might be a grandfather and a grandson to the same class). An instance of a class might be an instance of two classes, at once. The combinatorics and the recursive options can become computationally difficult or impossible.

Those who use ontologies that allow multiclass inheritance will readily acknowledge that they have created a system that is complex and unpredictable. The ontology expert justifies his complex and unpredictable model on the observation that reality itself is complex and unpredictable (see Glossary item, Modeling). A faithful model of reality cannot be created with a simple-mined classification. With time and effort, modern approaches to complex systems will isolate and eliminate computational impedimenta; these are the kinds of problems that computer scientists are trained to solve. For example, recursiveness within an ontology can be avoided if the ontology is acyclic (i.e., class relationships are not permitted to cycle back onto themselves). For every problem created by an ontology, an adept computer scientist will find a solution. Basically, ontologists believe that the task of organizing and understanding information no longer resides within the ancient realm of classification.

For those nonprogrammers who believe in the supremacy of classifications over ontologies, their faith has nothing to do with the computational dilemmas incurred with multiclass parental inheritance. They base their faith on epistemological grounds—on the nature of objects. They hold that an object can only be one thing. You cannot pretend that one thing is really two or more things simply because you insist that it is so. One thing can only belong to one class. One class can only have one ancestor class; otherwise, it would have a dual nature. Assigning more than one parental class to an object is a sign that you have failed to grasp the essential nature of the object. The classification expert believes that ontologies (i.e., classifications that permit one class to have more than one parent class and that permit one object to hold membership in more than one class) do not accurately represent reality.

At the heart of classical classification is the notion that everything in the universe has an essence that makes it one particular thing, and nothing else. This belief is justified for many different kinds of systems. When an engineer builds a radio, he knows that he can assign names to components, and these components can be relied upon to behave in a manner that is characteristic of its type. A capacitor will behave like a capacitor, and a resistor will behave like a resistor. The engineer need not worry that the capacitor will behave like a semiconductor or an integrated circuit.

What is true for the radio engineer may not hold true for the Big Data analyst. In many complex systems, the object changes its function depending on circumstances. For example, cancer researchers discovered an important protein that plays a very important role in the development of cancer. This protein, p53, was considered to be the primary cellular driver for human malignancy. When p53 mutated, cellular regulation was disrupted, and cells proceeded down a slippery path leading to cancer. In the past few decades, as more information was obtained, cancer researchers have learned that p53 is just one of many proteins that play some role in carcinogenesis, but the role changes depending on the species, tissue type, cellular microenvironment, genetic background of the cell, and many other factors. Under one set of circumstances, p53 may play a role in DNA repair, whereas under another set of circumstances, p53 may cause cells to arrest the growth cycle.47,48 It is difficult to classify a protein that changes its primary function based on its biological context.

Simple classifications cannot be built for objects whose identities are contingent on other objects not contained in the classification. Compromise is needed. In the case of protein classification, bioinformaticians have developed GO, the Gene Ontology. In GO, each protein is assigned a position in three different systems: cellular component, biological process, and molecular function. The first system contains information related to the anatomic position of the protein in the cell (e.g., cell membrane). The second system contains the biological pathways in which the protein participates (e.g., tricarboxylic acid cycle), and the third system describes its various molecular functions. Each ontology is acyclic to eliminate the occurrences of class relationships that cycle back to the same class (i.e., parent class cannot be its own child class). GO allows biologists to accommodate the context-based identity of proteins by providing three different ontologies, combined into one. One protein fits into the cellular component ontology, the biological process ontology, and the molecular function ontology. The three ontologies are combined into one controlled vocabulary that can be ported into the relational model for a Big Data resource. Whew!

As someone steeped in the ancient art of classification, and as someone who has written extensively on object-oriented programming, I am impressed, but not convinced, by arguments on both sides of the ontology/classification debate. As a matter of practicality, complex ontologies are not successfully implemented in Big Data projects. The job of building and operating a Big Data resource is always difficult. Imposing a complex ontology framework onto a Big Data resource tends to transform a tough job into an impossible job. Ontologists believe that Big Data resources must match the complexity of their data domain. They would argue that the dictum “keep it simple, stupid” only applies to systems that are simple at the outset (see Glossary item, KISS). I would comment here that one of the problems with ontology builders is that they tend to build ontologies that are much more complex than reality. They do so because it is actually quite easy to add layers of abstraction to an ontology, without incurring any immediate penalty.

Without stating a preference for single-class inheritance (classifications) or multiclass inheritance (ontologies), I would suggest that when modeling a complex system, you should always strive to design a model that is as simple as possible. The wise ontologist will settle for a simplified approximation of the truth. Regardless of your personal preference, you should learn to recognize when an ontology has become too complex. Here are the danger signs of an overly complex ontology.

1. Nobody, even the designers, fully understands the ontology model.

2. You realize that the ontology makes no sense. The solutions obtained by data analysts are absurd, or they contradict observations. The ontologists perpetually tinker with the model in an effort to achieve a semblance of reality and rationality. Meanwhile, the data analysts tolerate the flawed model because they have no choice in the matter.

3. For a given problem, no two data analysts seem able to formulate the query the same way, and no two query results are ever equivalent.

4. The time spent on ontology design and improvement exceeds the time spent on collecting the data that populates the ontology.

5. The ontology lacks modularity. It is impossible to remove a set of classes within the ontology without reconstructing the entire ontology. When anything goes wrong, the entire ontology must be fixed or redesigned.

6. The ontology cannot be fitted into a higher level ontology or a lower level ontology.

7. The ontology cannot be debugged when errors are detected.

8. Errors occur without anyone knowing that the error has occurred.

Simple classifications are not flawless. Here are a few danger signs of an overly simple classification.

1. The classification is too granular to be of much value in associating observations with particular instances within a class or with particular classes within the classification.

2. The classification excludes important relationships among data objects. For example, dolphins and fish both live in water. As a consequence, dolphins and fish will both be subject to some of the same influences (e.g., ocean pollutants, water-borne infectious agents, and so on). In this case, relationships that are not based on species ancestry are simply excluded from the classification of living organisms and cannot be usefully examined.

3. The classes in the classification lack inferential competence. Competence in the ontology field is the ability to infer answers based on the rules for class membership. For example, in an ontology you can subclass wines into white wines and red wines, and you can create a rule that specifies that the two subclasses are exclusive. If you know that a wine is white, then you can infer that the wine does not belong to the subclass of red wines. Classifications are built by understanding the essential features of an object that make it what it is; they are not generally built on rules that might serve the interest of the data analyst or the computer programmer. Unless a determined effort has been made to build a rule-based classification, the ability to draw logical inferences from observations on data objects will be sharply limited.

4. The classification contains a “miscellaneous” class. A formal classification requires that every instance belongs to a class with well-defined properties. A good classification does not contain a “miscellaneous class” that includes objects that are difficult to assign. Nevertheless, desperate taxonomists will occasionally assign objects of indeterminate nature to a temporary class, waiting for further information to clarify the object’s correct placement. In the classification of living organisms, two prominent examples come to mind: the fungal deuteromycetes and the eukaryotic protists. These two groups of organisms never really qualified as classes; each were grab-bag collections containing unrelated organisms that happened to share some biological similarities. Over the decades, these pseudo-classes have insinuated their way into standard biology textbooks. The task of repairing the classification, by creating and assigning the correct classes for the members of these unnatural groupings, has frustrated biologists through many decades and is still a source of some confusion.49

5. The classification may be unstable. Simplistic approaches may yield a classification that serves well for a limited number of tasks, but fails to be extensible to a wider range of activities or fails to integrate well with classifications created for other knowledge domains. All classifications require review and revision, but some classifications are just awful and are constantly subjected to major overhauls.

It seems obvious that in the case of Big Data, a computational approach to data classification is imperative, but a computational approach that consistently leads to failure is not beneficial. It is my impression that most of the ontologies that have been created for data collected in many of the fields of science have been ignored or abandoned by their intended beneficiaries. They are simply too difficult to understand and too difficult to implement.

Introduction to Resource Description Framework Schema

Is there a practical method whereby any and all data can be intelligibly organized into classes and shared over the Internet? There seems to be a solution waiting for us. The W3C consortium (the people behind the World Wide Web) has proposed a framework for representing Web data that encompasses a very simple and clever way to assign data to identified data objects, to represent information in meaningful statements, and to assign instances to classes of objects with defined properties. The solution is known as Resource Description Framework (RDF). Using RDF, Big Data resources can design a scaffold for their information that can be understood by humans, parsed by computers, and shared by other Big Data resources. This solution transforms every RDF-compliant Web page into an accessible database whose contents can be searched, extracted, aggregated, and integrated along with all the data contained in every existing Big Data resource.

Without jumping ahead of the game, it is appropriate to discuss in this chapter the marvelous “trick” that RDF Schema employs that solves many of the complexity problems of ontologies and many of the oversimplification issues associated with classifications. It does so by introducing the new concept of class property. The class property permits the developer to assign features that can be associated with a class and its members. A property can apply to more than one class and may apply to classes that are not directly related (i.e., neither an ancestor class nor a descendant class). The concept of the assigned class property permits developers to create simple ontologies by reducing the need to create classes to account for every feature of interest to the developer. Moreover, the concept of the assigned property permits classification developers the ability to relate instances belonging to unrelated classes through their shared property features. The RDF Schema permits developers to build class structures that preserve the best qualities of both complex ontologies and simple classifications. We will discuss RDF at greater length in Chapter 4. In this section, we will restrict our attention to one aspect of RDF—its method of defining classes of objects and bestowing properties on classes that vastly enhance the manner in which class models can be implemented in Big Data resources.

How do the Class and Property definitions of RDF Schema work? The RDF Schema is a file that defines Classes and Properties. When an RDF Schema is prepared, it is simply posted onto the Internet, as a public Web page, with a unique Web address.

An RDF Schema contains a list of classes, their definition(s), and the names of the parent class(es). This is followed by a list of properties that apply to one or more classes in the Schema. The following example is an example of RDF Schema written in plain English, without formal RDF syntax.

Plain-English RDF Schema

Class: Fungi

Definition: Contains all fungi

Subclass of: Class Opisthokonta (described in another RDF Schema)

Class: Plantae

Definition: Includes multicellular organisms such as flowering plants, conifers, ferns, and mosses

Subclass of: Class Archaeplastida (described in another RDF Schema)

Property: Stationary existence

Definition: Adult organism does not ambulate under its own power

Range of classes: Class Fungi, Class Plantae

Property: Soil habitation

Definition: Lives in soil

Range of classes: Class Fungi, Class Plantae

Property: Chitinous cell wall

Definition: Chitin is an extracellular material often forming part of the matrix surrounding cells

Range of classes: Class Opisthokonta

Property: Cellulosic cell wall

Definition: Cellulose is an extracellular material often forming part of the matrix surrounding cells

Range of classes: Class Archaeplastida

This Schema defines two classes: Class Fungi, containing all fungal species, and Class Plantae, containing the flowering plants, conifers, and mosses. The Schema defines four properties. Two of the properties (Property Stationary existence and Property Soil habitation) apply to two different classes. Two of the properties (Property Chitinous cell wall and Property Cellulosic cell wall) apply to only one class.

By assigning properties that apply to several unrelated classes, we keep the class system small, but we permit property comparisons among unrelated classes. In this case, we defined Property Stationary growth, and we indicated that the property applied to instances of Class Fungi and Class Plantae. This schema permits databases that contain data objects assigned to Class Fungi or data objects assigned to Class Plantae to include data object values related to Property Stationary growth. Data analysts can collect data from any plant or fungus data object and examine these objects for data values related to Property Stationary growth.

Property Soil habitation applies to Class Fungi and to Class Plantae. Objects of either class may include a soil habitation data value. Data objects from two unrelated classes (Class Fungi and Class Plantae) can be analyzed by a shared property.

The schema lists two other properties: Property Chitinous cell wall and Property Cellulosic cell wall. In this case, each property is assigned to one class only. Property Chitinous cell wall applies to Class Opisthokonta. Property Cellulosic cell wall applies to Class Archaeplastida. These two properties are exclusive to their class. If a data object is described as having a cellulosic cell wall, it cannot be a member of Class Opisthokonta. If a data object is described as having a chitinous cell wall, then it cannot be a member of Class Archaeplastida.

A property assigned to a class will extend to every member of every descendant class. Class Opisthokonta includes Class Fungi, and it also includes Class Animalia, the class of all animals. This means that all animals may have the property of a chitinous cell wall. In point of fact, chitin is distributed widely throughout the animal kingdom, but is not found in mammals.

RDF seems like a panacea for ontologists, but it is seldom used in Big Data resources. The reason for its poor acceptance is largely due to its strangeness. Savvy data mangers who have led successful careers using standard database technologies are understandably reluctant to switch over to an entirely new paradigm of information management. Realistically, a novel and untested approach to data description, such as RDF, will take decades to catch on. Whether RDF succeeds as a data description standard is immaterial. The fundamental principles upon which RDF is built are certain to dominate the world of Big Data. Everyone who works with Big Data should be familiar with the power of RDF. In the next chapter, you will learn how data formatted using RDF syntax can be assigned to classes defined in public RDF Schema documents and how the data can be integrated with any RDF-formatted data sets.

Common Pitfalls in Ontology Development

Do ontologies serve a necessary role in the design and development of Big Data resources? Yes. Because every Big Data resource is composed of many different types of information, it becomes important to assign types of data into groups that have similar properties: images, music, movies, documents, and so forth. The data manager needs to distinguish one type of data object from another and must have a way of knowing the set of properties that apply to the members of each class. When a query comes in asking for a list of songs written by a certain composer or performed by a particular musician, the data manager will need to have a software implementation wherein the features of the query are matched to the data objects for which those features apply. The ontology that organizes the Big Data resource may be called by many other names (class systems, tables, data typing, database relationships, object model), but it will always come down to some way of organizing information into groups that share a set of properties.

Despite the importance of ontologies to Big Data resources, the process of building an ontology is seldom undertaken wisely. There is a rich and animated literature devoted to the limitations and dangers of ontology building.50,51 Here are just a few pitfalls that you should try to avoid.

1. Don’t build transitive classes. Class assignment is permanent. If you assign your pet beagle to the “dog” class, you cannot pluck him from this class and reassign him to the “feline” class. Once a dog, always a dog. This may seem like an obvious condition for an ontology, but it can be very tempting to make a class known as “puppy.” This practice is forbidden because a dog assigned to class “puppy” will grow out of his class when he becomes an adult. It is better to assign “puppy” as a property of Class Dog, with a property definition of “age less than 1 year.”

2. Don’t build miscellaneous classes. Even experienced ontologists will stoop to creating a “miscellaneous” class as an act of desperation. The temptation to build a “miscellaneous” class arises when you have an instance (of a data object) that does not seem to fall into any of the well-defined classes. You need to assign the instance to a class, but you do not know enough about the instance to define a new class for the instance. To keep the project moving forward, you invent a “miscellaneous” class to hold the object until a better class can be created. When you encounter another object that does not fit into any of the defined classes, you simply assign it to the “miscellaneous” class. Now you have two objects in the “miscellaneous” class. Their only shared property is that neither object can be readily assigned to any of the defined classes. In the classification of living organisms, Class Protoctista was invented in the mid-19th century to hold, temporarily, some of the organisms that could not be classified as animal, plant, or fungus. It has taken a century for taxonomists to rectify the oversight, and it may take another century for the larger scientific community to fully adjust to the revisions. Likewise, mycologists (fungus experts) have accumulated a large group of unclassifiable fungi. A pseudoclass of fungi, deuteromyctetes (spelled with a lowercase “d”, signifying its questionable validity as a true biologic class), was created to hold these indeterminate organisms until definitive classes can be assigned. At present, there are several thousand such fungi, sitting in taxonomic limbo, until they can be placed into a definitive taxonomic class.52

Sometimes, everyone just drops the ball and miscellaneous classes become permanent.53 Successive analysts, unaware that the class is illegitimate, assumed that the “miscellaneous” objects were related to one another (i.e., related through their “miscellaneousness”). Doing so led to misleading interpretations (e.g., finding similarities among unrelated data objects and failing to see relationships that would have been obvious had the objects been assigned to their correct classes). The creation of an undefined “miscellaneous” class is an example of a general design flaw known as “ontological promiscuity.”50 When an ontology is promiscuous, the members of one class cannot always be distinguished from members of other classes.

3. Don’t invent classes and properties if they have already been invented.54 Time-pressured ontologists may not wish to search, find, and study the classes and properties created by other ontologists. It is often easier to invent classes and properties as you need them, defining them in your own Schema document. If your ambitions are limited to using your own data for your own purposes, there really is no compelling reason to hunt for external ontologies. Problems will surface when you need to integrate your data objects with the data objects held in other Big Data resources. If every resource invented its own set of classes and properties, then there could be no sensible comparisons among classes, and the relationships among the data objects from the different resources cannot be explored.

Most data records, even those that are held in seemingly unrelated databases, contain information that applies to more than one type of data. A medical record, a financial record, and a music video may seem to be heterogeneous types of data, but each is associated with the name of a person, and each named person might have an address (see Glossary item, Heterogeneous data). The classes of information that deal with names and addresses can be integrated across resources if they all fit into the same ontology, and if they all have the same intended meanings in each resource.

4. Use a simple data description language. If you decide to represent your data objects as triples, you will have a choice of languages, each with its own syntax, with which to describe your data objects, roughly listed here in order of increasing complexity: Notation 3, Turtle, RDF, DAML/OIL, and OWL (see Glossary items, RDF, Triple, Notation 3). Experience suggests that syntax languages start out simple; complexity is added as users demand additional functionalities. The task of expressing objects in an ontology language has gradually become a job for highly trained specialists who work in the obscure field of descriptive logic. As the complexity of the descriptive language increases, the number of people who can understand and operate the resource tends to diminish. In general, complex descriptive languages should only be used by well-staffed and well-funded Big Data resources capable of benefiting from the added bells and whistles.

5. Do not confuse properties with your classes. When I lecture on the topic of classifications and ontologies, I always throw out the following question: “Is a leg a subclass of the human body?” Most people answer yes. The reason they give is that the normal human body contains a leg; hence leg is a subclass of the human body. They forget that a leg is not a type of human body and is therefore not a subclass of the human body. As a part of the human body, “leg” is a property of a class. Furthermore, lots of different classes of things have legs (e.g., dogs, cows, tables). The “leg” property can be applied to many different classes and is usually asserted with a “has_a” descriptor (e.g., “Fred has_a leg”). The fundamental difference between classes and properties is one of the more difficult concepts in the field of ontology.

References

19. Berman JJ. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton, FL: Chapman and Hall; 2010.

44. Pearson K. The grammar of science. London: Adam and Black; 1900.

45. Berman JJ. Racing to share pathology data. Am J Clin Pathol. 2004;121:169–171.

46. Scamardella JM. Not plants or animals: a brief history of the origin of kingdoms Protozoa, Protista and Protoctista. Intl Microbiol. 1999;2:207–216.

47. Madar S, Goldstein I, Rotter V. Did experimental biology die? Lessons from 30 years of p53 research. Cancer Res. 2009;69:6378–6380.

48. Zilfou JT, Lowe SW. Tumor suppressive functions of p53. Cold Spring Harb Perspect Biol 2009;a001883 00.

49. Berman JJ. Taxonomic guide to infectious diseases: understanding the biologic classes of pathogenic organisms. Waltham: Academic Press; 2012.

50. Suggested Upper Merged Ontology (SUMO). The OntologyPortal. Available from: http://www.ontologyportal.org; viewed August 14, 2012.

51. de Bruijn J. Using ontologies: enabling knowledge sharing and reuse on the Semantic Web. Digital Enterprise Research Institute Technical Report DERI-2003-10-29, October 2003. Available from: http://www.deri.org/fileadmin/documents/DERI-TR-2003-10-29.pdf; viewed August 14, 2012.

52. Guarro J, Gene J, Stchigel AM. Developments in fungal taxonomy. Clin Microbiol Rev. 1999;12:454–500.

53. Nakayama R, Nemoto T, Takahashi H, et al. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Modern Pathol. 2007;20:749–759.

54. Richard Cote R, Reisinger F, Martens L, Barsnes H, Vizcaino JA, Hermjakob H. The ontology lookup service: bigger and better. Nucleic Acids Res. 2010;38:W155–W160.


ent“To view the full reference list for the book, click here

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.240.222