5

Classifications and Ontologies

Abstract

Information has limited value unless it can take its place within our general understanding of the world. “How does this thing relate to that thing?” is often the central question of scientific efforts. Ontologies are formal systems that relate different information objects into classes and relate classes of information objects to other classes, often as a hierarchical lineage (i.e., classes that have superclasses and subclasses). Scientific analyses of large information resources can be greatly enhanced if every data object in the resource is positioned somewhere within a formal ontology. Using ontologies, scientists can determine whether observations on a single object will apply to other objects in the same class. Similarly, scientists can begin to ask whether observations that hold true for a class of objects will relate to other classes of objects. Basically, ontologies help scientists complete one of their most important tasks; determining how things relate to each other. This chapter will describe how ontologies are constructed, and how they are used for scientific discovery in Big Data resources.

Keywords

Ontology; Classification; Class; Subclass; Superclass; Class hierarchy; Ontologic competence; Instances; Class object

Section 5.1. It's All About Object Relationships

Order and simplification are the first steps toward the mastery of a subject.

Thomas Mann

Information has limited value unless it can take its place within our general understanding of the world. When a financial analyst learns that the price of a stock has suddenly dropped, he cannot help but wonder if the drop of a single stock reflects conditions in other stocks in the same industry. If so, the analyst may check to ensure that other industries are following a downward trend. He may wonder whether the downward trend represents a shift in the national or global economies. There is a commonality to all of the questions posed by the financial analyst. In every case, the analyst is asking a variation on a single question: “How does this thing relate to that thing?”

Big Data resources are complex. When data is simply stored in a database, without any general principles of organization, it becomes impossible to find the relationships among the data objects. To be useful the information in a Big Data resource must be divided into classes of data. Each data object within a class shares a set of properties chosen to enhance our ability to relate one piece of data with another.

Relationships are the fundamental properties of an object that determine the class in which it is placed. Every member of a class shares these same fundamental properties. A core set of relational properties is found in all the ancestral classes of an object and in all the descendant classes of an object. Similarities are just features that one or more objects have in common, but they are not fundamental relationships upon which classes can be organized. Related objects tend to be similar to one another, but these similarities occur as the consequence of their relationships; not vice versa. For example, you may have many similarities to your father. If so, you are similar to your father because you are related to him; you are not related to him because you are similar to him.

The distinction between grouping data objects by similarity and grouping data objects by relationship is sometimes lost on computer scientists. I have had numerous conversations with intelligent scientists who refuse to accept that grouping by similarity (e.g., clustering) is fundamentally different from grouping by relationship (i.e., building a classification). [Glossary Cluster analysis]

Consider a collection of 300 objects. Each object belongs to one of two classes, marked by an asterisk or by an empty box. The three hundred objects naturally cluster into three groups. It is tempting to conclude that the graph shows three classes of objects that can be defined by their similarities, but we know from the outset that the objects fall into two classes, and we see from the graph that objects from both classes are distributed in all three clusters (Fig. 5.1).

Fig. 5.1
Fig. 5.1 The spatial distribution of 300 objects represented by data points in three dimensions. Each data object falls into one of two classes, represented by an asterisk or an empty box. The data naturally segregates into three clusters. Objects of type asterisk and type box are distributed throughout each cluster.

Is this graph far-fetched? Not really. Suppose you have a collection of felines and canines. The collection of dogs might include Chihuahuas, St. Bernards, and other breeds. The collection of cats might include housecats, lions, and other species, and the data collected on each animal might include weight, age, and hair length. We do not know what to expect when we cluster the animals by similarities (i.e., weight, age, and hair length) but we can be sure that short-haired cats and short-haired Chihuahuas of the same age will probably fall into one cluster. Cheetahs and greyhounds, having similar size and build, might fall into another cluster. The similarity clusters will mix together unrelated animals and will separate related animals.

OK, similarities are different from relationships; but how do we know when we are dealing with a similarity and when we are dealing with a true relationship? Here are two stories that may clarify the functional differences between the two concepts:

  1. 1.  You look up at the clouds, and you begin to see the shape of a lion. The cloud has a tail, like a lion's tail, and a fluffy head, like a lion's mane. With a little imagination, the mouth of the lion seems to roar down from the sky. You have succeeded in finding similarities between the cloud and a lion. If you look at a cloud and you imagine a teakettle producing a head of steam, then you are establishing a relationship between the physical forces that create a cloud and the physical forces that produce steam from a heated kettle, and you understand that clouds are composed of water vapor.
  2. 2.  You look up at the stars and you see the outline of a flying horse, Pegasus, or the soup ladle, the Big Dipper. You have found similarities upon which to base the names of celestial landmarks, the constellations. The constellations help you orient yourself to the night sky, but they do not tell you much about the physical nature of the twinkling objects. If you look at the stars and you see the relationship between the twinkling stars in the night sky, and the round sun in the daylight sky, then you can begin to understand how the universe operates.

For taxonomists, the importance of grouping by relationship, not by similarity, is a lesson learned the hard way. Literally two thousand years of mis-classifications, erroneous biological theorizations, impediments to progress in medicine and agriculture, have occurred whenever similarities were confused with relationships. Early classifications of animals were based on similarities (e.g., beak shape, color of coat, or number of toes). These kinds of classifications led to the erroneous conclusion that the various juvenile forms of holometabolous insects (i.e., insects that undergo metamorphosis) were distinct organisms, unrelated to the adult form into which they would mature. The vast field of animal taxonomy was a useless mess until taxonomists began to think very deeply about classes of organisms and the fundamental properties that accounted for the relationships among the classes. [Glossary Classification system versus identification system, Classification versus index, Phenetics]

Geneticists have learned that sequence similarities among genes may bear no relationship to their functionalities, their inheritance from higher organisms, their physical locations, or to any biological process whatsoever. Geneticists use the term homology to describe the relationship among sequences that can be credited to descent from a common ancestral sequence. Similarity among different sequences can be non-homologous, developing randomly in non-related organisms, or developing by convergence, through selection for genes that have common functionality. Sequence similarity that is not acquired from a common ancestral sequence seldom relates to the shared fundamental cellular properties that characterize inherited relationships. Biological inferences drawn from gene analyses are more useful when they are built upon phylogenetic relationships, rather than on superficial genetic or physiologic similarities [1]. [Glossary Nonphylogenetic property]

The distinction between classification by similarity and classification by relationship is vitally important to the field of computer science and to the future of Big Data analysis. I have discussed this point with many of my colleagues, who hold the opposite view: that the distinction between similarity classification and relationship classification is purely semantic. There is no practical difference between the two methods. Regardless of which side you may choose, the issue is worth pondering for a few moments.

Two arguments support the opinion that classification should be based on similarity measures. The first argument is that classification by similarity is the standard method by which relational classifications are built. The second argument is that relational properties are always unknown at the time that the classification is built. The foundation of every classification must be built on measurable features and the only comparison we have for measurable features is similarity. This argument has no scientific merit insofar as comparisons by relationship are always feasible, though not always readily computable.

The second argument, that classification by relationship requires access to unobtainable knowledge is a clever observation that hits on a weakness in the relational theory of classification. To build a classification, you must first know the relational properties that define classes, superclasses, and subclasses; but if you want to know the relationships among the classes, you must refer to the classification. It is another bootstrapping problem. [Glossary Bootstrapping]

Building a classification is an iterative process wherein you hope that your tentative selection of relational properties and your class assignments will be validated by the test of time. You build a classification by guessing which properties are fundamental and relational and by guessing which system of classes will make sense when all of the instances of the classes are assigned. A classification is often likened to a hypothesis that must be tested again and again as the classification grows.

Is it ever possible to build a classification using a hierarchical clustering algorithm based on measuring similarities among objects? The answer is a qualified yes, assuming that the object features that you have measured happen to be the relational properties that define the classes. A good example of this process is demonstrated by the work of Carl Woese and his coworkers in the field of the classification of terrestrial organisms [2]. Woese compared ribosomal RNA sequences among organisms. Ribosomal RNA is involved in the precise synthesis of proteins according to instructions coded in genes. According to Woese, the genes coding for ribosomal RNA mutate more slowly than other genes, because ribosomal RNA has scarcely any leeway in its functionality. Changes in the sequence of ribosomal RNA act like a chronometer for evolution. Using sequence similarities Woese developed a brilliant classification of living organisms that has revolutionized evolutionary genetics. Woese's analysis is not perfect and where there are apparent mistakes in his classification, disputations focus on the limitations of using similarity as a substitute for fundamental relational properties [3,4]. [Glossary Non-living organism]

The field of medical genetics has been embroiled in a debate, lasting well over a decade, on the place of race in science. Some would argue that when the genomes of humans from different races are compared, there is no sensible way to tell one genome from another, on the basis of assigned race. The genes of a tall man and the short man are more different than the genes of an African-American man and a white man. Judged by genetic similarity, race has no scientific meaning [5]. On the other hand, every clinician understands that various diseases, congenital and acquired, occur at different rates in the African-American population than in the white population. Furthermore, the clinical symptoms, clinical outcome, and even the treatment of these diseases in African-American and white individuals will sometimes differ among ethnic or racial groups. Hence, many medical epidemiologists and physicians perceive race as a clinical reality [6]. The discord stems from a misunderstanding of the meanings of similarity and of relationship. It is quite possible to have a situation wherein similarities are absent, while relationships pertain. The lack of informative genetic similarities that distinguish one race from another does not imply that race does not exist. The basis for race is the relationship created by shared ancestry. The morphologic and clinical by-product of the ancestry relationship may occur as various physical features and epidemiologic patterns found by clinicians. [Glossary Cladistics]

Fundamentally, all analysis is devoted to finding relationships among objects or classes of objects. All we ever know about the universe, and the processes that play out in our universe, can be reduced to simple relationships. In many cases the process of finding and establishing relationships often begins with finding similarities; but it must never end there.

Section 5.2. Classifications, the Simplest of Ontologies

Consciousness is our awareness of our own awareness.

Descartes

The human brain is constantly processing visual and other sensory information collected from the environment. When we walk down the street, we see images of concrete and asphalt and millions of blades of grass, birds, dogs, and other persons. Every step we take conveys a new world of sensory input. How can we process it all? The mathematician and philosopher Karl Pearson (1857–1936) has likened the human mind to a “sorting machine” [7]. We take a stream of sensory information and sort it into objects; we then collect the individual objects into general classes. The green stuff on the ground is classified as “grass,” and the grass is subclassified under some larger grouping, such as “plants.” A flat stretch of asphalt and concrete may be classified as a “road” and the road might be subclassified under “man-made constructions.” If we lacked a culturally determined classification of objects for our world, we would be overwhelmed by sensory input, and we would have no way to remember what we see, and no way to draw general inferences about anything. Simply put, without our ability to classify, we would not be human [8].

Every culture has some particular way to impose a uniform way of perceiving the environment. In English-speaking cultures, the term “hat” denotes a universally recognized object. Hats may be composed of many different types of materials, and they may vary greatly in size, weight, and shape. Nonetheless, we can almost always identify a hat when we see one, and we can distinguish a hat from all other types of objects. An object is not classified as a hat simply because it shares a few structural similarities with other hats. A hat is classified as a hat because it has a class relationship; all hats are items of clothing that fit over the head. Likewise, all biological classifications are built by relationships, not by similarities [9,8].

Aristotle was one of the first experts in classification. His greatest insight came when he correctly identified a dolphin as a mammal. Through observation, he knew that a large group of animals was distinguished by a gestational period in which a developing embryo is nourished by a placenta, and the offspring are delivered into the world as formed, but small versions of the adult animals (i.e., not as eggs or larvae), and the newborn animals feed from milk excreted from nipples, overlying specialized glandular organs (mammae). Aristotle knew that these features, characteristic of mammals, were absent in all other types of animals. He also knew that dolphins had all these features; fish did not. He correctly reasoned that dolphins were a type of mammal, not a type of fish. Aristotle was ridiculed by his contemporaries for whom it was obvious that dolphins were a type of fish. Unlike Aristotle, they based their classification on similarities, not on relationships. They saw that dolphins looked like fish and dolphins swam in the ocean like fish, and this was all the proof they needed to conclude that dolphins were indeed fish. For about two thousand years following the death of Aristotle, biologists persisted in their belief that dolphins were a type of fish. For the past several hundred years, biologists have acknowledged that Aristotle was correct after all; dolphins are mammals. Aristotle discovered and taught the most important principle of classification; that classes are built on relationships among class members; not by counting similarities [8].

Today, the formal systems that assign data objects to classes, and that relate classes to other classes, are known as ontologies. When the data within a Big Data resource is classified within an ontology, data analysts can determine whether observations on a single object will apply to other objects in the same class.

A classification is a very simple form of ontology, in which each class is allowed to have only one parent class. To build a classification, the ontologist must do the following: (1) define classes (i.e., find the properties that define a class and extend to the subclasses of the class); (2) assign instances to classes; (3) position classes within the hierarchy; and (4) test and validate all the above. [Glossary Parent class]

The constructed classification becomes a hierarchy of data objects conforming to a set of principles:

  1. 1.  The classes (groups with members) of the hierarchy have a set of properties or rules that extend to every member of the class and to all of the subclasses of the class, to the exclusion of unrelated classes. A subclass is itself a type of class wherein the members have the defining class properties of the parent class plus some additional property(ies) specific for the subclass.
  2. 2.  In a hierarchical classification, each subclass may have no more than one parent class. The root (top) class has no parent class. The biological classification of living organisms is a hierarchical classification.
  3. 3.  At the bottom of the hierarchy is the class instance. For example, your copy of this book is an instance of the class of objects known as “books.”
  4. 4.  Every instance belongs to exactly one class.
  5. 5.  Instances and classes do not change their positions in the classification. As examples, a horse never transforms into a sheep, and a book never transforms into a harpsichord. [Glossary Intransitive property]
  6. 6.  The members of classes may be highly similar to one another, but their similarities result from their membership in the same class (i.e., conforming to class properties), and not the other way around (i.e., similarity alone cannot define class inclusion).

Classifications are always simple; the parental classes of any instance of the classification can be traced as a simple, non-branched list, ascending through the class hierarchy. As an example, here is the lineage for the domestic horse (Equus caballus), from the classification of living organisms:

  • Equus caballus
  • Equus subg. Equus
  • Equus
  • Equidae
  • Perissodactyla
  • Laurasiatheria
  • Eutheria
  • Theria
  • Mammalia
  • Amniota
  • Tetrapoda
  • Sarcopterygii
  • Euteleostomi
  • Teleostomi
  • Gnathostomata
  • Vertebrata
  • Craniata
  • Chordata
  • Deuterostomia
  • Coelomata
  • Bilateria
  • Eumetazoa
  • Metazoa
  • Fungi/Metazoa group
  • Eukaryota
  • cellular organisms

The words in this zoological lineage may seem strange to laypersons, but taxonomists who view this lineage instantly grasp the place of domestic horses in the classification of all living organisms.

A classification is a list of every member class along with their relationships to other classes. Because each class can have only one parent class, a complete classification can be provided when we list all the classes, adding the name of the parent class for each class on the list. For example, a few lines of the classification of living organisms might be:

Craniata, subclass of Chordata
Chordata, subclass of Duterostomia
Deuterostomia, subclass of Coelomata
Coelomata, subclass of Bilateria
Bilateria, subclass of Eumetazoa

Given the name of any class a programmer can compute (with a few lines of code), the complete ancestral lineage for the class, by iteratively finding the parent class assigned to each ascending class [10]. [Glossary Iterator]

A taxonomy is a classification with the instances “filled in.” This means that for each class in a taxonomy, all the known instances (i.e., member objects) are explicitly listed. For the taxonomy of living organisms the instances are named species. Currently, there are several million named species of living organisms, and each of these several million species is listed under the name of some class included in the full classification.

Classifications drive down the complexity of their data domain because every instance in the domain is assigned to a single class and every class is related to the other classes through a simple hierarchy.

It is important to distinguish a classification system from an identification system. An identification system puts a data object into its correct slot within the classification. For example, a fingerprint matching system may look for a set of features that puts a fingerprint into a special subclass of all fingerprint, but the primary goal of fingerprint matching is to establish the identity of an instance (i.e., to determine whether two sets of fingerprints belong to the same person). In the realm of medicine, when a doctor renders a diagnosis on a patient's diseases, she is not classifying the disease; she is finding the correct slot, within the preexisting classification of diseases, that holds her patient's diagnosis.

Section 5.3. Ontologies, Classes With Multiple Parents

...science is in reality a classification and analysis of the contents of the mind...

Karl Pearson [7]

Ontologies are constructions that permit an object to be a direct subclass of more than one classes. In an ontology, the class “horse” might be a subclass of Equu, a zoological term; as well as a subclass of “racing animals” and “farm animals,” and “four-legged animals.” The class “book” might be a subclass of “works of literature,” as well as a subclass of “wood-pulp materials,” and “inked products.” Ontologies are unrestrained classifications. Hence, all classifications are ontologies, but not all ontologies are classifications. Ontologies are predicated on the belief that a single object or class of objects might have multiple different fundamental identities, and that these different identities will often place one class of objects directly under more than one superclass. [Glossary Multiclass classification, Multiclass inheritance]

Data analysts sometimes prefer ontologies to classifications because they permit the analyst to find relationships among classes of objects that would have been impossible to find under a classification. For example, a data analyst might be interested in determining the relationships among groups of flying animals, such as butterflies, birds, and bats. In the classification of living organisms, these animals occupy classes that are not closely related to one another; no two of the different types of flying animals share a single parent class. Because classifications follow relationships through a lineage, they cannot connect instances of classes that fall outside the line of descent.

Ontologies are not subject to the analytic limitations imposed by classifications. In an ontology, a data object can be an instance of many different kinds of classes; thus, the class does not define the essence of the object, as it does in a classification. In an ontology the assignment of an object to a class and the behavior of the members of the objects of a class, are determined by rules. An object belongs to a class when it behaves like the other members of the class, according to a rule created by the ontologist. Every class, subclass, and superclass is defined by rules; and rules can be programmed into software.

Classifications were created and implemented at a time when scientists did not have powerful computers that were capable of handling the complexities of ontologies. For example, the classification of all living organisms on earth was created over a period of two millennia. Several million species have been assigned to date to the classification. It is currently estimated that we will need to add another 10–50 million species before we come close to completing the taxonomy of living organisms. Prior generations of scientists could cope with a simple classification, wherein each class of organisms falls under a single superclass; they could not hope to cope with a complex ontology of organisms.

The advent of powerful and accessible computers has spawned a new generation of computer scientists who have developed powerful methods for building complex ontologies. It is the goal of these computer scientists to analyze data in a manner that allows us to find and understand ontologic relationships among data objects.

In simple data collections, such as spreadsheets, data is organized in a very specific manner that preserves the relationships among specific types of data. The rows of the spreadsheet are the individual data objects (i.e., people, experimental samples, and class of information). The left-hand field of the row is typically the name assigned to the data object and the cells of the row are the attributes of the data object (e.g., quantitative measurements, categorical data, and other information). Each cell of each row occurs in a specific order and the order determines the kind of information contained in the cell. Hence, every column of the spreadsheet has a particular type of information in each spreadsheet cell. [Glossary Categorical data, Observational data]

Big Data resources are much more complex than spreadsheets. The set of features belonging to an object (i.e., the values, sometimes called variables, belonging to the object, and corresponding to the cells in a spreadsheet row) will be different for different classes of objects. For example, a member of Class Automobile may have a feature such as “average miles per gallon in city driving,” while a member of Class Mammal would not. Every data object must be assigned membership in a class (e.g., Class Persons, Class Tissue Samples, and Class Bank Accounts), and every class must be assigned a set of class properties. In Big Data resources that are based on class models, the data objects are not defined by their location in a rectangular spreadsheet; they are defined by their class membership. Classes, in turn, are defined by their properties and by their relations to other classes. [Glossary Properties versus classes]

The question that should confront every Big Data manager is, “Should I model my data as a classification, wherein every class has one direct parent class; or should I model the resource as an ontology, wherein classes may have multiparental inheritance?”

Section 5.4. Choosing a Class Model

Taxonomy is the oldest profession practiced by people with their clothes on.

Quentin Wheeler, referring to the belief that Adam was assigned the task of naming all the creatures.

The simple, and fundamental question, “Can a class of objects have more than one parent class?” lies at the heart of several related fields: database management, computational informatics, object oriented programming, semantics, and artificial intelligence. Computer scientists are choosing sides, often without acknowledging the problem or fully understanding the stakes. For example, when a programmer builds object libraries in the Python or the Perl programming languages, he is choosing to program in a permissive environment that supports multiclass object inheritance. In Python and Perl, any object can have as many parent classes as the programmer prefers. When a programmer chooses to program in the Ruby programming language, he shuts the door on multiclass inheritance. A Ruby object can have only one direct parent class. Many programmers are totally unaware of the liberties and restrictions imposed by their choice of programming language, until they start to construct their own object libraries, or until they begin to use class libraries prepared by another programmer. [Glossary Artificial intelligence]

In object oriented programming the programming language provides a syntax whereby a named method is “sent” to data objects and a result is calculated. The named methods are functions and short programs contained in a library of methods created for a class. For example, a “close” method, written for file objects, typically shuts a file so that it cannot be accessed for read or write operations. In object-oriented languages a “close” method is sent to an instance of class “File” when the programmer wants to prohibit access to the file. The programming language, upon receiving the “close” method, will look for a method named “close” somewhere in the library of methods prepared for the “File” class. If it finds the “close” method in the “File” class library, it will apply the method to the object to which the method was sent. In simplest terms the specified file would be closed.

If the “close” method were not found among the available methods for the “File” class library, the programming language would automatically look for the “close” method in the parent class of the “File” class. In some languages the parent class of the “File” class is the “Input/Output” class. If there were a “close” method in the “Input/Output” class, the method would be sent to the “File” Object. If not, the process of looking for a “close” method would be repeated for the parent class of the “Input/Output” class. You get the idea. Object oriented languages search for methods by moving up the lineage of ancestral classes for the object instance that receives the method.

In object oriented programming, every data object is assigned membership to a class of related objects. Once a data object has been assigned to a class, the object has access to all of the methods available to the class in which it holds membership, and to all of the methods in all the ancestral classes. This is the beauty of object oriented programming. If the object oriented programming language is constrained to single parental inheritance, as happens in the Ruby programming language, then the methods available to the programmer are restricted to a tight lineage. When the object oriented language permits multiparental inheritance, as happens in the Perl and Python programming languages, a data object can have many different ancestral classes spread horizontally and vertically through the class libraries. [Glossary Beauty]

Freedom always has its price. Imagine what happens in a multiparental object oriented programming language when a method is sent to a data object, and the data object's class library does not contain the method. The programming language will look for the named method in the library belonging to a parent class. Which parent class library should be searched? Suppose the object has two parent classes, and each of those two parent classes has a method of the same name in their respective class libraries? The functionality of the method will change depending on its class membership (i.e., a “close” method may have a different function within class File than it may have within class Transactions or class Boxes). There is no way to determine how a search for a named method will traverse its ancestral class libraries; hence, the output of a software program written in an object oriented language that permits multiclass inheritance is unpredictable.

The rules by which ontologies assign class relationships can become computationally difficult. When there are no restraining inheritance rules, a class within the ontology might be an ancestor of a child class that is an ancestor of its parent class (e.g., a single class might be a grandfather and a grandson to the same class). An instance of a class might be an instance of two classes, at once. The combinatorics and the recursive options can become impossible to compute. [Glossary Combinatorics]

Those who use ontologies that allow multiclass inheritance will readily acknowledge that they have created a system that is complex and unpredictable. The ontology expert justifies his complex and unpredictable model on the observation that reality itself is complex and unpredictable. A faithful model of reality cannot be created with a simple-minded classification. With time and effort, modern approaches to complex systems will isolate and eliminate computational impedimenta; these are the kinds of problems that computer scientists are trained to solve. For example, recursion within an ontology can be avoided if the ontology is acyclic (i.e., class relationships are not permitted to cycle back onto themselves). For every problem created by an ontology an adept computer scientist will find a solution. Basically, many modern ontologists believe that the task of organizing and understanding information cannot reside within the ancient realm of classification.

For those non-programmers who believe in the supremacy of classifications, over ontologies, their faith may have nothing to do with the computational dilemmas incurred with multiclass parental inheritance. They base their faith on epistemological grounds; on the nature of objects. They hold that an object can only be one thing. You cannot pretend that one thing is really two or more things simply because you insist that it is so. One thing can only belong to one class. Once class can only have one ancestor class; otherwise, it would have a dual nature. For classical taxonomists, assigning more than one parental class to an object indicates that you have failed to grasp the essential nature of the object. The classification expert believes that ontologies (i.e., classifications that permit one class to have more than one parent classes and that permit one object to hold membership in more than one class), do not accurately represent reality.

At the heart of traditional classifications is the notion that everything in the universe has an essence that makes it one particular thing and nothing else. This belief is justified for many different kinds of systems. When an engineer builds a radio, he knows that he can assign names to components, and these components can be relied upon to behave in a manner that is characteristic of its type. A capacitor will behave like a capacitor, and a resistor will behave like a resistor. The engineer need not worry that the capacitor will behave like a semiconductor or an integrated circuit.

What is true for the radio engineer may not hold true for the Big Data analyst. In many complex systems the object changes its function depending on circumstances. For example, cancer researchers discovered an important protein that plays a very important role in the development of cancer. This protein, p53, was, at one time, considered to be the primary cellular driver for human malignancy. When p53 mutated, cellular regulation was disrupted and cells proceeded down a slippery path leading to cancer. In the past few decades, as more information was obtained, cancer researchers have learned that p53 is just one of many proteins that play some role in carcinogenesis, but the role changes depending on the species, tissue type, cellular microenvironment, genetic background of the cell, and many other factors. Under one set of circumstances, p53 may play a role in DNA repair; under another set of circumstances, p53 may cause cells to arrest the growth cycle [11,12]. It is difficult to classify a protein that changes its primary function based on its biological context.

As someone steeped in the ancient art of classification, and as someone who has written extensively on object oriented programming, I am impressed, but not convinced, by arguments on both sides of the ontology/classification debate. As a matter of practicality, complex ontologies are nearly impossible to implement in Big Data projects. The job of building and operating a Big Data resource is always difficult. Imposing a complex ontology framework onto a Big Data resource tends to transform a tough job into an impossible job. Ontologists believe that the Big Data resources must match the complexity of their data domain. They would argue that the dictum “Keep it simple, stupid!” only applies to systems that are simple at the outset. I would comment here that one of the problems with ontology builders is that they tend to build ontologies that are much more complex than our reality. They do so because it is actually quite easy to add layers of abstraction to an ontology without incurring any immediate penalty. [Glossary KISS]

Without stating a preference for single-class inheritance (classifications) or multi-class inheritance (ontologies), I would suggest that when modeling a complex system, you should always strive to design a model that is as simple as possible. The wise ontologist will settle for a simplified approximation of the truth. Regardless of your personal preference, you should learn to recognize when an ontology has become too complex for its own good.

Here are the danger signs of an overly-complex ontology:

  •   You realize that the ontology makes no sense. The solutions obtained by data analysts contradict direct observations. The ontologists perpetually tinker with the model in an effort to achieve a semblance of reality and rationality. Meanwhile, the data analysts tolerate the flawed model because they have no choice in the matter.
  •   For a given problem, no two data analysts seem able to formulate the query the same way and no two query results are ever equivalent.
  •   The time spent on ontology design and improvement exceeds the time spent on collecting the data that populates the ontology.
  •   The ontology lacks modularity. It is impossible to remove a set of classes within the ontology without reconstructing the entire ontology. When anything goes wrong the entire ontology must be fixed or redesigned.
  •   The ontology cannot be fitted into a higher level ontology or a lower-level ontology.
  •   The ontology cannot be debugged when errors are detected.
  •   Errors occur without anyone knowing where the error has occurred.
  •   Nobody, even the designers, fully understands the ontology model.

Simple classifications are not flawless. Here are a few danger signs of an overly-simple classifications.

  1. 1.  The classification is too granular.

You find it difficult to associate observations with particular instances within a class or to particular classes within the classification.

  1. 2.  The classification excludes important relationships among data objects.

For example, dolphins and fish both live in water. As a consequence, dolphins and fish will both be subject to some of the same influences (e.g., ocean pollutants and water-borne infectious agents). In this case, relationships that are not based on species ancestry are simply excluded from the classification of living organisms and cannot be usefully examined.

  1. 3.  The classes in the classification lack inferential competence.

Competence in the ontology field is the ability to infer answers based on the rules for class membership. For example, in an ontology you can subclass wines into white wines and red wines and you can create a rule that specifies that the two subclasses are exclusive. If you know that a wine is white, then you can infer that the wine does not belong to the subclass of red wines. Classifications are built by understanding the essential features of an object that make it what it is; they are not generally built on rules that might serve the interests of the data analyst or the computer programmer. Unless a determined effort has been made to build a rule-based classification, the ability to draw logical inferences from observations on data objects will be sharply limited.

  1. 4.  The classification contains a “miscellaneous” class.

A formal classification requires that every instance belongs to a class with well-defined properties. A good classification does not contain a “miscellaneous” class that includes objects that are difficult to assign. Nevertheless, desperate taxonomists will occasionally assign objects of indeterminate nature to a temporary class, waiting for further information to clarify the object's correct placement. In the field of biological taxonomy, the task of creating and assigning the correct classes for the members of these unnatural and temporary groupings, has frustrated biologists over many decades, and is still a source of some confusion [13]. [Glossary Unclassifiable objects]

  1. 5.  The classification is unstable.

Simplistic approaches may yield a classification that serves well for a limited number of tasks, but fails to be extensible to a wider range of activities or fails to integrate well with classifications created for other knowledge domains. All classifications require review and revision, but some classifications are just awful and are constantly subjected to major overhauls.

It seems obvious that in the case of Big Data, a computational approach to data classification is imperative, but a computational approach that consistently leads to failure is not beneficial. Many of the ontologies that have been created for data collected in many of the fields of science have been ignored or abandoned by their intended beneficiaries. Ontologies, due to their multi-lineage ancestries, are simply too difficult to understand and too difficult to implement.

Section 5.5. Class Blending

It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so.

Mark Twain

A blended class, also known as a noisy class, results when the taxonomist assigns unrelated objects to the same class. This almost always leads to errors in data analysis whose cause is nearly impossible to find. As an example of class blending, suppose you were testing the effectiveness of an antibiotic on a group of subjects all having a specific type of bacterial pneumonia. In this case, the accuracy of your results will be forfeit when your study population includes subjects with viral pneumonia, smoking-related lung damage, or a pneumonia produced by some bacteria other than the bacteria that is known to be sensitive to the antibiotic under study. Basically, a classification has no value if its classes contain unrelated members.

Errors induced by blending classes are often overlooked by data analysts who incorrectly assume that the experiment was designed to ensure that each data group is composed of a uniform and representative population. Sometimes class blending occurs when an incompetent curator misplaces data objects into the wrong class. For example, you would not want to hire an astronomer who cannot distinguish a moon from a planet. More commonly, however, the problem lies within the classification itself. It is not uncommon for the formal class definition (which includes objective criteria for including or excluding objects from the class) to be ill-conceived.

One caveat. Efforts to eliminate class blending can be counterproductive if undertaken with excessive zeal. For example, in an effort to reduce class blending, a researcher may choose groups of subjects who are uniform with respect to every known observable property. For example, suppose you want to actually compare apples with oranges. To avoid class blending, you might want to make very sure that your apples do not include any cumquats or persimmons. You should be certain that your oranges do not include any limes or grapefruits. Imagine that you go even further, choosing only apples and oranges of one variety (e.g., Macintosh apples and Navel oranges), size (e.g., 10 cm), and origin (e.g., California). How will your comparisons apply to the varieties of apples and oranges that you have excluded from your study? You may actually reach conclusions that are invalid and irreproducible for more generalized populations within each class. In this case, you have succeeded in eliminated class blending at the expense of losing representative subpopulations of the classes. Some days, the more you try, the more you lose. [Glossary Representation bias, Confounder]

Section 5.6. Common Pitfalls in Ontology Development

The hallmark of good science is that it uses models and theory but never believes them.

Martin Wilk

Do ontologies serve a necessary role in the design and development of Big Data resources? Yes. Because every Big Data resource is composed of many different types of information, it becomes important to assign types of data into groups that have similar properties: images, music, movies, documents, and so forth. The data manager needs to distinguish one type of data object from another, and must have a way of knowing the set of properties that apply to the members of each class. When a query comes in asking for a list of songs written by a certain composer, or performed by a particular musician, the data manager will need to have a software implementation wherein the features of the query are matched to the data objects for which those features apply. The ontology that organizes the Big Data resource may be called by many other names (class systems, tables, data typing, database relationships, object model), but it will always come down to some way of organizing information into groups that share a set of properties.

Despite the importance of ontologies to Big Data resources the process of building an ontology is seldom undertaken wisely. There is a rich and animated literature devoted to the limitations and dangers of ontology-building [14,15]. Here are just a few pitfalls that you should try to avoid:

  •   Do not build transitive classes.

Class assignment is permanent. If you assign your pet beagle to the “dog” class, you cannot pluck him from this class and reassign him to the “feline” class. Once a dog, always a dog. This may seem like an obvious condition for an ontology, but it can be very tempting to make a class known as “puppy.” This practice is forbidden because a dog assigned to class “puppy” will grow out of his class when he becomes an adult. It is better to assign “puppy” as a property of Class Dog, with a property definition of “age less than one year.”

  •   Do not build miscellaneous classes.

As previously mentioned, even experienced ontologists will stoop to creating a “miscellaneous” class, as an act of desperation. The temptation to build a “miscellaneous” class arises when you have an instance (of a data object) that does not seem to fall into any of the well-defined classes. You need to assign the instance to a class, but you do not know enough about the instance to define a new class for the instance. To keep the project moving forward, you invent a “miscellaneous” class to hold the object until a better class can be created. When you encounter another object that does not fit into any of the defined classes, you simply assign it to the “miscellaneous” class. Now you have two objects in the “miscellaneous” class. Their only shared property is that neither object can be readily assigned to any of the defined classes. In the classification of living organisms, Class Protoctista was invented in the mid-nineteenth century to hold, temporarily, some of the organisms that could not be classified as animal, plant, or fungus. It has taken a century for taxonomists to rectify the oversight, and it may take another century for the larger scientific community to fully adjust to the revisions. Likewise, mycologists (fungus experts) have accumulated a large group of unclassifiable fungi. A pseudoclass of fungi, deuteromycetes (spelled with a lowercase “d”, signifying its questionable validity as a true biologic class) was created to hold these indeterminate organisms until definitive classes can be assigned. At present, there are several thousand such fungi, sitting in taxonomic limbo, until they can be placed into a definitive taxonomic class [16]. [Glossary Negative classifier]

Sometimes, everyone just drops the ball and miscellaneous classes become permanent [17]. Successive analysts, unaware that the class is illegitimate, assumed that the “miscellaneous” objects were related to one another (i.e., related through their “miscellaneousness”). Doing so led to misleading interpretations (e.g., finding similarities among unrelated data objects, and failing to see relationships that would have been obvious had the objects been assigned to their correct classes). The creation of an undefined “miscellaneous” class is an example of a general design flaw known as “ontological promiscuity” [14]. When an ontology is promiscuous the members of one class cannot always be distinguished from members of other classes.

  •   Do not confuse properties with classes.

Whenever I lecture on the topic of classifications and ontologies, I always throw out the following question: “Is a leg a subclass of the human body?” Most people answer yes. They reason that the normal human body contains a leg; hence leg is a subclass of the human body. They forget that a leg is not a type of human body, and is therefore not a subclass of the human body. As a part of the human body, “leg” is a property of a class. Furthermore, lots of different classes of things have legs (e.g., dogs, cows, tables). The “leg” property can be applied to many different classes and is usually asserted with a “has_a” descriptor (e.g., “Fred has_a leg”). The fundamental difference between classes and properties is one of the more difficult concepts in the field of ontology.

  •   Do not invent classes and properties that have already been invented [18].

Time-pressured ontologists may not wish to search, find, and study the classes and properties created by other ontologists. It is often easier to invent classes and properties as you need them, defining them in your own Schema document. If your ambitions are limited to using your own data for your own purposes, there really is no compelling reason to hunt for external ontologies. Problems will surface only if you need to integrate your data objects with the data objects held in other Big Data resources. If every resource invented its own set of classes and properties, then there could be no sensible comparisons among classes, and the relationships among the data objects from the different resources could not be explored.

Most data records, even those that are held in seemingly unrelated databases, contain information that applies to more than one type of class of data. A medical record, a financial record and a music video may seem to be heterogeneous types of data, but each is associated with the name of a person, and each named person might have an address. The classes of information that deal with names and addresses can be integrated across resources is they all fit into the same ontology, and if they all have the same intended meanings in each resource. [Glossary Heterogeneous data]

  •   Do not use a complex data description language.

If you decide to represent your data objects as triples, you will have a choice of languages, each with their own syntax, with which to describe your data objects. Examples of "triple" languages, roughly listed in order of increasing complexity, are: Notation 3, Turtle, RDF, DAML/OIL, and OWL. Experience suggests that syntax languages start out simple; complexity is added as users demand additional functionalities. The task of expressing triples in DAML/OIL or OWL has gradually become a job for highly trained specialists who work in the obscure field of descriptive logic. As the complexity of the descriptive language increases the number of people who can understand and operate the resource tends to diminish. In general, complex descriptive languages should only be used by well-staffed and well-funded Big Data resources capable of benefiting from the added bells and whistles. [Glossary RDF, Triple]

Section 5.7. Case Study: An Upper Level Ontology

An idea can be as flawless as can be, but its execution will always be full of mistakes.

Brent Scowcroft

Knowing that ontologies reach into higher ontologies, ontologists have endeavored to create upper level ontologies to accommodate general classes of objects, under which the lower ontologies may take their place. Once such ontology is SUMO, the Suggested Upper Merged Ontology, created by a group of talented ontologists [19]. SUMO is owned by IEEE (Institute of Electrical and Electronics Engineers), and is freely available, subject to a usage license [14]. [Glossary RDF Ontology]

As an upper level ontology, SUMO contains classes of objects that other ontologies can refer to as their superclasses. SUMO permits multiple class inheritance. For example, in SUMO, the class of humans is assigned to two different parent classes: Class Hominid and Class CognitiveAgent. “HumanCorpse,” another SUMO class, is defined in SUMO as “A dead thing that was formerly a Human.” Human corpse is a subclass of Class OrganicObject; not of Class Human. This means that a human, once it ceases to live, transits to a class that is not directly related to the class of humans. Consequently, members of Class Human, in the SUMO ontology, will change their class and their ancestral lineage, at different moments in time, thus violating the non-transitive rule of classification. [Glossary Superclass]

What went wrong?

  •   Class HumanCorpse was not created as a subclass of Class Human. This was a mistake, as all humans will eventually die. If we were to create two classes, one called Class Living Human and one called Class Deceased Human, we would certainly cover all possible human states of being, but we would be creating a situation where members of a class are forced to transition out of their class and into another (violating the intransitive rule of classification). The solution, in this case, is simple. Life and death are properties of organisms, and all organisms can and will have both properties, but never at the same time. Assign organisms the properties of life and of death, and stop there.

One last quibble. Consider these two classes from the SUMO ontology, both of which happen to be subclasses of Class Substance.

Subclass NaturalSubstance
Subclass SyntheticSubstance

It would seem that these two subclasses are mutually exclusive. However, diamonds occur naturally, and diamonds can be synthesized. Hence, diamond belongs to Subclass NaturalSubstance and to Subclass SyntheticSubstance. The ontology creates two mutually exclusive classes that contain members of the same objects. This is problematic, because it violates the uniqueness rule of classifications. We cannot create sensible inference rules for objects that occupy mutually exclusive classes.

What went wrong?

  •   At first glance, the concepts “NaturalSubstance” and “SyntheticSubstance” would appear to be subclasses of “Substance.” Are they really? Would it not be better to think that being “natural” or being “synthetic” are just properties of substances; not types of substances. If we agree that diamonds are a member of class substance, we can say that any specific diamond may have occurred naturally or through synthesis. We can eliminate two subclasses (i.e., “NaturalSubstance” and “SyntheticSubstance”) and replace them with two properties of class “Substance”: synthetic and natural. By assigning properties to a class of objects, we simplify the ontology (by reducing the number of subclasses), and we eliminate problems created when a class member belongs to two mutually exclusive subclasses. We will discuss the role of properties in classifications in Section 5.9.

As ontologies go, SUMO is one of the best, serving a useful purpose as an upper level repository of classes that can be used freely by Big Data scientists who are trying to simplify how they classify their data objects. Nonetheless, SUMO is not perfect and we are reminded that all ontologies are works-in-progress that must be critically examined, tested, and improved, in perpetuity. [Glossary Data scientist]

Section 5.8. Case Study (Advanced): Paradoxes

Owners of dogs will have noticed that, if you provide them with food, water, shelter, and affection, they will think you are god. Whereas owners of cats are compelled to realize that, if you provide them with food, water, shelter, and affection, they draw the conclusion that they are gods.

Christopher Hitchens

The rules for constructing classifications seem obvious and simplistic. Surprisingly, the task of building a logical, self-consistent classification is extremely difficult. Most classifications are rife with logical inconsistencies and paradoxes. Let us look at a few examples.

In 1975, while touring the Bethesda, Maryland, campus of the National Institutes of Health, I was informed that their Building 10 was the largest all-brick building in the world, providing a home to over 7 million bricks. Soon thereafter, an ambitious construction project was undertaken to greatly expand the size of Building 10. When the work was finished, building 10 was no longer the largest all-brick building in the world. What happened? The builders used material other than brick, and Building 10 lost its classification as an all-brick building, violating the immutability rule of class assignments.

Apparent paradoxes that plague any formal conceptualization of classifications are not difficult to find. Let us look at a few more examples.

Consider the geometric class of ellipses; planar objects in which the sum of the distances to two focal points is constant. Class Circle is a child of Class Ellipse, for which the two focal points of instance members occupy the same position, in the center, producing a radius of constant size. Imagine that Class Ellipse is provided with a class method called “stretch,” in which the foci are moved further apart, thus producing flatter objects. When the parent class “stretch” method is applied to members of the Class Circle the circle stops being a circle and becomes an ordinary ellipse. Hence the inherited “stretch” method forces members of Class Circle to transition out of their assigned class, violating the intransitive rule of classifications. [Glossary Method]

Let us look at the “Bag” class of objects. A “Bag” is a collection of objects and the Class Bag is included in most object oriented programming languages. A “Set” is also a collection of objects (i.e., a subclass of Bag), with the special feature that duplicate instances are not permitted. For example, if Kansas is a member of the set of United States states, then you cannot add a second state named “Kansas” to the set. If Class Bag were to have an “increment” method, that added “1” to the total count of objects in the bag, whenever an object is added to Class Bag, then the “increment” method would be inherited by all of the subclasses of Class Bag, including Class Set. But Class Set cannot increase in size when duplicate items are added. Hence, inheritance creates a paradox in the Class Set. [Glossary Inheritance]

How does a data scientist deal with class objects that disappear from their assigned class and reappear elsewhere? In the examples discussed here, we saw the following:

  1. 1.  Building 10 at NIH was defined as the largest all-brick building in the world. Strictly speaking, Building 10 was a structure; it had a certain weight and dimensions, and it was constructed of brick. “Brick” is an attribute or property of buildings and properties cannot form the basis of a class of building, if they are not a constant feature shared by all members of the class (i.e., some buildings have bricks; others do not). Had we not conceptualized an “all-brick” class of building, we would have avoided any confusion.
  2. 2.  Class Circle qualified as a member of Class Ellipse, because a circle can be imagined as an ellipse whose two focal points happen to occupy the same location. Had we defined Class Ellipse to specify that class members must have two separate focal points, we could have excluded circles from class Ellipse. Hence, we could have safely included the stretch method in Class Ellipse without creating a paradox.
  3. 3.  Class Set was made a subset of Class Bag, but the increment method of class Bag could not apply to Class Set. We created Class Set without taking into account the basic properties of Class Bag, which must apply to all its subclasses. Perhaps it would have been better if Class Set and Class Bag were created as children of Class Collection; each with its own set of properties.

Section 5.9. Case Study (Advanced): RDF Schemas and Class Properties

It's OK to figure out murder mysteries, but you shouldn't need to figure out code. You should be able to read it.

Steve McConnell

In Section 4.5, “Case Study: A Syntax for Triples,” we introduced the topic of RDF Schemas, and defined them as web-accessible documents that contain the definitions of classes. How does the RDF schema know how to describe the classes in such a way that computers can understand the class definitions and determine the properties that convey to all the members of a class, and to every member of every subclass of a class? Without moving too far beyond the scope of this book, we can discuss here the marvelous “trick” that RDF Schema employs that solves many of the complexity problems of ontologies and many of the over-simplification issues associated with classifications. It does so by introducing the new concept of class property. The class property permits the developer to assign features that can be associated with a class and its members. A property can apply to more than one class, and may apply to classes that are not directly related (i.e., neither an ancestor class nor a descendant class). The concept of the assigned class property permits developers to create simple ontologies, by reducing the need to create classes to account for every feature of interest to the developer. Moreover, the concept of the assigned property gives classification developers the ability to relate instances belonging to unrelated classes through their shared property features. The RDF Schema permits developers to build class structures that preserve the best qualities of both complex ontologies and simple classifications.

How do the Class and Property definitions of RDF Schema work? The RDF Schema is a file that defines Classes and Properties. When an RDF Schema is prepared, it is simply posted onto the Internet, as a public Web page, with a unique Web address.

An RDF Schema contains a list of classes, their definition, and the names of the parent class(es). This is followed by a list of properties that apply to one or more classes in the Schema. The following is an example of an RDF Schema written in plain English, without formal RDF syntax.

Class: Fungi
Definition: Contains all fungi
Subclass of: Class Opisthokonta (described in another RDF Schema)

Class Plantae
Definition: Includes multicellular organisms such as flowering plants, conifers, ferns and mosses.
Subclass of: Class Archaeplastida (described in another RDF Schema)
Property: Stationary existence
Definition: Adult organism does not ambulate under its own power.
Range of classes: Class Fungi, Class Plantae

Property: Soil-habitation
Definition: Lives in soil.
Range of classes: Class Fungi, Class Plantae

Property: Chitinous cell wall
Definition: Chitin is an extracellular material often forming part of the matrix surrounding cells.
Range of classes: Class Opisthokonta

Property: Cellulosic cell wall
Definition: Cellulose is an extracellular material often forming part of the matric surrounding cells.
Range of classes: Class Archaeplastida

This Schema defines two classes: Class Fungi, containing all fungal species, and Class Plantae containing the flowering plants, conifers and mosses. The Schema defines four properties. Two of the properties (Property Stationary existence and Property Soil-habitation apply to two different classes. Two of the properties (Property Chitinous cell wall and Property Cellulosic cell wall) apply to only one class.

By assigning properties that apply to several unrelated classes, we keep the class system small, but we permit property comparisons among unrelated classes. In this case, we defined Property Stationary growth and we indicated that the property applied to instances of Class Fungi and Class Plantae. This schema permits databases that contain data objects assigned to Class Fungi or data objects assigned to Class Plantae to include data object values related to Property Stationary Growth. Data analysts can collect data from any plant or fungus data object and examine these objects for data values related to Stationary Growth.

Property Soil-habitation applies to Class Fungi and to Class Plantae. Objects of either class may include soil-habitation data values. Data objects from two unrelated classes (Class Fungi and Class Plantae) can be analyzed by a shared property.

The schema lists two other properties, Property Chitinous cell wall and Property Cellulosic cell wall. In this case each property is assigned to one class only. Property Chitinous cell wall applies to Class Opisthokonta. Property Cellulosic cell wall applies to Class Archaeplastidae. These two properties are exclusive to their class. If a data object is described as having a cellulosic cell wall, it cannot be a member of Class Opisthokonta. If a data object is described as having a chitinous cell wall, then it cannot be a member of Class Archaeplastidae.

A property assigned to a class will extend to every member of every descendant class. Class Opisthokonta includes Class Fungi and it also includes Class Animalia, the class of all animals. This means that all animals may have the property of chitinous cell wall. In point of fact, chitin is distributed widely through the animal kingdom, but is not found in mammals.

As the name implies, RDF Schema are written in RDF syntax. In practice, many of the so-called RDF Schema documents found on the web are prepared in alternate formats. They are nominally RDF syntax because they create a namespace for classes and properties referred by triples listed in RDF documents.

Here is a short schema, written as Turtle triples, and held in a fictitious web site,

http://www.fictitious_site.org/schemas/life#” [Glossary Turtle]

@prefix rdf: < http://www.w3.org/1999/02/22-rdf-syntax-ns#>
@prefix rdfs: < http://www.w3.org/2000/01/rdf-schema#>
@base < http://www.fictitious_site.org/schemas/life#>
:Homo instance_of rdfs:Class.
:HomoSapiens instance_of rdfs:Class;
  rdfs:subClassOf :Homo.

Turtle triples have a somewhat different syntax than N-triples or N3 triples. As you can see, the turtle triple resembles RDF syntax in form, allowing for nested metadata/data pairs assigned to the same object. Nonetheless, turtle triples use less verbiage than RDF, but convey equivalent information. In this minimalist RDF Schema, we specify two classes that would normally be included in the much larger classification of living organisms: Homo and HomoSapiens.

A triple that refers to our “http://www.fictitious_site.org/schemas/life#” Schema might look something like this:

:Batman instance_of < http://www.fictitious_site.org/schemas/life#>:HomoSapiens.

The triple asserts that Batman is an instance of Homo Sapiens. The data “HomoSapiens” links us to the RDF Schema, which in turn tells us that HomoSapiens is a class and is the subclass of Class Homo.

One of the many advantages of triples is their fungibility. Once you have created your triple list, you can port them into spreadsheets, or databases, or morph them into alternate triple dialects, such as RDF or N3. Triples in any dialect can be transformed into any other dialect with simple scripts using your preferred programming language.

RDF documents can be a pain to create, but they are very easy to parse. Even in instances when an RDF file is composed of an off-kilter variant of RDF, it is usually quite easy to write a short script that will parse through the file, extracting triples, and using the components of the triples to serve the programmer's goals. Such goals may include: counting occurrences of items in a class, finding properties that apply to specific subsets of items in specific classes, or merging triples extracted from various triplestore databases. [Glossary Triplestore]

RDF seems like a panacea for ontologists, but it is seldom used in Big Data resources. The reasons for its poor acceptance are largely due to its strangeness. Savvy data mangers who have led successful careers using standard database technologies are understandably reluctant to switch over to an entirely new paradigm of information management. Realistically, a novel and untested approach to data description, such as RDF, will take decades to catch on. Whether RDF emerges as the data description standard for Big Data resources is immaterial. The fundamental principles upon which RDF is built are certain to dominate the world of Big Data.

Section 5.10. Case Study (Advanced): Visualizing Class Relationships

The ignoramus is a leaf who doesn't know he is part of a tree

Attributed to Michael Crichton

When working with classifications or ontologies, it is useful to have an image that represents the relationships among the classes. GraphViz is an open source software utility that produces graphic representations of object relationships.

The GraphViz can be downloaded from:

http://www.graphviz.org/

GraphViz comes with a set of applications that generate graphs of various styles. Here is an example of a GraphViz dot file, number.dot, constructed in GraphViz syntax [20]. Aside from a few lines that provide instructions for line length and graph size the dot file is a list of classes and their child classes.

digraph G {
 size ="7,7";
 Object -> Numeric;
 Numeric -> Integer;
 Numeric -> Float;
 Integer -> Fixnum
 Integer -> Bignum
}

After the GraphViz exe file (version graphviz-2.14.1.exe, on my computer) is installed, you can launch the various GraphViz methods as command lines from its working directory, or through a system call from within a script. [Glossary Exe file, System call]

c:ftpdot > dot -Tpng number.dot -o number.png

The command line tells GraphViz to use the dot method to produce a rendering of the number.dot text file, saved as an image file, with filename number.png. The output file contains a class hierarchy, beginning with the highest class and branching until it reaches the lowest descendant class.

With a glance, we see that the highest class is Class Object (Fig. 5.2). Class Object has one child class, Class Numeric. Numeric has two child classes, Class Integer and Class Float. Class Integer has two child classes, Class Fixnum and Class Bignum. You might argue that a graphic representation of classes was unnecessary; the textual listing of class relationships was all that you needed. Maybe so, but when the class structure becomes complex, visualization can greatly simplify your understanding of the relationships among classes.

Fig. 5.2
Fig. 5.2 A class hierarchy, described by the number.dot file and converted to a visual file, using GraphViz.

Here is a visualization of a classification of human neoplasms (Fig. 5.3). It was produced by GraphViz, from a .dot file containing a ranking of classes and their subclasses, and rendered with the “twopi” method, shown: [Glossary Object rank]

Fig. 5.3
Fig. 5.3 A visualization of relationships in a classification of tumors. The image was rendered with the GraphViz utility, using the twopi method, which produced a radial classification, with the root class in the center.

c:ftp > twopi -Tpng neoplasms.dot -o neoplasms_classes.png

We can look at the graphic version of the classification and quickly make the following observations:

  1. 1.  The root class (i.e., the ancestor to every class) is Class Neoplasm. The GraphViz utility helped us find the root class, by placing it in the center of the visualization.
  2. 2.  Every class is connected to other classes. There are no classes sitting out in space, unrelated to other classes.
  3. 3.  Every class that has a parent class has exactly one parent class.
  4. 4.  There are no recursive branches to the graph (e.g., the ancestor of a class cannot also be a descendant of the class).

If we had only the textual listing of class relationships, without benefit of a graphic visualization, it would be very difficult for a human to verify, at a glance, the internal logic of the classification.

With a few tweaks to the neo.dot GraphViz file, we can create a nonsensical graphic visualization:

Notice that one cluster of classes is unconnected to the other, indicating that class Endoderm/Ectoderm has no parent classes (Fig. 5.4). Elsewhere, Class Mesoderm is both child and parent to Class Neoplasm. Class Melanocytic and Class Molar are each the child class to two different parent classes. At a glance, we have determined that the classification is highly flawed. The visualization simplified the relationships among classes, and allowed us to see where the classification went wrong. Had we only looked at the textual listing of classes and subclasses, we may have missed some or all of the logical flaws in our classification.

Fig. 5.4
Fig. 5.4 A corrupted classification that might qualify as a valid ontology.

At this point, you may be thinking that visualizations of class relationships are nice, but who has the time and energy to create the long list of classes and subclasses, in GraphViz syntax, that are the input files for the GraphViz methods? Now comes one of the great payoffs of data specifications. You must remember that good data specifications are fungible. A modestly adept programmer can transform a specification into whatever format is necessary to do a particular job. In this case, the classification of neoplasms had been specified as an RDF Schema (vida supra). An RDF Schema includes the definitions of classes and properties, with each class provided with the name of its parent class and each property provided with its range (i.e., the classes to which the property applies). Because class relationships in an RDF Schema are specified, it is easy to transform an RDF Schema into a .dot file suitable for Graphviz.

Here is a short RDF python script, dot.py that parses an RDF Schema (contained in the plain-text file, schema.txt) and produces a GraphViz .dot file, named schema.dot. [Glossary Metaprogramming]

import re, string
infile = open('schema.txt', "r")
outfile = open("schema.dot", "w")
outfile.write("digraph G {
")
outfile.write("size ="15,15";
")
outfile.write("ranksep ="3.00";
")
clump = ""
for line in infile:
 namematch = re.match(r'</rdfs:Class >', line)
 if (namematch):
  father = ""
  child = ""
  clump = re.sub(r'
', ' ', clump)
  fathermatch = re.search(r':resource="[a-zA-Z0-9:/\_.-]⁎#([a-zA-Z\_]+)"', clump)
  if fathermatch:
   father = fathermatch.group(1)
  childmatch = re.search(r'rdf:ID="([a-zA-Z\_]+)"', clump)
  if childmatch:
   child = childmatch.group(1)
  outfile.write(father + " -> " + child + ";
")
  clump = ""
 else:
  clump = clump + line
outfile.write("}
")

The first 15 lines of output of the dot.pl script:

digraph G {
size ="15,15";
ranksep ="2.00";
Class -> Tumor_classification;
Tumor_classification -> Neoplasm;
Tumor_classification -> Unclassified;
Neural_tube -> Neural_tube_parenchyma;
Mesoderm -> Sub_coelomic;
Neoplasm -> Endoderm_or_ectoderm;
Unclassified -> Syndrome;
Neoplasm -> Neural_crest;
Neoplasm -> Germ_cell;
Neoplasm -> Pluripotent_non_germ_cell;
Sub_coelomic -> Sub_coelomic_gonadal;
Trophectoderm -> Molar;

The full schema.dot file, not shown, is suitable for use as an input file for the GraphViz utility.

Glossary

Artificial intelligence Artificial intelligence is the field of computer science that seeks to create machines and computer programs that seem to have human intelligence. The field of artificial intelligence sometimes includes the related fields of machine learning and computational intelligence. Over the past few decades the term “artificial intelligence” has taken a battering from professionals inside and outside the field, for good reasons. First and foremost is that computers do not think in the way that humans think. Though powerful computers can now beat chess masters at their own game, the algorithms for doing so do not simulate human thought processes. Furthermore, most of the predicted benefits from artificial intelligence have not come to pass, despite decades of generous funding. The areas of neural networks, expert systems, and language translation have not met expectations. Detractors have suggested that artificial intelligence is not a well-defined subdiscipline within computer science as it has encroached into areas unrelated to machine intelligence, and has appropriated techniques from other fields, including statistics and numerical analysis. Some of the goals of artificial intelligence have been achieved (e.g., speech-to-text translation), and the analytic methods employed in Big Data analysis should be counted among the enduring successes of the field.

Beauty To mathematicians, beauty and simplicity are virtually synonymous, both conveying the idea that someone has managed to produce something of great meaning or value from a minimum of material. Euler's identity, relating e, i, pi, 0, and 1 in a simple equation, is held as an example of beauty in mathematics. When writing this book, I was tempted to give it the title, “The Beauty of Data,” but I feared that a reductionist flourish, equating data simplification with beauty, was just too obscure.

Bootstrapping The act of self-creation, from nothing. The term derives from the ludicrous stunt of pulling oneself up by one's own bootstraps. Its shortened form, “booting” refers to the startup process in computers in which the operating system is somehow activated via its operating system, which has not been activated. The absurd and somewhat surrealistic quality of bootstrapping protocols serves as one of the most mysterious and fascinating areas of science. As it happens, bootstrapping processes lie at the heart of some of the most powerful techniques in data simplification (e.g., classification, object oriented programming, resampling statistics, and Monte Carlo simulations).
It is worth taking a moment to explore the philosophical and the pragmatic aspects of bootstrapping. Starting from the beginning, how was the universe created? For believers, the universe was created by an all-powerful deity. If this were so, then how was the all-powerful deity created? Was the deity self-created, or did the deity simply bypass the act of creation altogether? The answers to these questions are left as an exercise for the reader, but we can all agree that there had to be some kind of bootstrapping process, if something was created from nothing. Otherwise, there would be no universe, and this book would be much shorter than it is. Getting back to our computers, how is it possible for any computer to boot its operating system, when we know that the process of managing the startup process is one of the most important functions of the fully operational operating system? Basically, at startup, the operating system is non-functional. A few primitive instructions hardwired into the computer's processors are sufficient to call forth a somewhat more complex process from memory, and this newly activated process calls forth other processes, until the operating system is eventually up and running. The cascading rebirth of active processes takes time and explains why booting your computer may seem to be a ridiculously slow process.
What is the relationship between bootstrapping and classification? The ontologist creates a classification based on a worldview in which objects hold specific relationships with other objects. Hence, the ontologist's perception of the world is based on preexisting knowledge of the classification of things; which presupposes that the classification already exists. Essentially, you cannot build a classification without first having the classification. How does an ontologist bootstrap a classification into existence? She may begin with a small assumption that seems, to the best of her knowledge, unassailable. In the case of the classification of living organisms, she may assume that the first organisms were primitive, consisting of a few self-replicating molecules and some physiologic actions, confined to a small space, capable of a self-sustaining system. Primitive viruses and prokaryotes (i.e., bacteria) may have started the ball rolling. This first assumption might lead to observations and deductions, which eventually yield the classification of living organisms that we know today. Every thoughtful ontologist will admit that a classification is, at its best, a hypothesis-generating machine; not a factual representation of reality. We use the classification to create new hypotheses about the world and about the classification itself. The process of testing hypotheses may reveal that the classification is flawed; that our early assumptions were incorrect. More often, testing hypotheses will reassure us that our assumptions were consistent with new observations, adding to our understanding of the relations between the classes and instances within the classification.

Categorical data Non-numeric data in which objects are assigned categories, with categories having no numeric order. Yes or no, male or female, heads or tails, snake-eyes or boxcars, are types of unordered categorical data. Traditional courses in mathematics and statistics stress the analysis of numeric data, but data scientists soon learn that much of their work involves the collection and analysis of non-numeric data.

Cladistics The technique of producing a hierarchy of clades, wherein each branch includes a parent species and all its descendant species, while excluding species that did not descend from the parent (Fig. 5.5). If a subclass of a parent class omits any of the descendants of the parent class, then the parent class is said to be paraphyletic. If a subclass of a parent class includes organisms that did not descend from the parent, then the parent class is polyphyletic. A class can be paraphyletic and polyphyletic, if it excludes organisms that were descendants of the parent and if it includes organisms that did not descend from the parent. The goal of cladistics is to create a hierarchical classification that consists exclusively of monophyletic classes (i.e., no paraphyly, no polyphyly). Classifications of the kinds described in this chapter, are monophyletic.

Fig. 5.5
Fig. 5.5 Schematic (cladogram) of all the descendant branches of a common ancestor (stem at bottom of image). The left and the right groups represent clades insofar as they contain all their descendants and exclude classes that are not descendants of the group root. The middle group is not a valid clade because it does not contain all of the descendants of its group root (i.e., it is paraphyletic). Specifically, it excludes the left-most group in the diagram. From Wikimedia Commons, author "Life of Riley".

Classification system versus identification system It is important to distinguish a classification system from an identification system. An identification system matches an individual organism with its assigned object name (or species name, in the case of the classification of living organisms). Identification is based on finding several features that, taken together, can help determine the name of an organism. For example, if you have a list of characteristic features: large, hairy, strong, African, jungle-dwelling, knuckle-walking; you might correctly identify the organisms as a gorilla. These identifiers are different from the phylogenetic features that were used to classify gorillas within the hierarchy of organisms (Animalia: Chordata: Mammalia: Primates: Hominidae: Homininae: Gorillini: Gorilla). Specifically, you can identify an animal as a gorilla without knowing that a gorilla is a type of mammal. You can classify a gorilla as a member of Class Gorillini without knowing that a gorilla happens to be large. One of the most common mistakes in science is to confuse an identification system with a classification system. The former simply provides a handy way to associate an object with a name; the latter is a system of relationships among objects.

Classification versus index In practice, an index is an alphabetized listing of the important terms in a work (e.g., book), with the locations of each term within the work. Ideally, though, an index should be much more than that. An idealized index is a conceptualization of a corpus of work that enables users to locate the concepts that are discussed and created within the work. How does an idealized index differ from a classification? A classification is a way of organizing concepts in classes, wherein the relationships of the concepts are revealed. The classification can incorporate all of the information held in an index by encapsulating the concept locations together with the names of the concepts. Because the relationships among the objects in a classification can be used to draw inferences about the objects, we can think of a classification as an index that can help us think.

Cluster analysis Clustering algorithms provide a way of taking a large set of data objects that seem to have no relationship to one another, and to produce a visually simple collection of clusters wherein each cluster member is similar to every other member of the same cluster. The algorithmic methods for clustering are simple. One of the most popular clustering algorithms is the k-means algorithm, which assigns any number of data objects to one of k clusters [21]. The number k of clusters is provided by the user. The algorithm is easy to describe and to understand, but the computational task of completing the algorithm can be difficult when the number of dimensions in the object (i.e., the number of attributes associated with the object), is large. There are some serious drawbacks to the algorithm: (1) The final set of clusters will sometimes depend on the initial, random choice of k data objects. This means that multiple runs of the algorithm may produce different outcomes; (2) The algorithms are not guaranteed to succeed. Sometimes, the algorithm does not converge to a final, stable set of clusters; (3) When the dimensionality is very high, the distances between data objects (i.e., the square root of the sum of squares of the measured differences between corresponding attributes of two objects) can be ridiculously large and of no practical meaning. Computations may bog down, cease altogether, or produce meaningless results. In this case, the only recourse may require eliminating some of the attributes (i.e., reducing dimensionality of the data objects); (4) The clustering algorithm may succeed, producing a set of clusters of similar objects, but the clusters may have no practical value. They may miss important relationships among the objects, or they might group together objects whose similarities are totally non-informative. The biggest drawback associated with cluster analyses is that researchers may mistakenly believe that that the groupings produced by the method constitute a valid biological classification. This is not the case because biological entities (genes, proteins, cells, organs, organisms) may share many properties and still be fundamentally different. For example, two genes may have the same length and share some sub-sequences, but both genes may have no homology with one another (i.e., no shared ancestry) and may have no common or similar expressed products. Another set of genes may be structurally dissimilar but may belong to the same family. The groupings produced by cluster analysis should never be equated with a classification. At best, cluster analysis produces groups that can be used to start piecing together a biological classification.

Combinatorics The analysis of complex data often involves combinatorics; the evaluation, on some numeric level, of combinations of things. Often, combinatorics involves pairwise comparisons of all possible combinations of items. When the number of comparisons becomes large, as is the case with virtually all combinatoric problems involving large data sets, the computational effort becomes massive. For this reason, combinatorics research has become a subspecialty in applied mathematics and data science. There are four “hot” areas in combinatorics. The first involves building increasingly powerful computers capable of solving complex combinatoric problems. The second involves developing methods whereby combinatoric problems can be broken into smaller problems that can be distributed to many computers, to provide relatively fast solutions to problems that could not otherwise be solved in any reasonable length of time The third area of research involves developing new algorithms for solving combinatoric problems quickly and efficiently. The fourth area, perhaps the most promising area, involves developing innovative non-combinatoric solutions for traditionally combinatoric problems, a golden opportunity for experts in the field of data simplification.

Confounder Unanticipated or ignored factor that alters the outcome of a data analysis. Confounders are particularly important in Big Data analytics, because most analyses are observational; based on collected parameters from large numbers of data records, and there is very little control over confounders. Confounders are less of a problem in controlled prospective experiments, in which a control group and a treated group are alike, to every extent feasible; only differing in their treatment. Differences between the control group and the treated group are presumed to be caused by the treatment, as all of the confounders have been eliminated. One of the greatest challenges of Big Data analytics involves developing new analytic protocols that reduce the effect of confounders in observational studies.

Data scientist Anyone who practices data science and who has some expertise in a field subsumed by data science (i.e., informatics, statistics, data analysis, programming, and computer science).

Exe file Short for executable file and also known as application file. A file containing a program, in binary code, that can be executed when the name of the file is invoked on the command line.

Heterogeneous data Sets of data that are dissimilar with regard to content, purpose, format, organization, and annotations. One of the purposes of Big Data is to discover relationships among heterogeneous data sources. For example, epidemiologic data sets may be of service to molecular biologists who have gene sequence data on diverse human populations. The epidemiologic data is likely to contain different types of data values, annotated and formatted in a manner that is completely different from the data and annotations in a gene sequence database. The two types of related data, epidemiologic and genetic, have dissimilar content; hence they are heterogeneous to one another.

Inheritance The method by which a child is endowed with features of the parent. In object oriented programming, inheritance is passed from parent class to child class, meaning that the child class has access to all of the methods and properties that are held in the parent class.

Intransitive property One of the criteria for a classification is that every object (sometimes referred to as member or as instance) belongs to exactly one class. From this criteria comes the intransitive property of classifications; namely, an object cannot change its class. Otherwise an object would belong to more than one class at different times. It is easy to apply the intransitive rule under most circumstances. A cat cannot become a dog and a horse cannot become a sheep. What do we do when a caterpillar becomes a butterfly? In this case, we must recognize that caterpillar and butterfly represent phases in the development of one particular instance of a species, and do not belong to separate classes.

Iterator Iterators are simple programming shortcuts that call functions that operate on consecutive members of a data structure, such as a list, or a block of code. Typically, complex iterators can be expressed in a single line of code. Perl, Python and Ruby all have iterator methods. In Ruby, the iterator methods are each, find, collect, and inject. In Python, there are types of objects that are iterable (not to be confused with “irritable”), and these objects accept implicit or scripted iteration methods.

KISS Acronym for Keep It Simple Stupid. With respect to Big Data, there are basically two schools of thought. This first is that reality is quite complex, and the advent of powerful computers and enormous data collections allows us to tackle important problems, despite their inherent size and complexity. KISS represents a second school of thought; that Big Problems are just small problems that are waiting to be simplified.

Metaprogramming A metaprogram is a program that creates or modifies other programs. Metaprogramming is a particularly powerful feature of languages that are modifiable at runtime. Perl, Python, and Ruby are all metaprogramming languages. There are several techniques that facilitate metaprogramming features, including introspection and reflection.

Method Roughly equivalent to functions, subroutines, or code blocks. In object-oriented languages, a method is a subroutine available to an object (class or instance). In Ruby and Python, instance methods are declared with a “def” declaration followed by the name of the method, in lowercase. Here is an example, in Ruby, for the “hello” method, is written for the Salutations class.
class Salutations
     def hello
            puts "hello there"
     end
end

Multiclass classification A misnomer imported from the field of machine translation, and indicating the assignment of an instance to more than one class. Classifications, as defined in this book, impose one-class classification (i.e., an instance can be assigned to one and only one class). It is tempting to think that a ball should be included in class “toy” and in class “spheroids,” but multiclass assignments create unnecessary classes of inscrutable provenance, and taxonomies of enormous size, consisting largely of replicate items.

Multiclass inheritance In ontologies, multiclass inheritance occurs when a child class has more than one parent class. For example, a member of Class House may have two different parent classes: Class Shelter, and Class Property. Multiclass inheritance is generally permitted in ontologies but is forbidden in classifications that restrict inheritance to a single parent class (i.e., each class can have at most one parent class, though it may have multiple child classes). When an object-oriented program language permits multiparental inheritance (e.g., Perl and Python programming languages), data objects may have many different ancestral classes spread horizontally and vertically through the class libraries. There are many drawbacks to multi-class inheritance in object oriented programming languages and these have been discussed at some length in the computer science literature [22]. Medical taxonomists should understand that when multi-class inheritance is permitted, a class may be an ancestor of a child class that is an ancestor of its parent class (e.g., a single class might be a grandfather and a grandson to the same class). An instance of a class might be an instance of two classes, at once. The combinatorics and the recursive options can become computationally difficult or impossible. Those who use taxonomies that permit multiclass inheritance will readily acknowledge that they have created a system that is complex. Ontology experts justify the use of multiclass inheritance on the observation that such ontologies provide accurate models of nature and that faithful models of reality cannot be created with simple, uniparental classification. Taxonomists who rely on simple, uniparental classifications base their model on epistemological grounds; on the nature of objects. They hold that an object can have only one nature and can belong to only one defining class, and can be derived from exactly one parent class. Taxonomists who insist upon uniparental class inheritance believe that assigning more than one parental class to an object indicates that you have failed to grasp the essential nature of the object [2224].

Negative classifier One of the most common mistakes committed by ontologists involves classification by negative attribute. A negative classifier is a feature whose absence is used to define a class. An example is found in the Collembola, popularly known as springtails, a ubiquitous member of Class Hexapoda, and readily found under just about any rock. These organisms look like fleas (same size, same shape) and were formerly misclassified among the class of true fleas (Class Siphonaptera). Like fleas, springtails are wingless, and it was assumed that springtails, like fleas, lost their wings somewhere in evolution's murky past. However, true fleas lost their wings when they became parasitic. Springtails never had wings, an important taxonomic distinction separating springtails from fleas. Today, springtails (Collembola) are assigned to Class Entognatha, a separate subclass of Class Hexapoda. Alternately, taxonomists may be deceived by a feature whose absence is falsely conceived to be a fundamental property of a class of organisms. For example, all species of Class Fungi were believed to have a characteristic absence of a flagellum. Based on the absence of a flagellum, the fungi were excluded from Class Opisthokonta and were put in Class Plantae, which they superficially resembled. However, the chytrids, which have a flagellum, were have been shown to be a primitive member of Class Fungi. This finding places fungi among the true descendants of Class Opisthokonta (from which Class Animalia descended). This means that fungi are much more closely related to people than to plants, a shocking revelation [13]!

Non-living organism Herein, viruses and prions are referred to as non-living organisms. Viruses lack key features that distinguish life from non-life. They depend entirely on host cells for replication; they do not partake in metabolism, and do not yield energy; they cannot adjust to changes in their environment (i.e., no homeostasis), nor can they respond to stimuli. Most scientists consider viruses to be mobile genetic elements that can travel between cells (much as transposons are considered mobile genetic elements that travel within a cell). All viruses have a mechanism that permits them to infect cells and to use the host cell machinery to replicate. At minimum, viruses consist of a small RNA or DNA genome, encased by a protective protein coat, called a capsid. Class Mimiviridae, discovered in 1992, occupies a niche that seems to span the biological gulf separating living organisms from viruses. Members of Class Mimiviridae are complex, larger than some bacteria, with enormous genomes (by viral standards), exceeding a million base pairs and encoding upwards of 1000 proteins. The large size and complexity of Class Mimiviridae exemplifies the advantage of a double-stranded DNA genome. Class Megaviridae is a newly reported (October, 2011) class of viruses, related to Class Mimiviridae, but even larger [25]. Biologically, the life of a mimivirus is not very different from that of obligate intracellular bacteria (e.g., Rickettsia). The discovery of Class Mimiviridae inspires biologists to reconsider the “non-living” status relegated to viruses and compels taxonomists to examine the placement of viruses within the phylogenetic development of prokaryotic and eukaryotic organisms [13].

Nonphylogenetic property Properties that do not hold true for a class; hence, cannot be used by ontologists to create a classification. For example, we do not classify animals by height, or weight because animals of greatly different heights and weights may occupy the same biological class. Similarly, animals within a class may have widely ranging geographic habitats; hence, we cannot classify animals by locality. Case in point: penguins can be found virtually anywhere in the southern hemisphere, including hot and cold climates. Hence, we cannot classify penguins as animals that live in Antarctica or that prefer a cold climate. Scientists commonly encounter properties, once thought to be class-specific that prove to be uninformative, for classification purposes. For many decades, all bacteria were assumed to be small; much smaller than animal cells. However, the bacterium Epulopiscium fishelsoni grows to about 600 microns by 80 microns, much larger than the typical animal epithelial cell (about 35 microns in diameter) [26]. Thiomargarita namibiensis, an ocean-dwelling bacterium, can reach a size of 0.75 mm, visible to the unaided eye. What do these admittedly obscure facts teach us about the art of classification? Superficial properties, such as size, seldom inform us how to classify objects. The ontologist must think very deeply to find the essential defining features of classes.

Object rank A generalization of Page rank, the indexing method employed by Google. Object ranking involves providing objects with a quantitative score that provides some clue to the relevance or the popularity of an object. For the typical object ranking project, objects take the form of a key word phrase.

Observational data Data obtained by measuring existing things or things that occurred without the help of the scientist. Observational data needs to be distinguished from experimental data. In general, experimental data can be described with a Gaussian curve, because the experimenter is trying to measure what happens when a controlled process is performed on every member of a uniform population. Such experiments typically produce Gaussian (i.e., bell-shaped or normal) curves for the control population and the test population. The statistical analysis of experiments reduces to the chore of deciding whether the resulting Gaussian curves are different from one another. In observational studies, data is collected on categories of things, and the resulting data sets often follow a Zipf distribution, wherein a few types of data objects account for the majority of observations For this reason, many of the assumptions that apply to experimental data (i.e., the utility of parametric statistical descriptors including average, standard deviation and p-values), will not necessarily apply to observational data sets [24].

Parent class The immediate ancestor, or the next-higher class (i.e., the direct superclass) of a class. For example, in the classification of living organisms, Class Vertebrata is the parent class of Class Gnathostomata. Class Gnathostomata is the parent class of Class Teleostomi. In a classification, which imposes single class inheritance, each child class has exactly one parent class; whereas one parent class may have several different child classes. Furthermore, some classes, in particular the bottom class in the lineage, have no child classes (i.e., a class need not always be a superclass of other classes). A class can be defined by its properties, its membership (i.e., the instances that belong to the class), and by the name of its parent class. When we list all of the classes in a classification, in any order, we can always reconstruct the complete class lineage, in their correct lineage and branchings, if we know the name of each class's parent class [13].

Phenetics The classification of organisms by feature similarity, rather than through relationships. Starting with a set of feature data on a collection of organisms, you can write a computer program that will cluster the organisms into classes, according to their similarities. In theory, one computer program, executing over a large dataset containing measurements for every earthly organism, could create a complete biological classification. The status of a species is thereby reduced from a fundamental biological entity, to a mathematical construction.
There are a host of problems consequent to computational methods for classification. First, there are many different mathematical algorithms that cluster objects by similarity. Depending on the chosen algorithm, the assignment of organisms to one species or another would change. Secondly, mathematical algorithms do not cope well with species convergence. Convergence occurs when two species independently acquire identical or similar traits through adaptation; not through inheritance from a shared ancestor. Examples are: the wing of a bat and the wing of a bird; the opposable thumb of opossums and of primates; the beak of a platypus and the beak of a bird. Unrelated species frequently converge upon similar morphologic adaptations to common environmental conditions or shared physiological imperatives. Algorithms that cluster organisms based on similarity are likely to group divergent organisms under the same species.
It is often assumed that computational classification, based on morphologic feature similarities, will improve when we acquire whole-genome sequence data for many different species. Imagine an experiment wherein you take DNA samples from every organism you encounter: bacterial colonies cultured from a river, unicellular non-bacterial organisms found in a pond, small multicellular organisms found in soil, crawling creatures dwelling under rocks, and so on. You own a powerful sequencing machine, that produces the full-length sequence for each sampled organism, and you have a powerful computer that sorts and clusters every sequence. At the end, the computer prints out a huge graph, wherein all the samples are ordered and groups with the greatest sequence similarities are clustered together. You may think you have created a useful classification, but you have not really, because you do not know anything about the organisms that are clustered together. You do not know whether each cluster represents a species, or a class (a collection of related species), or whether a cluster may be contaminated by organisms that share some of the same gene sequences, but are phylogenetically unrelated (i.e., the sequence similarities result from chance or from convergence, but not by descent from a common ancestor). The sequences do not tell you very much about the biological properties of specific organisms, and you cannot infer which biological properties characterize the classes of clustered organisms. You have no certain knowledge whether the members of any given cluster of organisms can be characterized by any particular gene sequence (i.e., you do not know the characterizing gene sequences for classes of organisms). You do not know the genus or species names of the organisms included in the clusters, because you began your experiment without a presumptive taxonomy. Basically, you simply know what you knew before you started; that individual organisms have unique gene sequences that can be grouped by sequence similarity.
Taxonomists, who have long held that a species is a natural unit of biological life, and that the nature of a species is revealed through the intellectual process of building a consistent taxonomy [27], are opposed to the process of phenetics-based classification [27,13]. In the realm of big data, computational phenetics may create a complex web of self-perpetuating nonsense that cannot be sensibly analyzed. Over the next decade or two, the brilliance or the folly of computational phenetics will most likely be revealed.

Properties versus classes When creating classifications, the most common mistake is to assign class status to a property. When a property is inappropriately assigned as a class, then the entire classification is ruined. Hence, it is important to be very clear on the difference between these two concepts, and to understand why it is human nature to confuse one with the other. A class is a holder of related objects (e.g., items, records, categorized things). A property is a feature or trait that can be assigned to an item. When inclusion in a class requires items to have a specific property, we often name the class by its defining property. For example Class Rodentia, which includes rats, mice, squirrels, and gophers, are all gnawing mammals. The word rodent derives from the Latin roots rodentem, rodens, from rodere, “to gnaw.” Although all rodents gnaw, we know that gnawing is not unique to rodents. Rabbits (Class Lagormorpha) also gnaw.
Objects from many different classes may have some of the same properties. Here's another example. Normal human anatomy includes two legs. This being the case, is “leg” a subclass of “human.” The answer is no. A leg is not a type of human. Having a leg is just one of many properties associated with normal human anatomy. You would be surprised how many people can be tricked into thinking that a leg, which is itself an object, should be assigned as a subclass of the organisms to which it is attached. Some of this confusion comes from the way that we think about relationships between objects and properties. We say “He is hungry,” using a term of equality, “is” to describe the relationship between “He” and “hungry.” Technically, the sentence, “He is hungry” asserts that “He” and “hungry” are equivalent objects. We never bother to say “He has hunger,” but other languages are more fastidious. A German might say “Ich habe Hunger,” indicating that he has hunger, and avoiding any inference that he and hunger are equivalent terms (i.e., never “Ich bin Hunger”). It may seem like a trivial point, but mistaking classes for properties is a common error that nearly always leads to disaster.

RDF Resource Description Framework (RDF) is a syntax in XML notation that formally expresses assertions as triples. The RDF triple consists of a uniquely identified subject plus a metadata descriptor for the data plus a data element. Triples are necessary and sufficient to create statements that convey meaning. Triples can be aggregated with other triples from the same data set or from other data sets, so long as each triple pertains to a unique subject that is identified equivalently through the data sets. Enormous data sets of RDF triples can be merged or functionally integrated with other massive or complex data resources.

RDF Ontology A term that, in common usage, refers to the class definitions and relationships included in an RDF Schema document. The classes in an RDF Schema need not comprise a complete ontology. In fact, a complete ontology could be distributed over multiple RDF Schema documents.

Representation bias Occurs when the population sampled does not represent the population intended for study. For example, the population for which the normal range of prostate specific antigen (PSA) was based, was selected from a county in the state of Minnesota. The male population under study consisted almost exclusively of white men (i.e., virtually no African-Americans, Asians, Hispanics, etc.). It may have been assumed that PSA levels would not vary with race. It was eventually determined that the normal PSA ranges varied greatly by race [28]. The Minnesota data, though plentiful, did not represent racial subpopulations. A sharp distinction must drawn between Big-ness and Whole-ness [29].

Superclass Any of the ancestral classes of a subclass. For example, in the classification of living organisms, the class of vertebrates is a superclass of the class of mammals. The immediate superclass of a class is its parent class. In common parlance, when we speak of the superclass of a class, we are usually referring to its parent class.

System call Refers to a command, within a running script, that calls the operating system into action, momentarily bypassing the programming interpreter for the script. A system call can do essentially anything the operating system can do via a command line.

Triple In computer semantics, a triple is an identified data object associated with a data element and the description of the data element. In theory, all Big Data resources can be composed as collections of triples. When the data and metadata held in sets of triples are organized into ontologies consisting of classes of objects and associated properties (metadata), the resource can potentially provide introspection (the ability of a data object to be self-descriptive). An in-depth discussion of triples is found in Chapter 4, “Metadata, Semantics, and Triples.”

Triplestore A list or database composed entirely of triples (statements consisting of an item identifier plus the metadata describing the item plus an item of data. The triples in a triplestore need not be saved in any particular order, and any triplestore can be merged with any other triplestore; the basic semantic meaning of the contained triples is unaffected. Additional discussion of triplestores can be found in Section 6.5, “Case Study: A Visit to the TripleStore.”

Turtle Another syntax for expressing triples. From RDF came a simplified syntax for triples, known as Notation 3 or N3 [30]. From N3 came Turtle, thought to fit more closely to RDF. From Turtle came an even more simplified form, known as N-Triples.

Unclassifiable objects Classifications create a class for every object and taxonomies assign each and every object to its correct class. This means that a classification is not permitted to contain unclassified objects; a condition that puts fussy taxonomists in an untenable position. Suppose you have an object, and you simply do not know enough about the object to confidently assign it to a class. Or, suppose you have an object that seems to fit more than one class, and you can't decide which class is the correct class. What do you do?
Historically, scientists have resorted to creating a “miscellaneous” class into which otherwise unclassifiable objects are given a temporary home, until more suitable accommodations can be provided. I have spoken with numerous data managers, and everyone seems to be of a mind that “miscellaneous” classes, created as a stopgap measure, serve a useful purpose. Not so. Historically, the promiscuous application of “miscellaneous” classes has proven to be a huge impediment to the advancement of science. In the case of the classification of living organisms, the class of protozoans stands as a case in point. Ernst Haeckel, a leading biological taxonomist in his time, created the Kingdom Protista (i.e., protozoans), in 1866, to accommodate a wide variety of simple organisms with superficial commonalities. Haeckel himself understood that the protists were a blended class that included unrelated organisms, but he believed that further study would resolve the confusion. In a sense, he was right, but the process took much longer than he had anticipated; occupying generations of taxonomists over the following 150 years.
Today, Kingdom Protista no longer exists. Its members have been reassigned to positions among the animals, plants, and fungi. Nonetheless, textbooks of microbiology still describe the protozoans, just as though this name continued to occupy a legitimate place among terrestrial organisms. In the meantime, therapeutic opportunities for eradicating so-called protozoal infections, using class-targeted agents, have no doubt been missed [13].
You might think that the creation of a class of living organisms, with no established scientific relation to the real world, was a rare and ancient event in the annals of biology, having little or no chance of being repeated. Not so. A special pseudoclass of fungi, deuteromyctetes (spelled with a lowercase “d,” signifying its questionable validity as a true biologic class) has been created to hold fungi of indeterminate speciation. At present, there are several thousand such fungi, sitting in a taxonomic limbo, waiting to be placed into a definitive taxonomic class [16,13].

References

[1] Wu D., Hugenholtz P., Mavromatis K., Pukall R., Dalin E., Ivanova N.N., et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature. 2009;462:1056–1060.

[2] Woese C.R., Fox G.E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. PNAS. 1977;74:5088–5090.

[3] Mayr E. Two empires or three? PNAS. 1998;95:9720–9723.

[4] Woese C.R. Default taxonomy: Ernst Mayr's view of the microbial world. PNAS. 1998;95(19):11043–11046.

[5] Bamshad M.J., Olson S.E. Does race exist? Sci Am. December, 2003;78–85.

[6] Wadman M. Geneticists struggle towards consensus on place for ‘race’. Nature. 2004;431:1026.

[7] Pearson K. The grammar of science. London: Adam and Black; 1900.

[8] Berman J.J. Racing to share pathology data. Am J Clin Pathol. 2004;121:169–171.

[9] Scamardella J.M. Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista. Int Microbiol. 1999;2:207–216.

[10] Berman J.J. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton: Chapman and Hall; 2010.

[11] Madar S., Goldstein I., Rotter V. Did experimental biology die? Lessons from 30 years of p53 research. Cancer Res. 2009;69:6378–6380.

[12] Zilfou J.T., Lowe S.W. Tumor suppressive functions of p53. Cold Spring Harb Perspect Biol. 2009;00:a001883.

[13] Berman J.J. Taxonomic guide to infectious diseases: understanding the biologic classes of pathogenic organisms. Cambridge, MA: Academic Press; 2012.

[14] Suggested Upper Merged Ontology (SUMO). The Ontology Portal. Available from: http://www.ontologyportal.org [viewed August 14, 2012].

[15] de Bruijn J. Using ontologies: enabling knowledge sharing and reuse on the Semantic Web. Digital Enterprise Research Institute Technical Report DERI-2003-10-29. Available from: http://www.deri.org/fileadmin/documents/DERI-TR-2003-10-29.pdf. October 2003 [viewed August 14, 2012].

[16] Guarro J., Gene J., Stchigel A.M. Developments in fungal taxonomy. Clin Microbiol Rev. 1999;12:454–500.

[17] Nakayama R., Nemoto T., Takahashi H., Ohta T., Kawai A., Seki K., et al. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Mod Pathol. 2007;20:749–759.

[18] Cote R., Reisinger F., Martens L., Barsnes H., Vizcaino J.A., Hermjakob H. The ontology lookup service: bigger and better. Nucleic Acids Res. 2010;38:W155–160.

[19] Niles I., Pease A. In: Welty C., Smith B., eds. Towards a standard upper ontology. Proceedings of the 2nd international conference on formal ontology in information systems (FOIS-2001), Ogunquit, Maine, October 17-19; 2001.

[20] Gansner E., Koutsofios E. Drawing graphs with dot. January 26. Available at: http://www.graphviz.org/Documentation/dotguide.pdf. 2006 [viewed on June 29, 2015].

[21] Wu X., Kumar V., Quinlan J.R., Ghosh J., Yang Q., Motoda H., et al. Top 10 algorithms in data mining. Knowl Inf Syst. 2008;14:1–37.

[22] Berman J.J. Principles of big data: preparing, sharing, and analyzing complex information. Waltham, MA: Morgan Kaufmann; 2013.

[23] Berman J.J. Repurposing legacy data: innovative case studies. Waltham, MA: Morgan Kaufmann; 2015.

[24] Berman J.J. Data simplification: taming information with open source tools. Waltham, MA: Morgan Kaufmann; 2016.

[25] Arslan D., Legendre M., Seltzer V., Abergel C., Claverie J. Distant Mimivirus relative with a larger genome highlights the fundamental features of Megaviridae. PNAS. 2011;108:17486–17491.

[26] Angert E.R., Clements K.D., Pace N.R. The largest bacterium. Nature. 1993;362:239–241.

[27] DeQueiroz K. Ernst Mayr and the modern concept of species. PNAS. 2005;102(suppl 1):6600–6607.

[28] Sawyer R., Berman J.J., Borkowski A., Moore G.W. Elevated prostate-specific antigen levels in black men and white men. Mod Pathol. 1996;9:1029–1032.

[29] Boyd D. Privacy and publicity in the context of big data. Raleigh, North Carolina: Open Government and the World Wide Web (WWW2010); 2010. April 29. Available from: http://www.danah.org/papers/talks/2010/WWW2010.html [viewed August 26, 2012].

[30] Primer: Getting into RDF & Semantic Web using N3. Available from: http://www.w3.org/2000/10/swap/Primer.html [viewed September 17, 2015].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.149.94