CHAPTER 9
Ontologies, Knowledge Graphs, and Semantic Technology

Ontologies are conceptual models built using Semantic Technology. They provide a layer of meaning on top of Linked Data and graph databases.

Our observation is, and our earlier case studies demonstrate, that it is possible to establish a Data-Centric enterprise using traditional technology. We have also found a much easier route, combining Linked Data concepts from the previous chapter, which Semantic Technology, which we will cover here.

We have found benefits in two major areas, from the adoption of Semantic Technology:

  • Implementation Benefits – Adopting Semantic Technology design and infrastructure provides a number of direct implementation benefits, including reducing complexity of data models, and increasing flexibility of delivered systems.
  • Perspective Benefits – Even without the implementation benefits, merely reframing the problem and forcing yourself to look at the enterprise application problems through the lens of semantics, often is enough to shift your point of view. Seeing the problem differently is often a prime benefit.

What follows is a very high level review of some of the differences introduced when you adopt Semantic Technology.

Metadata is triples as well

In most systems, data and metadata are very different things:

  • Data comprises the specific facts about individual instances, such as “the price of product 23 is $17.”
  • Metadata comprises the data that defines the structure and meaning of data, such as “the Inventory table contains columns for product IDs and prices.”

In relational technology, metadata is expressed in Data Definition Language (DDL) and data is expressed in Data Manipulation Language (DML).

In semantically based systems, data and metadata are all just triples. A single triple tells us that “Person” is a class (this is metadata).

Another triple tells us that “Dave” is a “Person.” (connecting data to metadata – note this loose and late binding relationship of data to metadata is one of the distinguishing features of semantic based systems.)

One flexibility that we get from this is that with one more triple, “Dave” is an “Employee,” and with another, “Dave” is a “Patient.” We don’t need new identifiers, we just make a single assertion, and Dave is a member of many classes, simultaneously. In relational technology, an instance is only a member of one class (a row is only in one table). There is incredible flexibility in this simple approach of allowing an instance to be in many classes simultaneously.

The other interesting side effect of having metadata be expressed in the same way as data, we can freely mix data and metadata, giving a way to discover what things mean, without having to know a priori.

If we take the “follow your nose” approach from the previous section and now apply it to a triple store that has metadata loaded, we can follow our nose from an individual (instance) to its class. We can find out what type of thing it is (its class). Let’s say we run a query that returns “:dave” as one of the URIs in a triple; we might click on :dave to get a page that tells us some of the key attributes of :dave, including his class memberships. We may find that: dave is an “:Employee,” a “:Consultant,” and a “:Parent.” Clicking on any of these will tell us what they mean.

Formal definitions

A formal definition is one where a software system can infer class membership based on the models plus the data.

A partial formal definition is one where we can infer data based on known class membership. For instance, if part of the definition of being a person is that they have a birthdate, then a system can infer a person has a birthdate, even if that date is unknown.

As we said, a complete formal definition is one where there is enough information in order to infer membership into a class, based on data in the database. Let’s say the formal definition of a Patient is as follows:

A Patient is a Person who has received diagnostic or therapeutic services from a healthcare provider.

Therefore, a system can infer membership in the class of Patients for any Persons that have had such services. In an ontology editor, like Protégé, this would look like this:

Serialized as OWL it looks like this:

Imagine that we found the following assertions (and they may have been from the web, an unstructured document, or a structured database):

Joe Jones had his gunshot wound repaired at St Luke’s.

If a system can establish:

  • that Joe Jones is a Person,
  • that gunshot wound repair is a treatment (CPT code 27033, for those who are curious), and
  • that St Luke’s is a medical facility,

then

  • the system can conclude (or infer) that Joe Jones is a patient.

A complete definition is one where we don’t need to assert membership in a class, the system can do it for us. In the previous examples, Person was a partial definition, and Patient was a complete definition.

At first, this doesn’t sound like such a big deal, but it has surprising implications. The first is that it removes ambiguity from class definitions. In the past, the definition of a class or table was generally in the documentation, and almost always underspecified (if it existed at all). It was assumed that everyone (users and programmers) knew what all the classes and tables meant. Most traditional systems have thousands of classes and tables, many have tens of thousands. Very few people understand even a fraction of these well.

Having more complete definitions means there is lower cognitive load on the users and developers. They only really need to understand and agree with the formal definitions and the concepts from which they were constructed. While this is still a bit of work, we routinely see order of magnitude reductions in complexity, and often two orders of magnitude reductions.

Self-describing data

In a semantic system, the data and the metadata that describe the data is co-located. What this means is that when you run a query and encounter a property or attribute or data value that you are not familiar with, you can, in a well-designed environment, click on the item and understand what it means and what else it is connected to. If you were searching for roller coasters in DBPedia (using SPARQL), you would query the following:

The query yields a list, the first five of which appear below. Note that these are all URIs (globally unique identifiers).

Clicking on one of these (Behemoth) gives us a new page with a set of formatted data, including, for example, these tidbits about the Behemoth roller coaster:

If you didn’t know what “dbp:dropFt” was, you could click on it and get a definition (in principle, at least). We used to use this example in our training class, and the property was called “verticalDrop.” Now the property is “dropFt,” but I’m pretty sure it’s the same data. In other words, it answers the question, what is the largest single drop in the course of the roller coaster?

The point is, with a traditional system, you need to know all the structure, all the terms, and what everything means, even before you execute your first query.

In a semantic system, you can execute a very naïve query, and when you see things you weren’t expecting or don’t know, you can click on them (if the system is well designed) and find out what they mean.

Schema later

There is a huge debate now between “schema on write” and “schema on read.” “Schema on write” means you must know all of the schema before you write anything to the database. In other words, all the tables and all the structure must be present before you do any writing (or reading). All relational and object-oriented systems are “schema on write” (i.e., the schema must exist before data can be written).

“Schema on read” says we will figure out the schema as we go. Go ahead and put the data in the data lake with whatever keys and tags you’d like, and the data scientists will work it out later.

Both approaches are flawed. The former slows people down because as new data is discovered, there is a considerable lag before the data administrators have set up the structures to handle the new data. This delay is becoming less and less palatable.

The latter approach means the only people who know (sort of) what the data means are the data scientists who have plumbed the depths of the data. But there are two problems with this: 1) each data scientist may come to a different conclusion, and 2) any conclusion they come to is unlikely to be shared with anyone else.

With Semantic Technology schema can be added as it is uncovered or designed. Imagine a genealogy triple store that only had two properties: “hasParent” and “gender.” From this, you could define the class of all “mothers” and “fathers” and infer people into them. You could define the class of “grandparents” as well as “grandmothers” and “grandfathers.” We can derive ancestor from parent. We can define the class of “uncles” and specific uncle relationships. All of this is schema that can be added long after the data has all been created.

Open world

As Tim Berners-Lee has so often said, on the Web and on the Semantic Web, “anybody can say anything about anything.” This doesn’t resonate in our enterprises. But before we throw the baby out with the bathwater, let’s dig a bit deeper about why the Semantic Web is the way it is and how to have the best of both worlds.

The Semantic Web is predicated on the idea that you will continually be harvesting and accreting additional information. Therefore, at any point in time, the information you have will be incomplete. There is a great deal of reasoning you can do with incomplete information, but there are limits. For instance, if my municipality’s definition of an impoundable vehicle is one with more than five parking violations, then once I discover five parking violations, I can infer that the vehicle is impoundable. New information will not reverse that inference.

The converse is not true. If I find a vehicle with three parking violations, I cannot conclude that it is not impoundable. The open world, and the recognition that we are always dealing with incomplete data, conspire to tell us that if we look a bit harder, we may find some more parking violations.

This runs counter to what most corporate developers are accustomed to. It is one of the lynchpins of the ability to evolve a database in place. The more productive thing to do is embrace this open idea, and use the closed world assumption as an exception, when one needs to. The semantic query language, SPARQL, which we alluded to above, allows us to make conclusions based on the data at hand. You can write a query that concludes, based on what we know now; this is not an impoundable vehicle.

Local constraints

Many developers and database designers come to Semantic Technology feeling dismayed at the seeming lack of constraints, and they sometimes leave the approach for that reason. Some adopt design approaches that mimic what they are used to but compromise the potential of the Semantic approach.

The lack of constraints is tied in with the open world. It essentially says, “I’d rather have partial information than no information.” A system with constraints essentially says, “by rejecting incomplete information, I’m saying I would rather have no data than incomplete data.” Most constraints are completeness checks: all items in this table must have values for these five columns.

Let’s say that our definition of Employee (US Employee) said semantically, “all employees have birthdates and social security numbers.” If we declared someone an Employee in a semantic system, we would know that they have a birthdate and a social security number, even if we don’t know what those values are. A traditional system with constraints would reject the assertion that someone is an employee if there wasn’t a birthdate and social security number to come along with it.

There is now a standard to allow us to add constraints to a semantic system, and that standard is called SHACL. SHACL has been designed in a way that the constraints can be applied locally. We can have two triple stores (two “repositories” in the lingo) that were built based on a shared ontology, and because of that, they have a shared definition of meaning. One, the “curated” repository, could have constraints applied, such that a set of information in that repository is either complete or missing. Any Person data in the Employee repository would only be there if it had a birthdate, a social security number, and perhaps a start date and a pay rate. Another repository might have Patient data and would only have a record if we had patient ID, date of birth, gender, and at least one payment method. A third repository could just have whatever information we have on people.

If we needed a curated set of data, we could interrogate either the employee repository or the patient repository. In each case, we would get incomplete information relative to what we might know about a person if we interrogated all three repositories, which we would get a complete picture.

We think this ability to have global shared definitions, based on open world principles, coupled with the ability to have repositories with locally enforced constraints, is the best of both worlds.

Curated and uncurated data

There is a tension in the semantic web community between the wild west notions of the open world assumption and the desire for enterprises to “button things down.” This often leads to two separate sets of databases, one that is mined by data scientists that are populated with data that is gathered from a myriad of sources and of course has low consistency and quality (though it might contain interesting insights not available to the corporate systems). The other is the highly curated and controlled enterprise data.

But these two need not be polar opposites. Both can be expressed in a shared ontology. In other words, the meaning of Person or Employee might be equivalent even though the rules of enforcement in one database are strict and another lax.

What this does is if you have a business problem that only needs or wants curated data (e.g., your internal payroll system), you should probably only use the curated dataset. But you may have an application that would benefit from the combination or curated and open data. Maybe combining your internal curated data about your employees from open data harvested from the web would lead to some interesting insight. This is made far easier if the curated and “uncurated” datasets share a single ontology (i.e. if the concepts mean the same thing even though the rules of consistency are not necessarily enforced consistently in both).

Ontologies

An ontology is a conceptual model, built using a formal modeling notation. We will use the OWL modeling language as defined by the W3C as the canonical example, although there are other modeling languages.

For those who wish to delve deeper, one of our colleagues at Semantic Arts, Michael Uschold, has written a very approachable and yet deep treatment of OWL: Demystifying OWL for the Enterprise.43

Technically any data expressed in this modeling notation and packaged in an appropriate wrapper is an ontology. This would include a file of instance level assertions as long as it was packaged as an ontology; however, when most people use the term “ontology”, they are referring to the definition of the classes and properties that make up the model.

The triples that make up the formal definition of the meaning of the classes and properties are called “axioms.”

An ontology is different from traditional conceptual models in at least three important ways:

  1. It is computable. That is, you can execute code (reasoners) against the ontology.
  2. The use of the reasoner can detect logical inconsistencies in a way that other conceptual models cannot.
  3. It can be directly implemented. Traditional conceptual models must be transformed into logical models and then physical models before any data can be captured and stored. The ontology can be loaded as is into a triplestore, and data can also be loaded conforming to the ontology and be ready to be used.

Modularity and reuse

Ontologies have an interesting property of being modular. This modularity allows for partitioning of a domain into comprehendible sized chunks. One ontology can import another, which means it brings in the declaration of the classes and properties, as well as the axioms that make up the formal definitions. The importing ontology has all the concepts from the ontology it imported and can start extending from there.

Imagine you were in the business of selling chemical substances. You might arrange your ontologies like the example shown below. The materials ontology knows about and extends the chemical ontology but need not be concerned with the price of materials. There is surprising economy in this approach. Often each module in a scheme like this may only have a few dozen to a hundred concepts. This is small enough that most consumers of the ontology can understand it.

This chunking into modules is a great boon for reuse. The first boon for reuse is getting each ontology to an understandable size. The related boon to reuse is that sometimes an ontology has axioms that conflict with the importing ontology. Having reasonably sized modules allows an importer to select the modules they want and avoid axioms they don’t want.

Self-policing data

In a traditional system, the protection of a data set falls to the application that maintains it. But this makes the data very dependent on the application, encouraging further creation of data silos.

As we move to the Data-Centric model, we uncover the need for data to police itself (i.e., take care of its own quality management, constraints, and security).

This cannot be achieved by standards alone. The standards give us most of the building blocks we need, but some architecture is required to implement them. As mentioned earlier, this process will be more closely examined in The Data-centric Architecture, a companion book to this volume. For our purposes here, it suffices to say that when a certain architecture is Data-Centric, it is independent of any application managing data, based on rules (also expressed in data) that co-exist with the data.

Computable models

Traditional systems have data models as well. Often, they have conceptual models, logical models, and physical models. There is a difference in kind between ontologies and traditional models; we call that difference, “being computable.”

It is very analogous to the difference between a paper map and an electronic map, such as Google Maps. They are both models of the real, geospatial world. The difference isn’t level of detail—a paper map can have any amount of detail. The difference is that one (Google Maps) is a computable model. You can interrogate the model, ask how far two points are from each other, how to get from here to there, and how many coffee shops are on the route.

You can ask a paper map the best route from point A to point B, but it is not going to help you. Google Maps is a computable model of geography.

The analogy holds between traditional data models and next generation data models. A traditional relational model, even though it may have been built with electronic tools, cannot be interrogated in any deep sense. There are a wide number of queries that can be executed against a next generation data model. It can work out its own class hierarchy, spot logical inconsistencies, infer instances into classes, and determine classes that are closely related. The computable model can be used to generate other models.

Integration with relational

You won’t put all your data in a semantic graph database (at least, not in the short- to medium-term). Most of your existing data will persist in relational databases. However, we can achieve most of the benefits of a Data-Centric architecture without re-platforming the relational data.

The secret to this is a combination between a mapping technology called Relational to RDF Mapping Language (R2RML) and the ability to federate queries.

R2RML is a W3C standard that describes how to create a map between a relational database and RDF triples. The map essentially describes how to mint URIs from the relational keys and establishes the equivalency between the column names and semantic properties. The fascinating thing about this standard is once the map is built it can be run in either of two modes. In “ETL” (Extract, Transform and Load) mode, it mimics the behavior of the utilities that have been populating data warehouses for the last several decades. You process the entire database and turn it into triples which can then be loaded into a triple store.

The same maps can be used in a federated query mode. In the federated query mode, a semantic query uses the R2RML map to reach into the relational system at query time and only “triplify” (or transform into triples) the datasets it needs. This is typically combined with query results either from a map to another relational database or to a triple store. This process of combining data from multiple data stores is called “federation.” The ability to combine triples stored in a triple store with those dynamically discovered at query time is a unique feature of this technology stack and is a key part of most migration strategies.

Integration with big data

In an analogous fashion, the semantic web standards provide a way to reach into “big data.” Most big data these days is expressed in JavaScript Object Notation (json) syntax. The data is in “documents” (these are not like word documents; these documents are nested sets of dictionaries and arrays). In json, a dictionary is syntactically expressed within { } and has keys followed by values (e.g., { ‘hasCapital’: ‘Denver’}) and an array is comma delimitated in [ ] (e.g., [‘Mon’, ‘Tue’, ‘Wed’, ‘Thu’, ‘Fri’, ‘Sat’, ‘Sun’]). It turns out that programmers and data scientists can deal with these structures very easily, no matter how deeply they are nested (for instance, you can, and usually do, have dictionaries of dictionaries with arrays of dictionaries).

While a programmer or a data scientist may be able to parse these structures, the problem is there is no mechanism to integrate this data with anything else you might have. This is where another standard, json for Linked Data (JSON-LD), comes in. It is essentially a map from semantics to json data structures. It is implemented as a header to a json document that maps the keys to semantic properties and describes how URIs will be minted.

Again, this allows you to either convert json data to triples and store them in a triple store or leave them where they are and interrogate them semantically at query time.

Natural language processing

For about two decades the holy grail of integration has been the integration of structured (relational) with unstructured (natural language documents) data. This has been a very hard problem for a long time. It is now a medium hard problem. Many people have solved it, and the general solution is within reach.

Natural Language Processing (NLP) has been pretty good at extracting “named entities” from documents for a long time. A “named entity” is something like a specific person, organization, place, transaction, or event. What NLP has not been very good at—which is to say it required a lot of computational linguists to spend a lot of time training and tuning—is finding the relationships between the named entities in a document. We are entering an era where this is doable and possible (with a bit of training and configuration) to align the relations expressed in a document with the relations (properties) in an ontology. What this means, is it is now possible to harvest triples from unstructured documents that conform to an ontology, and therefore harvest data from documents that can be combined with structured data.

Semantic standards stack

This is a variation of the “Semantic Web Layer Cake,”44 which was meant to help people visualize how the various standards that make up what we think of as Semantic Technology are related to each other. The most recent version (in the footnote) still contains research areas that didn’t become standards (user interface, unifying logic, trust and crypto), some that were developed but not widely used (RIF), and some that were developed differently (Trust became PROV-O). There are also many important W3C standards not in that picture.

This version includes the standards that most practitioners are following. Most of these have been mentioned above. NG stands for “Named Graph,” which is a standard for tagging triples in a triple store, which turns out to be very handy for implementing provenance (PROV-O).

Chapter Summary

Building a Data-Centric enterprise has always been a good idea, even if it has not been a widely held good idea. But for most of the time we’ve been implementing systems, the infrastructure and support did not exist to make it economically feasible. This has changed. Most of the concepts and practices outlined in this section are now more than a decade old. There are standards, tools, and trained professionals that can apply these concepts. We have existence proof. All that remains is creating a plan, putting the pieces together, and getting started.

As we saw in our earlier case studies, it is possible to become Data-Centric without relying on semantic and graph technology, but it demands far more discipline. Linked Data, Knowledge Graphs and Semantic Technology offer capabilities that make it easier to adopt the Data-Centric approach.

The next chapter explores what a few trailblazing companies have done to apply Semantic Technology in a way that is consistent with the theme of this book, which is to enable a Data-Centric enterprise.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.85.100