Chapter 19
In This Chapter
Architecting triples and quads
Applying standards
Managing ontologies
I want to begin this chapter by asking, “Why do you need a triple store or graph store?” Do you really need a web of interconnected data, or can you simply tag your data and infer relationships according to the records that share the same tags?
If you do have a complex set of interconnected data, then you need to decide what query functionality you need to support your application. Are you querying for data, or trying to mathematically analyze the graph itself?
In order to get the facts, or assertions, that you require, are you manually adding them, importing them from another system, or determining them through logical rules, called inferencing? By inferencing, I mean that if Luke is the son of Anakin, and Anakin’s mother is called Shmi, then you can infer that Luke’s grandmother is Shmi.
The tool you select for the job — whether it’s a simple document store with metadata support, a triple store, or a graph store — will flow from your answers to the preceding questions.
In this chapter, I discuss the basic model of graph stores and triple stores, how they differ, and why you might consider using them.
I deliberately separated the terms graph store and triple store in this book. The reason is pretty simple. Although the underlying structures are the same, the analysis done on them is drastically different.
This difference means that graph and triple stores, by necessity, are architected differently. In the future, they may share a common underpinning, but not all the architectural issues of distributing graphs across multiple machines have been addressed at this point in time.
A triple store managed individual assertions. For most use cases you can simply think of an assertion as a “fact.” These assertions describe subjects’ properties and the relationships between subjects. The data model consists of many simple subject – predicate – object triples, as shown in Figure 19-1.
This subject – predicate – object triple allows complex webs of assertions, called graphs, to be built up. One triple could describe the type of the subject, another an integer property belonging to it, and another a relationship to another subject.
Figure 19-2 shows a simple graph of subjects and their relationships, but no properties for each subject. You can see that each relationship and type is described using a particular vocabulary. Each vocabulary is called an ontology.
Each ontology describes a set of types, perhaps with inheritance and “same as” relationships to other types in other ontologies. These ontologies are described using the same triple data model in documents composed of Resource Description Framework (RDF) statements.
In graph theory these subjects are called vertices, and each relationship is called an edge. In a graph, both vertices and edges can have properties describing them.
Every graph store is a triple store because both share the same concepts. However, not every triple store is a graph store because of the queries that each can process.
A triple store typically answers queries for facts. Listing 19-1 shows a simple query (based on the graph in Figure 19-2) to return all facts about the first ten subjects of type person
.
Listing 19-1: Simple SPARQL Query
SELECT ?s ?p ?o WHERE {
?s rdf:type :person .
?s ?p ?o .
} LIMIT 10
In a more complex example, you may look for subjects who are related to other subjects through several relationships across the graph, as illustrated in Listing 19-2.
Listing 19-2: Complex SPARQL Query
SELECT ?s WHERE {
?s rdf:type :person .
?s :knows ?s2 .
?s2 rdf:type :person .
?s2 :likes :cheese .
} LIMIT 10
In Listing 19-2, you aren’t looking for a directly related subject but for one separated by a single hop (that is, a query on an object related to another object) through another vertex, or relationship, in the graph. In Listing 19-2, you’re asking for a list of the first ten people who know someone who likes cheese.
These examples SPARQL queries share one thing in common: They return a list of triples as the result of the operation. They are queries for data itself, not queries about the state of the relationships between subjects, the size of a graph, or the degree of separation between subjects in a graph.
A graph store provides the ability to discover information about the relationships or a network of relationships. Graph stores can respond to queries for data, too, and are also concerned with the mathematical relationships between vertices.
Generally, you don’t find these graph operations in triple stores:
These algorithms are mathematically much harder queries to satisfy than simply returning a set of facts. This is because these algorithms could traverse the graph to an unpredictable depth of search from the first object in the database.
Triple queries, on the other hand, are always bounded by a depth within their queries and operate on a known set of vertices as specified in the queries themselves.
Graph stores also allow their relationships, or edges, to be described using properties. This convention isn’t supported in RDF in triple stores. Instead, you create a special subject to represent the relationships itself, add the properties to this intermediate subject.
This process does lead to more complex queries, but they can be handled using the query style shown in Listing 19-2.
The differences between the graph and triple store data models lead to great differences in architecture. Because of the number of hops possible in graph queries, a graph store typically requires all its data to be held on a single server in order to make queries fast.
A triple store, on the other hand, can distribute its data in the same manner as other NoSQL databases, with a specialized triple index to allow distributed queries to be spread among servers in a cluster.
Whether you choose a triple or graph store isn’t a question of which architecture you prefer; instead, the question is which type of queries you prefer.
The majority of data models use triples to provide the same flexibility in modeling relationships as you get in schema-less NoSQL’s document models. They are basically schema-less relationships: You are free to add, remove, and edit them without informing the database beforehand of the particular types of relationship you’re going to add.
Triple stores are concerned with storing and retrieving data, not returning complex metrics or statistics about the interconnectedness of the subjects themselves.
Triple stores are also built on the open RDF set of standards and the SPARQL query language. Graph stores each have their own terminology, slightly different data models, and query operations.
If you need to query information about the graph structure, then choose a graph store. If you only need to query information about subjects within that graph, then choose a triple store.
From this point on, I use the term triple store to refer to both triple and graph stores, unless stated otherwise.
The subject – predicate – object data model is a very flexible one. It allows you to describe individual assertions.
There are situations though when the subject – predicate – object model is too simple, typically because your assertion makes sense only in a particular context. For example, when describing a particular patient in one medical trial versus another trial, or maybe you’re describing the status of one person within two different social groups.
Thankfully, you can easily model context within a triple store. Triple stores have the concept of a named graph. Rather than simply add all your assertions globally, you add them to a named part of the graph.
You can use this graph name to restrict queries to a particular subset of the information. In this way, you don’t need to change the underlying ontology or the data model used in order to support the concept of context.
In the preceding example, you could have a different named graph for each medical trial or each social group. If you specify the named graph in your query, you restrict the context queried. If you don’t specify the named graph, you get a query across all your data.
Note that each triple can be stored in only a single named graph. This means you must to carefully select what you use as your context and graph name. If you don’t, then you may find yourself in a situation where you need to use two contexts for a single set of triples.
By creating a database for all triples in a particular application, when querying across them, you’re automatically saying they all have value. In most situations, therefore, you don’t need a context. In this situation, you can ignore the context of the data and just keep thinking in terms of triples, rather than quads.
If you need the context concept and you can add a property to a subject without making your queries complex, or altering an ontology, then do so, because this approach is more flexible.
If you absolutely need to use context without adding your own properties outside of an ontology, then using the graph name for context will give you the quads you need.
The first standard you need to become familiar with when dealing with triples is the Resource Description Framework (RDF). This standard describes the components of the RDF data model, which includes subjects, predicates, objects, and how they are described.
Here are a few key RDF concepts that are explained in detail in the working ontologist book:
A key difference between RDF and other specifications is that there are multiple expression formats for RDF data, not just a single language. Common languages are N-Triples, Turtle, and RDF/XML.
Which format you choose depends on the tools you’re using. A person who understands RDF in one format should be able to pick up another format easily enough without formal retraining.
In addition to RDF, there are other related standards in the ecosystem. Here are the standards you need to become familiar with:
These specifications allow you to define not only the data in your database but also the structure within that data and how it’s organized.
You can use a triple store by utilizing a single RDF serialization like N-Triples, and SPARQL to query the information the database contains. It’s good to know about these other specifications, however, when you’re designing ontologies that you share with the outside world.
SPARQL is a recursive acronym that stands for SPARQL Protocol and RDF Query Language. SPARQL uses a variant of the Turtle language to provide a query mechanism on databases that store RDF information.
SPARQL provides several modes of operation:
These operations can be restricted to portions of the database using a Where clause.
As shown in Listing 19-1, select statements can be very simple. The sample in Listing 19-1 returns all triples in the database that match the given expression.
You can construct more complex queries to find particular subjects that match a query across relationships in the graph, as shown in Listing 19-2.
An update to the SPARQL standard is now very common and often people looking to implement a triple store request the 1.1 version of the SPARQL standard.
Version 1.1 provides a “group by” structuring mechanism and allows aggregation functions to be performed over triples.
Listing 19-3 shows both an aggregation function (AVG for mean average) and a GROUP BY clause. This query returns the average age of purchasers for each product ordered from a website.
Listing 19-3: Product Average Purchaser Age Query and Result
SELECT (AVG(?age) AS ?averageage) WHERE {
?product :id ?id .
?product :title ?title .
?order rdf:type :order .
?order :has_item ?product .
?order :owner ?owner .
?owner :age ?age .
} GROUP BY ?title
SPARQL 1.1 also provides a HAVING keyword that acts like a filter clause, except it operates over the result of an aggregation specified in the SELECT clause, rather than a bound variable within the WHERE clause.
An often overlooked specification is the W3C SPARQL 1.1 graph store protocol. This is a single web address (called a HTTP endpoint — HTTP being Hypertext Transfer Protocol, the protocol that powers the web) that allows clients to create, modify, get, and delete named graphs within a triple store.
This is a simple specification that can be easier to work with than the more complex SPARQL 1.1 Update mechanism. The graph store protocol is easy to use because you can take any Turtle RDF file and use a simple web request to create a graph, or add new data to an existing graph.
SPARQL 1.1 Update is a variation of SPARQL that allows the insertion and deletion of triples within a named graph. It also provides graph deletion via the DROP
operation and copy, load, move, and add operations.
Triple stores provide great flexibility by allowing different systems to use the same data model to describe things. That comes at the cost of allowing people to describe things in a very open and flexible way!
RDF Schema (RDFS), OWL, and SKOS allow developers to use the familiar RDF mechanism to describe how these structures interrelate to each other and the existing relations and values.
An ontology is a semantic model in place within an RDF store. A single store can contain information across many ontologies. Indeed, you can use two ontologies to describe different aspects of the same subject.
The main tool used to describe the structures in an RDF ontology is the RDF Schema Language (RDFS). Listing 19-4 illustrates a simple example of an RDF Schema.
Listing 19-4: Some Assertions
:title rdfs:domain :product .
:service rdfs:subClassOf :product .
:period rdfs:domain :service .
:foodstuff rdfs:subClassOf :product .
:expiry rdfs:domain :foodstuff .
Listing 19-5 shows how RDF Schema are used in practice.
Listing 19-5: Triples Within This RDF Schema
:SoftwareSupport rdf:type :service .
:SoftwareSupport :period “12 months” .
:SoftwareSupport :title “Software Support” .
:Camenbert rdf:type :foodstuff .
:Camembert :title “Camembert Cheese” .
:Camembert :expiry “2014-12-24”^^xs:date .
The preceding schema implies the following:
Relationships within triples are directional, thus the semantic web industry’s frequent references to directed graphs. The relationship is from one subject to one object. In many situations, there is an opposite case, for example:
The Web Ontology Language, OWL, provides extensions to RDF Schema that help model more complex scenarios, including that in Listing 19-6.
Listing 19-6: Simple Use of the OWL inverseOf Property
:person rdf:type owl:Class .
:fatherOf rdf:type owl:ObjectProperty;
rdfs:domain :person;
rdfs:range :person;
owl:inverseOf :sonOf .
:sonOf rdf:type owl:ObjectProperty;
rdfs:domain :person;
rdfs:range :person;
owl:inverseOf :fatherOf .
As you can see in Listing 19-6, the inverseOf
predicate can be used to specify that relationships are the opposite of each other. This enables the presence of one relationship to infer that the other relationship also exists in the opposite direction.
Of course, many more sophisticated examples are available. You can probably think immediately of other ways to apply this concept.
A common requirement in using a triple store is to define concepts and how objects fit within those concepts. Examples include
SKOS is used to define vocabularies to describe the preceding scenarios’ data modeling needs.
A concept is the core SKOS type. Concepts can have preferred labels and alternative labels. Labels provide human readable descriptions. A concept can have a variety of other properties, too, including a note on the scope of the concept. This provides clarification to a user of the ontology as to how a concept should be used.
A concept can also have relationships to narrower or broader concepts. A concept can also describe relationships as close matches or exact matches.
Listing 19-7 is an example SKOS ontology used to describe customers and what class of customer they are with a company.
Listing 19-7: SKOS Vocabulary to Describe Customer Relationships
amazon:primecustomer a skos:concept ;
skos:prefLabel “Amazon Prime Customer”@en ;
skos:broader amazon:customer .
amazon:customer a skos:concept ;
skos:prefLabel “Amazon Customer”@en ;
skos:broader :customer ;
skos:narrower amazon:primecustomer .
SKOS provides a web linkable mechanism for describing thesauri, taxonomies, folksonomies, and controlled vocabularies. This can provide a very valuable data modeling technique.
In particular, SKOS provides a great way to power drop-down lists and hierarchical navigation user-interface components. So, consider SKOS for times when you need a general-purpose, cross-platform way to define a shared vocabulary, especially if the resulting data ends up in a triple store.
In the day-to-day use of databases you likely create, update, and delete data with abandon. In this book, you find out how to gather information from disparate sources and store all of it together, using document or semantic mechanisms to create new content or infer facts.
In larger systems, or systems used over time, you can end up with very complicated interconnected pieces of information. You may receive an innocuous tweet that shows a person may be of interest to the police. Then decide six months later, after examining lots of other data, that this person’s house should be raided.
How do you prove that the chain of information and events you received, assessments you made, and decisions taken were reasonable and justified for this action to take place?
Similarly, records, especially documents held in document-orientated NoSQL databases, are changed by people who are often in the same organization. This is even more the case when you’re dealing with distributed systems like a Wiki. How do you describe the changes that content goes through over time, who changed it, and why? This kind of documentation is known as data provenance.
You can invent a number of ways to describe these activities. However, a wonderful standard based on RDF has emerged to describe such changes of data over time.
The W3C (yes, those people again!) PROV Ontology (PROV-O) provides a way to describe documents, versions of those documents, changes, the people responsible, and even the software or mechanism used to make the change!
PROV-O describes some core classes:
prov:Entity
: The subject being created, changed, or used as inputprov:Activity
: The process by which an entity is modifiedprov:Agent
: The person or process carrying out the activity on an entityThese three core classes can be used to describe a range of actions and changes to content. They can form the basis for systems to help prove governance is being followed within a data update or action chain.
PROV-O comprises many properties and relationships. There’s not room in this book to describe all of them, nor could my tired ole hands type them! But here is a selection I like to briefly mention:
wasGenertedBy
: Indicates which agent generated a particular entitywasDerivedFrom
: Shows versioning chains or where data was amalgamatedstartedAtTime
, endedAtTime
: Provide information on how the activity was performedactedOnBehalfOf
: Allows a process agent to indicate, for example, which human agent it was running for; also, used to determine when one person performs an operation at the request of anotherRegardless of your requirements for tracking modifications of records or for describing actions, you can use PROV-O as a standards-compliant basis for managing the records of changes to data in your organization.
PROV-O is a group of standards that includes validator services that can be created and run against a triple store using PROV-O data. It’s a standard well worth being familiar with if you need to store data on changes information held in your NoSQL database.
3.145.70.38