CHAPTER 5

image

Tables are Not Your Friends: Graph Databases

Google sought to gauge what people were thinking, and became what people were thinking. Facebook sought to map the social graph, and became the social graph

—George Dyson, Turing’s Cathedral

Whenever someone gives you a problem, think graphs.

—Steve Yegge, Get That Job at Google (blog post)

Proponents of key-value stores, document databases, and relational systems disagree about practically every aspect of database design, but they do agree in one respect: databases are about storing information about “things,” be those things represented by JSON, tables, or binary values. But sometimes it’s the relationship between things, rather than the things themselves, that are of primary interest. This is where graph database systems shine.

Graph structures are most familiar to us from social networks such as Facebook. In Facebook, the information about individuals is important, but it’s the network between people—the social graph—that provides the unique power of the platform. Similar graph-oriented datasets occur within network topologies, access-control systems, medical models, and elsewhere.

The relational database is completely capable of modeling such networks through the use of foreign keys and self-joins. Unfortunately, the RDBMS generally hits performance issues when working with very large graphs, and SQL lacks an expressive syntax to work with graph data.

The NoSQL solutions described so far do even a worse job with graph models than the relational database.  In a key-value store or document database, a graph structure can be stored in a single document or object, but relationships between objects are not inherently supported (because there are no joins).

To efficiently process a data model in which complex networks of relationships between objects or entities are a primary focus, a different type of database is called for: the graph database.

What is a Graph?

Like the relational database, but unlike many nonrelational systems, graph databases are based on a strong theoretical foundation. Graph theory is a long-established branch of mathematics, with many practical applications in medicine, physics, and sociology, as well as in computer science.

Graph theory defines these major components of a graph:

  • Vertices, or “nodes,” represent distinct objects.
  • Edges, or “relationships” or “arcs,” connect these objects.
  • Both vertices and edges can have properties.

While vertices and edges are the terms used most frequently in mathematical theory, we’ll use “nodes” and “relationships” more often in this chapter because those terms lead to less convoluted and murky language.

Both nodes and relationships can have properties. The properties of nodes are not dissimilar to those you might find associated with a relational table or in a JSON document. Properties of relationships might include the type, strength, or history of the relationship.

Figure 5-1 illustrates a simple graph with four nodes (vertices) and three relationships (edges).

9781484213308_Fig05-01.jpg

Figure 5-1. Simple graph with four vertices (nodes) and three edges (relationships)

Graph theory provides mathematical notation for defining or removing nodes or relationships from the graph, and for performing operations to find adjacent nodes. These primitive operations can be used to perform a graph traversal—walking through the graph to explore the network. To use a familiar example, Facebook sometimes performs a graph traversal to find the friends of your friends. In the example shown in Figure 5-1, we might perform a graph traversal to find all the actors that have ever appeared as a co-star in a movie starring Keanu Reeves.

RDBMS Patterns for Graphs

It’s relatively easy to represent graph structures in the relational model. For instance, we might create the relational structure shown in Figure 5-2 to represent the graph shown in Figure 5-1.

9781484213308_Fig05-02.jpg

Figure 5-2. Relational schema to represent our sample graph data

While the relational model can easily represent the data that is contained in a graph model, we face two significant problems in practice:

  1. SQL lacks the syntax to easily perform graph traversal, especially traversals where the depth is unknown or unbounded. For instance, using SQL to determine friends of your friends is easy enough, but it is hard to solve the “degrees of separation” problem (often illustrated using the example of the number of connections that separate you from Kevin Bacon).
  2. Performance degrades quickly as we traverse the graph. Each level of traversal adds significantly to query response time.

For instance, consider the SQL needed to find all actors who have ever worked with Keanu Reeves. It might look something like that shown in Figure 5-3.

9781484213308_Fig05-03.jpg

Figure 5-3. SQL to perform first-level graph traversal

This SQL queries PEOPLE to find Keanu, then ACTORS and MOVIES to find the movies he has starred in, then ACTORS and PEOPLE again to find other actors who have been in those movies. A five-way join such as this is manageable, but as we traverse more we have to add another three joins each time we increase the search depth. Furthermore, there is no syntax that allows us to search to an arbitrary depth. So for instance, if we want to expand the whole graph, we can’t because there is no way to know in advance how many levels we need to examine.

Performance in an RDBMS is also a potential issue. Providing that appropriate indexes are created, our sample query above will execute with acceptable overhead. But each join requires an index lookup for each actor and movie. Each index lookup adds overhead and performance that for deep graph traversals are often unacceptable. Sometimes the best solution is to load the tables in their entirety into map structures within application code and traverse the graph in memory. However, this assumes that there is enough memory in application code to cache all the data, which is not always the case.

Because key-value stores do not support joins, they offer even less graph traversal capability than the relational database. In a pure key-value store or document database, graph traversal logic must be implemented entirely in application code.

RDF and SPARQL

The Resource Description Framework (RDF) is a web standard developed in the late 1990s for the modeling of web resources and the relationships between them.  It represents one of the earliest standards for representing and processing graph data.

Information in RDF is expressed in triples, conceptually of the form entity: attribute :value; for instance:

TheMatrix: is :Movie
Kenau: is :Person
Kenau: starred in: TheMatrix

RDF was intended as a means of creating a formal database of resources—particularly web services—on the web together with the dependencies that these resources relied on.

RDF graphs can be stored in a variety of formats including XML, or even within tables inside a relational database.  However, native RDF databases—known as triplestores—have been implemented. These include AllegroGraph, Ontotext GraphDB, StarDog, and Oracle Spatial.

RDF supports a query language SPARQL (a recursive acronym: SPARQL Protocol and RDF Query Language), which is a SQL-like language for interrogating RDF data. Figure 5-4 shows a SPARQL query interrogating DBpedia (an RDF representation of selected Wikipedia data) to find all objects that have Edgar Codd as a topic.

9781484213308_Fig05-04.jpg

Figure 5-4. Interrogating Dbpedia using SPARQL

Property Graphs and Neo4j

While RDF is an important graph technology, the Property Graph model provides a richer model for representing complex data by associating both nodes and relationships with attributes.

The property graph model is the basis for Neo4j, which is the most widely adopted graph database. Neo4J is a Java-based graph database that can be easily embedded in any Java application or run as a standalone server. Neo4j supports billions of nodes, ACID compliant transactions, and multiversion consistency.

Neo4j implements a declarative graph query language Cypher. Cypher allows graphs to be queried using simple syntax somewhat comparable to SQL or SPARQL, but particularly optimized for graph traversals.

The following Cypher statements would create the graph structure shown in Figure 5-1 within Neo4J:

CREATE (TheMatrix:Movie {title:'The Matrix', released:1999,
        tagline:'Welcome to the Real World'})
CREATE (JohnWick:Movie {title:'John Wick', released:2014,
        tagline:'Silliest Keanu movie ever'})
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (AndyW:Person {name:'Andy Wachowski', born:1967})
CREATE
  (Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
  (Keanu)-[:ACTED_IN {roles:['John Wick']}]->(JohnWick),
  (AndyW)-[:DIRECTED]->(TheMatrix)

The following Cypher query retrieves information for a single node in the graph:

neo4j-sh (?)$ MATCH (kenau:Person {name:"Keanu Reeves"})
>             RETURN kenau;
+----------------------------------------+
| kenau                                  |
+----------------------------------------+
| Node[1]{name:"Keanu Reeves",born:1964} |
+----------------------------------------+

The MATCH clause is roughly equivalent to the WHERE clause in SQL and the RETURN clause to a SELECT list.

MATCH supports a syntax for traversing relationships in the graph. For instance, the following query finds all the actors who have ever acted in the same movie as Keanu Reeves:

neo4j-sh (?)$ MATCH (kenau:Person {name:"Keanu Reeves"})
                 -[:ACTED_IN]->(movie)<-[:ACTED_IN]-(coStar)
              RETURN coStar.name;
+----------------------+
| coStar.name          |
+----------------------+
| "Jack Nicholson"     |
| "Diane Keaton"       |
| "Dina Meyer"         |
| "Ice-T"              |
| "Takeshi Kitano"     |

In the above example, we specified the relationship type (ACTED_IN) and the node type (movie). We can also specify a wildcard to match all relationship or node types and can specify the depth of the traversal, as in the following query which matches all nodes within two traversals from “Keanu Reeves.” This list will include all movies that he has starred in, together with all co-stars and directors.

neo4j-sh (?)$ MATCH (kenau:Person {name:"Keanu Reeves"})
               -[*1..2]-(related) RETURN distinct related;
+------------------------------------------------------------------------
| related
+------------------------------------------------------------------------
| Node[0]{title:"The Matrix",released:1999,
          tagline:"Welcome to the Real World"}
| Node[7]{name:"Joel Silver",born:1952}
| Node[5]{name:"Andy Wachowski",born:1967}
| Node[6]{name:"Lana Wachowski",born:1965}

Just as SQL queries generate result sets that are structured as tables, Cypher returns results that can themselves be graphs. For instance, in Figure 5-5, we generate a graph showing all nodes related to “Keanu Reeves” out to two levels.

9781484213308_Fig05-05.jpg

Figure 5-5. Graph output from cypher query

Gremlin

Although easy to use, Cypher was until recently specific to Neo4J. Gremlin is an alternative graph database query language that can be used with Neo4J and a variety of other graph engines, including Titan (now part of the Datastax Cassandra distribution) and OrientDB.

Compared to Cypher’s nonprocedural SQL-like style, Gremlin is a more procedurally oriented language. Using Gremlin, the programmer declares the specific traversal operations to be performed—possibly in multiple statements.

Figure 5-6 illustrates the Gremlin “toy” sample database. The database contains six vertices (nodes) which define four people and two projects. Some people have created projects, and some people know other people.

9781484213308_Fig05-06.jpg

Figure 5-6. Gremlin sample graph database

In Gremlin notation, V represents the vertices/nodes while E represents edges/relationships:

gremlin> myGraph.V.map
==>{name=marko, age=29}
==>{name=vadas, age=27}
==>{name=lop, lang=java}
==>{name=josh, age=32}
==>{name=ripple, lang=java}
==>{name=peter, age=35}

gremlin> myGraph.E
==>e[11][4-created->3]
==>e[12][6-created->3]
==>e[7][1-knows->2]
==>e[8][1-knows->4]
==>e[9][1-created->3]
==>e[10][4-created->5]

This command displays the people “known” by Marko:

gremlin> marko=myGraph.v(1)
==>v[1]
gremlin> marko.out('knows').name
==>vadas
==>josh

This command walks the tree to find the projects created by people whom Marko knows:

gremlin> marko.out('knows').out('created').name
==>lop
==>ripple

In late 2015, Neo technology announced an open-source version of the Cypher language. Oracle and Databricks—the Spark company—and several other vendors have announced support for this openCypher language. Given the popularity of Neo4j and the arguably more accessible syntax offered by Cypher, it’s quite possible that the motivation for Gremlin as an alternative language will be diminished.

Graph Database Internals

Graph processing can be performed on databases irrespective of their internal storage format.  The logical model of a graph database outlined at the beginning of the chapter can be implemented on almost any database system; indeed, relational systems can implement the logical relationships within a graph without compromise. However, as we discussed earlier, performance on relational systems—even with optimal indexing—degrades as the depth of the graph traversal increases. Enabling efficient real-time graph processing in practice requires a way to move through the graph structure without index lookups. This index-free navigation is referred to as index-free adjacency.

For example, to implement a graph in a relational model such as we saw in Figure 5-2, a join table is used to navigate between two rows—possibly in the same table—that represent adjacent nodes. Indexes allow the logical key value of the adjacent row to be translated to a physical address within the database. This physical address can then be quickly accessed. In a typical implementation, three or four logical IO operations are required to traverse the B-Tree index, and a further lookup to retrieve the value. These IO operations may be satisfied in cache memory, but usually some physical IO is required.

As we discussed earlier, index lookups are typically adequate for shallow traversals, but they contribute to performance degradation as the traversals become deeper. The exact implementation varies depending on the type of database, but the need to create an index to facilitate a graph traversal is common to all nongraph databases.

In a native graph database utilizing index-free adjacency, each node knows the physical location of all adjacent nodes. There is, therefore, no need to use indexes to efficiently navigate the graph, because each node “points” to all adjacent nodes.

Not surprisingly, native graph databases massively outperform alternative implementations for graph traversal operations. However, the architecture presents challenges for distributing the graph across multiple servers. While a node can, of course, point to an adjacent node located on another machine, the overhead of routing the traversal across multiple machines eliminates the advantages of the graph database model, because inter-server communication is far more time consuming than local access. For this reason, many pure graph databases like Neo4j do not currently support a distributed deployment—the graph database can run on a single server only.

Graph Compute Engines

Graph-only databases excel at graph processing, of course, but they are rarely optimal for all the workloads in a typical application. Conversely, many applications—even if their predominant workloads are satisfied by a nongraph database technology—sometimes need to be able to perform efficient graph traversals.

Furthermore, as we discussed previously, the pure graph model has difficulties coping with massive distributed datasets, because it is inefficient to perform graph based traversals across multiple physical machines.

Graph compute engines provide a solution for databases that want to merge graph processing with other data models and for graph processing that needs to work across massive distributed datasets. A graph compute engine implements efficient graph processing algorithms and exposes graph APIs, but it does not assume or require that the data be stored in the index free adjacency graph format we discussed previously; the underlying datasets may be held in a relational database, a NoSQL system, or Hadoop. Graph compute engines don’t always offer the efficient real-time graph traversal offered by pure graph databases; instead, they tend to excel at batch operations that process all or large parts of the graph.

Significant graph compute engines include:

  • Apache Giraph: A graph processing system designed to run over Hadoop data using MapReduce.
  • GraphX: A graph processing system that is part of the Berkeley Data Analytic Stack (BDAS), which includes Spark (see Chapter 7). It uses Spark as the foundation for graph processing just as Giraph uses MapReduce.
  • Titan: A graph database that can be layered on top of Big Data storage engines, including Hbase and Cassandra. Datastax—the commercial vendor of Cassandra—acquired Auerlius, the commercial sponsor of Titan, in 2014.

Conclusion

Graph databases occupy a specific and important niche in database technology. Unlike advocates of key-value stores and document databases, the proponents of graph databases do not claim to be on a mission to replace the venerable RDBMS; rather, they correctly describe graph databases as an important alternative when the analysis of relationships between objects is of as much significance as the objects themselves.

The logical model of a graph database is conceptually quite simple. Nodes (or “vertices”) are connected to each other by relationships (or “edges”). Both nodes and relationships may have properties—sets of name : value pairs.

Languages such as SPARQL, Cypher, and Gremlin are optimized for graph traversal operations, offering a syntax that allows us to walk the graph without requiring the recursive joins necessitated by relational databases.

Pure graph databases such as Neo4J store data in a format that is optimized for real-time graph processing, but that is not always suitable for other purposes. These databases store the data in a physical format that matches the logical relationships of the graph.

Graph compute engines perform graph processing over data held in other formats, often in Hadoop but potentially in a relational format. A graph compute engine is typically optimized to perform batch processing over an entire graph, while a native graph database is optimized toward real-time graph processing.

Modifying the physical organization of data to suit a particular processing workload is not unique to graph databases. As we will see in the next chapter, columnar databases turn the traditional physical organization of database storage on its side so as to optimize analytic workloads.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.75