Chapter 2. Technical Introduction

If you are familiar with Docker and Containers, you can ignore the first section of this chapter. I won’t go into too much detail about Docker since there are plenty of resources out there. That said, I will cover enough of the basics so that you can get the examples up and running with minimal additional configuration on your part.

The second part of this chapter provides a high level, conceptual overview of the various types of databases that we will cover in later chapters. I start with a discussion of relational databses (often referred to as SQL databases) which use the Structured Query Language (SQL). This is the gold standard in databases and have been commonly used for decades.

This book generally assumes that you have some proficiency with SQL though there may be many of you who are coming from the healthcare side of the world and may not have a lot of experience with databases. The overview below is not intended to provide a deep dive into SQL databases. Instead, it sets the stage, providing a basic introduction to vocabulary concepts to which we can compare and contrast graph databases.

Basic Introduction to Docker and Containers

If you have ever tried to install server-side software such as a web server, application server, or database server, you know how painful it can be to make sure that your environment aligns with the documentation. Your directory structure might be slightly different; your operating system might be a different version; or you have a different version of Java installed.

One solution to this is to use virtual machines and something like VMWare or Parallels. However, these require that you still install a guest operating system and incurs signficant overhead, particularly RAM and disk space. The moderns solution to this problem is to use containers.1

Containers are a form of virtualization that focus on the application layer. In other words, they allow us to deploy software in a particular environment (called a container) where we control the version of the software and all of its dependencies, just as we would in a virtual machine. However, unlike a traditional virtual machine, we do not virtualize the operating system, saving RAM and disk space in the process.

Of course, some magic needs to happen in order to allow for disk access, networking, security, etc. This is where projects such as [Docker](https://www.docker.com/) come in. There are other container engines suck as [rkt](https://coreos.com/rkt/) and [LXC/LXD](https://linuxcontainers.org/). We focus on Docker given its widespread adoption and support for Mac OS, MS Windows, and Linux operating systems.

Additionally, containers also allow application developers, data scientists, and others at the top of the stack to create repeatable and reproducible deployments regardless of the underlying operating system. This becomes especially important where there are a mix of Mac, Windows, and Linux workstations and servers.

Docker is has many options that are beyond the scope of this book. For additional information and to learn more about Docker, checkout Docker’s [educational resources](https://docs.docker.com/get-started/resources/).

Installing and Testing Docker

If you do not have Docker installed, please follow the appropriate instructions here. The rest of this chapter assumes that you have Docker installed and running properly. You should be able to run docker run hello-world and get something similar to the following output (specifically, the “Hello from Docker!” and subsequent output):

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
0e03bdcc26d7: Pull complete
Digest: sha256:d58e752213a51785838f9eed2b7a498ffa1cb3aa7f946dda11af39286c3db9a9
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Conceptual Introduction to Databases

There are many aspects to choosing what type of database to use for a particular project. Typically, you would make your choice based on write requirements or read requirements. Write requirements revolve around the data collection and what the system needs to be able to handle so that data is not lost. On the other hand, read requirements consider the needs of the system as you query the data from the database.

ACID Compliance

A common set of properties of databases (from the perspective of writing to databases) are referred to as ACID. ACID compliance ensures that a database will perform write operations with certain guarantees:

Atomicity is the idea that writes to a database are treated as a single, atomic, operation such that all components are successful or none are applied to the database. The classic example is a bank database that supports transfers of funds from one account to another. Say Joe wants to transfer $10 to Susie — the database must debit $10 from Joe’s account and then credit $10 to Susie’s account. What happens if there is a power outage or other glitch in the system after the money is debited from Joe’s account and before the money is credited to Susie? The money would be essentially lost forever. If a database supports atomicity, both the debit and credit must succeed; otherwise, the debit is rolled back as if nothing happened. A group of writes that must execute together are grouped into a transaction.

Consistency ensures that the database is always in a valid state. In other words, as software systems interact with the database, queries will either execute successfully, or not at all. This may sound similar to atomicity but is more basic. The idea of consistency applies regardless of the number of queries within a transaction, and focuses on whether or not a write is successful. To continue the example above, what if Joe simply wanted to withdraw money from his own account? There is really only a single step to the transaction (the debit of $10 from Joe’s account) so atomicity doesn’t apply. However, if the power outage hits right in the middle of logging Joe’s debit, the database needs to ensure that Joe’s account, and the database as a whole, are not corrupted as a result. In a sense, consistency is a building block of atomicity — the database uses consistency to ensure that each step of a transaction is successful and, if all steps are successful, the transaction is completed, thus providing atomicity.

Isolation ensures that individual transactions executed against a database are isolated from one another. In other words, whether a transaction consists of a single query or multiple queries, the success or failure of a particular transaction does not affect any other transactions. Isolation is mainly a consideration of concurrent systems, systems that may run particular operations in parallel. So, if two transactions are sent to the database server at the same exact moment, the failure of one should not affect the database.

Durability ensures that, once a transaction is completed and written to the databse, the completion of the transacation will not be lost due to a system failure.

As you are probably wondering, these all sound very similar and overlapping so they the need to discuss each of them individually? From the perspective of a data engineer or data scientist, we may not care too much about the nitty gritty details. They are, however, extremely important to database engineers, adminsitrators, and system/network administrators. Durability ensures that a successful transaction will never be lost; consistency ensures that successful transactions cannot corruopt the database; isolation ensures that one transaction’s success of failure will not affect others; atomicity ensures that all parts of a transaction must succeed for the transaction to succeed. The underlying engineering to make these guarnatees becomes especially complex when considering distributed databases that live across multiple servers, data centers, or cloud providers.

OLTP Systems

ACID compliance is typically discussed in the context of OLTP systems — online transactional processing. The majority of database usage was focused around OLTP systems that were designed to facilitate business transactions. Whether it’s a bank logging money transfers between accounts or a medical center trying to send medication requests from the ICU to the pharmacy, there is a need for database systems to log the request, push it to other systems, and update the database accordingly.

OLTP systems are designed around reliablya and quickly reading and writing of small chunks of data, especially as the database grows in size. Whether a medical center has 100 patients or 1 million patients, the electronic health record (as an OLTP system) should be allow physicians to update or look up a single patient’s clinical record within seconds. The New York Stock Exchange should be able to log a single stock trade within milliseconds. There is rarely a need to quickly look up clinical records of all patients or to pull all of the stock trades for the entire stock exchange. These differing requirements allow engineers to design the system accordingly. However, with the focus on small, discrete chunks of data, OLTP systems perform quite poorly when trying to scan the entire database for specific data.

On the other hand, data scientists often want to scan the entire database for a specific set of fields. For example, say we want to run an analysis of adverse reactions to medications that result in anaphylaxis (a life-threatening, systemic allergic reaction). We want to pull known allergies, prescribed medications, and hospital admissions for all patients in our electronic health record. If you are at an organization as large as Kaiser Permanente, you would be running this query against 12.5 million people. Running such a query against an OLTP system would likely take a very long time to return the results, and could even take down the system similar to a denial-of-service (DoS) attack. Either way, your database administrators would likely not be very happy with you!

OLAP Systems

OLAP (online analytical processing) systems were introduced to handle analytics workloads similar to the previous example. These databases may not perform well for transactional use cases but excel at large scans across a subset of the data. You will likely hear discussions of OLAP systems in the context of data warehouses, data marts, and other centralized data storage systems designed around aggregating data for the purpose of analytics.

OLAP systems are built around the read requirements — those considerations on how we want to query and extract the data. Unlike with OLTP systems, data architects will consider the anticipated analytical workloads and then design the system accordingly. To continue our EHR-based example, will we want to routinely scan all patients or a subset of patients? Or, will we scan each encounter a patient has (which could be orders of magnitude higher than the number of patients)?

Large scans of a single column/field of data is a common use case and one of the reasons why column-oriented databases have become so popular. We could model the databse such that each row is a patient and each column reflects a particular field (e.g., medications, lab values, vital signs, etc.). In a columnar database, asking for “all medications across all patients” becomes a very efficient query.

The topic of OLAP systems and the underlying database options, such as SQL, column-oriented, document-oriented, or graph databases is a very complex one. This book focuses specifically on graph databases given their ability to model many of the complexities of healthcare data. However, they are not necessarily the best or only solution for healthcare data. Ultimately, the specific type of database depends on a multitude of factors, with semantic representation being only one factor.

SQL vs. NoSQL

When discussing OLAP systems, people often look to NoSQL solutions (compared to SQL systems when considering OLTP systems). However, both SQL and NoSQL databases can be used for either OLAP or OLTP systems. The underlying decision for which type of database to use is a complex topic and beyond the scope of this book. As data engineers and data scientists, we often get involved with projects well after the databases have been deployed.

The key thing to remember is that SQL databases are a fairly homogeneous group of databases. While therea are certainly differences from one vendor to the next, they are all conceptually the same. There are tables of rows and columns, and they are connected via primary-foreign key relationships.

NoSQL databses, on the other hand, are an extremely hetereogeneous group of databases. By definition, they are all databases that do not follow the relational paradigm. As a result, they are all conceptually very different. The most common are column-oriented databases (e.g., Cassandra), key-value stores (e.g., Redis), document-oriented databases (e.g., MongoDB), and graph database (e.g., Neo4j). Within each class, there are many options and vendors, each with a particular focus. Also, many database vendors are beginning to support multiple paradigms. For example, ArangoDB is a document-oriented database that also provides graph functionality; RedisGraph is a graph add-on for the Redis key-value store; Datastax also provides a graph database on top of the Cassandra column-oriented database. This is not intended to be an exhaustive list but simply to illustrate that the NoSQL landscape is wide and diverse with the only common characteristic of simply being something other than relational databases.

As a result, the query languages and underlying philosophies differ from database to database. For example, MongoDB and CouchBase are both document-oriented databases but have entirely different approaches to querying the data. MongoDB uses a Javascript-based query language2 whereas CouchBase provides a query language3 that is inspired by SQL itself.

On the graph database front, there are several different philosophical approaches and the remainder of this chapter will cover three particular approaches.

SQL Databases

This section provides a very brief overview/review of SQL databases. It is intended for those who may have learned about databases years ago and need a quick review or those who may have heard of SQL databases but may lack a little more detail. The main intent for this section is to baseline the vocabulary and basic concepts. I will refer back to these concepts as we review the different types of graph databases.

As I mentioned earlier, SQL databases are the gold-standard when it comes to database technology. While they are often referred to as SQL databases, they are also often referred to as RDBMS’s (relational database management systems) or relational databases.

It is a little confusing but relations in a relational database refer to the tables themselves, not the relationships between tables. That said, the power in a relational database is the ability to connect tables to one another and then query these connections. Figure 2-1 highlights how a patient might be connected to medications and diagnoses in a relational database.

As you can see, the Patients table is intended to track individual patients, each with a name (Patient_name) and a link to Demographic_Info (not shown). Additionally, there are two other tables, Medication and Diagnoses that each have a link back to Patients. The PK and FK that you see correspond to primary key and foreign key.

Relational Database Example
Figure 2-1. Relational Database Example

These keys are one of the fundamental building blocks of relational databases and are what allow tables to be linked together. A primary key (PK) are used to uniquely identify rows within a particular table. Sometimes, these keys represent attributes in real life (e.g., medical record number, social security number, insurance subscriber ID); other times, they are entirely artificial and do not have any meaning or semantic relevance to the underlying data. The latter are often sequentially generated integers (as in the example in Figure 2-1).

Foreign keys are are used in tables to refer to rows in another table. In the example above, the PK of the Patients table is Patient_id, presumably a patient identifier assigned by the clinic or hospital. Each patient is likely to be on zero or more medications. So, the database architect decided to create a second table to store medications. In order to link these medications to particular patients, they have setup a primary-foreign key relationship between the Patients and Medication tables. There is a column in the Medication table that will always store Patient_id. In this particular setup, a single patient can be linked to multiple entries in the Medication table since any number of rows in Medication can refer to the same patient. On the other hand, each medication entry can only refer to a single patient. This is what we call a one-to-many relationship between Patients and Medication.

When linking two tables such as Patients and Medication, you will typically see one-to-one, one-to-many, or many-to-one relationships. In one-to-one situations, a single row in one table will correspond to a single row in another table. In one-to-many situations, a single row in a table will link to multiple rows in another table. Many-to-one is the same except when looking from the perspective of the other table.

You can model many things using relationships above. But, what about more complex systems where you need many-to-many relationships? To build out the example above, what if we don’t want to track multiple entries of “Acetaminophen (Tylenol)” within our system? In other words, regardless of how many patients may have been prescribed acetaminophen, we want to track only a single instance of it. This would clean up our database and prevent thousands of duplicate rows where we might need to rename the medication or change some other attribute of it. Figure 2-2 is an example of one way to link patients to medications such that we have a many-to-many relationship. A single patient can be linked to any number of medications and a particular medication can be linked to any number of patients.

Basic Join Table Example
Figure 2-2. Basic Join Table Example

This is what we call a join table. As it stands, the join table addresses the basic requirement of having a single entry for a particular medication name so we don’t have multiple entries for “Acetaminophen (Tylenol).” However, you may have noticed that I labled the join table Medication_Order. Figure 2-3 expands on the basic join example with several more columns for the Medication_Order table such as order_date and ordered_by.

Medication Order Join Table Example
Figure 2-3. Expanded Medication Order Join Table Example

The ordered_by column is a foreign key relation and points back to another table (not shown) of physicians and nurse practitioners who can place medication orders. The column order_date is when the medication order was actually entered into the system. This highlights the power of join tables and how they can be used to model complex systems such as the ordering or adminsitration of medications.

For the latter, we may create another table called Medication_Administration that has a one-to-one relationship with Medication_Order. You may be wondering, why create something with a one-to-one relationship when we can just add additional columns? In this case, the two tables capture very different parts of the workflow so it makes sense to also separate them in the database. This could make it easier when trying to load these day into a data warehouse but also simplifies the engineering and database administration since updates to the medication ordering vs. administration processes can be tested independently.

To continue the example around patients, medications, and medication orders, let’s see how they might look from the perspective of SQL queries:

Example 2-1. Medication Orders in SQL
# Get the medication ID for acetaminophen
SELECT Medication_id FROM Medication WHERE Medication_name = "Acetaminophen (Tylenol)";

# Get all patients on acetaminophen without a join table
SELECT P.Patient_id FROM Patients P
INNER JOIN Medication M
ON P.Patient_id = M.Patient_id
WHERE M.Medication_name = "Acetaminophen (Tylenol)";

# Get all patients on acetaminophen with a join table
SELECT Patient_id FROM Medication_Order MO
INNER JOIN Medication M
ON MO.Medication_id = M.Medication_id
WHERE M.Medication_name = "Acetaminophen (Tylenol)";
Example 2-2. Medication Orders with Patient Date of Birth in SQL
# Get patient IDs and date of birth on acetaminophen without a join table
SELECT P.Patient_id, P.DOB FROM Patients P
INNER JOIN Medication M
ON P.Patient_id = M.Patient_id
WHERE M.Medication_name = "Acetaminophen (Tylenol)";

# Get patient IDs and date of birth on acetaminophen with a join table
SELECT P.Patient_id, P.DOB FROM Patient P
INNER JOIN Medication_Order MO
ON P.Patient_id = MO.Patient_id
INNER JOIN Medication M
ON MO.Medication_id = M.Medication_id
WHERE M.Medication_name = "Acetaminophen (Tylenol)";

Example 2-1 shows some basic SQL queries corresponding to the schema in Figure 2-1 where no join table exists between patients and their medications. This is the simplest schema and the INNER JOIN is the key aspect of SQL that enables this. Joins are the mechanism by which we can query relational databases and extract data that is spread across multiple tables. If you have worked with SQL in the past, this is intended as a review. If you are knew to SQL, my main intent here is to give you a high level overview and help you develop a basic understanding of how to query relational databases.

Example 2-2 expands on the SQL queries and assumes the schema that is provided in Figure 2-3. Here, you can see the query becomes a little bit more complex since we need to join across three different tables. The last query in Example 2-2 also highlights how the queries can get more complex when trying to query additional properties such as a patient’s date of birth.

These examples and the idea of join tables in particular are critical to thinking about graph-based representations of the data, and graph databases.

Labeled Property Graph (LPG) Databases

Labeled property graph databases, often referred to as just property graph databases are probably one of the most common graph databases in use today. They are certainly not the only type of graph database and not even the first. However, companies such as Neo4j4 have popularized them in recent years. At their core, property graphs are essentially a collection of nodes (sometimes referred to as vertices) and relationships (sometimes referred to as edges) as shown in Figure 2-4.

Property Graph Nodes and Relationships
Figure 2-4. Property Graph Nodes and Relationships

Relationships are directional and always point from one node to another. However, graph query languages usually allow you to query relationships while ignoring their directionality. In labeled property graphs, nodes also have labels which provide a higher level organization of all of the nodes that might be in the database. These labels are somewhat analogous to tables in a relational database. To continue the example from earlier, we would have a couple labels or types of nodes: Patient and Medication. Each node can also have properties that are analogous to columns of a table. One possible graph-based representation of the schema in Figure 2-3 is shown in Figure 2-5.

Medication Order Example as a Property Graph
Figure 2-5. Medication Order Example as a Property Graph

As you can see, we still have the ability to represent patients, medications, and orders of medications while tracking the relevant columns. When used this way, property graphs and relational databases have many similarities and generally provide the same functionality. You may be wondering, what’s the difference and why bother with graph database (especially if you already have a relational database deployed)?

In relational databases, the relationships focus on connecting tables together as part of the underlying schema of the database. The semantics of how two data points are connected get baked into the design and administration of the underlying database. If a data engineer or data scientist wants to add a new type of relationship between tables, they must work with the database administrators and engineers to update the database schema accordingly.

On the other hand, within a property graph database, all aspects of the data (labels, properties, and relationship types) are considered data in the sense that they can be updated on the fly and there is no need to redesign the database schema. As such, property graph databases fall under the umbrella of schemaless databases. This is a bit of a misnomer since the databases certainly have a schema. However, the schemas can be changed on the fly and is not set/managed by the database administrator. Perhaps a better description would be that they have dynamic schemas.

So, you may now be wondering, are these just schemaless versions of relational databases? What are the benefits and tradeoffs?

Much of the benefit of property graph databases may lie in their query languages, not in how they represent and store the data. At some level, this is true for almost anything in computer science. Most tools can be used to solve most problems — the specific decision to use one or another largely depends on a variety of tangential factors. In the context of labeled property graphs vs. relational databases, some key considerations are:

  1. Are the types of relationships between various concepts set in stone?

  2. How likely are new relationships going to be added? By whom?

  3. Relative importance of requirements around:

    1. ACID compliance

    2. Reead/write latency

    3. Ease of writing/maintaining queries

    4. Existing tools in your ecosystem (e.g., reporting tools such as Tableau) and expertise available in your organization

    5. Similarity to SQL? Developer friendly?

Common Query Languages and Frameworks

Two of the most common query languages for property graphs are Cypher5 (and its open sourced counterpart openCypher6) and Gremlin.7 That said, nearly every graph database out there provides its own query language. A few (e.g., RedisGraph 8) have adopted9 openCypher, and some have adopted Gremlin (e.g., JanusGraph10, AWS Neptune11, OrientDB12) while many have support via third-party libraries (e.g., Neo4j13). The challenge with third-party libraries, as evident with Neo4j support, is that they may not support the latest versions of a particular databse and support is likely to come and go depending on the needs of the underlying maintainer.

Query languages such as openCypher, Arango Query Language, GSQL (from TigerGraph) are structurally similar to SQL. This decreases the learning curve for those who are accustomed to working with SQL. On the other hand, approaches such as Gremlin are much more programmer oriented and less of a traditional query language.

Let’s look at some examples of how the SQL queries from Example 2-1 and Example 2-2 might look in graph query languages while assuming the graph schema provided in Figure 2-5.

Cypher

Example 2-3. Medication Orders in Cypher
# Get the medication ID for acetaminophen
MATCH (m:Medication {Medication_name: "Acetaminophen (Tylenol)"})
RETURN m.Medication_id

# Get all patients on acetaminophen via the has_order_for relationship
MATCH (p:Patient)-[:has_order_for]->(m:Medication {Medication_name: "Acetaminophen (Tylenol)"})
RETURN p

# Get patient IDs and date of birth on acetaminophen via the has_order_for relationship
MATCH (p:Patient)-[:has_order_for]->(m:Medication {Medication_name: "Acetaminophen (Tylenol)"})
RETURN p.Patient_id, p.DOB

As you can see, the queries are very similar to one another and are not too different from their SQL counterparts. However, the syntax is much more compact. While it is certainly possible to use SQL to model this particular use case, the Cypher-based approach feels much simpler and easier. This simplicity is particularly highlighted by the additional query for the patient’s date of birth. In the SQL example, it required an additional INNER JOIN clause while the Cypher version simply required an additional p.DOB.

Gremlin

Example 2-4. Medication Orders in Gremlin
# Get the medication ID for acetaminophen
g.V()
 .has("Medication_name", "Acetaminophen (Tylenol)")
 .property("Medication_id")

# Get all patients on acetaminophen via the has_order_for relationship
g.V()
 .has("Medication_name", "Acetaminophen (Tylenol)")
 .in("has_order_for")

# Get patient IDs and date of birth on acetaminophen via the has_order_for relationship
 g.V()
  .has("Medication_name", "Acetaminophen (Tylenol)")
  .in("has_order_for")
  .values("Patient_id", "DOB")

Again, we can see that the traversal of relationships within the graph add a minimal amount of syntax to the overall query.

ArangoDB Query Language

Now, let’s take a look at one other graph query language that we haven’t talked about much yet, the ArangoDB Query Language (AQL)14 since we will be diving a bit deeper into ArangoDB in the remainder of this book.

Example 2-5. Medication Orders in ArangoDB Query Language
# Get the medication ID for acetaminophen
FOR m IN Medication
  FILTER m.Medication_name == "Acetaminophen (Tylenol)"
  RETURN m.Medication_id

# Get all patients on acetaminophen via the has_order_for relationship
FOR m IN Medication
  FILTER m.Medication_name == "Acetaminophen (Tylenol)"
  FOR p in 1..1 INBOUND m has_order_for
    RETURN p

# Get patient IDs and date of birth on acetaminophen via the has_order_for relationship
FOR m IN Medication
  FILTER m.Medication_name == "Acetaminophen (Tylenol)"
  FOR p in 1..1 INBOUND m has_order_for
    RETURN {id: p.Patient_id, dob: p.DOB}

AQL feels a bit like the combination of a query language as well as a traversal language. That said, we can see that the queries increase in complexity similar to Cypher and Gremlin.

Resource Description Framework (RDF) Databases

RDF graphs15 have been around for a long time and are part of the W3C (World Wide Web Consortium)16 and are part of the Semantic Web.17 RDF graphs are extremely flexible and quite powerful given the tradeoff that they have a fairly steep learning curve and, as a result, often carry the stigma that they are “too academic.”

You may often hear RDF graph databases referred to as triple stores. This comes from the idea that RDF graphs are a collection of triples of the form subject-predicate-object (referred to as an RDF triple), as shown in Figure 2-6.

RDF Triple
Figure 2-6. RDF Triple

While Figure 2-6 shows a diagram very similar to what we saw with property graphs, they are a bit more abstract in the sense that everything in an RDF graph is stored as a triple. In other words, what would be a column in a relational database or a property in a property graph database is itself an RDF triple. Figure 2-7 shows how a property graph would store and represent a particular patient with an ID of 1234 and a date of birth of 2020-01-01.

Property vs. RDF Graph Patient Example
Figure 2-7. Property vs. RDF Graph Patient Example

In the RDF represetation of a patient, the circle is the subject in three different triples — one each for the birth date, patient ID, and the type of the node. The predicate provides the context of how the object should be interpreted. In this case, the predicate is what tells us that 2020-01-01 is actually a birth date and 1234 is a patient identifier.

The power behind this approach is that nearly anything can be modeled using this basic schema. You may have also noticed that a particular notation is used throughout the example, consisting of two strings separated by a colon. This is an adaptation of XML namespacing where a namespace can be defined that corresponds to the base of a URI, to which an identifier can be appended. The key point here is that RDF relies heavily on URI’s (uniform resource identifier). Because URI’s for a particular semantic context often have the same base, the use of namespaces allows us to simplify the representation, making it more compact and easier to read.

Because RDF was created as a standard to represent data and information on the web, URI’s are baked into the underlying standard. The idea is that any particular idea or notion is backed by a clearly defined concept that is accessible at the specified URI. This allows anything represented as RDF to be self-describing. This is particularly important in the context of the semantic web because anyone should be able to publish and link information to anything else on the web. Unlike with most database systems (where information models are designed to support very specific use cases), RDF was designed to support open information models.

In Figure 2-7, the schema namespace (used in schema:birthData) is used to shorten the URI https://schema.org/birthDate. The base https://schema.org/ is shortened to schema so that we don’t need to list the entire URI. Similarly, the ex namespace is a completely fictional namespace made up for this example. It could easily refer to something like https://mydomain.com/example/ or anything else. The one requirement is that the URI be resolvable and provides specific details of the underlying resource. As you may notice, https is used in the examples above — there is a general discussion18 over the use of http vs. https though that is generally beyond the scope of this book. The key point to remember is that resources are identified by unique URI’s.

SPARQL Protocol and RDF Query Language (SPARQL)

SPARQL (pronounced “sparkle”) is the underlying query language for RDF graphs. As part of the semantic web standard, it is fully open and implemented by many different vendors. Of course, RDF triple stores implement SPARQL but many property graphs also have SPARQL implementations as well. For example there is a SPARQL-gremlin connector19 such that SPARQL could be run against any database supporting gremlin.

SPARQL looks very similar to SQL in many ways but integrates concepts such as namespaces and URI’s. One thing you will notice in Example 2-6 is the addition of PREFIX — this allows query authors to shorten their queries by taking advantage of namespacing. Otherwise, the query should be very similar to SQL. However, you will likely notice that the WHERE clause is a bit more verbose and reflect the subject-predicate-object nature of RDF. Every clause in the WHERE block is itself a triple.

Example 2-6. Medication Orders with Patient Date of Birth in SQL
# Get the medication ID for acetaminophen
PREFIX schema: <http://schema.org/>
PREFIX ex: <http://mydomain.com/some_example/>

SELECT ?medication_id
WHERE {
  ?medication ex:medication_name "Acetaminophen (Tylenol)" .
  ?medication ex:medication_id ?medication_id .
}

# Get all patients on acetaminophen via the has_order_for relationship

SELECT ?patient_id
WHERE {
  ?medication ex:medication_name "Acetaminophen (Tylenol)" .
  ?patient ex:has_order_for ?medication .
  ?patient ex:patient_id ?patient_id .
}

# Get patient IDs and date of birth on acetaminophen via the has_order_for relationship
SELECT ?patient_id ?dob
WHERE {
  ?medication ex:medication_name "Acetaminophen (Tylenol)" .
  ?patient ex:has_order_for ?medication .
  ?patient ex:patient_id ?patient_id .
  ?patient schema:birthData ?dob .
}

This syntax may take you a bit of getting used to but is extremely flexible and powerful. The RDF approach allows for a lot of flexibility in representing different data and information models. However, this flexibility comes with a common tradeoff — performance. Because the data are stored as a series of triples, queries of large graphs can incur significant latency. Depending on your use case, this may be acceptable or the flexibility is worth the cost of servers with significant CPU and RAM resources.

Hypergraph Databases

Hypergraphs are not nearly as common when considering the landscape of graph databases. When someone mentions “graph database,” nearly everyone will assume you are talking about a property graph or RDF database. In a property graph, an edge connects exactly two vertices (or, a relationship connects two nodes). A hypergraph, on the other hand, contains hyperedges which can connect any number of vertices. In Figure 2-8, I compare a property graph representation and a hypergraph representation of the same information.

Property Graph vs. Hypergraph
Figure 2-8. Property Graph vs. Hypergraph

As you can see, node A is connected to nodes B and C via the relationship some_edge. In a property graph database, two separate relationships are stored, A->B and A->C. In the hypergraph representation, a single edge is stored but that edge connects A to both B and C. Additionally, the hypergraph database I will be covering in this book also treats hyperedges as a vertex itself. This means that an edge can itself be part of another edge.

You may be scratching your head a bit and wondering why would we want to treat an edge as a vertex and wouldn’t that confuse things? Yes, this can be a difficult concept to wrap your head around and may feel very “meta.” In the example we have been using around patients and medication orders, the hypergraph idea does not provide any benefit. After all, our relationship is only connecting two nodes anyways (a patient and a medication via the has_order_for relationship). The hypergraph representation is exactly the same. But, what about the situation for combination therapies where multiple medications are ordered for a single patient? Figure 2-9 illustrates an example where a patient may be on multiple medications, A, B, and C.

Property Graph vs. Hypergraph with Multiple Medications
Figure 2-9. Property Graph vs. Hypergraph with Multiple Medications

The additional context, however, is that medications A and B must be taken together as a combination therapy — the patient should not take one without the other. Additionally, the patient is also on medication C. In the property graph representation, it is unclear if the patient just happens to be on three separate medications at the same time or if A and B are ordered together. In the hypergraph represeation, this is made very clear since A and B are connected via the same has_order_for relationship. Of course, we can solve this by adding additional properties or nodes to the property graph representation but we are now changing the representation of the data to account for shortcomings in the underlying graph representation.

TypeDB and TypeQL

When I first started this book, the hypergraph database I intended to cover was Grakn from Grakn Labs. However, since then, they have rebranded the database to TypeDB from Vaticle. As such, they are no longer calling themselves a hypergraph database but a strongly-typed database. That said, the underlying functionality that they provide still allows us experiment with hypergraph structures so I will still use their database when we cover hypergraphs, even if they are no longer really considering themselves a graph database.

Example 2-7. Medication Orders in TypeQL
# Get the medication ID for acetaminophen
match
$medication isa Medication, has Medication_name "Acetaminophen (Tylenol)", has Medication_id $id;
get $id;

# Get all patients on acetaminophen via the has_order_for relationship
match
$medication isa Medication, has Medication_name "Acetaminophen (Tylenol)";
$patient isa Patient;
(patient: $patient, medication: $medication) isa order;
get $patient

# Get patient IDs and date of birth on acetaminophen via the has_order_for relationship
match
$medication isa Medication, has Medication_name "Acetaminophen (Tylenol)";
$patient isa Patient, has dob $dob, has Patient_id $id;
(patient: $patient, medication: $medication) isa order;
get $dob, $id;

As you can see, TypeQL has a structure very similar to the other query languages and makes it pretty clear and easy to query across relationships or properties.

Conclusion

The intent of this chapter was not to make you an expert in the different types of graph databases. My goal was simply to give you a high level overview of each and how their query languages compare to one another. I will dive much more deeply into each of these types of databases throughout the remainder of this book, particularly in the context of analyzing electronic health record and claims data.

The choice of which database to use depends on many factors and generally beyond the scope of this book. As with any data project, the decision rests between a set of functional and non-functional requirements. Of course, the query language and underlying representation of the data are important considerations. But, as with nearly any software project, you can use any database to accomplish the task at hand — it may just require additional engineering on your part.

When looking at graph databases specifically, you may hear the term “graph-native” used quite a bit. Given the growing popularity of graph-based representations of data, many database vendors are starting to support graphs. Whether the database was originally a SQL database, document store, or even key-value store, you can find one that also supports graphs in some capacity. While any database can be used to represent a graph, graph-native refers to how the underlying data are stored. Particularly, graph-native databases provide index-free adjacency. That is, nodes that are connected by relationships are stored such that the retrieval of related nodes/edges does not require additional indexing.

This may or may not be a concern for you. After all, does it matter if the underlying database requires additional indexing to provide graph functionality? For small graphs, this may not matter. For large graphs, the indexes themselves could become very large and non-performant. This is one example of the types of additional considerations when considering graph databases.

1 Docker: Containers. https://www.docker.com/resources/what-container

2 MongDB CRUD operations. https://docs.mongodb.com/manual/crud/

3 CouchBase N1QL. https://docs.couchbase.com/server/current/n1ql/query.html

4 Neo4j. https://neo4j.com

5 Cypher. https://neo4j.com/developer/cypher/

6 openCypher. https://opencypher.org/

7 Gremlin. https://tinkerpop.apache.org/gremlin.html

8 RedisGraph. https://oss.redislabs.com/redisgraph/

9 RedisGraph Cypher Support. https://oss.redislabs.com/redisgraph/cypher_support/

10 JanusGraph. https://janusgraph.org/

11 AWS Neptune Gremlin. https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin.html

12 OrientDB Gremlin API. https://orientdb.com/docs/2.2.x/Gremlin.html

13 Neo4j Gremlin Support. https://github.com/neo4j-contrib/neo4j-tinkerpop-api-impl

14 ArangoDB Query Language. https://www.arangodb.com/docs/stable/aql/

15 RDF 1.1. https://www.w3.org/TR/rdf11-concepts/

16 W3C. https://www.w3.org/Consortium/

17 Semantic Web. https://www.w3.org/standards/semanticweb/

18 HTTPS and the Semantic Web/Linked Data. https://www.w3.org/blog/2016/05/https-and-the-semantic-weblinked-data/

19 SPARQL-gremlin. https://tinkerpop.apache.org/docs/current/reference/#sparql-gremlin

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset