If you are familiar with Docker and Containers, you can ignore the first section of this chapter. I won’t go into too much detail about Docker since there are plenty of resources out there. That said, I will cover enough of the basics so that you can get the examples up and running with minimal additional configuration on your part.
The second part of this chapter provides a high level, conceptual overview of the various types of databases that we will cover in later chapters. I start with a discussion of relational databses (often referred to as SQL databases) which use the Structured Query Language (SQL). This is the gold standard in databases and have been commonly used for decades.
This book generally assumes that you have some proficiency with SQL though there may be many of you who are coming from the healthcare side of the world and may not have a lot of experience with databases. The overview below is not intended to provide a deep dive into SQL databases. Instead, it sets the stage, providing a basic introduction to vocabulary concepts to which we can compare and contrast graph databases.
If you have ever tried to install server-side software such as a web server, application server, or database server, you know how painful it can be to make sure that your environment aligns with the documentation. Your directory structure might be slightly different; your operating system might be a different version; or you have a different version of Java installed.
One solution to this is to use virtual machines and something like VMWare or Parallels. However, these require that you still install a guest operating system and incurs signficant overhead, particularly RAM and disk space. The moderns solution to this problem is to use containers.1
Containers are a form of virtualization that focus on the application layer. In other words, they allow us to deploy software in a particular environment (called a container) where we control the version of the software and all of its dependencies, just as we would in a virtual machine. However, unlike a traditional virtual machine, we do not virtualize the operating system, saving RAM and disk space in the process.
Of course, some magic needs to happen in order to allow for disk access, networking, security, etc. This is where projects such as [Docker](https://www.docker.com/) come in. There are other container engines suck as [rkt](https://coreos.com/rkt/) and [LXC/LXD](https://linuxcontainers.org/). We focus on Docker given its widespread adoption and support for Mac OS, MS Windows, and Linux operating systems.
Additionally, containers also allow application developers, data scientists, and others at the top of the stack to create repeatable and reproducible deployments regardless of the underlying operating system. This becomes especially important where there are a mix of Mac, Windows, and Linux workstations and servers.
Docker is has many options that are beyond the scope of this book. For additional information and to learn more about Docker, checkout Docker’s [educational resources](https://docs.docker.com/get-started/resources/).
If you do not have Docker installed, please follow the
appropriate instructions here. The rest
of this chapter assumes that you have Docker installed and running properly.
You should be able to run docker run hello-world
and get something similar to
the following output (specifically, the “Hello from Docker!” and subsequent
output):
Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world 0e03bdcc26d7: Pull complete Digest: sha256:d58e752213a51785838f9eed2b7a498ffa1cb3aa7f946dda11af39286c3db9a9 Status: Downloaded newer image for hello-world:latest Hello from Docker! This message shows that your installation appears to be working correctly. To generate this message, Docker took the following steps: 1. The Docker client contacted the Docker daemon. 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64) 3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading. 4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal. To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/ For more examples and ideas, visit: https://docs.docker.com/get-started/
There are many aspects to choosing what type of database to use for a particular project. Typically, you would make your choice based on write requirements or read requirements. Write requirements revolve around the data collection and what the system needs to be able to handle so that data is not lost. On the other hand, read requirements consider the needs of the system as you query the data from the database.
A common set of properties of databases (from the perspective of writing to databases) are referred to as ACID. ACID compliance ensures that a database will perform write operations with certain guarantees:
Atomicity is the idea that writes to a database are treated as a single, atomic, operation such that all components are successful or none are applied to the database. The classic example is a bank database that supports transfers of funds from one account to another. Say Joe wants to transfer $10 to Susie — the database must debit $10 from Joe’s account and then credit $10 to Susie’s account. What happens if there is a power outage or other glitch in the system after the money is debited from Joe’s account and before the money is credited to Susie? The money would be essentially lost forever. If a database supports atomicity, both the debit and credit must succeed; otherwise, the debit is rolled back as if nothing happened. A group of writes that must execute together are grouped into a transaction.
Consistency ensures that the database is always in a valid state. In other words, as software systems interact with the database, queries will either execute successfully, or not at all. This may sound similar to atomicity but is more basic. The idea of consistency applies regardless of the number of queries within a transaction, and focuses on whether or not a write is successful. To continue the example above, what if Joe simply wanted to withdraw money from his own account? There is really only a single step to the transaction (the debit of $10 from Joe’s account) so atomicity doesn’t apply. However, if the power outage hits right in the middle of logging Joe’s debit, the database needs to ensure that Joe’s account, and the database as a whole, are not corrupted as a result. In a sense, consistency is a building block of atomicity — the database uses consistency to ensure that each step of a transaction is successful and, if all steps are successful, the transaction is completed, thus providing atomicity.
Isolation ensures that individual transactions executed against a database are isolated from one another. In other words, whether a transaction consists of a single query or multiple queries, the success or failure of a particular transaction does not affect any other transactions. Isolation is mainly a consideration of concurrent systems, systems that may run particular operations in parallel. So, if two transactions are sent to the database server at the same exact moment, the failure of one should not affect the database.
Durability ensures that, once a transaction is completed and written to the databse, the completion of the transacation will not be lost due to a system failure.
As you are probably wondering, these all sound very similar and overlapping so they the need to discuss each of them individually? From the perspective of a data engineer or data scientist, we may not care too much about the nitty gritty details. They are, however, extremely important to database engineers, adminsitrators, and system/network administrators. Durability ensures that a successful transaction will never be lost; consistency ensures that successful transactions cannot corruopt the database; isolation ensures that one transaction’s success of failure will not affect others; atomicity ensures that all parts of a transaction must succeed for the transaction to succeed. The underlying engineering to make these guarnatees becomes especially complex when considering distributed databases that live across multiple servers, data centers, or cloud providers.
ACID compliance is typically discussed in the context of OLTP systems — online transactional processing. The majority of database usage was focused around OLTP systems that were designed to facilitate business transactions. Whether it’s a bank logging money transfers between accounts or a medical center trying to send medication requests from the ICU to the pharmacy, there is a need for database systems to log the request, push it to other systems, and update the database accordingly.
OLTP systems are designed around reliablya and quickly reading and writing of small chunks of data, especially as the database grows in size. Whether a medical center has 100 patients or 1 million patients, the electronic health record (as an OLTP system) should be allow physicians to update or look up a single patient’s clinical record within seconds. The New York Stock Exchange should be able to log a single stock trade within milliseconds. There is rarely a need to quickly look up clinical records of all patients or to pull all of the stock trades for the entire stock exchange. These differing requirements allow engineers to design the system accordingly. However, with the focus on small, discrete chunks of data, OLTP systems perform quite poorly when trying to scan the entire database for specific data.
On the other hand, data scientists often want to scan the entire database for a specific set of fields. For example, say we want to run an analysis of adverse reactions to medications that result in anaphylaxis (a life-threatening, systemic allergic reaction). We want to pull known allergies, prescribed medications, and hospital admissions for all patients in our electronic health record. If you are at an organization as large as Kaiser Permanente, you would be running this query against 12.5 million people. Running such a query against an OLTP system would likely take a very long time to return the results, and could even take down the system similar to a denial-of-service (DoS) attack. Either way, your database administrators would likely not be very happy with you!
OLAP (online analytical processing) systems were introduced to handle analytics workloads similar to the previous example. These databases may not perform well for transactional use cases but excel at large scans across a subset of the data. You will likely hear discussions of OLAP systems in the context of data warehouses, data marts, and other centralized data storage systems designed around aggregating data for the purpose of analytics.
OLAP systems are built around the read requirements — those considerations on how we want to query and extract the data. Unlike with OLTP systems, data architects will consider the anticipated analytical workloads and then design the system accordingly. To continue our EHR-based example, will we want to routinely scan all patients or a subset of patients? Or, will we scan each encounter a patient has (which could be orders of magnitude higher than the number of patients)?
Large scans of a single column/field of data is a common use case and one of the reasons why column-oriented databases have become so popular. We could model the databse such that each row is a patient and each column reflects a particular field (e.g., medications, lab values, vital signs, etc.). In a columnar database, asking for “all medications across all patients” becomes a very efficient query.
The topic of OLAP systems and the underlying database options, such as SQL, column-oriented, document-oriented, or graph databases is a very complex one. This book focuses specifically on graph databases given their ability to model many of the complexities of healthcare data. However, they are not necessarily the best or only solution for healthcare data. Ultimately, the specific type of database depends on a multitude of factors, with semantic representation being only one factor.
When discussing OLAP systems, people often look to NoSQL solutions (compared to SQL systems when considering OLTP systems). However, both SQL and NoSQL databases can be used for either OLAP or OLTP systems. The underlying decision for which type of database to use is a complex topic and beyond the scope of this book. As data engineers and data scientists, we often get involved with projects well after the databases have been deployed.
The key thing to remember is that SQL databases are a fairly homogeneous group of databases. While therea are certainly differences from one vendor to the next, they are all conceptually the same. There are tables of rows and columns, and they are connected via primary-foreign key relationships.
NoSQL databses, on the other hand, are an extremely hetereogeneous group of databases. By definition, they are all databases that do not follow the relational paradigm. As a result, they are all conceptually very different. The most common are column-oriented databases (e.g., Cassandra), key-value stores (e.g., Redis), document-oriented databases (e.g., MongoDB), and graph database (e.g., Neo4j). Within each class, there are many options and vendors, each with a particular focus. Also, many database vendors are beginning to support multiple paradigms. For example, ArangoDB is a document-oriented database that also provides graph functionality; RedisGraph is a graph add-on for the Redis key-value store; Datastax also provides a graph database on top of the Cassandra column-oriented database. This is not intended to be an exhaustive list but simply to illustrate that the NoSQL landscape is wide and diverse with the only common characteristic of simply being something other than relational databases.
As a result, the query languages and underlying philosophies differ from database to database. For example, MongoDB and CouchBase are both document-oriented databases but have entirely different approaches to querying the data. MongoDB uses a Javascript-based query language2 whereas CouchBase provides a query language3 that is inspired by SQL itself.
On the graph database front, there are several different philosophical approaches and the remainder of this chapter will cover three particular approaches.
This section provides a very brief overview/review of SQL databases. It is intended for those who may have learned about databases years ago and need a quick review or those who may have heard of SQL databases but may lack a little more detail. The main intent for this section is to baseline the vocabulary and basic concepts. I will refer back to these concepts as we review the different types of graph databases.
As I mentioned earlier, SQL databases are the gold-standard when it comes to database technology. While they are often referred to as SQL databases, they are also often referred to as RDBMS’s (relational database management systems) or relational databases.
It is a little confusing but relations in a relational database refer to the tables themselves, not the relationships between tables. That said, the power in a relational database is the ability to connect tables to one another and then query these connections. Figure 2-1 highlights how a patient might be connected to medications and diagnoses in a relational database.
As you can see, the Patients table is intended to track individual patients, each with a name (Patient_name) and a link to Demographic_Info (not shown). Additionally, there are two other tables, Medication and Diagnoses that each have a link back to Patients. The PK and FK that you see correspond to primary key and foreign key.
These keys are one of the fundamental building blocks of relational databases and are what allow tables to be linked together. A primary key (PK) are used to uniquely identify rows within a particular table. Sometimes, these keys represent attributes in real life (e.g., medical record number, social security number, insurance subscriber ID); other times, they are entirely artificial and do not have any meaning or semantic relevance to the underlying data. The latter are often sequentially generated integers (as in the example in Figure 2-1).
Foreign keys are are used in tables to refer to rows in another table. In the example above, the PK of the Patients table is Patient_id, presumably a patient identifier assigned by the clinic or hospital. Each patient is likely to be on zero or more medications. So, the database architect decided to create a second table to store medications. In order to link these medications to particular patients, they have setup a primary-foreign key relationship between the Patients and Medication tables. There is a column in the Medication table that will always store Patient_id. In this particular setup, a single patient can be linked to multiple entries in the Medication table since any number of rows in Medication can refer to the same patient. On the other hand, each medication entry can only refer to a single patient. This is what we call a one-to-many relationship between Patients and Medication.
When linking two tables such as Patients and Medication, you will typically see one-to-one, one-to-many, or many-to-one relationships. In one-to-one situations, a single row in one table will correspond to a single row in another table. In one-to-many situations, a single row in a table will link to multiple rows in another table. Many-to-one is the same except when looking from the perspective of the other table.
You can model many things using relationships above. But, what about more complex systems where you need many-to-many relationships? To build out the example above, what if we don’t want to track multiple entries of “Acetaminophen (Tylenol)” within our system? In other words, regardless of how many patients may have been prescribed acetaminophen, we want to track only a single instance of it. This would clean up our database and prevent thousands of duplicate rows where we might need to rename the medication or change some other attribute of it. Figure 2-2 is an example of one way to link patients to medications such that we have a many-to-many relationship. A single patient can be linked to any number of medications and a particular medication can be linked to any number of patients.
This is what we call a join table. As it stands, the join table addresses the basic requirement of having a single entry for a particular medication name so we don’t have multiple entries for “Acetaminophen (Tylenol).” However, you may have noticed that I labled the join table Medication_Order. Figure 2-3 expands on the basic join example with several more columns for the Medication_Order table such as order_date and ordered_by.
The ordered_by column is a foreign key relation and points back to another table (not shown) of physicians and nurse practitioners who can place medication orders. The column order_date is when the medication order was actually entered into the system. This highlights the power of join tables and how they can be used to model complex systems such as the ordering or adminsitration of medications.
For the latter, we may create another table called Medication_Administration that has a one-to-one relationship with Medication_Order. You may be wondering, why create something with a one-to-one relationship when we can just add additional columns? In this case, the two tables capture very different parts of the workflow so it makes sense to also separate them in the database. This could make it easier when trying to load these day into a data warehouse but also simplifies the engineering and database administration since updates to the medication ordering vs. administration processes can be tested independently.
To continue the example around patients, medications, and medication orders, let’s see how they might look from the perspective of SQL queries:
#
Get
the
medication
ID
for
acetaminophen
SELECT
Medication_id
FROM
Medication
WHERE
Medication_name
=
"Acetaminophen (Tylenol)"
;
#
Get
all
patients
on
acetaminophen
without
a
join
table
SELECT
P
.
Patient_id
FROM
Patients
P
INNER
JOIN
Medication
M
ON
P
.
Patient_id
=
M
.
Patient_id
WHERE
M
.
Medication_name
=
"Acetaminophen (Tylenol)"
;
#
Get
all
patients
on
acetaminophen
with
a
join
table
SELECT
Patient_id
FROM
Medication_Order
MO
INNER
JOIN
Medication
M
ON
MO
.
Medication_id
=
M
.
Medication_id
WHERE
M
.
Medication_name
=
"Acetaminophen (Tylenol)"
;
#
Get
patient
IDs
and
date
of
birth
on
acetaminophen
without
a
join
table
SELECT
P
.
Patient_id
,
P
.
DOB
FROM
Patients
P
INNER
JOIN
Medication
M
ON
P
.
Patient_id
=
M
.
Patient_id
WHERE
M
.
Medication_name
=
"Acetaminophen (Tylenol)"
;
#
Get
patient
IDs
and
date
of
birth
on
acetaminophen
with
a
join
table
SELECT
P
.
Patient_id
,
P
.
DOB
FROM
Patient
P
INNER
JOIN
Medication_Order
MO
ON
P
.
Patient_id
=
MO
.
Patient_id
INNER
JOIN
Medication
M
ON
MO
.
Medication_id
=
M
.
Medication_id
WHERE
M
.
Medication_name
=
"Acetaminophen (Tylenol)"
;
Example 2-1 shows some basic SQL queries corresponding to the schema in Figure 2-1 where no join table exists between patients and their medications. This is the simplest schema and the INNER JOIN is the key aspect of SQL that enables this. Joins are the mechanism by which we can query relational databases and extract data that is spread across multiple tables. If you have worked with SQL in the past, this is intended as a review. If you are knew to SQL, my main intent here is to give you a high level overview and help you develop a basic understanding of how to query relational databases.
Example 2-2 expands on the SQL queries and assumes the schema that is provided in Figure 2-3. Here, you can see the query becomes a little bit more complex since we need to join across three different tables. The last query in Example 2-2 also highlights how the queries can get more complex when trying to query additional properties such as a patient’s date of birth.
These examples and the idea of join tables in particular are critical to thinking about graph-based representations of the data, and graph databases.
Labeled property graph databases, often referred to as just property graph databases are probably one of the most common graph databases in use today. They are certainly not the only type of graph database and not even the first. However, companies such as Neo4j4 have popularized them in recent years. At their core, property graphs are essentially a collection of nodes (sometimes referred to as vertices) and relationships (sometimes referred to as edges) as shown in Figure 2-4.
Relationships are directional and always point from one node to another. However, graph query languages usually allow you to query relationships while ignoring their directionality. In labeled property graphs, nodes also have labels which provide a higher level organization of all of the nodes that might be in the database. These labels are somewhat analogous to tables in a relational database. To continue the example from earlier, we would have a couple labels or types of nodes: Patient and Medication. Each node can also have properties that are analogous to columns of a table. One possible graph-based representation of the schema in Figure 2-3 is shown in Figure 2-5.
As you can see, we still have the ability to represent patients, medications, and orders of medications while tracking the relevant columns. When used this way, property graphs and relational databases have many similarities and generally provide the same functionality. You may be wondering, what’s the difference and why bother with graph database (especially if you already have a relational database deployed)?
In relational databases, the relationships focus on connecting tables together as part of the underlying schema of the database. The semantics of how two data points are connected get baked into the design and administration of the underlying database. If a data engineer or data scientist wants to add a new type of relationship between tables, they must work with the database administrators and engineers to update the database schema accordingly.
On the other hand, within a property graph database, all aspects of the data (labels, properties, and relationship types) are considered data in the sense that they can be updated on the fly and there is no need to redesign the database schema. As such, property graph databases fall under the umbrella of schemaless databases. This is a bit of a misnomer since the databases certainly have a schema. However, the schemas can be changed on the fly and is not set/managed by the database administrator. Perhaps a better description would be that they have dynamic schemas.
So, you may now be wondering, are these just schemaless versions of relational databases? What are the benefits and tradeoffs?
Much of the benefit of property graph databases may lie in their query languages, not in how they represent and store the data. At some level, this is true for almost anything in computer science. Most tools can be used to solve most problems — the specific decision to use one or another largely depends on a variety of tangential factors. In the context of labeled property graphs vs. relational databases, some key considerations are:
Are the types of relationships between various concepts set in stone?
How likely are new relationships going to be added? By whom?
Relative importance of requirements around:
ACID compliance
Reead/write latency
Ease of writing/maintaining queries
Existing tools in your ecosystem (e.g., reporting tools such as Tableau) and expertise available in your organization
Similarity to SQL? Developer friendly?
Two of the most common query languages for property graphs are Cypher5 (and its open sourced counterpart openCypher6) and Gremlin.7 That said, nearly every graph database out there provides its own query language. A few (e.g., RedisGraph 8) have adopted9 openCypher, and some have adopted Gremlin (e.g., JanusGraph10, AWS Neptune11, OrientDB12) while many have support via third-party libraries (e.g., Neo4j13). The challenge with third-party libraries, as evident with Neo4j support, is that they may not support the latest versions of a particular databse and support is likely to come and go depending on the needs of the underlying maintainer.
Query languages such as openCypher, Arango Query Language, GSQL (from TigerGraph) are structurally similar to SQL. This decreases the learning curve for those who are accustomed to working with SQL. On the other hand, approaches such as Gremlin are much more programmer oriented and less of a traditional query language.
Let’s look at some examples of how the SQL queries from Example 2-1 and Example 2-2 might look in graph query languages while assuming the graph schema provided in Figure 2-5.
# Get the medication
ID
for acetaminophen
MATCH
(m:Medication {Medication_name:
"Acetaminophen (Tylenol)"
})
RETURN
m.Medication_id
# Get
all
patients on acetaminophen via the has_order_for
relationship
MATCH
(p:Patient)-[:has_order_for]->(m:Medication {Medication_name:
"Acetaminophen (Tylenol)"
})
RETURN
p
# Get patient IDs
and
date of birth on acetaminophen via the has_order_for
relationship
MATCH
(p:Patient)-[:has_order_for]->(m:Medication {Medication_name:
"Acetaminophen (Tylenol)"
})
RETURN
p.Patient_id, p.DOB
As you can see, the queries are very similar to one another and are not too
different from their SQL counterparts. However, the syntax is much more
compact. While it is certainly possible to use SQL to model this particular
use case, the Cypher-based approach feels much simpler and easier. This
simplicity is particularly highlighted by the additional query for the
patient’s date of birth. In the SQL example, it required an additional INNER
JOIN
clause while the Cypher version simply required an additional p.DOB
.
#
Get
the
medication
ID
for
acetaminophen
g
.
V
()
.
has
(
"Medication_name"
,
"Acetaminophen (Tylenol)"
)
.
property
(
"Medication_id"
)
#
Get
all
patients
on
acetaminophen
via
the
has_order_for
relationship
g
.
V
()
.
has
(
"Medication_name"
,
"Acetaminophen (Tylenol)"
)
.
in
(
"has_order_for"
)
#
Get
patient
IDs
and
date
of
birth
on
acetaminophen
via
the
has_order_for
relationship
g
.
V
()
.
has
(
"Medication_name"
,
"Acetaminophen (Tylenol)"
)
.
in
(
"has_order_for"
)
.
values
(
"Patient_id"
,
"DOB"
)
Again, we can see that the traversal of relationships within the graph add a minimal amount of syntax to the overall query.
Now, let’s take a look at one other graph query language that we haven’t talked about much yet, the ArangoDB Query Language (AQL)14 since we will be diving a bit deeper into ArangoDB in the remainder of this book.
# Get the medication ID for acetaminophen FOR m IN Medication FILTER m.Medication_name == "Acetaminophen (Tylenol)" RETURN m.Medication_id # Get all patients on acetaminophen via the has_order_for relationship FOR m IN Medication FILTER m.Medication_name == "Acetaminophen (Tylenol)" FOR p in 1..1 INBOUND m has_order_for RETURN p # Get patient IDs and date of birth on acetaminophen via the has_order_for relationship FOR m IN Medication FILTER m.Medication_name == "Acetaminophen (Tylenol)" FOR p in 1..1 INBOUND m has_order_for RETURN {id: p.Patient_id, dob: p.DOB}
AQL feels a bit like the combination of a query language as well as a traversal language. That said, we can see that the queries increase in complexity similar to Cypher and Gremlin.
RDF graphs15 have been around for a long time and are part of the W3C (World Wide Web Consortium)16 and are part of the Semantic Web.17 RDF graphs are extremely flexible and quite powerful given the tradeoff that they have a fairly steep learning curve and, as a result, often carry the stigma that they are “too academic.”
You may often hear RDF graph databases referred to as triple stores. This comes from the idea that RDF graphs are a collection of triples of the form subject-predicate-object (referred to as an RDF triple), as shown in Figure 2-6.
While Figure 2-6 shows a diagram very similar to what we saw with
property graphs, they are a bit more abstract in the sense that everything in
an RDF graph is stored as a triple. In other words, what would be a column in
a relational database or a property in a property graph database is itself an
RDF triple. Figure 2-7 shows how a property graph
would store and represent a particular patient with an ID of 1234
and a date
of birth of 2020-01-01
.
In the RDF represetation of a patient, the circle is the subject in three
different triples — one each for the birth date, patient ID, and the type of
the node. The predicate provides the context of how the object should be
interpreted. In this case, the predicate is what tells us that 2020-01-01
is
actually a birth date and 1234
is a patient identifier.
The power behind this approach is that nearly anything can be modeled using this basic schema. You may have also noticed that a particular notation is used throughout the example, consisting of two strings separated by a colon. This is an adaptation of XML namespacing where a namespace can be defined that corresponds to the base of a URI, to which an identifier can be appended. The key point here is that RDF relies heavily on URI’s (uniform resource identifier). Because URI’s for a particular semantic context often have the same base, the use of namespaces allows us to simplify the representation, making it more compact and easier to read.
Because RDF was created as a standard to represent data and information on the web, URI’s are baked into the underlying standard. The idea is that any particular idea or notion is backed by a clearly defined concept that is accessible at the specified URI. This allows anything represented as RDF to be self-describing. This is particularly important in the context of the semantic web because anyone should be able to publish and link information to anything else on the web. Unlike with most database systems (where information models are designed to support very specific use cases), RDF was designed to support open information models.
In Figure 2-7, the schema
namespace (used in
schema:birthData
) is used to shorten the URI https://schema.org/birthDate
.
The base https://schema.org/
is shortened to schema
so that we don’t need
to list the entire URI. Similarly, the ex
namespace is a completely
fictional namespace made up for this example. It could easily refer to
something like https://mydomain.com/example/
or anything else. The one
requirement is that the URI be resolvable and provides specific details of the
underlying resource. As you may notice, https
is used in the examples above — there is a general discussion18
over the use of http
vs. https
though that is generally beyond the scope of
this book. The key point to remember is that resources are identified by
unique URI’s.
SPARQL (pronounced “sparkle”) is the underlying query language for RDF graphs. As part of the semantic web standard, it is fully open and implemented by many different vendors. Of course, RDF triple stores implement SPARQL but many property graphs also have SPARQL implementations as well. For example there is a SPARQL-gremlin connector19 such that SPARQL could be run against any database supporting gremlin.
SPARQL looks very similar to SQL in many ways but integrates concepts such as namespaces and URI’s. One thing you will notice in Example 2-6 is the addition of PREFIX
— this allows query authors to shorten their queries by taking advantage of namespacing. Otherwise, the query should be very similar to SQL. However, you will likely notice that the WHERE
clause is a bit more verbose and reflect the subject-predicate-object nature of RDF. Every clause in the WHERE
block is itself a triple.
# Get the medication ID for acetaminophen
PREFIX
schema
:
<http://schema.org/>
PREFIX
ex
:
<http://mydomain.com/some_example/>
SELECT
?medication_id
WHERE
{
?medication
ex
:
medication_name
"Acetaminophen (Tylenol)"
.
?medication
ex
:
medication_id
?medication_id
.
}
# Get all patients on acetaminophen via the has_order_for relationship
SELECT
?patient_id
WHERE
{
?medication
ex
:
medication_name
"Acetaminophen (Tylenol)"
.
?patient
ex
:
has_order_for
?medication
.
?patient
ex
:
patient_id
?patient_id
.
}
# Get patient IDs and date of birth on acetaminophen via the has_order_for relationship
SELECT
?patient_id
?dob
WHERE
{
?medication
ex
:
medication_name
"Acetaminophen (Tylenol)"
.
?patient
ex
:
has_order_for
?medication
.
?patient
ex
:
patient_id
?patient_id
.
?patient
schema
:
birthData
?dob
.
}
This syntax may take you a bit of getting used to but is extremely flexible and powerful. The RDF approach allows for a lot of flexibility in representing different data and information models. However, this flexibility comes with a common tradeoff — performance. Because the data are stored as a series of triples, queries of large graphs can incur significant latency. Depending on your use case, this may be acceptable or the flexibility is worth the cost of servers with significant CPU and RAM resources.
Hypergraphs are not nearly as common when considering the landscape of graph databases. When someone mentions “graph database,” nearly everyone will assume you are talking about a property graph or RDF database. In a property graph, an edge connects exactly two vertices (or, a relationship connects two nodes). A hypergraph, on the other hand, contains hyperedges which can connect any number of vertices. In Figure 2-8, I compare a property graph representation and a hypergraph representation of the same information.
As you can see, node A
is connected to nodes B
and C
via the relationship
some_edge
. In a property graph database, two separate relationships are
stored, A->B
and A->C
. In the hypergraph representation, a single edge is
stored but that edge connects A
to both B
and C
. Additionally, the
hypergraph database I will be covering in this book also treats hyperedges as a
vertex itself. This means that an edge can itself be part of another edge.
You may be scratching your head a bit and wondering why would we want to treat
an edge as a vertex and wouldn’t that confuse things? Yes, this can be a
difficult concept to wrap your head around and may feel very “meta.” In the
example we have been using around patients and medication orders, the
hypergraph idea does not provide any benefit. After all, our relationship is
only connecting two nodes anyways (a patient and a medication via the
has_order_for
relationship). The hypergraph representation is exactly the
same. But, what about the situation for combination therapies where multiple
medications are ordered for a single patient?
Figure 2-9 illustrates an example where a
patient may be on multiple medications, A, B, and C.
The additional context, however, is that medications A and B must be taken
together as a combination therapy — the patient should not take one without
the other. Additionally, the patient is also on medication C. In the property
graph representation, it is unclear if the patient just happens to be on three
separate medications at the same time or if A and B are ordered together. In
the hypergraph represeation, this is made very clear since A and B are
connected via the same has_order_for
relationship. Of course, we can solve
this by adding additional properties or nodes to the property graph
representation but we are now changing the representation of the data to
account for shortcomings in the underlying graph representation.
When I first started this book, the hypergraph database I intended to cover was Grakn from Grakn Labs. However, since then, they have rebranded the database to TypeDB from Vaticle. As such, they are no longer calling themselves a hypergraph database but a strongly-typed database. That said, the underlying functionality that they provide still allows us experiment with hypergraph structures so I will still use their database when we cover hypergraphs, even if they are no longer really considering themselves a graph database.
# Get the medication ID for acetaminophen match $medication isa Medication, has Medication_name "Acetaminophen (Tylenol)", has Medication_id $id; get $id; # Get all patients on acetaminophen via the has_order_for relationship match $medication isa Medication, has Medication_name "Acetaminophen (Tylenol)"; $patient isa Patient; (patient: $patient, medication: $medication) isa order; get $patient # Get patient IDs and date of birth on acetaminophen via the has_order_for relationship match $medication isa Medication, has Medication_name "Acetaminophen (Tylenol)"; $patient isa Patient, has dob $dob, has Patient_id $id; (patient: $patient, medication: $medication) isa order; get $dob, $id;
As you can see, TypeQL has a structure very similar to the other query languages and makes it pretty clear and easy to query across relationships or properties.
The intent of this chapter was not to make you an expert in the different types of graph databases. My goal was simply to give you a high level overview of each and how their query languages compare to one another. I will dive much more deeply into each of these types of databases throughout the remainder of this book, particularly in the context of analyzing electronic health record and claims data.
The choice of which database to use depends on many factors and generally beyond the scope of this book. As with any data project, the decision rests between a set of functional and non-functional requirements. Of course, the query language and underlying representation of the data are important considerations. But, as with nearly any software project, you can use any database to accomplish the task at hand — it may just require additional engineering on your part.
When looking at graph databases specifically, you may hear the term “graph-native” used quite a bit. Given the growing popularity of graph-based representations of data, many database vendors are starting to support graphs. Whether the database was originally a SQL database, document store, or even key-value store, you can find one that also supports graphs in some capacity. While any database can be used to represent a graph, graph-native refers to how the underlying data are stored. Particularly, graph-native databases provide index-free adjacency. That is, nodes that are connected by relationships are stored such that the retrieval of related nodes/edges does not require additional indexing.
This may or may not be a concern for you. After all, does it matter if the underlying database requires additional indexing to provide graph functionality? For small graphs, this may not matter. For large graphs, the indexes themselves could become very large and non-performant. This is one example of the types of additional considerations when considering graph databases.
1 Docker: Containers. https://www.docker.com/resources/what-container
2 MongDB CRUD operations. https://docs.mongodb.com/manual/crud/
3 CouchBase N1QL. https://docs.couchbase.com/server/current/n1ql/query.html
4 Neo4j. https://neo4j.com
5 Cypher. https://neo4j.com/developer/cypher/
6 openCypher. https://opencypher.org/
7 Gremlin. https://tinkerpop.apache.org/gremlin.html
8 RedisGraph. https://oss.redislabs.com/redisgraph/
9 RedisGraph Cypher Support. https://oss.redislabs.com/redisgraph/cypher_support/
10 JanusGraph. https://janusgraph.org/
11 AWS Neptune Gremlin. https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin.html
12 OrientDB Gremlin API. https://orientdb.com/docs/2.2.x/Gremlin.html
13 Neo4j Gremlin Support. https://github.com/neo4j-contrib/neo4j-tinkerpop-api-impl
14 ArangoDB Query Language. https://www.arangodb.com/docs/stable/aql/
15 RDF 1.1. https://www.w3.org/TR/rdf11-concepts/
16 W3C. https://www.w3.org/Consortium/
17 Semantic Web. https://www.w3.org/standards/semanticweb/
18 HTTPS and the Semantic Web/Linked Data. https://www.w3.org/blog/2016/05/https-and-the-semantic-weblinked-data/
19 SPARQL-gremlin. https://tinkerpop.apache.org/docs/current/reference/#sparql-gremlin