4

Querying Graph

In the previous chapter, we loaded the data into Graph. Now, it’s time to look into how we can query the graph using Cypher. We will take a look at the graph data model and how to leverage it to build effective queries and understand what works and what doesn’t work. By the end of the chapter, you will have learned how to build Cypher queries and understand what is required to build performant queries.

In this chapter, we will cover the following topics:

  • Exploring the data in Graph
  • Querying the nodes
  • Querying the paths

Before we start with queries, let’s explore the graph first.

Exploring the data in Graph

Once we load the data into the graph, as we learned about in the previous chapter, when we go to the database in the browser, it shows basic details such as the node labels in the database, the relationship types available, the total node count, and the total count of relationships, along with the property names that exist in the database.

The following screenshot shows these details:

Figure 4.1 – Graph data counts in the database

Figure 4.1 – Graph data counts in the database

We can see the node-related details on the top left-hand side of the screen, and below it, we can see the relationship details. All of these are selectable items and when you click on them, a Cypher query runs against the database to show a sample related to the name you clicked. For instance, if you click on the node label, 25 nodes of those types are shown. The same applies to the relationship types shown.

We can get the database statistics by executing the Cypher query:

CALL apoc.meta.stats()

The following screenshot shows the database statistics:

Figure 4.2 – Database statistics

Figure 4.2 – Database statistics

We can see from the screenshot the number of label types, relationship types, and property types. The fourth column shows the total number of nodes in the database. The next column shows the total number of relationships in the database. The next column shows the node counts by label. The next column shows the relationship count based on start or end label types.

For example, "()-[:HAS_ENCOUNTER]->(:wellness)": 22019 tells us that there are 22,019 HAS_ENCOUNTER relationships that have the wellness node label as the end node.

The next column shows the frequencies of different relationships in the graph. The last column is the JSON representation of all the previous column data. This provides a wealth of information on how the graph is connected.

Now that we have seen how the data is distributed, let us see how the data is connected in graph format next. We can get the graph schema by running this Cypher query:

CALL db.schema.visualization()

We can see the graph visualization in the following screenshot:

Figure 4.3 – Database graph schema visualization

Figure 4.3 – Database graph schema visualization

Our schema looks very busy here. This is because we have used a lot of secondary labels for Encounter nodes and SNOMED_CT nodes. Once we remove them, the schema will look cleaner. The following image represents the clean schema visualization:

Figure 4.4 – Database graph schema visualization – clean version

Figure 4.4 – Database graph schema visualization – clean version

Now, this looks like the data diagram we started building using Arrows App (https://arrows.app) in the previous chapter. One difference we see here is that the Encounter node is connected to the SNOMED_CT node via multiple relationship types. This is because the SNOMED_CT node can have secondary labels and these relationship types map to those secondary labels.

Now, let us look at one of the pieces of patient node data in the graph that we have loaded, to see how it matches the data diagram we built in the last chapter to assist us with data loading. The following picture shows the Neo4j-Browser visualization:

Figure 4.5 – Patient data visualization in Neo4j Browser

Figure 4.5 – Patient data visualization in Neo4j Browser

This seems somewhat closer to what we built. Since we cannot control how the browser renders the visualization, it is difficult to tell how similar it is. Let us view it in the Neo4j Bloom visualization that comes along with Neo4j Desktop:

Figure 4.6 – Patient data visualization in Neo4 jBloom

Figure 4.6 – Patient data visualization in Neo4 jBloom

This looks closer to what we built as a data diagram. We have a Patient node in the middle and the Encounter nodes around it, which are connected to Diagnosis (SNOMED-CT) nodes, Provider nodes, Organization nodes, and so on.

Now that we have explored the data in a graph, let us get into querying the graph using Cypher. We will start by querying for the nodes first.

Querying the nodes

We have the patient data set along with the encounters with the healthcare system. We will look at a question and how we can represent it in Cypher. To find the total number of patients in the system, type the following into the console:

MATCH (p:Patient) RETURN count(p)

This Cypher query returns the total number of patients in the system. This screenshot shows the response in the browser:

Figure 4.7 – Patient count query

Figure 4.7 – Patient count query

You can see that it is instantaneous. Neo4j maintains the count stores for labels, so when we ask for counts in this way, Neo4j uses the count stores to return the response. The response time is consistent when we have one node or one million nodes of this label.

The Patient node only had one label. We know the Encounter node has multiple labels. Let us write a query to find label distribution. The Cypher query looks as follows:

MATCH (e:Encounter)
RETURN labels(e) as labels, count(e) as counts

This query returns all the Encounter label combinations available, along with the count of the nodes per label combination:

Figure 4.8 – Encounter label distribution count query

Figure 4.8 – Encounter label distribution count query

We can see that basic Encounter nodes, which also contain the Stop representation, are the majority here. After that, the wellness nodes are the majority. We can also see that this query takes more time, as it cannot leverage the count stores. Now, let us find a patient by ID:

MATCH (p:Patient
     {id:'7361ce15-cf67-ae76-88e6-bcbdca19ce0b'})
RETURN p

This query should return a single patient node identified by the ID value provided.

Figure 4.9 – Finding patients by ID

Figure 4.9 – Finding patients by ID

This is straightforward and we can see the node is shown here. If we select the Table view, the data is shown as follows:

Figure 4.10 – Finding patients by ID – Table view

Figure 4.10 – Finding patients by ID – Table view

We can see the JSON representation of the node data shown here in the Table mode. Since we are leveraging the index here, the response is quick. Now, let us get patients using their first names. The Cypher query would look as follows:

MATCH (p:Patient {firstName:'Jason347'}) return p

In this query, we are trying to find all the patents whose first name is Jason347:

Figure 4.11 – Finding patients by firstName

Figure 4.11 – Finding patients by firstName

We can see that two nodes are returned. Since our dataset is smaller, the response can be quick. Let us check how the database executed the query. The quickest option is to use the EXPLAIN clause. Let us run the explain query. It would look as follows:

EXPLAIN MATCH (p:Patient {firstName:'Jason347'}) return p

Here, we are trying to see how the database will try to execute this query.

Figure 4.12 – Finding patients by firstName – EXPLAIN MATCH

Figure 4.12 – Finding patients by firstName – EXPLAIN MATCH

From the picture, we can see the database did a NodeByLabelScan. This is similar to a table scan in RDBMS terms. Since there was no index on the firstName, the database finds all the nodes identified by the label Patient and checks if it has a firstName property with specified value. To understand the true cost, we can try PROFILE clause. The Cypher query will look like this.

PROFILE MATCH (p:Patient {firstName:'Jason347'}) return p

Here, we are trying to PROFILE the query to retrieve the patient using the firstName property.

Figure 4.13 – Finding patients by firstName – PROFILE plan

Figure 4.13 – Finding patients by firstName – PROFILE plan

We can see that there are 3,544 DB hits to execute the query. Each DB hit means a unit of work done by the database. The higher the DB hits, higher the amount of work the database is doing and can have an impact on query SLAs.

Since we don’t have a point of reference yet, let us see what the number of DB hits would be if we execute a query with an index. For this, we can execute the query with patient ID. The Cypher would look like this:

PROFILE
MATCH (p:Patient
       {id:'7361ce15-cf67-ae76-88e6-bcbdca19ce0b'})
RETURN p

Here, we are trying to PROFILE the query that retrieves the patient using the id property.

Figure 4.14 – Finding patients by ID – PROFILE plan

Figure 4.14 – Finding patients by ID – PROFILE plan

From the profile, we can see that it took 14 DB hits, out of which 12 DB hits were to return all the properties of the Patient node. We can see how big a difference an index can make when we are executing queries. We will touch upon this subject more in the later chapters.

Now, let us add an index for the firstName property, to make sure we are not seeing this behavior by chance:

CREATE INDEX p_firstName IF NOT EXISTS
FOR (n:Patient)
ON n.firstName

Execute this query in the browser first.

Now, let’s profile the query again and compare the results.

Figure 4.15 – Finding patients by name – PROFILE with and without index

Figure 4.15 – Finding patients by name – PROFILE with and without index

We can see after adding an index that DB hits reduced from 3,546 to 26, which is more than 100 times better. Not all queries see this kind of improvement, but having an index can make queries orders of magnitude better.

In Cypher, we can use STARTS WITH or ENDS WITH clauses for string properties. Let us find all the patients whose ID starts with 73. The Cypher query looks as follows:

MATCH (p:Patient) WHERE p.id STARTS WITH '73' RETURN p

This query is trying to retrieve all the patient nodes whose ID property starts with 73:

Figure 4.16 – Finding patients by ID – STARTS WITH

Figure 4.16 – Finding patients by ID – STARTS WITH

We can see there are eight nodes matching that pattern. We are also using the WHERE clause here. We can use the {} query pattern only when we are looking for exact values. Also, when you have multiple values in that pattern, it acts like an AND condition. If we are looking for range or using other clauses, then we need to use a WHERE clause to build our queries. Before now, we only got nodes as a response. Say we want CSV values as our response, such as a patient’s first name and last name to be returned. The query would look as follows:

MATCH (p:Patient) WHERE p.id STARTS WITH '73'
RETURN p.firstName as firstName, p.lastName as lastName

This query is trying to retrieve the firstName and lastName values for patients whose ID starts with 73:

Figure 4.17 – Finding patients by ID – STARTS WITH – CSV response

Figure 4.17 – Finding patients by ID – STARTS WITH – CSV response

When we return individual properties, then the data is returned in CSV format. For visualizations or other processing, we can return the data as nodes and relationships; to consume the data in a tabular format, we can return the data in CSV format as shown here.

Let us do a profile on this query to see whether we are using indexes or not using this query:

PROFILE MATCH (p:Patient) WHERE p.id STARTS WITH '73'
RETURN p.firstName as firstName, p.lastName as lastName

This is telling the database to execute the query (excluding the PROFILE keyword) and tell us exactly the amount of work that the database did:

Figure 4.18 – Finding patients by ID – STARTS WITH – CSV response – PROFILE

Figure 4.18 – Finding patients by ID – STARTS WITH – CSV response – PROFILE

We can see that for string properties when we have an index for the STARTS WITH clause, the index is leveraged. It is very important to understand this aspect. The WHERE clause works pretty similarly to all the other types except the Spatial Point type. For a Point type, we need to use spatial functions in the WHERE clause. Since we already have a location attribute on the Patient node, which is of the Point type, let us see how we can build a Cypher query on it.

Say we have a point of reference, {longitude: -71.46554, latitude: 42.217916}, in mind. We want to find all the patients within 5 kilometers of this point. Please remember that the majority of the time, the distance calculated by the spatial functions is in meters.

To achieve this, the Cypher query looks as follows:

WITH point({longitude: -71.46554, latitude: 42.217916}) as refPoint
MATCH (p:Patient)
WITH p, point.distance(refPoint, p.location) as distance
WITH p, distance
WHERE distance < 5000
RETURN p.firstName as firstName, p.lastName as lastName, distance

This query is using the point capabilities of the graph database to do a geolocation search.

Figure 4.19 – Finding patients within a certain distance of a point

Figure 4.19 – Finding patients within a certain distance of a point

We can see we have 9 results, which are within 5 km of the reference point.

Now that we have worked with querying nodes, let us look at querying for the paths.

Querying the paths

A path here consists of an anchor node or starting node and traverses one or more hops in any direction from it. In the earlier section, we worked with the Patient node to showcase how to query nodes. Here, let us start from the Patient node and what it is connected to at one hop.

Our Cypher query can look as follows:

MATCH path=
   (:Patient {id:'7361ce15-cf67-ae76-88e6-bcbdca19ce0b'})-->()
RETURN path

This returns all the paths in one hop in the outgoing direction from the Patient node:

Figure 4.20 – All paths one hop from a patient

Figure 4.20 – All paths one hop from a patient

Here, we can see one Race node, one Ethnicity node, and 60 Encounter nodes. Say that we want to find out the race demographics of our patients where we can use paths to get these values. The Cypher query for it will look as follows:

MATCH (r:Race)
RETURN r.type as type, size((r)<-[:HAS_RACE]-()) as count

This query retrieves all the Race nodes and returns the type property and the number of nodes connected to it:

Figure 4.21 – Patient count by race

Figure 4.21 – Patient count by race

The approach we have taken here is to use path comprehension to find the values. This approach will also leverage the count stores to give a response quickly. We can get the same results using this query:

MATCH (r:Race)<-[:HAS_RACE]-(p)
RETURN r.type as type, count(p) as count

Let’s profile each query to see which one is performing better.

Figure 4.22 – Patient count by Race – PROFILE of different queries

Figure 4.22 – Patient count by Race – PROFILE of different queries

We can see from the profile that the path comprehension way of getting counts is way cheaper (25 DB hits) than traversing the relationship and counting the nodes (1,204 DB hits). This is the advantage of leveraging the count stores to return the counts. This was possible here because only Patient was connected to the Race node. Let us take the ZipCode node, to which both the Patient and Provider nodes are connected via the same relationship, HAS_ZIPCODE. Let us see how the query is being processed by the database when we use a similar approach. First, the Cypher queries look like this.

Here’s the first version of the query:

MATCH (z:ZipCode)<-[:HAS_ZIPCODE]-(p:Patient)
RETURN z.zip as zip, count(p) as count

Here’s the second version of the query:

MATCH (z:ZipCode)
RETURN z.zip as zip, size((z)<-[:HAS_ZIPCODE]-(:Patient)) as count

You can see one difference between the queries, but we had to give the Patient label explicitly in both of the queries.

Let us look at the profile of each of these queries:

Figure 4.23 – Patient count by ZIP code – PROFILE of different queries

Figure 4.23 – Patient count by ZIP code – PROFILE of different queries

We can see here that using path comprehension does not benefit us at all. In fact, that is a costlier query due to the way the database has to traverse and apply the label check. So, if there is a requirement to provide counts such as these more often, say a dashboard that refreshes every 30 seconds, then having distinct relationships can make a huge difference not only in terms of that query performance but also in terms of overall database resource consumption. This is where tuning the data model as our requirements grow and making changes are required. Neo4j being schema-less is great for this.

Now, let us look at some functional questions. Let us build a query for how many patients visited the providers within the same ZIP code. We know that Patient and Provider are not connected directly. They are only connected via the Encounter nodes. The Cypher query for it looks as follows:

MATCH (p:Patient)-[:HAS_ZIPCODE]->(zip)
WITH p, zip
MATCH (p)-[:HAS_ENCOUNTER]->()-[:HAS_PROVIDER]->
      (prov)-[:HAS_ZIPCODE]->(zip)
WITH DISTINCT p,prov, zip
RETURN p.firstName as patientFirst,
       p.lastName as patientLast,
       prov.name as provider,
       zip.zip as zipcode

We are starting with patients who have a ZIP code, then we are traversing via the Encounter node to Provider to the ZIP code that any given Patient is connected to. This screenshot shows the results:

Figure 4.24 – Patients who visited providers in the same zip code

Figure 4.24 – Patients who visited providers in the same zip code

Also, in the query, we are using the WITH clause after we get patient and zip. The WITH clause will separate the previous section of the query from the next section. The data from the first section of the query is streamed to the next section of the query and processed in that order. This avoids the Cartesian product.

Now, let us take a look at a slightly more complex query: once a condition is identified, what a patient’s activity from the previous 60 days has been.

The Cypher query for this can look as follows:

MATCH (c:Condition {code:'160903007'})
    <-[:HAS_CONDITION]-(encounter)
WITH encounter LIMIT 1
MATCH (encounter)<-[:HAS_ENCOUNTER]-(patient)
WITH patient, encounter
MATCH (patient)-[:HAS_ENCOUNTER]->(e)
WHERE ( encounter.date - duration('P60D') )  <= e.date < (encounter.date)
WITH e ORDER BY e.date
MATCH (e)-[:HAS_DRUG|:HAS_PROCEDURE]->(x)
RETURN labels(x)[0] AS eventType, x.description AS name,    e.date AS startDate

We can see from the query that, we find a patient with a given Condition, and within the previous 60 days, we are looking for any related DRUG or PROCEDURE delivered.

Figure 4.25 – Activity for 60 days before a condition is identified

Figure 4.25 – Activity for 60 days before a condition is identified

We can see that one procedure has been performed and one drug prescribed 60 days before the condition has been identified.

Let us try one more query. For each drug, let us see how many patients it has been prescribed to. To be able to do this, we need to start from the Drug node first and via Encounter node, we can reach the Patient node.

Cypher looks as follows for the query:

MATCH (d:Drug)<-[:HAS_DRUG]-()<-[:HAS_ENCOUNTER]-(p)
WITH DISTINCT d, p
RETURN d.description as drug, count(p) as patients

This query is trying to retrieve all the drugs in the system, along with how many patients they have been prescribed to.

Figure 4.26 – Number of patients prescribed the drug

Figure 4.26 – Number of patients prescribed the drug

We can see it is simple to traverse and collect the details as needed.

All this time, we have been using hardcoded values to execute the query. This is fine for validating and tuning queries occasionally. The negative side of hardcoding values is that there would be query planning involved each time we executed the query. This is not going to be a small amount of time. Depending on what the query is like, the query plan time may be much longer than the actual execution. When the application is executing a large number of queries against the database, using parameterized queries to reduce or eliminate the query planning time can have a huge impact on system performance.

Now, we will look at how to use parameterized queries in Neo4j Browser.

The first step is to define the parameters. For this, we will use the :Param keyword in Browser.

We need to execute this query to set a parameter in the Browser session:

:param id => '7361ce15-cf67-ae76-88e6-bcbdca19ce0b'

Please remember that this is Browser-specific usage only. For applications, you need to use the driver API.

Figure 4.27 – Defining a parameter

Figure 4.27 – Defining a parameter

The screenshot shows that we are able to successfully define a new parameter called id in a Browser query context.

Now that we have defined the parameter, let us use it in a query:

MATCH (p:Patient {id:$id}) RETURN p

We refer to the parameter using the $ symbol in the query. We can define one or many parameters and all of them can be referenced in the query using the $ symbol. The following screenshot shows the query execution in Browser and the results.

Figure 4.28 – Executing a parameterized query

Figure 4.28 – Executing a parameterized query

There are some limitations to using parameters. We cannot use parameters in place of labels or relationship types. This is because Cypher does not allow variable substitution for node labels or relationship types. Otherwise, anywhere we can have a variable, we can use parameterized values.

Let’s look at a slightly more complex parameter definition. Say we want to define a map as a single parameter. We can achieve this using this query:

:param test => { RETURN 123 as type, "test" as name}

This defines a map with name and type keys and assigns the value to a parameter named test. One difference here is that the map is wrapped into a list.

First, let’s execute the query.

Figure 4.29 – Defining a complex parameter in Browser

Figure 4.29 – Defining a complex parameter in Browser

We can see the test key is associated with a list and the first element in the list is the map with the values we have defined.

Let’s write a Cypher query to use this parameter:

WITH $test[0] as data
RETURN data.name as name, data.type as type

We can see from the query that we are taking the first element of the input parameter and returning name and type values from it.

Figure 4.30 – Using the complex parameter from the query

Figure 4.30 – Using the complex parameter from the query

From the screenshot, we can see that we can retrieve the individual values from the map the way we defined them.

Now, let us look at what we have learned so far.

Summary

In this chapter, we have taken a look at the graph data model and explored the graph using database statistics and schema visualization.

In this chapter, we learned about how to find nodes by leveraging indexes or without using indexes; compared the performance of different queries to understand the importance of using indexes for querying; used PROFILE to understand performance issues in the queries; learned about using the STARTS WITH clause, which leverages an index; worked with point properties to leverage the geospatial capabilities of Graph; traversed the graph efficiently; leveraged count stores; and built complex queries using the WITH clause.

In the next chapter, we will continue our graph exploration by performing more complex queries, which will involve filtering, sorting, and aggregation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.114.132