9

Query Tuning

Till now, we have looked at various aspects of building Cypher queries. In this chapter, we will take a look at which options are available to profile and tune queries.

We have two options available to tune queries. The first one is the EXPLAIN clause, which takes the Cypher query and provides an estimated amount of work the database might do. It does not execute the query. The other one is the PROFILE clause. This will execute the query and gives the exact amount of work the database is doing. We will review both of these options in detail to understand how to leverage them to tune queries.

We will be taking a look at these aspects:

  • Working with EXPLAIN
  • Working with PROFILE
  • Reviewing plan operators
  • Using index hints

First, let’s take a look at how to use EXPLAIN to tune queries.

Working with EXPLAIN

The EXPLAIN clause returns a query execution plan as the database sees it. It does not actually execute the query and does not return data. It does not make any changes to the database itself. It can be used to easily identify missing indexes or query plans not using existing indexes, and so on. Also, it tells us how a query is traversing relationships to do its work.

Let’s look at a basic example of EXPLAIN usage:

EXPLAIN MATCH (d {code:'313820'})
RETURN d

This query is trying to find a node with a code property matching the provided value. Notice that we did not provide a label in the query. This is intentional to showcase how a query plan can be used to identify issues.

We can see from the following screenshot that the first step we are doing is AllNodesScan. What this means is that we are looking at the whole database to find the node we want and that we either have a mistake in the query or we are missing an index:

Figure 9.1 – Basic EXPLAIN plan

Figure 9.1 – Basic EXPLAIN plan

We can also see that the query plan provides estimated rows returned in each step. These are approximate values based on statistics collected by the database and not exact values.

Now, let’s add the missing label and retry the plan again:

EXPLAIN MATCH (d:Drug {code:'313820'})
RETURN d

We added a label to the query here.

From the following screenshot, we can see that the query plan is using NodeUniqueIndexSeek to find the node now:

Figure 9.2 – EXPLAIN plan with node label present

Figure 9.2 – EXPLAIN plan with node label present

This is better—even though we don’t know exact performance benchmarks as such, this indicates the database query planner knows there is a unique index on the Drug label and the code property and tells the database to use it when it executes the query.

Let us try a slightly more complex query and see its plan:

EXPLAIN MATCH (p:Patient)-[:HAS_ZIPCODE]->(zip)
WITH p, zip
MATCH (p)-[:HAS_ENCOUNTER]->()-[:HAS_PROVIDER]->
      (prov)-[:HAS_ZIPCODE]->(zip)
WITH DISTINCT p,prov, zip
RETURN p.firstName as patientFirst,
       p.lastName as patientLast,
       prov.name as provider,
       zip.zip as zipcode

We can see this is a bit more complex query than what we tried before. The plan is big, so we will explore it in sections.

We can see that the first step here is to do something called NodeByLabelScan. Since we did not provide any property or WHERE clause to limit the results in the first line of the query, it is finding all nodes identified by the Patient label:

Figure 9.3 – Part 1 of query plan of complex query

Figure 9.3 – Part 1 of query plan of complex query

The reason it is starting from the Patient label is that we have not specified a label for the other node in the query. Due to this, the query planner starts from a known label first. If it had to start from zip, it would have to do AllNodesScan, which is a costlier operation than NodeByLabelScan. Another thing we notice here is that once it finds the patient nodes, it is ordering them in ascending order. If we notice in the query, we do not have ORDER BY anywhere.

Then, why is the query planner looking to order the nodes?

If we go to the next part of the query, we are doing this:

WITH DISTINCT p,prov, zip

Since we want the distinct values, it is trying to order by the patient nodes first so that it can eliminate duplicates easily. Sorting the dataset when it is small and exploring from there could be more cost-effective than collecting all the data and trying to sort; considering all the different combinations could be costlier. The query planner could be considering this aspect and sorts the patient nodes immediately.

Next, it traverses the HAS_ZIPCODE relationship and collects the zip codes:

MATCH (p)-[:HAS_ENCOUNTER]->()-[:HAS_PROVIDER]->
      (prov)-[:HAS_ZIPCODE]->(zip)

We can see from Figure 9.3 that the query planner is planning for the query snippet shown previously. In the query, we are starting from the patient node, but the query planner is starting from the zip code node. This is because the total number of zip code nodes is a lot less, so it is trying to start from there and going backward. It is traversing this way and collecting all the providers.

The next step that is planned is finding all the anonymous nodes connected to these provider nodes.

Let's look at the query plan image in the next screenshot:

Figure 9.4 – Part 2 of query plan of complex query

Figure 9.4 – Part 2 of query plan of complex query

The next step in the query plan is to check the patient node is connected to the anonymous nodes and only pick those nodes. Once all the conditions are satisfied, we collect the patient, provider, and zip nodes.

Let's look at the query plan image for the next part here:

Figure 9.5 – Part 3 of query plan of complex query

Figure 9.5 – Part 3 of query plan of complex query

We can see in the last step of the plan that we are now collecting all the return data and returning the result.

We will take a look at the profile of these queries to gain more insight into the amount of work the database is doing.

Working with PROFILE

The PROFILE clause executes as per the plan created by the database and provides the exact cost of the query. It will keep track of how many rows pass through the operators and the amount of work the database is doing that is measured as database hits.

Let’s look at a basic example and compare the EXPLAIN and PROFILE plans:

PROFILE MATCH (d {code:'313820'})
RETURN d

In the query, we can see that the only difference is we are using PROFILE instead of EXPLAIN:

Figure 9.6 – Basic query without EXPLAIN and PROFILE label comparison

Figure 9.6 – Basic query without EXPLAIN and PROFILE label comparison

We can see that the plan remained exactly the same, but we are seeing the db hits, which are the measurement of how much work the database is doing. You can think of 1 db hit as a unit of work the database is doing. More db hits mean a plan step is taking more work than the database is doing to complete that step in the query. Also, more db hits mean that it’s taking more time to execute the query. The total amount of db hits required to execute this query is 164968, and the time taken is 76 milliseconds.

Let’s look at the profile of the query with a label:

PROFILE MATCH (d:Drug {code:'313820'})
RETURN d

We are using PROFILE here with a Drug label in the query. We will compare the EXPLAIN and PROFILE plans for this query in the following screenshot:

Figure 9.7 – PROFILE plan of basic query with label

Figure 9.7 – PROFILE plan of basic query with label

From the preceding screenshot, we can see again the plans are exactly the same and PROFILE has the db hits. By adding the label, we can see that the first operation took only 2 db hits. If we remember, in the earlier query, this step took 157,913 db hits. Also, here we are missing one step when compared to the previous query without a label. There is a step with 7,052 db hits, which is comparing the property value of the node found to the input value. This is not needed here because the index lookup gives up the node for the matching index value. The last step is exactly the same, with 3 db hits, which returns the node we found.

If we look at the results, this query takes 5 db hits and takes less than 1 ms (reported by the browser as 0 ms).

Let’s look at the profile of the complex query:

PROFILE MATCH (p:Patient)-[:HAS_ZIPCODE]->(zip)
WITH p, zip
MATCH (p)-[:HAS_ENCOUNTER]->()-[:HAS_PROVIDER]->
      (prov)-[:HAS_ZIPCODE]->(zip)
WITH DISTINCT p,prov, zip
RETURN p.firstName as patientFirst,
       p.lastName as patientLast,
       prov.name as provider,
       zip.zip as zipcode

We will compare the EXPLAIN plan and the profile of that plan in the next steps:

Figure 9.8 – Part 1 of query plan and profile of complex query

Figure 9.8 – Part 1 of query plan and profile of complex query

We can see again the plan steps are the same. We can see here clearly how the row estimations can be wildly different from the EXPLAIN plan and the actual PROFILE plan. For the first two steps, we can see the estimated rows do match the actual rows processed. This is where the data differs, as we are traversing the relationships, and the actual number of relationships cannot be captured in the query plan as count stores do not have that granular information:

Figure 9.9 – Part 2 of query plan and profile of complex query

Figure 9.9 – Part 2 of query plan and profile of complex query

We can see again in the following screenshot how the plan thinks there are zero estimated rows and the actual execution provides actual rows:

Figure 9.10 – Part 3 of query plan and profile of complex query

Figure 9.10 – Part 3 of query plan and profile of complex query

We can observe the same pattern again here. PROFILE gives us the actual amount of work that the database is doing. When we look at the total db hits for the query, they are 1,646,694 db hits, and the actual time taken is 228 ms.

The EXPLAIN and PROFILE plans of these queries show us how we can use EXPLAIN to understand how the database is planning to execute the query and PROFILE to tell us exactly how much work it is doing. To identify basic issues such as missing indices or not using an index, and so on, EXPLAIN is useful. As the data grows, PROFILE is very important to understand exactly how much work the database is doing to execute the query.

Next, we will take a look few important plan operators.

Reviewing plan operators

We will take a look at a few important operators that we will encounter most often while tuning queries.

These operators impact query performance the most and are listed as follows:

  • AllNodesScan: This means the database is looking at all the nodes. Most often than not, this means the user forgot to provide a label. There are scenarios where this might be the intention. Most users when they are beginning with Cypher add the label name but forget to add ":" before the label. An example looks like this:
    MATCH (Drug {code: '1234'} RETURN *

We might have thought we provided the label name, but since we don’t have ":" before the label, it will be treated as a variable, and we will see all nodes are being looked at.

  • CartesianProduct: This is a scenario when we have multiple MATCH statements without the WITH clause between them. A sample of such a scenario looks like this:
    MATCH (p:Patient)
    MATCH (d:Drug)
    RETURN p,d
    LIMIT 10

In this scenario, we are creating a Cartesian product of the response. Unless we are sure we need this scenario, when CartesianProduct appears in the query plan, we want to take a look at the query and see how we can avoid it.

  • NodeUniqueIndexSeek: This means we are finding a unique node leveraging the index. As much as possible, this should be the approach to find the initial node and then traverse from there. This also means for the node label and property, there is a unique constraint that exists.
  • NodeIndexScan: This means we are trying to find all nodes that match the label and property by looking at the whole index. This means we have an index created, but it is not a unique constraint. This is going to be way faster than a label scan. If a node has a primary key, we should strive to create a unique constraint rather than a simple index.
  • NodeByLabelScan: This means we have provided the label, but either we are not providing any property or there is no index on the label and property combination. The most common mistake users make is not creating an index on the node label and property, and the query performance keeps degrading as the data keeps growing.

Now, let us look at index hints that can assist us with query performance.

Using index hints

Index hints are used to specify which indexes the planner should try to use. This can change how the query gets executed. There might be a trade-off compared to the default planning when using index hints.

We can provide index hints to the planner with a USING clause. Let’s take a sample query that uses multiple entities and review a basic query profile and index hint driver query profile:

MATCH path=
(p:Patient {id:'7361ce15-cf67-ae76-88e6-bcbdca19ce0b'})
-[:HAS_ENCOUNTER]->()
-[:HAS_DRUG]->(d:Drug {code:'1190795'})
RETURN path LIMIT 1

We will take the preceding query and get a basic query profile first:

PROFILE MATCH path=
(p:Patient {id:'7361ce15-cf67-ae76-88e6-bcbdca19ce0b'})
-[:HAS_ENCOUNTER]->()
-[:HAS_DRUG]->(d:Drug {code:'1190795'})
RETURN path LIMIT 1

The following screenshot represents a basic query profile:

Figure 9.11 – Basic query profile without hints

Figure 9.11 – Basic query profile without hints

From the preceding screenshot, we can see that the query planner is starting with the Patient node first, even though we also have an index on the Drug node, as it tries to optimize based on the statistics available. The Patient node has a smaller number of outgoing relationships when compared to the Drug node’s incoming relationships, so it is cheaper to start from the Patient node. We can also see that we retrieve the Drug node and check for the label and compare the code property values. We can see that it takes a total of 541 db hits and uses 192 bytes of total memory to execute the query. It also takes 1 ms to execute.

Now, let’s get a profile for the query with an index hint. The query would look like this:

PROFILE MATCH path=
(p:Patient {id:'7361ce15-cf67-ae76-88e6-bcbdca19ce0b'})
-[:HAS_ENCOUNTER]->()
-[:HAS_DRUG]->(d:Drug {code:'1190795'})
USING INDEX d:Drug(code)
RETURN path LIMIT 1

In the query, we can see the usage of an index hint. We are telling the query planner to use the index we have on the code property of the Drug node.

Let’s take a look at the profile of the query in the following screenshot:

Figure 9.12 – Query profile with index hint

Figure 9.12 – Query profile with index hint

From the profile, we can see that there is a difference in the execution here. We have two starting points now. The first one is the Patient node with a specified id property and the second one is the Drug node with the specified code value. We take both the starting and ending points and traverse both ways to meet in the middle. We can see that we only take 87 db hits here, but we are using 16,784 bytes of memory. Also, it takes 2 ms to execute the query.

While the second profile is using a smaller number of db hits, it is using up more memory and CPU time. This is because of the way the query is executed. If we have enough page cache for the database to be loaded into memory, then the basic query profile gives us better results as it is straightforward to execute the query. When the database is big and the page cache is not enough to load the whole database into memory, then the second query can work better as it avoids page faults, due to a smaller number of db hits.

You can read more about planner hints at https://neo4j.com/docs/cypher-manual/current/query-tuning/using/.

Now, let us summarize what we have learned in this chapter.

Summary

In this chapter, we looked at using EXPLAIN and PROFILE to understand how queries will be executed by the database by looking at a plan and actual work done by the database in terms of db hits. We looked at some query plans for some basic queries and a complex query to understand the difference between EXPLAIN and PROFILE plans.

We looked at using index hints and how these affect the query execution time, along with the amount of work the database is doing to execute the query. We also reviewed a few important plan operators we need to look out for.

In the next chapter, we will take a look at Awesome Procedures on Cypher (APOC) functions and procedures and how they help us with querying.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.5.201