The Cypher Query Language

Cypher is the language used to interact with Neo4j. Originally created by Neo4j for Neo4j, it has been open sourced as openCypher and is now used by other graph database engines such as RedisGraph. It is also part of the Graph Query Language (GQL) association, whose goal is to build a common graph database query language – like SQL is for relational databases. In any case, it is a good starting point to understand how to query graph databases due to its visual aspect: nodes and relationships can be identified quickly by looking at the query.

In this chapter, we will review the basics of Cypher, which will be used in this book: CRUD methods, bulk importing data with various formats, and pattern matching to extract exactly the data we need from a graph. It is a good place to introduce the Awesome Procedures On Cypher (APOC) plugin, which is an extension of Cypher introducing powerful methods for data imports, and more. We will also take a look at more advanced tools such as the Cypher query planner. It will help us to understand the different steps of a query and how we can tune it for faster execution. Finally, we'll discuss Neo4j and Cypher performance with a social (Facebook-like) graph.

The following topics will be covered in this chapter:

  • Creating nodes and relationships
  • Updating and deleting nodes and relationships
  • Using the aggregation function
  • Importing data from CSV or JSON
  • Measuring performance and tuning your query for speed

Technical requirements

The required technologies and installations for this chapter are as follows:

Creating nodes and relationships

Unlike other query languages for graph databases such as Gremlin (https://tinkerpop.apache.org/gremlin.html) or AQL (ArangoDB), Cypher was built to have a syntax similar to SQL, in order to ease the transition for developers and data scientists used to the structured query language.

Like many tools in the Neo4j universe, Cypher's name comes from the movie The Matrix released in 1999: Neo is the main character. Apoc is also a character from this movie.

Managing databases with Neo4j Desktop

It is assumed you already have experience with Neo4j Desktop. This is the easiest tool to manage your Neo4j graphs, the installed plugins, and applications. I recommend creating a new project for this book, in which we are going to create several databases. In the following screenshot, I have created a project named Hands-On-Graph-Analytics-with-Neo4j, containing two databases: Test graph and USA:

Throughout this book, we will use Neo4j Browser, which is an application installed by default in Neo4j Desktop. Within this application, you can write and execute Cypher queries, but also visualize the results in different formats: a visual graph, JSON, or tabular data.

Creating a node

The simplest instruction to create a node is the following:

CREATE ()

It creates a node, without a label or a relationship.

We can recognize the () pattern that is used to identify nodes within Cypher. Every time you want a node, you will have to use brackets.

We can check the content of the database after this statement with a simple MATCH query that will return all nodes of the graph:

MATCH (n)
RETURN n

Here, we are selecting nodes (because of the use of () ), and we are also giving a name, an alias, to these nodes: n. Thanks to this alias, we can refer to those nodes in later parts of the query, here only in the RETURN statement.

The result of this query is shown here:

OK, great. But a single node with no label nor properties is not sufficient for most use cases. If we want to assign a label to a node when creating it, here is the syntax to use:

CREATE (:Label)

That's already better! Now, let's create properties when creating the node:

CREATE (:Label  {property1: "value", property2: 13})

We'll see in later sections how to modify an existing node: adding or removing labels, and adding, updating, or deleting properties.

Selecting nodes

We've already talked about the simple query here, which selects all nodes inside the database:

MATCH (n)
RETURN n
Be careful, if your database is large, this query is likely to make your browser or application crash. It is better to do what you would do with SQL, and add a LIMIT statement:

MATCH (n)
RETURN n
LIMIT 10

Let's try to be more specific in the data we want to select (filter) and the properties we need in the RETURN statement.

Filtering

Usually, we don't want to select all nodes of the database, but only those matching some criteria. For instance, we might want to retrieve only the nodes with a given label. In that case, we'd use this:

MATCH (n:Label)
RETURN n

Or, if you want to select only nodes with a given property, use this:

MATCH (n {id: 1})
RETURN n

The WHERE statement is also useful for filtering nodes. It actually allows more complex comparisons compared to the {} notation. We can, for instance, use inequality comparison (greater than >, lower than <, greater or equal to >=, or lower or equal to <= statements), but also Boolean operations like AND and OR:

MATCH (n:Label)
WHERE n.property > 1 AND n.otherProperty <= 0.8
RETURN n
It might be surprising to you that when selecting nodes, the browser also displays relationships between them when we have not asked it to do so. This comes from a setting in Neo4j Browser, whose default behavior is to enable the visualization of node connections. This can be disabled by un-checking the Connect result nodes setting.

Returning properties

So far, we've returned the whole node, with all properties associated with it. If for your application, you are interested only in some properties of the matched nodes, you can reduce the size of the result set by specifying the properties to return by using the following query:

MATCH (n)
RETURN n.property1, n.property2

With this syntax, we don't have access to the graph output in Neo4j Browser anymore, as it cannot access the node object, but we have a much simpler table output.

Creating a relationship

In order to create a relationship, we have to tell Neo4j about its start and end node, meaning the nodes need to be already in the database when creating the relationship. There are two possible solutions:

  • Create nodes and the relationship(s) between them in one pass:
CREATE (n:Label {id: 1})
CREATE (m:Label {id: 2})
CREATE (n)-[:RELATED_TO]->(m)
  • Create the nodes (if they don't already exist):
CREATE (:Label {id: 3})
CREATE (:Label {id: 4})

And then create the relationship. In that case, since the relationship is created in another query (another namespace), we need to first MATCH the nodes of interest:

MATCH (a {id: 3})
MATCH (b {id: 4})
CREATE (a)-[:RELATED_TO]->(b)
While nodes are identified with brackets, (), relationships are characterized by square brackets, [].

If we check the content of our graph after the first query, here is the result:

Reminder: while specifying a node label when creating a node is not mandatory, relationships must have a type. The following query is invalid: CREATE (n)-[]->(m) and leads to following Neo.ClientError.Statement.SyntaxError:

Exactly one relationship type must be specified for CREATE. Did you forget to prefix your relationship type with a : (line 3, column 11 (offset: 60))?

Selecting relationships

We would like to write queries such as the following one, similar to the one we write for nodes but with square brackets, [], instead of brackets, ():

MATCH [r]
RETURN r

But this query results in an error. Relationships cannot be retrieved in the same way as nodes. If you want to see the relationship properties in a simple way, you can use either of the following syntaxes:

// no filtering
MATCH ()-[r]-()
RETURN r

// filtering on relationship type
MATCH ()-[r:REL_TYPE]-()
RETURN r

// filtering on relationship property and returning a subset of its properties
MATCH ()-[r]-()
WHERE r.property > 10
RETURN r.property

We will see how this works in detail in the Pattern matching and data retrieval section later on.

The MERGE keyword

The Cypher documentation describes the behavior of the MERGE command very well:

MERGE either matches existing nodes and binds them, or it creates new data and binds that. It’s like a combination of MATCH and CREATE that additionally allows you to specify what happens if the data was matched or created.

Let's see an example:

MERGE (n:Label {id: 1})
ON CREATE SET n.timestamp_created = timestamp()
ON MATCH SET n.timestamp_last_update = timestamp()

Here, we are trying to access a node with Label and a single property id, with a value of 1. If such a node already exists in the graph, the subsequent operations will be performed using that node. This statement is then equivalent to a MATCH in that case. However, if the node with label Label and id=1 doesn't exist, then it will be created, hence the parallel with the CREATE statement.

The two other optional statements are also important:

  • ON CREATE SET will be executed if and only if the node was not found in the database and a creation process had to be performed.
  • ON MATCH SET will only be executed if the node already exists in the graph.

In this example, I use those two statements to remember when the node was created and when it was last seen in such a query.

You are now able to create nodes and relationships, assigning label(s) and properties to them. The next section will be dedicated to other kinds of CRUD operations that can be performed on these objects: update and delete.

Updating and deleting nodes and relationships

Creating objects is not sufficient for a database to be useful. It also needs to be able to do the following:

  • Update existing objects with new information
  • Delete objects that are no longer relevant
  • Read data from the database

This section deals with the first two bullet points, while the last one will be covered in the following section.

Updating objects

There is no UPDATE keyword with Cypher. To update an object, node, or relationship, we'll use the SET statement only.

Updating an existing property or creating a new one

If you want to update an existing property or add a new one, it's as simple as the following:

MATCH (n {id: 1})
SET n.name = "Node 1"
RETURN n

The RETURN statement is not mandatory, but it is a way to check the query went well, for instance, checking the Table tab of the result cell:

{
"name": "Node 1",
"id": 1
}

Updating all properties of the node

If we want to update all properties of the node, there is a practical shortcut:

MATCH (n {id: 1})
SET n = {id: 1, name: "My name", age: 30, address: "Earth, Universe"}
RETURN n

This leads to the following result:

{
"name": "My name",
"address": "Earth, Universe",
"id": 1,
"age": 30
}

In some cases, it might be painful to repeat existing properties to be sure not to erase the id, for instance. In that case, the += syntax is the way to go:

MATCH (n {id: 1})
SET n += {gender: "F", name: "Another name"}
RETURN n

This again works as expected, adding the gender property and updating the value of the name field:

{
"name": "Another name",
"address": "Earth, Universe",
"id": 1,
"gender": "F",
"age": 30
}

Updating node labels

On top of adding, updating, and deleting properties, we can do the same with node labels. If you need to add a label to an existing node, you can use the following syntax:

MATCH (n {id: 1})
SET n:AnotherLabel
RETURN labels(n)

Here, again, the RETURN statement is just there to make sure everything went well. The result is as follows:

["Label", "AnotherLabel"]

On the contrary, if you mistakenly set a label to a node, you can REMOVE it:

MATCH (n {id: 1})
REMOVE n:AnotherLabel
RETURN labels(n)

And we are back to the situation where the node with id:1 has a single label called Label.

Deleting a node property

We briefly talked about NULL values in the previous chapter. In Neo4j, NULL values are not saved in the properties list. An absence of a property means it's null. So, deleting a property is as simple as setting it to a NULL value:

MATCH (n {id: 1})
SET n.age = NULL
RETURN n

Here's the result:

{
"name": "Another name",
"address": "Earth, Universe",
"id": 1,
"gender": "F" }

The other solution is to use the REMOVE keyword:

MATCH (n {id: 1})
REMOVE n.address
RETURN n

The result would be as follows:

{
"gender": "F",
"name": "Another name",
"id": 1
}

If you want to remove all properties from the node, you will have to assign it an empty map like so:

MATCH (n {id: 2})
SET n = {}
RETURN n

Deleting objects

To delete an object, we will use the DELETE statement:

  • For a relationship:
MATCH ()-[r:REL_TYPE {id: 1}]-()
DELETE r
  • For a node:
MATCH (n {id: 1})
DELETE n
Deleting a node requires it to be detached from any relationship (Neo4j can't contain a relationship with a NULL extremum).

If you try to delete a node that is still involved in a relationship, you will receive a Neo.ClientError.Schema.ConstraintValidationFailed error, with the following message:

Cannot delete node<41>, because it still has relationships. To delete this node, you must first delete its relationships.

We need to first delete the relationship, and then the node in this way:

MATCH (n {id:1})-[r:REL_TYPE]-()
DELETE r, n

But here again, Cypher provides a practical shortcut for this – DETACH DELETE – which will perform the preceding operation:

MATCH (n {id: 1})
DETACH DELETE n

You now have all the tools to in hand to create, update, delete, and read simple patterns from Neo4j. In the next section, we will focus on the pattern matching technique, to read data from Neo4j in the most effective way.

Pattern matching and data retrieval

The full power of graph databases, and Neo4j in particular, lies in their ability to go from one node to another by following relationships in a super-fast way. In this section, we explain how to read data from Neo4j through pattern matching and hence take full advantage of the graph structure.

Pattern matching

Let's take the following query:

MATCH ()-[r]-()
RETURN r

When we write these kinds of queries, we are actually performing what is called pattern matching with graph databases. The following schema explains this concept:

In this scenario, we have a directed graph made of nodes with labels A or B. We are looking for the sequence A -> B. Pattern matching consists of moving a stencil along the graph and seeing which pairs of nodes and relationships are consistent with it. On the first iteration, both the node labels and the relationship direction matches the search pattern. But on the second and third iterations, the node labels are not the expected ones, and these patterns are rejected. On iteration four, the labels are right, but the relationship orientation is reversed, which makes the matching fail again. Finally, on the last iteration, the pattern is respected even if not drawn in the right order: we have a node A connected with an outbound relationship to a node B.

Test data

Let's first create some test data to experiment with. We'll use the states of the United States as a playground. A node in our graph will be a state, with its two-letter code, name, and rounded population as properties. Those states are connected when they share a common border through a relationship of type SHARE_BORDER_WITH:

Here is our sample data, created from the preceding image, using only states up to two degrees of separation away from Florida (FL):

CREATE (FL:State {code: "FL", name: "Florida", population: 21500000})
CREATE (AL:State {code: "AL", name: "Alabama", population: 4900000})
CREATE (GA:State {code: "GA", name: "Georgia", population: 10600000})
CREATE (MS:State {code: "MS", name: "Mississippi", population: 3000000})
CREATE (TN:State {code: "TN", name: "Tennessee", population: 6800000})
CREATE (NC:State {code: "NC", name: "North Carolina", population: 10500000})
CREATE (SC:State {code: "SC", name: "South Carolina", population: 5100000})


CREATE (FL)-[:SHARE_BORDER_WITH]->(AL)
CREATE (FL)-[:SHARE_BORDER_WITH]->(GA)
CREATE (AL)-[:SHARE_BORDER_WITH]->(MS)
CREATE (AL)-[:SHARE_BORDER_WITH]->(TN)
CREATE (GA)-[:SHARE_BORDER_WITH]->(AL)
CREATE (GA)-[:SHARE_BORDER_WITH]->(NC)
CREATE (GA)-[:SHARE_BORDER_WITH]->(SC)
CREATE (SC)-[:SHARE_BORDER_WITH]->(NC)
CREATE (TN)-[:SHARE_BORDER_WITH]->(MS)
CREATE (NC)-[:SHARE_BORDER_WITH]->(TN)

We will now use this data to understand graph traversal.

Graph traversal

Graph traversal consists of going from one node to its neighbors by following an edge (relationship) in a given direction.

Orientation

While relationships must be oriented when creating them, pattern matching can be performed by taking this orientation into account or not. The relationship between two nodes, a and b, can be of three kinds (with respect to a):

OUTBOUND: (a) -[r]->(b)
INBOUND: (a)<-[r]- (b)
BOTH: (a) -[r]- (b)

Our USA states graph is undirected, so we will only use the BOTH relationship syntax. For instance, let's find the direct neighbors of Florida and return their names:

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH]-(n)
RETURN n.name

This leads to the following result:

╒═════════╕
│"n.name" │
╞═════════╡
│"Georgia"│
├─────────┤
│"Alabama"│
└─────────┘

It would be interesting to also see the state population, and order the result by this value, wouldn't it?

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH]-(n)
RETURN n.name as state_name, n.population as state_population
ORDER BY n.population DESC

Here's the corresponding result:

╒════════════╤══════════════════╕
│"state_name"│"state_population"│
╞════════════╪══════════════════╡
│"Georgia" │10600000 │
├────────────┼──────────────────┤
│"Alabama" │4900000 │
└────────────┴──────────────────┘

In this query, we are only interested in the direct neighbors of Florida, meaning only one hop from the starting node. But with Cypher, we can traverse more relationships.

The number of hops

For instance, if we also want the neighbors of the neighbors of Florida, we could use this:

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH]-(neighbor)-[:SHARE_BORDER_WITH]-(neighbor_of_neighbor)
RETURN neighbor_of_neighbor

This returns six nodes. If you check the result carefully, you will possibly be surprised to realize that it contains Alabama, for example, which is a direct neighbor of Florida. That's true, but Alabama is also a neighbor of Tennessee, which is a neighbor of Florida, so Alabama is also a neighbor of a neighbor of Florida. If we only want the neighbors of neighbors that are not direct neighbors of Florida, we have to explicitly exclude them:

MATCH (FL:State {code: "FL"})-[:SHARE_BORDER_WITH]-(neighbor)-[:SHARE_BORDER_WITH]-(neighbor_of_neighbor)
WHERE NOT (FL)-[:SHARE_BORDER_WITH]-(neighbor_of_neighbor)
RETURN neighbor_of_neighbor

This time, the query returns only four results: South Carolina, North Carolina, Tennessee, and Mississippi.

Variable-length patterns

When the relationship is of the same type, as in our example, or, if we do not care about the relationship type, we can use the following shortcut:

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH*2]-(neighbor_of_neighbor)
RETURN neighbor_of_neighbor

This will return the same six results that we have already seen in the previous section with this query:

(FL:State {code: "FL"})-[:SHARE_BORDER_WITH]-(neighbor)-[:SHARE_BORDER_WITH]-(neighbor_of_neighbor)

You can give lower and upper values for the number of hops with this syntax:

[:SHARE_BORDER_WITH*<lower_value>..<upper_value>]

For instance, [:SHARE_BORDER_WITH*2..3] will return the neighbors with two or three degrees of separation.

It is even possible to use whatever path length using the * notation, like so:

[:SHARE_BORDER_WITH*]

This will match paths regardless of the number of relationships. However, this syntax is not recommended since it can create a huge performance decrease.

Optional matches

Some states do not share any borders with another US state. Let's add Alaska to our test graph:

CREATE (CA:State {code: "AK", name: "Alaska", population: 700000 })

In the case of Alaska, the query we wrote before to get the neighbors will actually return zero results:

MATCH (n:State {code: "AK"})-[:SHARE_BORDER_WITH]-(m)
RETURN n, m

Indeed, no pattern matches the sequence ("AK")-SHARE_BORDER_WITH-().

In some cases, we might want to see Alaska in the results anyway. For instance, knowing that Alaska has zero neighbors is information in itself. In that case, we would use OPTIONAL MATCH pattern matching:

MATCH (n:State {code: "AK"})
OPTIONAL MATCH (n)-[:SHARE_BORDER_WITH]-(m)
RETURN n.name, m.name

This query returns the following result:

╒════════╤════════╕
│"n.name"│"m.name"│
╞════════╪════════╡
│"Alaska"│null │
└────────┴────────┘

The neighbor name, m.name, is NULL because no neighbor was found, but Alaska is part of the result.

We now have a better view of the way Cypher performs pattern matching. The next section will show how to perform aggregations such as count or sum, and handle lists of objects.

Using aggregation functions

It is often very useful to compute some aggregated quantities for the entities in our database, such as the number of friends in a social graph or the total price of an order for an e-commerce website. We will discover here how to do those calculations with Cypher.

Count, sum, and average

In a similar way to SQL, you can compute aggregates with Cypher. The main difference with SQL is that there is no need to use a GROUP BY statement; all fields that are not in an aggregation function will be used to create groups:

MATCH (FL:State {code: "FL"})-[:SHARE_BORDER_WITH]-(n)
RETURN FL.name as state_name, COUNT(n.code) as number_of_neighbors

The result is the following one, as expected:

╒════════════╤═════════════════════╕
│"state_name"│"number_of_neighbors"│
╞════════════╪═════════════════════╡
│"Florida" │2 │
└────────────┴─────────────────────┘

The following aggregate functions are available:

  • AVG(expr): available for numeric values and durations
  • COUNT(expr): the number of rows with non-null expr
  • MAX(expr): the maximum value of expr over the group
  • MIN(expr): the minimum value of expr over the group
  • percentileCont(expr, p): the p (percentage) of expr over the group, interpolated
  • percentileDisc(expr, p): the p (percentage) of expr over the group
  • stDev(expr): the standard deviation of expr over the group
  • stDevP(expr): the population standard deviation of expr over the group
  • SUM(expr): available for numeric values and duration
  • COLLECT(expr): see the next section

For instance, we can compute the ratio between a state population and the sum of all people living in its neighboring states like so:

MATCH (s:State)-[:SHARE_BORDER_WITH]-(n)
WITH s.name as state, toFloat(SUM(n.population)) as neighbor_population, s.population as pop
RETURN state, pop, neighbor_population, pop / neighbor_population as f
ORDER BY f desc
The WITH keyword is used to perform intermediate operations.

Creating a list of objects

It is sometimes useful to aggregate several rows into a single list of objects. In that case, we will use the following:

COLLECT

For instance, if we want to create a list containing the code of the states sharing a border with Colorado:

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH]-(n)
RETURN COLLECT(n.code)

This returns the following result:

["GA","AL"]

Unnesting objects

Unnesting consists of converting a list of objects into rows, each row containing an item of the list. It is the exact opposite of COLLECT, which groups objects together into a list.

With Cypher, we will use the following statement:

UNWIND

For instance, the following two queries are equivalent:

MATCH (:State {code: "FL"})-[:SHARE_BORDER_WITH]-(n)
WITH COLLECT(n.code) as codes
UNWIND codes as c
RETURN c

// is equivalent to, since COLLECT and UNWIND cancel each other:
MATCH (CO:State {code: "FL"})-[:SHARE_BORDER_WITH]-(n)
RETURN n.cod

This returns our well-known two state codes.

The UNWIND operation will be useful for data imports, since some files are formatted in a way that several pieces of information can be aggregated on a single row. Depending on the data format, this function can be useful when importing data into Neo4j, as we will see in the next section.

Importing data from CSV or JSON

Even if you start your business with Neo4j as a core database, it is very likely you will have to import some static data into your graph. We will also need to perform that kind of operation within this book. In this section, we detail several ways of bulk-feeding Neo4j with different tools and different input data formats.

Data import from Cypher

Cypher itself contains utilities to import data in CSV format from a local or remote file.

File location

Whether importing CSV, JSON, or another file format, this file can be located in the following places:

  • Online and reachable through a public URL: 'http://example.com/data.csv'
  • On your local disk: 'files:///data.csv'

Local file: the import folder

In the latter case, with default Neo4j configuration, the file has to be in the /imports folder. Finding this folder is straightforward with Neo4j Desktop:

  1. Click on the Manage button on the graph you are interested in.
  2. Identify the Open folder button at the top of the new window.
  3. Click on the arrow next to this button and select Import.

This will open your file browser inside your graph import folder.

If you prefer the command line, instead of clicking on Open Folder, you can use the Open Terminal button. In my local Ubuntu installation, it opens a session whose working directory is as follows:

~/.config/Neo4j Desktop/Application/neo4jDatabases/database-c83f9dc8-f2fe-4e5a-8243-2e9ee29e67aa/installation-3.5.14

The path on your system will be different since you will have a different database ID and maybe a different Neo4j version.

This directory structure is as follows:

$ tree -L 1
.
├── bin
├── certificates
├── conf
├── data
├── import
├── lib
├── LICENSES.txt
├── LICENSE.txt
├── logs
├── metrics
├── NOTICE.txt
├── plugins
├── README.txt
├── run
└── UPGRADE.txt

10 directories, 5 files

Here's some notes about the content of this directory:

  • data: Actually contains your data, especially data/databases/graph.db/ – the folder you can copy from one computer to another to retrieve your graph data.
  • bin: Contains some useful executables such as the import tool we'll discuss in the next section.
  • import: Put the files you want to import into your graph here.
  • plugins: if you have installed the APOC plugin, you should see apoc-<version>.jar in that folder. All plugins will be downloaded here, and if we want to add plugins not officially supported by Neo4j Desktop, it is enough to copy the jar file in this directory.

Changing the default configuration to import a file from another directory

The default import folder can be configured by changing the dbms.directories.import parameter in the conf/neo4j.conf configuration file:

# This setting constrains all `LOAD CSV` import files to be under the `import` directory. Remove or comment it out to
# allow files to be loaded from anywhere in the filesystem; this introduces possible security problems. See the
# `LOAD CSV` section of the manual for details.
dbms.directories.import=import

CSV files

CSV files are imported using the LOAD CSV Cypher statement. Depending on whether you can/want to use the headers, the syntax is slightly different.

CSV files without headers

If your file does not contain column headers, or you prefer ignoring them, you can refer to columns by indexes:

LOAD CSV FROM 'path/to/file.csv' AS row
CREATE (:Node {name: row[1]
Column indexes start with 0.

CSV files with headers

However, in most cases, you will have a CSV file with named columns. In that case, it is much more convenient to use a column header as a reference instead of numbers. This is possible with Cypher by specifying the WITH HEADERS option in the LOAD CSV query:

LOAD CSV WITH HEADERS FROM '<path/to/file.csv>' AS row
CREATE (:Node {name: row.name})

Let's practice with an example. The usa_state_neighbors_edges.csv CSV file has the following structure:

code;neighbor_code
NE;SD
NE;WY
NM;TX
...

This can be explained as follows:

  • code is the two-letter state identifier (for example, CO for Colorado).
  • neighbor_code is the two-letter identifier of a state sharing a border with the current state.

Our goal is to create a graph where each state is a node, and we create a relationship between two states if they share a common border.

So, let's get started:

  • Fields in this CSV file are delimited with semi-colons, ;, so we have to use the FIELDTERMINATOR option (the default is a comma, ,).
  • The first column contains a state code; we need to create the associated node.
  • The last column also contains a state code, so we have to check whether this state already exists and create it if not.
  • Finally, we can create a relationship between the two states, arbitrarily chosen to be oriented from the first state to the second one:
LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';'
MERGE (n:State {code: row.code})
MERGE (m:State {code: row.neighbor_code})
MERGE (n)-[:SHARE_BORDER_WITH]->(m)

This results in the graph displayed here, which is a representation of the United States:

It is interesting to note the special role of New York state, which completely splits the graph into two parts: states on one side of NY are never connected to a state from the other side of NY. Chapter 6, Node Importance, will describe the algorithms able to detect such nodes.

Our current graph structure has at least one problem: it does not contain states with no common borders, such as Alaska and Hawaii. To fix this issue, we will use another data file with a different format but that also contains the states without shared borders:

code;neighbors
CA;OR,NV,AZ
NH;VT,MA,ME
OR;WA,CA,ID,NV
...
AK;""
...

As you can see, we now have one row per state that contains a list of its neighbors. If the state does not have any neighbors, it is present in the file but the neighbors column contains a null value.

In reality, to prevent adding a relationship between states A and B and a second relationship between states B and A, the neighbors column only contains the neighbors with name < state_name. That's the reason why we have the row TX;"", while we know that Texas does have neighbors.

The query to import this file can be written as follows:

LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_all.csv" AS row FIELDTERMINATOR ';'
WITH row.code as state, split(row.neighbors, ',') as neighbors
MERGE (a:State {code: state})
WITH a, neighbors
UNWIND neighbors as neighbor
WITH a, neighbor
WHERE neighbor <> ""
MERGE (b:State {code: neighbor})
CREATE (a)-[:SHARE_BORDER_WITH]->(b)

A few notes to better understand this query:

  • We use the split() function to create a list from a comma-separated list of state codes.
  • The UNWIND operator creates one row for each element in the list of neighbor codes.
  • We need to filter out the states with no neighbors from the rest of the query since Cypher cannot use a NULL value as an identifier when merging nodes. However, since the WHERE clause happens after the first MERGE, states without neighbors will still be created.

If you see an error or unexpected results when using LOAD CSV, you can debug by returning intermediate results. This can be achieved, for instance, like this:

LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_all.csv" AS row FIELDTERMINATOR ';'
WITH row LIMIT 10
RETURN row

Using a LIMIT function is not mandatory but can be better for performance if you are using a very large file.

Eager operations

If you look closely, Neo4j Desktop is showing a small warning sign next to the query text editor. If you click on this warning, it will show an explanation about it. In our case, it says this:

The execution plan for this query contains the Eager operator, which forces all dependent data to be materialized in main memory before proceeding

This is not directly related to a data import, but this is often the first time we face this warning message, so let's try to understand it.

The Neo4j documentation defines the Eager operator in this sentence:

For isolation purposes, the operator ensures that operations affecting subsequent operations are executed fully for the whole dataset before continuing execution.

In other words, each statement of the query is executed for each row of the file, before moving to the other row. This is usually not a problem because a Cypher statement will deal with a hundred nodes or so, but when importing large data files, the overhead is noticeable and may even lead to OutOfMemory errors. This then needs to be taken into account.

In the case of a data import, the Eager operator is used because we are using MERGE statements, which forces Cypher to check whether the nodes and relationships exist for the whole data file.

To overcome this issue, several solutions are possible, depending on the input data:

  • If we are sure the data file does not contain duplicates, we can replace MERGE operations with CREATE.
  • But most of the time, we will instead need to split the import statement into two or more parts.

The solution to load the US states would be to use three consecutive queries:

// first create starting state node if it does not already exist
LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';'
MERGE (:State {code: row.code})

// then create the end state node if it does not already exist
LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';
MERGE (:State {code: row.neighbor_code})

// then create relationships
LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';'
MATCH (n:State {code: row.code})
MATCH (m:State {code: row.neighbor_code})
MERGE (n)-[:SHARE_BORDER_WITH]->(m)

The first two queries create the State nodes. If a state code appears several times in the file, the MERGE operation will take note of creating two distinct nodes with the same code.

Once this is done, we again read the same file to create the neighborhood relationships: we start from reading the State node from the graph with a MATCH operation and then create a unique relationship between them. Here, again, we used the MERGE operation rather than CREATE to prevent having the same relationship twice between the same two nodes.

We had to split the first two statements into two separate queries because they are acting on the same node label. However, a statement like the following one will not rely on the Eager operator:

LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row 
MERGE (:User {id: row.user_id})
MERGE (:Product {id: row.product_id})

Indeed, since the two MERGE nodes involve two different node labels, Cypher does not have to execute all the operations for the first line to make sure there is no conflict with the second one; the operations are independent.

In the APOC utilities for imports section, we will study another representation of the US dataset, which we will be able to import without writing three different queries.

Before that, let's have a look at the built-in Neo4j import tool.

Data import from the command line

Neo4j also provides a command-line import tool. The executable is located in $NEO4J_HOME/bin/import. It requires several CSV files:

  • One or several CSV file(s) for nodes with the following format:
id:ID,:LABEL,code,name,population_estimate_2019:int
1,State,CA,California,40000000
2,State,OR,Oregon,4000000
3,State,AZ,Arizona,7000000
  • It is mandatory to have a unique identifier for nodes. This identifier must be identified with the :ID keyword.
  • All fields are parsed as a string, unless the type is specified in the header with the :type syntax.
  • One or several CSV file(s) for relationships with the following format:
:START_ID,:END_ID,:TYPE,year:int
1,2,SHARE_BORDER_WITH,2019
1,3,SHARE_BORDER_WITH,2019

Once the data files are created and located in the import folder, you can run the following:

bin/neo4j-admin import --nodes=import/states.csv --relationships=import/rel.csv

If you have very large files, the import tool can be much more convenient since it can manage compressed files (.tar, .gz, or .zip) and also understand header definitions in separate files, which makes it easier to open and update.

The full documentation about the import tool can be found at https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/.

APOC utilities for imports

The APOC library is a Neo4j extension that contains several tools to ease working with this database:

  • Data import and export: from and to different formats like CSV and JSON but also HTML or a web API
  • Data structure: advanced data manipulation, including type conversion functions, maps, and collections management
  • Advanced graph querying functions: tools to enhance pattern matching, including more conditions
  • Graph projections: with virtual nodes and/or relationships

The first implementations of graph algorithms were done within that library, even if they have now been deprecated in favor of a dedicated plugin we will discover in part 2 of this book.

We will only detail in this section the tools related to data import, but I encourage you to take a look at the documentation to learn what can be achieved with this plugin: https://neo4j.com/docs/labs/apoc/current/.

When executing the code in the rest of this chapter, you may get an error saying the following:

There is no procedure with the name apoc.load.jsonParams registered for this database instance

If so, you will have to add the following line to your neo4j.conf setting(the Settings tab in the Graph Management area in Neo4j Desktop):

dbms.security.procedures.whitelist= apoc.load.*

CSV files

The APOC library contains a procedure to import CSV files. The syntax is the following:

CALL apoc.load.csv('')
YIELD name, age
CREATE (:None {name: name, age: age})

As an exercise, try and import the USA state data with this procedure.

Similar to the LOAD CSV statement, the file to be updated needs to be inside the import folder of your graph. However, you should not include the file:// descriptor, which would trigger an error.

JSON files

More importantly, APOC also contains a procedure to import data from JSON, which is not possible yet with vanilla Cypher. The structure of the query is as follows:

CALL apoc.load.json('http://...') AS value
UNWIND value.items AS item
CREATE (:Node {name: item.name}

As an example, we will import some data from GitHub using the GitHub API: https://developer.github.com/v3/.

We can get the list of repositories owned by the organization Neo4j with this request:

curl -u "<your_github_username>" https://api.github.com/orgs/neo4j/repos

Here is a sample of the data you can get for the given repository (with chosen fields):

   {
"id": 34007506,
"node_id": "MDEwOlJlcG9zaXRvcnkzNDAwNzUwNg==",
"name": "neo4j-java-driver",
"full_name": "neo4j/neo4j-java-driver",
"private": false,
"owner": {
"login": "neo4j",
"id": 201120,
"node_id": "MDEyOk9yZ2FuaXphdGlvbjIwMTEyMA==",
"html_url": "https://github.com/neo4j",
"followers_url": "https://api.github.com/users/neo4j/followers",
"following_url": "https://api.github.com/users/neo4j/following{/other_user}",
"repos_url": "https://api.github.com/users/neo4j/repos",
"type": "Organization"
},
"html_url": "https://github.com/neo4j/neo4j-java-driver",
"description": "Neo4j Bolt driver for Java",
"contributors_url": "https://api.github.com/repos/neo4j/neo4j-java-driver/contributors",
"subscribers_url": "https://api.github.com/repos/neo4j/neo4j-java-driver/subscribers",
"commits_url": "https://api.github.com/repos/neo4j/neo4j-java-driver/commits{/sha}",
"issues_url": "https://api.github.com/repos/neo4j/neo4j-java-driver/issues{/number}",
"created_at": "2015-04-15T17:08:15Z",
"updated_at": "2020-01-02T10:20:45Z",
"homepage": "",
"size": 8700,
"stargazers_count": 199,
"language": "Java",
"license": {
"key": "apache-2.0",
"name": "Apache License 2.0",
"spdx_id": "Apache-2.0",
"node_id": "MDc6TGljZW5zZTI="
},
"default_branch": "4.0"
}

We will import this data into a new graph, using APOC. To do so, we have to enable file import with APOC by adding the following line to the Neo4j configuration file (neo4j.conf):

apoc.import.file.enabled=true 

Let's now read this data. You can see the result of the apoc.load.json procedure with the following:

CALL apoc.load.json("neo4j_repos_github.json") YIELD value AS item
RETURN item
LIMIT 1

This query produces a result similar to the preceding sample JSON. To access the fields in each JSON file, we can use the item.<field> notation. So, here is how to create a node for each repository and owner, and a relationship between the owner and the repository:

CALL apoc.load.json("neo4j_repos_github.json") YIELD value AS item
CREATE (r:Repository {name: item.name,created_at: item.created_at, contributors_url: item.contributors_url} )
MERGE (u:User {login: item.owner.login})
CREATE (u)-[:OWNS]->(r)

Checking the content of the graph, we can see this kind of pattern:

We can do the same to import all the contributors to the Neo4j repository:

CALL apoc.load.json("neo4j_neo4j_contributors_github.json") 
YIELD value AS item
MATCH (r:Repository {name: "neo4j"})
MERGE (u:User {login: item.login})
CREATE (u)-[:CONTRIBUTED_TO]->(r)

Importing data from a web API

You may have noticed that the JSON returned by GitHub contains a URL to extend our knowledge about repositories or users. For instance, in the neo4j_neo4j_contributors_github.json file, there is a followers URL. Let's see how to use APOC to feed the graph with the result of this API call.

Setting parameters

We can set parameters within Neo4j Browser with the following syntax:

:params {"repo_name": "neo4j"}

The parameters can then be referred to in later queries with the $repo_name notation:

MATCH (r:Repository {name: $repo_name}) RETURN r

This can be very useful when the parameter is used in multiple places in the query.

In the next section, we will perform HTTP requests to the GitHub API directly from Cypher. You'll need a GitHub token to authenticate and save in as a parameter:

:params {"token": "<your_token>"}
The token is not required but the rate limits for unauthorized requests are much lower so it will be easier to create one by following the instructions here: https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line#creating-a-token.

Calling the GitHub web API

We can use apoc.load.jsonParams to load a JSON file from a web API, setting the HTTP request headers in the second parameter of this procedure:

CALL apoc.load.json("neo4j_neo4j_contributors_github.json") YIELD value AS item
MATCH (u:User {login: item.login})
CALL apoc.load.jsonParams(item.followers_url, {Authorization: 'Token ' + $token}, null) YIELD value AS contrib
MERGE (f:User {login: contrib.login})
CREATE (f)-[:FOLLOWS]->(u)

When performing the import, I got the following results:

Added 439 labels, created 439 nodes, set 439 properties, created 601 relationships, completed after 12652 ms.

This may vary when you run this since a given user's followers evolve over time. Here is the resulting graph, where users are shown in green and repositories in blue:

You can use any of the provided URLs to enrich your graph, depending on the kind of analysis you want to perform: you can add commits, contributors, issues, and so on.

Summary of import methods

Choosing the right tool to import your data mainly depends on its format. Here are some overall recommendations:

  • If you only have JSON files, then apoc.load.json is your only option.
  • If you are using CSV files, then:
  • If your data is big, use the import tool from the command line.
  • If your data is small or medium-sized, you can use APOC or Cypher, LOAD CSV.

This closes our section about data imports, where we learned how to feed a Neo4j graph with existing data, from CSV, JSON, or even via a direct call to a web API. We will use those tools all through this book to have meaningful data to run the algorithms on.

Before moving on to those algorithms, a final step is needed. Indeed, as with SQL, there are often several Cypher queries producing the same result, but not all of them have the same efficiency. The next section will show you how to measure efficiency and deal with some good practices to avoid the main caveats. 

Measuring performance and tuning your query for speed

In order to measure a Cypher query performance, we will have to look at the Cypher query planner, which details the operations performed under the hood. In this section, we introduce the notions to learn how to access the Cypher execution plan. We will also deal with some good practices to avoid the worst operations in terms of performance, before concluding with a well-known example.

Cypher query planner

As you would do with SQL, you can check the Cypher query planner to understand what happens under the hood and how to improve your query. Two options are possible:

  • EXPLAIN: If you do not want the query to be run, EXPLAIN won't make any changes to your graph.
  • PROFILE: This will actually run the query and alter your graph, together with measuring performance.

In the rest of this chapter, we will use a dataset released by Facebook in 2012 for a recruiting competition hosted by Kaggle. The dataset can be downloaded here: https://www.kaggle.com/c/FacebookRecruiting/data. I have only used the training sample, containing a list of connections between anonymized people. It contains 1,867,425 nodes and 9,437,519 edges.

We already talked about one of the operations that can be identified in the query planner: Eager operations, which we need to avoid as much as possible since they really hurt performance. Let's see some more operators and how to tune our queries for performance.

A simple query to select a node with a given id and get Cypher query explanations could be written as follows:

PROFILE
MATCH (p { id: 1000})
RETURN p

When executing this query, a new tab is available in the result cell called Plan, shown in the following screenshot:

The query profile shows the use of the AllNodesScan operator, which performs a search on ALL nodes of the graph. In this specific case, this won't have a big impact since we have only one node label, Person. But if your graph happens to have many different labels, performing a scan on all nodes can be horribly slow. For this reason, it is highly recommended to explicitly set the node labels and relationship types of interest in our queries:

PROFILE
MATCH (p:Person { id: 1000})
RETURN p

In that case, Cypher uses the NodeByLabelScan operation as can be seen in the following screenshot:

In terms of performance, this query is executed in approximately 650 ms on my laptop, in both cases. In some cases, performance can be increased even more thanks to Neo4j indexing.

Neo4j indexing

Neo4j indexes are used to easily find the start node of a pattern matching query. Let's see the impact of creating an index on the execution plan and execution time:

CREATE INDEX ON :Person(id)

And let's run our query again:

PROFILE
MATCH (p:Person { id: 1000})
RETURN p

You can see that the query is now using our index through the NodeIndexSeek operation, which reduces the execution time to 1 ms:

An index can also be dropped with the following statement:

DROP INDEX ON :Person(id)

The Neo4j indexing system also supports combined indexes and full-text indexes. Check https://neo4j.com/docs/cypher-manual/current/schema/index/ for more information.

Back to LOAD CSV

Remember we talked about the Eager operator earlier in this chapter. We are importing US states with the LOAD CSV statement:

LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';'
MERGE (n:State {code: row.code})
MERGE (m:State {code: row.neighbor_code})
MERGE (n)-[:SHARE_BORDER_WITH]->(m)

To better understand it and identify the root cause of this warning message, we ask Neo4j to EXPLAIN it. We would then get a complex diagram like the one displayed here:

I have highlighted three elements for you:

  • The violet part corresponds to the first MERGE statement.
  • The green part contains the same operations for the second MERGE statement.
  • The read box is the Eager operation.

From this diagram, you can see that the Eager operation is performed between step 1 (the first MERGE) and step 2 (the second MERGE). This is where your query needs to be split in order to avoid this operator.

You now know more about how to understand the operations Cypher performs when executing your query and how to identify and fix bottlenecks. It is time to actually measure query performance in terms of time. For this, we are going to use the famous friend-of-friend example in a social network.

The friend-of-friend example

The friend-of-friend example is the most famous argument in favor of Neo4j when talking about performance. Since Neo4j is known to be incredibly performant at traversing relationships, contrary to other Database engines, we expect the response time of this query to be quite low.

Neo4j Browser displays the query execution time in the result cell:

It can also be measured programmatically. For instance, using the Neo4j Python driver from the Neo4j package, we can measure the total execution and streaming time with the following:

from neo4j import GraphDatabase

URL = "bolt://localhost:7687"
USER = "neo4j"
PWD = "neo4j"

driver = GraphDatabase.driver(URL, auth=(USER, PWD)

query = "MATCH (a:Person {id: 203749})-[:IS_FRIEND_WITH]-(b:Person) RETURN count(b.id)"

with driver.session() as session:
with session.begin_transaction() as tx:
result = tx.run(query)
summary = result.summary()
avail = summary.result_available_after # ms
cons = summary.result_consumed_after # ms
total_time = avail + cons

With that code, we were able to measure the total time of execution for different starting nodes, with different degrees, and different depths (first-degree friends, second degree... up to the fourth degree).

The following figure shows the results. As you can see, the amount of time before the results are made available is below 1 ms for all depth-1 queries, independently of the number of first-degree neighbors of the node:

The time Neo4j needs to get the results increases with the depth of the query, as expected. However, you can see that the time difference between the number of friends for the initial node becomes really important only when having a lot of friends. When starting from the node with 100 friends at depth 4, the number of matching nodes is almost 450,000, identified within 1 minute approximately.

This benchmark was performed without any changes to the Neo4j Community Edition default configuration. Some gain is to be expected by tuning some of those parameters, such as the maximum heap size.

More information about these configurations will be given in Chapter 12, Neo4j at Scale.

Summary

In this chapter, you learned how to navigate into your Neo4j graph. You are now able to perform CRUD operations with Cypher, creating, updating, and deleting nodes, relationships, and their properties.

But the full power of Neo4j lies in relationship traversal (going from one node to its neighbors is super fast) and pattern matching you are now able to perform with Cypher. 

You have also discovered how to measure your query performance with the Cypher query planner. This can help you to avoid some pitfalls, such as the Eager operation when loading data. It will also help in understanding Cypher internals and tuning your query for better performance in terms of speed.

We know have all the tools in hand to start really using Neo4j and study some real-life examples. In the next chapter, we will learn about knowledge graphs. For many organizations, this is the first entry point to the world of graphs. With that data structure, we will be able to implement performant recommendation engines and graph-based search for your customers.

Questions

  1. US states:
    • Find the first-degree neighbors of Colorado (code CO):
      • What are the codes of these neighbors?
      • Can you count how many there are using an aggregate function?
    • Which states have the highest number of neighbors? The lowest? Think about the ORDER BY clause.
  2. GitHub graph enhancement:
    • From neo4j_repos_github.json, can you save the project language when provided, as a new node? You can use the new Language node label.
    • Can you also save the license, when provided, using a new node? Use a Licence node label and save all of the provided information from GitHub as properties.
      (Hint: the license is provided with the following format):
   "license": {
"key": "other",
"name": "Other",
"spdx_id": "NOASSERTION",
"url": null,
"node_id": "MDc6TGljZW5zZTA="
},
    • Using the GitHub API, can you save the user locations?
      Hint: the URL to get user information is https://api.github.com/users/<login>.
    • The location often contains the city and country name, separated by either a space or a comma. Can you write a query to save only the first element of this pair, assumed to be the city?
      Hint: checkout APOC text utils.
    • Using the GitHub API, can you retrieve the repositories owned by each Neo4j contributor?
      Hint: the URL to get the repositories for a given user is https://api.github.com/users/<login>/repos.
    • Which location is the most represented among Neo4j contributors?

Further reading

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.81.43