Now that we are able to use Neo4j and graph algorithms to answer many questions, it is time to worry about how Neo4j manages data of different sizes. Neo4j versions prior to 4.0 were already able to deal with huge amounts of data (billions of nodes and relationships), only limited by disk size. But Neo4j 4.0 has overcome these limitations (almost) by introducing sharding, a technique that spans a graph across several machines. This chapter will provide an introduction to sharding, both in terms of shard definition and a new Cypher statement that was introduced especially to query such graphs. In the last section, we will investigate the GDS performances of large graphs.

The following topics will be covered in this chapter:

Measuring GDS performance
Configuring Neo4j 4.0 for big data

Let's get started!

Technical requirements

While the first section of this chapter can still be done with Neo4j 3.x, the last section will use sharding, which was only introduced in version 4 of Neo4j. Thus, the following configuration is required for this section:

Neo4j ≥ 4.0
GDS ≥ 1.2

Measuring GDS performance

The Graph Data Science library has been built for big datasets; graph representation and algorithms are optimized for big graphs. However, for efficiency's sake, all operations are performed in the heap, which is the reason why having an estimation of the memory requirements to run a given algorithm on a given projected graph can be important.

Estimating memory usage with the estimate procedures

GDS algorithms are run on an in-memory projected graph. The library provides helper procedures, which can be used to predict the memory usage required for storing a projected graph and running a given algorithm. These estimations are performed via the estimate execution mode, which can be appended to graph creation or algorithm execution procedures.

Estimating projected graph memory usage

Projected graphs are stored entirely in-memory (in the heap). In order to know how much memory is required to store a projected graph with the given nodes, relationships, and properties, we can use the following procedures:

gds.graph.create.estimate(...)
// or
gds.graph.create.cypher.estimate(...)

There are several ways to use this procedure, but in all cases, it will return, among others, the following parameters:

requiredMemory: The total required RAM
mapView: A detailed description about which entities require more memory between nodes, relationships, and properties
nodeCount: The total number of nodes in the projected graph
relationshipCount: The total number of relationships in the projected graph

Let's study a few examples.

Fictive graph

The projected graph stores the following:

A node as the Neo4j internal identifier (id(n))
Relationships as a pair of node IDs
Properties, if requested, as floating-point numbers (8 bytes)

This means that the memory requirements can be estimated as soon as we know how many nodes, relationships, and properties the projected graph is about to contain by using the following query:

CALL gds.graph.create.estimate('*', '*', {
 nodeCount: 10000,
 relationshipCount: 2000
})

The mapView field looks as follows:

{
  "name": "HugeGraph",
  "components": [
    {
      "name": "this.instance",
      "memoryUsage": "72 Bytes"
    },
    {
      "name": "nodeIdMap",
      "components": [
            ...
      ],
      "memoryUsage": "175 KiB"
    },
    {
      "name": "adjacency list for 'RelationshipType{name='__ALL__'}'",
      "components": [
            ...
      ],
      "memoryUsage": "256 KiB"
    },
    {
      "name": "adjacency offsets for 'RelationshipType{name='__ALL__'}'",
      "components": [
            ...
      ],
      "memoryUsage": "80 KiB"
    }
  ],
  "memoryUsage": "511 KiB"
}

This tells you that the total memory usage of your graph will be around 511KB (last line). It also helps you identify which parts consume most of this memory so that, eventually, you can reduce the size of the graph to the bare minimum for your needs.

Graph defined by native or Cypher projection

Estimating the memory requirement for a projected graph defined with Cypher projection is achieved thanks to the following syntax:

CALL gds.graph.create.cypher.estimate(
    "MATCH (n) RETURN id(n)", 
    "MATCH (u)--(v) RETURN id(u) as source, id(v) as target"
)

Note the difference compared to real graph creation with Cypher projection, which would be as follows:

CALL gds.graph.create.cypher(
    "projGraphName", 
    "MATCH (n) RETURN id(n) as id", 
    "MATCH (u)--(v) RETURN id(u) as source, id(v) as target"
)

The graph name parameter has to be skipped in the estimate scenario.

Knowing how much memory is required by a projected graph is one thing. But usually, we create projected graphs to run some algorithms on them. Therefore, being able to estimate the memory consumption of algorithms is also crucial to determining the size of the machine we will need to use.

Estimating algorithm memory usage

For each algorithm in the production quality tier, the GDS implements an estimate procedure similar to the one studied in the preceding paragraph for projected graphs. It can be used by appending estimate to any algorithm execution mode:

CALL gds.<ALGO>.<MODE>.estimate(graphNameOrConfig: String|Map, configuration: Map)

The algorithms in the alpha tier do not benefit from this feature.

The returned parameters are as follows:

YIELD 
    requiredMemory, 
    treeView, 
    mapView, 
    bytesMin,
    bytesMax, 
    heapPercentageMin,
    heapPercentageMax,
    nodeCount,
    relationshipCount

The parameter names are self-explanatory and similar to the projected graph estimation procedure. By estimating the memory usage of both the projected graph and the algorithms you want to run on them, you can get a fairly good estimate of the heap required and use a well-sized server for your analysis.

The stats running mode

Another interesting feature introduced by the GDS (compared to its predecessor, the Graph Algorithms Library) is the stats mode. It runs the algorithm without altering the graph, but returns some information about the projected graph and the algorithm's results. The list of returned parameters is the same as the one from the write mode.

In the following section, we are going to study an example of both graph and algorithm memory (estimate) and execution (stats) estimations.

Measuring time performances for some of the algorithms

When using the algorithm write procedure, which means writing the results back into the main Neo4j graph, the procedure will return some information about its execution time:

createMillis: Time to create the projected graph
computeMillis: Time to run the algorithm
writeMillis: Time to write the results back

Let's use these parameters to check the PageRank or Louvain algorithms' performances with a slightly larger graph compared to the one we've used so far. To do so, we are going to use a social network graph provided by Facebook during a recruiting Kaggle competition. The dataset can be downloaded from https://www.kaggle.com/c/FacebookRecruiting/overview. It contains 1,867,425 nodes and 9,437,519 relationships. You can import it into Neo4j using your favorite import tool (LOAD CSV, APOC, or the Neo4j command-line import tool). In the rest of this example, I'll use Node as the node label and IS_FRIEND_WITH as the relationship type.

In order to use the GDS, we need to create a projected graph that contains all our nodes and undirected relationships:

CALL gds.graph.create(
    "graph", 
    "*", 
    {
        IS_FRIEND_WITH: {
            type: "IS_FRIEND_WITH", 
            orientation: "UNDIRECTED"
        }
    }
)

Then, we can execute the Louvain algorithm with the following command:

CALL gds.louvain.write.estimate("graph", {writeProperty: "pr"})

This tells us that only around 56 MB of the heap is required to run the algorithm.

Now, run the following command:

CALL gds.louvain.stats("graph)

This will give us an estimate regarding the required time to run the Louvain algorithm.

Even the estimation can take quite a long time, so be patient.

Now, you should be able to estimate how much memory is required by your graph to execute a given algorithm. You should also be able to estimate how much time it will take for an algorithm to converge on your data.

In the next section, we will come back to Neo4j and Cypher and learn about the novelties introduced in version 4.0 for very large graphs.

Configuring Neo4j 4.0 for big data

Neo4j 4.0 was announced in February 2020. Among other new features, it is the first version that supports sharding to split a graph across several servers. We will discuss this further in the following section. First, let's review some basic settings that can improve Neo4j's performance before thinking about complex solutions.

The landscape prior to Neo4j 4.0

Neo4j 3.x is already an incredibly powerful database that can manage billions of nodes and relationships. One of the first settings you learn to tune when getting the Neo4j certification is the allocated memory.

Memory settings

Memory is configured by several settings in the neo4j.conf file. The default values are quite low and it is often recommended to increase them, depending on your data and usage. Neo4j has a utility script that's used to estimate the memory required by your graph instance. The script is located in the bin directory of your $NEO4J_HOME path. When using Neo4j from Neo4j Desktop, you can open a Terminal from your graph management tab and it will automatically open in that folder. Then, you can execute memrec with the following bash command:

./bin/neo4j-admin memrec

The output of this command contains the following block:

dbms.memory.heap.initial_size=3500m
dbms.memory.heap.max_size=3500m
dbms.memory.pagecache.size=1900m

You can use these indications to update neo4j.conf (or graph settings in Neo4j Desktop) and then restart your database.

Neo4j in the cloud

It has been possible to run Neo4j in the cloud in a completely automated and managed way for a long time now. In fact, Neo4j provides images to be used in Google cloud or AWS.

Since 2019, Neo4j has also had its own Database-as-a-Service solution with Neo4j Aura, an all-in-one solution used to run a Neo4j database without the need to worry about infrastructure. Check the deployment guide for more information and links: https://neo4j.com/developer/guide-cloud-deployment/.

Sharding with Neo4j 4.0

Sharding is a technique that involves splitting a database across multiple servers when the amount of data to be stored is larger than a single server's capacity. This only happens with very large datasets, since the current hardware capacities for a single server can reach several terabytes of data. An application would have to generate GBs of data per day to reach these limitations. Now, let's address how to manage these big datasets using Neo4j.

Neo4j has developed another tool that can be used especially for this purpose called fabric.

If you create a new graph using Neo4j 4.0 in Neo4j Browser, you will notice that the active database is now highlighted next to the query box.

Defining shards

The first thing to do when dealing with shards is to define how to split the data. Indeed, sharding is powerful, but it is not without its limitations. For instance, an important limitation of sharding is the inability to traverse a relationship from one shard to another. However, some nodes can be duplicated in order to avoid this situation, so it all depends on your data model and how you are querying it. Imagine we have data from an e-commerce website that's registering orders from customers all around the world. Several dimensions can be considered for sharding:

Spatial-based: All orders coming from the same continent or country can be grouped together in the same shard.
Temporal-based: All nodes created on the same day/week/month can be saved in the same shard.
Product-based: Orders containing the same products can be grouped together.

The last situation assumes that there are clear partitions between ordered products; otherwise, we would have to duplicate almost all the product nodes.

Once we have a better idea of how to split our data and how many databases will be needed, we can proceed to the configuration in neo4j.conf. As an example, we will use two local databases that can be configured by adding the following lines to neo4j.conf:

fabric.database.name=fabric

fabric.graph.0.name=gr1
fabric.graph.0.uri=neo4j://localhost:7867
fabric.graph.0.database=db1

fabric.graph.1.name=gr2
fabric.graph.1.uri=neo4j://localhost:7867
fabric.graph.1.database=db2

Before we move on, we need to do one more thing: create the two databases (db1 and db2).

Creating the databases

In Neo4j Browser, from the left panel, navigate to the Database Information tab (see the following screenshot):

There, you can choose which database will be used. While the default is neo4j, we will connect to the system database in order to be able to create new ones. Here, select System from the drop-down menu. Neo4j Browser will let you know the following:

"Queries from this point on are using the database system as the target."

Here, you can create the two databases needed for this section:

CREATE DATABASE db1;
CREATE DATABASE db2

Now, you can switch to the main entry point for our cluster: the database called fabric (again, from the left menu).

Now that we have told Neo4j that the data will be split across several clusters and that we have created these clusters, let's learn how to tell Cypher which cluster to use.

Querying a sharded graph

The last Cypher version introduced a new keyword, USE, which we can use to query a sharded graph. Together with the fabric utilities, this allows us to query data from different shards using a single query.

The USE statement

The USE Cypher statement was introduced in Neo4j 4.0 to tell Cypher which shard to use. For instance, we can add some data to one of our databases with the following query:

USE fabric.gr1
CREATE (c:Country {name: "USA"})
CREATE (u1:User {name: "u1"})
CREATE (u2:User {name: "u2"})
CREATE (u1)-[:LIVES_IN]->(c)
CREATE (u2)-[:LIVES_IN]->(c)
CREATE (o1:Order {name: "o1"})
CREATE (o2:Order {name: "o2"})
CREATE (o3:Order {name: "o3"})
CREATE (u1)-[:ORDERED]->(o1)
CREATE (u2)-[:ORDERED]->(o2)
CREATE (u2)-[:ORDERED]->(o3)

We can retrieve data from this shard using the following code:

USE fabric.gr1
MATCH (order:Order)<--(:User)-->(country:Country)
RETURN "gr1" as db,
       country.name,
       count(*) AS nbOrders

The preceding query will use only nodes in the gr1 database, as defined in the fabric configuration section in neo4j.conf. You can run the same query with USE fabric.gr2 and get no result since we have not created any data in gr2 yet. If you are interested in all your data, without filtering on a specific database, this is also possible using fabric, as we will see now.

Querying all databases

To query all known shards, we will have to use two other functions:

fabric.graphIds(), which returns all the known databases. You can test its result with the following command:

RETURN fabric.graphIds()

fabric.graph(graphId) returns the name of the database with a given ID, directly usable in a Use statement.

This will essentially allow us to perform a loop over all our shards:

UNWIND fabric.graphIds() AS graphId
CALL {
  USE fabric.graph(graphId)
    WITH graphId
    MATCH (order:Order)<--(:User)-->(country:Country)
    RETURN "gr" + graphId as db,
           country.name as countryName,
           count(*) AS nbOrders
 } RETURN db, countryName, sum(nbOrders)

Here is the result of the preceding query:

╒═════╤═════════════╤═══════════════╕
│"db" │"countryName"│"sum(nbOrders)"│
╞═════╪═════════════╪═══════════════╡
│"gr1"│"USA"        │3              │
└─────┴─────────────┴───────────────┘

Here, we can see the three orders from the USA that we created earlier in this section in gr1.

Now, let's say you add some data to gr2, for instance:

USE fabric.gr2
CREATE (c:Country {name: "UK"})
CREATE (u1:User {name: "u11"})
CREATE (u2:User {name: "u12"})
CREATE (u1)-[:LIVES_IN]->(c)
CREATE (u2)-[:LIVES_IN]->(c)
CREATE (o1:Order {name: "o11"})
CREATE (o2:Order {name: "o12"})
CREATE (u1)-[:ORDERED]->(o1)
CREATE (u2)-[:ORDERED]->(o2)

By doing this, the preceding MATCH query returns the following:

╒═════╤═════════════╤═══════════════╕
│"db" │"countryName"│"sum(nbOrders)"│
╞═════╪═════════════╪═══════════════╡
│"gr1"│"USA"        │3              │
├─────┼─────────────┼───────────────┤
│"gr2"│"UK"         │2              │
└─────┴─────────────┴───────────────┘

With this, you should now understand that sharding provides infinite possibilities for data storage, both in terms of volume and in terms of realization, with the latter depending on the queries that will be run against the database.

Summary

After reading this chapter, you should be able to deploy a Neo4j database either on a single server or using cluster and sharding techniques. You should also be aware of the best practices for using the GDS with big data and the limitations of this plugin.

With this chapter, we have closed the loop we started in Chapter 1, Graph Databases, where you first started learning about graphs, graph databases, and Neo4j. Chapter 2, The Cypher Query Language, taught you how to insert data into Neo4j and query it using the visual query language known as Cypher. Chapter 3, Empowering Your Business with Pure Cypher, gave you the opportunity to use Cypher to build knowledge graphs and to perform some natural language processing (NLP) using a third-party library built by Graph Aware.

In Chapter 4, The Graph Data Science Library and Path Finding, you learned about the basics of the Graph Data Science library. Developed and supported by Neo4j, this plugin is the starting point for data science on graphs using Neo4j. In the same chapter, you learned how to perform shortest path operations without having to extract data from Neo4j. The following chapter, Chapter 5, Spatial Data, then taught you how to use spatial data in Neo4j. Here, you discovered the built-in spatial type, Point, and learned about another plugin, neo4j-spatial. You saw that spatial data is well integrated into Neo4j and can even be visualized thanks to Neo4j Desktop applications.

In the rest of the chapters in the Graph Algorithms section, you discovered the centrality (Chapter 6, Node Importance) and community detection (Chapter 7, Community Detection and Centrality Measures) algorithms. You also learned about tools such as neoviz, which is used to create nice graph visualizations.

With this new knowledge, you then learned how to use these graph algorithms in machine learning. First, in Chapter 8, Using Graph-based Features in Machine Learning, you built a simple classification model. By turning your data into a graph and adding graph-based features, you were able to increase the model's performance. In Chapter 9, Predicting Relationships, you got more familiar with a machine learning problem specific to graphs: predicting future links between nodes in a time-evolving graph. In Chapter 10, Graph Embedding—from Graphs to Matrices, you moved even further into the graph analysis stage by learning about the different types of algorithms available in order to automatically learn the features that are used to represent nodes in vector format.

Finally, in the last section of this book, you learned how to interface Neo4j with a web application using Python or GraphQL and React, thanks to GRANDstack (Chapter 11, Using Neo4j in Your Web Application), as well as how to tune Neo4j and the GDS for really big data.

The path we have traveled down since the beginning of this book has been impressive, but remember that there is no end of the road toward knowledge. Keep reading, learning, experimenting, and building amazing things!

Table of Contents for Neo4j at Scale

Create new playlist

Sign In

Sign Up