In this chapter, we will be taking a much closer look at the real subject of this book, that is, learning Neo4j—the world's leading graph database. In this chapter, we will be going through and familiarizing ourselves with the database management system so that we can start using it in the following chapters with real-world models and use cases.
We will discuss the following topics in this chapter:
Let's start with the first topic straightaway.
Before we dive into the details of Neo4j, let's take a look at some of the key characteristics of Neo4j specifically as a graph database management system. Hopefully, this will immediately point out and help you get to grips with some of the key strengths as well.
Like many open source projects and many open source NoSQL database management systems, Neo4j too came into existence for very specific reasons. Scratching the itch, as this is sometimes called. Grassroots developers who want to solve a problem and are struggling to do so with traditional technology stacks, decide to take a radical, new-found approach. That's what the Neo4j founders did early on in the 21st century—they built something to solve a problem for a particular media company in order to better manage media assets.
In the early days, Neo4j was not a full-on graph database management system—it was more like a graph library that people could use in their code to deal with connected data structures in an easier way. It was sitting on top of traditional, MySQL (and other) relational database management systems and was much more focused on creating a graph abstraction layer for developers than anything else. Clearly, this was not enough. After a while, the open source project took a radical decision to move away from the MySQL infrastructure and to build a graph store from the ground up. The key thing here is from the ground up. The entire infrastructure, including low-level components such as the binary file layout of the graph database store files, is optimized for dealing with graph data. This is important in many ways, as it will be the basis for many of the speed and other improvements that Neo4j will display versus other database management systems.
We don't need to understand the details of this file structure for the basis of this book—but suffice to say that it is a native, graph-oriented storage format that is tuned for this particular workload. That, dear reader, makes a big difference.
Neo4j prides itself in being an ACID-compliant database. To explain this further, it's probably useful to go back to what ACID really means. Basically, the acronym is one of the oldest summaries of four goals that many database management systems strive for, and they are shown in the following figure:
The optional schema of Neo4j is really interesting: the idea being that it is actually incredibly useful to have a schema-free database when you are still at the beginning of your development cycles. As you are refining your knowledge about the domain and its requirements, your data model will just grow with you—free of any requirements to pre-impose a schema on your iterations. However, as you move closer to production, schema—and therefore consistency—can be really useful. At that point, system administrators and business owners alike will want to have more checks and balances around data quality, and the C in ACID will become more important. Neo4j fully supports both approaches, which is tremendously useful in today's agile development methodologies.
The summary of all this is probably that Neo4j, really, has been designed from the ground up to be a true multipurpose database-style solution. It shares many of the qualities of a traditional relational database management system that we know today—it just uses a radically different data model that is well suited for densely connected use cases.
The mentioned characteristics help with systems where you really need to be returning data from the database management system in an online system environment. This means that the queries that you want to ask the database management system would need to be answered in the timespan between a web request and a web response. In other words, in milliseconds—not seconds, let alone minutes.
This characteristic is not required of every database management system. Many systems actually only need to reply to requests that are first fired off and then require an answer many hours later. In the world of relational database systems, we call these analytical systems. We refer to the difference between the two types of systems as the difference between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). There's a significant difference between the two—from a conceptual as well as from a technical perspective. So let's compare the two in the following table:
Online Transaction Processing (Operational System) |
Online Analytical Processing (Analytical System, also known as the data warehouse) | |
---|---|---|
Source of data |
Operational data; OLTPs are the original source of the data |
Consolidation data; OLAP data comes from the various OLTP databases |
Purpose of data |
To control and run fundamental business tasks |
To help with planning, problem solving, and decision support |
What the data provides |
Reveals a snapshot of ongoing business processes |
Multidimensional views of various kinds of business activities |
Inserts and updates |
Short and fast inserts and updates initiated by end users |
Periodic long-running batch jobs refresh the data |
Queries |
Relatively standardized and simple queries returning relatively few records |
Often complex queries involving aggregations |
Processing speed |
Typically very fast |
Depends on the amount of data involved; batch data refreshes and complex queries may take many hours |
Space requirements |
Can be relatively small if historical data is archived |
Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP |
Database design |
Highly normalized with many tables |
Typically de-normalized with fewer tables; use of star and/or snowflake schemas |
Backup and recovery |
Backs up religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability |
Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method |
At the time of writing this, Neo4j is clearly in the OLTP side of the database ecosystem. That does not mean that you cannot do any analytical tasks with Neo4j. In fact, some analytical tasks in the relational world are far more efficiently run on a graph database (see the sweet spot query section that follows later), but it is not optimized for it. Typical Neo4j implementation recommendations would also suggest that you put aside a separate Neo4j instance for these analytical workloads so that it would not impact your production OLTP queries. In the future, Neo Technology plans to make further enhancements to Neo4j that make it even more suited for OLAP tasks.
In order to deal with the OLTP workload, Neo4j obviously needs to be able to support critical scalability, high availability, and fault-tolerance requirements. Creating clusters of database server instances that work together to achieve the goals stated before typically solves this problem. Neo4j's Enterprise Edition, therefore, features a clustering solution that has been proven to support even the most challenging workloads.
As you can see from the preceding diagram, the Neo4j clustering solution is a master-slave clustering solution. In a particular cluster, each server instance of the cluster will perform the following steps:
Neo4j's clustering solution allows you to provide the following features:
This covers 99 percent of all use cases—the references of Neo Technology speak for itself.
One of the defining features of the Neo4j graph database product today is its wonderful query language, called Cypher. Cypher is a declarative, pattern-matching query language that makes graph database management systems understandable and workable for any database user—even the less technical ones.
The key characteristic of Cypher is, in my opinion, that it is a declarative language, opposed to other imperative query languages that have existed for quite some time. Why is this so important? Here are your answers:
In an imperative (query) language, you would have to tell the database specifically what to do to get to the data and retrieve it.
Part of the reason why I feel that Cypher is such an important part of Neo4j is that we know that declarative languages, especially in the database management systems world, are critical to mass adoption. Most application developers do not want to be worrying about the nitty gritty of how to best interact with their data. They want to focus on the business logic and the data should just be there when I want it, as I want it. This is exactly how relational database systems evolved in the seventies (refer to Chapter 2, Graph Databases – Overview). It is highly likely that we will be seeing a similar evolution in the graph database management system space. Cypher, therefore, is in a unique position and makes it so much easier to work with the database. It is already an incredible tool today, and it will only become better.
Like with many software engineering tools, Neo4j too has its sweet spot use cases—specific types of uses that the tool really shines and adds a lot of value to your process. Many tools can do many things and so can Neo4j, but only a few things can be done really well by a certain tool. We have addressed some of this already in the previous chapter. However, to summarize specifically for the Neo4j software package, I believe that there are two particular types of cases—featuring two specific types of database queries—where the tool really excels.
We discussed in the previous chapter how relational database management systems suffer from significant drawbacks, as they have to deal with more and more complex data models. Asking these kinds of questions of a relational database requires the database engine to calculate the Cartesian product of the full indices on the tables involved in the query. That computation can take a very long time on larger datasets, or if more than two tables are involved.
Graph database management systems do not suffer from these problems. The join operations are effectively precalculated and explicitly persisted in the database based on the relationships that connect nodes together. Therefore, joining data becomes as simple as hopping from one node to another—effectively as simple as following a pointer. These complex questions that are so difficult to ask in a relational world are extremely simple, efficient, and fast in a graph structure.
Many users of Neo4j use the graph structure of their data to find out whether there are useful paths between different nodes on the network. Useful in this phrase is probably the operative word; they are looking for specific paths on the network to:
Both of these sweet spot use cases share a couple of important characteristics:
Let's now switch to another key element of Neo4j's success as a graph database management system: the fact that it is an open source solution.
One of the key things that we have seen happening in Enterprise information technology, is the true and massive adoption of open source technologies for many of its business-critical applications. This has been an evolution that has lasted a decade at least, starting with peripheral systems such as web servers (in the days when web servers were still considered to be serving static web pages), but gradually evolving to mission critical operating systems, content management applications, CRM systems and databases such as Neo4j.
There are many interesting aspects to open source software, but some of the most often quoted are listed as follows:
I believe that all is true for Neo4j. Let's look at the different parameter axes that determine the license model. Three parameters are important, which are explained in the following sections.
Neo4j offers different feature sets for different editions of the graph database management system:
Most users of Neo4j start off with the Community Edition, but then deploy into production on the Enterprise Edition.
Different support channels exist for Neo4j's different editions:
Neo Technology does sponsor a significant team of top-notch engineers to help the community users, but at the end of the day, this formula does have its limitations.
The support program for Neo4j is typically something that is most needed at the beginning of the process (as that is when the development teams have most questions about the new technology that they are using), but it is often only sought at the end of a development cycle.
For the slightly more complicated bit, Neo Technology has chosen very specific licensing terms for Neo4j, which may seem a tad complicated but actually really supports the following goals:
This is achieved in the following ways:
The AGPL license differs from the other GNU licenses in that it was built for network software. Similar conditions apply as to the GPL; however, it is even more "viral" in the sense that it requires you to open source your code not only when you link your code on the same machine (through Neo4j's Java API), but also if you interface with Neo4j over the network (through Neo4j's REST API). So, this means that if you use Neo4j's Enterprise Edition for free, you have to open source your code.
All of the mentioned points are summarized in the following figure:
As indicated in the preceding figure, Neo Technology offers a number of different annual commercial subscription options, depending on the number of instances that you will deploy, the type of company you are (startup, mid-sized corporation, or large corporation), the legal contract requirements of the agreement, and the support contract. For more information on the specifics of these bundles—which change regularly—you can contact <[email protected]>
.
With that, we have wrapped up this section and will now proceed to getting our hands dirty with Neo4j on the different platforms available.
18.119.111.179