CHAPTER 1

image

Introduction to Graph Databases

With anything that’s worth learning, there’s always a bit of theory to go along with something more practical, and this book is no exception. Neo4j is the leading graph database (or at least that’s how they describe themselves, anyway), but what does that mean? If you’re from a more traditional relational-database approach, then the concept of a graph database may be a new one, but learning a bit of theory will be worth it. Graph databases have many advantages, one of which is making some queries that are close to impossible in traditional SQL based databases, very possible using a graph database. Graph databases make this possible because their primary function is to relate data. If you understand graph databases already, you could skip ahead, but my teachers always used to say: “Well it’s a good refresher for you,” so I’ll say the same, and hopefully there’s a benefit.

In this chapter, I’ll be covering everything database related to show why graph databases are a brilliant utility, and how they have a lot of potential for modern application development. There’s already a number of people, from large companies, such as eBay and WalMart, to small research teams taking advantage of graph databases to solve various data-based problems, and you could be too. Of course there are many databases out there. Where do graph databases stand? This chapter will also give an overview on the various types of databases and a few details on each one.

What is a database?

Before going into detail about graph databases, relational-databases, or any database for that matter, it’s probably a good idea to start at the beginning, and describing what a database actually is. At its most fundamental, a database is primarily a means of organizing information. Databases come in many forms. Most are associated with the computer system but some are used for backups.

Since a database is a structured set of information, it doesn’t need to be limited to something electronic. A hard copy address book and an electronic address book are both structured data and are both considered databases. However, there may be a time when you want to migrate to a more reliable database system that isn’t paper based. When you do, you need to know where to start. To manage data in a traditional database and communicate with your chosen database, you’ll use a Database management system (DBMS). There are many DBMSs on the market, such as MySQL, PostgreSQL, Microsoft SQL Server, CouchDB, or (Of course) Neo4j. If you aren’t familiar with any of those, or your particular favorite wasn’t mentioned don’t worry. There are a lot of different DBMSs on the market, each with its own advantages and disadvantages, depending on your use case.

A database system allows you to interact with the data stored within it via a predetermined language, dictated by the type of database. The main job of a DBMS is to provide a way for the user to interact with the data stored in it. These interactions can be categorized into four primary sections:

  • Data definition – Any action that modifies the organization of the data within the database
  • Update - When an action manipulates the actual data stored within the databases is classed as an update, which includes creating, updating, and deleting data. In the case of inserting or deleting data, this is classed as an update to the database itself as you’re changing the data structure in some way by either adding or removing data.
  • Retrieval - Data is stored in a database in most cases to be reused. When data is selected from the database to be used in another application, that’s a retrieval.
  • Administration – The remaining actions of user management, performance analysis, security, and all of the higher-level actions are classes as administration.

Database Transactions

Depending on your knowledge of databases the idea of transactions may be a new concept. It’s one of those things you may know about, but not know the correct words to explain it. A database transaction is essentially a group of queries that all have to be successful for them to be applied. If one query within a transaction fails, the whole thing does. Database transactions have two main purposes, both involving consistency, just in different ways.

The first purpose of a database transaction is to ensure that all queries within a transaction are actually executed, which can be very important. Say you’re creating a user and inserting a record for it into the database. There are cases when the ID of an inserted row will be used in queries that follow it. One use is permissions or roles, where a user’s id would traditionally be used to make the relation. If that initial creation of the user fails, maybe due to not being unique, the subsequent queries will also fail since they depend on the result of the failed query. Depending on how the application is set up, if these queries were run without using the transaction incomplete data may be added to the database (so potentially a set of permissions for a non-existent user) or for the application to fail unexpectedly. To avoid this, you can run all of the queries within a transaction, so if any query fails, then any queries that have already run (within the transaction) are reverted, and the query ends, which means your data is untouched.

The second purpose of a database transaction concerns two actions happening at the same time: if a database is being queried simultaneously by multiple sources, then there is the potential the data integrity may be compromised. For example, if you were querying the database, but also performing an update on some of the data being queried at the same time, what would happen? To make this example more informative, let’s say we’re querying a list of users by name, but one of the users is online changing their name. Depending on the timing of the query (without transactions) there’s a chance you could get the data before or even after the change; there isn’t any guarantee. Using transactions though, the update would only be committed and then available to query after it and all other queries within said transaction were successful. So in this case, the updated name wouldn’t be available until all of the needed queries within the transaction were successful.

When you talk about a database transaction, it should be atomic, consistent, isolated, and durable, or ACID. If a database transaction is truly ACID, then it works as it’s been explained here, in an all-or-nothing fashion. The most important time for a transaction to abide by the ACID principles is when money is involved. For instance, if you had a system in place to pay bills, which transferred money from one account to another, and then made a payment, you’d want to ensure all of that happened without any errors. If the bills account was empty, the money transfer from one account to the other would fail, but if the two actions were run outside of a single transaction, you would still try to make the payment, even though no money had been transferred. This is an example of when a query is executed, then subsequent queries depend on the result of the first one, and in this case, you’d want both queries to be in one, ACID-compliant transaction.

Principles used within ACID are relative to the CAP Theorem, also known as Brewer’s Theorem. Eric Brewer (the theorem’s creator) stated that it is impossible for a distributed computer system (or database) to simultaneously guarantee the following three conditions:

  • Consistency (data is available to all nodes at the same time)
  • Availability (each request receives a response about whether it was successful or failed)
  • Partition tolerance (the system can still operate despite losing contact with other nodes due to network issues)

If a system of nodes (or databases) wants to be always available, and safe from failures, then it cannot always have the most up-to-date data. For example, if you have a system of three nodes, each node would be a copy of the last, so if one failed, you would have access to the other two. If you were to make a change to one of the nodes, then the other two nodes would be unaware of the change. To combat this problem, Eventual Consistency is implied, meaning that through some means, eventually, the change would be mirrored across all three nodes. In relation to ACID, until a transaction has completed, the contents of that transaction won’t be available to access within a database. Essentially, CAP, is ACID, but applied to a distributed system.

Many database vendors rave about their software being fully ACID compliant, so this was just a quick overview to show what that actually is. Although a lot of different systems support ACID, it’s not something that just happens. In most cases you’ll need to show you want to start a transaction, which can be different depending on the query language used, but the concept is the same. Once the transaction has been initialized, the queries running within it are added. Then when it needs to be committed, this is added to the query in the way the language requires it. There are cases when you simply don’t need transactions, so just remember when you want to use a transaction, you’ll probably have to indicate it in the query, unless your chosen database vendor has different rules.

Although transactions are used with the intention of rolling them back if they fail, this isn’t always the desired outcome. In some cases, such as in MySQL, you need to explicitly say if you want to rollback a failed transaction, and this can only be done before the transaction is committed. Each database vendor will have its own rules when it comes to how transactions are handled, so if you want to use them just be sure to check the official documentation to ensure you’re using them correctly.

What is a Graph?

Trying to define what a graph actually is isn’t the easiest of tasks, as it has a variety of meanings depending on the context. In a traditional sense, graphs are used to display how two or more systems of data relate to each other. A basic example could be something as simple as, number of pies eaten over a certain time period, or pies over time. The graph seen in Figure 1-1 illustrates that very example, and shows a way of representing pies over time.

9781484212288_Fig01-01.jpg

Figure 1-1. A basic graph showing how pies are eaten over time

If you were grading this graph it would be very a low one, there aren’t any units on the axis and the origin hasn’t been marked with a value. Although it’s not the most imperative graph, it does still show (assuming the units would increase from the graph origin) that as time goes on, more pies are eaten, then after so long, the rate in which pies are eaten goes down. Graphs of this nature are normally called a bell curve, or an inverted U, depending on the context, where a graph hits a maximum point, and curves on either side, causing a bell shape.

The example used here was a graph showing pies over time, but there are of course many, many more graphs and graph types out there. Graphs can range from the serious (Showing important data, company growth), to the not so serious (I’m sure we’ve all seen some crude ones) but no matter the subject matter, the graphs all share one common trend, a relationship. In our example the relationship is pies with time, but you could equally have something like profits and time in a graph showing company growth. Getting this relationship is the key part of what makes a graph a graph, and applying that to a mathematics-based graph or to a graph database is the same concept.

When it comes to the mathematical graphs, you can have the different data systems relating using various different graph types, such as a line graph, bar chart, or even a pie chart. Some very literal examples of these can be seen in Figure 1-2.

In graph databases, you wouldn’t necessarily see the data shown in any of those formats, although given that graph nodes are represented by a relationship, they are still graphs. Regardless of the complexity of the graph, even if it’s just a small, simple one, it can be translated to a graph database, to allow it to be queried and analyzed.

9781484212288_Fig01-02.jpg

Figure 1-2. A comic from xkcd 688 showing some very literal, self-describing graphs

Graph Theory

If you were to simplify graph theory, you could say it was just that, the study of graphs, but there is a lot more to it than that. Developers have been taking the principles of graph theory and applying them to databases. Thanks to this hard work there are graph databases, that take relating data very seriously.

In a mathematical sense, graph theory is the study of structures used to model the relationship between objects. In this context, a graph is made up of nodes (or vertices) and potentially edges connecting them. If you wanted to demonstrate this visually, it can be done with an arrow to indicate that a node is connected to another node. For example, if we had two nodes, A and B, to which A was connected to B with an edge, it could be expressed as A image B. The direction is shown here in that A is connected to B, but B is not connected to A. If the edges that make up a graph don’t have an associated direction (e.g., A image B is the same as B image A), then the graph is classed as undirected. If however, the orientation of the edge has some meaning, then the graph is directed.

There are other applications for graph theory outside the world of mathematics. Since graph theory, at its lowest level, describes how data relates to each other, it can be applied to a number of different industries and scenarios where relating data is important. It can be used to map out chemical structures, create road diagrams, even to analyze data from social networks. The applications for graph theory are pretty wide.

Origins

The first known paper on graph theory was written way back in 1736 called “Seven Bridges of Königsberg” by Leonhard Euler, a brilliant mathematician and physicist, considered to be the pre-eminent mathematician of the 18th century. He introduced a lot of the notation and terminology used within modern mathematics, and published extensively within the fields of fluid dynamics, astronomy, and even music theory. Leonhard Euler was an incredible man and helped further modern mathematics and other fields to where they are today. If you have a chance to read up on him. Right now though, we will focus on “Seven Bridges of Königsberg,” from which graph theory originated.

The city of Königsberg, Prussia (now Kaliningrad, Russia) was built on top of the Pregel River, and included two large islands that were connected to each other and the mainland by seven bridges. The problem was to see if it were possible to cross each of Königsberg’s seven bridges just once, and be able to visit all the parts of the city. You can see an abstracted version of the problem in Figure 1-3.

9781484212288_Fig01-03.jpg

Figure 1-3. The 7 bridges of Königsberg, abstracted into a graph format

After abstracting the problem into a graph, Euler noticed a pattern, based on the number of vertices and edges. In the Königsberg graph, there are 4 vertices and 7 edges. In the literal sense, Euler noticed that if you were to walk to one of the islands, and exit to another, you would use an entrance bridge, and an exit bridge. Essentially, to be able to traverse a path across a graph without crossing an edge more than once, you need an even number of edges.

Euler theorized that to traverse a graph entirely, by using each edge only once, depends on a node’s degrees. In the context of a node or vertex, degrees refers to the amount of edges touching the node. Euler argued that if you were to traverse a graph fully (using an edge only once), you can have either 0, or 2 nodes of odd degrees. This was later proven by Carl Hierholzer, and traversing a graph in this way is known as an Eulerian path or Euler walk in Euler’s honor.

Graph Databases

Using graph theory as a basis, graph databases store data in the form of nodes (vertices), edges, and properties. When creating a node, you would give this node properties, then any edges used could also have properties. This helps build up a graph of data that is related directly to the data, rather than in rows with join tables as you would in a relational database.

Visually, you could interpret a graph database as a kind of web. Although you can have a graph database without any edges, more often than not, it will have them, and lots of them. A good example of a graph database in the physical world would be a crime diagram from a TV show, or of course in real life if you happen to have seen one.

With the crime diagram, suspects are related to other suspects, or the victim, and various bits of evidence are related for various reasons. This could be easily replicated in a graph database format, as it’s just a big graph. The nodes in this case would be your evidence items and suspects, and they could connect together for various reasons, which would be logged via properties. Those who know of Breaking Bad, may remember Figure 1-4, but for those that haven’t seen the show, or can’t remember this particular scene, it’s a crime diagram used in the show, which reminds me, SPOILER ALERT!

9781484212288_Fig01-04.jpg

Figure 1-4. A scene from Breaking Bad showing an evidence board, with connected people and evidence

Another example of this is one from the TV show Heroes. The show aired in 2006 about ordinary people with extraordinary abilities (I loved that show) but it had a huge tie to the flow of time. If you haven’t seen the show, or didn’t care for it that isn’t a problem, there’s one example of a brilliant graph that’s worth sharing either way.

In the show things start going wrong, so to help control this, one of the characters makes a physical timeline of events, featuring when events happened, who was involved, and how it was all related to any other event. This is very much like the previous example in that events would be classed as nodes, and the string connecting those nodes as edges. In Figure 1-5 you can see a portion of the graph from the show, where you can see the connections between different pictures and items on various bits of string.

9781484212288_Fig01-05.jpg

Figure 1-5. A still from Heroes, showing all of the characters lives and interactions represented with string and other items

Depending on the graph database system you use, the language may change slightly, but it all comes down to vertices, nodes, and edges. As you’ll soon learn Neo4j consists of Nodes, Relationships, and Properties, so here edges are relationships, and nodes are nodes. Titan DB on the other hand (another graph database) uses nodes and edges to describe its relationships. Although the terminology differs between the two, the underlying meaning is the same.

Of course in this case, there’s only one graph database of interest, and that is Neo4j. Although the details of Neo4j will be explained in the next chapter, for now Neo4j uses Nodes, Properties, and Relationships (edges). As I said, different systems have different ways of wording the different elements, but it comes down to the same structure.

Relational Databases

Relational databases have been around for a while, and if you’ve ever used Drupal, WordPress, Magento, or a number of other applications, you’ll have most likely used MySQL, which is a common relational database. MySQL is an example of a SQL (Structured Query Language) database, which stores its data in the form of tables, rows, and columns. This method of storing data is based on the relational model of data, as proposed by Edgar Frank Codd in 1970.

Within a relational database, you’ll create a table for every type of data you want to store. So for example, a user table could be used to store user information, a movie table to store movies, and so on. Each row in an SQL table must have a primary_key which is the unique identifier for the row. Typically, this is an ID field that will automatically increment as rows are added. Using this system for storing data does work quite well, and has for a very long time, but when it comes to adding in relationships, that’s when the problems potentially start.

If you’ve ever had to spend time in Excel, or another spreadsheet application, then you know how relational databases work, at least on some level. You’ll set up your columns and then add rows that correspond to those columns. Although you can do a lot of other things in these applications, such as adding up all of the values in a column, the concept is the same. Excel at least has multiple sheets, and in the context of the spreadsheet application, a sheet is like a table, where you’ll have one main document (the database, in this case) with many sheets, containing main columns and rows, that may or may not be related.

Relationships

When creating a relationship in a relational database (or SQL database), you would create your two data types, such as person and team, and most likely have a joining table named something like person_team. In the joining table, the unique identifier used in each table will be added as a row in the joining table. For example, if a person with the ID of 1, is in the team with an ID of 2, then the row in the person_team table would be something like that shown in Table 1-1.

Table 1-1. An example joining table between a person and a team

person_id

team_id

1

1

1

2

2

3

4

3

5

2

6

3

7

1

8

1

This approach works when it comes to relating small amounts of data, but when you start having to do multiple joins for thousands of rows, it starts to become slow, and eventually, unusable. This is a huge problem for the amount of data stored these days, and how that data relates to other data. If your website gets hit with a large spike of traffic, you’ll want to be able to scale your database to keep up. In most cases this’ll be fine, but if there’s a join-intensive query, then unless it’s been heavily optimized, there are going to be problems when you compare that to how easily a graph database handles the same issue.

Origins

As I mentioned earlier, the model used was proposed by Edgar Frank Codd in 1970 while he was still working at IBM. In 1970, while working at IBM, Codd published “A Relational Model of Data for Large Shared Data Banks” which showed the potential of his data model. Despite his efforts to get IBM to implement his changes, they were slow to do so, and only started doing so when competitors did.

Although they were slow in adapting the changes, they eventually did begin to implement them, but because Codd himself wasn’t involved with the process (and the team weren’t familiar with his ideas), rather than using his own Alpha query language, the team created their own. The team created SEQUEL (Later renamed SQL) and because it was so superior to what was already out there, it was copied.

Since the term relational database was so new, a lot of database vendors ended up adding the term to their name because of its popularity, despite said systems not actually being, relational. To try to prevent this and reduce the number of incorrect uses of his model, Codd put together a list of 12 rules which would be true to any relational database.

NoSQL

When you talk about databases at all, you need to mention NoSQL, which can be interpreted as “Not only SQL” or “Not SQL”, depending on the conext. Just to make things confusing. Some NoSQL databases can handle SQL based queries, whereas others cannot, so this can differ between different NoSQL databases. If you’re in doubt, just check the official documentation. The name is also somewhat misleading anyway, as it should have been called something like, NoREL (No relations) as it goes away from the traditional relational data model, so technically speaking, Neo4j and graph databases in general, are a type of NoSQL database. You may notice with some NoSQL databases that the query language used is somewhat similar to SQL in how it’s written, which can help developers feel at ease with a new query language. You’ll notice this with Cypher (Neo4js’s query language) a lot if you’re from an SQL background, as there are noticeable similarities in the syntax of both.

Depending on the database used, the benefits can be slightly different. There are those that focus on being able to scale well (example) and others that aim for data consistency. When you scale up a database to meet demand, it’ll create more instances (or copies) of it, so the load is shared between however many instances exist. An issue with this though, is that the databases won’t communicate with each other, so if a change is made, it may be made on one database but not the others, making the data inconsistent. When scaling, NoSQL databases can use the “Eventual Consistency” model to keep their data correct. This means that if a change is made, eventually, the change will be mirrored to all of the databases, but until this happens, the data retrieved may be incorrect. This is also known as BASE transactions, or Basic Availability, Soft-state and Eventual consistency transactions, which essentially says, it’s available (so it scales, and data can be accessed) and it’ll eventually be fully consistent, but this can’t be guaranteed.

Back in 1998, Carlo Strozzi used the term NoSQL to describe an open source database he was working on, as it went away from the typical relational-database model by not exposing SQL to the user. Although this was the first time the term was used (purely because of its lack of SQL) it wasn’t like the NoSQL databases we know now. Strozzi’s database was still relational, whereas typically, NoSQL databases aren’t.

The term stuck however, and then led to a new breed of databases that decided to go against the then-standard relational model. It would be a bit broad to have every database that wasn’t relational under the same umbrella without some categorization, so the main types are key-value, column, document, and (you guessed it) graph.

Key Value

Given its name, you’d be right to assume that this is in fact, key-basedvalue storage. Essentially, you don’t get a table, you don’t get columns in the sense of a relational database, instead the database is like one big table, with many columns, or keys. Values are stored within the database using a key, and are retrieved using that key. This makes it a lot simpler than a traditional SQL driven database, and for some web applications, this is more than enough.

This approach does work for a lot of cases when your data isn’t related, or especially structured but that’s not always the case. This database approach is good if you just want to store a chunk of data you don’t need to query against. You could, for example, store some JSON within a key-value store, but until it was retrieved from the database, you wouldn’t be able to query against or use the data in any way.

Column

The column type of NoSQL database holds many similarities to the key-value based NoSQL database, in that it is still stored and retrieved using a key. The difference is that each column in the database consists of a key, value, and a timestamp. This is especially useful when scaling, as the timestamp can be used to work out which content is stale when the database is updated.

Document-orientated

Technically speaking, a document-orientated NoSQL database is actually a key-value based database, just a little bit more intelligent. The key-value style of the database is still respected, but in this case, the value is a structured document, rather than just text, or a single value. This means thanks to the increased structure of the information, the database will be able to perform more optimized queries, as well as making data retrieval easier in general.

Documents can be technically anything, depending on the database vendor’s preference. One popular choice is JSON, which isn’t the best for structuring data, but it allows the data to be used by both back- and front-end applications.

Graph

The graph style of NoSQL database is different still, and stores its content in the format of Nodes, Properties, and Edges. Throughout the course of the book, there will be a lot of talk on graph databases, as Neo4j is of course one. For now though, it’s good to note that despite being a graph database, it’s still a type of NoSQL database.

Summary

In this chapter lots of different database information has been covered, but things will move on from here. You always need to know the theory about something before you can properly use whatever it is you’re learning, and that’s what this chapter is all about. It shows that something conceived as early as 1736 by the brilliant Leonard Euler may still not see the light of day until the technology exists to make it happen.

When you talk about databases, you can never discount relational ones, and of course, this chapter was no exception. Relational databases have been around for some time now, mainly due to the abundance of resources to help you, and web applications that utilize them. Although they also relate data, this can come at a cost when relationships become complex, as you have more and more tables to join. Graph databases put relationships first, which means complex relationships are possible, without compromising performance.

There are many alternative databases out there, but each one has a different purpose, including Neo4j. We’ll be relating a lot of data together in this both, crafting recommendations, and much more. Although I’ve only just touched on Neo4j, there is a whole chapter on its terminology, internals, and generally how it works, so don’t worry if it’s something new for you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.121.209