10 Performance, pitfalls, and anti-patterns

This chapter covers

  • Diagnosing and debugging common performance problems with traversals
  • Understanding, locating, and mitigating supernodes
  • Identifying common application anti-patterns

Our application is built, tested, and delivered to production. We spent a lot of effort designing a system to run in a resilient and scalable manner. However, entropy is not on our side. Everything is humming along perfectly, until one day, we receive that dreaded bug ticket, “Application is slow.” Knowing what’s likely inside, we hesitantly click on the message, and as expected, we’re presented with a vague description that says the application is slow, but otherwise gives little detail.

This chapter examines some common performance issues and techniques for mitigating those performance issues that you are likely to encounter while developing graph applications. We’ll start by looking at how to diagnose common performance problems in graph traversals, including the dreaded, “Application is slow.” For this, we look at a few of the most common tools available to help diagnose and debug traversal issues. Next, we will introduce you to supernodes, a common source of performance problems in graph applications. Here, we discuss what supernodes are, why these are a problem, and what to do to mitigate their effects. Finally, we’ll focus on some specific pitfalls and anti-patterns that can come with building graph applications, some of which are common across databases and some are unique to graph databases. By the end of this chapter, you’ll possess a solid understanding of the most common graph anti-patterns, how to detect these early on in the project, and how to prevent graph projects from going astray.

10.1 Slow-performing traversals

We have a user who’s experiencing performance problems with our application and who submitted the “Application is slow” ticket. Lucky for us, the user at least told us what they were trying to do when the application slowed, which gives us a place to start looking. The ticket leads us to the problem being with this request: “Find the three friends-of-friends of Dave that have the most connections.” (We know that wasn’t a use case that we listed in chapters 2 or 7, but just work with us as we illustrate by example.) Digging into our application code, we locate the problematic traversal:

g.V().has('person', 'first_name', 'Dave').
  both('friends').
  both('friends').
  groupCount().
    by('first_name').
  unfold().
  order().
values, desc).
keys).
  project('name', 'count').
    by(keys).
    by(values).
  limit(3)

Great, we know where the problem is, but how do we diagnose this slow-performing traversal? Graph databases, like relational databases, are no stranger to slow-performing operations. And like relational databases, graphs also have tools to aid in diagnosing problems. These tools take two forms: explaining what a traversal will do or profiling what a traversal did.

10.1.1 Explaining our traversal

We should say up front that the explain() step is rarely our first step in troubleshooting. We usually use the profiling tool we discuss in the next section. We find that using the explain() step to locate issues with poor-performing traversals requires a deep knowledge of the inner workings of the databases. However, the explain() step is a commonly available tool across different database instances, and some people find both useful in debugging, so we’ll give explain() a little attention.

Let’s say we want to know how our traversal runs, but we do not want to execute it. Most graph databases perform this type of debugging step via the use of an explain() step. This is similar to an estimated execution plan in a relational database, in as much as the database optimizer shows the output after it rearranges and optimizes the traversal, but before it actually runs the traversal on the data. (Gremlin does this through the use of strategies; see http://mng.bz/ggoR). The important part to focus on is the final traversal plan. This represents the optimized plan to be executed on the graph data. The output of the explain() step lists the various options that were applied in order to reach the final internal form of the traversal, the one designed to run on the physical data.

The best way to illustrate this is to run an explain() step and then examine the output. In the following example, the Final Traversal is highlighted in bold and the non-optimized options are removed for brevity. Let’s run an explain() step on our slow traversal and see if we can draw any conclusions from the output:

g.V().has('person', 'first_name', 'Dave').
  both('friends').
  both('friends').
  groupCount().
    by('first_name').unfold().
  order().
    by(values, desc).
    by(keys).
  project('name', 'count').
    by(keys).
    by(values).
  limit(3).
  explain()                                                  
==>Traversal Explanation
===========================================================================
...                                                          
Final Traversal[TinkerGraphStep(vertex,[~label.eq(person), 
 first_name.eq(Dave)]), 
VertexStep(BOTH,[friends],vertex), 
VertexStep(BOTH,[friends],vertex), 
GroupCountStep(value(first_name)), 
UnfoldStep, 
OrderGlobalStep([[values, desc], [keys, asc]]), 
RangeGlobalStep(0,3), 
ProjectStep([name, count],[keys, values]), 
 ReferenceElementStep]                                     

Performs the explain command

Output removed for brevity

The final traversal is what matters most.

As we mentioned, the important part is the line starting with Final Traversal. This is the optimized plan that’s executed on the graph. In this example, the optimized code performed against our graph is

[TinkerGraphStep(vertex,[~label.eq(person), 
   first_name.eq(Dave)]),                       
VertexStep(BOTH,[friends],vertex),                
VertexStep(BOTH,[friends],vertex),                
GroupCountStep(value(first_name)),                
UnfoldStep,                                       
OrderGlobalStep([[values, desc], [keys, asc]]),   
RangeGlobalStep(0,3),                             
ProjectStep([name, count],[keys, values]), 
   ReferenceElementStep]                        

Maps to V().has('person', 'first_name', 'Dave')

Maps to the first both('friends') step

Maps to the second both('friends') step

Maps to groupCount().by('first_name')

Maps to unfold()

Maps to order().by(values, desc).by(keys)

Maps to limit(3)

Maps to project('name','count').by(keys).by(values

Although it is nice to see the traversal written in optimized steps, it doesn’t explicitly point out why our traversal is slow or what we can do to improve its performance. What it does do, however, is show us how this traversal will be executed. With enough practice and knowledge of a particular database, you can understand what needs to be done to further optimize the execution plan. But we find that simply knowing how a traversal will execute tends to lack the insight we need to fix performance problems.

This lack of insight is one reason why we rarely use the explain() step. Another reason is that this optimized plan is always the same, no matter what your starting vertex is. In many scenarios, we have a traversal that works well starting in one location of the graph but performs poorly for other starting points. In these scenarios, the explain() step won’t help to diagnose performance issues.

More often than not, we want to see the actual execution of a traversal, not just the way the engine thinks it will run it. This leads us to the most commonly used performance debugging tool--profiling.

10.1.2 Profiling our traversal

Let’s say our slow traversal works perfectly fine for some users, but horribly for others. Instead of looking at the planned execution, we need to profile the actual operations. This allows us to compare good runs to the bad ones and to see the differences.

In most graph databases, this type of debugging is done via the use of the profile() step. The profile() step runs the traversal and collects statistics on the performance characteristics during its execution. These statistics include details about the execution, similar to an actual execution plan in a relational database.

As with the explain() step, the easiest way to understand a profile() step is to run one and then examine the output. Let’s profile our slow-performing traversal and investigate the output (shown in the table following the code input). We are looking for where the traversal spends the most time and for which step uses the most traversers. In the table, we highlighted the duration (%Dur) in bold within the output displayed in figure 10.1.

Figure 10.1 The output of a profile() step showing associations back to the original traversal steps

Upon examining the figure 10.1 output, we notice a couple of things. First, we see that the lines in the output match the bytecode from the explain() step shown in the previous section. This correlation makes sense because the explain() step tells us how a traversal executes, and the profile() step informs us what happens when it runs.

Second, each line in the traversal maps to one step in our optimized traversal (shown in the output). Usually, it’s straightforward to determine which step in the traversal refers to which line in the output. Unfortunately, there’s no definitive documentation on how these map because they are specific to your vendor’s implementation. Different vendors have different implementations, each of which has a different engine and approach toward optimization strategies. Finally, for each line of output we see

  • The count of the represented traversers (Count)

  • The count of the actual traverser (Ts or Traversers)

  • The time spent on that step (Time)

  • The percentage of the traversal’s total duration spent on that step (%Dur)

The Count and Traversers (the Ts column) values won’t always match; for example, when the same element is visited multiple times. In this instance, the traversers can be merged in a process in Gremlin known as bulking, which causes the Count value to be larger than the Traversers value.

Note One cautionary note is that profiling traversals requires extra resources, so the represented times may not match a non-profiled traversal. However, the proportion of time spent is the same between profiled and non-profiled traversals.

The critical questions are, where is the traversal spending most of its time and what is the count of traversers on that step? Based on the previous output, we see that more than 48% of our traversal’s time is spent on the has('person', 'first_name', 'Dave') step. This leads us to one of two common fixes, the details of which are covered in the following section.

If we find that the longest duration steps have many traversers, we should add additional filtering criteria prior to that step to reduce the number of traversers required. In our example, however, this isn’t the case. Instead, we identify that the longest duration step doesn’t have many traversers associated with it; it only has one traverser. Because we know that we only have a single traverser and that our step is a filtering step, this leads us to think that we should add an index (our second common fix).

10.1.3 Indexes

Similar to relational databases, an index in a graph database provides a method to efficiently find data based on predefined criteria. Indexes work by allowing us to quickly and directly access the data that we’re seeking, instead of scanning the entire graph to find it. Avoiding a scan of the entire graph creates massive performance improvements.

Let’s say we want to search a graph to find a vertex where the first_name is Dave. Without an index, this requires us to look at every vertex to see if it has a property named first_name and, if it does, is the value of that property is Dave. While this might not be a noticeable issue in small graphs, in graphs with thousands, millions, or billions of nodes, this creates a huge performance impact.

Let’s take a look at how this same scenario works if we add an index on the first_name property. With this option, instead of having to look at every vertex, we alternatively look at the index. The index already knows which vertices have a first_ name property and can, with a single lookup, find the ones with the values of Dave. It is reasonable to expect that performing a single lookup inside an index will be significantly faster than looking at all of the thousands, millions, or billions of vertices. Here are three areas where indexes can provide the most performance improvement:

  • Properties frequently used for filtering on values or ranges. Indexes quickly reduce the number of traversers required to execute a particular task, thereby reducing the work required of the database. This is especially helpful early on in a traversal where a minimal number of traversers is desired.

  • Properties requiring a full-text search, such as finding words that start with, end with, or contain a specific phrase. Many databases require a particular type of index to perform a full-text search on a property because these warrant special handling to be indexed efficiently.

  • Spatial features needing to be searched if the database supports geospatial data. Spatial properties also fall into the category of requiring special indexes to perform the appropriate queries such as, “Find all restaurants within ten miles of here.”

The increased efficiency that indexes bring comes at the cost of additional storage and additional writes behind the scenes. An index makes a redundant copy of data, or at least pointers to data, optimized for retrieval by specific criteria. For these reasons, we should be prudent when adding indexes to our graph and only use these when needed to achieve the desired performance.

Every vendor’s implementation has different indexing capabilities and characteristics. Some implementations, such as TinkerGraph, only offer global single value indexes. Others, like Neo4j, DataStax Graph, and JanusGraph (among many others) enable a full range of single value-based, composite value-based, range-based, and even geospatial indexes. Still others, such as Azure CosmosDB and Amazon Neptune, have no concept of user-defined indexes, preferring that the indexing details be left to the service provider to manage. We highly recommend consulting the documentation of your chosen database for the indexing capabilities, as well as the best practices for use of those indexes.

In this section, we looked at some of the diagnostic tools we can use when a particular traversal is slow. However, traversals are only one part of the application that causes performance issues. Sometimes the issue is not with the traversal, but with the data itself. Many of these data-related performance problems can be traced back to a single source--supernodes.

10.2 Dealing with supernodes

Supernodes are one of the most common data-related performance problems in graph databases. These are also particularly difficult to deal with because supernodes can’t be removed. And because they are part of the data, we can only try to mitigate the problems caused by supernodes. This leads us to our first question: What is a supernode?

A supernode is a vertex in a graph with a disproportionally high number of incident edges. We find that supernodes are hard to define but easy to understand with an example, so let’s take a look at one instance using Twitter.

When writing this book, the most followed person on Twitter is Katy Perry with 107.8 million followers (https://twitter.com/katyperry). Based on research performed in 2016, by the social media marketing company KickFactory (http://mng.bz/em5J), the average Twitter user has 707 followers. This means Katy Perry has ~152,475 times as many followers as the average Twitter user. Let’s assume we stored this data using the data model in figure 10.2.

Figure 10.2 An example logical data model for Twitter

Based on this model and current research, Katy Perry has 107.8 million follows edges incident to her user vertex, and the average user has 707 follows edges. What happens when we notify all followers about a new tweet?

When an average user posts a tweet, we need to notify all their followers, meaning our traversal will have 707 traversers, one for each follows edge. In Katy Perry’s case, when she posts a tweet, our traversal will have 107.8 million traversers. With this sort of follower differential, it’s safe to assume that Katy Perry’s tweets are more computationally intensive than the average user’s post. While an extreme example, it is this sort of disparity that leads to supernodes.

The natural next question is, “What number is disproportionately high?” We wish we could give you a precise number where a vertex becomes a supernode, but it’s not that easy. We need to understand two main concepts when discussing a supernode: instance data and underlying data structures. Let’s look at these.

10.2.1 It’s about instance data

The first concept is that a supernode is a specific vertex of a specific label in the instance data. One common misunderstanding is that a supernode refers to a vertex label, but this isn’t the case. Instead, it refers to an instance of a vertex with a disproportionate number of edges compared to the other instances with the same label. Think back to our Twitter example where, because the average user and Katy Perry are both people, they have the same vertex label. In other words, it’s the Katy Perry instance of the user vertex that is a supernode, not the generic user label.

10.2.2 It’s about the database

The second concept to understand is that what performs like a supernode in one traversal on one database can work fine on a different database or within a different traversal in the same database. Underlying data structures and storage algorithms differ between database vendors. These differences, along with other database-specific optimizations, make it impossible to provide a generalized answer. But we do recommend that you review the documentation for your chosen database, understand the distribution of relationships in your data, and thoroughly test the chosen system based on these expected distributions.

10.2.3 What makes a supernode?

While our Twitter model is useful to demonstrate the concept of a supernode, most of us will never work with data at that scale. Instead, let’s use our DiningByFriends data model, shown in figure 10.3, as a more realistic example and see if we find any potential supernodes.

Exercise Apply what you just learned about Katy Perry and the Twitter example to our DiningByFriends model. Can you identify any potential supernodes?

Figure 10.3 The logical data model for our DiningByFriends application

When we look at our DiningByFriends data model, we see two potential opportunities for supernodes: the city and the state vertices. Why these two vertex labels? To demonstrate why city and state might potentially cause supernodes, let’s use the example of two cities in the United States: New York City, New York, and Anchorage, Alaska.

If we do a simple Google search, we find that the New York City is home to around 26,000 restaurants, while Anchorage has approximately 750 restaurants. This means that the city vertex for New York City has almost 35 times the number of incident within edges as the vertex for Anchorage, so any traversals from New York City require 35 times the work as those from Anchorage. While not quite as dramatic a difference as our Twitter example, we’d like to think that a 35-times spread in values represents a disproportionately distributed data set.

This same logic applies to our state vertex as well. The state of New York has approximately 1,000 cities and towns, while Alaska has around 130. This disparity represents a nearly eight times difference between the two.

While none of this disparity automatically means that we have a supernode, we’ll learn how to determine that in the next section. These are but two likely supernode candidates we see in our model.

10.2.4 Monitoring for supernodes

If we can’t give specific numbers of what constitutes a supernode, how do we find them? We generally employ two strategies, frequently in parallel, to detect supernodes: monitoring for growth and monitoring for outliers.

Monitoring for growth

The first strategy is to periodically monitor the degree (number) of all the vertices in our graph and look for the top outliers. Monitoring is important because supernodes rarely exist at the beginning; these grow over time. In other words, supernodes are rarely created during the initial loading of data; instead, they tend to grow as more and more data is added to a graph. This is because many real-world networks are scale-free networks. Scale-free networks have many vertices with a low degree of incident edges and only a few vertices with a high degree.

Think back to our Twitter example. The majority of users have a low number of connections. There’s also a small minority with a high number of connections. The same is true if you look at other networks, such as airline companies. While most airports will likely only have a few flights, there are a small number of airports, namely the hubs, that have a larger number of flights. This type of distribution of data is known as a power-law distribution, as figure 10.4 shows.

Figure 10.4 Power-law distribution with the power law having a long tail

The long tail of the distribution is where supernodes exist in scale-free networks. If we have a scale-free network and these cause supernodes, how do we check the growth of the degree of vertices? We need to monitor our data periodically to find the vertices with the highest degree. In brief, we need a traversal to take the following steps (illustrated in figure 10.5):

  1. Find all vertices.

  2. Calculate the degree of each vertex.

  3. Order the results, descending by degree.

  4. Return only the top N results.

Figure 10.5 An example traversal to find the top 10 vertices with the highest degree

Running this or similar traversals at regular intervals and tracking the results monitors the growth of potential supernodes proactively and catches these before a problem arises. While this strategy is an effective tool to find and monitor supernodes, it has a few significant drawbacks.

First, it requires us to remember to run this traversal and monitor the output on a regular basis. We’re all busy, and this sort of housekeeping task is easy to delay, meaning that supernodes can creep in without proper warning. Second, the traversal in figure 10.5 is long-running because it requires visiting every vertex in the graph once (for each incident vertex) and every edge twice. As our graph grows, this traversal will take longer and longer to run and will consume increasingly more resources. But we have another possible approach for monitoring supernodes in our toolbelt. Let’s look at that additional process next.

Monitoring for outliers

The second commonly used approach to monitoring for supernodes is to reactively monitor the performance of traversals and look for outliers. This is usually done with one of the many different application monitoring tools available on the market. When monitoring for supernodes, we look for traversals that are taking a significantly longer time to execute for one vertex than for another. Although supernodes aren’t the only cause of slow performance, these are one of the common reasons why a generally well-performing traversal exhibits performance differences on specific vertices.

Although these two methods for identifying supernodes in our graph are useful, as we said, both these approaches have downsides. Due to the amount of the graph touched, each of the specified approaches tends to result in a long running query and places significant additional load on the graph database. There’s no magic bullet for detecting supernodes. Still, domain knowledge, proper data modeling, and continuous monitoring are the best tools available to prevent and identify supernodes.

10.2.5 What to do if you have a supernode

If you determine that you have a supernode in your graph, the first thing to do is decide whether the supernode is actually a problem. If it is a problem, the best step is to look at ways to mitigate the supernode.

Is the supernode a problem?

We need to consider how our traversals are traversing a supernode to determine if it causes a problem. For example, let’s look at a subsection of our DiningByFriends schema containing only the city, state, and restaurant vertices, as figure 10.6 illustrates.

Figure 10.6 The portion of the DiningByFriends logical data model relating to restaurants, cities, and states.

As we previously mentioned, both city and state vertices are likely candidates to become supernodes within our graph. However, because these are only likely supernodes, we need to examin the specific traversals we run on our graph to decide if these potential supernodes will become problematic.

Let’s say that the only traversal our application makes is to answer the request, “Get me the city and state for restaurant X.” In this traversal, we only ever have one city and one state vertex associated with a specific restaurant vertex. This means that there is only ever a single within edge traversed when moving from a restaurant to a city, and only a single within edge traversed when moving from a city to a state. While both an instance of a city, such as New York City, and a state, such as New York, are likely to be supernodes, we traverse these in a way that minimizes the number of edges traversed, so these vertices won’t cause the performance problems associated with supernodes. That is, our chosen access patterns will not encounter the worst possible branching factor, the number of successors of a given vertex, when traversing through either city or state vertices.

On the other hand, if we were to slightly change our request to, “Give me all the restaurants in New York City,” we would then encounter the opposite problem. To answer this traversal, we need to traverse the roughly 26,000 within edges associated with New York City to find all the restaurants. This will likely cause significant performance problems because 26,000 individual traversers are required. This traversal can inflict long wait times while our database churns through these requests.

As these examples demonstrate, a change in the question can flip the same vertex from performing normally in one scenario to acting as a supernode in another. This behavior is one of the exacerbating complexities of supernodes. Not only are these highly dependent on the situation, but the negative impact is conditioned by specifics of the vendor’s implementation, the hardware configuration, and the indexes configured for the graph in some cases.

In addition to the direction we traverse our supernode, there are some scenarios, especially when running analytical algorithms, that require supernodes to get the correct answer. Some algorithms rely on the connectedness of a graph as all or part of the calculation; for example, when we’re in a domain such as social networking, peer-to-peer file sharing, or network asset monitoring, and we want to answer, “Who is the most connected person in my graph?” This question requires an accurate count of the degree of all the vertices in our network. With this sort of calculation, having vertices with disproportionately high edge counts is what we’re after. Here, a supernode in our graph is a meaningful construct.

However, because we’re traversing through the entire graph (or a large portion of it), these calculations should be treated as analytical instead of transactional. Transactional operations typically have a time allotment measured in seconds or milliseconds, but analytical operations can have a time allocation of minutes, hours, or longer.

Say that we look at our data and decide that we have a supernode. Then, we assess our questions and conclude that we need to traverse through these supernodes in a manner that’s likely to cause problems. What can we do to alleviate this?

Mitigating supernodes

The most common and universally applicable approach to handling supernodes is to refactor the model to remove or minimize the impact of the supernode. This means going back to our schema and investigating potential changes to our data model to remove the need to traverse through any supernodes. To do that, we’ll need to employ one or more of the data modeling strategies we have learned thus far, including

  • Duplicating vertex properties on edges

  • Making vertices into properties or properties into vertices

  • Moving property locations

  • Precalculating data

  • Adding indexes

The goal of this refactoring is to minimize the number of edges that our graphs traverse. Wait! Isn’t the point of a graph database to traverse edges? Isn’t it better at that than any other data engine? Yes, that’s true. Graph databases are optimized for this access pattern of traversing edges. However, just because graph databases are better with this operation than other engines doesn’t mean that we want graphs to perform more work than required.

In the last section, we identified that the traversal, “Give me all the restaurants in New York City” is a problematic traversal. Let’s figure out how to change our data model to answer this question, without touching all 26,000 within edges associated with New York City.

Exercise Use the traversing techniques you learned in chapters 3, 7, and 10 to see if you can change our DiningByFriends data model to reduce the number of edges that need to be traversed.

Examining our DiningByFriends data model, we realize that what we need to do is to get the city and state properties of an address co-located to the restaurant vertex. If we collocate this data, then there is no need to traverse to the city vertex to retrieve that information.

We know how to apply data denormalization techniques to create new properties on the restaurant vertex for the city and state name properties. Because we cannot have both the city name and state name attributes on the restaurant vertex with the same key, name, we can rename these as city and state. Figure 10.7 shows this depiction.

Figure 10.7 The portion of the DiningByFriends logical data model relating to restaurants, cities, and states with the city_name and state_name attributes denormalized to the restaurant vertex

Denormalizing the properties removes the need to traverse any edges to answer the request, “Find all the restaurants in New York City.” However, it has introduced a new problem. Now we must scan every restaurant vertex in our system to handle this issue. To reduce the impact of scanning the entire set of restaurant vertices to retrieve this data, we add an index for these properties. Combining these two techniques, indexing and denormalizing, allows us to quickly and efficiently retrieve data for both requests: “Give me all the restaurants in New York City” and “Get me the city and state for restaurant X.” We can do this because the data is now co-located on the restaurant vertex.

As one last method of cleanup, we can remove the city and state vertices from our data model because these are no longer being used. Figure 10.8 shows this depiction.

Figure 10.8 Our updated DiningByFriends logical data model with the city and state attributes stored on the restaurant vertex and with the city and state vertices removed

Vertex-centric and edge indexes

Because supernodes are a common problem within graph databases, some database vendors include specific index types to help address this issue. These index types are often referred to as edge indexes or vertex-centric indexes. We apply these types of indexes to specific Vertex-Edge-Vertex combinations; these help to index and sort the combinations to prevent the linear scan of edges, which should provide a faster graph traversal.

Not all databases support these types of indexes, however, and the details on how each vendor implements these vary. If the database you select supports features like this, we highly recommend investigating vertex-centric and edge indexes as potential solutions to supernodes.

10.3 Application anti-patterns

While supernodes can become problematic as the data grows, there are other anti-patterns that you might encounter when creating graph-backed applications. In this section, we discuss

  1. Using graphs for non-graph use cases

  2. “Dirty” data

  3. Lack of adequate testing

Each of these anti-patterns commonly appears in the design, architecture, and preparation for building an application.

10.3.1 Using graphs for non-graph use cases

“I want to use a graph database, so let’s find a use case for one.”

--Undisclosed graph database client

As we learned in chapter 1, while graph databases are good at several specific types of complex problems, they aren’t a universal solution to all problems. It’s crucial to remember both the benefits and the limitations of how graphs provide insight into our data and transform businesses in the process. A graph cannot answer a question we don’t ask. Before we build a graph solution, we need a strong-enough understanding of the information to be able to model and traverse our graph.

It’s also important not to overplay the flexibility of graphs. While graphs do have an amazing agility, that doesn’t mean that any graph data model can answer any question or do so in an efficient manner. As we’ve already shown in this chapter, seemingly obvious graph implementations can hide problems, such as poor performance from a lack of indexes or unexpected highly connected parts of the data called supernodes. We think that the simplicity of graphs, their seemingly sensible way of expressing design, lulls many into believing that because graphs can do anything regarding data, they should do everything.

The answer to this problem, using graphs for non-graph use cases, is straightforward: don’t let the excitement of getting to use a new technology overwhelm good software development fundamentals. Want to test if you completely understand a problem? Explain it to someone else. When we understand a problem well enough to explain it to someone, then we typically comprehend it well enough to begin working on it. If you aren’t able to explain it, then spend some time refining your understanding of exactly what you want to accomplish.

10.3.2 Dirty data

The second anti-pattern involves adding dirty data to our graph. The dirty data we’re referring to is data that contains errors, duplicate records, incomplete or outdated information, or missing data fields.

Dirty data, just like the other anti-patterns, isn’t a problem unique to graph databases, but it does cause some unique issues. Although graph databases rely heavily on the connectedness between data, these also require an accurate representation of those entities to work effectively. The cleaner the data, the fewer the duplicates, the more accurately the connectedness will be represented within our graph. Consider an example of these three dirty data records in an RDBMS system representing people and their associated addresses, as this table shows.

ID

Name

Address

1

John Smith

123 Main St.

2

J Smith

123 Main Street

3

Bob Diaz

123 Main

From examining this data, we can infer that both John Smith and J Smith are likely the same person and that all three people likely live at the same address. However, if we add this dirty data to a graph, what we see is three pairs of vertices not connected to the same address as we would expect. Figure 10.9 shows this output.

Figure 10.9 Our model showing three pairs of data not connected to the same address

Let’s say we want to use this graph to try and determine if anyone in our graph lives at the same address. As each of the entities is represented as its own vertex, we won’t find any matches. This is a problem because this sort of related link or connectedness is the basis for many graph use cases.

Thinking back to our social network, if we had multiple person vertices representing the same physical person, then even something as straightforward as finding a friends-of-friends link with dirty data would, at best, be hard to use. At worst, it would provide inaccurate results. What’s the solution to dirty data?

The answer is simple: clean our data before we import it using a process known as entity resolution. Entity resolution is the process of de-duplicating, linking, or grouping records that are believed to represent the same entity together into a single canonical representation. This data cleaning process is crucial for most data sets and becomes a greater challenge as the volume and velocity of data grows. Entity resolution is a complex process, but the crucial takeaway is that the data-cleaning process is an important part of any graph database application.

Returning to our address example, look at figure 10.10 to see what our graph looks like when we clean the name and address data to find matches before adding these to our graph. Using this graph, we can determine that Bob Diaz and John Smith share the same address, 123 Main St.

10.3.3 Lack of adequate testing

The last application anti-pattern is a lack of adequate testing, which usually falls into one of two categories: non-representative test data or insufficient scale when testing.

Non-representative test data

Because of the connected nature of graphs and how we traverse these, ensuring that we test against representative samples of data is more important than with relational databases. A representative sample is especially vital when dealing with the branching factor (http://mng.bz/pzVP), the number of successive vertices in recursive queries. Unlike other database technologies, the performance of graph traversals is less dependent on the quantity of data in the graph. Instead, it’s more dependent on the connectedness of the data in the graph. Testing against a representative set of data, including realistic edge cases and particularly potential supernodes, is crucial.

Lack of scale

Testing at an adequate scale means using sufficiently deep and sufficiently connected data, not just sufficiently large data. Testing at scale is especially paramount with highly recursive applications or those using a distributed database. For example, if the application we build involves an unbounded recursive traversal that performs acceptably on our test graph with a depth (number of iterations) of 5, it might function horribly on a production graph with a depth of 10.

Figure 10.10 Address data graph containing our clean data connected to the same address vertex

It’s extremely complex to effectively simulate graph data. The scale-free nature of most graph domains represents a unique challenge. In our experience, the best approach to creating test data isn’t to create it at all but, instead, to use real data whenever possible, sometimes in combination with data-masking approaches in order to protect personally identifying information in the data. But more often than not, this isn’t possible. If you do have to simulate data, take extra care that the data’s shape matches the expected production data.

10.4 Traversal anti-patterns

In this last section, we look at some of the architectural anti-patterns when building graph-backed applications. These patterns include

  • Not using parameterized traversals

  • Using unlabeled filtering steps

Both of these represent either a security or performance risk. Understanding how to identify and remedying these patterns is therefore essential for secure, performant, and maintainable production applications.

10.4.1 Not using parameterized traversals

According to the latest report (2017) on the top application security risks released by the Open Web Application Security Project (OWASP), the number one most common vulnerability in applications is injection-type attacks (http://mng.bz/gg).

Injection flaws, such as SQL, NoSQL, OS, and LDAP injection, occur when untrusted data is sent to an interpreter as part of a command or query. The attacker’s hostile data can trick the interpreter into executing unintended commands or accessing data without proper authorization.

For those familiar with SQL, both the concept of injection attacks and the use of parameterization to defend against these should be well understood. What’s new is that as hackers become increasingly sophisticated, they’re applying the same techniques used on RDBMS databases to graph and other NoSQL databases. But what works in relational databases, parameterization, also applies to graph databases as well. Let’s see what injection attacks are and then show how parameterization defends against these.

Important We’ve used best practices throughout this book and parameterized all the traversals we built. Gremlin Language Variant (GLV) traversals are automatically parameterized. For the subgraph traversals where we used the string API, we built those to be parameterized.

What is an injection attack?

Hackers use injection attacks when they find openings in an application that allow them to add their own malicious code to the application. This malicious code then can delete records, add users, or view unauthorized data. Missing validation logic on user input and then using that input directly in a database query is what allows this type of access. For example, let’s take an application that has a REST endpoint that generates a query (shown in SQL and Gremlin) to find all of a user’s friends where the userid is passed in from a URL.

URL

http://diningbyfriends.com/friends?userid=?

SQL query

"SELECT * FROM friends WHERE userid = " + userid

Gremlin

"g.V().has('friend', 'user', " + userid + ")"

In a normal scenario where no one is attacking the system, the URL and resulting query would look something like this.

URL

http://diningbyfriends.com/friends?userid=1

SQL query

"SELECT * FROM friends WHERE userid = 1"

Gremlin

"g.V().has('friend', 'user', 1)"

In the SQL query, values passed from the user are directly concatenated into the query. While this concatenation seems like a convenient and simple way to build an application, it poses one of the most severe types of security vulnerabilities. If a clever hacker wants to see information about other user’s friends, one can, using a few simple tricks, inject SQL into the query to exploit this vulnerability.

SQL URL

http://diningbyfriends.com/friends?userid=1+OR+userid%3D2+OR+userid%3D3

SQL query

SELECT * FROM friends WHERE userid=1 OR userid=2 OR userid=3

To exploit this weakness via Gremlin, a hacker would need to know Gremlin and change their input. However, it’s still possible as depicted here.

Gremlin URL

http://diningbyfriends.com/friends?userid=within(1,2,3)

Gremlin

g.V().has('friend', 'user', within(1,2,3))

With a simple change of the URL, a hacker has exposed others’ private information. Using tricks similar to this URL hack, a hacker can do more damaging acts, such as deleting records, deleting the database, or hijacking our system. While we demonstrated this type of hack with Gremlin, this same technique is applicable to other graph query languages as well. In fact, any application that accepts input unfiltered or unchecked, regardless of language, is susceptible to such an attack.

Preventing injection attacks

Each graph query language and their drivers offer some type of parameterized traversal functionality. A parameterized traversal uses tokens to represent the input values in the query. At execution time, these tokens are replaced with values passed in from the application after being sanitized and validated. Using JDBC in Java, we accomplish parameterization using a PreparedStatement:

"SELECT * FROM person WHERE first_name = ?");
"Ted");

If we use the Gremlin Language Variants (GLVs) that we’ve employed throughout this book, then we’re already using parameterized queries. However, many other databases including some TinkerPop-based databases only support string-based traversals. This exposes possible string concatenation errors as shown here:

public static void insecureGraphTraversal(
     Cluster cluster, String userid) {
    Client client = cluster.connect();          
    String traversal = "g.V().has("friend", 
     "id", " + userid + ")";                
    client.submit(traversal);                   
}

Creates a client connection

Creates our string with concatenated parameters, which is not good

Submits our concatenated string

This method runs the risk of malicious code being injected into our code because we are just concatenating user inputted values to our string without validation. To protect against malicious code, we use parameters for the data being passed from the user. First, we construct our string with a token, which is just a name in the string that represents the user’s input values. We then pass a map of tokens and their corresponding values:

public static void secureGraphTraversal(
     Cluster cluster, String userid) {
    Client client = cluster.connect();           
    String traversal = "g.V().has("friend", 
     "id", userid)";                         
    Map<String, Integer> map = 
     Collections.singletonMap("userid", 1);    
    client.submit(traversal , map);              
}

Instantiates a client connection

Formulates our tokenized traversal (the right way to do it)

Creates a map of parameters

Submits the traversal and the map of parameters

When the traversal is executed by the client.submit() method, the server replaces the token values with the values stored by the map. When it does this, it performs it in a way that doesn’t allow malicious code to execute, just as with a PreparedStatement in JDBC. While this example was written using Gremlin, the same basic concept applies to other graph query languages such as Cypher.

There’s an additional benefit to using parameterized queries. Most data engines, graph or otherwise, cache the execution plan. This saves us the cost of generating a new execution plan after the first time that the traversal is called with parameters. For frequently used traversals, caching execution plans can measurably improve server performance.

10.4.2 Using unlabeled filtering steps

The anti-pattern in the previous section, injection attack, is one of the most severe from a security and data integrity perspective, even if it isn’t that common. In contrast, the anti-pattern for this section, using unlabeled filtering steps at the start of a traversal, doesn’t pose a security risk but tends to have a massive impact on traversal performance. Let’s first investigate what an unlabeled filtering step looks like in Gremlin:

g.V().has('first_name', 'Hank').next()

This seems innocent enough, but why is it an anti-pattern? If you look closely, you can see that we didn’t specify the label or labels of the vertex to search. We don’t give the database any hints or help on where to look to find vertices with a first_name of Hank. To satisfy this traversal, our database must

  1. Find all vertex labels in the graph.

  2. For each vertex label, find all vertices.

  3. Determine if the vertex has a property called first_name.

  4. If so, determine if the value of that first_name property is equal to Hank.

  5. If so, return the vertex.

As we didn’t provide a label to filter for in the first step, all the remaining steps are required for every vertex label in our graph, causing a huge performance impact. When transitioning to graph databases from a relational world, not filtering a traversal at the start is a common mistake, because in the RDBMS world, queries such as this aren’t even possible. The key difference is that in a graph database, the starting point is all vertices (g.V()), while in an SQL query, the starting point is a specific table (FROM table). The table specification provides a natural boundary for an SQL query.

There are two different ways we can enforce filtering our traversal: either add a global index on the first_name property, or specify the vertex label at the start of the traversal. Let’s look at both approaches.

The first approach, adding a global index on the first_name property, has two problems. The first is that it might not be easy to create the correct index or indexes. While TinkerGraph does allow us to add a global index on a property called first_name, global indexes themselves aren’t common among graph database vendors. Most databases that support indexes require that the index be a combination of both label and property. Even with the proper global indexes created, we still have to search all vertex labels. Even though we can search the vertex labels faster, we still have to manipulate all of those, which is the second problem with this approach.

The second, and preferred, method to enforce filtering is to add the appropriate labels to our filtering queries and to also add indexes for the vertex and property combination. After adding the proper filtering to our traversal, it now looks like this:

g.V().has('person', 'first_name', 'Hank').next()

This pattern should be familiar because we’ve done this throughout the book. Although this approach makes sense, the anti-pattern of unlabeled filtering steps is probably one of the more typical ones we run across. Even in the schemaless world of graph databases, there’s an implicit schema being applied to a graph by the domain; in other words, when we search for a first_name attribute, we know which label or labels contain the property we’re targeting.

While this anti-pattern has the greatest impact at the start of a transactional traversal, it is also helpful to think about providing labels whenever possible to commonly used filtering steps, both later in a transactional traversal as well as in analytical traversals.

Note As a general rule, the earlier and more precise the filtering criteria we provide to a traversal, the faster that traversal runs.

Summary

  • Performance issues with graph traversals can be diagnosed using one of two common methods:

    • explain()--Shows how a graph traversal is executed but does not run it.

    • profile()--Runs the traversal and collects statistics about what actually occurred. (This step is a lot more helpful.)

  • Graph databases use indexes to speed up traversals by allowing quick and direct access to data, similar to relational databases. However, there are multiple types of indexes offered. The index type available is dependent on the vendor’s specific database implementation.

  • Supernodes are vertices in a graph that have a disproportionately high number of edges. These can cause traversal performance problems, especially when running transactional traversals.

  • Supernodes can be found by monitoring the branching factor and the number of successor vertices, and by looking for outliers.

  • Supernodes can be mitigated by using data modeling tips and tricks to split up the edges across multiple, different vertices or by using features such as vertex-centric indexes available in some graph databases. These indexes are designed to help alleviate the side effects of supernodes.

  • Although supernodes aren’t desirable when running transactional queries, these can be critically important when running analytical algorithms, such as degree centrality.

  • When writing graph-backed applications, understanding the problem we’re trying to solve is critical to its success. This ensures that we use the right tools and use those in the right way.

  • The use of “dirty” data is a common anti-pattern that is particularly problematic in graph applications due to the highly connected nature of the data and questions. Dirty data can be addressed by properly de-duplicating and linking data to facilitate better application performance.

  • You should avoid testing with unrepresentative data because the connectedness of the data dramatically impacts the performance of graph-backed applications.

  • When submitting strings of Gremlin, the only way supported by some graph database vendors, you should always parameterize graph traversals to prevent injection attacks from running malicious code.

  • When running transactional queries, always start a traversal with a filter traversal that specifies the vertex label (or labels) as well as the properties. Specifying a filter at the start of your traversal prevents a traversal from searching all of the vertices.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.93.44