Finding data within the graph is one skill, but returning it efficiently presents a whole new set of challenges. While it’s entirely possible to send raw, unorganized data to the client, in most cases, it’s best to do as much data processing at the database layer as possible. Client applications are quite busy handling user interactions.
In this chapter, we’ll focus on the different methods of collecting, formatting, and outputting traversal results at the database level. We’ll review the value steps introduced in chapter 3 and illustrate why these are required. Then we’ll discuss how to return values from elements that are located in the middle of a traversal, as well as crafting custom objects. Finally, we’ll wrap up this chapter by demonstrating how to sort, group, and limit results for efficient communication with client applications.
If you haven’t done so already, download the corresponding source code for this chapter: https://github.com/bechbd/graph-databases-in-action. The code relevant to this chapter is located in the chapter05 folder. All examples begin with the assumption that our social network data set is loaded. To accomplish this, run this command:
.sh -i $BASE_DIR/chapter05/scripts/5.1-complex-social-network.groovy
We start with the most common of formatting tools: the values()
and valueMap()
steps, which we introduced in chapter 3. But before we get to these, we look at the default behavior in TinkerPop. Most of the traversal examples return the ID of the elements. Looking at the following traversal, the ID value of 4
is returned as part of a toString()
construct:
g.V().has('person', 'first_name', 'Ted') ==>v[4]
In relational database terms, this is the equivalent of running this SQL query:
SELECT ROWID FROM person WHERE first_name = 'Ted';
We rarely want to return just the ID. In most cases, we aim to return all or a subset of the properties. So, if the normal use case is to retrieve the attribute values, then why not return all the attributes by default?
The reason is quite practical. If a database returns all the attributes by default, we transmit a lot of unrequired data. Generally, we want to transmit only the data we need, so most databases require that we specify the particular attributes to return. We know it is an (unfortunately) common practice in SQL to use a wildcard (*
) in the SELECT
clause as seen here:
SELECT * FROM person WHERE first_name = 'Ted';
But the preferred method is to specify the column names like this:
SELECT first_name FROM person WHERE first_name = 'Ted';
So, what’s the graph equivalent of this SQL? Let’s say that we want to return all the properties for the Hank
vertex in our graph. We already know the basic steps for this:
At this stage, we are already old pros at handling the first two steps. Back in chapter 3, we also used the values()
step and discussed the valueMap()
step, often at the end of the traversal, to retrieve the properties of a vertex. Both the valueMap()
and values()
steps return the property values of a vertex or an edge. As you’ll remember, the values()
step returns only the values of the properties. The valueMap()
step returns a map, which is a collection of key-value pairs of the specified properties. (In some programming languages, a map is known as a dictionary. This is the same concept here.)
Let’s take a look at the difference between the two values steps by returning all the attributes for Hank in our graph. First, using values()
', 'first_name', 'Hank').values() ==> Hank
and then employing valueMap()
:
', 'first_name', 'Hank').valueMap() ==>{first_name=[Hank]}
As shown, while we receive the same basic data back for each of these, it’s returned with slightly different formats. The valueMap()
step differs from the values()
step because it returns the data as a key-value pair (or map) instead of just the value. Generally, we find that having the keys for the property values makes it much easier to work with the results. Additionally, the valueMap()
step returns one row per traverser while the values()
step returns one row per property per traverser. In our work, we usually prefer the valueMap()
step.
Now that we’ve reviewed the values()
and valueMap()
steps, why are these needed at all? In SQL we don’t do anything extra to get the values, so why do we need to do that with graph databases? It’s necessary because of a crucial difference between how SQL engines process queries and how graph database engines process traversals:
In a graph database, only the values of the current vertices or edges are retrieved.
In a relational database, all the values from all the joined tables can be included in the results.
This difference arises from the disparity in how the engines process queries. Understanding this difference is critical for creating effective and efficient graph queries. To demonstrate, let’s look at how a relational database handles a query and compare that to how a graph traversal works.
Let’s use an example of an order-processing system, consisting of orders and products. It’s likely that we’re familiar with this sort of simple hierarchical relationship in a relational model. If we model a simplified version of this system for a relational database, we’d design two tables: Orders
, containing the orders, and Products
, containing the products. These two tables are connected with a linking table, ProductsInOrder
. Figure 5.1 shows this relationship.
Populating these tables with some sample data, we get something like figure 5.2.
A common question to answer with our relational database system is, “What are all the orders and the products that were ordered?” To answer this, we join the Orders
table and the Products
table with the ProductsInOrder
table as in this SQL query:
SELECT * FROM Orders JOIN ProductsInOrder ON ProductsInOrder.order_id = Orders.order_id JOIN Products ON Products.product_id = ProductsInOrder.product_id;
When we run this query, the SQL engine generates tabular output by combining the rows from the Orders
, ProductsInOrder
, and Products
tables, where both the order_id
values and the product_id
values match. The following table provides the output.
The critical point to notice is that our result set contains the data from both tables involved in the join operation. If our SQL query contained additional join clauses, then the columns from the additional tables are also included in the result set by default because of the use of the wildcard.
Now let’s look at how we represent the same order-processing system in a graph. Taking what we learned in chapter 2, we create a schema as represented in figure 5.3. It consists of two vertices, order
and product
, and one edge, contains
.
If we then populate our graph with the data used in our SQL tables, we get the graph presented in figure 5.4.
Applying what we learned about how graph traversals work, we know that our first step is to find all order
vertices in the graph. Figure 5.5 illustrates this step.
Our next step is to traverse out all the contains
edges to the adjacent product
vertices. Figure 5.6 shows this step.
In Gremlin, we’d write a traversal like the following:
g.V().hasLabel('order').out('contains')
If we compare the final location of our graph traversal to the final result set of our SQL query, we’ll notice that although the SQL results have information about both the orders and the products, the graph results only have the properties of the products
vertices. This represents a fundamental difference between querying a relational database and traversing a graph. Further, in a relational database, the output of a join operation is the combination of all of the joined tables. In a graph database, the output of any step of a traversal is the current set of vertices or edges. How do we return both the order and product information for a graph?
To return both the order
and product
vertices, we use an alias on the order
vertices. An alias in a graph database is a labeled reference to a specific output of a step, either a vertex or an edge, that can be referenced by later steps. In our order-processing graph, the steps to get a combined order/product result are as follows:
Return all the properties from the elements labeled O
as well all the properties from the elements labeled P
.
It might appear strange to also alias the product
vertices, not just the order
vertices. When returning the aliased elements, all these elements must have an alias, not just the mid-traversal elements. If we look at our order-processing graph after we complete steps 1 and 2 (shown in figure 5.7), we have a graph with all our order
vertices labeled as O
.
Now we can traverse the contains
edge and alias the adjacent product
vertices (steps 3 and 4). This provides a graph where all our order
vertices aliased as O
and all our product
vertices aliased as P.
Figure 5.8 exhibits this part of the traversal.
So far, it looks like we’re on the right track. We have references to both the order
and product
vertices (step 5). Next, we’ll select our O
and P
vertices and return their properties. To retrieve these values, we’ll refer to the alias and the properties we want to return for each desired attribute. While this makes sense conceptually, let’s dive into a concrete example from our social network and see how this all works using our social network graph.
Having covered the concept of aliases, let’s apply this to our friends-of-friends traversal we created in chapter 3. In section 3.3, we crafted the traversal:
g.V().has('person', 'first_name', 'Ted'). repeat( 'friends') ).times(2). values('first_name')
The Ted
vertex has just a couple of connections in our sample graph, so it does not have many results as is. Let’s move to the middle of the graph and search for Dave’s friends-of-friends instead of Ted’s. Also, instead of just returning the friends-of-friends name, let’s also return the friend’s name. Our results should be a list of objects, each with a friend’s name and with a friends-of-friends name. Let’s start this exercise by first reminding ourselves what our social network graph looks like, illustrated in figure 5.9.
Note In figure 5.9, the vertices are labeled using only the first_name
property to simplify the visual presentation.
Applying what we learned, we come up with the following steps for answering this friends-of-friends question for Dave:
These are Dave’s friends-of-friends, so alias them with the label 'foff'
(for friends-of-friends).
For each result, return the first_name
property of the element labeled 'f'
and the first_name
property of the element labeled 'foff'
.
Figure 5.10 shows what this looks like in our social network graph. In this figure, we recognize Dave’s friends shown by the end vertices with the solid lines (representing steps 1-3), and his friends-of-friends denoted by the end vertices with the dashed lines (representing steps 4 and 5).
Traversing our graph in this manner returns one result for each of the six solid arrows in figure 5.10 and includes a value for each result based on the solid arrow traversed. Because there are five solid arrows, some of the friends
vertices, namely Jim
and Kelly
, end up being duplicated in our results. Note also that there’s an solid arrow to Hank
but no solid arrow from Hank
. We shouldn’t expect to find Hank in the friends category, only in the friends-of-friends, owing to the solid arrow from Josh.
Because there isn’t an outgoing friends
edge from the Hank
vertex, we don’t get a friends-of-friends result for Dave’s friend Hank. Put another way, there is no vertex that satisfies the pattern: Dave
->
Hank
->
???
. This is a good example of how the edge directions can influence results in sometimes unexpected ways. Let’s write our traversal starting with the friends-of-friends traversal from section 3.3 but replacing Ted
with Dave
.
Tip It’s best to write code in small chunks and test early and often.
out().
out().
==>Denise
==>Denise
==>Paras
==>Jim
==>Josh
==>Hank
Comparing the outcome of this traversal with what we expected from our graph, we confirm that the results match. However, this graph just returns the name of the friends-of-friends. We want the combination of the friend and friends-of-friends vertices. To get the missing pieces, two new pieces are required: first, aliasing an element in the middle of a traversal, and second, using aliased elements later in the traversal to choose properties.
Aliasing elements mid-traversal using as()
The first concept, aliasing elements mid-traversal, is what enables us to retrieve the friend’s name. We use a new Gremlin step, the as()
modulator:
as()
--Assigns a label (or labels) to the output of the previous step, which can be accessed later in the same traversal.
Think of as()
in Gremlin as similar to assigning an alias to a table in SQL. For example, with SQL
SELECT alias_name.* FROM table AS alias_name;
would be represented in Gremlin as
g.V().hasLabel('table').as('alias_name')
Both approaches use the keyword as
and alias a specific portion of the data (in SQL, it’s a table; in a graph, it’s a reference to an element) to use a simple reference later. Let’s add the as()
step to alias our vertices after each of the out()
steps. Then, in figure 5.11, we isolate and describe the portions of the traversal where we alias the vertices:
out().as('f'). out().as('foff')
As figure 5.12 illustrates, each vertex is aliased with the assigned name after we traverse each friends
edge.
While it might be tempting to assign an as()
at each step in a traversal, its use comes at a cost. As each traverser moves through the graph, it carries a reference to each element that’s aliased. The more aliases we create, the more the traverser has to keep track of with each additional step. Therefore, it’s best practice to only alias steps that we plan to retrieve later in the traversal. Now that we’ve wrapped our heads around how to alias elements mid-traversal, let’s examine the second new concept: retrieving the aliased elements.
Retrieving the aliased elements in our traversal requires two different steps. First, we need to specify what aliased elements to retrieve, and second, what properties of each to return. In the case of our friends-of-friends traversal, we
To return aliased elements, let’s turn to a new Gremlin step:
select(string[])
--Selects aliased elements from earlier in the traversal. This step always looks back to previous steps in the traversal to find the aliases.
The select()
step takes an array of strings, which are the aliases to retrieve. In our example, we specify select('f',
'foff')
to use both sets of vertices in our results. To specify what properties to return, we introduce another new Gremlin step, or more accurately, another modulator, by()
. Like the from(),
to()
, and as()
modulators, the by()
modulator only works in the context of another step; in this case it will be working with the select()
step (although it can work with others, as we will see later):
by(key)
--Specifies the key
of the property to return the value from the corresponding aliased element
by(traversal)
--Specifies the traversal
to perform on the corresponding aliased element
There are two forms of by()
. The first form takes the property key and returns the corresponding property value from the labeled element. This is a bit of Gremlin syntax sugar because by(key)
is equivalent to by(values(key))
. The second form takes a traversal that allows us to perform additional steps on the labeled element, such as a valueMap()
or out().valueMap('key')
. It’s also possible to use complex traversals within a by()
modulator to format results. We demonstrate more complex uses in later chapters.
The by()
modulator specifies what to do with the corresponding aliased elements from a step like the select()
step. In our case, we apply the first form to specify that we want the first_name
property from each of the aliases referenced in our select()
step. Putting the select()
and by()
steps on our previous friends-of-friends traversal, we get
out().as('foff'). select('f', 'foff'). by('first_name'). by('first_name') ==>{f=Jim, foff=Denise} ==>{f=Jim, foff=Paras} ==>{f=Kelly, foff=Jim} ==>{f=Kelly, foff=Denise} ==>{f=Ted, foff=Josh} ==>{f=Josh, foff=Hank}
With these two concepts, we can create complex results by combining elements from different points in our traversal. In this scenario, it means including not only the name of the friends-of-friends of Dave, but also which of Dave’s friends they’re connected to as well. Our traversal yields the six results we expected with the Jim
and Kelly
vertices referenced twice as friends. Also, the Hank
vertex was not included as a one of the friends because there was no corresponding friends-of-friends returned. This is awesome, but why are there two by()
steps?
One confusing aspect of using by()
statements is that each aliased element we specify in a select()
statement should have a corresponding by()
statement to indicate the operations to perform on it. Additionally, the order of the by()
step corresponds to the order of the aliases specified.
In our example, select('f',
'foff')
, our traversal must have two by()
statements. The first by()
performs actions on the elements labeled as 'f'
; the second by()
performs actions on the elements labeled as 'foff'
. Figure 5.13 demonstrates how the by()
steps correlate in our example.
Note Strictly speaking, it is possible to have greater or fewer by()
statements than referenced elements. In these scenarios, the by()
statements are used in a round-robin fashion. This can lead to confusion, so we always match the number of by()
statements to the number of aliased elements to be clear about what should happen for each alias.
Upon examination, in figure 5.13, we notice that the first by()
statement returns the first_name
property from our vertices labeled as 'f'
, and the second by()
statement yields the first_name
property from our vertices labeled as '
foff'
.
Sometimes, instead of looking back in the traversal for earlier results, it is preferable to project results forward from the current elements. Projecting results differs from retrieving the previous results in a simple, but somewhat subtle way. When we retrieve (or select) data, we can only get information that we already traversed and aliased. When we project results, we create new results, possibly branching to items not yet traversed. Let’s start by looking at projection in contrast to selection:
Selection is the process of working with vertices, properties, or additional traversal expressions to return results from previously labeled steps. Selection always looks back to earlier parts of the traversal.
Projection is the process of working with vertices, properties, or additional traversal expressions to create results from the input to the current step. Projection always moves forward, taking the incoming data as the starting point.
Understanding the difference between these two items is crucial. Selection is generally used to combine results from elements traversed earlier in the traversal. Projection is generally used to group or aggregate data starting from the current location in the graph (for example, finding the degree
property of each member of a set of vertices, which we do later in this section).
Let’s return to our order-processing graph for an illustration. For this example, let’s answer the question, “For each of the products in order ABC123, how many times has that product been ordered?” Using what we already know from the previous section about selecting results, let’s use the following process:
Completing step 1 has our traverser located on the order
vertex. Figure 5.14 illustrates this step.
Moving along, completing step 2 of the process has two traversers, one located on each adjacent product
vertex. Figure 5.15 shows this step.
With our next to last step (step 3), the traversers are located on the contains
edges. Figure 5.16 shows this step.
With everything we’ve learned about using aliases and returning results, this is the correct way to think about the problem. However, if you count the gremlins in figure 5.16, you see there are three. Each of these returns its own results, based on what it knows. What they each know is the widget they came from and how many edges these occupy. This approach yields the following:
{name: widget 1, count: 1} {name: widget 2, count: 1} {name: widget 2, count: 1}
That is one result for each of the three gremlins. What we want instead is a result like this:
{name: widget 1, count: 1} {name: widget 2, count: 2}
Why didn’t the traversal return the expected results? Why did we end up with three gremlins and three results each with a count of 1? Previously, the return values from our traversals were generated by selecting the values of previously traversed and labeled elements. In the case of the previous order-processing traversal, we end up with three traversers. So, what do we do instead? The steps shown are mostly valid, but require a few tweaks:
Studying the difference between the two processes, we’ll notice a few specifics:
In the example traversal, we use projection to end up with the count of contains
edges for a specific product because we want to branch our logic at the product
vertex.
Whew, we know this is a lot to take in, so let’s see what this looks like in a practical example from our social network graph. Let’s say we applied the same concepts to this question: “Find the degree
property for every person
vertex in my graph.” To answer this question, we need to do the following:
We also need a new step, the project()
step:
project(string[])
--Projects the current object into a new object or objects as specified by the criteria in the by()
modulators
Let’s apply what we learned about the by()
modulators and projection. Figure 5.17 shows the traversal we should get from this.
As we did with select()
, we’ll use the by()
modulators to instruct the project()
step on how to return its results.
IMPORTANT In this traversal, instead of specifying a property name in the second by()
statement, we specify additional traversal steps. This takes the incoming element (in this case, the product
vertex) and then performs additional traversal steps from that point in the graph. This ability to specify additional traversal steps within a by()
step isn’t unique to project()
. We can specify these sub-traversals with either a select()
or other Gremlin steps. This is quite powerful--the ability to do complex operations within a traversal or steps within steps.
When we run this traversal on the complex social network graph, we get the following results:
g.V().hasLabel('person'). project('name', 'degree'). by('first_name'). by(bothE().count()) ==>{name=Dave, degree=5} ==>{name=Paras, degree=2} ==>{name=Josh, degree=3} ==>{name=Denise, degree=3} ==>{name=Ted, degree=2} ==>{name=Hank, degree=2} ==>{name=Kelly, degree=3} ==>{name=Jim, degree=4}
To review, let’s summarize the differences between the two methods: selection versus projection. Then we’ll compare these side by side in figure 5.18 before we move on:
Selection uses the select()
step to create a result set based on previously traversed elements of a graph. To use the select()
step, we alias each of the elements with the as()
step for later use.
Projection uses the project()
step to branch from the current location within the graph and creates new objects. In our present example, we had one element remain static, the person’s name, but we needed the other elements to be calculated through further traversing of the graph to return the number of friends.
Don’t worry if you don’t feel comfortable with how to manipulate results right away. Our experience tells us that, as with many powerful software concepts, it takes some practice to know which technique produces the desired result. It does become more natural with regular practice, and by regular practice, we mean trial and error until you get the desired results. Now that we know how to construct complex result structures, how do we go about returning these in a predictable way?
In this section we investigate two other mechanisms to manipulate results: ordering and grouping. Most clients want nice, clean, ordered data. However, as in most relational databases, graph databases don’t guarantee the order of the results by default. This leads us to the three most common requirements for organizing our result data:
Clients often expect the returned data to be sorted by one or more properties. For example, when displaying everyone in the graph by name, people usually expect to see the results in alphabetical order. This leaves us with a couple of options.
The first option is to return all the names and sort these client-side, in memory, within the application. While this works, it’s undesirable. For example, let’s say our application only shows the first 10 names of a possible 100. Sorting all the data client-side means that we return all 100 values, sort these, and then choose the top 10. This is inefficient and adds load not only on the client, but also on the database and the network. While there are scenarios where this might make sense, such as if we were caching all the names in the application for repeated reuse, we normally want to reduce unneeded work.
This leaves us with the second option: sorting the names first on the server-side. This is the method frequently taken in a relational database and is also common in a graph database. In SQL, we use the ORDER
BY
clause like this:
SELECT *
FROM person
ORDER BY first_name;
The syntax in a graph database is similar. In fact, to order results in Gremlin, we use the following step:
order()
--Collects all objects up to this point of the traversal into a list, which is ordered according to the accompanying by()
modulator
This new step, used in conjunction with the by()
modulator, specifies how to arrange our data and which property to use to sort the results. For example, to order the names of every person
vertex in the graph by first_name
, we use this traversal:
g.V().hasLabel('person').values('first_name'). order(). by() ==>Dave ==>Denise ==>Hank ==>Jim ==>Josh ==>Kelly ==>Paras ==>Ted
The order()
step defaults to sorting in ascending order. To sort by descending order, specify the decr
parameter in the by()
step as shown:
g.V().hasLabel('person').values('first_name'). order(). by(decr) ==>Ted ==>Paras ==>Kelly ==>Josh ==>Jim ==>Hank ==>Denise ==>Dave
While sorting in either ascending or descending order is common, there are times when we want to order data randomly, such as when sampling. For this, we use the shuffle
parameter in the by()
step:
g.V().hasLabel('person').values('first_name'). order(). by(shuffle) ==>Dave ==>Jim ==>Ted ==>Paras ==>Kelly ==>Hank ==>Denise ==>Josh
Ordering is probably the most frequent requirement for formatting data. Another typical need is to group or count the number of items in a group.
If we return to our previous friends-of-friends traversal, the client might want to return that list grouped by which of Ted’s friends they’re friends with. In this scenario, we’re left with the same choices as with ordering data: either perform the work on the client-side in the application or on the server-side in the database.
There’s a natural desire, and one which we encourage, to push as much of this work as close to the data as possible. In SQL, we accomplish this by using the GROUP BY clause:
SELECT f.person_id, count(foff.*)
FROM person
WHERE person.first_name = 'Ted'
Similar to ordering data, the syntax in a graph database for grouping is comparable when using Gremlin. To perform these grouping operations, we can use either of the following steps:
group()
--Groups the results based on the specified by()
modulator. Data is grouped by using either one or two by()
modulators. The first one specifies the keys for the grouping. The second one, if present, specifies the values. If not present, the incoming data is collected as a list of the values associated with the grouping key.
groupCount()
--Groups and counts the results based on the specified by()
modulator. It takes one by()
modulator to specify the keys. The values are always aggregated by the count()
step.
In the following, we apply these steps to group all of the friends-of-friends of Dave by his friends:
g.V().has('person', 'first_name', 'Dave'). both(). both(). group(). by('first_name') ==>{Denise=[v[19], v[19]], Ted=[v[4]], Hank=[v[6]], Paras=[v[17]], Josh=[v[2], v[2]], Dave=[v[0], v[0], v[0], v[0], v[0]], Kelly=[v[13]], Jim=[v[15]]}
We can see that our traversal returns a Map
containing arrays of vertices for each name. Because we didn’t specify a second by()
modulator, it simply collected the references to the vertexes into a list. To make this a bit easier to read, let’s use the unfold()
step:
Applying the unfold()
step to our results unwinds those into individual records for each name. For example
g.V().has('person', 'first_name', 'Dave'). both(). both(). group(). by('first_name'). unfold() ==>Denise=[v[19], v[19]] ==>Ted=[v[4]] ==>Hank=[v[6]] ==>Paras=[v[17]] ==>Josh=[v[2], v[2]] ==>Dave=[v[0], v[0], v[0], v[0], v[0]] ==>Kelly=[v[13]] ==>Jim=[v[15]]
Instead of returning the actual vertices for each name, what if we were more interested in discovering which of Dave’s friends is the most popular? To do aggregated grouping, we need the count for a group by name, so we use the groupCount()
step:
g.V().has('person', 'first_name', 'Dave'). both(). both(). groupCount(). by('first_name'). unfold() ==>Denise=2 ==>Ted=1 ==>Hank=1 ==>Paras=1 ==>Josh=2 ==>Dave=5 ==>Kelly=1 ==>Jim=1
The groupCount()
step is just a little syntax sugar for the most common use of the group()
step--aggregating counts of things. As a quick point of comparison, and a nice demonstration of how we can use a traversal in a by()
modulator, the group()
version of the groupCount()
step is
g.V().has('person', 'first_name', 'Dave'). both(). both(). group(). by('first_name'). by(count()). unfold() ==>Denise=2 ==>Ted=1 ==>Hank=1 ==>Paras=1 ==>Josh=2 ==>Dave=5 ==>Kelly=1 ==>Jim=1
Note how we were able to use a traversal of one step (the count()
step) in our second by()
modulator. The group()
step applied by()
to all of the incoming vertices that shared the same first_name
value. As these examples demonstrate, grouping and ordering results in a graph database is similar to the process used in a relational database.
The final topic in organizing results is returning a subset of the data. This is commonly used to minimize the result size or for pagination functionality. For example, let’s say that we want to return all the names for people in our graph, but our graph contains one million people. Can any application display all at the same time? Usually we want to limit the initial results and then allow the user to move through the data set in groups of records. This approach is standard in a number of types of applications.
As with grouping or ordering, the question remains: Do it on the client-side or do it on the server-side? In this case, it’s almost always better to limit data on the server before returning it to the client. That provides a drastic reduction in resources across the whole stack, from the database to the network to the application. In SQL, we use the LIMIT clause for this:
SELECT * FROM person LIMIT 10;
As before, the approach in a graph database is similar. Gremlin, however, has several steps depending on the desired outcome: first X results, last X results, or X results from within the data set:
range(startNumber,
endNumber)
--Returns the results from startNumber
(inclusive, zero-based) to endNumber
(not inclusive)
These three steps are usually paired with an ordering step because graph traversals don’t guarantee the order of the data returned. Let’s say that we want to return only the top three names from our graph, ordered by first_name
. If we extend the traversal we built in the ordering section to add the limit()
step, we get a traversal that looks like this:
g.V().hasLabel('person').values('first_name'). order(). by(). limit(3) ==>Dave ==>Denise ==>Hank
What if we want to do the reverse and return the last three names? How would we accomplish this task?
Exercise Take a minute and think about what you’ve learned about manipulating our results. What would you do to answer the question?
We see two ways to do this. The first is to use the previous traversal, but arrange names in descending order instead of ascending order; then limit our results to the top three. The second way is to use the tail()
step instead of the limit()
step like this:
g.V().hasLabel('person').values('first_name'). order(). by(). tail(3) ==>Kelly ==>Paras ==>Ted
Both methods accomplish the same goal. It’s therefore up to your discretion to decide which one you want to choose.
The last requirement to discuss is pagination of results. Let’s say we want everyone ordered by first name, but only three at a time. We use the range()
step and specify the first and last result number to return:
g.V().hasLabel('person').values('first_name'). order(). by(). range(0, 3) ==>Dave ==>Denise ==>Hank
By manipulating the startNumber
and endNumber
values, we can page through our results. For example, if we wanted to move to a second page of results, we could accomplish this by incrementing the values in our range step to range(3,
6)
.
Given all these different ways of manipulating our data, we’ll share one last example that combines these concepts together to answer the question, “What three friends-of-friends of Dave have the most connections?” We’ll start with the answer and then break it down into its component parts. First, the traversal to answer this question follows:
g.V().has('person', 'first_name', 'Dave'). both(). both(). groupCount(). by('first_name'). unfold(). order(). by(values, desc). by(keys). project('name', 'count'). by(keys). by(values). limit(3) ==>{name=Dave, count=5} ==>{name=Denise, count=2} ==>{name=Josh, count=2}
That’s a lot to take in when you are used to traversals with at most five steps. This one has nine steps plus five modulators. In chapter 8, we introduce a methodology for developing more involved traversals such as this one. In this final example for this chapter, we walk through our approach to understand the traversal that is already written.
The first step is to ascertain the traversal writer’s intent. Here, we know that the traversal is supposed to answer a specific question: “What three friends-of-friends of Dave have the most connections?” We may not always know the intent when looking at someone else’s traversals. Out of consideration for the developers who might someday need to support the traversal you write, we recommend either having descriptive method names (e.g., getTop3FriendOfFriendsByEdgeCount
) or including a helpful comment in the code. We discuss adding comments a bit later in this section.
Next, determine the traversal’s starting point. It is a single vertex or a type of vertices. We do this by looking at all of the filtering steps at the start. Remember that filtering steps such as has()
can be chained together efficiently. In this example, the traversal starts at a single vertex, the Dave
vertex, before it traverses to other parts of the graph:
g.V().has('person', 'first_name', 'Dave')
Next, we want to take each step, or collection of steps, in their order. While reading through the steps, we want to develop our mental view of our position within the graph at each point. Sometimes, we find it helpful to add comments to the code as a way of taking notes. For example, with the first few steps, we might add comments like the following:
g.V().has('person', 'first_name', 'Dave'). // single person: Dave both(). // friends both() // friends of friends
This helps us to keep track of the output at each step. The only problem with this is that not all traversal processors support inline comments like this. For example, while our IDE recognizes this as valid Groovy code, the Gremlin Console gives us an error if we attempt to run it. To be able to run the code in Gremlin Console, we need to change the comment to something like this:
// single person: Dave g.V().has('person', 'first_name', 'Dave'). // friends both(). // friends of friends both()
But that makes the traversal verbose in ways that are almost unhelpful. In these situations, we are tempted to keep two versions of the traversal in our IDE or editor, one with comments and the other for testing. Either way, we can see that the first three lines get us to the friends of Dave’s friends. Testing in the Gremlin Console gives us
g.V().has('person', 'first_name', 'Dave'). both(). both() ==>v[19] ==>v[17] ==>v[0] ...
Let’s look at the next set of steps, the groupCount()
step, it’s by()
modulator, and the unfold()
step. We can run just those steps through the Gremlin Console and review the results:
g.V().has('person', 'first_name', 'Dave'). both(). both(). groupCount(). by('first_name'). unfold() ==>Denise=2 ==>Ted=1 ==>Hank=1 ==>Paras=1 ==>Josh=2 ==>Dave=5 ==>Kelly=1 ==>Jim=1
From this, we can see that we get a series of key-value pairs, where the key is the first_name
of the friends-of-friends vertex, and the value is the number of times that it appears in the results. Referring back to our starting question, this covers the friends-of-friends of Dave and their number of connections. The groupCount()
step handled both the needed grouping (by friends-of-friends’ first names) and the count aggregation. An unfold()
step is tacked on to simplify the ordering, which is next:
g.V().has('person', 'first_name', 'Dave'). both(). both(). groupCount(). by('first_name'). unfold(). order(). by(values, desc). by(keys) ==>Dave=5 ==>Denise=2 ==>Josh=2 ==>Hank=1 ==>Jim=1 ==>Kelly=1 ==>Paras=1 ==>Ted=1
The ordering statement gets interesting. First, it orders by the values in descending order and then by the keys. The ordering by the keys is a nice little tie breaker to ensure somewhat deterministic results. We’re not sure that an alphabetical bias by first name is the best approach, but it certainly works for ensuring consistency. Now let’s reformat these results so that it’s easier to parse with a client program:
g.V().has('person', 'first_name', 'Dave'). both(). both(). groupCount(). by('first_name'). unfold(). order(). by(values, desc). by(keys). project('name', 'count'). by(keys). by(values) ==>{name=Dave, count=5} ==>{name=Denise, count=2} ==>{name=Josh, count=2} ==>{name=Hank, count=1} ==>{name=Jim, count=1} ==>{name=Kelly, count=1} ==>{name=Paras, count=1} ==>{name=Ted, count=1}
Here, we use a project()
step so that we have objects with clear labels on the properties. This probably isn’t necessary if working with data within a single traversal, but it is a great practice when returning results to another program. Now, the only operation that remains is to limit the results:
g.V().has('person', 'first_name', 'Dave'). both(). both(). groupCount(). by('first_name'). unfold(). order(). by(values, desc). by(keys). project('name', 'count'). by(keys). by(values). limit(3) ==>{name=Dave, count=5} ==>{name=Denise, count=2} ==>{name=Josh, count=2}
This example demonstrates how we brought together the concepts and constructs in this chapter to format results for client software. And, as a bonus, we walked through how to examine a traversal piece by piece to understand its operations. In the next chapter, we’ll bring the skills from the last few chapters together to build a working application.
By default, properties aren’t returned from graph elements, so we must explicitly ask for those. In Gremlin, we use steps such as values()
and valueMap()
to retrieve the values in the desired form.
Aliases in the traversal allow for referencing results from earlier steps in later steps, supporting composition of powerful traversals.
Selecting and projecting steps create complex results from multiple vertices or edges, allowing for the composition of intricate result structures.
Selection creates a result set based on previously traversed elements of a graph. To use the select()
step, we alias with the as()
step elements for use in later steps.
Projection operates from the current location within the graph and creates new objects with either static or calculated properties.
Ordering, grouping, or counting by group are common ways to transform results using the order()
, group()
, and groupCount()
steps.
The limit()
step limits the number of results, the tail()
step returns the last X records, and the range()
step allows for result pagination.
Combining different steps performs complex manipulation and transformation of traversal results in the database prior to returning the results to a client. Used appropriately, this improves performance in the database, across the network, and in the application itself.
52.14.121.242