5 Formatting results

This chapter covers

  • Retrieving values from our vertices and edges
  • Aliasing vertices and edges for later use in the traversal
  • Crafting custom result objects by combining static and computed values
  • Sorting, grouping, and limiting our results

Finding data within the graph is one skill, but returning it efficiently presents a whole new set of challenges. While it’s entirely possible to send raw, unorganized data to the client, in most cases, it’s best to do as much data processing at the database layer as possible. Client applications are quite busy handling user interactions.

In this chapter, we’ll focus on the different methods of collecting, formatting, and outputting traversal results at the database level. We’ll review the value steps introduced in chapter 3 and illustrate why these are required. Then we’ll discuss how to return values from elements that are located in the middle of a traversal, as well as crafting custom objects. Finally, we’ll wrap up this chapter by demonstrating how to sort, group, and limit results for efficient communication with client applications.

If you haven’t done so already, download the corresponding source code for this chapter: https://github.com/bechbd/graph-databases-in-action. The code relevant to this chapter is located in the chapter05 folder. All examples begin with the assumption that our social network data set is loaded. To accomplish this, run this command:

.sh -i $BASE_DIR/chapter05/scripts/5.1-complex-social-network.groovy

5.1 Review of values steps

We start with the most common of formatting tools: the values() and valueMap() steps, which we introduced in chapter 3. But before we get to these, we look at the default behavior in TinkerPop. Most of the traversal examples return the ID of the elements. Looking at the following traversal, the ID value of 4 is returned as part of a toString() construct:

g.V().has('person', 'first_name', 'Ted')
==>v[4]

In relational database terms, this is the equivalent of running this SQL query:

SELECT ROWID FROM person WHERE first_name = 'Ted';

We rarely want to return just the ID. In most cases, we aim to return all or a subset of the properties. So, if the normal use case is to retrieve the attribute values, then why not return all the attributes by default?

The reason is quite practical. If a database returns all the attributes by default, we transmit a lot of unrequired data. Generally, we want to transmit only the data we need, so most databases require that we specify the particular attributes to return. We know it is an (unfortunately) common practice in SQL to use a wildcard (*) in the SELECT clause as seen here:

SELECT * FROM person WHERE first_name = 'Ted';

But the preferred method is to specify the column names like this:

SELECT first_name FROM person WHERE first_name = 'Ted';

So, what’s the graph equivalent of this SQL? Let’s say that we want to return all the properties for the Hank vertex in our graph. We already know the basic steps for this:

  1. Given a traversal source g

  2. Find vertex of type person with a first_name of Hank.

  3. Return the properties of that vertex (in this case, there is only one, first_name).

At this stage, we are already old pros at handling the first two steps. Back in chapter 3, we also used the values() step and discussed the valueMap() step, often at the end of the traversal, to retrieve the properties of a vertex. Both the valueMap() and values() steps return the property values of a vertex or an edge. As you’ll remember, the values() step returns only the values of the properties. The valueMap() step returns a map, which is a collection of key-value pairs of the specified properties. (In some programming languages, a map is known as a dictionary. This is the same concept here.)

Why is values() plural?

Although plural, values() is most often used to return scalars; specifically, the value of a single property. Because it returns only the value portion of the property without a key or label, the requesting code must know which property is called for. It’s plural because values() is designed to work on one or more properties and to distinguish it from the rarely used value() step.

Let’s take a look at the difference between the two values steps by returning all the attributes for Hank in our graph. First, using values()

', 'first_name', 'Hank').values()
==> Hank

and then employing valueMap():

', 'first_name', 'Hank').valueMap()
==>{first_name=[Hank]}

As shown, while we receive the same basic data back for each of these, it’s returned with slightly different formats. The valueMap() step differs from the values() step because it returns the data as a key-value pair (or map) instead of just the value. Generally, we find that having the keys for the property values makes it much easier to work with the results. Additionally, the valueMap() step returns one row per traverser while the values() step returns one row per property per traverser. In our work, we usually prefer the valueMap() step.

Empty values() steps

In our sample traversals, note that we don’t specify the properties to return in our values() step. For example

Having an empty values() step is generally a bad idea. It’s the equivalent of running a SQL SELECT query with a wildcard like this:

FROM person WHERE first_name = 'Hank';

As with SQL, although this is allowed in graph traversals, it has potentially significant drawbacks:

  • The values we receive can change over time. When a new property is added to these vertices, it automatically gets included in the results.

  • There’s no guarantee of ordering the properties with graph databases.

  • We can end up getting significantly more data than required, slowing down the application.

For these reasons, as in SQL, it’s a best practice to always specify the properties to return with a comma-separated list of property keys as illustrated here:

g.V().has('person', 'first_name', 'Hank').values('first_name')

Now that we’ve reviewed the values() and valueMap() steps, why are these needed at all? In SQL we don’t do anything extra to get the values, so why do we need to do that with graph databases? It’s necessary because of a crucial difference between how SQL engines process queries and how graph database engines process traversals:

  • In a graph database, only the values of the current vertices or edges are retrieved.

  • In a relational database, all the values from all the joined tables can be included in the results.

This difference arises from the disparity in how the engines process queries. Understanding this difference is critical for creating effective and efficient graph queries. To demonstrate, let’s look at how a relational database handles a query and compare that to how a graph traversal works.

Let’s use an example of an order-processing system, consisting of orders and products. It’s likely that we’re familiar with this sort of simple hierarchical relationship in a relational model. If we model a simplified version of this system for a relational database, we’d design two tables: Orders, containing the orders, and Products, containing the products. These two tables are connected with a linking table, ProductsInOrder. Figure 5.1 shows this relationship.

Figure 5.1 A sample Entity Relationship Diagram (ERD) for our order-processing system, which contains two tables, Orders and Products, linked together with a foreign key.

Populating these tables with some sample data, we get something like figure 5.2.

Figure 5.2 Our example order-processing system, which contains sample data for a relational database model

A common question to answer with our relational database system is, “What are all the orders and the products that were ordered?” To answer this, we join the Orders table and the Products table with the ProductsInOrder table as in this SQL query:

SELECT * 
FROM Orders 
 JOIN ProductsInOrder ON ProductsInOrder.order_id = Orders.order_id
  JOIN Products ON Products.product_id = ProductsInOrder.product_id;

When we run this query, the SQL engine generates tabular output by combining the rows from the Orders, ProductsInOrder, and Products tables, where both the order_id values and the product_id values match. The following table provides the output.

order_id

order_number

order_id

product_id

qty

product_id

name

1

ABC123

1

1

5

1

widget 1

1

ABC123

1

2

10

2

widget 2

2

DEF234

2

2

4

2

widget 2

2

DEF234

2

3

6

3

widget 3

The critical point to notice is that our result set contains the data from both tables involved in the join operation. If our SQL query contained additional join clauses, then the columns from the additional tables are also included in the result set by default because of the use of the wildcard.

Now let’s look at how we represent the same order-processing system in a graph. Taking what we learned in chapter 2, we create a schema as represented in figure 5.3. It consists of two vertices, order and product, and one edge, contains.

Figure 5.3 A graph schema for our order-processing system with two vertices, order and product, and one edge, contains

If we then populate our graph with the data used in our SQL tables, we get the graph presented in figure 5.4.

Figure 5.4 Our order-processing graph populated with the same data used in our SQL example.

Applying what we learned about how graph traversals work, we know that our first step is to find all order vertices in the graph. Figure 5.5 illustrates this step.

Figure 5.5 Our traversal finds all the order vertices in our order-processing graph.

Our next step is to traverse out all the contains edges to the adjacent product vertices. Figure 5.6 shows this step.

Figure 5.6 Traversing out the contains edges from all order vertices to the product vertices

In Gremlin, we’d write a traversal like the following:

g.V().hasLabel('order').out('contains')

If we compare the final location of our graph traversal to the final result set of our SQL query, we’ll notice that although the SQL results have information about both the orders and the products, the graph results only have the properties of the products vertices. This represents a fundamental difference between querying a relational database and traversing a graph. Further, in a relational database, the output of a join operation is the combination of all of the joined tables. In a graph database, the output of any step of a traversal is the current set of vertices or edges. How do we return both the order and product information for a graph?

5.2 Constructing our result payload

To return both the order and product vertices, we use an alias on the order vertices. An alias in a graph database is a labeled reference to a specific output of a step, either a vertex or an edge, that can be referenced by later steps. In our order-processing graph, the steps to get a combined order/product result are as follows:

  1. Find all the order vertices in the graph.

  2. Give these an alias labeled O.

  3. Traverse out the contains edge to the product vertices.

  4. Give these an alias labeled P.

  5. Return all the properties from the elements labeled O as well all the properties from the elements labeled P.

It might appear strange to also alias the product vertices, not just the order vertices. When returning the aliased elements, all these elements must have an alias, not just the mid-traversal elements. If we look at our order-processing graph after we complete steps 1 and 2 (shown in figure 5.7), we have a graph with all our order vertices labeled as O.

Figure 5.7 Our order-processing graph with all the order elements labeled O

Now we can traverse the contains edge and alias the adjacent product vertices (steps 3 and 4). This provides a graph where all our order vertices aliased as O and all our product vertices aliased as P. Figure 5.8 exhibits this part of the traversal.

Figure 5.8 Our order-processing graph with the order vertices labeled O and the product vertices labeled P

So far, it looks like we’re on the right track. We have references to both the order and product vertices (step 5). Next, we’ll select our O and P vertices and return their properties. To retrieve these values, we’ll refer to the alias and the properties we want to return for each desired attribute. While this makes sense conceptually, let’s dive into a concrete example from our social network and see how this all works using our social network graph.

5.2.1 Applying aliases in Gremlin

Having covered the concept of aliases, let’s apply this to our friends-of-friends traversal we created in chapter 3. In section 3.3, we crafted the traversal:

g.V().has('person', 'first_name', 'Ted').
   repeat(
'friends')
   ).times(2).
  values('first_name')

The Ted vertex has just a couple of connections in our sample graph, so it does not have many results as is. Let’s move to the middle of the graph and search for Dave’s friends-of-friends instead of Ted’s. Also, instead of just returning the friends-of-friends name, let’s also return the friend’s name. Our results should be a list of objects, each with a friend’s name and with a friends-of-friends name. Let’s start this exercise by first reminding ourselves what our social network graph looks like, illustrated in figure 5.9.

Note In figure 5.9, the vertices are labeled using only the first_name property to simplify the visual presentation.

Figure 5.9 Our simplified social network with person vertices labeled with the person’s name

Applying what we learned, we come up with the following steps for answering this friends-of-friends question for Dave:

  1. Find the Dave vertex.

  2. Traverse out the friends edges.

  3. These are Dave’s friends, so alias them with the label 'f'.

  4. Traverse out the friends edge again.

  5. These are Dave’s friends-of-friends, so alias them with the label 'foff' (for friends-of-friends).

  6. For each result, return the first_name property of the element labeled 'f' and the first_name property of the element labeled 'foff'.

Figure 5.10 shows what this looks like in our social network graph. In this figure, we recognize Dave’s friends shown by the end vertices with the solid lines (representing steps 1-3), and his friends-of-friends denoted by the end vertices with the dashed lines (representing steps 4 and 5).

Figure 5.10 Finding Dave’s friends, shown as the endpoints with the solid lines, and their friends, depicted as the endpoints with the dashed lines, within our social network graph

Traversing our graph in this manner returns one result for each of the six solid arrows in figure 5.10 and includes a value for each result based on the solid arrow traversed. Because there are five solid arrows, some of the friends vertices, namely Jim and Kelly, end up being duplicated in our results. Note also that there’s an solid arrow to Hank but no solid arrow from Hank. We shouldn’t expect to find Hank in the friends category, only in the friends-of-friends, owing to the solid arrow from Josh.

Because there isn’t an outgoing friends edge from the Hank vertex, we don’t get a friends-of-friends result for Dave’s friend Hank. Put another way, there is no vertex that satisfies the pattern: Dave -> Hank -> ???. This is a good example of how the edge directions can influence results in sometimes unexpected ways. Let’s write our traversal starting with the friends-of-friends traversal from section 3.3 but replacing Ted with Dave.

Tip It’s best to write code in small chunks and test early and often.

  out().
  out().

==>Denise
==>Denise
==>Paras
==>Jim
==>Josh
==>Hank

Comparing the outcome of this traversal with what we expected from our graph, we confirm that the results match. However, this graph just returns the name of the friends-of-friends. We want the combination of the friend and friends-of-friends vertices. To get the missing pieces, two new pieces are required: first, aliasing an element in the middle of a traversal, and second, using aliased elements later in the traversal to choose properties.

Aliasing elements mid-traversal using as()

The first concept, aliasing elements mid-traversal, is what enables us to retrieve the friend’s name. We use a new Gremlin step, the as() modulator:

  • as()--Assigns a label (or labels) to the output of the previous step, which can be accessed later in the same traversal.

Think of as()in Gremlin as similar to assigning an alias to a table in SQL. For example, with SQL

SELECT alias_name.* FROM table AS alias_name;

would be represented in Gremlin as

g.V().hasLabel('table').as('alias_name')

Both approaches use the keyword as and alias a specific portion of the data (in SQL, it’s a table; in a graph, it’s a reference to an element) to use a simple reference later. Let’s add the as() step to alias our vertices after each of the out() steps. Then, in figure 5.11, we isolate and describe the portions of the traversal where we alias the vertices:

  out().as('f').            
  out().as('foff')

Figure 5.11 Portions of the traversal where we alias the vertices at each step away from the starting vertex

As figure 5.12 illustrates, each vertex is aliased with the assigned name after we traverse each friends edge.

Figure 5.12 Our social network graph depicting each vertex being aliased as f or foff for each step in our traversal

While it might be tempting to assign an as() at each step in a traversal, its use comes at a cost. As each traverser moves through the graph, it carries a reference to each element that’s aliased. The more aliases we create, the more the traverser has to keep track of with each additional step. Therefore, it’s best practice to only alias steps that we plan to retrieve later in the traversal. Now that we’ve wrapped our heads around how to alias elements mid-traversal, let’s examine the second new concept: retrieving the aliased elements.

Returning aliased elements

Retrieving the aliased elements in our traversal requires two different steps. First, we need to specify what aliased elements to retrieve, and second, what properties of each to return. In the case of our friends-of-friends traversal, we

  1. Return all elements labeled 'f'.

  2. Return all elements labeled 'foff'.

  3. For each of the returned elements, return the first_name property.

To return aliased elements, let’s turn to a new Gremlin step:

  • select(string[])--Selects aliased elements from earlier in the traversal. This step always looks back to previous steps in the traversal to find the aliases.

The select() step takes an array of strings, which are the aliases to retrieve. In our example, we specify select('f', 'foff') to use both sets of vertices in our results. To specify what properties to return, we introduce another new Gremlin step, or more accurately, another modulator, by(). Like the from(), to(), and as() modulators, the by() modulator only works in the context of another step; in this case it will be working with the select() step (although it can work with others, as we will see later):

  • by(key)--Specifies the key of the property to return the value from the corresponding aliased element

  • by(traversal)--Specifies the traversal to perform on the corresponding aliased element

There are two forms of by(). The first form takes the property key and returns the corresponding property value from the labeled element. This is a bit of Gremlin syntax sugar because by(key) is equivalent to by(values(key)). The second form takes a traversal that allows us to perform additional steps on the labeled element, such as a valueMap() or out().valueMap('key'). It’s also possible to use complex traversals within a by() modulator to format results. We demonstrate more complex uses in later chapters.

The by() modulator specifies what to do with the corresponding aliased elements from a step like the select() step. In our case, we apply the first form to specify that we want the first_name property from each of the aliases referenced in our select() step. Putting the select() and by() steps on our previous friends-of-friends traversal, we get

  out().as('foff').
  select('f', 'foff').
    by('first_name').
    by('first_name')
==>{f=Jim, foff=Denise}
==>{f=Jim, foff=Paras}
==>{f=Kelly, foff=Jim}
==>{f=Kelly, foff=Denise}
==>{f=Ted, foff=Josh}
==>{f=Josh, foff=Hank}

With these two concepts, we can create complex results by combining elements from different points in our traversal. In this scenario, it means including not only the name of the friends-of-friends of Dave, but also which of Dave’s friends they’re connected to as well. Our traversal yields the six results we expected with the Jim and Kelly vertices referenced twice as friends. Also, the Hank vertex was not included as a one of the friends because there was no corresponding friends-of-friends returned. This is awesome, but why are there two by() steps?

One confusing aspect of using by() statements is that each aliased element we specify in a select() statement should have a corresponding by() statement to indicate the operations to perform on it. Additionally, the order of the by() step corresponds to the order of the aliases specified.

In our example, select('f', 'foff'), our traversal must have two by() statements. The first by() performs actions on the elements labeled as 'f'; the second by() performs actions on the elements labeled as 'foff'. Figure 5.13 demonstrates how the by() steps correlate in our example.

Note Strictly speaking, it is possible to have greater or fewer by() statements than referenced elements. In these scenarios, the by() statements are used in a round-robin fashion. This can lead to confusion, so we always match the number of by() statements to the number of aliased elements to be clear about what should happen for each alias.

Upon examination, in figure 5.13, we notice that the first by() statement returns the first_name property from our vertices labeled as 'f', and the second by() statement yields the first_name property from our vertices labeled as 'foff'.

Figure 5.13 Diagram showing the portions of the traversal where we select the labeled vertices

5.2.2 Projecting results instead of aliasing

Sometimes, instead of looking back in the traversal for earlier results, it is preferable to project results forward from the current elements. Projecting results differs from retrieving the previous results in a simple, but somewhat subtle way. When we retrieve (or select) data, we can only get information that we already traversed and aliased. When we project results, we create new results, possibly branching to items not yet traversed. Let’s start by looking at projection in contrast to selection:

  • Selection is the process of working with vertices, properties, or additional traversal expressions to return results from previously labeled steps. Selection always looks back to earlier parts of the traversal.

  • Projection is the process of working with vertices, properties, or additional traversal expressions to create results from the input to the current step. Projection always moves forward, taking the incoming data as the starting point.

Understanding the difference between these two items is crucial. Selection is generally used to combine results from elements traversed earlier in the traversal. Projection is generally used to group or aggregate data starting from the current location in the graph (for example, finding the degree property of each member of a set of vertices, which we do later in this section).

Let’s return to our order-processing graph for an illustration. For this example, let’s answer the question, “For each of the products in order ABC123, how many times has that product been ordered?” Using what we already know from the previous section about selecting results, let’s use the following process:

  1. Find the order vertex ABC123.

  2. Traverse the contains edge to each of the product vertices, aliased as p.

  3. Traverse out all the contains edges, aliased as c.

  4. Return a selection of the name of p with the count of c.

Completing step 1 has our traverser located on the order vertex. Figure 5.14 illustrates this step.

Figure 5.14 Our order-processing graph after step 1 with our traversers sitting on the order vertices

Moving along, completing step 2 of the process has two traversers, one located on each adjacent product vertex. Figure 5.15 shows this step.

Figure 5.15 Our order-processing graph after step 2 with our traversers sitting on the product vertices

With our next to last step (step 3), the traversers are located on the contains edges. Figure 5.16 shows this step.

With everything we’ve learned about using aliases and returning results, this is the correct way to think about the problem. However, if you count the gremlins in figure 5.16, you see there are three. Each of these returns its own results, based on what it knows. What they each know is the widget they came from and how many edges these occupy. This approach yields the following:

{name: widget 1, count: 1}
{name: widget 2, count: 1}
{name: widget 2, count: 1}
 

Figure 5.16 Our order-processing graph after step 3 with our traversers sitting on our contains edges

That is one result for each of the three gremlins. What we want instead is a result like this:

{name: widget 1, count: 1}
{name: widget 2, count: 2}

Why didn’t the traversal return the expected results? Why did we end up with three gremlins and three results each with a count of 1? Previously, the return values from our traversals were generated by selecting the values of previously traversed and labeled elements. In the case of the previous order-processing traversal, we end up with three traversers. So, what do we do instead? The steps shown are mostly valid, but require a few tweaks:

  1. Find the order vertex ABC123.

  2. Traverse the contains edge to each of the product vertices.

  3. Traverse out all the contains edges.

  4. Return a projection of the product name with a count of the incident contains edges.

Studying the difference between the two processes, we’ll notice a few specifics:

  • We’re no longer aliasing elements as we traverse these.

  • We don’t have to traverse back out to the contains edge a second time.

In the example traversal, we use projection to end up with the count of contains edges for a specific product because we want to branch our logic at the product vertex.

Whew, we know this is a lot to take in, so let’s see what this looks like in a practical example from our social network graph. Let’s say we applied the same concepts to this question: “Find the degree property for every person vertex in my graph.” To answer this question, we need to do the following:

  1. Find all the person vertices in the graph.

  2. Create a new result object with the name and degree keys.

  3. For the name key, return the first_name of the person.

  4. For the degree key, count all the edges for the person.

We also need a new step, the project() step:

  • project(string[])--Projects the current object into a new object or objects as specified by the criteria in the by() modulators

Let’s apply what we learned about the by() modulators and projection. Figure 5.17 shows the traversal we should get from this.

Figure 5.17 Diagram showing the portions of the traversal describing the project() step

As we did with select(), we’ll use the by() modulators to instruct the project() step on how to return its results.

IMPORTANT In this traversal, instead of specifying a property name in the second by() statement, we specify additional traversal steps. This takes the incoming element (in this case, the product vertex) and then performs additional traversal steps from that point in the graph. This ability to specify additional traversal steps within a by() step isn’t unique to project(). We can specify these sub-traversals with either a select() or other Gremlin steps. This is quite powerful--the ability to do complex operations within a traversal or steps within steps.

When we run this traversal on the complex social network graph, we get the following results:

g.V().hasLabel('person').
  project('name', 'degree').
    by('first_name').
    by(bothE().count())
==>{name=Dave, degree=5}
==>{name=Paras, degree=2}
==>{name=Josh, degree=3}
==>{name=Denise, degree=3}
==>{name=Ted, degree=2}
==>{name=Hank, degree=2}
==>{name=Kelly, degree=3}
==>{name=Jim, degree=4}

To review, let’s summarize the differences between the two methods: selection versus projection. Then we’ll compare these side by side in figure 5.18 before we move on:

  • Selection uses the select() step to create a result set based on previously traversed elements of a graph. To use the select() step, we alias each of the elements with the as() step for later use.

  • Projection uses the project() step to branch from the current location within the graph and creates new objects. In our present example, we had one element remain static, the person’s name, but we needed the other elements to be calculated through further traversing of the graph to return the number of friends.

Figure 5.18 The select() step looks back to previously aliased steps, and the project() step takes the incoming data and moves forward with it.

Don’t worry if you don’t feel comfortable with how to manipulate results right away. Our experience tells us that, as with many powerful software concepts, it takes some practice to know which technique produces the desired result. It does become more natural with regular practice, and by regular practice, we mean trial and error until you get the desired results. Now that we know how to construct complex result structures, how do we go about returning these in a predictable way?

5.3 Organizing our results

In this section we investigate two other mechanisms to manipulate results: ordering and grouping. Most clients want nice, clean, ordered data. However, as in most relational databases, graph databases don’t guarantee the order of the results by default. This leads us to the three most common requirements for organizing our result data:

  • Ordering the results

  • Grouping the results

  • Limiting the size of the results

5.3.1 Ordering results returned from a graph traversal

Clients often expect the returned data to be sorted by one or more properties. For example, when displaying everyone in the graph by name, people usually expect to see the results in alphabetical order. This leaves us with a couple of options.

The first option is to return all the names and sort these client-side, in memory, within the application. While this works, it’s undesirable. For example, let’s say our application only shows the first 10 names of a possible 100. Sorting all the data client-side means that we return all 100 values, sort these, and then choose the top 10. This is inefficient and adds load not only on the client, but also on the database and the network. While there are scenarios where this might make sense, such as if we were caching all the names in the application for repeated reuse, we normally want to reduce unneeded work.

This leaves us with the second option: sorting the names first on the server-side. This is the method frequently taken in a relational database and is also common in a graph database. In SQL, we use the ORDER BY clause like this:

SELECT *
FROM person
ORDER BY first_name;

The syntax in a graph database is similar. In fact, to order results in Gremlin, we use the following step:

  • order()--Collects all objects up to this point of the traversal into a list, which is ordered according to the accompanying by() modulator

This new step, used in conjunction with the by() modulator, specifies how to arrange our data and which property to use to sort the results. For example, to order the names of every person vertex in the graph by first_name, we use this traversal:

g.V().hasLabel('person').values('first_name').
  order().
    by()
==>Dave
==>Denise
==>Hank
==>Jim
==>Josh
==>Kelly
==>Paras
==>Ted

The order() step defaults to sorting in ascending order. To sort by descending order, specify the decr parameter in the by() step as shown:

g.V().hasLabel('person').values('first_name').
  order().
    by(decr)
==>Ted
==>Paras
==>Kelly
==>Josh
==>Jim
==>Hank
==>Denise
==>Dave

While sorting in either ascending or descending order is common, there are times when we want to order data randomly, such as when sampling. For this, we use the shuffle parameter in the by() step:

g.V().hasLabel('person').values('first_name').
  order().
    by(shuffle)
==>Dave
==>Jim
==>Ted
==>Paras
==>Kelly
==>Hank
==>Denise
==>Josh

Ordering is probably the most frequent requirement for formatting data. Another typical need is to group or count the number of items in a group.

5.3.2 Grouping results returned from a graph traversal

If we return to our previous friends-of-friends traversal, the client might want to return that list grouped by which of Ted’s friends they’re friends with. In this scenario, we’re left with the same choices as with ordering data: either perform the work on the client-side in the application or on the server-side in the database.

There’s a natural desire, and one which we encourage, to push as much of this work as close to the data as possible. In SQL, we accomplish this by using the GROUP BY clause:

SELECT f.person_id, count(foff.*)
FROM person


WHERE person.first_name = 'Ted'

Similar to ordering data, the syntax in a graph database for grouping is comparable when using Gremlin. To perform these grouping operations, we can use either of the following steps:

  • group()--Groups the results based on the specified by() modulator. Data is grouped by using either one or two by() modulators. The first one specifies the keys for the grouping. The second one, if present, specifies the values. If not present, the incoming data is collected as a list of the values associated with the grouping key.

  • groupCount()--Groups and counts the results based on the specified by() modulator. It takes one by() modulator to specify the keys. The values are always aggregated by the count() step.

In the following, we apply these steps to group all of the friends-of-friends of Dave by his friends:

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
  group().
    by('first_name')
 
==>{Denise=[v[19], v[19]], Ted=[v[4]], Hank=[v[6]], Paras=[v[17]], Josh=[v[2], v[2]], Dave=[v[0], v[0], v[0], v[0], v[0]], Kelly=[v[13]], Jim=[v[15]]}

We can see that our traversal returns a Map containing arrays of vertices for each name. Because we didn’t specify a second by() modulator, it simply collected the references to the vertexes into a list. To make this a bit easier to read, let’s use the unfold() step:

  • unfold()--Unrolls an interable or map into its individual components

Applying the unfold() step to our results unwinds those into individual records for each name. For example

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
  group().
    by('first_name').
  unfold()
==>Denise=[v[19], v[19]]
==>Ted=[v[4]]
==>Hank=[v[6]]
==>Paras=[v[17]]
==>Josh=[v[2], v[2]]
==>Dave=[v[0], v[0], v[0], v[0], v[0]]
==>Kelly=[v[13]]
==>Jim=[v[15]]

Instead of returning the actual vertices for each name, what if we were more interested in discovering which of Dave’s friends is the most popular? To do aggregated grouping, we need the count for a group by name, so we use the groupCount() step:

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
  groupCount().
    by('first_name').
  unfold()
 
==>Denise=2
==>Ted=1
==>Hank=1
==>Paras=1
==>Josh=2
==>Dave=5
==>Kelly=1
==>Jim=1

The groupCount() step is just a little syntax sugar for the most common use of the group() step--aggregating counts of things. As a quick point of comparison, and a nice demonstration of how we can use a traversal in a by() modulator, the group() version of the groupCount() step is

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
group().
    by('first_name').
    by(count()).
  unfold()
==>Denise=2
==>Ted=1
==>Hank=1
==>Paras=1
==>Josh=2
==>Dave=5
==>Kelly=1
==>Jim=1

Note how we were able to use a traversal of one step (the count() step) in our second by() modulator. The group() step applied by() to all of the incoming vertices that shared the same first_name value. As these examples demonstrate, grouping and ordering results in a graph database is similar to the process used in a relational database.

5.3.3 Limiting results

The final topic in organizing results is returning a subset of the data. This is commonly used to minimize the result size or for pagination functionality. For example, let’s say that we want to return all the names for people in our graph, but our graph contains one million people. Can any application display all at the same time? Usually we want to limit the initial results and then allow the user to move through the data set in groups of records. This approach is standard in a number of types of applications.

As with grouping or ordering, the question remains: Do it on the client-side or do it on the server-side? In this case, it’s almost always better to limit data on the server before returning it to the client. That provides a drastic reduction in resources across the whole stack, from the database to the network to the application. In SQL, we use the LIMIT clause for this:

SELECT *
FROM person
LIMIT 10;

As before, the approach in a graph database is similar. Gremlin, however, has several steps depending on the desired outcome: first X results, last X results, or X results from within the data set:

  • limit(number)--Returns the first number of results

  • tail(number)--Returns the last number of results

  • range(startNumber, endNumber)--Returns the results from startNumber (inclusive, zero-based) to endNumber (not inclusive)

These three steps are usually paired with an ordering step because graph traversals don’t guarantee the order of the data returned. Let’s say that we want to return only the top three names from our graph, ordered by first_name. If we extend the traversal we built in the ordering section to add the limit() step, we get a traversal that looks like this:

g.V().hasLabel('person').values('first_name').
  order().
    by().
  limit(3)
==>Dave
==>Denise
==>Hank

What if we want to do the reverse and return the last three names? How would we accomplish this task?

Exercise Take a minute and think about what you’ve learned about manipulating our results. What would you do to answer the question?

We see two ways to do this. The first is to use the previous traversal, but arrange names in descending order instead of ascending order; then limit our results to the top three. The second way is to use the tail() step instead of the limit() step like this:

g.V().hasLabel('person').values('first_name').
  order().
    by().
  tail(3)
==>Kelly
==>Paras
==>Ted

Both methods accomplish the same goal. It’s therefore up to your discretion to decide which one you want to choose.

The last requirement to discuss is pagination of results. Let’s say we want everyone ordered by first name, but only three at a time. We use the range() step and specify the first and last result number to return:

g.V().hasLabel('person').values('first_name').
  order().
    by().
  range(0, 3)
==>Dave
==>Denise
==>Hank

By manipulating the startNumber and endNumber values, we can page through our results. For example, if we wanted to move to a second page of results, we could accomplish this by incrementing the values in our range step to range(3, 6).

5.4 Combining steps into complex traversals

Given all these different ways of manipulating our data, we’ll share one last example that combines these concepts together to answer the question, “What three friends-of-friends of Dave have the most connections?” We’ll start with the answer and then break it down into its component parts. First, the traversal to answer this question follows:

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
  groupCount().
    by('first_name').
  unfold().
  order().
    by(values, desc).
    by(keys).
  project('name', 'count').
    by(keys).
    by(values).
  limit(3)
 
==>{name=Dave, count=5}
==>{name=Denise, count=2}
==>{name=Josh, count=2}

That’s a lot to take in when you are used to traversals with at most five steps. This one has nine steps plus five modulators. In chapter 8, we introduce a methodology for developing more involved traversals such as this one. In this final example for this chapter, we walk through our approach to understand the traversal that is already written.

The first step is to ascertain the traversal writer’s intent. Here, we know that the traversal is supposed to answer a specific question: “What three friends-of-friends of Dave have the most connections?” We may not always know the intent when looking at someone else’s traversals. Out of consideration for the developers who might someday need to support the traversal you write, we recommend either having descriptive method names (e.g., getTop3FriendOfFriendsByEdgeCount) or including a helpful comment in the code. We discuss adding comments a bit later in this section.

Next, determine the traversal’s starting point. It is a single vertex or a type of vertices. We do this by looking at all of the filtering steps at the start. Remember that filtering steps such as has() can be chained together efficiently. In this example, the traversal starts at a single vertex, the Dave vertex, before it traverses to other parts of the graph:

g.V().has('person', 'first_name', 'Dave')

Next, we want to take each step, or collection of steps, in their order. While reading through the steps, we want to develop our mental view of our position within the graph at each point. Sometimes, we find it helpful to add comments to the code as a way of taking notes. For example, with the first few steps, we might add comments like the following:

g.V().has('person', 'first_name', 'Dave').  // single person: Dave
  both().                                   // friends
  both()                                    // friends of friends

This helps us to keep track of the output at each step. The only problem with this is that not all traversal processors support inline comments like this. For example, while our IDE recognizes this as valid Groovy code, the Gremlin Console gives us an error if we attempt to run it. To be able to run the code in Gremlin Console, we need to change the comment to something like this:

// single person: Dave
g.V().has('person', 'first_name', 'Dave').
  // friends
  both(). 
  // friends of friends
  both()

But that makes the traversal verbose in ways that are almost unhelpful. In these situations, we are tempted to keep two versions of the traversal in our IDE or editor, one with comments and the other for testing. Either way, we can see that the first three lines get us to the friends of Dave’s friends. Testing in the Gremlin Console gives us

g.V().has('person', 'first_name', 'Dave').
  both().
  both()
==>v[19]
==>v[17]
==>v[0]
...

Let’s look at the next set of steps, the groupCount() step, it’s by() modulator, and the unfold() step. We can run just those steps through the Gremlin Console and review the results:

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
  groupCount().
    by('first_name').
  unfold()
==>Denise=2
==>Ted=1
==>Hank=1
==>Paras=1
==>Josh=2
==>Dave=5
==>Kelly=1
==>Jim=1

From this, we can see that we get a series of key-value pairs, where the key is the first_name of the friends-of-friends vertex, and the value is the number of times that it appears in the results. Referring back to our starting question, this covers the friends-of-friends of Dave and their number of connections. The groupCount() step handled both the needed grouping (by friends-of-friends’ first names) and the count aggregation. An unfold() step is tacked on to simplify the ordering, which is next:

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
  groupCount().
    by('first_name').
  unfold().
  order().
    by(values, desc).
    by(keys)
==>Dave=5
==>Denise=2
==>Josh=2
==>Hank=1
==>Jim=1
==>Kelly=1
==>Paras=1
==>Ted=1

The ordering statement gets interesting. First, it orders by the values in descending order and then by the keys. The ordering by the keys is a nice little tie breaker to ensure somewhat deterministic results. We’re not sure that an alphabetical bias by first name is the best approach, but it certainly works for ensuring consistency. Now let’s reformat these results so that it’s easier to parse with a client program:

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
  groupCount().
    by('first_name').
  unfold().
  order().
    by(values, desc).
    by(keys).
  project('name', 'count').
    by(keys).
    by(values)
==>{name=Dave, count=5}
==>{name=Denise, count=2}
==>{name=Josh, count=2}
==>{name=Hank, count=1}
==>{name=Jim, count=1}
==>{name=Kelly, count=1}
==>{name=Paras, count=1}
==>{name=Ted, count=1}

Here, we use a project() step so that we have objects with clear labels on the properties. This probably isn’t necessary if working with data within a single traversal, but it is a great practice when returning results to another program. Now, the only operation that remains is to limit the results:

g.V().has('person', 'first_name', 'Dave').
  both().
  both().
  groupCount().
    by('first_name').
  unfold().
  order().
    by(values, desc).
    by(keys).
  project('name', 'count').
    by(keys).
    by(values).
  limit(3)
 
==>{name=Dave, count=5}
==>{name=Denise, count=2}
==>{name=Josh, count=2}

This example demonstrates how we brought together the concepts and constructs in this chapter to format results for client software. And, as a bonus, we walked through how to examine a traversal piece by piece to understand its operations. In the next chapter, we’ll bring the skills from the last few chapters together to build a working application.

Summary

  • By default, properties aren’t returned from graph elements, so we must explicitly ask for those. In Gremlin, we use steps such as values() and valueMap() to retrieve the values in the desired form.

  • Aliases in the traversal allow for referencing results from earlier steps in later steps, supporting composition of powerful traversals.

  • Selecting and projecting steps create complex results from multiple vertices or edges, allowing for the composition of intricate result structures.

  • Selection creates a result set based on previously traversed elements of a graph. To use the select() step, we alias with the as() step elements for use in later steps.

  • Projection operates from the current location within the graph and creates new objects with either static or calculated properties.

  • Ordering, grouping, or counting by group are common ways to transform results using the order(), group(), and groupCount() steps.

  • The limit() step limits the number of results, the tail() step returns the last X records, and the range() step allows for result pagination.

  • Combining different steps performs complex manipulation and transformation of traversal results in the database prior to returning the results to a client. Used appropriately, this improves performance in the database, across the network, and in the application itself.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.121.242