8 Building traversals using known walks

This chapter covers

  • Creating known-walk traversals
  • Translating business questions into graph traversals
  • Prioritizing strategies for traversal development
  • Paginating results in a graph traversal

Denise is one of the users of DiningByFriends. She recently traveled to Cincinnati, Ohio, for work and is looking for a recommendation for an excellent restaurant for dinner. From the work we did in the last chapter, we know that our data model contains all the necessary information to answer this type of question. We also know that to answer this question, we need to develop traversals that

  • Traverse a specified set of vertices and edges

  • Traverse those elements in a set order

  • Traverse those elements a specific number of times

In the last chapter, we learned that traversing a graph with the attributes from this list is a pattern described as a known walk. Although we introduced the concept of known walks in chapter 7, this chapter dives into how we develop traversals for this pattern. To demonstrate this, we use the tangible target of our recommendation engine use case. We start by revisiting the requirements of the restaurant recommendation engine. We then identify the vertices and edges needed for our known-walk traversals. We follow that with developing the traversals for the use case. Finally, we incorporate the traversals into our application.

In previous chapters, we developed our traversals separately from the Java code and then added these to our application. This separated traversal drafting and testing and application development into two distinct steps. We did this to avoid intermingling developing traversals with that of constructing an application. The reality is that most developers do both at the same time, regardless of database engine.

In this chapter, we’ll follow a more standard developer workflow and combine the creation of traversals with adding these to our application. Along the way, we’ll provide a variety of tips and best practices to aid in our process of developing graph-backed applications. By the end of this chapter, you’ll have a simple recommendation engine for our DiningByFriends app. You’ll also learn how to develop applications using known walks to solve substantive business problems. Let’s get started.

8.1 Preparing to develop our traversals

Before we begin our development process, we need to gather two critical pieces of information. First, we need the requirements of our use case. Second, we need our graph data model. Revisiting the requirements and data model for our recommendation engine, these questions make up the requirements for our recommendation engine use case:

  • What restaurant near me with a specific cuisine is the highest rated?

  • What are the ten highest-rated restaurants near me?

  • What are the newest reviews for this restaurant?

To develop the traversals that answers these questions, we’ll use our logical data model. Figure 8.1 shows this model.

8.1.1 Identifying the required elements

With the questions and logical data model, we can begin writing our traversals by identifying the necessary vertex labels and edge labels for this use case as we did in chapters 3, 4, and 5. One addition to our process is that, because we are working with a more complex logical model, one with multiple vertices and edges, we need to perform a few steps in preparation for writing our traversals. We go through these steps for each of the three questions in our use case:

  1. Examine each requirement and break it into the components needed to answer the question.

  2. Identify the required vertex labels.

  3. Identify the required edge labels.

Figure 8.1 The DiningByFriends logical data model with the relevant portions highlighted for the recommendation engine use case

Although these steps seem logical and straightforward, the devil is in the details. In our experience, we find that this process is better understood by working through a few examples. Let’s use our restaurant recommendation engine requirements to show how the preparation is different when working with a more involved model.

Our first question in the use case is, “What restaurant near me with a specific cuisine is the highest rated?” Breaking this question into the required actions, we see three pieces are needed:

  • What restaurant near me . . . (locates the restaurants in a geographic area “near me”)

  • . . . with a specific cuisine . . . (filters by cuisine to find a specific cuisine)

  • . . . is the highest rated? (calculates the average rating for each restaurant to find the highest rated)

Looking at our data model, let’s see which vertex labels provide the information needed to satisfy these requirements. When looking for the required vertex labels, a good place to start is by looking for the nouns in the requirements. After we find the nouns, we look in the data model for the corresponding vertex labels for those nouns. While finding the corresponding labels can be a straightforward lookup, it frequently involves synonyms or additional concepts. We already did much of this work during our data modeling process, so we are able to leverage that work to quickly identify the required elements.

Applying this process to our requirements, we can identify the following required vertex labels. Note that the order of these doesn’t matter. We are in a design phase and simply want to make sure that we have what we need to answer the overall question:

  • restaurant--This is the core piece of information we need to return.

  • city--We need to find restaurants “near me,” so we use city to represent the geographical location of a restaurant as we explained in section 7.2.1.

  • cuisine--This allows us to filter by cuisine.

  • review--We need to calculate the average rating of a restaurant, and rating is a property of this entity.

What do you think? Have we listed all of the key vertices to return an answer for this traversal?

After reviewing, we find a subtle element missing. Did you notice that we need a person vertex? Because we need to find a restaurant “near me,” we need to know what city to search. This is contingent on where the person lives. From our data model, we know that the current user’s location is represented by the lives edge from the person vertex, so we need to include the person vertex.

Important Pronouns are easily overlooked when translating business questions into technical requirements and implementation. These tend to hide additional and more subtle requirements. Pay attention to pronouns when identifying required elements. In the current example, note how we called out the phrase “near me” (with an emphasis on “me” as an example of a potentially hidden element).

Now that we’ve identified the vertex labels, our next step is to look for the required edges. In our current model, this step is simplified because, for our four vertex labels (restaurant, city, cuisine, and review), we are limited in the edges we can use. We could just use the edge labels connecting those vertex labels and then proceed with a fair chance of success. But sometimes, there are multiple edge labels between vertices, or we might have missed an important part of the use case by only looking at the nouns. We should also remember to look for edge labels by examining the requirements of the use case and looking at the verbs (actions) in the questions.

Once we identify the verbs in our requirements, we can look for the corresponding edge names in our data model, much as we did with our nouns. As with that step, we are able to reuse much of the work we did while data modeling to quickly identify the required edge labels. Analyzing our requirements, we find the corresponding edges in our data model:

  • restaurant within city--Determines the location of a restaurant to use as a filter

  • restaurant serves cuisine--Categorizes the cuisine a restaurant serves to filter on it

  • review about restaurant--Provides all the reviews for a restaurant to calculate the average rating for that restaurant

  • person lives city--Provides a person’s location to find restaurants “near me.” (This is one of the more subtle or hidden requirements from our data model.)

Note Two of these edges contain a preposition (within, about) implying a verb, rather than a verb itself. For the grammarians among our readers, in these cases in our data model, we decided to omit the verb is for simplicity; within is simpler to write than is_within. As we mentioned, naming things is hard, and naming the edges is no different.

Turns out that these are the only options for connecting the four vertex labels, so these are the edges we use, given our schema. It is possible that we missed something in our modeling work in chapter 7. If so, we would discover that at this point because there would be a disconnect between our use case requirements and the schema. Alternatively, if we skip the modeling work altogether, we must do that now so that the design satisfies the requirements. But because these edges from the logical model align nicely with the use case requirements, our efforts in chapter 7 were sufficient.

In our example, we found nine total elements for the first question: five vertex labels and four edges. Let’s move on to see what’s required for the other two questions that make up the recommendation engine. For each question, we follow the same process: extract the information needed to answer the question into a series of steps, find the nouns (or pronouns) in these steps to locate the corresponding vertex labels, and then find the verbs in our actions to identify the required edges. Let’s see how this looks for our second question, “What are the ten highest-rated restaurants near me?” We start by extracting the requirements needed to answer the question:

  • Locate restaurants in a geographic area.

  • Determine a user’s location to filter on restaurants in that area.

  • Calculate the average restaurant rating in order to sort the restaurants and return only the top 10.

As with the previous example, we find the vertices by looking for the nouns or their synonyms, which provide the corresponding vertex labels in our data model:

  • restaurant--This is the core piece of information we need to return.

  • person--Locates a user, which is needed to satisfy the “near me” requirements of our question.

  • city--Defines the location of both the user and the restaurant.

  • review--Required to calculate the average rating for a restaurant because rating is a property.

Moving on to our next step, reviewing the logical data model to locate the verbs, we need the following edge labels:

  • restaurant within city--Defines the location of a restaurant “near me” to filter on

  • person lives city--Defines where the user lives

  • review about restaurant--Provides all the reviews for a restaurant needed to calculate the average rating

Following this process, we identify seven required elements for this question: four vertex labels and three edges. While this is fewer than the previous question, that’s not a problem. Because we follow the same process, we have a high degree of confidence that we did not miss anything. We can also compare the two questions. Doing so, we see that the main difference is that this question does not include a reference to cuisine. Otherwise, these questions are quite similar. This gives us the confidence that we’re on the right track.

Exercise For the third question--“What are the newest reviews for this restaurant?”--go through the process on your own and compile a list of vertices and edges that you think are required.

How many elements did you come up with? When we looked at that question, we found three. Our two vertex labels include

  • restaurant--Required to find appropriate reviews for the current restaurant

  • review--This is the core information being returned. For this question, we also assume that “newest” refers to the date the review was written, so we also need the created_date property, which allows us to find the newest reviews.

And our sole edge label is

  • about (connecting review and restaurant)--Required to associate a set of reviews to the appropriate restaurant. This is also where our created_date property lives, which allows us to find the newest reviews.

In the three questions for our restaurant recommendation engine, identifying some of the required elements, such as restaurant, was fairly straightforward. However, identifying other elements, such the elements required for “near me,” are less obvious and require us to leverage our experience when creating the logical data model to answer these. We are also aided in that our logical model has only a single edge label among vertex labels in all cases. We are now ready to get started writing our traversals, but where do we begin?

8.1.2 Selecting a starting place

Before writing our traversals, we need to make a crucial decision: Where do we begin our development work? We can’t build three traversals all at the same time, so which use case should we address first? For this, we see two reasonable approaches.

One approach is to pick what we think is the most challenging problem and start there. This approach works well when there are compelling unknowns or project risks, such as introducing new technologies or processes into the development ecosystem. This approach allows us to fail fast and is the right choice when a quick decision of some sort is needed--perhaps to make a “go/no-go” decision or to determine whether the technology is the right choice for the problem.

Another approach is to start with the most straightforward or the least complicated question and use it as a building block for the rest of our work. This path allows for the progressive development of the code base and provides an excellent way to avoid biting off more than you can chew. The idea here is to get a quick win or success with a smaller, simpler problem before tackling more complex issues.

Let’s look at the questions for our recommendation use case. Then we can decide which approach and questions look like the best place to start.

Question

Vertex Labels

Edge Labels

What restaurant near me with a specific cuisine is the highest rated?

personrestaurantcitycuisinereview

lives (connecting person → city)within (connecting restaurant → city)serves (connecting restaurant → cuisine)about (connecting review → restaurant)

What are the ten highest-rated restaurants near me?

restaurantcityreviewperson

lives (connecting person → city)within (connecting restaurant → city)about (connectingreview → restaurant)

What are the newest reviews for this restaurant?

restaurantreview

about (connecting review → restaurant)

The questions seem to become progressively simpler, with the last question having the fewest required elements. Which of the approaches would you choose?

If we were to choose the first approach, starting with the most challenging, we’d want to start with the first or second question. These questions are more complicated and involved than the last one. If we wanted to prove our graph technologies, choosing one of these questions would provide us with the most knowledge about the problem with a single effort. On the other hand, we could choose the second approach, starting with least complicated, which allows us to get a win or success with a smaller, simpler problem before tackling more complex issues.

For our present work, let’s choose the second approach: start with the simplest problem and use that as a building block for the harder questions. For this project, we don’t have a go/no-go point, and we don’t have significant budget constraints. Without these constraints, we’d rather start small and get a quick win. However, before we begin developing in earnest, there’s one more task to accomplish: setting up some test data for our work.

8.1.3 Setting up test data

Before we begin writing our traversals, the last step in our preparation is to load some test data. As with any database development, work goes much faster with data, preferably real data. Thus, a useful test data set should cover our core use cases at a minimum and, ideally, the known edge cases. As we work with the code and with the data, we expect to uncover additional edge cases. These make excellent candidates for adding to our test data set, as well as using these additional edge cases for unit and integration tests.

For this book, we included a set of test data with the code for this chapter, which may be found here: https://github.com/bechbd/graph-databases-in-action. Note that this script works a bit differently than the previous scripts. Instead of using individual commands to independently create vertices and edges, this script loads data from a JSON file. To see the details of how this works, see http://mng.bz/jVV9 to look at the io() step in Gremlin. The downside of this approach is that we need to update the reference location of the data file in the loading script, 8.1-restaurant-review-network-io.groovy. To update the script, open it in a text editor of your choice and edit the line shown here to point to the full path and file location of chapter08/scripts/restaurant-review-network.json:

full_path_and_filename = "/path/to/restaurant-review-network.json"

Once that is done, if you set up the Gremlin Console according to the instructions in appendix A, then you can start the Gremlin Console and load the data for this chapter with this single command on MacOS or Linux:

s/8.1-restaurant-review-network-io.groovy

Or use this command on Windows:

bingremlin.bat -i $BASE_DIRchapter08scripts8.1-restaurant-review-network-io.groovy

Once this script completes, our graph now contains test data we will use throughout the remainder of this chapter for testing during our traversal writing process. We can quickly verify that the data set is loaded correctly in the Gremlin Console by typing g and pressing Enter:

         ,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> g
==>graphtraversalsource[tinkergraph[vertices:185 edges:318], standard]
gremlin>

8.2 Writing our first traversal

Now that we’ve decided where to start and have loaded our test data, it’s time to begin writing the first traversal. At this point, we like to break down the question into parts and progressively build out the code in a sequential, or at a least systematic way, using these steps:

  1. Identify the vertex labels and edge labels required to answer the question.

  2. Find the starting location for the traversal.

  3. Find the end location for the traversal.

  4. Write out the steps in plain English (or in your preferred language). The first step is your input step; the last step is the output or return step.

  5. Code each step, one at a time, with Gremlin and verify the results against the test data after each step.

As you probably noticed, this sounds like the process we went through to develop the traversals for our social network. This similarity is not an accident. Although we followed this process, we did not formalize it.

In chapters 3, 4, and 5, we chose to focus on the basics of writing a traversal instead of getting bogged down in the process. These were also (by design) much simpler traversals than what we build in this chapter. Now that you’ve learned the basics and are familiar with both thinking about and traversing a graph, as well as the Gremlin syntax, it is time to formalize our process as we take on more involved traversals. For the next section, we’ll employ this process consistently. However, for section 8.2.2, we’ll streamline the instructions and focus on how the execution looks when we are in the flow of writing code.

8.2.1 Designing our traversal

Now we are ready to begin designing our traversal to answer, “What are the newest reviews for this restaurant?” We start this process by identifying the required portions of the schema, based on the work from section 8.1.1. Figure 8.2 highlights the part of the logical data model (schema) that we’ll use to answer this question.

Figure 8.2 The logical data model showing the information required to answer the question, “What are the newest reviews for this restaurant?”

We use this part of the schema to choose a starting point for our traversal. As we learned in chapter 3, choosing a starting point is about looking for the criteria that quickly narrows down the number of starting vertices. This minimizes the number of traversers required and results in faster performance overall.

Looking at the question, we see that it revolves around, “this restaurant,” so that is a great place to start our traversal. Because we are looking for a single starting vertex (“this restaurant”), we start by filtering our traversal. This leads to a dilemma: Which property do we use to identify a restaurant, restaurant_id, name, or address?

  • name--This is problematic as restaurants can belong to chains, all of which have the same name. Using this as a filter could potentially return many restaurants.

  • address--Works well for standalone buildings, but a user might be in a food court, multi-tenant building, or at a mobile food truck. In any case, the address might not be apparent or readily available.

  • restaurant_id--This property more precisely describes a particular restaurant, so we can use this unique identifier to ensure that we get the single starting vertex.

Given these factors, let’s make a simplifying assumption that our traversal takes a restaurant_id as an input. Because the restaurant_id is a unique identifier of the restaurant, using this ID ensures that we get the single starting vertex we desire. With our starting vertex and filtering criteria identified, let’s move on to the next step of our process: determining the end point. We find the end point by looking for the nouns and qualifiers that describe what the traversal needs to return; what type of thing will the answer be?

Examining the question, we see that we want to return “the reviews” for the restaurant. This means that our traversal needs to traverse from our starting point, the restaurant vertex, to the review vertex to retrieve the reviews. This is the only additional information the question asks for, so this makes the review vertex our traversal’s endpoint. However, we don’t want to return the entire vertex. Users don’t want to get vertices back. An end user wants the text of the review, so we need to return the body property. Because we want the “newest” reviews, we also need to order by when a review was made. For that, we return the created_date property as well.

Now that we know that we start with the restaurant identified by its restaurant _id and return the created_date and body properties of the review vertices, our process gives us the following steps that our traversal must perform (we sort for newest later in this process):

  • Get the restaurant based on the restaurant_id.

     . . . 

  • Return the created_date and body properties of the review vertices.

In between these two steps is one or more steps that we’ll fill in as we complete this section. We have denoted these middle steps with an ellipsis (. . .) because we don’t know exactly how many steps there will be to get from our starting point to the final response. But we do know that we have some steps in between the starting point and the return data because we are not simply returning the starting vertex.

As a point of comparison, in a relational database these starting and ending objects would be two tables, and we need to determine the join conditions necessary to return the correct rows, which may require additional tables. In our graph database, we have identified two vertex labels, and we must construct the appropriate traversal steps to craft the desired response data. Some of this involves traversing through the graph with the steps we first introduced in chapter 3. Other parts of this involve formatting the response using the steps introduced in chapter 5.

Having identified our starting point and return data, let’s examine our schema (figure 8.3). This allows us to find the appropriate vertices and edges we need to traverse from one to another.

Figure 8.3 The graph schema shows that the starting point (restaurant_id) and ending point (created_date and body properties of the review vertices) are directly connected.

Because the starting and ending points are directly connected, this keeps things simple. Now we know we need to

  • Get the restaurant based on the restaurant_id.

  • Traverse the about edges to the review vertices.

     . . .

  • Return the created_date and body properties of the review vertices.

That’s closer, but the output needs to be ordered a specific way, namely, by the most recent reviews. Our review vertex has a created_date property that is well-suited for this purpose. Because we want the newest reviews, we sort the review vertices by created _date in descending order. Using properties such as a created date or an updated date is a widespread pattern in graph databases, as with other databases. That allows us to sort items chronologically. With this, we have a total of four steps in our path:

  • Get the restaurant based on the restaurant_id.

  • Traverse the about edges to the review vertices.

  • Sort review vertices by created_date in descending order.

  • Return the created_date and body properties of the review vertices.

Upon first glance, we might think that we’re done, but there’s another factor we should consider. There could be hundreds or thousands of reviews going back years for restaurants in the system. There’s a chance that the result set could be extensive, so let’s limit the results. At the end of this section, we throw in some bonus material on how to handle pagination, but for now let’s only return the top three. This wasn’t directly specified in the use case, but as experienced developers, we know that this is a reasonable implementation approach, one that we can later validate with user testing. Bringing everything all together, we now see that our traversal needs to perform the following actions:

  • Get the restaurant based on the restaurant_id .

  • Traverse the about edges to the review vertices.

  • Sort review vertices by created_date in descending order.

  • Limit the results to the first three returned.

  • Return the created_date and body properties of the review vertices.

Comparing these steps against the requirements of our question, we see that we’ve defined the traversal in sufficient detail to get started writing. The old saying, “The proof of the pudding is in the eating of it,” applies here too. We won’t know for certain that we have covered the requirements of the use case until we code the traversal. But now we have a clear set of steps that should make short work of writing the code.

The set of steps needed for this traversal looks similar, though not exactly identical, to the recursive and path-based traversals we developed for our social network. When we developed our recursive traversals, we had a known series of vertices and edges to traverse but an unknown number of times with which to traverse these. With those path-based traversals, we were interested in how our start and end points were connected (the path). In contrast, the known-walk or known-path traversals we work with now have a defined set of vertices and edges to traverse and a known number of times we must traverse these. We use known-path traversals when we are interested in finding out if our starting and ending points are related, not in how these are related.

In our present example, we need to traverse a single type of edge label. The variable won’t be in the number of edges labels traversed (there’s only the single about label), but in the number of instances of about edges. In relational databases, this is akin to the difference between the number of joins we have to use verses the number of rows actually returned.

8.2.2 Developing the traversal code

Now that we’ve designed our traversal approach, we’re ready to start developing it using our test data. This is the process of taking the bulleted points we designed in the previous section and writing the corresponding code. Looking at the actions we need to accomplish, we notice that we’ve already performed each of these steps in previous chapters when building pathfinding traversals. Because we already know the steps required, let’s develop this traversal to learn an iterative approach to constructing complex traversals.

This process is straightforward. We begin by writing the code for the first step and progressively adding code for each additional step, one step at a time. Then we verify that we are returning the expected results after each one.

Looking at our test data, let’s use that great greasy eatery, that hopping Houston mainstay, Dave’s Big Deluxe, as our example, which has a restaurant_id = 31. We should probably mention that this restaurant, as with all of the people and other restaurants in the examples, is entirely fictitious.

To allow us to easily test against multiple restaurants, let’s create a variable named rid and assign the restaurant_id of Dave’s Big Deluxe, 31. We add the variable using the following command in the Gremlin Console:

 31

Note As previously mentioned, the Gremlin Console allows for creating variables using Groovy syntax. In this case, we create a variable simply to ease the development of the traversal, so it’s not required. But using a variable makes it easy to test our traversal on multiple restaurants by merely changing the variable value instead of the traversal.

We already know how to create the traversals to find our starting point. Let’s filter on restaurants that match the rid variable:

',rid)     
 
==>v[288]

Gets the restaurant based on the restaurant_id = rid

It returned a result! That’s great, but it’d be nice to know if it’s the correct result, so let’s check our results. While not specifically required to construct our traversal, we find it helpful to add values steps to traversals while constructing these for immediate validation. In this case, we add the valueMap() step:

g.V().has('restaurant','restaurant_id',rid).
                              
 
==> {id=288, label=restaurant, address=[490 Ivan Cape], 
 restaurant_id=[31], name=[Dave's Big Deluxe]}

Returns all properties

Perfect! We got back Dave’s Big Deluxe, so we confirmed that we have the right restaurant vertex. Let’s move on to the next step: traversing the about edges.

Examining the schema, we see it is an inbound edge, meaning we use an in() step. (Don’t worry that the body of a review looks like gibberish. The data set we loaded contained some autogenerated text values, so these are not meant to be understandable.)

g.V().has('restaurant','restaurant_id',rid).
in('about').                                                              
  valueMap(true)
 
==>{id=894, label=review, rating=[5], created_date=[Wed Sep 26 18:30:16 CDT 
 2018], body=[Soluta velit quasi explicabo ut atque ratione nisi. ...]}
...                                                                        
==>{id=666, label=review, rating=[5], created_date=[Wed May 01 07:37:44 CDT 
 2019], body=[Quo et non aut ipsam qui autem aut. Voluptatem id. ...]}

Traverses the about edges to the review vertices

Results truncated to improve readability.

Whoa! Perhaps that valueMap() step is too helpful. If you’re following along, you should have returned a wall of text for the eight review results. We need a way to see those results, but perhaps without all of the noise. Referring back to our plan for this traversal, we know we want to return the created_date and body properties, so let’s update our traversal to only return those properties:

g.V().has('restaurant','restaurant_id',rid). 
  in('about').
p('created_date', 'body')                           
 
==>{created_date=[Sun Jul 19 01:43:31 CDT 2015], 
  ...                                               
==>{created_date=[Wed Sep 26 18:30:16 CDT 2018], 
 body=[Soluta   ...                               
==>{created_date=[Wed Jul 27 07:30:46 CDT 2016], 
 body=[Officiis ...                               
...

Returns only the created_date and body properties

Results returned unsorted

Note Within Gremlin Server, dates are stored in Coordinated Universal Time (UTC) formats; but for display purposes, Gremlin Console automatically translates dates to the local time zone.

That’s better; at least we only have a pair of properties. Looking back at our plan, the next step is to add logic to sort the results by the created_date value:

g.V().has('restaurant','restaurant_id',rid).
  in('about').
order().by('created_date').                       
valueMap('created_date', 'body') 
 
==>{created_date=[Sun Jul 19 01:43:31 CDT 2015], 
 body=[Dolorem ...                              
==>{created_date=[Wed Jul 27 07:30:46 CDT 2016], 
 body=[Officiis ...                             
==>{created_date=[Thu Mar 09 03:37:52 CST 2017], 
...                                               
... 

Orders reviews by the created_date

Results ordered in ascending order by created_date

We’re almost there. The results are sorted by date, but the wrong way! Recall that the order() step defaults to ascending order, so let’s use descending order:

g.V().has('restaurant','restaurant_id',rid).
  in('about').
desc).               
  valueMap('created_date', 'body')
 
==>{created_date=[Wed May 01 07:37:44 CDT 2019], 
 body=[Quo et ...                               
==>{created_date=[Tue Mar 12 20:33:43 CDT 2019], 
 body=[Ducimus ...                              
==>{created_date=[Wed Sep 26 18:30:16 CDT 2018], 
 body=[Soluta ...                               
...

Adds desc for descending order so the ordering is in the expected direction

Results sorted in descending order by created_date

We accomplished all the actions our traversal requires except for limiting the results. Let’s add that functionality now:

g.V().has('restaurant','restaurant_id',rid).
  in('about').
  order().by('created_date', desc).
  limit(3).                                       
  valueMap('created_date', 'body')
 
==>{created_date=[Wed May 01 07:37:44 CDT 2019], 
 body=[Quo et ...                               
==>{created_date=[Tue Mar 12 20:33:43 CDT 2019], 
 body=[Ducimus ...                              
==>{created_date=[Wed Sep 26 18:30:16 CDT 2018], 
 body=[Soluta ...                               

Limits results to three

Only receive three results.

That’s great! It looks like we’re done. We have the right data, returned in the proper order, and limited to the correct result size. What’s next?

Extending our traversal with the ID

While the previous traversal handles all the requirements of our question, our application is likely going to need more than just the body of the review text. Most applications work from domain objects, so we probably want to create a review object for our application, with a unique identifier tying our domain object back to the underlying database.

As discussed in section 4.1.1, we discourage the use of a database engine’s internal ID values for any business logic; however, due to the ease of using that ID, it’s common practice, which is why we discuss it here. Using the internal ID results in a leaky abstraction that tightly couples your application logic to the underlying database implementation. The best practice is to use either a natural key from your data or an application-generated synthetic key. Now that we’ve restated our “use ID’s wisely” message, let’s update our traversal to return the ID of the review vertex, along with the created_date and body properties:

g.V().has('restaurant','restaurant_id',rid).
  in('about').
  order().by('created_date', decr).
  limit(3).
  valueMap('created_date', 'body').
    with(WithOptions.tokens)                      
 
==>{id=666, label=review, 
 created_date=[Wed May 01 07:37:44 CDT 2019],
   
==>{id=564, label=review, 
 created_date=[Tue Mar 12 20:33:43 CDT 2019],
 body=[Ducimus maxime corrupti et aut...        
==>{id=894, label=review, 
 created_date=[Wed Sep 26 18:30:16 CDT 2018],
 body=[Soluta velit quasi explicabo ut...       

Returns the ID and label metadata of the vertex

Results now include id and label properties.

valueMap() and the with() step

As of TinkerPop version 3.4, the valueMap() step takes an optional with() step for modifying the output, so it is a fairly recent addition to TinkerPop. Not all vendors may support it. Prior to TinkerPop version 3.4, the valueMap() step, including the IDs and labels, uses a Boolean parameter like valueMap(true), or, more specific to our present case, valueMap(true, 'created_date', 'body'). That form should still work, at least up to TinkerPop version 3.5.

TinkerPop is striving for more consistency in its implementation and now uses a with() step to provide configuration information for the valueMap() step. By specifying with(WithOptions.tokens), we can include both Gremlin’s internal ID value as well as the label of the vertex in our responses.

Adding this traversal to our application

Whew, it certainly took more work to develop this known-walk traversal than the social network ones, but the hard work is over. Luckily, the process of adding it to our actual application is the same as the one we used for adding our social network traversals in chapter 6. We won’t go over the details again, but you can find the full Java code in the chapter08/java folder in a new method called newestRestaurantReviews(). We share the relevant parts of the method here with some call-outs for the minor differences from the Gremlin code we drafted. The following shows the newestRestaurantReviews() method:

private static String newestRestaurantReviews(GraphTraversalSource g) {
    Scanner keyboard = new Scanner(System.in);
    System.out.println("Enter the id for the restaurant:");
    Integer restaurantId = Integer.valueOf(keyboard.nextLine());
        
    List<Map<Object, Object>> reviews = g.V().
        has("restaurant", 
 "restaurant_id", restaurantId).                  
        in("about").
        order().
          by("created_date", Order.desc).           
        limit(3).
        valueMap("rating", "created_date", "body").
          with(WithOptions.tokens).
        toList();                                   
 
    return StringUtils.join(reviews, System.lineSeparator());
}

All strings in Java must use double quotes.

Uses TinkerPop’s Order enumeration

All traversals require a terminal step; here we use toList() when not using the Gremlin Console.

Congratulations, you built your first known-walk traversal and you now know the process of iteratively constructing a traversal! You can repeat this process of starting with a specification of steps, adding each Gremlin step one at a time, and testing the traversal against the test data to make sure you get the expected results until you’ve satisfied all the identified parts.

8.3 Pagination and graph databases

This traversal is also a great way to illustrate one of the more challenging patterns with graph databases--pagination.1 As we noted, it’s possible that our traversal can return more results than the requesting software wants or can handle. We addressed that concern in the last section by limiting our results to just the newest three reviews. What if the requesting software wanted access to all the results, just not all of those at the same time?

Before we investigate how graph databases handle pagination, let’s take a quick look at how relational databases handle pagination to use that as a comparison. Wait--we can’t take a quick look because it seems that every relational database engine does it slightly differently, each with its own semantics. What is common is that most pagination implementations take two inputs:

  • offset--The number of records to skip. To start at the beginning of the data set is to have offset = 0. Offset values are multiples of the page size. With a page size of 10, possible offset values would include 0, 10, 20, 30, and so forth.

  • limit--The page size or the maximum number of items to return. We stress that the limit is the maximum number of items because result sets won’t always return the page size (e.g., the final page can contain fewer objects than the specified page size).

The general pattern for dealing with pagination in a relational database, and in a graph database it is the same:

  1. Retrieve the values of the traversal.

  2. Start with the record at the offset index.

  3. Return the limit number of values.

  4. Repeat the process with a new offset value equal to offset + limit.

To handle this in Gremlin, we use the range() step. Although we briefly introduced the range() step in chapter 5, here we expand the definition slightly:

  • range(startNumber, endNumber)--Passes through the objects, starting with and including those indexed at the startNumber, continuing up to but not including those indexed with endNumber. So startNumber is inclusive, and endNumber is exclusive.

A keen observer will notice that while startNumber is the same as the offset, endNumber isn’t the same as the limit used in relational databases. Instead, endNumber = startNumber + limit. A pagination function that take the usual inputs of offset and limit must compute the endNumber.

The startNumber and endNumber values apply to an element index returned by the traversal. This element index value is somewhat analogous to the results of a SQL ROW_NUMBER() function. This value is a zero-based element index number assigned to each of the elements based on the order those are passed to the range() step.

Importance of ordering the inputs before calling range()

For pagination to work as expected, the order of objects passed to the range() step matters. This means we need to order elements prior to paginating them. Without ordering, the results can arrive at the range() step in any sequence, and subsequent calls to the same traversal can result in a different ordering of the objects as these enter that step. Not sure what we mean by this? Well, let’s take a look.

For example, assume that a graph has five vertices: ([v[0], v[1], v[2], v[3], v[4]]). Then let’s assume that we request vertices two at a time, and that we make that request three times. This looks like the three following calls:

g.V().range(0,2)
g.V().range(2,4)

Running this, we expect to obtain the following output:

1]
v[2], v[3]
v[4]

This output is only true if g.V() returns the same order of vertices each time. What would happen if it changed the ordering each time?

In theory, if a database provides a randomized return order, then we get back a seemingly random pair of vertices for each call. Because our call returns the values based on an index, if the element in that index changes between runs, so does the value that is returned. TinkerPop guarantees that it returns elements in the order these enter a step, but it is up to the actual underlying engine to specify that order. In other words, there’s no guarantee about the order unless we specify an order.

This is no different than relational databases: the database engine determines the order of the results according to its own internal logic. This means that to provide a consistent experience to our users, we must sort things before calling the range() step.

Ordering is an expensive operation

We must note that ordering results is an expensive operation in any database, particularly with large data sets. To order the results of a traversal, the database must first return all the results and then sort them. This cost is the same in both relational and graph databases.

In TinkerPop, the order() step is categorized as a “collecting barrier step.” Unlike most TinkerPop steps, which are lazy, meaning that the data is processed opportunistically as new values enter the step, the order() step (and other collecting barrier steps) first collect all incoming values before ordering these and then send the results to the following steps. Yet again, this is no different than any relational database, because to provide a sorted set of values, we must first know all the values we need to sort.

Let’s take a look at what our traversal from the last section looks like when we update it to include pagination. To reduce the output text, we use the rating property instead of the body property. When adding pagination, we need to make the following changes:

  1. Replace the limit() step with a range() step.

  2. Define a limit variable with a value of 3.

  3. Define an offset variable and increment it by the limit value for each call.

Note Because the Gremlin Console converts timestamps to local time zones and we can update our sample data after publication, your results might not match our results exactly.

Implementing these changes, our traversal now looks like this:

limit = 3                                          
==>3
offset = 0                                         
==>0
g.V().has('restaurant','restaurant_id',rid).
  in('about').
  order().by('created_date', decr).
  range(offset, offset + limit).                   
')
==>{rating=[2],                                    
 created_date=[Sun May 26 00:53:56 AKDT 2019]}   
==>{rating=[1],                                    
 created_date=[Thu Mar 28 21:56:30 AKDT 2019]}   
==>{rating=[5],                                    
 created_date=[Fri Nov 09 20:09:49 AKST 2018]}   

Sets the number of results to return

Sets our initial offset

Replaces the limit() step with the range() step

Newest three results returned

We see from this code example that the newest three reviews are returned. Let’s see what happens if we try to get the next page of results:

 offset + limit                            
==>3
g.V().has('restaurant','restaurant_id',rid).
  in('about').
  order().by('created_date', decr). 
  range(offset, offset + limit).

==>{rating=[5],                                    
   
==>{rating=[3],                                    
 created_date=[Tue Oct 24 07:38:21 AKDT 2017]}   
==>{rating=[2],                                    
 created_date=[Tue Mar 28 18:10:00 AKDT 2017]}   

Updates our offset to get the next page of results

Returns the next three results

We retrieve three results, but these are older than the results from the previous run of this traversal. To continue paging our results, we only need to update the offset each time by calculating a new endNumber using the offset + limit calculation. Let’s update our offset again and do this one more time:

it                                                 
==>6
g.V().has('restaurant','restaurant_id',rid).
  in('about').
  order().by('created_date', decr). 
  range(offset, offset + limit).

==>{rating=[2],                                    
 created_date=[Thu Jun 09 08:58:35 AKDT 2016]}   
==>{rating=[2],                                    
 created_date=[Sun Sep 27 10:21:17 AKDT 2015]}   

Updates our offset to 6

Returns only two results instead of three

Well, that is a bit strange. Why did we only get two results back instead of three? This is because we only had eight reviews for this restaurant. This leads us to one final question to address: How do we know when to stop paging?

One approach is to keep running the traversal and incrementing the offset until it returns an empty result set. This approach is useful when we expect a large number of results and don’t need to know the total number, or when we want to avoid the cost of counting all of the results up front. A second approach is to initialize a count of the possible results and then use that as an upper bound to the offset + limit value. This latter approach is particularly useful when the application needs to know the total number of possible results it is required to display.

8.4 Recommending the highest-rated restaurants

We’ve finished writing the first known-walk traversal for our recommendation engine. Let’s move on to answering our second question, “What are the ten highest-rated restaurants near me?” For this question, we follow the same methodology as we did in the last section:

  1. Identify the vertex labels and edge labels required to answer the question.

  2. Find the starting location for the traversal.

  3. Find the end location for the traversal.

  4. Write out the steps in plain English (or in your preferred language). The first step is your input step; the last step is the output or return step.

  5. Code each step with Gremlin, one at a time, verifying the results against the test data after each step.

8.4.1 Designing our traversal

Now that we know the process for constructing a traversal run through, let’s repeat the process for this traversal. However, unlike the detailed walk-through we did in the last section, we progress through the steps quickly, only stopping to dive into the details where they differ from the previous process. In section 8.1, we specified that we need the following elements to answer this question.

Question

Vertex Labels

Edge Labels

What are the ten highest-rated restaurants near me?

restaurantcityreviewperson

lives (connecting person → city)within (connecting restaurant → city)about (connecting review → restaurant)

Let’s highlight the relevant elements of the schema. Figure 8.4 shows the data model for this use case.

Because we’ve identified the relevant schema elements for our use case, we can move on to the first step in finding our starting step. Remember that locating the starting point is all about finding the portion of the question that narrows down our traversal to the minimum number of starting vertices. Looking at the question, “What are the ten highest-rated restaurants near me?”, we see that a user wants all the restaurants located “near me.” This means that the person vertex filtered by person_id needs to narrow down our starting vertex to a single instance, “me.”

Next, we move to the second step in our process--finding the end point. Examining the question tells us that what the user wants is a list of nearby restaurants, meaning our endpoint should be the restaurant vertex. While not explicit, it seems likely that a user wants to get back several properties of the restaurant, such as the name, address, and the average rating for that restaurant. With these two pieces of information, we can start our step specification as

  • Start with the current person, identified by their person_id.

  • Return the name, address, and average rating for the restaurant vertices.

Figure 8.4 Logical data model elements required for the question, “What are the ten highest-rated restaurants near me?”

Next, we move on to figuring out the series of vertices and edges needed to traverse from our starting and ending points. As we did in the last question, we are going to assume that “near me” means the city in which the user lives. To get the city for that person, we traverse from the person to the lives edge to find the city. Now that we are on the correct city vertex, we need the restaurant vertices in the same city. Examining our schema, we see that the within edge connects the two.

As with our last traversal, we need to do some ordering and limit the results; in this case, 10 is specified before we return the restaurant vertices. This leaves us with the following set of actions that our traversal needs to accomplish:

  • Start with the current person, identified by their person_id.

  • Traverse the lives edge to get their city.

  • Traverse the within edge to get the restaurant vertices of the city.

  • Calculate and perform a descending sort by average_rating.

  • Limit to the first 10 results.

  • Return the name, address, and average rating for the restaurant vertices.

With these actions identified, let’s go on to the next step. This step is where we iteratively develop our traversal.

8.4.2 Developing the traversal code

Just as we did in the last section, we’ll develop this traversal in an iterative fashion. In this section, we’ll write the code one step at a time, testing our traversal after each step to ensure that we get the results we expect. Along the way, we’ll revisit some of the concepts we learned in previous sections and show how these can be applied to more complex, real-world scenarios.

Before we begin, as with the last example, let’s make it easy to test multiple people by using a variable to represent the person_id we use for filtering. Let’s use Denise in Cincinnati this time and set pid = 8:

d = 8

As always, we start our traversal filtering on the starting point (in this case, Denise). We learned in chapter 3 how to use Gremlin to filter the person vertex based on a person_id value, so let’s apply that here and add a valueMap() step to return the properties so we can verify our results

g.V().has('person','person_id',pid).        
       
 
==>{id=45, label=person, last_name=[Mande], 
 first_name=[Denise], person_id=[8]}      

Gets the person where person_id = pid

Returns all the properties and metadata

The test data returns the Denise vertex as expected.

We’re off to a good start! Let’s now traverse out from the Denise vertex to find the city. From the schema, we see this is an outbound edge, so we use an out() step:

',pid).
out('lives').                              

 
     

Traverses the lives edge to get the city

The test data returns Cincinnati as expected.

Because Denise lives in Cincinnati, our next action is to traverse the within edge, which is an inbound edge. With this traversal, we can find the nearby restaurants in that fair urban conclave:

g.V().has('person','person_id',pid).
  out('lives').
in('within').                            
 
 
==>{id=60, label=restaurant, address=[600 Bergnaum Locks], 
 restaurant_id=[1], name=[Rare Bull]}
==>{id=192, label=restaurant, address=[102 Kuhlman Point], 
 restaurant_id=[18], name=[Without Heat]}
...

Traverses the within edge to get the restaurants in Cincinnati

We’ve truncated the results to only a subset, but it’s clear that we’re dealing with test data. How else can we explain the complete lack of chili restaurants in Cincinnati? (Although what they call “chili” in Cincinnati would never be considered authentic chili in other parts of the world.)

Now that we’ve successfully returned the restaurants in the city of our user, Denise, we’re ready to order this list of restaurants by their average ratings. Here is where it gets a little bit trickier. In the previous examples, all the values we needed to answer our question resided on the end point vertex. For this example, this is not the case. The rating value we need to use is located on the review vertex, so we need to traverse to the review vertices for a restaurant and compute the average rating.

Recall section 5.3, where we discussed ordering and grouping? There, we stated that not only can we pass a property into the order() step’s by() modulator, but we can also pass in a traversal. In this case, we want to pass in a traversal, which traverses to the review vertex, and average the rating values for a restaurant. While we know how to traverse to the review vertex, we need a new step to calculate the average:

  • mean()--Aggregation to compute the mean or average of a set of values; used most commonly in the group().by().by() step pattern

Let’s combine this step with what we know about traversing and give this a try:

g.V().has('person','person_id',pid).
  out('lives').
  in('within').
order().
_.in('about').values('rating').mean()).    
  valueMap().with(WithOptions.tokens)
 
The provided traverser does not map to a value:      
 v[232]->[VertexStep(IN,[about],vertex),          
 PropertiesStep([rating],value), MeanGlobalStep]  
':help' or ':h' for help.                   
 Display stack trace? [yN].                      

Orders results by the average rating

Standard Gremlin Server error statement

Uh-oh. Whoops! Something certainly didn’t go as expected there. Let’s see if we can parse this unexpected response and sort out the problem.

Troubleshooting errors while developing a traversal

We’ll start by saying that we rarely look at the stack trace for an error. We find that if we are following the development process of adding one action at a time to the traverser, then we already know where to begin debugging. Combining that with the details in the error and our knowledge of how graph traversals work is usually sufficient to troubleshoot. Now, about the actual troubleshooting. Because we iteratively add steps to our traversal, we know that the issue is likely with the last step:

order().
  by(__.in('about').values('rating').mean())

Let’s take a look at the error message to see what details it might give us about the problem:

The provided traverser does not map to a value.

Just as with any debugging problem, we need to combine the information in this error message with what we know about how graph traversals work to postulate a likely reason for the error. Recall that a traverser is a process that performs a specific task. The traversal, which is the whole string of steps, is broken down into multiple traversers along the way. In this case, we can see that one of these traversers ran into a problem: it didn’t return a result. The question is, what was the traverser trying to do when it failed to get a result? Because we know that the issue was with our ordering step, we already know the problematic step, but how do we check our suspicion?

How exactly we check for this information is different for each database. Luckily, for those using Gremlin, this is easy because the next part of the error statement tells us:

v[232]->[VertexStep(IN,[about],vertex), PropertiesStep([rating],value), MeanGlobalStep]

The information you are looking at is Gremlin bytecode. Generally, as application developers, we don’t worry about the bytecode. It is an implementation detail within the Gremlin Server and in the TinkerPop drivers. However, it is usually what is displayed when there is an error and points out a problem. While there is not a one-to-one mapping of bytecode steps to Gremlin steps, it is usually easy enough to map between the two (more on this in chapter 10). In this case, we see that one vertex (v[232]) generated the error. Which traversal step caused the error can be inferred from knowing which steps were added last; in this case

order().
    by(__.in('about').values('rating').mean())

More specifically, we see that the bytecode identifies the MeanGlobalStep, so we can infer that our issue is with the in('about').values('rating').mean() part of the traversal. This is an important feature of our step-by-step approach to traversal development (or any complex software, for that matter). By testing after each incremental change, when we run into an error, we know exactly what caused it!

Because vertex v[232] didn’t like that part of the traversal, let’s investigate why. To answer these types of questions about a specific vertex, we start by looking at the data for that vertex:

g.V(232).valueMap().with(WithOptions.tokens)                
 
==>{id=232, label=restaurant, address=[212 Lorraine Court],
 restaurant_id=[23], name=[With Sauce]}                   

Traversal gets the full details about the vertex with the ID 232

Response with full details about the vertex with the ID 232

Nothing looks out of order with the details about the vertex. But then again, that wasn’t where there was a problem. The vertex made it through the traversal just fine, up to the point where we sorted these by the average ratings. Because the ratings are located on the review vertex and not the restaurant vertex, let’s look at the about edges that take us to the review vertices for this restaurant:

inE('about')     
==>              

Traversal displays the about edges for the vertex with the ID 232

Response listing a within edge and a serves edge

Well, that’s it! There are no reviews for this restaurant. By combining our iterative development process with what we know about traversing a graph, as well as the details of our error message, we quickly pinpointed the issue with our traversal.

In locating the source of the problem in this example, we found that we had made an incorrect assumption about the data. But we could just as easily have made a typo in our script, or there could have been a problem with the logic. We can fix this problem by adding a review, but the problem wasn’t that the data was wrong. The real problem is that we had an incorrect assumption about the data. We were fortunate enough to discover that in our development process, not in production, and so we’re able to write the necessary defensive code to handle a valid state of the data.

In other situations with incorrect assumptions about the data, adding to the sample data might be the best approach. Depending on your specific circumstances, you might want to investigate whether there is some validation process in place with production data that isn’t reflected in a test data set. Or perhaps the test data isn’t crafted well enough to fit the actual shape and scope of the real-world data. Maybe the solution is to add a validation process and clean the data before this traversal runs on it. There are multiple possibilities governed by your environment and the nature of the software you develop. There are, however, some approaches to troubleshooting that can be generally helpful.

In our example, we knew with certainty that the traversal was working as expected before we added the order() step. Additionally, we were able to parse the error statement and get helpful clues. We could then investigate the problem vertex, its properties, and its edges. And we were able to lay out all of that information and reason back to the true cause of the error.

Finally, when troubleshooting these types of errors, there are other resources at your disposal. Sometimes just talking about the problem with a colleague can help you get to a solution. Another possible approach is to attempt to duplicate the conditions in a more controlled way, perhaps with a smaller set of data. Last but not least, there are also online resources like the Gremlin Users email group and the Stack Overflow website, or a vendor’s support team, or even consulting services for researching, asking questions, or getting paid support. The final point is that you are not alone.

Mid-traversal filtering

Returning to our traversal, we have now pinpointed our issue as being the lack of reviews for restaurant v[232]. The question is, what are we going to do about it?

We could say that a lack of reviews is a problem with the data, and as authors, we were sorely tempted to just add one more review to the sample data and avoid the little trouble-shooting digression in the last section. But we felt that the teachable moment was too good to pass up. A lot of times, we make assumptions about the data that are later exposed in our traversal development. In our recommendation use case, we assumed that all the restaurants had ratings. However, this was a bad assumption. Having a restaurant without a rating is actually a reasonable expectation and should be handled. So how do we handle this?

In this example, we filter out restaurants that don’t have ratings, which is to say, do not have any about edges. For this filtering, we introduce a new step, the where() step:

  • where(traversal)--Filters incoming objects based on a traversal, and only passes through objects when the traversal returns a result

The has() step, however, is the primary go-to filtering step and is best for filtering logic based on properties. The where() step is generally for all other filtering use cases, where we filter based on a more complex set of logic beyond simple property matching. In SQL terms, using a where() step is similar to writing a subquery in the WHERE clause such as this:

SELECT
  FirstName,
  LastName
FROM
  Person.Person
WHERE
  BusinessEntityID =
  (
    SELECT BusinessEntityID
    FROM HumanResources.Employee
    WHERE ID_Number = 123
  );

For our traversal, we want to traverse only from restaurants that have reviews, so we filter based on the existence of an about edge. To accomplish this, let’s insert a where() step that checks the presence of an about edge before our order() step:

g.V().has('person','person_id',pid).
  out('lives').
  in('within').
where(__.inE('about')).                                      
  order().
__.in('about').values('rating').mean()).
  valueMap().with(WithOptions.tokens)
 
==>{id=224, label=restaurant, address=[3134 Keenan Stravenue],
 restaurant_id=[22], name=[With Shell]}                      
==>{id=108, label=restaurant, address=[2419 Pouros Garden],
 restaurant_id=[7], name=[Eastern Winds]}                    
...

Adds a where() step to filter out vertices without an inbound about edge

Lists the restaurants

Well, that gives us a result, but not exactly the one we are looking for. We got back a list of ordered vertices, but the ratings aren’t visible, so we aren’t sure if these were ordered correctly. To get this list with the computed averages, we need to switch our approach. Before we order the data, we need to compute the mean rating for each restaurant and associate it with the restaurant vertex. Let’s group vertices into key-value pairs with the restaurant vertex as the key and the mean of the rating values as the value.

In section 5.3.2, we showed you how to use a group().by().by() series of steps to produce a collection of key-value pairs. The first by() modulator specifies the keys; the second by() modulator specifies the values. To create key-value pairs for this use case, we use the group() step, but how do we return the restaurant vertex as the key? To return this key, we need another Gremlin step:

  • identity()--Takes the element entering the step and returns that same element unaltered

Now that we know how to calculate both the key and value parts of our key-value pair, we can add these steps with the group().by().by()like this:

g.V().has('person','person_id',pid).
  out('lives').
  in('within').
  where(__.inE('about')).
                                                        
    by(__.identity()).                                  
))          
==>{v[192]=1.5, v[324]=4.0, v[262]=3.3333333333333335,
 v[200]=3.25, v[330]=2.25, v[270]=2.0, v[208]=4.0,
 v[336]=4.0, v[146]=3.5, v[84]=1.75, v[276]=2.0, 
 v[342]=5.0, v[216]=3.6666666666666665, v[154]=3.5,
 v[282]=3.5, v[92]=2.5, v[224]=1.0,
 v[162]=3.6666666666666665, v[100]=4.0, v[294]=4.5, 
 v[108]=1.3333333333333333, v[176]=2.5, v[306]=4.0, 
 [246]=3.0, v[60]=4.333333333333333                   

Groups vertices to create our key-value pair

Assigns the current element as the key

Traverses the about edge and returns the average rating as the value

Results now contain key-value pairs.

Whoo-hoo! The group().by().by() gave us the key-value pairs we wanted, with the key being the restaurant vertex and the value being the average rating for all reviews of that restaurant. Key-value pairs are convenient to work with, as long as we get these ordered by their values. Let’s add our order() step back in and look at our results:

g.V().has('person','person_id',pid).
  out('lives'). 
  in('within').
  where(__.inE('about')). 
  group().
    by(identity()).
    by(__.in('about').values('rating').mean()).
order().                                                
by(values, desc)                                      
 
==>{v[193]=3.0, v[289]=2.75, v[163]=4.0,
 v[69]=3.3333333333333335, v[133]=1.3333333333333333, 
 v[139]=3.0, v[331]=4.0, v[109]=2.25, v[301]=4.0,
 v[177]=4.0, v[209]=2.6666666666666665,
 v[147]=2.6666666666666665, v[307]=3.0, v[117]=3.0,
 [277]=3.0, v[247]=3.0, v[185]=3.0, v[313]=5.0,
 v[155]=3.0, v[61]=4.333333333333333, v[125]=2.5}       

Adds the order step

Orders results in descending order by the values of our key-value pair

Results are not ordered as expected.

Wait a minute. If we look at the results, these don’t appear to be ordered correctly. What is going on? Looking at the results before we added our order() step, we might see the answer:

==>{v[193]=3.0, v[289]=2.75, v[163]=4.0, 
 v[69]=3.3333333333333335, v[133]=1.333333333333333,
 v[139]=3.0, v[331]=4.0, v[109]=2.25, v[301]=4.0,
 v[177]=4.0, v[209]=2.6666666666666665,
 v[147]=2.6666666666666665, v[307]=3.0, v[117]=3.0,
 v[277]=3.0, v[247]=3.0, v[185]=3.0, v[313]=5.0,
 v[155]=3.0, v[61]=4.333333333333333, v[125]=2.5}

Examining this code snippet closely, we see that what we are getting back is not a collection of key-value pairs as one might expect, but a single object. Note the starting and ending braces: { and }. This is an unexpected result, but one that we know how to remedy. Back in section 5.3.2, we dealt with this same issue with our grouped results. There, we used a step to explode an object into its individual properties using the unfold() step. Let’s add this step before our order() step and see if we get a properly ordered set of results:

g.V().has('person','person_id',pid).
  out('lives'). 
  in('within').
  where(__.inE('about')). 
  group().
    by(identity()).
    by(__.in('about').values('rating').mean()).
unfold().                
  order().
    by(values, desc)
==> v[342]=5.0             
            
...                        
==> v[224]=1.0             

Unwinds the incoming object into individual key-value pairs

Results are ordered as expected.

While the results are truncated, it is pretty clear that we are getting back the results in descending order by average rating as expected. Whew, that was a lot of work, but it looks much better. We returned our aggregation, associated it with the vertex, and got our ordering. That only leaves limiting our results:

g.V().has('person','person_id',pid). 
  out('lives'). 
  in('within').
where(__.inE('about')).
  group().
    by(identity()).
    by(__.in('about').values('rating').mean()).
  unfold().
  order().

(10)                             
 
==> v[342]=5.0                   
==> v[294]=4.5                   
...                              
==> v[162]=3.6666666666666665    

Limits results to 10

Top 10 results ordered by descending average rating

Now that we have the right number of results in the correct order, and we have all of the data we need, we’re almost at the end of our process. All that is left is to format our results to return the name, address, and average rating for each restaurant. This task is more complicated than the previous examples because it requires that we combine both the project() and the select() steps to create our object.

Projecting key-value pairs

Our traversal returns key-value pairs containing the restaurant vertex and the average rating. But what we really want to return is a map of properties containing the name, address, and average rating. To generate this new object from the current key-value pairs, we need to revisit what we learned in section 5.2.1 about formatting results.

We know that we have two options for formatting results: selection and projection. Because we want to create our result object from the our current location (instead of selecting data from earlier in our traversal), we use a project() step. We start by creating three property names for the returned object and add a by() modulator for each one:

project('name', 'address', 'rating_average')      
.                                                 
  by().                                           
  by()                                            

The three property names of our return object

The by() modulator for the name key in the project() step

The by() modulator for the address key in the project() step

The by() modulator for the rating_average key in the project() step

Before we jump into projecting our traversal, we need to take a minute and talk about how to use the project() step with key-value pair data. Using the project() step with a key-value pair instead of a graph element is more complicated. We need to pull data from either the key or the value portion of the incoming key-value pair.

Remember, at this point in the traversal, we are working a collection of key-value pairs, the first of which looks like this: v[313]=5.0. The key part is the v[313], which represents a vertex with an ID value of 313. The value part is 5.0, which is the average rating we computed in the group() step.

When working with key-value pairs, we choose the key part or value part using a special overload of the Gremlin select() step. This overload takes a token, either keys or values, to specify whether we want to choose the key (select(keys)) or the value (select(values)) portion of the key-value pair.

Important The token values is different from the values() step. The token values refers to the value portion of a key-value pair, while the values() step specifies the properties to return from an element. We wanted to call this to your attention because we use both in the same traversal. We know this is confusing, but we didn’t name this, so please don’t shoot the messenger.

Returning from that brief aside, let’s apply our knowledge to this traversal. We know that we want the name and address properties from the key, containing the restaurant vertex, and the rating_average from the value of our key-value pair. Combining this with our knowledge of how to select portions of our key-value pair, we get the following traversal:

project('name', 'address', 'rating_average')
  by(select(keys).values('name')).            
  by(select(keys).values('address')).         
  by(select(values))                          

Selects the name from the restaurant vertex in the key

Selects the address from the restaurant vertex in the key

Selects the rating_average from the value

Applying this to the end of our earlier traversal, we get

g.V().has('person','person_id',pid). 
  out('lives'). 
  in('within').
where(__.inE('about')). 
  group().
    by(identity()).
    by(__.in('about').values('rating').mean()).
  unfold().
  order().
    by(values, desc).

  unfold().
    
    by(select(keys).values('name')).               
    by(select(keys).values('address')).            
                             
 
==>{name=Lonely Grape, address=09418 Torphy Cape, 
 rating_average=5.0}                             
==>{name=Perryman's, address=644 Reta Stream, 
 rating_average=4.5}                             
...                                                

Adds project() step to return results with all needed properties

Results with all properties are ordered as expected.

Results truncated for brevity.

That’s all, folks. We’ve successfully written a traversal to answer, “What are the ten highest-rated restaurants near me?” While this wasn’t as straightforward as we might have hoped, it denotes the complexity we can encounter when writing graph traversals. It also allowed us to demonstrate some common graph concepts, such as how to deal with key-value pairs and how to construct complex result objects, not to mention our little excursion into mid-traversal development troubleshooting. Now that the hard work of finishing this use case is complete, the only thing left to do is to add it into our application.

Adding the traversal to our application

As in the last section, in this section, we’ll follow the same process as we did in chapter 6. In our example app, we’ll add a new method called highestRatedRestaurants. In Java, the traversal code looks like this:

List<Map<String, Object>> restaurants = g.V().
"person", "person_id", personId).
    out("lives").
    in("within").
    where(inE("about")).                          
    group().
        by(identity()).
        by(in("about").values("rating").mean()).  
    unfold().
    order().
        by(values, Order.desc).
    limit(10).
    project("name", "address", "rating_average").
        by(select(keys).values("name")).          
        by(select(keys).values("address")).       
        by(select(values)).                       
    toList();

An anonymous traversal is not required due to imports.

Enum values (Column.keys, Column.values) included in imports

8.5 Writing the last recommendation engine traversal

We’ve worked our way back to the first question, “What restaurant near me with a specific cuisine is the highest rated?” Let’s make this section a little more concrete. Let’s choose a random person from our test data.

Say we’re Kelly Gorman. Because we are Kelly, and because Kelly is a social catalyst, we’re out with our friends. As everyone knows, everywhere Kelly goes, there’s always a group of people eating and hanging out. The group is getting a little unruly because they’re hungry and thirsty, but they can’t decide where to go, debating between a diner or just going to a bar. Naturally, Kelly pulls up DiningByFriends and gets everyone to agree that they’ll go to the top local diner or bar as returned by the app. Your job now is to use what you have learned to build the traversal and return that restaurant. Doing so saves Kelly from her hangry friends.2

We’ll give you a couple of hints to get started, provide a quick review of the process used in the two previous sections, and then offer some space to sort out the answer on your own. Then we’ll close the chapter by revealing our traversal and walking through our thinking. To start, as Kelly Gorman, we assume that we’re logged into DiningByFriends with

 = 5
==>5

We need to enter two cuisines for our search. To do so, we create a list variable in the Gremlin Console like this:

ar']
==>diner
==>bar

Finally, the answer we expect to display from DiningByFriends is this:

{name=Without Chaser, address=01511 Casper Fall, 
rating_average=3.5, cuisine=bar} 

We invite you to go through the process we used with the first two questions:

  1. Identify the vertex labels and edge labels required to answer the question. Perhaps make a small schema drawing to help you process what to use in the graph.

  2. Find the starting location for the traversal.

  3. Find the ending location for the traversal.

  4. Write out the steps in plain English (or in your preferred language). The first step is your input step, and the last step is the output step.

  5. Code each step with Gremlin, one at a time, verifying the results against the test data after each step.

We broke down the relevant question for you to answer at the start of the chapter, but we repeat that in the following table for your convenience. To help with the first step, figure 8.5 provides a picture of the schema.

Question

Vertex Labels

Edge Labels

What restaurant near me with a specific cuisine is the highest rated?

personcityrestaurantcuisinereview

lives (connecting person → city)within (connecting restaurant → city)serves (connecting restaurant → cuisine)about (connecting review → restaurant)

Now take a few minutes and record the steps the traversal requires to answer the question. We think you’ll need eight or nine steps, including the initial one with the starting and the ending return steps. Hint: many of the steps are identical to the traversal used in the previous section. Given all of that, feel free to plan out your approach with the following bullet list:

  • (Starting Point)

  • (Listing of steps to traverse)

    . . .

  • (Ending Point)

We encourage you to refer back to the two previous examples in this chapter, as well as the last chapter’s content as you work through this exercise on your own. When you have outlined your approach with bulleted points, go ahead and code it and test it with the data. We also suggest that you look at the within() predicate to help you filter by cuisine (see http://mng.bz/wpp7). Aside from that, the other parts of our solution have been discussed previously in this book, many in this chapter. When you’re ready, take a look at our solution in the next section.

Figure 8.5 Logical data model elements required for the question, “What restaurant near me with a specific cuisine is the highest rated?”

8.5.1 Designing our traversal

Our first thought is that this traversal is similar to the traversal from the last section. Much like that one, we look for restaurants near the current user, so the best starting point here is a person vertex filtered by the person_id. The ending point is also going to be a list of restaurant vertices with the restaurant‘s name, address, average rating, and type of cuisine served. Here is the list of actions we came up with:

  1. Get the person based on the person_id input with the pid variable.

  2. Traverse the lives edge to get their city.

  3. Traverse the within edge to get the restaurant vertices for the city.

  4. Filter the restaurant vertices based on the adjacent cuisine vertex to show only those that offer one of the cuisines specified in the cuisine_list variable.

  5. Group the vertices with computed rating_average, including a filter to ensure each restaurant vertex has a review.

  6. Order by average_rating, descending.

  7. Limit to one result.

  8. Return name, address, and average rating for the restaurant vertex.

With those parts laid out, we know the steps we need to accomplish this traversal. With the exception of the number of items returned and filtering on the cuisine, we can reuse much of the traversal we constructed in section 8.3. Let’s start by copying that traversal here and limiting our results to one:

g.V().has('person','person_id',pid). 
  out('lives'). 
  in('within').
  where(inE('about')).
  group().
    by(identity()).
    by(__.in('about').values('rating').mean()).
  unfold().
  order().
    by(values, desc).
  limit(1).                                      
  project('name', 'address', 'rating_average'). 
    by(select(keys).values('name')).
    by(select(keys).values('address')).
    by(select(values))
==>{name=Dave's Big Deluxe, address=490 Ivan Cape, rating_average=4.0}

Limits results to just one element

The next missing piece of this traversal is to filter for the input cuisine(s). For this, we want to add a simple filter using has():

  has('cuisine',within(cuisine_list))

However, we can’t. That’s not how the schema is designed. In our model, cuisine is a separate vertex, so we need to filter on a traversal instead of directly on a property. That traversal needs to reach out to the cuisine vertex in order to filter by the cuisine type.

In Gremlin, this can’t be accomplished using a has() step because it filters just on properties, not on the presence of an edge. Recalling our work with the previous traversal, the where() step does what we need. This step takes a traversal as a parameter and filters any traversers that do not return a result. We do this by creating this statement:

t)))

For each restaurant, this statement traverses out to its incident cuisine vertex to check if its name is in our cuisine list. We have a filter within a filter. The outer filter is the where() step, which only passes through results if the inner traversal completes. The inner filter is at the end of the traversal and uses a has() step to filter on the name property of the cuisine vertex.

Refactoring your model

This use of cuisine is a candidate for refactoring. Do you recall what we said about escape rooms back in chapter 3? We used this analogy to show that being on a vertex in a graph is like being in a room for that vertex.

(continued)

Within our immediate access are a series of drawers that represent the properties on the vertex and a series of doors that lead to other vertices. There's little to no cost to look in a drawer for the value of a property, because we already incurred the expense of loading the vertex with its properties into memory. There’s also no additional cost to look at the edges to note their values. But checking the cuisine requires traversing to the cuisine vertex, or walking down the hall to the room representing that vertex in our analogy. We’ll discuss refactoring in chapter 10 when we cover performance issues.

Plugging this into our traversal and making a minor update to our project() step to include the cuisine, we get

g.V().has('person','person_id',pid). 
  out('lives'). 
  in('within'). 
  where(out('serves').has('name',
 within(cuisine_list))).                             
  where(inE('about')).
  group().
    by(identity()).
    by(__.in('about').values('rating').mean()).
  unfold().
  order().
    by(values, desc).
  limit(1).
  project('name', 'address', 
, 'cuisine').                                          
    by(select(keys).values('name')).                   
'address'                                              
    by(select(values)).                                
    by(select(keys).out('serves').values('name'))      
==> {name=Without Chaser, address=01511 Casper Fall,   
 rating_average=3.5, cuisine=bar}                    

Adds a where() step to filter by cuisine

Updates the project() step to include the cuisine value in the result

Returns our result for the highest rated bar or diner near Kelly Gorman

This result is precisely the response we expected. If you got the same answer, then your approach was successful as well. If you didn’t, compare your steps with ours and find where in your traversal the results start to vary from our approach.

8.5.2 Adding this traversal to our application

We finish by adding a new method called highestRatedByCuisine to our application. In Java, the traversal code is as follows:

List<Map<String, Object>> restaurants = g.V().
    has("person", "person_id", personId).
    out("lives").
    in("within").
    where(out("serves").has("name", 
 P.within(cuisineList))).                 
    where(inE("about")).
    group().
        by(identity()).
        by(in("about").values("rating").mean()).
    unfold().
    order().
        by(values, Order.desc).
    limit(1).
    project("restaurant_name", "address", 
 "rating_average", "cuisine").
        by(select(keys).values("name")).
        by(select(keys).values("address")).
        by(select(values)).
        by(select(keys).out("serves").values("name")).
    toList();

Predicates use P.within or a static import statement.

Whew, we did it! We built the traversals for our three recommendation engine use cases! We started by determining the required graph elements (vertices and edges) for each question. Identifying this information helped us to organize our thinking and prioritize our work. We decided to start with the simplest of the traversals and work our way through to the more involved ones. This order worked to our benefit because we were able to reuse much of the code from the traversal for our use case (number two) in section 8.3 to write the traversal for our use case (number one) in section 8.4.

We also practiced using some of the more practical aspects of software development with graph databases throughout this chapter, such as always having a good picture of the schema available. We drafted our traversals, first in plain English and then with Gremlin steps. And as we went through the process of implementing the Gremlin steps, we encountered a number of the usual challenges of building software: incorrect assumptions about the data, unexpected bugs, steps missing from the original plan, and new uses for familiar steps.

Throughout this process, our iterative “one-step-and-test” approach served us well in completing the work. In the next chapter, we’ll use subgraphs to allow users to personalize the recommendations they receive.

Summary

  • Start developing traversals for a use case by identifying the vertices and edges required to answer the business question.

  • Developing known-walk traversals starts with identifying the relevant portions of the schema, finding the starting and ending points for the traversal, identifying the series of vertices and edges needed to traverse from the starting to the ending points, and finally, composing the traversal by iteratively adding steps while validating the results against test data as we build each step.

  • A good starting point for a traversal minimizes the number of starting vertices, ideally to a single vertex. To do this, apply all possible filtering at the start of the traversal.

  • Prioritization of which use case questions to start on can be done by either choosing the hardest question if we want to de-risk the use of graph technology, or the simplest question if we want an early win as a building block for future success.

  • Using a systematic, step-by-step approach to building traversals makes it easier to identify the source of a problem if an error is encountered. Troubleshooting any errors can involve multiple steps, including investigating the data, changing the approach to the traversal, and consulting with other staff or online resources.

  • Paging results in a graph traversal requires an ordered result set and the use of limits and offsets to specify the desired subset of results.

  • Grouping and ordering traversals create results that are key-value pairs. Further processing key-value pairs uses special overloads of the select() step to work with the key and value portions of the pair.


1.We want to give a shout-out to Jason, an early MEAP reader that requested we address the subject of pagination. Jason, thanks for being an early reader of our work and for reminding us to discuss this everyday use case.

2.The word hangry is an English portmanteau of “hungry” and “angry.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.178.157