Chapter 7 : Querying Data in Neo4j with Cypher

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 7

Querying Data in Neo4j with Cypher

When you’ve spent the time building up a data structure, planning the best relationships for the different node types, and of course collecting the data, you want results. Cypher allows us to take all of the hard work in building up a good dataset, and use that to give a better experience for users, or supply accurate recommendations. There are a lot of ways to analyze your data to show trends and gain insight into those that supplied it. Whether that is from a user making orders to generate more accurate recommendations on an e-commerce website or to a blog user seeing their most popular posts by seeing which of them has the most comments. There are many uses for data, and we’re going to cover how to use Cypher to get the data you need for a multitude of situations.

To best showcase the wide range of Cypher’s capabilities, there will be two main examples used. The first, will be a Pokémon-based example, composed of user-submitted data. The second, will be a location-based example, composed of generated data based on various locations. These two approaches demonstrate how Cypher can do both recommendation-based queries and distance-based ones as well. Both of these examples demonstrate the power of Cypher, and these can be adapted to various situations, including e-commerce, which will also be covered. The e-commerce application is an obvious choice for demonstrating recommendations, so aspects of the logic applied in the Pokémon example will also be applied to e-commerce.

With that out of the way, it’s best to just dive right in to the Pokémon example, to demonstrate how Cypher be used to generate recommendations, and of course, the e-commerce application.

Recommendations, Thanks to Pokémon Data

I wanted to find an interesting way to generate data for use in the book and decided to get the community involved in supplying data. To as many people involved as possible, I decided to try and widen my audience, and make it something that would apply to anybody; anybody that liked Pokémon, that is. I’m a big fan of Pokémon, and figured I’d try to involve that, plus I know I’m not the only person who likes it. The idea of the application came from the original Pokémon slogan, Gotta catch em’ all! Why not make that the objective? With the idea in place, I had to then build a website to make it happen.

Getting the Data, the Website Used

After seeing the slogan, it seemed a good idea to make a game out of catching them all, so that’s just what happened. The idea was, you’d get one shot to randomly catch them all, but you could pick four Pokémon to reserve, so you could make sure you had your favorite. There were a couple of rules when it came to choosing these reserved Pokémon, but I’ll get to those in a moment. While they were playing the game, it made sense to try to get some information about the user, which could be used when analyzing the data. The data recorded was picked for a reason, and all kept optional (most of the time, but we’ll get to that) so it was all fair. This essentially meant that each user or Trainer (as they were labeled in Neo4j) that submitted data would get a random sample of Pokémon, ranging between 1 and 200. In addition to that sample, they would also have four Pokémon they’d reserved, and if they’d submitted it, some other personal information to go with this sample. As far as the user was concerned though, they’d just pick four Pokémon, fill in a form, then be presented with a page of Pokémon they’d caught.

The Pokémon that could actually be caught were the first 151, so there’s a chance to catch Mew, as Mew is the 151st Pokémon, and was added after the initial release, for those that are interested, that is. That means that there is an actual chance to get all 151, as you get can get over 151 chances, and you already pick 4 to begin with.

Data Being Gathered

Of course I wanted to make sure the website was fun, but I still wanted to gather data that could be analyzed and was of course above all else, useful. Although supplying the data is entirely optional, each user had the option to submit the following information, supplied in Table 7-1.

Table 7-1. The personal data recorded for those who submitted on pokemon.chrisdkemper.co.uk for use in the book and the reasoning for each

Data item	Reasoning
Nickname	This was the best means of identifying a user if they wanted to be mentioned in the book.
Email	The e-mail address here was only used to contact those who were mentioned, just to let them know their nickname would be used (or not, as it seems).
Gender	There were only four Pokémon selected by the user, but gender is still a good value to have when it comes to analyzing data.
Age	Having an age for the user allows some nice range-based queries depending on the range of the data available, so again, it is useful to have.

Each item of data recorded had its own benefit, and of course is only used for this chapter. The e-mail data stored has since been removed, as the only reason it was needed was to contact those who’ve been featured in the book, and in a couple of cases, when they weren’t, spam is always a problem, even if you use a reCAPTCHA.

Keeping the Spam Under Control

To try to keep spam users down to a minimum, I implemented a reCAPTCHA on the website, which is Google’s offering in the fight against spam. If you’ve never come across one, it’s just a checkbox you need to check if you’re human, and it may also ask you to identify some similar images. Although technically spam data would just be anonymous data, I wanted to avoid 100% of the data being spam. If I’d wanted automated data, I’d have just scripted that.

The other advantage of the reCAPTCHA application means that although there will be anonymous data, it’s still user-submitted anonymous data. This means, that the Pokémon characters that have been reserved, have all been picked by an actual person, provided the reCAPTCHA has been keeping the spam bots at bay.

Keeping the spam bots at bay was a priority, but I also didn’t want to use completely anonymous data, because as I said earlier, I could have generated that. To get around this, I put in a validation check to ensure there wasn’t more than 50% anonymous submissions already in the database.

If a user posted without a nickname, and 50% of the existing entries were anonymous, then they had to add a nickname to submit their data. They could still have anything as a username, nothing was unique, just as long as it was populated. The idea was that having another look at the form may allow for some additional data to be added, rather than the now required, nickname.

I used Cypher to pull out two figures to help with the calculation, the total submissions, and the number of anonymous submissions. The query used to achieve this is as follows.

MATCH (t:Trainer)
MATCH (a:Trainer {nickname: ’anonymous’})
RETURN count(DISTINCT t) AS total, count(DISTINCT a) AS anon;

The main thing to note in the above query is the use of DISTINCT, which ensures each node is only counted once, otherwise you run into a duplication problem. This duplication problem comes from count using both match statements to do the count, even though it shouldn’t. Thanks to the multiple matches, this results in the same result for both values. To show this, first will be Figure 7-1, which will be the results without DISTINCT, and then Figure 7-2 soon after, show the results with DISTINCT.

Figure 7-1. Showing the results of the Cypher query being used without DISTINCT

You can see in Figure 7-2 that the results are the same for both values, which is of course useless to us. To make this data actually useful, DISTINCT is added as shown in Figure 7-2.

Figure 7-2. Showing the results of the Cypher query with the use of DISTINCT

You can see here there are now unique values being returned for each, which is much more useful. The interesting thing here are the values being returned. The first value was 6076 for both `total` and `anon`, and the actual values were 98 and 82, for `total` and `anon` respectively. The two unique values multiplied together: 98 * 62 equals the first value, 6076. This gives some reasoning behind the initial value, but also shows the importance of DISTINCT, in the times it’s needed.

With the Cypher side out of the way, after the values are returned, they’re used to work out a percentage, and if it’s over 50%, the user cannot submit anonymously. If this user submitted data anonymously, with an unrounded percentage of 63.2, they’d have to add a nickname to get their shot at catching them all.

How the Data is Structured

To best understand how the queries work, you must first know the data you are querying against. In this case, all 151 Pokémon were imported into the database, with various different properties. An example of Bulbasaur, the first Pokémon (in terms of Id number, 1, that is) can be seen in Table 7-2.

Table 7-2. Bulbasaur’s properties within the database

Property	Value
Name	Bulbasaur
Index	1
Height	7
Weight	69
Type	[poison, grass]

The data for each Pokémon was obtained from the Pokeapi, available at http://pokeapi.co. I wrote a script to import the Pokémon into Neo4j, based on the Id numbers, and then picked these properties to use. The API has a lot available to use, so it’s worth looking at if you’re doing a Pokémon-related project. The type(s) I wanted to store as an array, so they could be used as a filter later. The Pokeapi was a bit too verbose with the information it had regarding the types, at least for my needs, so I had to trim it down to just the names of the types.

With the Pokémon nodes, it was then just a case of relating a Trainer to the Pokémon as needed. There are two relationship types used, RESERVED and CAUGHT. The former, are the Pokémon that are picked by the user, and the latter, are the Pokémon that have been randomly assigned. This helps do some interesting queries later, including the most reserved Pokémon.

When a user submits their data, a Trainer node is created, with any information they supply attached as properties to the node. The Trainer is then related to their RESERVED Pokémon, with a RESERVED relationship, and the website assigns the CAUGHT relationships as needed. This database is one that is relationship heavy, as one user can have up to a maximum of 204 relationships, provided they get the highest achievable random value. An example of how a Trainer’s relationships will be displayed can be seen in Figure 7-3, where a random node has been selected, and its structure can be seen.

Figure 7-3. The structure of Trainer node in the database, in this case node 919

As you can see, Figure 7-3 shows that one Trainer has many relationships to Pokémon, in this case 54, to 48 different Pokémon. The query used to achieve this was as follows:

MATCH (t:Trainer)-[r]-(p:Pokemon)
WHERE id(t) = 919
RETURN t, r, p;

The query is just a normal relationship-based query, with a WHERE clause to filter it down to the trainer with the node id of 919. This example node shows how the relationships will work within the application, so when we get to the Cypher queries, the use of Pokémon and Trainer Labels will make a little more sense.

Rules for Choosing Pokémon

To make things a little bit more interesting there was a catch when it came to selecting which Pokémon you were allowed to pick. The first rule was that you were only allowed one starter Pokémon, or any evolution of that Pokémon. If you’ve ever played the games, you’ll know that you only get a choice of one of three starter Pokémon, so this rule was to bring some of those feelings back. To give some extra choice, I allowed the inclusion of evolutions of the Pokémon, so you could pick Charmander, or Charizard for your starter, if you even wanted to pick one.

This rule means that it’s possible to see the most popular starter Pokémon, including their evolutions, which is interesting for analysis. There was a similar rule applied for Fossil-based Pokémon, and also Legendary Pokémon for their uniqueness. The rules don’t have any real impact on the process, it just allows for additional analytics points. When the actual Pokémon are picked by the system, these rules don’t apply, so it’s a 1/151 chance to get any particular Pokémon.

Querying the Data

With all the data collected (at the last point it could be before printing) it’s now possible to use that data and work out a lot of different things. Although this dataset is mostly random, a lot can be learned from it, as the concepts used here can be applied to other things where the data won’t be so random, such as e-commerce. A number of different values will be worked out, with the Cypher used to retrieve them, and of course, the results of the queries on the data. In addition to taking data for analytics, it’s also possible to make recommendations for individual `Trainers` (as they’re labeled in Neo4j in the database) for different Pokémon they could catch, based on which type they have the least of, for example.

To make things easier, the queries will be divided into two groups. The first group will be queries based on the dataset as a whole, so the more analytical aspect. The second group will be recommendation based, and will be applied to individual trainers. Let’s get right into it, starting with the analytics side of things.

Analyzing the Data as a Whole

Whether your dataset is big or small, being able to analyze it helps provide insight into what you’re actually collecting, and you can then use the results to help streamline your application. In this case we have a lot of random sets of Pokémon, based on anonymous, and non-anonymous data. Once again, the queries can be broken down into two groups, anonymous (so not using, nickname, gender, or age) and non-anonymous (using the values collected to filter data) respectively. Within each group there will be a number of queries based upon the group they’re in, and the results of those queries. Let’s just dive into the anonymous data sample, and do some queries on all of the data.

Anonymous Data Queries

These queries are ones that applied to the dataset as a whole, without taking any of the properties of the Trainer nodes (data submitted, along with their reserved Pokémon) into account. The first query will be totaling which Pokémon was the most popular overall, so let’s get to it.

Most Popular Pokémon Type

Each Pokémon has at least one type within the database; in some cases, two. These values have been stored in Neo4j as an array of string values, which means they can be used within queries, provided the values are treated the same as they are stored, so in this case, strings. The query now needs to take the types from all of the caught Pokémon, and ensure the list is ordered by the most popular type. That’s enough talk, let’s get straight into it. The Cypher for the query can be seen below.

MATCH (:Trainer)--(p:Pokemon)
WITH p.type AS types
UNWIND types AS type
RETURN type, count(type) AS total
ORDER BY total DESC
LIMIT 10

As we’re only interested in Pokémon data, the Trainer alias can be left off as it’s not used. The interesting part of the query comes with the use of UNWIND, which allows each individual type to be counted, rather than treating multiple values as one row. If the above has been run without the UNWIND, then any time a Pokémon had more than one type, this combination would be counted as one type, rather than two different types.

We already know Bulbasaur’s types from previously in the chapter, which are poison and grass. Without the use of UNWIND, the combination of poison and grass would be counted as one value. Essentially, the use of UNWIND allows each item in the array to be evaluated as an individual item.

Thanks to the combination of WITH and UNWIND, it means that when the data has to be counted, it’s already just a huge list of types, so they just need to be counted, ordered, and of course the type itself needs to be returned, so the numbers make sense. This query gives the result shown in Figure 7-8.

Figure 7-8. The results of the query to determine the most popular types within the database

This query could be modified slightly to give more specific result sets depending on the use case, so if you wanted to know the most popular type within RESERVED Pokémon, just add a relationship constraint to the query. It could be also possible to see what is the most popular type among the Pokémon, regardless of the Trainer input, which would look like so:

MATCH (p:Pokemon) WITH p.type AS types UNWIND types AS type
RETURN type, count(type) AS total
ORDER BY total DESC
LIMIT 10

This allows you to see which are the most and least popular Pokémon types, so if you wanted to catch specific types to strengthen your team, that could be done using this query. The only real change in this query is the removal of the Trainer relationship, as we can just query the Pokémon nodes directly to get the types in this case. Speaking of the results of the Trainer version of the query, let’s go through those.

Non-Anonymous Data Queries

All of the previous queries have been based on completely anonymous results, but the database isn’t all anonymous, so it’s time to look at what data we have available. As mentioned earlier, the fields recorded were `nickname`, `gender`, `email`, and `age`. Really, out of these fields, there are only two that can be of any real use, which are `gender` and `age`. With these fields, it allows the data to be categorized by age ranges, and of course, gender.

Of course, had I collected more data, such as location, there would have been more granular results, but with age and gender values, this still gives some values. We’ll be performing a number of queries using these values to help filter down the results, and give some insight into those that submitted their data to be used.

Popular Pokémon Filtered by Gender

We’ve already done a query to work out the most popular Pokémon, so with a couple of alterations, it can be modified to return gender-specific values. The resulting Cypher query can be seen below.

MATCH (t:Trainer)--(p:Pokemon)
WHERE t.gender = ’male’
RETURN p.name AS name, count(p.name) AS count
ORDER BY count DESC
LIMIT 10

In this query, we need to use the values of the Trainer node, so `t` has to be in to supply data for the filter. In terms of the stored data, the gender property will be either set to `male`, or `female`, or be null, as it only gets set if there’s a value to set. In the WHERE, it’ll remove null values by default, unless specified otherwise. Since we don’t want the null values, then the basic WHERE does the job. To get the female results it’s just a case of swapping out `male` for `female` in the query. You can see the results of this query in Figure 7-9.

Figure 7-9. The results of most popular Pokémon query, filtered by gender, in this case `male`

Switching the results to filter by `female` rather than `male` gives the results, seen in Figure 7-10.

Figure 7-10. The results of most popular Pokémon query, filtered by gender, in this case `female`

Popular Pokémon Filtered by Age

Having access to age is a brilliant way to filter data, as it allows you to filter by age ranges, as well as an individual age. If you want to filter by an age range, you can either have these preset, or decide them based on what values are available within your data. In this case, we’ll be checking the ages submitted, and using those values to build up acceptable ranges on which to base the main query. First, we need all the ages that have been submitted and the counts for these, which in Cypher terms looks like this:

MATCH (t:Trainer)
RETURN t.age AS age, count(t.age) AS count
ORDER BY count DESC

In addition to giving the different range of ages within the data, it also gives the most popular, which although it’s not particularly useful in this case, being able to identify your key age or age group is always a good thing. To get this value, the query just needs to be ordered by the `count`, rather than the `age`. The results of the query can be seen in Figure 7-11.

Figure 7-11. A query to show all of the submitted ages within the database

With a slight modification it could also be possible to use an age range with the query, which would look like so:

MATCH (t:Trainer)--(p:Pokemon)
WHERE t.age > 18 AND t.age < 21
RETURN p.name AS name, count(p.name) AS count, t.nickname
ORDER BY count DESC

Recommendation-Based Queries

After covering queries that were based on the data as a whole, it’s time to get to a more personal level and recommend different things to the Trainers. Each Trainer node has at least 4 relationships to different Pokémon within the system, which means there are still 147 different choices available. Unless you’re like Ash (the main character in Pokémon) and want to ‘Catch them all’, you may want to apply some strategy to the Pokémon you catch. This is where the recommendations come in.

Recommend Pokémon, Based on Type

As a trainer, it would be nice to know what types you have the most of, because this shows you which areas you may need to build on. Even if that’s of no use to you, knowing which areas you are weakest in could be, so let’s get to working that out, shall we? A node id used previously was 919, so we’ll use it again here. Trainer 919 needs the types of all of the Pokémon they have caught, and the counts for them.

MATCH (t:Trainer)-[]-(p:Pokemon)
WHERE id(t) = 919
UNWIND p.type AS type
RETURN type, count(type) AS total
ORDER BY total ASC

With the query in place, Trainer 919 gets the results seen in Figure 7-12. In this query, we’re selecting Trainer 919 using WHERE id(t) = 919 after getting all of the Pokémon related to said Trainer. Since the type is an array, it needs to be iterated, which is where UNWIND comes in. After that `type`, and the count of `type` are returned. Thanks to count aggregating the `type` values to count them, it also removes the duplicates from the rows, which is useful in this case.

Figure 7-12. The types and their counts for the Pokémon Trainer 919 has

The only bad thing about this list is there are no zero values, so what if Trainer 919 doesn’t have a type at all? Well, there’s an easy way to check, we just need to get all of the different types that the Pokémon have, look at the total. Using Cypher, that looks like this:

MATCH (p:Pokemon)
UNWIND p.type AS type
RETURN DISTINCT type

This query gives the results seen in Figure 7-13. This query is a slimmed down version of the one used above. In this case, because the count is being used in the return, duplicate values will be returned, so to get around that, DISTINCT is used.

Figure 7-13. The results of a query to get a unique list of all the Pokémon types

With that query, we find that there are 17 types returned, and Trainer 919 only has 16, so we know there’s one missing. There should technically be 18 types, there is one missing, which is the `dark` type. It is not included in the list because it wasn’t introduced until the second generation, and therefore none of the first 151 Pokémon are `dark`.

From Figure 7-12, we know that `steel`, `ghost`, and `fairy` are the top 3 lowest counts, at 1 each. We can use this information to recommend some Pokémon with those types. To do this, it requires the combination of the previous query, and some additional Cypher code, which looks like so:

MATCH (t:Trainer)-[]-(p:Pokemon)
WHERE id(t) = 919
UNWIND p.type AS type
WITH type, count(type) AS total, t
ORDER BY total ASC
LIMIT 3

MATCH (p:Pokemon)
WHERE type IN p.type
AND NOT (t)--(p)
RETURN p.name, p.type
LIMIT 5

The first part of the query is the exact one from before, just with a WITH clause in place of RETURN. The values passed over by WITH have already been evaluated, so type is being iterated over, and counted. With the use of WITH, each of the types from the previous query can be used in WHERE. Since the types are stored in an array, they must be filtered using IN, as you cannot compare two arrays. This essentially means to compare two arrays, you must be first iterating over the first, then use IN to check if the value is in the second.

In this case, the query is checking to see if one of the top 3 types (`steel`, `ghost`, and `fairy`) are within a Pokémon’s type array. There’s nothing worse than being recommended something that you already have, and the same goes here. To get around this, AND is used to add that the Trainer cannot be related to the Pokémon returned. Gotta catch em’ all, right? To stop too many choices being returned, the results are limited, after the `name` and `type` are returned. The results of the query can be seen in Figure 7-14.

Figure 7-14. The top 5 results for Pokémon with the least popular types that Trainer 191 had

This data can be incredibly useful for Trainer 919 to help them catch some Pokémon that round out their types a little better. If this information was fed back into the Pokémon website, it should easily recommend these Pokémon as good suggestions to catch. The ordering of the query could also be reversed to show a list of Pokémon that should be caught to add to the more popular types owned, which would help if you were trying to get all Pokémon of a certain type, for example.

What’s Left to Catch

If you’re attempting any kind of collection, it’s always nice to have an idea of how much is remaining, and a Pokémon Trailer is no different. It’s always handy to have a list of the renamed Pokémon you had to catch, and luckily Cypher can help with that using the following simple query.

MATCH (p:Pokemon), (t:Trainer)
WHERE id(t) = 919
AND NOT (t)--(p)
RETURN p.name

Once again we use Trainer 919 as an example by using the combination of WHERE and the id function. We’re also matching Pokémon (`p`), but without any matches this time. Essentially what’s being done here is any Pokémon that aren’t related to Trainer 919 are then returned, and we have our list of remaining Pokémon. The results for this query can be seen in Figure 7-15.

Figure 7-15. The remaining Pokémon Trainer 919 has to catch

Relating to e-Commerce

Some of the queries used for the Pokémon examples can also be applied to an e-commerce application for recommending products, instead of Pokémon. We’ll go through a couple of these applications to show how the same query can be used to achieve a similar result.

Table of Contents for Chapter 7 : Querying Data in Neo4j with Cypher

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7 : Querying Data in Neo4j with Cypher