Aggregating results

In our application, we have users who rate books with a score from one to five. Now, we are going to query the database to get some aggregated information about book scores.

Counting matching rows or non-null values

Suppose that we want to know the number of users who have voted for a book. For this, we need to count the number of the vote relations between the users and that book, as shown in the following code snippet:

START b=node({id})
MATCH (b) <-[r:Vote]- (u:User)
RETURN COUNT(*) as votes

The only difference with the query patterns we already know is that here, we have used the COUNT function in the RETURN clause. With Cypher, the RETURN clause drives the aggregation of entities. In this case, as we have nothing else in the RETURN clause but the COUNT function, all the matching results are counted and the result is a single value. The result is as shown in the following output code:

+-------+
| votes |
+-------+
| 7     |
+-------+
1 row

The arguments inside the COUNT function can be the following:

  • The * keyword is a placeholder that instructs the counter to count all matching rows
  • A variable that makes the counter skip null values, missing properties, or missing matches

In the previous query, specifying the COUNT(r.score) score property will give the same result because in our database, we haven't got any null votes. However, we can have a book without any scores. Now, we are going to investigate two queries that differ only in the usage of the COUNT function. The first query is as follows:

MATCH (b:Book {title: "The Art of Prolog"})
OPTIONAL MATCH (b) <-[r:Votes]- (:User)
RETURN b, COUNT(r.score) as votes

The preceding query counts only the rows that have a non-null vote score. The following result is obtained:

+---------------------------------------------------------------+
| b                                                     | votes |
+---------------------------------------------------------------+
| Node[1037]{title:"The Art of Prolog",tags:["prolog"]} | 0     |
+---------------------------------------------------------------+

We have no votes here because no user has voted for this book. Now, look at the following query where I have changed COUNT(r.score) to COUNT(*):

MATCH (b:Book {title: "The Art of Prolog"})
OPTIONAL MATCH (b) <-[r:Votes]- (:User)
RETURN b, COUNT(*) as votes

This query gives one vote in the result, as shown in the following output code:

+---------------------------------------------------------------+
| b                                                     | votes |
+---------------------------------------------------------------+
| Node[1037]{title:"The Art of Prolog",tags:["prolog"]} | 1     |
+---------------------------------------------------------------+

In fact, we are using the COUNT(*) function to count all the rows that match the expression in the first row of the query. We have just one matching row because the presence of the votes relationship is optional. So, we have this matching row counted as one vote, which of course is wrong.

Tip

Be careful while using the COUNT(*) function versus the COUNT(variable) function, in conjunction with the OPTIONAL MATCH clause; you could get unexpected results!

Summation

If you are interested in the total score received by a book too, you can use the SUM function, as shown in the following query:

START b=node(5)
MATCH (b:Book) <-[r:Votes]- (:User)
RETURN COUNT(r.score) as votes, SUM(r.score) as total

The argument of the SUM function is the variable to summarize. The result of this query is as follows:

+---------------+
| votes | total |
+---------------+
| 5     | 14    |
+---------------+

Here, the null values will be ignored.

Average

The average score is the sum of all the scores given by the users on a book divided by the sum of the number of votes. The AVG function computes this for us, as shown in the following query:

START b=node(5)
MATCH (b:Book) <-[r:Votes]- (:User)
RETURN AVG(r.score) as avgScore

The argument of this function is the variable whose average we want to compute. Here too, the null values will be ignored.

Tip

The AVG function treats null values differently from zero. A zero is summed in the average and counted, while null values are just ignored. null values don't influence the average. For example, the average of 10 and a null value will be 10, while the average of 10 and 0 will be 5.

The result of the previous query is as follows:

+----------+
| avgScore |
+----------+
| 3.8      |
+----------+

Note that although our score data consisted of integer values, the AVG function returns a floating point value (double).

Maximum and minimum

Cypher provides the functions MAX and MIN to compute the largest and the smallest value in the property specified as argument, respectively. Consider the following query:

START b=node(5)
MATCH (b:Book) <-[r:Votes]- (:User)
RETURN MAX(r.score), MIN(r.score)

This query returns the largest and the smallest score vote given to the book. The result is as shown in the following output code:

+-----------------------------+
| MAX(r.score) | MIN(r.score) |
+-----------------------------+
| 5            | 3            |
+-----------------------------+

Again, null values are ignored.

Standard deviation

The standard deviation measures the variation from average for a set of values. In our case, it will predict how many voters agree about the average score. Consider the following query:

START b=node(5)
MATCH (b:Book) <-[r:Votes]- (:User)
RETURN AVG(r.score) as avgScore, STDEV(r.score) as stdDevScore

The preceding query returns the standard deviation with the average. The result is as follows:

+-------------------------------+
| avgScore | stdDevScore        |
+-------------------------------+
| 3.8      | 0.8366600265340756 |
+-------------------------------+

The result tells us that the average is 3.8 and that users agree with the votes.

Cypher also provides other statistical aggregation functions. You can find explanations on them in the Appendix.

Collecting values in an array

If statistical functions provided by Cypher are not enough for your needs, you can collect all the values in an array so that you can easily process them with your preferred algorithm. For example, the following query returns all the score votes received for two books:

START b=node(5,6)
MATCH (b:Book) <-[r:Votes]- (:User)
RETURN b.title, COLLECT(r.score)

As you can see, we used the START keyword to instruct Cypher to start from two nodes, those with the ID 5 or ID 6, and for each of them, we got the collected values of scores. The result is shown in the following output code:

+----------------------------------------+
| b.title             | COLLECT(r.score) |
+----------------------------------------+
| "Epic of Gilgamesh" | [5,4,3,4,1]      |
| "The Divine Comedy" | [4,3,5,3,4]      |
+----------------------------------------+

Grouping keys

To better explain how the RETURN function works with aggregation, let's try to remove the b.title column. The query would then be as follows:

START b=node(5,6)
MATCH (b:Book) <-[r:Votes]- (:User)
RETURN COLLECT(r.score)

The result is strongly different from the result of the preceding query, as shown in the following output code:

+-----------------------+
| COLLECT(r.score)      |
+-----------------------+
| [3,3,5,2,1,4,3,1,1,3] |
+-----------------------+

In fact, the purpose of the b.title column in the previous query was to set a grouping key of the rows. By removing it, we instruct Cypher to collect everything together. This is quite different from SQL, where you have to explicitly specify grouping keys and return values. Cypher is more concise and I find it more intuitive than SQL.

Conditional expressions

Consider this scenario: we want to write a query that computes the average scores of books in the following two categories:

  • Recent books: This category contains all books published in the year 2010 or later
  • Other books: This category contains all other books published before 2010 or without a date of publication

We already know how to compute the average of a value in a group based on a property value, but here we have an arbitrary grouping based on a condition. We can use the CASE WHEN expression to express this condition. The query will be as follows:

MATCH (b:Book)<-[r:Votes]-(:User)
OPTIONAL MATCH (b) -[p:PublishedBy]-> (c)
RETURN
  
CASE WHEN p.year >= 2010 THEN 'Recent'
    ELSE 'Old' END as category,
  AVG(r.score)

In this query, we introduced a CASE WHEN expression in the RETURN clause. The expression evaluates to the Recent string if the year of publication is 2010 or later, and Old if otherwise. In the RETURN clause, we have the book category that results from the CASE WHEN expression and AVG(r.score); therefore, Cypher will group the score average by the book category. In fact, the result is as follows:

+-------------------------------+
| category | AVG(r.score)       |
+-------------------------------+
| "Recent" | 4.4333333333333333 |
| "Old"    | 4.9767123287671234 |
+-------------------------------+

There are two types of CASE WHEN expressions. The type we have just learned is the generic one. It is used when we have a set of alternative conditions to verify. If we just have a set of possible values, then we can use the simple form. Here is an example:

MATCH (b:Book)<-[r:Votes]-(:User)
OPTIONAL MATCH (b) -[p:PublishedBy]-> (c)
RETURN
  CASE p.year % 2
    WHEN 0 THEN 'Even'
    WHEN 1 THEN 'Odd'
    ELSE 'Unknown' END as parity,
  AVG(r.score)

This query computes the score average grouped by the parity publication year. Here, the CASE WHEN statement is slightly different; we have an expression right after the CASE keyword. This expression is evaluated and compared to the values in the WHEN clauses. If one of them matches, the corresponding value expressed in the THEN clause is picked; otherwise, the ELSE value is returned. The result is as follows:

+-----------------------------+
| parity | AVG(r.score)       |
+-----------------------------+
| "Odd"  | 4.0                |
| "Even" | 3.4                |
| <null> | 4.4413793103448276 |
+-----------------------------+

You may wonder why do we have a null value instead of Unknown in the result. The reason is that if p.year is a null value, the expression in the case cannot be evaluated. To make null values fall in our condition, we will again resort to the COALESCE function. The query is as follows:

MATCH (b:Book)<-[r:Votes]-(:User)
OPTIONAL MATCH (b) -[p:PublishedBy]-> (c)
RETURN
  CASE COALESCE(p.year % 2, -1)
    WHEN 0 THEN 'Even'
    WHEN 1 THEN 'Odd'
    ELSE 'Unknown' END as parity,
  AVG(r.score)

Separating query parts using WITH

Suppose you want the books sorted by average score, with highest score at the top; for this, we can write the following query:

MATCH (b:Book) <-[r:Votes]- (:User)
RETURN b, AVG(r.score) as avgScore
ORDER BY avgScore DESC

This query returns all books that have votes with their average score. However, that could be a huge number of books, almost the entire database. Suppose you are looking only for books with a score greater than or equal to 4. Where would you place the WHERE clause? Of course, you can't put it before the RETURN clause because the average score is computed there. We have to split the query in two: a part to compute the averages and a part for filtering and sorting.

The WITH keyword exists for this reason. The query is as follows:

MATCH (b:Book) <-[r:Votes]- (:User)
WITH b, AVG(r.score) as avgScore
WHERE avgScore >=4
RETURN b, avgScore
ORDER BY avgScore DESC

The result is as follows:

+---------------------------------------------+
| b                                | avgScore |
+---------------------------------------------+
| Node[171]{title:"Anna Karenina"} | 4.4      |
| Node[141]{title:"Tales"}         | 4.0      |
+---------------------------------------------+

This example shows us how to re-use an existing query to perform further processing. We can perform the following tasks:

  • Give an alias to computed values in the RETURN function
  • Replace the RETURN keyword with the WITH keyword
  • Append the new part to the previous using the variable declared in the WITH function

Splitting queries is useful to filter data, which is otherwise difficult. This is done by combining the WHERE function with the MATCH or OPTIONAL MATCH functions. For example, if we want the year of publication of the book with the best score, the following query can be used:

MATCH (b:Book) <-[r:Votes]- (:User)
WITH b, AVG(r.score) as avgScore
ORDER BY avgScore DESC
LIMIT 1
OPTIONAL MATCH (b) –[p:PublishedBy]-> ()
RETURN b.title, p.year

This query is understandable. In the first four lines, we are looking for the book with the best score. In the remaining part of the query, we get the year of publication.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.89.30