In our application, we have users who rate books with a score from one to five. Now, we are going to query the database to get some aggregated information about book scores.
Suppose that we want to know the number of users who have voted for a book. For this, we need to count the number of the vote
relations between the users and that book, as shown in the following code snippet:
START b=node({id}) MATCH (b) <-[r:Vote]- (u:User) RETURN COUNT(*) as votes
The only difference with the query patterns we already know is that here, we have used the COUNT
function in the RETURN
clause. With Cypher, the RETURN
clause drives the aggregation of entities. In this case, as we have nothing else in the RETURN
clause but the COUNT
function, all the matching results are counted and the result is a single value. The result is as shown in the following output code:
+-------+ | votes | +-------+ | 7 | +-------+ 1 row
The arguments inside the COUNT
function can be the following:
In the previous query, specifying the COUNT(r.score)
score property will give the same result because in our database, we haven't got any null votes. However, we can have a book without any scores. Now, we are going to investigate two queries that differ only in the usage of the COUNT
function. The first query is as follows:
MATCH (b:Book {title: "The Art of Prolog"})
OPTIONAL MATCH (b) <-[r:Votes]- (:User)
RETURN b, COUNT(r.score) as votes
The preceding query counts only the rows that have a non-null vote score. The following result is obtained:
+---------------------------------------------------------------+ | b | votes | +---------------------------------------------------------------+ | Node[1037]{title:"The Art of Prolog",tags:["prolog"]} | 0 | +---------------------------------------------------------------+
We have no votes here because no user has voted for this book. Now, look at the following query where I have changed COUNT(r.score)
to COUNT(*)
:
MATCH (b:Book {title: "The Art of Prolog"})
OPTIONAL MATCH (b) <-[r:Votes]- (:User)
RETURN b, COUNT(*) as votes
This query gives one vote in the result, as shown in the following output code:
+---------------------------------------------------------------+ | b | votes | +---------------------------------------------------------------+ | Node[1037]{title:"The Art of Prolog",tags:["prolog"]} | 1 | +---------------------------------------------------------------+
In fact, we are using the COUNT(*)
function to count all the rows that match the expression in the first row of the query. We have just one matching row because the presence of the votes relationship is optional. So, we have this matching row counted as one vote, which of course is wrong.
If you are interested in the total score received by a book too, you can use the SUM
function, as shown in the following query:
START b=node(5) MATCH (b:Book) <-[r:Votes]- (:User) RETURN COUNT(r.score) as votes, SUM(r.score) as total
The argument of the SUM
function is the variable to summarize. The result of this query is as follows:
+---------------+ | votes | total | +---------------+ | 5 | 14 | +---------------+
Here, the null
values will be ignored.
The average score is the sum of all the scores given by the users on a book divided by the sum of the number of votes. The AVG
function computes this for us, as shown in the following query:
START b=node(5) MATCH (b:Book) <-[r:Votes]- (:User) RETURN AVG(r.score) as avgScore
The argument of this function is the variable whose average we want to compute. Here too, the null
values will be ignored.
The result of the previous query is as follows:
+----------+ | avgScore | +----------+ | 3.8 | +----------+
Note that although our score data consisted of integer values, the AVG
function returns a floating point value (double).
Cypher provides the functions MAX
and MIN
to compute the largest and the smallest value in the property specified as argument, respectively. Consider the following query:
START b=node(5) MATCH (b:Book) <-[r:Votes]- (:User) RETURN MAX(r.score), MIN(r.score)
This query returns the largest and the smallest score vote given to the book. The result is as shown in the following output code:
+-----------------------------+ | MAX(r.score) | MIN(r.score) | +-----------------------------+ | 5 | 3 | +-----------------------------+
Again, null
values are ignored.
The standard deviation measures the variation from average for a set of values. In our case, it will predict how many voters agree about the average score. Consider the following query:
START b=node(5) MATCH (b:Book) <-[r:Votes]- (:User) RETURN AVG(r.score) as avgScore, STDEV(r.score) as stdDevScore
The preceding query returns the standard deviation with the average. The result is as follows:
+-------------------------------+ | avgScore | stdDevScore | +-------------------------------+ | 3.8 | 0.8366600265340756 | +-------------------------------+
The result tells us that the average is 3.8 and that users agree with the votes.
Cypher also provides other statistical aggregation functions. You can find explanations on them in the Appendix.
If statistical functions provided by Cypher are not enough for your needs, you can collect all the values in an array so that you can easily process them with your preferred algorithm. For example, the following query returns all the score votes received for two books:
START b=node(5,6) MATCH (b:Book) <-[r:Votes]- (:User) RETURN b.title, COLLECT(r.score)
As you can see, we used the START
keyword to instruct Cypher to start from two nodes, those with the ID 5 or ID 6, and for each of them, we got the collected values of scores. The result is shown in the following output code:
+----------------------------------------+ | b.title | COLLECT(r.score) | +----------------------------------------+ | "Epic of Gilgamesh" | [5,4,3,4,1] | | "The Divine Comedy" | [4,3,5,3,4] | +----------------------------------------+
To better explain how the RETURN
function works with aggregation, let's try to remove the b.title
column. The query would then be as follows:
START b=node(5,6) MATCH (b:Book) <-[r:Votes]- (:User) RETURN COLLECT(r.score)
The result is strongly different from the result of the preceding query, as shown in the following output code:
+-----------------------+ | COLLECT(r.score) | +-----------------------+ | [3,3,5,2,1,4,3,1,1,3] | +-----------------------+
In fact, the purpose of the b.title
column in the previous query was to set a grouping key of the rows. By removing it, we instruct Cypher to collect everything together. This is quite different from SQL, where you have to explicitly specify grouping keys and return values. Cypher is more concise and I find it more intuitive than SQL.
Consider this scenario: we want to write a query that computes the average scores of books in the following two categories:
We already know how to compute the average of a value in a group based on a property value, but here we have an arbitrary grouping based on a condition. We can use the CASE WHEN
expression to express this condition. The query will be as follows:
MATCH (b:Book)<-[r:Votes]-(:User) OPTIONAL MATCH (b) -[p:PublishedBy]-> (c) RETURN CASE WHEN p.year >= 2010 THEN 'Recent' ELSE 'Old' END as category, AVG(r.score)
In this query, we introduced a CASE WHEN
expression in the RETURN
clause. The expression evaluates to the Recent
string if the year of publication is 2010 or later, and Old
if otherwise. In the RETURN
clause, we have the book category that results from the CASE WHEN
expression and AVG(r.score)
; therefore, Cypher will group the score average by the book category. In fact, the result is as follows:
+-------------------------------+ | category | AVG(r.score) | +-------------------------------+ | "Recent" | 4.4333333333333333 | | "Old" | 4.9767123287671234 | +-------------------------------+
There are two types of CASE WHEN
expressions. The type we have just learned is the generic one. It is used when we have a set of alternative conditions to verify. If we just have a set of possible values, then we can use the simple form. Here is an example:
MATCH (b:Book)<-[r:Votes]-(:User) OPTIONAL MATCH (b) -[p:PublishedBy]-> (c) RETURN CASE p.year % 2 WHEN 0 THEN 'Even' WHEN 1 THEN 'Odd' ELSE 'Unknown' END as parity, AVG(r.score)
This query computes the score average grouped by the parity publication year. Here, the CASE WHEN
statement is slightly different; we have an expression right after the CASE
keyword. This expression is evaluated and compared to the values in the WHEN
clauses. If one of them matches, the corresponding value expressed in the THEN
clause is picked; otherwise, the ELSE
value is returned. The result is as follows:
+-----------------------------+ | parity | AVG(r.score) | +-----------------------------+ | "Odd" | 4.0 | | "Even" | 3.4 | | <null> | 4.4413793103448276 | +-----------------------------+
You may wonder why do we have a null
value instead of Unknown
in the result. The reason is that if p.year
is a null
value, the expression in the case cannot be evaluated. To make null
values fall in our condition, we will again resort to the COALESCE
function. The query is as follows:
MATCH (b:Book)<-[r:Votes]-(:User) OPTIONAL MATCH (b) -[p:PublishedBy]-> (c) RETURN CASE COALESCE(p.year % 2, -1) WHEN 0 THEN 'Even' WHEN 1 THEN 'Odd' ELSE 'Unknown' END as parity, AVG(r.score)
Suppose you want the books sorted by average score, with highest score at the top; for this, we can write the following query:
MATCH (b:Book) <-[r:Votes]- (:User) RETURN b, AVG(r.score) as avgScore ORDER BY avgScore DESC
This query returns all books that have votes with their average score. However, that could be a huge number of books, almost the entire database. Suppose you are looking only for books with a score greater than or equal to 4. Where would you place the WHERE
clause? Of course, you can't put it before the RETURN
clause because the average score is computed there. We have to split the query in two: a part to compute the averages and a part for filtering and sorting.
The WITH
keyword exists for this reason. The query is as follows:
MATCH (b:Book) <-[r:Votes]- (:User) WITH b, AVG(r.score) as avgScore WHERE avgScore >=4 RETURN b, avgScore ORDER BY avgScore DESC
The result is as follows:
+---------------------------------------------+ | b | avgScore | +---------------------------------------------+ | Node[171]{title:"Anna Karenina"} | 4.4 | | Node[141]{title:"Tales"} | 4.0 | +---------------------------------------------+
This example shows us how to re-use an existing query to perform further processing. We can perform the following tasks:
RETURN
functionRETURN
keyword with the WITH
keywordWITH
functionSplitting queries is useful to filter data, which is otherwise difficult. This is done by combining the WHERE
function with the MATCH
or OPTIONAL MATCH
functions. For example, if we want the year of publication of the book with the best score, the following query can be used:
MATCH (b:Book) <-[r:Votes]- (:User) WITH b, AVG(r.score) as avgScore ORDER BY avgScore DESC LIMIT 1 OPTIONAL MATCH (b) –[p:PublishedBy]-> () RETURN b.title, p.year
This query is understandable. In the first four lines, we are looking for the book with the best score. In the remaining part of the query, we get the year of publication.
3.147.89.30