Real-world graphs are neither regular nor fully random grids. Their edge densities are not homogeneous, so we end up finding some interesting patterns. The fact that some nodes can have more connections than others is exploited in centrality algorithms to assess node importance (see Chapter 6, Node Importance). In this chapter, we will discover a new type of algorithm whose goal is to identify groups of nodes highly connected to each other and form a community or cluster. Several of these community detection algorithms are already implemented in the GDS: components algorithms, Label Propagation algorithms, and Louvain algorithms. This chapter is our opportunity to build a graph representation of communities with JavaScript and to discover NEuler, the Graph Algorithms Playground application developed by Neo4j. Finally, we will also learn about the different metrics implemented in the GDS to measure the similarity between two nodes locally.

The following topics will be covered in this chapter:

Introducing community detection and its applications
Detecting graph components and visualizing communities
Running the Label Propagation algorithm
Understanding the Louvain algorithm
Going beyond Louvain for overlapping community detection
Measuring the similarity between nodes

Technical requirements

In this chapter, we will use the following tools:

The Neo4j graph database with the following indent extensions listed:
Plugins: The Graph Data Science Library
The Neo4j Desktop Application
NEuler for graph visualization
You will also need a browser to open HTML and the JavaScript file that we will build for graph visualization.
Python (recommended ≥ 3.6) to run the example implementation of some algorithms.
The code used in this chapter is available in the GitHub repository for this book at
https://github.com/PacktPublishing/Hands-On-Graph-Analytics-with-Neo4j/ch7

If you are using Neo4j < 4.0, then the last compatible version of the GDS is 1.1.
If you are using Neo4j ≥ 4.0, then the first compatible version of the GDS is 1.2.

Introducing community detection and its applications

Community detection gathers techniques that have been developed to understand the structure of a graph and extract information from it. This structure can then be used in many applications, such as recommendation engines, fraud detection, property prediction, and link prediction.

Throughout this chapter, I will use the words community, cluster, or partition to refer to a group of nodes sharing common properties.

Identifying clusters of nodes

The following image shows the graph of Neo4j GitHub users we built in Chapter 2, The Cypher Query Language. Community detection algorithms were able to identify several communities:

Image generated using the Louvain algorithm and neoviz.js

By the end of this chapter, you will be able to reproduce this image. Further analysis will be needed to understand the common properties of the users belonging to the violet community. A deeper analysis of this graph teaches us that the users in the violet community clearly stand out from the crowd and have mostly contributed to a unique repository.

Before going into the technical details of each algorithm, we are first going to talk more about the advantages of knowing which community a node belongs to. Although we are talking about communities made of users here, this can be applied to many other fields. We will present some of them in the rest of this section.

Applications of the community detection method

There are many applications of community detection and clustering with graphs. They can be used, for instance, in the following fields:

Biology: Protein-protein interaction networks model the protein interactions within a cell. Proteins belonging to the same community are more likely to be involved in the same function.
Neuroscience: Researchers model the human brain as a graph, where each node is a small number of cells. Understanding the community structure of this graph has proven particularly useful for understanding how different parts of the brain coordinate with each other.
Public health: We can use the community structure of the population to try and predict the evolution and spread of a disease.

The next sections will focus on some applications that are more likely to be directly useful to you.

Recommendation engines and targeted marketing

We have already explored recommendations using Neo4j and Cypher, in Chapter 3, Empowering Your Business with Pure Cypher. But at that time, we only used graph traversal to find products bought together or products bought by a user that the current customer is connected to via social relationships. Community detection brings a new kind of information into the game, which can be used to provide even more relevant recommendations.

Clusters of products

For instance, being able to group similar products together can help in identifying similar products that would not normally be classified in the same category. It can also be used to find identical products sold by different retailers in market places like eBay. With that information, you can prevent recommending a product that has already been bought by a given customer, from another provider.

Clusters of users

Another way to use community detection to improve a recommendation engine is to try and create groups of customers. In that way, you can create groups of customers with similar habits, which can reflect similar interests. On an e-commerce website proposing sport-related articles, you can create a community of fishermen and a community of football players, without any prior knowledge about your users except their purchases: if the purchased products are mostly about fishing, you know this customer is likely a fisherman. This knowledge can be used to extend the list of relevant recommendations. For instance, some new football socks that have not been bought by anyone yet can probably be recommended to the users identified as being football players.

Fraud detection

Fraud detection was addressed in the preceding chapter (Chapter 6, Node Importance). We talked about the way fraudsters create crime rings and collaborate with each other to avoid detection. This kind of organization is hard to detect by traditional methods, but graphs, thanks to their native structure based on relationships, are a very helpful tool to fight against all types of fraud. Assuming that fraudsters will interact more with each other, like in the case of fake users sharing the same phone number or address, graphs can form a community and be more easily identified.

Predicting properties or links

One of the underlying ideas in community detection and most of the applications outlined above is that nodes belonging to the same community share some properties. This can be used to make predictions, based on the community structure of the graph. Let's start with the subgraph illustrated in the following figure:

It contains three nodes, A, B and C, and two edges ((A, B) and (A, C)). This could be a part of a larger graph with more outgoing edges. Nodes A and B have a property whose value is 1. That could be the age category of some users, which is not always available. Users A and B have filled that field, indicating they are between 21 and 30. On top of that, some community detection algorithms have managed to cluster all three nodes into the same community. Intuitively, we can say that the probability of node C also falling into the 21-30 age category increases with this new knowledge about the graph structure.

Similarly, if we try to measure the probability of the edge between B and C existing without us knowing it or appearing in the future, it's higher for nodes in the same community.

A brief overview of community detection techniques

One of the first graph structure studies was performed by Weiss and Jacobson and published in 1955. Since then, several types of algorithms have been studied and implemented, using different types of rules.

As for the node importance problem, the first thing to think of in the case of a community detection problem is the definition of the metric or objective function that will quantify how good the graph partitions are. The most common definition of community states that a community is a set of nodes with more infra-community connections (more edges between nodes in the same community) than inter-community connections (edges between nodes in two different communities). But even with that definition, there are several possible ways to achieve a satisfactory partitioning.

Many algorithms have been proposed, using different metrics. For instance, hierarchical clustering uses some rules to create a dendrogram, creating a hierarchy of clusters. The Girvan-Newman algorithm is an example of hierarchical clustering using an extension of the betweenness centrality usually used for nodes to edges: edges with the highest betweenness centrality are the edges involved most often in the shortest path between two pairs of nodes in the graph. Other hierarchical clustering algorithms use similarity measures instead of topological ones. At the end of this chapter (in the Measuring the similarity between nodes section), we will learn how to measure the similarity between nodes.

In this book, we will focus on some of the algorithms implemented in Neo4j:

Connected components, which allows us to detect disconnected sub-graphs.
Label Propagation, as the name suggests, propagates community labels through the graph base on the majority-vote rule to assign each node to its new community.
The Louvain algorithm optimizes the modularity, defined as the difference between the number of infra-community connections versus the number of inter-community connections.

Even if not yet implemented in the GDS, we will also talk about some proposed improvements of these algorithms, especially the Leiden algorithm, which was proposed to fix some issues with the Louvain algorithm. We will also briefly address the problem of overlapping communities, which isn't covered by the aforementioned algorithms.

Let's now start our community detection journey with the connected components algorithms. In the following section, we will learn about components algorithms. We will also discover tools to visualize detected communities in a graph format, which is crucial to understand the structure of medium graphs.

Detecting graph components and visualizing communities

Graph components have a clear mathematical definition. In a given component, two nodes are always connected to each other through a path but are not connected to any other nodes out of the component. There are two versions of connected components:

Strongly connected components: These make sure that the path between two nodes in the same component exists in both directions.
Weakly connected components: A single direction is enough for weakly connected components.

Let's see an example of this in Neo4j. In this section, we are also going to use, for the first time in this book, the write procedure to store the algorithm results in Neo4j. We will also introduce a new JavaScript library to customize the visualization of large graphs.

Let's start with the following directed graph:

The Cypher code used to create the graph is available on GitHub (https://github.com/PacktPublishing/Hands-On-Graph-Analytics-with-Neo4j/blob/master/ch7/test_graph.cql).

Visually, we can say that there are at least two components in this graph. Nodes Y and Z are totally disconnected from any other node in the graph, so there isn't any path from these two nodes to any other node in the graph. Let's see how we can learn this information from the algorithms implemented in the GDS.

We will use the following projected graph:

CALL gds.graph.create("simple_projected_graph", "Node", "LINKED_TO")

This projected graph contains all nodes with the Node label and all relationships with the LINKED_TO type in their natural orientation (as defined when creating them with Cypher and as illustrated in the preceding figure), without any properties attached to them.

Weakly connected components

The first algorithm we are going to study here is weakly connected components or union-find.

Weakly connected components, together with the Louvain and Label Propagation algorithms we are going to cover later on in this chapter, are production-ready algorithms in the GDS as of version 1.0.

To see the results of graph partitioning using weakly connected components, use the following query:

CALL gds.wcc.stream("simple_projected_graph") 
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name as nodeName, componentId
ORDER BY componentId

Here are our results:

╒══════════╤═════════════╕
│"nodeName"│"componentId"│
╞══════════╪═════════════╡
│"A"       │0            │
├──────────┼─────────────┤
│"B"       │0            │
├──────────┼─────────────┤
│"C"       │0            │
├──────────┼─────────────┤
│"D"       │0            │
├──────────┼─────────────┤
│"E"       │0            │
├──────────┼─────────────┤
│"F"       │0            │
├──────────┼─────────────┤
│"G"       │0            │
├──────────┼─────────────┤
│"Y"       │7            │
├──────────┼─────────────┤
│"Z"       │7            │
└──────────┴─────────────┘

As you can see, the algorithm successfully identified two components:

The component labeled 7 in this example, containing nodes Y and Z
Another component labeled 0, containing all the other nodes

The exact number used to label communities is not relevant; it depends on the internal functioning of the GDS. Depending on how your graph was created, the numbers you get might be different, but the same communities should be detected.

Weakly connected components tell us that we have two disconnected partitions in our graph, considering that the graph is undirected. If relationship direction is important, for instance in a road network, then we will have to use the strongly connected component algorithm.

Strongly connected components

In strongly connected components, the direction of relationships matters. Each node in a component must be able to join any other node in the same component in both directions. Focusing on the nodes A to G, which were grouped in the same community by the weakly connected component algorithm, you can see that it is not always possible to go from one node to another in both directions. For instance, going from D to A is possible (through C), but going from A to D is impossible.

To see the components identified with this stronger rule, we can use the gds.alpha.scc procedure in the following way:

CALL gds.alpha.scc.stream("simple_projected_graph") 
YIELD nodeId, partition as componentId
RETURN gds.util.asNode(nodeId).name as nodeName, componentId
ORDER BY componentId

Here are the results of the preceding code:

╒══════════╤═════════════╕
│"nodeName"│"componentId"│
╞══════════╪═════════════╡
│"A"       │0            │
├──────────┼─────────────┤
│"B"       │0            │
├──────────┼─────────────┤
│"C"       │0            │
├──────────┼─────────────┤
│"D"       │3            │
├──────────┼─────────────┤
│"E"       │3            │
├──────────┼─────────────┤
│"F"       │3            │
├──────────┼─────────────┤
│"G"       │3            │
├──────────┼─────────────┤
│"Y"       │7            │
├──────────┼─────────────┤
│"Z"       │7            │
└──────────┴─────────────┘

This time, three components are identified: while nodes Y and Z are still in their own component, nodes A, B, and C are now disconnected from D, E, F, and G.

Connected components are very useful algorithms for understanding our graph structure. Some algorithms like PageRank (see Chapter 6, Node Importance) can lead to unexpected results on graphs with multiple components. Running connected components algorithms is good practice in the data exploration part of your graph data analysis.

Before moving on to the study of the other community detection algorithms, we will talk about some tools that allow us to visualize the communities in a better format. Some of them will require the community number (componentID) to be a node attribute. That's why we'll now address a functionality of the GDS we have not exploited so far: the possibility to write the results of an algorithm in the Neo4j graph, instead of streaming them back to the user and letting them decide what do to with them.

Writing the GDS results in the graph

In Chapter 4, The Graph Data Science Library and Path Finding, we introduced the write feature of the GDS, allowing us to store the results of an algorithm in Neo4j. This feature is available for almost all algorithms implemented in the GDS. Basically, the algorithms that do not offer this option are the algorithms returning matrices, such as the All Pairs shortest path algorithms or the some similarity algorithms we are going to cover at the end of this chapter.

Let's see an application of the write procedure with the connected components algorithms.

The syntax of the write procedure is pretty similar to the stream one. The main difference is that it accepts another configuration parameter, writeProperty, which allows us to configure the name of the property that will be added to each node.

The following query will write the result of the weakly connected component algorithm into a wcc property for each node:

CALL gds.wcc.write(
    "simple_projected_graph", {
        writeProperty: "wcc"
    }
)

Here, the returned result contains information about the algorithm running time and graph structure:

But to see the partition each node belongs to, we will have to use another Cypher query:

MATCH (n:Node) 
RETURN n.name as nodeName, n.wcc

The same thing can be achieved with strongly connected components:

CALL gds.alpha.scc.write(
    "simple_projected_graph", {
        writeProperty: "scc"
    }
 )

On top of its name, each node now contains two more properties called wcc and scc, containing the ID of the component the node belongs to according to both weakly and strongly connected component algorithms. Here, you can see the content of node D:

{
  "name": "D",
  "scc": 3,
  "wcc": 0
}

Writing results to the graph is sometimes the only solution when the graph is very big (we'll discuss the case of big data more in Chapter 12, Neo4j at Scale). But it can also be useful in other cases. We'll discuss one of those now.

So far, we have only used very small graphs, and by reading the results from a table it was easy enough to understand them. But this is not the most common use case for graph algorithms, which are mostly useful for medium- and large-scale graphs. In these situations, visualizing the result of community detection algorithms in a graph format can be much more important for understanding the structure of the graph. In the following section, we are going to discover two ways of drawing nodes with different sizes or colors depending on their attributes: the first is the neovis.js JavaScript library used to embed the graph visualization into an HTML page, and the second is NEuler, the Graph Algorithms Playground, a Neo4j Desktop application based on this package.

Visualizing a graph with neovis.js

When it comes to communities, it is always useful to have a way to visualize both the node classification and the relationships between them. Graph visualization is a research field in itself and many packages exist out there for good visualizations. For instance, one of the most complete JavaScript libraries for data visualization, d3.js, also has features to draw graphs. However, when using d3.js, one has to manage the connection to Neo4j, data retrieval, and formatting. That's the reason why, in this section and the rest of this chapter, we are going to use the open source neovis.js JavaScript library. It is very easy to use because the connection to Neo4j is managed internally and we don't need to have any knowledge about the Neo4j JavaScript driver to make it work. neovis.js also creates pretty nice visualizations, like the one shown on the first figure of this chapter illustrating the Neo4j GitHub community, and it has a lot of customizable parameters.

A full working example is available in the graph_viz.html file available at https://github.com/PacktPublishing/Hands-On-Graph-Analytics-with-Neo4j/ch7/connected_components/graph_viz.html.

Import the library from the GitHub repository with the following:

    <script src="https://rawgit.com/neo4j-contrib/neovis.js/master/dist/neovis.js"></script>

The minimal HTML code is as follows:

    <body onload="draw()">
        <div id="graph"></div>
    </body>

It will call the preceding draw function after loading the body, and draw the graph into the div function with id="graph". The main function is the draw function, which contains the configuration parameters and the rendering methods:

function draw() {
         let config = {
             // the ID of the "div" the graph will be drawn into
             container_id: "graph",
             // connection parameters to your Neo4j Graph
             // default is "bolt://localhost:7687" 
             server_url: "bolt://localhost:7687",
             server_user: "neo4j",
             server_password: "*****",
             // Node labels to be fetched
             // and properties used to customize each node rendering
             labels: {
                 "Node": {
                     "caption": "name",  // label drawn next to each node
                     "community": "scc",  // defines the node color
                 }
             },
             relationships: {
                 "LINKED_TO": {
                    // disable caption for relationships 
                    // since they do not carry any relevant information in our case
                    "caption": false, 
                 }
             },
            arrows: true, // display the relationships direction
            initial_cypher: "MATCH (n:Node)-[r:LINKED_TO]->(m) RETURN *"
      };

     let viz = new NeoVis.default(config);
     viz.render();
 }

You need to update the connection parameters to your graph: server_url , server_user, and server_password. server_password corresponds to the password you are asked to create when adding a new graph in Neo4j Desktop.

The file graph_viz.html can be opened with your favorite browser to see the different communities in our graph, as identified by the strongly connected component algorithm (the scc property). The resulting image should look like the one displayed in the following figure:

You can also, if needed, visualize the node importance by indicating the size parameter of the node configuration in the draw function and the property of the node containing its importance; for instance:

             labels: {
                 "Node": {
                     "caption": "name",  // label drawn next to each node
                     "community": "scc",  // defines the node color
                     "size": "pagerank"
                 }
             },

For this code to work, you need to run the PageRank algorithm with the write procedure on your projected graph:
CALL gds.pageRank("simple_projected_graph", {writeProperty: "pagerank"})

But remember that PageRank may produce unexpected results in the case of disconnected graphs.

neoviz.js is a powerful and interesting tool if you want to embed the graph into an HTML page. However, if you just need to test the results of an algorithm, a simpler solution exists: NEuler.

Using NEuler, the Graph Data Science Playground

NEuler is an open source application integrated within Neo4j Desktop, whose code is available on GitHub at https://github.com/neo4j-apps/neuler.

It is a very powerful tool allowing us to test all the algorithms implemented in the GDS, from path finding (Chapter 4, The Graph Data Science Library and Path Finding) and centrality (Chapter 5, Node Importance) to community detection and similarity. It also enables visualizing the results with a graph visualization feature based upon neoviz.js.

Installation instructions are available in the GitHub repository of this book.

Usage for community detection visualization

The home screen, illustrated in the following screenshot, is where you can choose the type of algorithm you are going to run:

Once you have selected the community detection algorithms, you can choose the algorithm you want to try from the upper menu. For this example, we are going to use the Strongly Connected Components algorithm. After clicking on its name, you can configure the projected graph and the algorithm parameters from the right bar. The configuration illustrated in the following screenshot will do the following:

Run the algorithm on the projected graph containing the node labels Node and relationship types LINKED_TO with orientation=UNDIRECTED.

Store the results in a property called scc.

Clicking on the Run button will call the proper Cypher procedure, which you can check via the Code tab:

Finally, you can visualize the results inside the Visualization tab. There, you can customize the node property that will be used to color them, as we did in the previous section with neovis.js:

This closes our section about connected components. We have learned how weakly and strongly connected components algorithms can help us identify disconnected communities in our graph. We have also discovered two different ways of visualizing the results of community detection algorithms. We will continue using them in the following sections, where we are going to discover new types of community detection algorithms: the Label Propagation and Louvain algorithms, both part of the production-quality tier of the GDS.

Running the Label Propagation algorithm

Label Propagation is another example of a community detection algorithm. Proposed in 2017, its strength is in its possibility to set some labels for known nodes and derive the unknown labels from them in a semi-supervised way. It can also take into account both relationships and node weights. In this section, we are going to detail the algorithm with a simple implementation in Python.

Defining Label Propagation

Several variants of Label Propagation exist. The main idea is the following:

Labels are initialized such that each node lies in its own community.
Labels are iteratively updated based on the majority vote rule: each nodes receives the label of its neighbors and the most common label within them is assigned to the node. Conflicts appear when the most common label is not unique. In that case, a rule needs to be defined, which can be random or deterministic (like in the GDS).
The iterative process is repeated until all nodes have fixed labels.

The optimal solution is a partition with the minimum number of edges connecting two nodes with different labels. Let's consider the following graph:

After some iterations, the algorithm assigned nodes A and B to one community (let's call it CR), and nodes E and G to another one (CG). According to the majority vote rule, in the next iteration, node C will be assigned to the CR community since two nodes connected to C already belong to this partition, while only one node (D) belongs to another cluster. Similarly, node D will be assigned to the CG community. The resulting graph is illustrated in the following figure:

Weighted nodes and relationships

In order to take into account the weight of a node or relationship, we can update the majority vote rule so that it does not only count the number of nodes with a given label but sums their weight. The selected label will then be the one with the highest sum of weights.

Let's consider the weighted version of the preceding graph illustrated in the following figure:

This time, in order to choose the label for node D, we have to take into account the weights of each edge connected to D:

Weight for the CR community = 1 (C) + 8 (D) = 9
Weight for the CG community = 1 (E) + 2 (G) = 3

Hence, in this weighted version of the graph, node D would belong to the CR community.

Semi-supervised learning

Another interesting aspect of Label Propagation is its ability to take into account prior knowledge. If you already know the community of some of the nodes, this information can be used in the initialization phase instead of setting random initial values. This technique is known as semi-supervised learning since only some nodes are labeled with their community.

Implementing Label Propagation in Python

In this section, we are going to implement a simplified version of Label Propagation, considering seed labels can only take two different values: 0 or 1.

We will use the same graph representation as in the previous chapters, based on a Python dictionary. The graph we will use is represented by the following object:

    G = {
        'A': {'B': 1, 'C': 1},
        'B': {'A': 1, 'C': 1},
        'C': {'A': 1, 'B': 1, 'D': 1},
        'D': {'C': 1, 'E': 1, 'G': 1},
        'E': {'D': 1, 'F': 1, 'G': 1},
        'F': {'E': 1, 'G': 1},
        'G': {'D': 1, 'E': 1, 'F': 1},
    }

It tells us that node A is connected to nodes B and C, both edges having a weight equal to 1.

Let's now start the algorithm. In the initialization phase, we initialize all labels with unique values. One way to find unique values is to use their index in the loop over the keys of G, which can be achieved in Python with the following:

    labels = {node:k  for k, node in enumerate(G)}

The enumerate(iterable) function returns a tuple containing a count (from 0) and the value obtained by iterating over iterable. Here, iterable is our dictionary G, and iterating over G is equivalent to iterating over its keys in Python.

We then enter into the main loop. At each iteration, we perform a loop on all nodes in the graph and for each of them count the number of votes it receives:

    for it in range(max_iterations):
        print("======= Iteration", it)
        # create a copy of the labels computed in previous iteration
        old_labels = dict(labels)
        for node, neighbors in G.items():
            # counting the number of votes from each neighbors:
            votes = Counter([old_labels[n] for n in neighbors])

To find the majority vote, we can use the following piece of code, which will iterate over the received votes and update the value of the new label each time a value above the current max value is found:

            max_vote = -9999
            new_label = old_labels[node]
            for label, vote  in votes.items():
                if vote > max_vote:
                    max_vote = vote
                    new_label = label
                elif vote == max_vote:  
                    # deterministic rule to disentangle equality votes (arbitrary)
                    if label > new_label:
                        new_label = label
            labels[node] = new_label

In order to disentangle cases where two labels have the same number of votes, we use a totally arbitrary rule to select the label with the highest value.

Once all labels have been updated, we can check the algorithm's convergence by checking that no label has changed since the previous iteration:

        end = True
        for node in G:
            if old_labels[node] != labels[node]:
                # if at least one node's label has changed, go to next iteration
                end = False
                break
        if end:
            return labels

Running this code on graph G defined at the beginning of this section, we get the following results:

{'A': 3, 'B': 3, 'C': 3, 'D': 6, 'E': 6, 'F': 6, 'G': 6}

Two communities are identified: the first one is labeled 3 and contains nodes A, B, and C, while the second one is labeled 6 and contains nodes D to G. Remember that the labels themselves are meaningless since they just come from the node position in the graph definition. Different implementations will return different values for the labels, but the community structure will be preserved.

Convergence is not guaranteed with Label Propagation and we can end up with an oscillation where the label of a given node is unstable and oscillates between two values at each iteration, making the convergence fail.

The full code is available on GitHub at https://github.com/PacktPublishing/Hands-On-Graph-Analytics-with-Neo4j/ch7/label_propagation/label_propagation.py. You are encouraged to copy and modify it to help you understand how it works. For instance, update the code to take into account nodes and relationship weights. Note that the implementation is kept as simple as possible to allow you to leverage your prior knowledge about Python. If you already know about networkx and numpy, for instance, you can try and modify this code to make it work with a netwokx graph or by using matrix formulation (see the previous chapter, Chapter 6, Node Importance).

Using the Label Propagation algorithm from the GDS

The Label Propagation algorithm is part of the production-quality algorithm, which means it is well tested and optimized for large graphs. We are going to test run the Label Propagation algorithm on the graph we studied in the Weakly connected components section. The Cypher code to create it is available at https://github.com/PacktPublishing/Hands-On-Graph-Analytics-with-Neo4j/ch7/test_graph.cql.

Let's create an undirected projected graph:

CALL gds.graph.create(
    "projected_graph", 
    "Node", 
    {
        LINKED_TO: {
            type: "LINKED_TO", 
            orientation: "UNDIRECTED"
        }
    }
)

To execute the Label Propagation algorithm on this graph, we can use the following query:

CALL gds.labelPropagation.stream("projected_graph")
YIELD nodeId, communityId
RETURN gds.util.asNode(nodeId).name AS nodeName, communityId
ORDER BY communityId

Here are the results:

╒══════════╤═════════════╕
│"nodeName"│"communityId"│
╞══════════╪═════════════╡
│"A"       │39           │
├──────────┼─────────────┤
│"B"       │39           │
├──────────┼─────────────┤
│"C"       │39           │
├──────────┼─────────────┤
│"D"       │42           │
├──────────┼─────────────┤
│"E"       │42           │
├──────────┼─────────────┤
│"F"       │42           │
├──────────┼─────────────┤
│"G"       │42           │
├──────────┼─────────────┤
│"Y"       │46           │
├──────────┼─────────────┤
│"Z"       │46           │
└──────────┴─────────────┘

Three communities have been identified: two are similar to the ones we identified in the previous section plus an extra community containing the nodes Y and Z, which were not included in our previous study.

Here again, the exact value of communityId may vary; it is related to the nodeId property. But all nodes in the same community will always have the same communityId.

Using seeds

In order to test our algorithm, we first need to add a new property to the graph that will hold our prior belief about the node community membership – knownCommunity:

MATCH (A:Node {name: "A"}) SET A.knownCommunity = 0;
MATCH (B:Node {name: "B"}) SET B.knownCommunity = 0;
MATCH (F:Node {name: "F"}) SET F.knownCommunity = 1;
MATCH (G:Node {name: "G"}) SET G.knownCommunity = 1;

To some nodes, we have added a property called knownCommunity, which stores our prior knowledge (or belief) about the community each node belong to.

Then, we can create the named projected graph. This graph will contain all nodes with the Node label and all relationships with the LINKED_TO type. In order to use the algorithm in a semi-supervised way, we also need to explicitly tell the GDS to store the knownCommunity node property in the projected graph. Finally, we will consider our graph as undirected, which is achieved by specifying the orientation: "UNDIRECTED" parameter in the relationship projection:

CALL gds.graph.create("projected_graph_with_properties", { 
    // node projection:
    Node: {
        label: "Node",
        properties: "knownCommunity"
    }}, { 
    // relationship projection:
    LINKED_TO: {
        type: "LINKED_TO",
        orientation: "UNDIRECTED"
    }
})

Now we can run the Label Propagation algorithm on this named projected graph and stream the results with the following:

CALL gds.labelPropagation.stream("projected_graph_with_properties", {
    seedProperty: "knownCommunity"
})
YIELD nodeId, communityId
RETURN gds.util.asNode(nodeId).name AS nodeName, communityId
ORDER BY communityId

You can see from the results that the identified communities are similar, however, communityId now reflects the values given via seedProperty.

Writing results to the graph

To visualize the results in the browser with the code we wrote in the previous section, we need to store the results in the graph, which can be achieved with the gds.labelPropagation.write procedure:

CALL gds.labelPropagation.write(
    "projected_graph_with_properties", {
        seedProperty: "knownCommunity",
        writeProperty: "lp"
    }
)

Algorithms such as Label Propagation are a good example of how graph algorithms have already been used in more classical machine learning models. Indeed, Label Propagation is used both for classification and regression in machine learning, where the propagation is performed through a similarity matrix (instead of the adjacency matrix discussed in the previous chapter).

For now, we will focus on another important algorithm for community detection: the Louvain algorithm.

Understanding the Louvain algorithm

The Louvain algorithm was proposed in 2008 by researchers from the university of Louvain in Belgium, giving the algorithm its name. It relies on a measure of the density of connections within a community compared to the connections toward other nodes. This metric is called modularity, and it is the variable we are going to understand first.

Defining modularity

Modularity is a metric quantifying the density of links within nodes in the same community, compared to links between nodes in different communities.

Mathematically speaking, its definition is as follows:

Q = 1/(2m) * Σ_ij [ A_ij - k_ik_j / (2m)] δ(c_i, c_j)

Where:

A_ij is 1 if nodes i and j are connected, 0 otherwise.
k_i is the degree of node i.
m is the sum of all weights carried by the edges. We also have the relation Σ_i k_i = 2m, since the sum over all nodes will count each edge twice.
c_i is the community the node i is assigned to by the algorithm.
δ(x, y) is the Kronecker delta function, which is equal to 1 if x=y and 0 otherwise.

To understand the meaning of this equation, let's focus on each of the terms separately. A node i with k_i edges has k_i chances of being connected to any other node in the graph. So two nodes i and j have k_i x k_j chances of being connected to each other. Hence, the term k_ik_j / (2m) corresponds to the probability of nodes i and j being connected to each other.

On the other side of the equation, A_ij is the real connection status between i and j. So the term under the sigma sign quantifies the difference between the real and expected number of edges between two nodes of the same community (in a random graph).

Two nodes in the same community should be connected more often than the average, hence having A_ij > k_ik_j / (2m). The Louvain algorithm is based on this property and tries to maximize the modularity Q.

This definition of modularity also works for weighted graphs. In that case, A_ij is the weight of the relationship between i and j.

Before moving on to the algorithm itself, let's investigate special graph partitions and what the value of the modularity is for them. The results can be challenged using the simple Python implementation for the modularity found at https://github.com/PacktPublishing/Hands-On-Graph-Analytics-with-Neo4j/ch7/louvain/modularity.py.

In the rest of this section, we will use a graph similar to in one of the previous sections, excluding nodes Y and Z in order to work on a single component, for simplicity's sake.

All nodes are in their own community

If all nodes are in their own community, then δ(c_i, c_j) is always 0 since c_i is always different from c_j, except in the cases where i=j. Since our graph does not have a self-loop (a node connected to itself), A_ii is always equal to zero and only the negative term in the sum remains. So in that particular case, the modularity Q is negative. In the following, we will write it as Q_diag and its value for our graph is Q_diag=-0.148.

All nodes are in the same community

In the other extreme case, where all nodes are in the same community, δ(c_i, c_j) is always 1. So the modularity is as follows:

Q = 1/(2m) * Σ_ij [ A_ij - k_ik_j / (2m)]

The sum over all pairs of nodes of the Aij term corresponds to the sum of the weights of all edges, counted twice (for ij and ji pairs). So, we have the following:

Σ_ij A_ij = 2m

Let's rewrite the second term of the equation:

Σ_ij k_ik_j / (2m) = Σ_i (k_i (Σ_j k_j )) / 2m = Σ_i (k_i (2m) / 2m = Σ_i k_i = 2m

So, in the special case where all nodes are in the same community, then the modularity Q=0.

Optimal partition

The optimal partition for the simple graph we are using in this section is similar to the one obtained with connected components or Label Propagation algorithms: nodes A, B, and C are in their own partition while nodes D to G are in another community.

With m = 9, the modularity in that case is as follows:

Q = 1/18 (
    2 * (A_AB - k_Ak_B / 18 + A_AC - k_Ak_C / 18 + A_BC - k_Bk_C / 18)
   + 2 * (A_DE - k_Dk_E / 18 + A_DG - k_Dk_G / 18 + A_DF - k_Dk_F / 18 + A_EF - k_Ek_F / 18 + A_EG - k_Ek_G / 18 + A_GF - k_Gk_F / 18 )
) + Q_diag
= 1 / 18 * 2 * (
   1 - 2 * 2 / 18 + 1 - 2 * 3 / 18 + 1 - 2 * 3 / 18
   + 1 - 3 * 3 / 18 + 1 - 3 * 3 / 18 + 0 - 3 * 2 / 18 + 1 - 3 * 2 / 18 + 1 - 3 * 3 / 18 + 1 - 3 * 2 / 18
    ) - 0.148
= 0.364 > 0

Since our graph is undirected, we can afford to double the term corresponding to the relationship between A and B (A - B), instead of summing the terms A -> B and B -> A.

Modularity can also be used to measure the quality of the partitioning detected by another algorithm.

Now that we have a better understanding of modularity, we will look at how to reproduce the Louvain algorithm, before showing some examples with Neo4j and the GDS.

Steps to reproduce the Louvain algorithm

The Louvain algorithm aims at maximizing the modularity, which acts as the loss function. Starting from a graph where each node is assigned to its own community (Q=0), the algorithm will try to move nodes to the community of their neighbors, and keep this configuration only if it makes the modularity increase.

The algorithm performs two steps at each iteration. During the first one, an iteration over all nodes is performed. For each node n and each of its neighbors k, the algorithm tries to move the node n to the same community as k. The node n is moved to the community that leads to the highest increase in modularity. If it is not possible to increase the modularity with such an operation, the community of node n is left unchanged for this iteration.

In the second step, the algorithm will group together nodes belonging to the same community in order to create new nodes, and sum the weights of the inter-community edges in order to create the new weighted edge between them.

The change of modularity induced by moving a single node from one community to another can be computed without looping over the whole graph again.

The Louvain algorithm in the GDS

The Louvain algorithm is part of the production quality implementation from the Graph Data Science library as of version 1.0.

Syntax

The Louvian algorithm's usage is similar to the other algorithms. To stream the results, use the gds.louvain.stream procedure with a named projected graph as a parameter:

CALL gds.louvain.stream(<projectedGraphName>)

We can also save the results in the graph with the equivalent write procedure:

CALL gds.louvain.write(<projectedGraphName>, {writeProperty: <newNodeProperty>})

Using the Louvain algorithm on our graph leads the two usual partitions, with nodes A, B, and C on one side and nodes D to G on the other side. To see the differences between these algorithms, we will have to move to slightly larger graphs. We will see an example of this with Zachary's karate club in a later section.

The aggregation method in relationship projection

If you use the write procedure instead, you'll notice it returns, among other information, the final modularity. If you run this procedure on our previously created projected graph, projected_graph, you will notice a slightly different modularity value compared to the one we obtained in the previous section. The explanation comes from our projected graph. The initial graph, stored in Neo4j, contains two relationships between D and E (one from D to E and one from E to D). When the GDS creates the projected graph, the default behavior is to store both relationships, which is equivalent to increasing the weight of this edge (the A_ij term). Our projected graph is hence equivalent to the following:

    G = {
        'A': {'B': 1, 'C': 1},
        'B': {'A': 1, 'C': 1},
        'C': {'A': 1, 'B': 1, 'D': 1},
        'D': {'C': 1, 'E': 2, 'G': 1},
        'E': {'D': 2, 'F': 1, 'G': 1},
        'F': {'E': 1, 'G': 1},
        'G': {'D': 1, 'E': 1, 'F': 1},
    }

The edge between D and E has a weight of 2 instead of 1.

If we want the projected graph to contain only one relationship between two nodes, we have to add another property to the relationship projection: aggregation: "SINGLE", which will enforce each pair of nodes to be connected by one single relationship of a given type. The following query will create a new projected graph with that property enabled:

CALL gds.graph.create(
    "projected_graph_single", 
    "Node",
    {
        LINKED_TO: {
            type: "LINKED_TO", 
            orientation: "UNDIRECTED", 
            aggregation: "SINGLE"
        }
    }
)

Run the Louvain algorithm on this new projected graph with the following statement:

CALL gds.louvain.write("projected_graph_single", {writeProperty: "louvain"})
YIELD nodePropertiesWritten, modularity
RETURN nodePropertiesWritten, modularity

You will again find the same value for modularity, around 0.36, as shown in the following screenshot:

If you look at the full result of the gds.louvain procedures, you will find another field called modularities. It corresponds to the modularity computed at each stage of the algorithm.

Intermediate steps

An interesting feature of the GDS is its ability to store the communities at intermediate steps of the algorithm. This option needs to be enabled by setting the configuration parameter includeIntermediateCommunities to true. For instance, the following query will stream the results of the Louvain algorithm for our projected graph, returning an extra column containing the list of communities each node was assigned to at each iteration:

CALL gds.louvain.stream(
    "projected_graph_single",
    {
        includeIntermediateCommunities: true
    }
)
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).name as name, communityId, intermediateCommunityIds
ORDER BY name

In the case of our simple graph, the intermediateCommunityIds column contains a list with a single element corresponding to the final community. This means that one single iteration was sufficient to converge, which is not surprising given the very small size of the graph. When using this algorithm on larger graphs, you will be able to see the state of the graph at each step.

If you use intermediateCommuntiyIds with the write procedure, the written property will contain a list of IDs corresponding to the intermediate community IDs, instead of a single integer encoding only the final community.

A comparison between Label Propagation and Louvain on the Zachary's karate club graph

With the small test graph we've used so far, the Label Propagation and Louvain algorithms yield the same communities, but this is not the case in general. The Zachary's karate club graph is a slightly larger graph and one of the most famous ones among graph experts. It was collected by Wayne W. Zachary, a teacher at the University of Chicago. He observed the different connections between the members of the karate club of this university in the 1970s.

The following image shows the result of Label Propagation (left) and the Louvain algorithm (right) on this graph, containing 34 nodes (students). While Label Propagation only captures two communities, the Louvain algorithm is able to detect four clusters:

Let me give you more context to understand this result better. In the karate club of the University of Chicago in around 1970, a conflict started between two instructors, which led to the club being split into two entities. At that time, both parties tried to attract students, and interactions between each instructor and the students were important. Zachary modeled these interactions with a graph, where each node is a student and edges between them indicate whether they interacted during that period. So this graph has the big advantage of having ground-truth information: Zachary recorded which part of the club each member chose to go to after the split.

Going back to the community partition detected by the two algorithms, Label Propagation gets it right, detecting two communities consistent with the real student partition. Looking at the results of the Louvain algorithm, it seems to have detected one more partition. However, if we merge the partition containing node 17 (top) and the one containing node 2 (middle), then the results are pretty similar. The only difference is node 10 will be in the top community whereas in the Label Propagation, it is in the bottom one.

Like some famous data science datasets shipped with scikit-learn, Zachary's karate club is included in the main Python package for dealing with graphs, networkx:

import networkx as nx G = nx.karate_club_graph()

We now know a lot about modularity and the Louvain algorithm. In the following section, we will learn about some limitations of this algorithm, and some alternatives that have been proposed to improve it, even if they are not (yet) implemented in the GDS.

Going beyond Louvain for overlapping community detection

As all algorithms do, the Louvain algorithm has its limitations. Understanding them is very important, so we'll try to do that in this section. We will also tackle possible alternatives. Finally, we are also going to talk about some algorithms that allow a node to belong to more than one community.

A caveat of the Louvain algorithm

Like any other algorithm, the Louvain algorithm has some known drawbacks. The main one is the resolution limit.

Resolution limit

Consider the following graph, consisting of strongly connected blobs of seven nodes each, weakly connected to each other with a single edge:

Running community detection on this graph, you would expect each of the blobs to form a community. While this works well for the Louvain algorithm on small graphs, it is known to fail on larger graphs. For instance, when run on a graph with a structure similar to the one depicted in the preceding figure but with 100 blobs, the Louvain algorithm will only identify 50 communities. This problem is known as the resolution limit problem: for a graph with edges of a fixed size (number of nodes) and density, the communities detected by the Louvain algorithm cannot be larger (or smaller) than a given number of nodes. This means that the algorithm will fail at discovering small communities in large networks (the 100 blobs case). It will also fail at detecting large communities in small networks. A special case for this last example is the no community case in small networks.

You can try to identify the large network issues by checking the intermediate steps that can be returned by the GDS implementation of the Louvain algorithm, as discussed in the previous section.

Another limitation is the fact that the modularity is a global measure. As a consequence, an update in one part of the graph can have consequences far away from it. For instance, if you iteratively add blobs to the ring in the preceding diagram, at some point the communities will suddenly become very different, even if only a small part of the graph has been changed.

Alternatives to Louvain

Many alternatives to the Louvain algorithm have been proposed. Among them are the following:

Algorithms involving a resolution parameter γ in order to tackle the resolution limit of the Louvain algorithm.
The Leiden algorithm is a variant of the Louvain algorithm, which is also able to split clusters instead of just merging them. By doing so, the Leiden algorithm is able to guarantee that communities will be well-connected, which is not the case with the Louvain algorithm (see the Further reading section for links to resources going deeper into this topic).

These algorithms have been proven to perform better than Louvain in the cases listed in the preceding section. However, there are two other cases that are covered by none of the algorithms addressed so far:

Overlapping communities: Cases where a given node can belong to several communities
Dynamic networks: How do communities change when the network evolves with time (new edges and/or new nodes)

Those two topics will be covered separately in the next two subsections.

Overlapping community detection

All the algorithms we have studied so far have one point in common: each node is assigned to one and only one community. But this is not always representative of reality: a friend can also be a colleague, a colleague can also be a family member. Being able to detect the nodes belonging to several communities also brings up interesting information about the graph structure and community borders, as suggested by the following illustration.

One of the most famous algorithms for overlapping community detection is the Clique Percolation Method (CPM). A clique in graph theory is a subset of nodes where each node is connected to all other nodes. A k-clique is a clique containing k nodes. The simplest example of a clique is the 3-clique, which is a triangle made of three fully connected nodes. The CPM algorithm defines a community based on an adjacent k-clique, where two k-cliques are considered adjacent if they share k-1 nodes, which makes it possible for the k^th node to be assigned to several communities.

Dynamic networks

A dynamic network is a graph evolving in time, where nodes and edges can be added, modified, or even deleted. The problem of community detection becomes much more complex because communities can do the following:

Appear
Grow
Be reduced
Fuse with another community
Be split
Disappear

A community can also stay unchanged or even temporarily vanish only to appear again sometime later. One technique to solve such problems consists of using snapshots of the graph at different times. A static algorithm such as the ones studied in this book can then be used on each of the snapshots. However, when comparing the communities discovered in two consecutive snapshots, it will be hard to decide whether the differences are due to the real community evolution or to the algorithm instability (think about the resolution limit of the Louvain algorithm). Many solutions have been proposed to solve this issue by using smoothing techniques. For instance, you can build an algorithm that will require the communities at time t to be somehow similar to the communities at time t-1. More details about this topic can be found in the references (see the Further reading section).

We have now covered community detection algorithms, understanding how they work, when to use them, and how to use them from the GDS in Neo4j. We have even gone beyond the GDS and described some techniques for overlapping community detection. Communities are about grouping together nodes with similarities: similar neighborhood (Label Propagation) and similar edge density compared to the rest of the graph (Louvain). If you want to quantify the similarity between two nodes, you can start by checking whether they belong, or not, to the same community. But more precise metrics exist, which we are going to introduce in the following section.

Measuring the similarity between nodes

There are several techniques used to quantify the similarity between nodes. They can be divided into two categories:

Set-based measures: Compare the content of two sets globally. For instance, sets (A, B, C) and (C, D, B) have two common elements.
Vector-based measures: Compare vectors element-wise, meaning that the position of each element is important. Euclidean distance is an example of such measures.

Let's go into more detail about these metrics, starting from the set-based similarities.

Set-based similarities

The GDS 1.0 implements two variants of set-based similarities we'll cover here.

Overlapping

The overlapping similarity is a measure of the number of common elements between two sets, relative to the size of the smallest set.

Definition

This measure's mathematical definition is as follows:

O(A, B) = | A ∩ B | / min(|A|, |B|)

A ∩ B is the intersection between sets A and B (common elements) and |A| denotes the size of set A.

The GDS contains, in its alpha tier, a function that will allow us to test and understand this similarity measure. We use it in the following way:

RETURN gds.alpha.similarity.overlap(<set1>, <set2>) AS similarity

The following statement returns the result of O([1, 2, 3], [1, 2, 3, 4]):

RETURN gds.alpha.similarity.overlap([1,2,3], [1,2,3,4]) AS similarity

The intersection between sets [1, 2, 3] and [1, 2, 3, 4] contains the elements that are in both sets: 1, 2, and 3. It contains three elements, so its size is | A ∩ B | = 3. On the denominator, we need to find the size of the smallest set. In our case, the smallest set is [1, 2, 3] containing three elements. So the expected value for the overlapping similarity is 1, which is the value returned by the GDS function. The following table contains a few more examples:

A	B	A ∩ B	\|A ∩ B\|	min(\|A\|, \|B\|)	O(A, B)
[1, 2, 3]	[1, 2, 3, 4]	[1, 2, 3]	3	3	1
[1, 2, 3]	[1, 2, 4]	[1, 2]	2	3	2/3≈0.67
[1, 2]	[3, 4]	[]	0	2	0

Note that the overlap similarity is symmetric; exchanging A and B does not change the result.

Quantifying user similarity in the GitHub graph

We are going to use the Neo4j community GitHub graph, containing GitHub users characterized by a login, the repositories owned by Neo4j, and relationships of type CONTRIBUTED_TO between users and the repositories they contributed to. If you have not yet built the graph from previous parts of this book, data and loading instructions are available in this book's repository on GitHub.

The first step to use similarity algorithms is to build a set of data associated with each user:

MATCH (user:User)-[:CONTRIBUTED_TO]->(repo:Repository)
WITH {item: user.login, categories: collect(repo.name)} as userData
RETURN userData

userData contains the following content, for a given user with login j:

{
  "item": "j",
  "categories": [
    "cypher-shell",
    "neo4j-ogm",
    "docker-neo4j",
    "doctools"
  ]
}

Computing the similarity between two users in that case means comparing the repositories they contributed to (stored in the categories key). In order to be usable by the GDS, we just need to replace our user login and repository name with the Neo4j internal IDs:

MATCH (user:User)-[:CONTRIBUTED_TO]->(repo:Repository)
WITH {item: id(user), categories: collect(id(repo))} as userData
RETURN userData

userData can then be used to feed the gds.alpha.overlap.stream procedure:

MATCH (user:User)-[:CONTRIBUTED_TO]->(repo:Repository)
WITH {item:id(user), categories: collect(id(repo))} as userData
WITH collect(userData) as data
CALL gds.alpha.similarity.overlap.stream(
     {
         nodeProjection: '*', 
         relationshipProjection: 'CONTRIBUTED_TO', 
         data: data
     }
)
YIELD item1, item2, count1, count2, intersection, similarity
RETURN *

The procedure returns the similarity, but also intermediate results such as the number of items in both categories' set and the size of the intersection.

item1 and item2 are node IDs. To retrieve the node object, you can use the gds.util.asNode function. For instance, to get the user login corresponding to item1, you can write gds.util.asNode(item1).login.

Here are some chosen rows from our result:

╒════════════════════╤═════════════════╤════════╤════════╤══════════════╤══════════════════╕
│"user1"             │"user2"          │"count1"│"count2"│"intersection"│"similarity"      │
╞════════════════════╪═════════════════╪════════╪════════╪══════════════╪══════════════════╡
│"systay"            │"nawroth"        │4       │13      │4             │1.0               │
├────────────────────┼─────────────────┼────────┼────────┼──────────────┼──────────────────┤
│"chrisvest"         │"systay"         │3       │4       │3             │1.0               │
└────────────────────┴─────────────────┴────────┴────────┴──────────────┴──────────────────┘

As you can see, even for pairs of nodes with a similarity equal to 1. There are large differences between the size of the intersection: we would maybe expect systay to be more similar to chrisvest than nawroth since navroth contributed to nine more repositories than systay, while the difference in the number of repositories is only one between systay and chrisvest. This is solved by the Jaccard similarity we are going to look at in the following paragraph.

Jaccard similarity

The Jaccard similarity definition is similar to the overlap similarity, except that the denominator contains the size of the union of sets A and B:

J(A, B) = | A ∩ B | / |A ∪ B|

Using the union of both sets in the denominator is useful for identifying the cases where a user would have contributed to a single repository, with many different contributors. With the overlap similarity, this user would have a similarity of 1 to all other contributors. With the Jaccard formula, the similarity will depend on the number of repositories each of the other users contributed to and will be equal to 1 only for the contributors that have contributed to that single repository as well.

Running the Jaccard similarity algorithm on this projected graph is as simple as this:

MATCH (u:User)
MATCH (v:User)
RETURN u, v, gds.alpha.similarity.jaccard(u, v) as score

You can check the similarity between systay and the other users in this graph and notice that, now, the similarity with chrisvest is much higher than the one with navroth.

These kinds of similarities are based on a comparison between the elements nodes are connected to. In the following section, we will discover how to use vector-based similarity metrics such as the Euclidean distance or cosine similarity within the GDS. Even if they are not specific to graphs, they offer useful information when it comes to data analysis.

Vector-based similarities

Vector-based similarities are similar to the ones encountered in classical machine learning pipelines. They compare two vectors containing an ordered list of numbers.

Euclidean distance

The Euclidean distance is a measure of the distance between two points. It is an extension of the distance metrics in a Cartesian plane, and is computed using the following formula:

d(u, v) = √( (u₁ - v₁)² + (u₂ - v₂)² + ... + (u_n - v_n)²)

Instead of simply counting the number of repositories each pair of users has in common, we can try to quantify the distance between their vectors of contribution. Let's make it clearer with an example. We will add a new property to some relationships, to count how often users contributed to a given repository:

MATCH (u1:User {login: "systay"})-[ct1]->(repo)
SET ct1.nContributions = round(rand() * 10)

I'm using random numbers here for simplicity, but it means your result may differ. If we do the same thing for another user, for instance, chrisvest, we can then create the vector of contributions of each user to each of the repositories:

MATCH (u1:User {login: 'chrisvest'})-[ct1]->(repo)
MATCH (u2:User {login: 'systay'})-[ct2]->(repo)
WITH collect(ct1.nContributions) as contrib1, collect(ct2.nContributions) as contrib2
RETURN contrib1, contrib2

The collect statement creates two lists, one for each user, containing the contribution of this user to each of the repositories. To compute the Euclidean distance between these two vectors, we just need to call the gds.alpha.similarity.euclideanDistance function:

MATCH (u1:User {login: 'chrisvest'})-[ct1]->(repo)
MATCH (u2:User {login: 'systay'})-[ct2]->(repo)
WITH collect(ct1.nContributions) as contrib1, collect(ct2.nContributions) as contrib2
RETURN gds.alpha.similarity.euclideanDistance(contrib1, contrib2) AS similarity

This leads to a similarity close to 38 in my case.

Similar to the overlap similarity, the GDS also contains procedures to run on a projected graph and computes similarities between all pairs of nodes.

Remember that you can find the full name and signature of the functions and procedures available in your version of the GDS with the following:

CALL gds.list() YIELD name, signature
WHERE name =~ '.*euclidean.*'
RETURN *

Cosine similarity

Cosine similarity is well-known, especially within the NLP community, since it is widely used to measure the similarity between two texts. Instead of computing the distance between two points like in Euclidean similarity, cosine similarity is based on the angle between two vectors. Consider the following scenario:

The Euclidean distance between vectors A and C is d_AC, while θ_AC represents the cosine similarity.

Similarly, the Euclidean distance between A and B is represented by the d_AB line, but the angle between them is 0, and since cos(0)=1, the cosine similarity between A and B is 1 – much higher than the cosine similarity between A and C.

To replace this example in the context of the GitHub contributors, we could have the following:

A contributes to two repositories R1 and R2, with 5 contributions to R1 (x axis) and 10 contributions to R2 (y axis).
B's contributions to the same repositories are: 1 contribution to R1 and 2 contributions to R2, so that the vectors are aligned.
C has more contributions to R1, let's say 8, but fewer contributions to R2 (let's say 8).

Looking only at the total number of contributions, users A and C are more similar than A and B. However, the distribution of contributions of users A and B is more similar: each of them contributed twice as often to R1 as to R2. This last fact is encoded in the cosine similarity, which is higher between A and B than between A and C (see the following table).

Cosine similarity usage with the GDS is identical to the one of Euclidean distance, using the following names:

A function, gds.alpha.similarity.cosine
Two procedures, gds.alpha.similarity.cosine.stream and gds.alpha.similarity.cosine.write

The following table shows a comparison of the Euclidean and cosine similarity between users A, B, and C with the values given previously:

	Euclidean	Cosine
A/B	`RETURN gds.alpha.similarity.euclidean([5, 10], [1, 2])` `~ 0.10`	`RETURN gds.alpha.similarity.cosine([5, 10], [1, 2])` `1.0`
A/C	`RETURN gds.alpha.similarity.euclidean([5, 10], [8, 8])` `~ 0.22`	`RETURN gds.alpha.similarity.cosine([5, 10], [8, 8])` `~ 0.95`

That's the end of our section dedicated to node similarity. We will come back to it in the following chapters, when we will use some of the metrics we discovered in this chapter and the preceding ones as features in a machine learning model.

Summary

In this chapter, we talked a lot about ways to measure the similarity between nodes, either on a global scale by grouping nodes into communities or with a more local similarity assessment, for example, using the Jaccard similarity metric. Several algorithms were studied – the weakly and strongly connected components, the Label Propagation algorithm, and the Louvain algorithms. We also used a feature offered by the GDS that allows us to write the results of an algorithm into Neo4j for future use. We also used two new tools to visualize a graph and the results of the graph algorithms implemented in the GDS: neovis.js, which is used to embed a Neo4j graph visualization into an HTML page, and NEuler, which is the Graph Algorithms Playground, from which you can run a graph algorithm without writing code.

Our exploration of the algorithms implemented in the GDS (1.0) is now finished. In the next chapters, we will learn how to use graphs and these algorithms in a machine learning pipeline to make predictions. To start with, in the following chapter, we will build a machine learning pipeline and introduce a graph-based feature to enhance performance.

Questions

Connected components:
What do you think would happen if you ran the strongly connected component algorithm on an undirected projected graph?
How will the community structure of the test graph change if we add a relationship from D to C? Or if we remove the relationship from D to E?
Label propagation:
Update our algorithm implementation to take into account seeds, that is, prior knowledge about the node community.

Table of Contents for Community Detection and Similarity Measures

Create new playlist

Sign In

Sign Up