A brief introduction to graph theory

To better understand graphs, let's look at Facebook and how you typically use Facebook. Every day you use your smart phone to post messages on your friend's wall or update your status. Your friends are all posting messages and photos and videos of their own.

You have friends, your friends have friends, who have friends, and so on. Facebook has settings that let you make new friends or remove friends from your friend list. Facebook also has permissions, which allow granular control on who sees what and who can communicate with who.

Now, when you consider that there are a billion Facebook users, the friends and friend's friends list for all users gets quite large and complicated. It is hard to even comprehend and manage all the different relationships or friendships.

So, if someone wants to find out if you and another person X are related at all, they can simply start by looking at all your friends and all your friends' friends, and so on, and try to get to the person X. If person X is a friend of a friend, then you and person X are indirectly connected.

Search for a celebrity or two in your Facebook account and see if someone is a friend of your friend. Maybe you can try to add them as a friend.

We need to build the storage and retrieval of such data about people and their friends so as to allow us to answer questions such as:

  • Is X a friend of Y?
  • Are X and Y connected directly or within two steps?
  • How many friends does X have?

We can start by trying out a simple data structure such as an array such that every person has an array of friends. So now, it's easy to just take the length of the array to answer 3. We can also just scan the array and quickly answer 1. Now, question 2 will need little more work, take the array of friends of X and for each such friend scan the array of friends.

We have sort of solved the problem by having a specialized data structure as shown in the following example where we create a case class Person and then add friends to build a relationship like this john | ken | mary | dan:

 

case class Person(name: String) {
val friends = scala.collection.mutable.ArrayBuffer[Person]()

def numberOfFriends() = friends.length

def isFriend(other: Person) = friends.find(_.name == other.name)

def isConnectedWithin2Steps(other: Person) = {
for {f <- friends} yield {f.name == other.name ||
f.isFriend(other).isDefined}

}.find(_ == true).isDefined
}

scala> val john = Person("John")
john: Person = Person(John)

scala> val ken = Person("Ken")
ken: Person = Person(Ken)

scala> val mary = Person("Mary")
mary: Person = Person(Mary)

scala> val dan = Person("Dan")
dan: Person = Person(Dan)

scala> john.numberOfFriends
res33: Int = 0

scala> john.friends += ken
res34: john.friends.type = ArrayBuffer(Person(Ken)) //john -> ken

scala> john.numberOfFriends
res35: Int = 1

scala> ken.friends += mary
res36: ken.friends.type = ArrayBuffer(Person(Mary)) //john -> ken -> mary

scala> ken.numberOfFriends
res37: Int = 1

scala> mary.friends += dan
res38: mary.friends.type = ArrayBuffer(Person(Dan)) //john -> ken -> mary -> dan

scala> mary.numberOfFriends
res39: Int = 1

scala> john.isFriend(ken)
res40: Option[Person] = Some(Person(Ken)) //Yes, ken is a friend of john

scala> john.isFriend(mary)
res41: Option[Person] = None //No, mary is a friend of ken not john

scala> john.isFriend(dan)
res42: Option[Person] = None //No, dan is a friend of mary not john

scala> john.isConnectedWithin2Steps(ken)
res43: Boolean = true //Yes, ken is a friend of john

scala> john.isConnectedWithin2Steps(mary)
res44: Boolean = true //Yes, mary is a friend of ken who is a friend of john

scala> john.isConnectedWithin2Steps(dan)
res45: Boolean = false //No, dan is a friend of mary who is a friend of ken who is a friend of john

If we build out the Person() instances for all Facebook users and add the friends to the arrays as the preceding code shows, then eventually, we will be able to perform lots of the queries on who is a friend and what is the relationship between any two persons.

The following diagram shows the data structures' Person() instances and how they are related to each other logically:

If you want to use the preceding graph and just find out John's friends, John's friend's friends and so on so that we can quickly find out direct friends, indirect friends (friends level 2), and level 3 (friends' friends' friends), you will see something like the following diagram:

We can easily extend the Person() class and provide more and more functionality to answer different questions. That is not the point here, what we want to look at is the preceding diagram showing Person and friends of the Person and how drawing all the friends of each Person yields in a mesh of relationships between persons.

We now introduce the graph theory, which stems from the field of Mathematics. Graph theory defines a graph as a structure made up of vertices, nodes, or points, which are connected by edges, arcs, and lines. If you consider a set of Vertices as V and a set of Edges as E, then a Graph G can be defined as an ordered pair of V and E.

Graph G = (V, E)
V - set of Vertices
E - set of Edges

In our example of the Facebook friends drawing, we can simply consider each of the persons as a vertex in the set of vertices and then each link between any two persons can be considered as an edge in the set of edges.

By this logic, we can list the Vertices and Edges as shown in the following diagram:

This depiction as a mathematical graph yields to various methodologies of traversing and querying the Graph using mathematical techniques. When the techniques are applied to computer science as a way to develop programmatical methods to perform the math necessary, the formal approach is, of course, to develop algorithms to implement the mathematical rules at a scalable efficient level.

We have already attempted to implement a simple graph-like program using the case class Person, but this is just the simplest use case, which should make it obvious that there are a lot of sophisticated extensions possible, such as the following questions to be answered:

  • What's the best way from X to Y? An example of such a question can be your car GPS telling you the best way to go to the grocery store.
  • Recognize the critical edges, which can cause partitions of the graph? An example of such a question is to determine the critical links connecting the internet services/water pipes/power lines of various cities in the state. A critical edge breaks connectivity and produces two subgraphs of well-connected cities, but there will not be any communication between the two subgraphs.

Answering the preceding questions yields to several algorithms such as minimum spanning tree, shortest path , page rank, ALS (alternating least squares), and max-cut min-flow algorithms, and so on, which are applicable to a broad set of use cases.

The other examples are LinkedIn profiles and connections, Twitter followers, Google page rank, airline scheduling, GPS in your car, and so on, where you can clearly see a graph of vertices and edges. Using graph algorithms, the graph seen earlier in the Facebook, LinkedIn, Google examples can be analyzed using various algorithms to yield different business use cases.

Shown below are illustration of some real-life use cases of graphs which show the use of graphs and graph algorithms in some real-life use cases such as:

  • help determine flight routes between airports
  • plan how to layout water pipelines to all the households in the locality
  • make your car GPS to plan the route to drive to the grocery
  • design how the internet traffic is routed from city to city, state to state and country to country

Let's now start digging deeper into how we can use Spark GraphX.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.231.245