Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Building Communities

With more and more people interacting together and communicating, exchanging information, or simply sharing a common interest in different topics, most data science use cases can be addressed using graph representations. Although very large graphs were, for a long time, only used by the Internet giants, government, and national security agencies, it is becoming more common place to work with large graphs containing millions of vertices. Hence, the main challenge of a data scientist will not necessarily be to detect communities and find influencers on graphs, but rather to do so in a fully distributed and efficient way in order to overcome the constraint of scale. This chapter progresses through building a graph example, at scale, using the persons we identified using NLP extraction described in Chapter 6, Scraping Link-Based External Data.

In this chapter, we will cover the following topics:

Use Spark to extract content from Elasticsearch, build a Graph of person entities and learn the benefits of using Accumulo as a secure graph database
Write a community detection algorithm from A to Z using GraphX and triangle optimization
Leverage Accumulo specific features, including cell-level security to observe the changes in communities, and iterators to provide server and client-side computation

This chapter being quite technical, we expect the reader to be already familiar with graph theory, message passing, and Pregel API. We also invite the reader to go through every white paper mentioned in the chapter.

Building a graph of persons

We previously used NLP entity recognition to identify persons from an HTML raw text format. In this chapter, we move to a lower level by trying to infer relations between these entities and detect the possible communities surrounding them.

Contact chaining

Within the context of news articles, we first need to ask ourselves a fundamental question. What defines a relation between two entities? The most elegant answer would probably be to study words using the Stanford NLP libraries described in Chapter 6, Scraping Link-Based External Data. Given the following input sentence, which is taken from http://www.ibtimes.co.uk/david-bowie-yoko-ono-says-starmans-death-has-left-big-empty-space-1545160:

"Yoko Ono said she and late husband John Lennon shared a close relationship with David Bowie"

We could easily extract the syntactic tree, a structure that linguists use to model how sentences are grammatically built and where each element is reported with its type such as a noun (NN), a verb (VR), or a determiner (DT) and its relative position in the sentence:

val processor = new CoreNLPProcessor()
val document = processor.annotate(text)

document.sentences foreach { sentence =>
  println(sentence.syntacticTree.get)
}

/*
(NNP Yoko)
(NNP Ono)
(VBD said)
        (PRP she)
      (CC and)
        (JJ late)
        (NN husband)
          (NNP John)
          (NNP Lennon)
      (VBD shared)
        (DT a)
        (JJ close)
        (NN relationship)
        (IN with)
          (NNP David)
          (NNP Bowie)
*/

A thorough study of each element, its type, its predecessors, and successors would help build a directed graph with edges being the true definitions of the relations that exist between all these three entities. An example of a graph built out of that sentence is reported as follows:

Figure 1: Syntactic graph for David Bowie, Yoko Ono, and John Lennon

Although it makes perfect sense (grammatically speaking), building a graph out of syntactic trees would require excessive amount of coding, would probably deserve a whole chapter on its own, and does not bring much added value since most of the relations we would build (in the context of news articles) would not be based on true facts taken from history books, but rather need to be put in their context. To illustrate this point we have two sentences which are taken from http://www.digitalspy.com/music/news/a779577/paul-mccartney-pays-tribute-to-great-star-david-bowie-his-star-will-shine-in-the-sky-forever/:

"Sir Paul McCartney described [David Bowie] as a great star"

"[Sir Paul McCartney] treasure[s] the moments they had together"

It would create the same grammatical link between vertices [Paul McCartney] and [David Bowie], while only the latter assumes a physical connection between them (they actually spent time together).

Instead, we use a much faster approach by grouping names based on their positions within a text. Our Naive assumption is that most of the authors usually start mentioning the names of important people first, then write about secondary characters, and lastly about less important persons. Our contact chaining is therefore a simple nested loop across all the names in a given article, names being sorted from the most to the least important ones using their actual position. Because of its relative time complexity O(n²) this approach will only be valid for hundreds of records per article and will certainly be a limiting factor with text mentioning hundreds of thousands of different entities.

def buildTuples(p: Array[String]): Array[(String, String)] = {
    for(i <- 0 to p.length - 2; j <- i + 1 to p.length - 1) yield {
      (p(i), p(j))
    }
  }

In our code repository, you will see an alternative: Combinations, which is a more generic solution that allows the specification of a variable r; this allows us to specify the number of entities that need to appear in each output combination, that is, 2 for this chapter but more in other contexts. Using Combinations.buildTuples is functionally equal to the buildTuples code given earlier.

Extracting data from Elasticsearch

Elasticsearch is a perfect tool for storing and indexing text content together with its metadata attributes, and was therefore a logical choice for our online data store using the text content we extracted in the previous chapter. As this section is more batch-process oriented, we get the data from Elasticsearch into our Spark cluster using the excellent Spark Elasticsearch API as shown in the following code:

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-spark_2.11</artifactId>
  <version>2.4.0<version>
</dependency>

Given an index type and name, a convenient way for interacting with the Elasticsearch API is using Spark DataFrame. Efficient enough in most use cases (a simple example is shown next), this might become a challenge when working on more complex and nested schemas:

val spark = SparkSession
  .builder()
  .config("es.nodes", "localhost")
  .config("es.port", "9200")
  .appName("communities-es-download")
  .getOrCreate()

spark
  .read
  .format("org.elasticsearch.spark.sql")
  .load("gzet/news")
  .select("title", "url")
  .show(5)

+--------------------+--------------------+
|               title|                 url|
+--------------------+--------------------+
|Sonia Meets Mehbo...|http://www.newind...|
|"A Job Well Done ...|http://daphneanso...|
|New reading progr...|http://www.mailtr...|
|Barrie fire servi...|http://www.simcoe...|
|Paris police stat...|http://www.dailym...|
+--------------------+--------------------+

In fact, the Elasticsearch API is not flexible enough to read nested structures and complex arrays. Using latest versions of Spark, one will quickly run into errors such as "Field 'persons' is backed by an array but the associated Spark Schema does not reflect this". With some experimentation, we can see that accessing nested and complex structures from Elasticsearch is usually much easier using a set of standard JSON parsers (such as json4s in the following code):

<dependency>
  <groupId>org.json4s</groupId>
  <artifactId>json4s-native_2.11</artifactId>
  <version>3.2.11</version>
</dependency>

We query Elasticsearch using the implicit esJsonRdd function from a spark context:

import org.elasticsearch.spark._
import org.json4s.native.JsonMethods._
import org.json4s.DefaultFormats

def readFromES(query: String = "?q=*"): RDD[Array[String]] = {

  sc.esJsonRDD("gzet/news", query)
    .values
    . map {
      jsonStr =>
        implicit val format = DefaultFormats
        val json = parse(jsonStr)
        (json  "persons").extract[Array[String]]
    }
 
}

readFromEs("?persons='david bowie'")
   .map(_.mkString(","))
   .take(3)
   .foreach(println)

/*
david bowie,yoko ono,john lennon,paul mc cartney
duncan jones,david bowie,tony visconti
david bowie,boris johnson,david cameron
*/

Using the query parameter, we can access all the data from Elasticsearch, a sample of it, or even all of the records matching a specific query. We can finally build our list of tuples using the simple contact chaining method explained earlier.

val personRdd = readFromES()
val tupleRdd = personRdd flatMap buildTuples

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Building Communities

Create new playlist

Sign In

Sign Up

Chapter 7. Building Communities

Building a graph of persons

Contact chaining

Extracting data from Elasticsearch

Table of Contents for
7. Building Communities