With more and more people interacting together and communicating, exchanging information, or simply sharing a common interest in different topics, most data science use cases can be addressed using graph representations. Although very large graphs were, for a long time, only used by the Internet giants, government, and national security agencies, it is becoming more common place to work with large graphs containing millions of vertices. Hence, the main challenge of a data scientist will not necessarily be to detect communities and find influencers on graphs, but rather to do so in a fully distributed and efficient way in order to overcome the constraint of scale. This chapter progresses through building a graph example, at scale, using the persons we identified using NLP extraction described in Chapter 6, Scraping Link-Based External Data.
In this chapter, we will cover the following topics:
This chapter being quite technical, we expect the reader to be already familiar with graph theory, message passing, and Pregel API. We also invite the reader to go through every white paper mentioned in the chapter.
We previously used NLP entity recognition to identify persons from an HTML raw text format. In this chapter, we move to a lower level by trying to infer relations between these entities and detect the possible communities surrounding them.
Within the context of news articles, we first need to ask ourselves a fundamental question. What defines a relation between two entities? The most elegant answer would probably be to study words using the Stanford NLP libraries described in Chapter 6, Scraping Link-Based External Data. Given the following input sentence, which is taken from http://www.ibtimes.co.uk/david-bowie-yoko-ono-says-starmans-death-has-left-big-empty-space-1545160:
"Yoko Ono said she and late husband John Lennon shared a close relationship with David Bowie"
We could easily extract the syntactic tree, a structure that linguists use to model how sentences are grammatically built and where each element is reported with its type such as a noun (NN
), a verb (VR
), or a determiner (DT
) and its relative position in the sentence:
val processor = new CoreNLPProcessor() val document = processor.annotate(text) document.sentences foreach { sentence => println(sentence.syntacticTree.get) } /* (NNP Yoko) (NNP Ono) (VBD said) (PRP she) (CC and) (JJ late) (NN husband) (NNP John) (NNP Lennon) (VBD shared) (DT a) (JJ close) (NN relationship) (IN with) (NNP David) (NNP Bowie) */
A thorough study of each element, its type, its predecessors, and successors would help build a directed graph with edges being the true definitions of the relations that exist between all these three entities. An example of a graph built out of that sentence is reported as follows:
Although it makes perfect sense (grammatically speaking), building a graph out of syntactic trees would require excessive amount of coding, would probably deserve a whole chapter on its own, and does not bring much added value since most of the relations we would build (in the context of news articles) would not be based on true facts taken from history books, but rather need to be put in their context. To illustrate this point we have two sentences which are taken from http://www.digitalspy.com/music/news/a779577/paul-mccartney-pays-tribute-to-great-star-david-bowie-his-star-will-shine-in-the-sky-forever/:
"Sir Paul McCartney described [David Bowie] as a great star"
"[Sir Paul McCartney] treasure[s] the moments they had together"
It would create the same grammatical link between vertices [Paul McCartney] and [David Bowie], while only the latter assumes a physical connection between them (they actually spent time together).
Instead, we use a much faster approach by grouping names based on their positions within a text. Our Naive assumption is that most of the authors usually start mentioning the names of important people first, then write about secondary characters, and lastly about less important persons. Our contact chaining is therefore a simple nested loop across all the names in a given article, names being sorted from the most to the least important ones using their actual position. Because of its relative time complexity O(n2) this approach will only be valid for hundreds of records per article and will certainly be a limiting factor with text mentioning hundreds of thousands of different entities.
def buildTuples(p: Array[String]): Array[(String, String)] = { for(i <- 0 to p.length - 2; j <- i + 1 to p.length - 1) yield { (p(i), p(j)) } }
In our code repository, you will see an alternative: Combinations
, which is a more generic solution that allows the specification of a variable r
; this allows us to specify the number of entities that need to appear in each output combination, that is, 2 for this chapter but more in other contexts. Using Combinations.buildTuples
is functionally equal to the buildTuples
code given earlier.
Elasticsearch is a perfect tool for storing and indexing text content together with its metadata attributes, and was therefore a logical choice for our online data store using the text content we extracted in the previous chapter. As this section is more batch-process oriented, we get the data from Elasticsearch into our Spark cluster using the excellent Spark Elasticsearch API as shown in the following code:
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-spark_2.11</artifactId> <version>2.4.0<version> </dependency>
Given an index type and name, a convenient way for interacting with the Elasticsearch API is using Spark DataFrame. Efficient enough in most use cases (a simple example is shown next), this might become a challenge when working on more complex and nested schemas:
val spark = SparkSession .builder() .config("es.nodes", "localhost") .config("es.port", "9200") .appName("communities-es-download") .getOrCreate() spark .read .format("org.elasticsearch.spark.sql") .load("gzet/news") .select("title", "url") .show(5) +--------------------+--------------------+ | title| url| +--------------------+--------------------+ |Sonia Meets Mehbo...|http://www.newind...| |"A Job Well Done ...|http://daphneanso...| |New reading progr...|http://www.mailtr...| |Barrie fire servi...|http://www.simcoe...| |Paris police stat...|http://www.dailym...| +--------------------+--------------------+
In fact, the Elasticsearch API is not flexible enough to read nested structures and complex arrays. Using latest versions of Spark, one will quickly run into errors such as "Field 'persons' is backed by an array but the associated Spark Schema does not reflect this". With some experimentation, we can see that accessing nested and complex structures from Elasticsearch is usually much easier using a set of standard JSON parsers (such as json4s
in the following code):
<dependency> <groupId>org.json4s</groupId> <artifactId>json4s-native_2.11</artifactId> <version>3.2.11</version> </dependency>
We query Elasticsearch using the implicit esJsonRdd
function from a spark context:
import org.elasticsearch.spark._ import org.json4s.native.JsonMethods._ import org.json4s.DefaultFormats def readFromES(query: String = "?q=*"): RDD[Array[String]] = { sc.esJsonRDD("gzet/news", query) .values . map { jsonStr => implicit val format = DefaultFormats val json = parse(jsonStr) (json "persons").extract[Array[String]] } } readFromEs("?persons='david bowie'") .map(_.mkString(",")) .take(3) .foreach(println) /* david bowie,yoko ono,john lennon,paul mc cartney duncan jones,david bowie,tony visconti david bowie,boris johnson,david cameron */
Using the query
parameter, we can access all the data from Elasticsearch, a sample of it, or even all of the records matching a specific query. We can finally build our list of tuples using the simple contact chaining method explained earlier.
val personRdd = readFromES() val tupleRdd = personRdd flatMap buildTuples
3.149.242.175