This chapter describes a few common techniques of Natural Language Processing (NLP), specifically, the ones that can benefit from Scala. There are some NLP packages in the open source out there. The most famous of them is probably NLTK (http://www.nltk.org), which is written in Python, and ostensibly even a larger number of proprietary software solutions emphasizing different aspects of NLP. It is worth mentioning Wolf (https://github.com/wolfe-pack), FACTORIE (http://factorie.cs.umass.edu), and ScalaNLP (http://www.scalanlp.org), and skymind (http://www.skymind.io), which is partly proprietary. However, few open source projects in this area remain active for a long period of time for one or another reason. Most projects are being eclipsed by Spark and MLlib capabilities, particularly, in the scalability aspect.
Instead of giving a detailed description of each of the NLP projects, which also might include speech-to-text, text-to-speech, and language translators, I will provide a few basic techniques focused on leveraging Spark MLlib in this chapter. The chapter comes very naturally as the last analytics chapter in this book. Scala is a very natural-language looking computer language and this chapter will leverage the techniques I developed earlier.
NLP arguably is the core of AI. Originally, the AI was created to mimic the humans, and natural language parsing and understanding is an indispensable part of it. Big data techniques has started to penetrate NLP, even though traditionally, NLP is very computationally intensive and is regarded as a small data problem. NLP often requires extensive deep learning techniques, and the volume of data of all written texts appears to be not so large compared to the logs volumes generated by all the machines today and analyzed by the big data machinery.
Even though the Library of Congress counts millions of documents, most of them can be digitized in PBs of actual digital data, a volume that any social websites is able to collect, store, and analyze within a few seconds. Complete works of most prolific authors can be stored within a few MBs of files (refer to Table 09-1). Nonetheless, the social network and ADTECH companies parse text from millions of users and in hundreds of contexts every day.
The complete works of |
When lived |
Size |
---|---|---|
Plato |
428/427 (or 424/423) - 348/347 BC |
2.1 MB |
William Shakespeare |
26 April 1564 (baptized) - 23 April 1616 |
3.8 MB |
Fyodor Dostoevsky |
11 November 1821 - 9 February 1881 |
5.9 MB |
Leo Tolstoy |
9 September 1828 - 20 November 1910 |
6.9 MB |
Mark Twain |
November 30, 1835 - April 21, 1910 |
13 MB |
Table 09-1. Complete Works collections of some famous writers (most can be acquired on Amazon.com today for a few dollars, later authors, although readily digitized, are more expensive)
The natural language is a dynamic concept that changes over time, technology, and generations. We saw the appearance of emoticons, three-letter abbreviations, and so on. Foreign languages tend to borrow from each other; describing this dynamic ecosystem is a challenge on itself.
As in the previous chapters, I will focus on how to use Scala as a tool to orchestrate the language analysis rather than rewriting the tools in Scala. As the topic is so large, I will not claim to cover all aspects of NLP here.
In this chapter, we will cover the following topics:
Before we proceed to detailed algorithms, let's look at a generic text-processing pipeline depicted in Figure 9-1. In text analysis, the input is usually presented as a stream of characters (depending on the specific language).
Lexical analysis has to do with breaking this stream into a sequence of words (or lexemes in linguistic analysis). Often it is also called tokenization (and the words called the tokens). ANother Tool for Language Recognition (ANTLR) (http://www.antlr.org/) and Flex (http://flex.sourceforge.net) are probably the most famous in the open source community. One of the classical examples of ambiguity is lexical ambiguity. For example, in the phrase I saw a bat. bat can mean either an animal or a baseball bat. We usually need context to figure this out, which we will discuss next:
Syntactic analysis, or parsing, traditionally deals with matching the structure of the text with grammar rules. This is relatively more important for computer languages that do not allow any ambiguity. In natural languages, this process is usually called chunking and tagging. In many cases, the meaning of the word in human language can be subject to context, intonation, or even body language or facial expression. The value of such analysis, as opposed to the big data approach, where the volume of data trumps complexity is still a contentious topic—one example of the latter is the word2vec approach, which will be described later.
Semantic analysis is the process of extracting language-independent meaning from the syntactic structures. As much as possible, it also involves removing features specific to particular cultural and linguistic contexts, to the extent that such a project is possible. The sources of ambiguity at this stage are: phrase attachment, conjunction, noun group structure, semantic ambiguity, anaphoric non-literal speech, and so on. Again, word2vec partially deals with these issues.
Disclosure integration partially deals with the issue of the context: the meaning of a sentence or an idiom can depend on the sentences or paragraphs before that. Syntactic analysis and cultural background play an important role here.
Finally, pragmatic analysis is yet another layer of complexity trying to reinterpret what is said in terms of what the intention was. How does this change the state of the world? Is it actionable?
The straightforward representation of the document is a bag of words. Scala, and Spark, provides an excellent paradigm to perform analysis on the word distributions. First, we read the whole collection of texts, and then count the unique words:
$ bin/spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ ''_/ /___/ .__/\_,_/_/ /_/\_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val leotolstoy = sc.textFile("leotolstoy").cache leotolstoy: org.apache.spark.rdd.RDD[String] = leotolstoy MapPartitionsRDD[1] at textFile at <console>:27 scala> leotolstoy.flatMap(_.split("\W+")).count res1: Long = 1318234 scala> val shakespeare = sc.textFile("shakespeare").cache shakespeare: org.apache.spark.rdd.RDD[String] = shakespeare MapPartitionsRDD[7] at textFile at <console>:27 scala> shakespeare.flatMap(_.split("\W+")).count res2: Long = 1051958
This gives us just an estimate of the number of distinct words in the repertoire of quite different authors. The simplest way to find intersection between the two corpuses is to find the common vocabulary (which will be quite different as Leo Tolstoy wrote in Russian and French, while Shakespeare was an English-writing author):
scala> :silent scala> val shakespeareBag = shakespeare.flatMap(_.split("\W+")).map(_.toLowerCase).distinct scala> val leotolstoyBag = leotolstoy.flatMap(_.split("\W+")).map(_.toLowerCase).distinct leotolstoyBag: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at map at <console>:29 scala> println("The bags intersection is " + leotolstoyBag.intersection(shakespeareBag).count) The bags intersection is 11552
A few thousands word indices are manageable with the current implementations. For any new story, we can determine whether it is more likely to be written by Leo Tolstoy or William Shakespeare. Let's take a look at The King James Version of the Bible, which also can be downloaded from Project Gutenberg (https://www.gutenberg.org/files/10/10-h/10-h.htm):
$ (mkdir bible; cd bible; wget http://www.gutenberg.org/cache/epub/10/pg10.txt) scala> val bible = sc.textFile("bible").cache scala> val bibleBag = bible.flatMap(_.split("\W+")).map(_.toLowerCase).distinct scala>:silent scala> bibleBag.intersection(shakespeareBag).count res5: Long = 7250 scala> bibleBag.intersection(leotolstoyBag).count res24: Long = 6611
This seems reasonable as the religious language was popular during the Shakespearean time. On the other hand, plays by Anton Chekhov have a larger intersection with the Leo Tolstoy vocabulary:
$ (mkdir chekhov; cd chekhov; wget http://www.gutenberg.org/cache/epub/7986/pg7986.txt wget http://www.gutenberg.org/cache/epub/1756/pg1756.txt wget http://www.gutenberg.org/cache/epub/1754/1754.txt wget http://www.gutenberg.org/cache/epub/13415/pg13415.txt) scala> val chekhov = sc.textFile("chekhov").cache chekhov: org.apache.spark.rdd.RDD[String] = chekhov MapPartitionsRDD[61] at textFile at <console>:27 scala> val chekhovBag = chekhov.flatMap(_.split("\W+")).map(_.toLowerCase).distinct chekhovBag: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at distinct at <console>:29 scala> chekhovBag.intersection(leotolstoyBag).count res8: Long = 8263 scala> chekhovBag.intersection(shakespeareBag).count res9: Long = 6457
This is a very simple approach that works, but there are a number of commonly known improvements we can make. First, a common technique is to stem the words. In many languages, words have a common part, often called root, and a changeable prefix or suffix, which may depend on the context, gender, time, and so on. Stemming is the process of improving the distinct count and intersection by approximating this flexible word form to the root, base, or a stem form in general. The stem form does not need to be identical to the morphological root of the word, it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid grammatical root. Secondly, we probably should account for the frequency of the words—while we will describe more elaborate methods in the next section, for the purpose of this exercise, we'll exclude the words with very high count, that usually are present in any document such as articles and possessive pronouns, which are usually called stop words, and the words with very low count. Specifically, I'll use the optimized Porter Stemmer implementation that I described in more detail at the end of the chapter.
The http://tartarus.org/martin/PorterStemmer/ site contains some of the Porter Stemmer implementations in Scala and other languages, including a highly optimized ANSI C, which may be more efficient, but here I will provide another optimized Scala version that can be used immediately with Spark.
The Stemmer example will stem the words and count the relative intersections between them, removing the stop words:
def main(args: Array[String]) { val stemmer = new Stemmer val conf = new SparkConf(). setAppName("Stemmer"). setMaster(args(0)) val sc = new SparkContext(conf) val stopwords = scala.collection.immutable.TreeSet( "", "i", "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "from", "had", "has", "he", "her", "him", "his", "in", "is", "it", "its", "my", "not", "of", "on", "she", "that", "the", "to", "was", "were", "will", "with", "you" ) map { stemmer.stem(_) } val bags = for (name <- args.slice(1, args.length)) yield { val rdd = sc.textFile(name).map(_.toLowerCase) if (name == "nytimes" || name == "nips" || name == "enron") rdd.filter(!_.startsWith("zzz_")).flatMap(_.split("_")).map(stemmer.stem(_)).distinct.filter(!stopwords.contains(_)).cache else { val withCounts = rdd.flatMap(_.split("\W+")).map(stemmer.stem(_)).filter(!stopwords.contains(_)).map((_, 1)).reduceByKey(_+_) val minCount = scala.math.max(1L, 0.0001 * withCounts.count.toLong) withCounts.filter(_._2 > minCount).map(_._1).cache } } val cntRoots = (0 until { args.length - 1 }).map(i => Math.sqrt(bags(i).count.toDouble)) for(l <- 0 until { args.length - 1 }; r <- l until { args.length - 1 }) { val cnt = bags(l).intersection(bags(r)).count println("The intersect " + args(l+1) + " x " + args(r+1) + " is: " + cnt + " (" + (cnt.toDouble / cntRoots(l) / cntRoots(r)) + ")") } sc.stop } }
When one runs the main class example from the command line, it outputs the stemmed bag sizes and intersection for datasets specified as parameters (these are directories in the home filesystem with documents):
$ sbt "run-main org.akozlov.examples.Stemmer local[2] shakespeare leotolstoy chekhov nytimes nips enron bible" [info] Loading project definition from /Users/akozlov/Src/Book/ml-in-scala/chapter09/project [info] Set current project to NLP in Scala (in build file:/Users/akozlov/Src/Book/ml-in-scala/chapter09/) [info] Running org.akozlov.examples.Stemmer local[2] shakespeare leotolstoy chekhov nytimes nips enron bible The intersect shakespeare x shakespeare is: 10533 (1.0) The intersect shakespeare x leotolstoy is: 5834 (0.5293670391596142) The intersect shakespeare x chekhov is: 3295 (0.4715281914492153) The intersect shakespeare x nytimes is: 7207 (0.4163369701270161) The intersect shakespeare x nips is: 2726 (0.27457329089479504) The intersect shakespeare x enron is: 5217 (0.34431535832271265) The intersect shakespeare x bible is: 3826 (0.45171392986714726) The intersect leotolstoy x leotolstoy is: 11531 (0.9999999999999999) The intersect leotolstoy x chekhov is: 4099 (0.5606253333241973) The intersect leotolstoy x nytimes is: 8657 (0.47796976891152176) The intersect leotolstoy x nips is: 3231 (0.3110369262979765) The intersect leotolstoy x enron is: 6076 (0.38326210407266764) The intersect leotolstoy x bible is: 3455 (0.3898604013063757) The intersect chekhov x chekhov is: 4636 (1.0) The intersect chekhov x nytimes is: 3843 (0.33463022711780555) The intersect chekhov x nips is: 1889 (0.28679311682962116) The intersect chekhov x enron is: 3213 (0.31963226496874225) The intersect chekhov x bible is: 2282 (0.40610513998395287) The intersect nytimes x nytimes is: 28449 (1.0) The intersect nytimes x nips is: 4954 (0.30362042173997206) The intersect nytimes x enron is: 11273 (0.45270741164576034) The intersect nytimes x bible is: 3655 (0.2625720159205085) The intersect nips x nips is: 9358 (1.0000000000000002) The intersect nips x enron is: 4888 (0.3422561629856124) The intersect nips x bible is: 1615 (0.20229053645165143) The intersect enron x enron is: 21796 (1.0) The intersect enron x bible is: 2895 (0.23760453654690084) The intersect bible x bible is: 6811 (1.0) [success] Total time: 12 s, completed May 17, 2016 11:00:38 PM
This, in this case, just confirms the hypothesis that Bible's vocabulary is closer to William Shakespeare than to Leo Tolstoy and other sources. Interestingly, modern vocabularies of NY Times articles and Enron's e-mails from the previous chapters are much closer to Leo Tolstoy's, which is probably more an indication of the translation quality.
Another thing to notice is that the pretty complex analysis took about 40 lines of Scala code (not counting the libraries, specifically the Porter Stemmer, which is about ~ 100 lines) and about 12 seconds. The power of Scala is that it can leverage other libraries very efficiently to write concise code.
Serialization
We already talked about serialization in Chapter 6, Working with Unstructured Data. As Spark's tasks are executed in different threads and potentially JVMs, Spark does a lot of serialization/deserialization when passing the objects. Potentially, I could use map { val stemmer = new Stemmer; stemmer.stem(_) }
instead of map { stemmer.stem(_) }
, but the latter reuses the object for multiple iterations and seems to be linguistically more appealing. One suggested performance optimization is to use Kryo serializer, which is less flexible than the Java serializer, but more performant. However, for integrative purpose, it is much easier to just make every object in the pipeline serializable and use default Java serialization.
As another example, let's compute the distribution of word frequencies, as follows:
scala> val bags = for (name <- List("shakespeare", "leotolstoy", "chekhov", "nytimes", "enron", "bible")) yield { | sc textFile(name) flatMap { _.split("\W+") } map { _.toLowerCase } map { stemmer.stem(_) } filter { ! stopwords.contains(_) } cache() | } bags: List[org.apache.spark.rdd.RDD[String]] = List(MapPartitionsRDD[93] at filter at <console>:36, MapPartitionsRDD[98] at filter at <console>:36, MapPartitionsRDD[103] at filter at <console>:36, MapPartitionsRDD[108] at filter at <console>:36, MapPartitionsRDD[113] at filter at <console>:36, MapPartitionsRDD[118] at filter at <console>:36) scala> bags reduceLeft { (a, b) => a.union(b) } map { (_, 1) } reduceByKey { _+_ } collect() sortBy(- _._2) map { x => scala.math.log(x._2) } res18: Array[Double] = Array(10.27759958298627, 10.1152465449837, 10.058652004037477, 10.046635061754612, 9.999615579630348, 9.855399641729074, 9.834405391348684, 9.801233318497372, 9.792667717430884, 9.76347807952779, 9.742496866444002, 9.655474810542554, 9.630365631415676, 9.623244409181346, 9.593355351246755, 9.517604459155686, 9.515837804297965, 9.47231994707559, 9.45930760329985, 9.441531454869693, 9.435561763085358, 9.426257878198653, 9.378985497953893, 9.355997944398545, 9.34862295977619, 9.300820725104558, 9.25569607369698, 9.25320827220336, 9.229162126216771, 9.20391980417326, 9.19917830726999, 9.167224080902555, 9.153875834995056, 9.137877200242468, 9.129889247578555, 9.090430075303626, 9.090091799380007, 9.083075020930307, 9.077722847361343, 9.070273383079064, 9.0542711863262... ...
The distribution of relative frequencies on the log-log scale is presented in the following diagram. With the exception of the first few tokens, the dependency of frequency on rank is almost linear:
13.58.199.182