Scratching the surface of Big Data

Among the recent achievements and trends towards better analysis of data and services lies the Big Data movement. In particular, the Hadoop framework has established some kind of ad hoc standard "for the distributed processing of large datasets across clusters of computers using simple programming models". In addition to a distributed file system called HDFS optimized for high throughput to access data, Hadoop offers MapReduce facilities for processing large datasets in parallel. As setting up and running Hadoop is not always considered a simple task, some other frameworks have been developed on top of Hadoop as a means to simplify the definition of Hadoop jobs. In Java, the Cascading framework is a layer on top of Hadoop that provides a convenient API to facilitate creation of MapReduce jobs. In Scala, the Scalding framework has been developed to further enhance the cascading API by utilizing the concise and expressive Scala syntax, as we can observe by taking a look at the activator-scalding Typesafe activator template. The sample code provided with this template illustrates a word counting application, that is, the hello-world project of Hadoop MapReduce jobs.

As a reminder on MapReduce jobs, consider reading the original paper from Google, available at http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

We can express the job of counting words with the following two steps:

  • Splitting lines from a file into individual words and creating a key-value pair for each word, where key is the word of the String type and value is the constant 1
  • By grouping the elements having the same key (grouping the same words) into a list and reducing the list by summing the values, we obtain our goal

If you run the > activator ui command in a terminal window, as we already did a number of times in Chapter 3, Understanding the Scala Ecosystem, and create the activator-scalding template project, you can verify how concise the word count in scalding is specified. Do not forget to run the > activator eclipse command to be able to import the project into the Eclipse IDE:

class WordCount(args : Args) extends Job(args) {

  // Tokenize into words by by splitting on non-characters. This
  // is imperfect, as it splits hyphenated words, possessives
  // ("John's"), etc.
  val tokenizerRegex = """W+"""

  // Read the file specified by the --input argument and process
  // each line by trimming the leading and trailing whitespace,
  // converting to lower case,then tokenizing into words.
  // The first argument list to flatMap specifies that we pass the
  // 'line field to the anonymous function on each call and each
  // word in the returned collection of words is given the name
  // 'word. Note that TextLine automatically associates the name
  // 'line with each line of text. It also tracks the line number
  // and names that field 'offset. Here, we're dropping the
  // offset.

  TextLine(args("input"))
    .read
    .flatMap('line -> 'word) {
      line : String => line.trim.toLowerCase.split(tokenizerRegex) 
    }

  // At this point we have a stream of words in the pipeline. To
  // count occurrences of the same word, we need to group the
  // words together. The groupBy operation does this. The first
  // argument list to groupBy specifies the fields to group over
  // as the key. In this case, we only use the 'word field. 
  // The anonymous function is passed an object of type
  // com.twitter.scalding.GroupBuilder. All we need to do is
  // compute the size of the group and we give it an optional
  // name, 'count.
    .groupBy('word){ group => group.size('count) }

  // In many data flows, we would need to project out just the
  // 'word and 'count, since we don't care about them any more,
  // but groupBy has already eliminated everything but 'word and
  // 'count. Hence, we'll just write them out as tab-delimited
  // values.
    .write(Tsv(args("output")))
}

Most of the code is indeed comments, which means that the whole algorithm is very close to the description one would do in pseudo code.

If you are interested in Big Data, Scala definitely fits a niche and a number of projects and frameworks handling huge streams of data and Hadoop-like jobs are already pushing the limits. Among them, we can mention Spark (http://spark.apache.org) as well as Twitter's open-source projects SummingBird (https://github.com/twitter/summingbird) and Algebird (https://github.com/twitter/algebird).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.168.8