Let's do a little more detailed analysis in regards to this shape data. We wondered if there was any correlation between the duration of a sighting to the reported shape. Perhaps cigar-shaped UFOs hang around longer than the rest or formations always appear for the exact amount of time.
shapetimemapper.rb
:#!/usr/bin/env ruby pattern = Regexp.new /d* ?((min)|(sec))/ while line = gets parts = line.split(" ") if parts.size == 6 shape = parts[3].strip duration = parts[4].strip.downcase if !shape.empty? && !duration.empty? match = pattern.match(duration) time = /d*/.match(match[0])[0] unit = match[1] time = Integer(time) time = time * 60 if unit == "min" puts shape+" "+time.to_s end end end
$ chmod +x shapetimemapper.rb
shapetimereducer.rb
:#!/usr/bin/env ruby current = nil min = 0 max = 0 mean = 0 total = 0 count = 0 while line = gets word, time = line.split(" ") time = Integer(time) if word == current count = count+1 total = total+time min = time if time < min max = time if time > max else puts current+" "+min.to_s+" "+max.to_s+" "+(total/count).to_s if current current = word count = 1 total = time min = time max = time end end puts current+" "+min.to_s+" "+max.to_s+" "+(total/count).to_s
$ chmod +x shapetimereducer.rb
$ hadoop jar hadoop/contrib/streaminghHadoop-streaming-1.0.3.jar -file shapetimemapper.rb -mapper shapetimemapper.rb -file shapetimereducer.rb -reducer shapetimereducer.rb -input ufo.tsv -output shapetime
$ hadoop fs -cat shapetime/part-00000
Our mapper here is a little more involved than previous examples due to the nature of the duration field. Taking a quick look at some sample records, we found values as follows:
15 seconds 2 minutes 2 min 2minutes 5-10 seconds
In other words, there was a mixture of range and absolute values, different formatting and inconsistent terms for time units. Again for simplicity we decided on a limited interpretation of the data; we will take the absolute value if present, and the upper part of a range if not. We would assume that the strings min
or sec
would be present for the time units and would convert all timings into seconds. With some regular expression magic, we unpack the duration field into these parts and do the conversion. Note again that we simply discard any record that does not work as we expect, which may not always be appropriate.
The reducer follows the same pattern as our earlier example, starting with a default key and reading values until a new one is encountered. In this case, we want to capture the minimum, maximum, and mean for each shape, so use numerous variables to track the needed data.
Remember that Streaming reducers need to handle a series of values grouped into their associated keys and must identify when a new line has a changed key, and hence indicates the last value for the previous key that has been processed. In contrast, a Java reducer would be simpler as it only deals with the values for a single key in each execution.
After making both files executable we run the job and get the following results, where we removed any shape with less than 10 sightings and again made the output more compact for space reasons. The numbers for each shape are the minimum value, the maximum value, and mean respectively:
changing0 5400 670 chevron0 3600 333 cigar0 5400 370 circle0 7200 423 cone0 4500 498 cross2 3600 460 cylinder0 5760 380 diamond0 7800 519 disk0 5400 449 egg0 5400 383 fireball0 5400 236 flash0 7200 303 formation0 5400 434 light0 9000 462 other0 5400 418 oval0 5400 405 rectangle0 4200 352 sphere0 14400 396 teardrop0 2700 335 triangle0 18000 375 unknown0 6000 470
It is surprising to see the relatively narrow variance in the mean sighting duration across all shape types; most have the mean value between 350 and 430 seconds. Interestingly, we also see that the shortest mean duration is for fireballs and the maximum for changeable objects, both of which make some degree of intuitive sense. A fireball by definition wouldn't be a long-lasting phenomena and a changeable object would need a lengthy duration for its changes to be noticed.
This last example with its more involved mapper and reducer is a good illustration of how Streaming can help MapReduce development in another way; you can execute the scripts outside of Hadoop.
It's generally good practice during MapReduce development to have a sample of the production data against which to test your code. But when this is on HDFS and you are writing Java map and reduce tasks, it can be difficult to debug problems or refine complex logic. With map and reduce tasks that read input from the command line, you can directly run them against some data to get quick feedback on the result. If you have a development environment that provides Hadoop integration or are using Hadoop in standalone mode, the problems are minimized; just remember that Streaming does give you this ability to try the scripts outside of Hadoop; it may be useful some day.
While developing these scripts the author noticed that the last set of records in his UFO datafile had data in a better structured manner than those at the start of the file. Therefore, to do a quick test on the mapper all that was required was:
$ tail ufo.tsv | shapetimemapper.rb
This principle can be applied to the full workflow to exercise both the map and reduce script.
18.227.134.154