Time for action – correlating of sighting duration to UFO shape

Let's do a little more detailed analysis in regards to this shape data. We wondered if there was any correlation between the duration of a sighting to the reported shape. Perhaps cigar-shaped UFOs hang around longer than the rest or formations always appear for the exact amount of time.

  1. Save the following to shapetimemapper.rb:
    #!/usr/bin/env ruby
    
    pattern = Regexp.new /d* ?((min)|(sec))/
    
    while line = gets
    parts = line.split("	")
    if parts.size == 6
    shape = parts[3].strip
    duration = parts[4].strip.downcase
    if !shape.empty? && !duration.empty?
    match = pattern.match(duration)
    time = /d*/.match(match[0])[0]
    unit = match[1]
    time = Integer(time)
    time = time * 60 if unit == "min"
    puts shape+"	"+time.to_s
    end
    end
    end
  2. Make the file executable by executing the following command:
    $ chmod +x shapetimemapper.rb
    
  3. Save the following to shapetimereducer.rb:
    #!/usr/bin/env ruby
    
    current = nil
    min = 0
    max = 0
    mean = 0
    total = 0
    count = 0
    
    while line = gets
    word, time = line.split("	")
    time = Integer(time)
    
    if word == current
    count = count+1
    total = total+time
    min = time if time < min
    max = time if time > max
    else
    puts current+"	"+min.to_s+" "+max.to_s+" "+(total/count).to_s if current
    current = word
    count = 1
    total = time
    min = time
    max = time
    end
    end
    puts current+"	"+min.to_s+" "+max.to_s+" "+(total/count).to_s
  4. Make the file executable by executing the following command:
    $ chmod +x shapetimereducer.rb
    
  5. Run the job:
    $ hadoop jar hadoop/contrib/streaminghHadoop-streaming-1.0.3.jar -file shapetimemapper.rb -mapper shapetimemapper.rb -file shapetimereducer.rb -reducer shapetimereducer.rb -input ufo.tsv -output shapetime
    
  6. Retrieve the results:
    $ hadoop fs -cat shapetime/part-00000
    

What just happened?

Our mapper here is a little more involved than previous examples due to the nature of the duration field. Taking a quick look at some sample records, we found values as follows:

15 seconds
2 minutes
2 min
2minutes
5-10 seconds

In other words, there was a mixture of range and absolute values, different formatting and inconsistent terms for time units. Again for simplicity we decided on a limited interpretation of the data; we will take the absolute value if present, and the upper part of a range if not. We would assume that the strings min or sec would be present for the time units and would convert all timings into seconds. With some regular expression magic, we unpack the duration field into these parts and do the conversion. Note again that we simply discard any record that does not work as we expect, which may not always be appropriate.

The reducer follows the same pattern as our earlier example, starting with a default key and reading values until a new one is encountered. In this case, we want to capture the minimum, maximum, and mean for each shape, so use numerous variables to track the needed data.

Remember that Streaming reducers need to handle a series of values grouped into their associated keys and must identify when a new line has a changed key, and hence indicates the last value for the previous key that has been processed. In contrast, a Java reducer would be simpler as it only deals with the values for a single key in each execution.

After making both files executable we run the job and get the following results, where we removed any shape with less than 10 sightings and again made the output more compact for space reasons. The numbers for each shape are the minimum value, the maximum value, and mean respectively:

changing0 5400 670 chevron0 3600 333
cigar0 5400 370 circle0 7200 423
cone0 4500 498 cross2 3600 460
cylinder0 5760 380 diamond0 7800 519
disk0 5400 449 egg0 5400 383
fireball0 5400 236 flash0 7200 303
formation0 5400 434 light0 9000 462
other0 5400 418 oval0 5400 405
rectangle0 4200 352 sphere0 14400 396
teardrop0 2700 335 triangle0 18000 375
unknown0 6000 470

It is surprising to see the relatively narrow variance in the mean sighting duration across all shape types; most have the mean value between 350 and 430 seconds. Interestingly, we also see that the shortest mean duration is for fireballs and the maximum for changeable objects, both of which make some degree of intuitive sense. A fireball by definition wouldn't be a long-lasting phenomena and a changeable object would need a lengthy duration for its changes to be noticed.

Using Streaming scripts outside Hadoop

This last example with its more involved mapper and reducer is a good illustration of how Streaming can help MapReduce development in another way; you can execute the scripts outside of Hadoop.

It's generally good practice during MapReduce development to have a sample of the production data against which to test your code. But when this is on HDFS and you are writing Java map and reduce tasks, it can be difficult to debug problems or refine complex logic. With map and reduce tasks that read input from the command line, you can directly run them against some data to get quick feedback on the result. If you have a development environment that provides Hadoop integration or are using Hadoop in standalone mode, the problems are minimized; just remember that Streaming does give you this ability to try the scripts outside of Hadoop; it may be useful some day.

While developing these scripts the author noticed that the last set of records in his UFO datafile had data in a better structured manner than those at the start of the file. Therefore, to do a quick test on the mapper all that was required was:

$ tail ufo.tsv | shapetimemapper.rb

This principle can be applied to the full workflow to exercise both the map and reduce script.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.134.154