Time for action – summarizing the UFO data

Now we have the data, let's get an initial summarization of its size and how many records may be incomplete:

  1. With the UFO tab-separated value (TSV) file on HDFS saved as ufo.tsv, save the following file to summarymapper.rb:
    #!/usr/bin/env ruby
    
    while line = gets
        puts "total	1"
        parts = line.split("	")
        puts "badline	1" if parts.size != 6
        puts "sighted	1" if !parts[0].empty?
        puts "recorded	1" if !parts[1].empty?
        puts "location	1" if !parts[2].empty?
        puts "shape	1" if !parts[3].empty?
        puts "duration	1" if !parts[4].empty?
        puts "description	1" if !parts[5].empty?
    end
  2. Make the file executable by executing the following command:
    $ chmod +x summarymapper.rb
    
  3. Execute the job as follows by using Streaming:
    $ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar 
    -file summarymapper.rb -mapper summarymapper.rb -file wcreducer.rb -reducer wcreducer.rb -input ufo.tsv -output ufosummary
    
  4. Retrieve the summary data:
    $ hadoop fs -cat ufosummary/part-0000
    

What just happened?

Remember that our UFO sightings should have six fields as described previously. They are listed as follows:

  • The date of the sighting
  • The date the sighting was reported
  • The location of the sighting
  • The shape of the object
  • The duration of the sighting
  • A free text description of the event

The mapper examines the file and counts the total number of records in addition to identifying potentially incomplete records.

We produce the overall count by simply recording how many distinct records are encountered while processing the file. We identify potentially incomplete records by flagging those that either do not contain exactly six fields or have at least one field that has a null value.

Therefore, the implementation of the mapper reads each line and does three things as it proceeds through the file:

  • It gives the output of a token to be incremented in the total number of records processed
  • It splits the record on tab boundaries and records any occurrence of lines which do not result in six fields' values
  • For each of the six expected fields it reports when the values present are other than an empty string, that is, there is data in the field, though this doesn't actually say anything about the quality of that data

We wrote this mapper intentionally to produce the output of the form (token, count). Doing this allowed us to use our existing WordCount reducer from our earlier implementations as the reducer for this job. There are certainly more efficient implementations, but as this job is unlikely to be frequently executed, the convenience is worth it.

At the time of writing, the result of this job was as follows:

badline324
description61372
duration58961
location61377
recorded61377
shape58855
sighted61377
total61377

We see from these figures that we have 61,300records. All of these provide values for the sighted date, reported date, and location fields. Around 58,000-59,000 records have values for shape and duration, and almost all have a description.

When split on tab characters there were 372 lines found to not have exactly six fields. However, since only five records had no value for description, this suggests that the bad records typically have too many tabs as opposed to too few. We could of course alter our mapper to gather detailed information on this fact. This is likely due to tabs being used in the free text description, so for now we will do our analysis expecting most records to have correctly placed values for all the six fields, but not make any assumptions regarding further tabs in each record.

Examining UFO shapes

Out of all the fields in these reports, it was shape that immediately interested us most, as it could offer some interesting ways of grouping the data depending on what sort of information we have in that field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.141.27