Now we have the data, let's get an initial summarization of its size and how many records may be incomplete:
ufo.tsv
, save the following file to summarymapper.rb
:#!/usr/bin/env ruby while line = gets puts "total 1" parts = line.split(" ") puts "badline 1" if parts.size != 6 puts "sighted 1" if !parts[0].empty? puts "recorded 1" if !parts[1].empty? puts "location 1" if !parts[2].empty? puts "shape 1" if !parts[3].empty? puts "duration 1" if !parts[4].empty? puts "description 1" if !parts[5].empty? end
$ chmod +x summarymapper.rb
$ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -file summarymapper.rb -mapper summarymapper.rb -file wcreducer.rb -reducer wcreducer.rb -input ufo.tsv -output ufosummary
$ hadoop fs -cat ufosummary/part-0000
Remember that our UFO sightings should have six fields as described previously. They are listed as follows:
The mapper examines the file and counts the total number of records in addition to identifying potentially incomplete records.
We produce the overall count by simply recording how many distinct records are encountered while processing the file. We identify potentially incomplete records by flagging those that either do not contain exactly six fields or have at least one field that has a null value.
Therefore, the implementation of the mapper reads each line and does three things as it proceeds through the file:
We wrote this mapper intentionally to produce the output of the form (token,
count)
. Doing this allowed us to use our existing WordCount reducer from our earlier implementations as the reducer for this job. There are certainly more efficient implementations, but as this job is unlikely to be frequently executed, the convenience is worth it.
At the time of writing, the result of this job was as follows:
badline324 description61372 duration58961 location61377 recorded61377 shape58855 sighted61377 total61377
We see from these figures that we have 61,300records. All of these provide values for the sighted date, reported date, and location fields. Around 58,000-59,000 records have values for shape and duration, and almost all have a description.
When split on tab characters there were 372 lines found to not have exactly six fields. However, since only five records had no value for description, this suggests that the bad records typically have too many tabs as opposed to too few. We could of course alter our mapper to gather detailed information on this fact. This is likely due to tabs being used in the free text description, so for now we will do our analysis expecting most records to have correctly placed values for all the six fields, but not make any assumptions regarding further tabs in each record.
18.118.141.27