Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Time for action – summarizing the UFO data

Now we have the data, let's get an initial summarization of its size and how many records may be incomplete:

With the UFO tab-separated value (TSV) file on HDFS saved as ufo.tsv, save the following file to summarymapper.rb:

#!/usr/bin/env ruby

while line = gets
    puts "total	1"
    parts = line.split("	")
    puts "badline	1" if parts.size != 6
    puts "sighted	1" if !parts[0].empty?
    puts "recorded	1" if !parts[1].empty?
    puts "location	1" if !parts[2].empty?
    puts "shape	1" if !parts[3].empty?
    puts "duration	1" if !parts[4].empty?
    puts "description	1" if !parts[5].empty?
end

Make the file executable by executing the following command:
```
$ chmod +x summarymapper.rb
```

Execute the job as follows by using Streaming:

$ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar 
-file summarymapper.rb -mapper summarymapper.rb -file wcreducer.rb -reducer wcreducer.rb -input ufo.tsv -output ufosummary

Retrieve the summary data:
```
$ hadoop fs -cat ufosummary/part-0000
```

What just happened?

Remember that our UFO sightings should have six fields as described previously. They are listed as follows:

The date of the sighting
The date the sighting was reported
The location of the sighting
The shape of the object
The duration of the sighting
A free text description of the event

The mapper examines the file and counts the total number of records in addition to identifying potentially incomplete records.

We produce the overall count by simply recording how many distinct records are encountered while processing the file. We identify potentially incomplete records by flagging those that either do not contain exactly six fields or have at least one field that has a null value.

Therefore, the implementation of the mapper reads each line and does three things as it proceeds through the file:

It gives the output of a token to be incremented in the total number of records processed
It splits the record on tab boundaries and records any occurrence of lines which do not result in six fields' values
For each of the six expected fields it reports when the values present are other than an empty string, that is, there is data in the field, though this doesn't actually say anything about the quality of that data

We wrote this mapper intentionally to produce the output of the form (token, count). Doing this allowed us to use our existing WordCount reducer from our earlier implementations as the reducer for this job. There are certainly more efficient implementations, but as this job is unlikely to be frequently executed, the convenience is worth it.

At the time of writing, the result of this job was as follows:

badline324
description61372
duration58961
location61377
recorded61377
shape58855
sighted61377
total61377

We see from these figures that we have 61,300records. All of these provide values for the sighted date, reported date, and location fields. Around 58,000-59,000 records have values for shape and duration, and almost all have a description.

When split on tab characters there were 372 lines found to not have exactly six fields. However, since only five records had no value for description, this suggests that the bad records typically have too many tabs as opposed to too few. We could of course alter our mapper to gather detailed information on this fact. This is likely due to tabs being used in the free text description, so for now we will do our analysis expecting most records to have correctly placed values for all the six fields, but not make any assumptions regarding further tabs in each record.

Examining UFO shapes

Out of all the fields in these reports, it was shape that immediately interested us most, as it could offer some interesting ways of grouping the data depending on what sort of information we have in that field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Time for action – summarizing the UFO data

Create new playlist

Sign In

Sign Up

Time for action – summarizing the UFO data

What just happened?

Examining UFO shapes

Table of Contents for
Time for action – summarizing the UFO data