Chapter 4. Classifying UFO Sightings

In this chapter, we're going to look at a dataset of UFO sightings. Sometimes, data analysis begins with a specific question or problem. Sometimes, however, it's more nebulous and vague. We'll engage with this UFO sighting dataset, and along the way, we'll learn more about data exploration, data visualization, and topic modeling before we dive into Naïve Bayesian classification.

This dataset was collected by the National UFO Reporting Center (NUFORC), and is available at http://www.nuforc.org/. They have included dates, rough locations, shapes, and descriptions of the sightings. We'll download and pull in this dataset. We'll see how to extract more structured data from messy, free-form text. And from there, we'll see how to visualize, analyze, and gain insights into our data.

In the process, we'll discover when is the best time to look for UFOs. We'll also learn what their important characteristics are. And we'll learn how to tell a description of a possible hoax sighting from one that may be real. In the end, hopefully, we'll be better prepared for seeing one of these ourselves. After all, we'll know when to look and for what to look.

Getting the data

For this chapter, actually acquiring the data will be relatively easy. In other chapters, this step involves screen scraping, SPARQL, or other data extraction, munging, and cleaning techniques. For this dataset, we'll just download it from Infochimps (http://www.infochimps.com/). Infochimps is a company (and their website) devoted to Big Data and doing more with data analysis. They provide a collection of datasets that are online and freely available. To download this specific dataset, browse to http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada and download the data from the link there, as shown in the following screenshot:

Getting the data

The data is in a ZIP-compressed file. This expands the files into the chimps_16154-2010-10-20_14-33-35 directory. This contains a file that lists metadata for the dataset as well as the data itself in several different formats. For the purposes of this chapter, we'll use the tab separated values (TSV) file. It's similar to a comma separated values (CSV) file, but it uses the tab character as a delimiter instead of a comma. This works nicely, because the tab character is used less often in text files in general, so it's often possible to use this data format without escaping many, if any, fields.

If we open the 16154.yaml file, we'll see metadata and other information about the dataset. And we learn that the fields in the dataset are as follows:

  • sighted_at: The date (as YYYYMMDD) the sighting happened
  • reported_at: The date the sighting was reported to NUFORC
  • location: The city and state the event happened in
  • shape: The shape of the object
  • duration: The duration the event lasted
  • description: A longer description of the sighting as a raw text string

We can get a better feel for this data by examining a row from the downloaded file. The following table represents what the fields contain for that record:

Field

Value

sighted_at

19950202

reported_at

19950203

location

Denmark, WI

shape

Cone

duration

75 min

description

Caller, and apparently several other people, witnessed multiple strange craft streaking through the night sky in the vicinity of Denmark and Mirabel, WI. Craft were seen to streak overhead, as well as to descend vertically, as fast as a meteorite, then stop suddenly just above the ground. During the last 30 minutes of the sighting, aircraft, which appeared to be US military craft, were seen either pursuing, or chaperoning, the strange craft. The objects were cone shaped, with a red nose and a green tail (sic).

Browsing through other rows, you will observe that some important fields—shape and duration—may be missing data. The description has XML entities and abbreviations such as w/ and repts.

Let's see what we can do with that.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.245.1