Description

While the shape field is important, the description has more information. Let's see what we can do with it.

First, let's examine a few and see what some of them look like. The following example is one that I selected randomly:

Large boomerang shaped invisible object blocked starlight while flying across sky. I have a sketch and noted the year was 1999, but did not write down the day. The sighting took place in the late evening when it was completely dark and the sky was clear and full of stars. Out of the corner of my eye, I noticed movement in the sky from the north moving to the south. When I looked closer, however, it wasn’t an object that I was seeing move, rather it was the disappearance and reappearance of stars behind an object. The object itself was black or invisible with no lights. Given the area of stars that were blocked out, I would say the object was five times larger than a jet. It was completely silent. It was shaped like a boomerang only a little more rounded in front rather than triangle and a slightly sharper points on the “wing” tips. Since the object was invisible, I can only suggest the shape based on the black area absent of stars like a silhouette as it moved across the sky. If the object was indeed five times the size of a jet and flying at about the attitude of a jet, then it was moving much faster than a jet. I blinked a couple times, looked away and looked back, and then followed the object across the remainder of the horizon until it was out of sight. In all it took about 8-10 seconds to span the sky and flew at the same altitude the whole time. Given the triangular shape, I suppose it could have been a low-flying Stealth Bomber that just appeared much larger if flying low. But is a Stealth completely silent? Also, Stealth Bombers have three triangles pointing backwards from the mid section. The object I saw did not seem to have any mid section as such.((NUFORC Note: Witness indicates that date of incident is approximate. PD))

So we can see that some examples are fairly long, and they may have characters encoded as HTML/XML entities (“ and ” in this example). And this quote is relatively clean: some have two or more words jammed together with just punctuation—often several periods—stuck between the words.

In order to deal with this data, we'll need to clean it up some and break the words out, or tokenize it. You can see the details of this in the code download, most of which is just pasting together a lot of string manipulation methods, but it's helpful to remind ourselves with what we're working and how we need to deal with it. I also filtered on a standard English stop-words list, which I augmented by adding a few words that are specific to the description fields, such as PD and NUFORC.

Let's see what the most frequent words are in the description fields:

user=> (def descr-counts (a/get-descr-counts data 50))
#'user/descr-counts
user=> (take 10 descr-counts)
({:count 85428, :descr "object"}
 {:count 82526, :descr "light"}
 {:count 73182, :descr "lights"}
 {:count 72011, :descr "sky"}
 {:count 58016, :descr "like"}
 {:count 47193, :descr "one"}
 {:count 40690, :descr "bright"}
 {:count 38225, :descr "time"}
 {:count 37065, :descr "could"}
 {:count 35953, :descr "looked"})

This seems more like what we'd expect. The most frequent word is object, which seems appropriate for a corpus made up of people talking about things that they can't identify. The next two words are light and lights, which would be expected, especially since light is the most common item in the shape field.

Let's graph these terms too. We won't be able to see the details of the words' frequencies but it will give us a better feel for their distribution. There are enough tokens; however, we'll only look at the 75 most frequent ones in the following graph:

Description

The distribution of these words seems very similar. In fact, it very roughly conforms to Zipf's law, which predicts the power-law distribution of many types of physical and social data, including language frequencies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.249.174