Dealing with messy data

The first thing that we need to deal with is qualitative data from the shape and description fields.

The shape field seems like a likely place to start. Let's see how many items have good data for it:

user=> (def data (m/read-data "data/ufo_awesome.tsv"))
user=> (count (remove (comp str/blank? :shape) data))
58870
user=> (count (filter (comp str/blank? :shape) data))
2523
user=> (count data)
61393
user=> (float 2506/61137)
0.04098991

So 4 percent of the data does not have the shape field set to meaningful data. Let's see what the most popular values for that field are:

user=> (def shape-freqs
           (frequencies
             (map str/trim
                  (map :shape
                       (remove (comp str/blank? :shape) data)))))
#'user/shape-freqs
user=> (pprint (take 10 (reverse (sort-by second shape-freqs))))
(["light" 12202]
 ["triangle" 6082]
 ["circle" 5271]
 ["disk" 4825]
 ["other" 4593]
 ["unknown" 4490]
 ["sphere" 3637]
 ["fireball" 3452]
 ["oval" 2869]
 ["formation" 1788])

Interesting! The most frequent shape isn't a shape at all. The values other and unknown also rank pretty high. We can use the shape field, but we need to keep these things in mind.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.51.145