Sampling from very large data sets

One way to deal with very large data sets is to sample. This can be especially useful when we're getting started and want to explore a dataset. A good sample can tell us what's in the full dataset and what we'll need to do in order to clean and process it. Samples are used in any kind of survey or election exit polling.

In this recipe, we'll see a couple of ways of creating samples.

Getting ready

We'll use a basic project.clj file:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

How to do it…

There are two ways to sample from a stream of values. If you want 10 percent of the larger population, you can just take every tenth item. If you want 1,000 out of who knows how many items, the process is a little more complicated.

Sampling by percentage

  1. Performing a rough sampling by percentage is pretty simple:
    (defn sample-percent
      [k coll]  (filter (fn [_] (<= (rand) k)) coll))
  2. Using it is also simple:
    user=> (sample-percent 0.01 (range 1000))
    (141 146 155 292 598 624 629 640 759 815 852 889)
    user=> (count *1)
    12

Sampling exactly

Sampling for an exact count is a little more complicated. We'll use Donald Knuth's algorithm from The Art of Computer Programming, Volume 2. This takes the sample off the front of the input sequence, and then from this point, each new item from the input has a chance of sample-size / size-of-collection-so-far randomly replacing one existing item in the sample. To implement this, we'll need one helper function which takes a map and a new key-value pair. It removes a random key from the map and inserts the new pair:

(defn rand-replace
  [m k v]  (assoc (dissoc m (rand-nth (keys m))) k v))

We'll also need another small utility to create an infinite range that begins at a given place:

(defn range-from [x] (map (partial + x) (range)))

Now, we use this to create the function that does the sampling:

(defn sample-amount [k coll]
  (->> coll
    (drop k)
    (map vector (range-from (inc k)))
    (filter #(<= (rand) (/ k (first %))))
    (reduce rand-replace
            (into {} (map vector (range k) (take k coll))))
    (sort-by first)
    (map second)))

Using this is as simple as using the first function though:

user=> (sample-amount 10 (range 1000))
(70 246 309 430 460 464 471 547 955 976)
user=> (count *1)
10

How it works…

Sampling by percentage just compares the percentage against a random value for each item in the collection. If the random number is less than the value, it saves the item. Notice though, that since it's random, the exact number that it pulls out doesn't necessarily match the parameter exactly. In this case, 1 percent.

Sampling by a set amount is more complicated. We keep a map of the sample, keyed by each item's position in the original sequence. Originally, we populate this map with the first items off the sequence. Afterwards, we walk through the rest of the collection. For each item, we randomly decide whether to keep it or not. If we do keep it, we randomly swap it with one item that is currently in the sample.

Let's see what this looks like in the code:

  1. Initially, we want to take the sample off the front of the collection. The processing pipeline in sample-amount will work over the rest of the collection, so we'll begin by dropping the initial sample off the front:
    (defn sample-amount [k coll]
      (->> coll
           (drop k)
  2. In order to figure out each subsequent item's probability of being chosen for the sample, we need to have its position in the collection. We can get this by associating each item with its position in a vector pair:
           (map vector (range-from (inc k)))
  3. Now, filter out all of the items whose position number, divided by the sample size, is less than a random number. This randomly replaces each item based on its position, as outlined in the algorithm:
           (filter #(<= (rand) (/ k (first %))))
  4. At this point, we start building the final sample as a hash map that maps each item's position in the original collection with the item itself. We use rand-replace to swap out an item from the sample Hashmap for each item that passed the random filter in Step 3:
           (reduce rand-replace
                   (into {}
                   (map vector (range k) (take k coll))))
  5. Once the reduce call is made, we can sort the hash-map by position:
           (sort-by first)
  6. Finally, we throw away the positions and return a sequence of a sample of the items from the original collection:
           (map second)))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.98.208