Composing Cascalog queries

One of the best things about Cascalog queries is that they can be composed together. Similar to composing functions, this can be a good way to build a complex process from smaller, easy-to-understand parts.

In this recipe, we'll parse the Virginia census data we first used in the Managing program complexity with STM recipe in Chapter 3, Managing Complexity with Concurrent Programming. You can download this data from http://www.ericrochester.com/clj-data-analysis/data/all_160_in_51.P35.csv. We'll also use a new census datafile that contains the race data. You can download it from http://www.ericrochester.com/clj-data-analysis/data/all_160_in_51.P3.csv.

Getting ready

Since we're reading CSV, we'll need to use the dependencies and imports from the Parsing CSV files with Cascalog recipe.

We'll also use the hfs-text-delim function from that recipe and ->long from the Aggregating data with Cascalog recipe.

Also, we'll need the data files from http://www.ericrochester.com/clj-data-analysis/data/all_160_in_51.P35.csv and http://www.ericrochester.com/clj-data-analysis/data/all_160_in_51.P3.csv. We'll put them into the data directory, as follows:

(def families-file "data/all_160_in_51.P35.csv")
(def race-file "data/all_160_in_51.P3.csv")

How to do it…

We'll read these datasets and convert some of the fields in each to integers. Then we'll join the two together and select only a few of the fields.

  1. We'll define a query that reads the families data file and converts the integer fields to numbers:
    (def family-data
         (<- [?GEOID ?SUMLEV ?STATE 
           ?NAME ?POP100 ?HU100 ?P035001]
          ((hfs-text-delim families-file
                           :has-header true)
             ?GEOID ?SUMLEV ?STATE _ _ _ _ _
             ?NAME ?spop100 ?shu100 _ _ ?sp035001 _)
          (->long ?spop100       :> ?POP100)
          (->long ?shu100        :> ?HU100)
          (->long ?sp035001      :> ?P035001)))
  2. We also need to read in the race data file:
    (def race-data
      (<- [?GEOID ?SUMLEV ?STATE
           ?NAME ?POP100 ?HU100 ?P003001 ?P003002
           ?P003003 ?P003004 ?P003005 ?P003006 ?P003007
           ?P003008]
          ((hfs-text-delim race-file :has-header true)
             ?GEOID ?SUMLEV ?STATE _ _ _ _ _
             ?NAME ?spop100 ?shu100 _ _
             ?sp003001 _ ?sp003002 _ ?sp003003 _
             ?sp003004 _ ?sp003005 _ ?sp003006 _
             ?sp003007 _ ?sp003008 _)
          (->long ?spop100  :> ?POP100)
          (->long ?shu100   :> ?HU100)
          (->long ?sp003001 :> ?P003001)
          (->long ?sp003002 :> ?P003002)
          (->long ?sp003003 :> ?P003003)
          (->long ?sp003004 :> ?P003004)
          (->long ?sp003005 :> ?P003005)
          (->long ?sp003006 :> ?P003006)
          (->long ?sp003007 :> ?P003007)
          (->long ?sp003008 :> ?P003008)))
  3. We'll use the preceding queries to build a query that joins them on the ?GEOID field. It will also rename some of the fields:
    (def census-joined
      (<- [?name ?pop100 ?hu100 ?families
           ?white ?black ?indian ?asian ?hawaiian ?other
           ?multiple]
          (family-data ?geoid _ _
                       ?name ?pop100 ?hu100 ?families)
          (race-data ?geoid _ _ _ _ _ _
                     ?white ?black ?indian ?asian
                     ?hawaiian ?other ?multiple)))

Now we can run this and send the results to the standard output:

user=> (?- (stdout) census-joined)
…
RESULTS
-----------------------

Abingdon town   8191    4271    2056    7681    257     15      86      6
Accomac town    519     229     117     389     106     0       3       1
Adwolf CDP      1530    677     467     1481    17      1       4       0
Alberta town    298     163     77      177     112     4       0       0
Alexandria city 139966  72376   30978   85186   30491   589     8432    141
Altavista town  3450    1669    928     2415    891     5       20      0
Amherst town    2231    1032    550     1571    550     17      14      0
Annandale CDP   41008   14715   9790    20670   3533    212     10103   53
Appalachia town 1754    879     482     1675    52      4       2       0
Appomattox town 1733    849     441     1141    540     8       3       0
Aquia Harbour CDP       6727    2300    1914    5704    521     38      150
…

How it works…

In every recipe so far, we've used the ?<- macro, which is a combination of <- and ?-. The arrow, <-, allows you to create and compose queries. The ?- executes the query and sends the results to a sink. Using the combined ?<- macro is convenient, but using the separate ones can be more powerful.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.172.146