Being able to aggregate data from many linked data sources is good, but most data isn't already formatted for the semantic Web. Fortunately, linked data's flexible and dynamic data model facilitates the integration of data from multiple sources.
For this recipe, we'll combine several previous recipes. We'll load currency data from RDF, as we did in the Reading RDF data recipe. We'll also scrape the exchange rate data from X-Rates (http://www.x-rates.com) to get information out of a table, just as we did in the Scraping data from tables in web pages recipe. Finally, we'll dump everything into a triple store and pull it back out, as we did in the last recipe.
First, make sure your Leiningen project.clj
file has the right dependencies:
(defproject getting-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [enlive "1.1.5"] [edu.ucdenver.ccp/kr-sesame-core "1.4.17"] [org.clojure/tools.logging "0.3.0"] [org.slf4j/slf4j-simple "1.7.7"] [clj-time "0.7.0"]])
We need to declare that we'll use these libraries in our script or REPL:
(require '(clojure.java [io :as io])) (require '(clojure [xml :as xml] [string :as string] [zip :as zip])) (require '(net.cgrand [enlive-html :as html]) (use 'incanter.core 'clj-time.coerce '[clj-time.format :only (formatter formatters parse unparse)] 'edu.ucdenver.ccp.kr.kb 'edu.ucdenver.ccp.kr.rdf 'edu.ucdenver.ccp.kr.sparql 'edu.ucdenver.ccp.kr.sesame.kb) (import [java.io File] [java.net URL URLEncoder])
Finally, make sure that you have the file, data/currencies.ttl
, which we've been using since Reading RDF data.
Since this is a longer recipe, we'll build it up in segments. At the end, we'll tie everything together.
To begin with, we'll create the triple store. This has become pretty standard. In fact, we'll use the same version of kb-memstore
and init-kb
that we've been using from the Reading RDF data recipe.
The first data that we'll pull into the triple store is the current exchange rates:
(defn find-time-stamp ([module-content] (second (map html/text (html/select module-content [:span.ratesTimestamp]))))) (def time-stamp-format (formatter "MMM dd, yyyy HH:mm 'UTC'")) (defn normalize-date ([date-time] (unparse (formatters :date-time) (parse time-stamp-format date-time))))
(defn find-data ([module-content] (html/select module-content [:table.tablesorter.ratesTable :tbody :tr]))) (defn td->code ([td] (let [code (-> td (html/select [:a]) first :attrs :href (string/split #"=") last)] (symbol "currency" (str code "#" code))))) (defn get-td-a ([td] (->> td :content (mapcat :content) string/join read-string))) (defn get-data ([row] (let [[td-header td-to td-from] (filter map? (:content row))] {:currency (td->code td-to) :exchange-to (get-td-a td-to) :exchange-from (get-td-a td-from)})))
(defn data->statements ([time-stamp data] (let [{:keys [currency exchange-to]} data] (list [currency 'err/exchangeRate exchange-to] [currency 'err/exchangeWith 'currency/USD#USD] [currency 'err/exchangeRateDate [time-stamp 'xsd/dateTime]]))))
(defn load-exchange-data "This downloads the HTML page and pulls the data out of it." [kb html-url] (let [html (html/html-resource html-url) div (html/select html [:div.moduleContent]) time-stamp (normalize-date (find-time-stamp div))] (add-statements kb (mapcat (partial data->statements time-stamp) (map get-data (find-data div))))))
That's a mouthful, but now that we can get all of the data into a triple store, we just need to pull everything back out and into Incanter.
Bringing the two data sources together and exporting it to Incanter is fairly easy at this point:
(defn aggregate-data "This controls the process and returns the aggregated data." [kb data-file data-url q col-map] (load-rdf-file kb (File. data-file)) (load-exchange-data kb (URL. data-url)) (to-dataset (map (partial rekey col-map) (query kb q))))
We'll need to do a lot of the set up we've done before. Here, we'll bind the triple store, the query, and the column map to names so that we can refer to them easily:
(def t-store (init-kb (kb-memstore))) (def q '((?/c rdf/type money/Currency) (?/c money/name ?/name) (?/c money/shortName ?/shortName) (?/c money/isoAlpha ?/iso) (?/c money/minorName ?/minorName) (?/c money/minorExponent ?/minorExponent) (:optional ((?/c err/exchangeRate ?/exchangeRate) (?/c err/exchangeWith ?/exchangeWith) (?/c err/exchangeRateDate ?/exchangeRateDate))))) (def col-map {'?/name :fullname '?/iso :iso '?/shortName :name '?/minorName :minor-name '?/minorExponent :minor-exp '?/exchangeRate :exchange-rate '?/exchangeWith :exchange-with '?/exchangeRateDate :exchange-date})
The specific URL that we're going to scrape is http://www.x-rates.com/table/?from=USD&amount=1.00. Let's go ahead and put everything together:
user=> (def d (aggregate-data t-store "data/currencies.ttl" "http://www.x-rates.com/table/?from=USD&amount=1.00" q col-map)) user=> (sel d :rows (range 3) :cols [:fullname :name :exchange-rate]) | :fullname | :name | :exchange-rate | |-----------------------------+--------+----------------| | United Arab Emirates dirham | dirham | 3.672845 | | United Arab Emirates dirham | dirham | 3.672845 | | United Arab Emirates dirham | dirham | 3.672849 | …
As you will see, some of the data from currencies.ttl
doesn't have exchange data (the ones that start with nil
). We can look in other sources for that, or decide that some of those currencies don't matter for our project.
A lot of this is just a slightly more complicated version of what we've seen before, pulled together into one recipe. The complicated part is scraping the web page, which is driven by the structure of the page itself.
After taking a look at the source for the page and playing with it on the REPL, the page's structure was clear. First, we needed to pull the timestamp off the top of the table that lists the exchange rates. Then, we walked over the table and pulled the data from each row. Both the data tables (the short and long ones) are in a div
element with a moduleContent
class, so everything begins there.
Next, we drilled down from the module's content into the rows of the rates
table. Inside each row, we pulled out the currency code and returned it as a symbol in the currency namespace. We also drilled down to the exchange rates and returned them as floats. Then, we put everything into a map and converted it to triple vectors, which we added to the triple store.
18.225.149.238