Querying and storing datasets

When dealing with large datasets, it's useful to be able to query the data based on some arbitrary conditions. Also, it's more reliable to store the data in a database rather than in a flat file or as an in-memory resource. The Incanter library provides us with several useful functions to perform these operations, as we will demonstrate in the code example that will follow.

Note

The Incanter library and the MongoDB driver used in the upcoming example can be added to a Leiningen project by adding the following dependency to the project.clj file:

[congomongo "0.4.1"]
[incanter "1.5.4"]

For the upcoming example, the namespace declaration should look similar to the following declaration:

(ns my-namespace
  (:use incanter.core
        [incanter.mongodb   :only [insert-dataset
                                   fetch-dataset]]
        [somnium.congomongo :only [mongo!]]
        [incanter.datasets  :only [get-dataset]]))

Also, this example requires MongoDB to be installed and running.

For this example, we will use the Iris dataset, which can be fetched using the get-dataset function from the incanter.datasets namespace. The code is as follows:

(def iris (get-dataset :iris))

As shown in the previous code, we simply bind the Iris dataset to a variable iris. We can perform various operations on this dataset using the with-data function. To view the data, we can use the view function along with the with-data function to provide a tabular representation of the dataset, as shown in the following code:

user> (with-data iris
        (view (conj-cols (range (nrow $data)) $data)))

The $data variable is a special binding that can be used to represent the entire dataset within the scope of the with-data function. In the previous code, we add an extra column to represent the row number of a record to the data using a composition of the conj-cols, nrows, and range functions. The data is then displayed in a spreadsheet-like table using the view function. The previous code produces the following table that represents the dataset:

Querying and storing datasets

We can also select columns we are interested in from the original dataset using the $ function within the scope of the with-data function, as shown in the following code:

user> (with-data iris ($ [:Species :Sepal.Length]))


|   :Species | :Sepal.Length |
|------------+---------------|
|     setosa |           5.1 |
|     setosa |           4.9 |
|     setosa |           4.7 |
  ...
|  virginica |           6.5 |
|  virginica |           6.2 |
|  virginica |           5.9 |

The $ function selects the :Species and :Sepal.Length columns from the iris dataset in the code example shown previously. We can also filter the data based on a condition using the $where function, as shown in the following code:

user> (with-data iris ($ [:Species :Sepal.Length]
                         ($where {:Sepal.Length 7.7})))

|  :Species | :Sepal.Length |
|-----------+---------------|
| virginica |           7.7 |
| virginica |           7.7 |
| virginica |           7.7 |
| virginica |           7.7 |

The previous example queries the iris dataset for records with the :Sepal.Length column equal to 7.7 using the $where function. We can also specify the lower or upper bound of the value to compare a column to using the :$gt and :$lt symbols in a map passed to $where function, as shown in the following code:

user> (with-data iris ($ [:Species :Sepal.Length]
                         ($where {:Sepal.Length {:$gt 7.0}})))

|  :Species | :Sepal.Length |
|-----------+---------------|
| virginica |           7.1 |
| virginica |           7.6 |
| virginica |           7.3 |
  ...
| virginica |           7.2 |
| virginica |           7.2 |
| virginica |           7.4 |

The previous example checks for records that have a :Sepal.Length attribute with a value greater than 7. To check whether a column's value lies within a given range, we can specify both the :$gt and :$lt keys in the map passed to the $where function, as shown in the following code:

user> (with-data iris ($ [:Species :Sepal.Length]
                         ($where {:Sepal.Length
                                  {:$gt 7.0 :$lt 7.5}})))

|  :Species  |:Sepal.Length |
|------------+--------------|
| virginica  |          7.1 |
| virginica  |          7.3 |
| virginica  |          7.2 |
| virginica  |          7.2 |
| virginica  |          7.2 |
| virginica  |          7.4 |

The previous example checks for records that have a :Sepal.Length attribute within the range of 7.0 and 7.5. We can also specify a discrete set of values using the $:in key, such as in the expression {:$in #{7.2 7.3 7.5}}. The Incanter library also provides several other functions such as $join and $group-by that can be used to express more complex queries.

The Incanter library provides functions to operate with MongoDB to persist and fetch datasets. MongoDB is a nonrelational document database that allows for storage of JSON documents with dynamic schemas. To connect to a MongoDB instance, we use the mongo! function, as shown in the following code:

user> (mongo! :db "sampledb")
true

In the previous code, the database name sampledb is specified as a keyword argument with the key :db to the mongo! function. We can also specify the hostname and port of the instance to connect to using the :host and :post keyword arguments, respectively.

We can store datasets in the connected MongoDB instance using the insert-dataset function from the incanter.mongodb namespace. Unfortunately, MongoDB does not support the use of the dot character (.) as column names, and so we must change the names of the columns in the iris dataset in order to successfully store it using the insert-dataset function. Replacing the column names can be done using the col-names function, as shown in the following code:

user> (insert-dataset
:iris (col-names iris [:SepalLength
:SepalWidth
:PetalLength
:PetalWidth
:Species]))

The previous code stores the iris dataset in the MongoDB instance after replacing the dot characters in the column names.

Note

Note that the dataset will be stored in a collection named iris in the sampledb database. Also, MongoDB will assign a hash-based ID to each record in the dataset that was stored in the database. This column can be referred to using the :_id keyword.

To fetch the dataset back from the database, we use the fetch-dataset function, as shown in the following code. The value returned by this function can be directly used by the with-data function to query and view the dataset fetched.

user> (with-data (fetch-dataset :iris) ($ [:Species :_id]
                                          ($where {:SepalLength
                                                   {:$gt 7}})))

|  :Species |                     :_id |
|-----------+--------------------------|
| virginica | 52ebcc1144ae6d6725965984 |
| virginica | 52ebcc1144ae6d6725965987 |
| virginica | 52ebcc1144ae6d6725965989 |
  ...
| virginica | 52ebcc1144ae6d67259659a0 |
| virginica | 52ebcc1144ae6d67259659a1 |
| virginica | 52ebcc1144ae6d67259659a5 |

We can also inspect the database after storing our dataset, using the mongo client, as shown in the following code. As we mentioned our database name is sampledb, we must select this database using the use command, as shown in the following terminal output:

$ mongo
MongoDB shell version: 2.4.6
connecting to: test
Server has startup warnings:
...

> use sampledb
switched to db sampledb

We can view all collections in the database using the show collections command. The queries can be executed using the find() function on the appropriate property in the variable db instance, as shown in the following code:

> show collections
iris
system.indexes
>
> db.iris.find({ SepalLength: 5})

{ "_id" : ObjectId("52ebcc1144ae6d6725965922"),
  "Species" : "setosa",
  "PetalWidth" : 0.2,
  "PetalLength" : 1.4,
  "SepalWidth" : 3.6,
  "SepalLength" : 5 }
{ "_id" : ObjectId("52ebcc1144ae6d6725965925"),
  "Species" : "setosa",
  "PetalWidth" : 0.2,
  "PetalLength" : 1.5,
  "SepalWidth" : 3.4,
  "SepalLength" : 5 }

...

To conclude, the Incanter library provides us with a sufficient set of tools for querying and storing datasets. Also, MongoDB can be easily used to store datasets via the Incanter library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.20.231