Introducing NoSQL

Many cloud solution vendors offer storage facilities similar to the use of NoSQL by storing unstructured data in a document type of model. In this section, we will explore NoSQL as a means of storing financial data. There are also a number of open source databases available for free that support NoSQL.

Getting MongoDB

MongoDB is a free and open source document database written in C++. The official site for MongoDB is http://www.mongodb.org. MongoDB is available for Linux, Windows, Mac OS X, and Solaris. Head to http://www.mongodb.org/downloads to download and install MongoDB on your local workstation. MongoDB management services are typically added to your operating system's runtime environment when installation is complete and run from the command line. The documentation also includes instructions to get your mongod service started and running on your machine.

Creating the data directory and running MongoDB

Before starting MongoDB for the first time, a directory is needed where the mongod process will write data. The mongod process uses /data/db as the default directory.

Let's create the data directory in a folder of our choice. Ensure that the user account has the proper read and write privileges.

Running MongoDB from Windows

In Windows, open Command Prompt, and navigate to your working directory with the cd command. Then, create a folder with the following command:

$ md datadb

To run the mongod service, enter the following command:

$ c:mongodbinmongod.exe –dbpath c:	estdatadb

Here, it is assumed that the MongoDB installation files are located at c:mongo, and our working directory is c: est. Depending on how you specify the MongoDB installation, these paths could differ from yours.

Running MongoDB from Mac OS X

If you are using Mac OS X, in Terminal navigate to your working directory with the dir command. Then, create a folder with the following command:

$ mkdir -p data/db

To run the mongod service, enter the following command:

mongod –dbpath data/db

Note

For detailed instructions on running MongoDB suitable to your operating system's environment, please refer to the official documentation at http://docs.mongodb.org/manual.

We should get the last two lines of the output similar to the following:

Sun Dec  7 00:42:26.063 [websvr] admin web console waiting for connections on port 28017
Sun Dec  7 00:42:26.064 [initandlisten] waiting for connections on port 27017

This shows that our MongoDB has started successfully and accepts the connections on port 27017. The administrative web interface runs on port 28017 and can be accessed at http://localhost:28017.

Getting PyMongo

The PyMongo module contains tools to interact with the MongoDB database using Python. The official page for PyMongo is https://pypi.python.org/pypi/pymongo/. This contains the installation instructions that are pretty much similar to installing a regular Python module on Windows, Linux, or Mac OS X. The simplest way to install PyMongo is to download the project source, unzip it to a folder on your local drive, use Terminal to navigate to this folder, and run python setup.py install to install it.

Alternatively, if you have pip installed, enter the following command in the Terminal:

$ pip install pymongo

Running a test connection

Let's do a simple connection check to ensure that our MongoDB service and PyMongo module are installed and running properly with the following Python script:

>>> import pymongo
>>> try:
...     client = pymongo.MongoClient("localhost", 27017)
...     print "Connected successfully!!!"
>>> except pymongo.errors.ConnectionFailure, e:
...    print "Could not connect to MongoDB: %s" % e
Connected successfully!!!

If the pymongo.MongoClient method call is successful, we should get the preceding output. Otherwise, ensure that the MongoDB service is running in the Terminal before executing this script again. Once connected, we can continue with some NoSQL operations.

Getting a database

Within a single data directory at data/db, we can create multiple independent databases. With the help of PyMongo, accessing databases can be done with just a single instance of MongoDB.

For example, if we want to have a database called ticks_db (which contains an underline character) to store some tick data, we can access this database with PyMongo using the style attributes as follows:

>>> ticks_db = client.ticks_db

If the database is named in such a way that style attributes are not supported, for example, ticks-db, the database can also be accessed as follows:

>>> ticks_db = client["ticks-db"]

Note

Assigning variables, such as ticks_db = client.ticks-db will cause an error!

Getting a collection

A collection is a group of documents, stored in a database, that are similar to tables in a relational database. A document accepts various input types described by the Binary JSON (BSON) specifications, including primitives such as strings and integers, arrays in the form of Python lists, binary strings in the form of UTF-8 encoded-Unicode, ObjectId, which is an assigned unique identifier, as well as Python dictionaries as objects. Be aware that each document can contain up to 16 MB of data.

In this example, we are interested in the incoming tick data from the stock ticker AAPL. We can begin to store ticks within a collection named aapl:

>>> aapl_collection = ticks_db.aapl

Similar to accessing databases, should your collection be named with the style attributes that are not supported, say for example, aapl-collection, the collection can also be accessed as follows:

>>> aapl_collection = ticks_db["aapl-collection"]

Note that the collections and databases are created lazily. When objects are inserted into collections and databases that do not yet exist, one will be created.

Inserting a document

Assume that we have a working AAPL market tick data collection system available in Python, and we are interested in storing this data. We can structure the tick data as a Python dictionary and store the ticker name, the time received, the open, high, low, and last prices, as well as the total volume traded information. The following AAPL tick data is just an example of the hypothetical prices received at 10 AM:

import datetime as dt
tick = {"ticker": "aapl",
        "time": dt.datetime(2014, 11, 17, 10, 0, 0),
        "open": 115.58,
        "high": 116.08,
        "low": 114.49,
        "last": 114.96,
        "vol": 1900000}

We simply use the insert function in the collection to insert a dictionary object. A unique key is generated by the insert function with a special field named _id:

>>> tick_id = aapl_collection.insert(tick)
>>> print tick_id
548490486d3ba7178b6c36ba

After inserting our first document, the aapl collection will be created on the server. To verify this, we can list all of the collections in our database:

>>> print ticks_db.collection_names()
[u'system.indexes', u'aapl']

Fetching a single document

The most basic query for a document in a collection is the find_one function. Using this method, without any parameters, simply gives us the first match, or none if there are no matches. Otherwise, the parameter accepts a dictionary as the filter criteria. The following code contains some find_one examples that return the same result. Note that, in the last example, the ObjectId attribute needs to be converted from a string in order to access the _id field generated by the server:

>>> print aapl_collection.find_one()
>>> print aapl_collection.find_one({"time": dt.datetime(2014, 11, 17,  10, 0, 0)})
>>>
>>> from bson.objectid import ObjectId
>>> print aapl_collection.find_one({"_id": 
>>> ObjectId("548490486d3ba7178b6c36ba")})
{u'last': 114.96, u'vol': 1900000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 0), u'_id': ObjectId('548490486d3ba7178b6c36ba'), u'ticker': u'aapl'}
{u'last': 114.96, u'vol': 1900000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 0), u'_id': ObjectId('548490486d3ba7178b6c36ba'), u'ticker': u'aapl'}
{u'last': 114.96, u'vol': 1900000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 0), u'_id': ObjectId('548490486d3ba7178b6c36ba'), u'ticker': u'aapl'}

Deleting documents

The remove function removes the documents in the collection that matches the query:

>>> aapl_collection.remove()

Batch-inserting documents

For batch inserts, the insert function accepts a list of comma-separated dictionaries. We will add two more hypothetical tick prices for the next two minutes to our collection:

aapl_collection.insert([tick,
                       {"ticker": "aapl",
                        "time": dt.datetime(2014,11,17,10,1,0),
                        "open": 115.58,
                        "high": 116.08,
                        "low": 114.49,
                        "last": 115.00,
                        "vol": 2000000},
                       {"ticker": "aapl",
                        "time": dt.datetime(2014,11,17,10,2,0),
                        "open": 115.58,
                        "high": 116.08,
                        "low": 113.49,
                        "last": 115.00,
                        "vol": 2100000}])

Counting documents in the collection

The count function can be used on any query to count the number of matches. It can also be used in conjunction with the find function:

>>> print aapl_collection.count()
>>> print aapl_collection.find({"open": 115.58}).count()
3
3

Finding documents

The find function is similar to the find_one function, except that it returns a list of documents for iteration. Without any parameters, the find function simply returns all the items in the collection:

>>> for aapl_tick in aapl_collection.find():
...    print aapl_tick
{u'last': 114.96, u'vol': 1900000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 0), u'_id': ObjectId('5484943f6d3ba717ca0d26ff'), u'ticker': u'aapl'}
{u'last': 115.0, u'vol': 2000000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 1), u'_id': ObjectId('5484943f6d3ba717ca0d2700'), u'ticker': u'aapl'}
{u'last': 115.0, u'vol': 2100000, u'open': 115.58, u'high': 116.08, u'low': 113.49, u'time': datetime.datetime(2014, 11, 17, 10, 2), u'_id': ObjectId('5484943f6d3ba717ca0d2701'), u'ticker': u'aapl'}

We can also filter our search on the collection of tick data. For example, we are interested in finding the two tick data that arrived before 10:02 AM:

>>> cutoff_time = dt.datetime(2014, 11, 17, 10, 2, 0)
>>> for tick in aapl_collection.find(
...        {"time": {"$lt": cutoff_time}}).sort("time"):
...    print tick 
{u'last': 114.96, u'vol': 1900000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 0), u'_id': ObjectId('5484943f6d3ba717ca0d26ff'), u'ticker': u'aapl'}
{u'last': 115.0, u'vol': 2000000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 1), u'_id': ObjectId('5484943f6d3ba717ca0d2700'), u'ticker': u'aapl'}

Sorting documents

In the find.sort example, we sorted our search results by time in ascending order. We can also do the same in descending order:

>>> sorted_ticks = aapl_collection.find().sort(
...     [("time", pymongo.DESCENDING)])
>>> for tick in sorted_ticks:
...     print tick 
{u'last': 115.0, u'vol': 2100000, u'open': 115.58, u'high': 116.08, u'low': 113.49, u'time': datetime.datetime(2014, 11, 17, 10, 2), u'_id': ObjectId('548494f16d3ba717d882b83e'), u'ticker': u'aapl'}
{u'last': 115.0, u'vol': 2000000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 1), u'_id': ObjectId('548494f16d3ba717d882b83d'), u'ticker': u'aapl'}
{u'last': 114.96, u'vol': 1900000, u'open': 115.58, u'high': 116.08, u'low': 114.49, u'time': datetime.datetime(2014, 11, 17, 10, 0), u'_id': ObjectId('548494f16d3ba717d882b83c'), u'ticker': u'aapl'}

Conclusion

Using the PyMongo module, we learned how to insert, delete, count, find, and sort tick data on a collection using three ticks' worth of data. Massive write performance, flexibility in the creation of collections, and fast key-value access are some virtues of NoSQL over SQL-based systems. As we continue on our journey in data collection and analysis, we can apply these basic methods to store a continuous stream of tick data for multiple tickers on various collections and databases. Although NoSQL databases may not be the only tick data storage solution available, and may sometimes be passed over for SQL solutions, as we have seen, they integrate nicely with Python and are easy to learn and use. The cross-platform and versatile MongoDB is one of the many NoSQL database products available to store documents in a data-interchangeable format such as the BSON format. Many cloud vendors offer data object storage in a JSON-like structure, similar to those we studied.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.70