Chapter 2

Hello NoSQL: Getting Initial Hands-on Experience

WHAT’S IN THIS CHAPTER?

  • Tasting NoSQL technology
  • Exploring MongoDB and Apache Cassandra basics
  • Accessing MongoDB and Apache Cassandra from some of the popular high-level programming languages

This chapter is a variation of the quintessential programming tutorial first step: Hello World! It introduces the initial examples. Although elementary, these examples go beyond simply printing a hello message on the console and give you a first hands-on flavor of the topic. The topic in this case is NoSQL, which is an abstraction for a class of data stores. NoSQL is a concept, a classification, and a new-generation data storage viewpoint. It includes a class of products and a set of alternative non-relational data store choices. You are already familiar with some of the essential concepts and pros and cons of NoSQL from Chapter 1. This is where you start seeing it in action.

The examples in this chapter use MongoDB and Cassandra so you may want to install and set up those products to follow along. Refer to Appendix A if you need help installing these products in your development environment.

WHY ONLY MONGODB AND APACHE CASSANDRA?

The choice of MongoDB and Cassandra to illustrate NoSQL examples is quite arbitrary. This chapter intends to provide a first flavor of the deep and wide NoSQL domain. There are numerous NoSQL products and many offer compelling features and advantages. Choosing a couple of products to start with NoSQL was not easy. For example, Couchbase server could have been chosen over MongoDB and HBase could have been used instead of Cassandra. The examples could have been based on products like Redis, Membase, Hypertable, or Riak. Many NoSQL databases are covered in this book, so read through and you will learn a lot about the various alternative options in the NoSQL space.

FIRST IMPRESSIONS — EXAMINING TWO SIMPLE EXAMPLES

Without further delay or long prologues, it’s time to dive right into your first two simple examples. The first example creates a trivial location preferences store and the second one manages a car make and model database. Both the examples focus on the data management aspects that are pertinent in the context of NoSQL.

A Simple Set of Persistent Preferences Data

Location-based services are gaining prominence as local businesses are trying to connect with users who are in the neighborhood and large companies are trying to customize their online experience and offerings based on where people are stationed. A few common occurrences of location-based preferences are visible in popular applications like Google Maps, which allows local search, and online retailers like Walmart.com that provide product availability and promotion information based on your closest Walmart store location.

Sometimes a user is asked to input location data and other times user location is inferred. Inference may be based on a user’s IP address, network access point (especially if a user accesses data from a mobile device), or any combination of these techniques. Irrespective of how the data is gathered, you will need to store it effectively and that is where the example starts.

To make things simple, the location preferences are maintained for users only in the United States so only a user identifier and a zip code are required to find the location for a user. Let’s start with usernames as their identifiers. Data points like “John Doe, 10001,” “Lee Chang, 94129,” “Jenny Gonzalez 33101,” and “Srinivas Shastri, 02101” will need to be maintained.

To store such data in a flexible and extendible way, this example uses a non-relational database product named MongoDB. In the next few steps you create a MongoDB database and store a few sample location data points.

Starting MongoDB and Storing Data

Assuming you have installed MongoDB successfully, start the server and connect to it.

You can start a MongoDB server by running the mongod program within the bin folder of the distribution. Distributions vary according to the underlying environment, which can be Windows, Mac OS X, or a Linux variant, but in each case the server program has the same name and it resides in a folder named bin in the distribution.

The simplest way to connect to the MongoDB server is to use the JavaScript shell available with the distribution. Simply run mongo from your command-line interface. The mongo JavaScript shell command is also found in the bin folder.

When you start the MongoDB server by running mongod, you should see output on your console that looks similar to the following:

PS C:applicationsmongodb-win32-x86_64-1.8.1> .inmongod.exe
C:applicationsmongodb-win32-x86_64-1.8.1inmongod.exe
--help for help and startup options
Sun May 01 21:22:56 [initandlisten] MongoDB starting : pid=3300 port=27017
  dbpath=/data/db/ 64-bit
Sun May 01 21:22:56 [initandlisten] db version v1.8.1, pdfile version 4.5
Sun May 01 21:22:56 [initandlisten] git version:
a429cd4f535b2499cc4130b06ff7c26f41c00f04
Sun May 01 21:22:56 [initandlisten] build sys info: windows (6, 1, 7600, 2, '')
  BOOST_LIB_VERSION=1_42
Sun May 01 21:22:56 [initandlisten] waiting for connections on port 27017
Sun May 01 21:22:56 [websvr] web admin interface listening on port 28017

This particular output was captured on a Windows 7 64-bit machine when mongod was run via the Windows PowerShell. Depending on your environment your output may vary.

Now that the database server is up and running, use the mongo JavaScript shell to connect to it. The initial output of the shell should be as follows:

PS C:applicationsmongodb-win32-x86_64-1.8.1> bin/mongo
MongoDB shell version: 1.8.1
connecting to: test
>

By default, the mongo shell connects to the “test” database available on localhost. From mongod (the server daemon program) console output, you can also guess that the MongoDB server waits for connections on port 27017. To explore a possible set of initial commands just type help on the mongo interactive console. On typing help and pressing the Enter (or Return) key, you should see a list of command options like so:

> help
        db.help()                    help on db methods
        db.mycoll.help()             help on collection methods
        rs.help()                    help on replica set methods
        help connect                 connecting to a db help
        help admin                   administrative help
        help misc                    misc things to know
        help mr                      mapreduce help

        show dbs                     show database names
        show collections             show collections in current database
        show users                   show users in current database
        show profile                 show most recent system.profile entries
                                         with time >= 1ms
        use <db_name>                set current database
        db.foo.find()                list objects in collection foo
        db.foo.find( { a : 1 } )     list objects in foo where a == 1
        it                           result of the last line evaluated;
                                         use to further iterate
        DBQuery.shellBatchSize = x   set default number of items to display
                                         on shell
        exit                         quit the mongo shell
>

CUSTOMIZING THE MONGODB DATA DIRECTORY AND PORT

By default, MongoDB stores the data files in the /data/db (C:datadb on Windows) directory and listens for requests on port 27017. You can specify an alternative data directory by specifying the directory path using the dbpath option, as follows:

mongod --dbpath  /path/to/alternative/directory

Make sure the data directory is created if it doesn’t already exist. Also, ensure that mongod has permissions to write to that directory.

In addition, you can also direct MongoDB to listen for connections on an alternative port by explicitly passing the port as follows:

mongod --port 94301

To avoid conflicts, make sure the port is not in use.

To change both the data directory and the port simultaneously, simply specify both the --dbpath and --port options with the corresponding alternative values to the mongod executable.

Next, you learn how to create the preferences database within the MongoDB instance.

Creating the Preferences Database

To start out, create a preferences database called prefs. After you create it, store tuples (or pairs) of usernames and zip codes in a collection, named location, within this database. Then store the available data sets in this defined structure. In MongoDB terms it would translate to carrying out the following steps:

1. Switch to the prefs database.

2. Define the data sets that need to be stored.

3. Save the defined data sets in a collection, named location.

To carry out these steps, type the following on your Mongo JavaScript console:

use prefs
w = {name: "John Doe", zip: 10001};
x = {name: "Lee Chang", zip: 94129};
y = {name: "Jenny Gonzalez", zip: 33101};
z = {name: "Srinivas Shastri", zip: 02101};
db.location.save(w);
db.location.save(x);
db.location.save(y);
db.location.save(z);

That’s it! A few simple steps and the data store is ready. Some quick notes before moving forward though: The use prefs command changed the current database to the database called prefs. However, the database itself was never explicitly created. Similarly, the data points were stored in the location collection by passing a data point to the db.location.save() method. The collection wasn’t explicitly created either. In MongoDB, both the database and the collection are created only when data is inserted into it. So, in this example, it’s created when the first data point, {name: "John Doe", zip: 10001}, is inserted.

You can now query the newly created database to verify the contents of the store. To get all records stored in the collection named location, run db.location.find().

Running db.location.find() on my machine reveals the following output:

> db.location.find()
{ "_id" : ObjectId("4c97053abe67000000003857"), "name" : "John Doe",
    "zip" : 10001 }
{ "_id" : ObjectId("4c970541be67000000003858"), "name" : "Lee Chang",
    "zip" : 94129 }
{ "_id" : ObjectId("4c970548be67000000003859"), "name" : "Jenny Gonzalez",
    "zip" : 33101 }
{ "_id" : ObjectId("4c970555be6700000000385a"), "name" : "Srinivas Shastri",
    "zip" : 1089 }

The output on your machine should be similar. The only bit that will vary is the ObjectId. ObjectId is MongoDB’s way of uniquely identifying each record or document in MongoDB terms.

image

MongoDB uniquely identifies each document in a collection using the ObjectId. The ObjectId for a document is stored as the _id attribute of that document. While inserting a record, any unique value can be set as the ObjectId. The uniqueness of the value needs to be guaranteed by the developer. You could also avoid specifying the value for the _id property while inserting a record. In such cases, MongoDB creates and inserts an appropriate unique id. Such generated ids in MongoDB are of the BSON, short for binary JSON, format, which can be best summarized as follows:

  • BSON Object Id is a 12-byte value.
  • The first 4 bytes represent the creation timestamp. It represents the seconds since epoch. This value must be stored in big endian, which means the most significant value in the sequence must be stored in the lowest storage address.
  • The next 3 bytes represent the machine id.
  • The following 2 bytes represent the process id.
  • The last 3 bytes represent the counter. This value must be stored in big endian.
  • The BSON format, apart from assuring uniqueness, includes the creation timestamp. BSON format ids are supported by all standard MongoDB drivers.

The find method, with no parameters, returns all the elements in the collection. In some cases, this may not be desirable and only a subset of the collection may be required. To understand querying possibilities, add the following additional records to the location collection:

  • Don Joe, 10001
  • John Doe, 94129

You can accomplish this, via the mongo shell, as follows:

> a = {name:"Don Joe", zip:10001};
{ "name" : "Don Joe", "zip" : 10001 }
> b = {name:"John Doe", zip:94129};
{ "name" : "John Doe", "zip" : 94129 }
> db.location.save(a);
> db.location.save(b);
>

To get a list of only those people who are in the 10001 zip code, you could query as follows:

> db.location.find({zip: 10001});
{ "_id" : ObjectId("4c97053abe67000000003857"), "name" : "John Doe",
    "zip" : 10001 }
{ "_id" : ObjectId("4c97a6555c760000000054d8"), "name" : "Don Joe",
    "zip" : 10001 }

To get a list of all those who have the name “John Doe,” you could query like so:

> db.location.find({name: "John Doe"});
{ "_id" : ObjectId("4c97053abe67000000003857"), "name" : "John Doe",
    "zip" : 10001 }
{ "_id" : ObjectId("4c97a7ef5c760000000054da"), "name" : "John Doe",
    "zip" : 94129 }

In both these queries that filter the collection, a query document is passed as a parameter to the find method. The query document specifies the pattern of keys and values that need to be matched. MongoDB supports many advanced querying mechanisms beyond simple filters, including pattern representation with the help of regular expressions.

Because a database includes newer data sets, it is possible the structure of the collection will become a constraint and thus need modification. In traditional relational database sense, you may need to alter the table schema. In relational databases, altering table schemas also means taking on a complicated data migration task to make sure data in the old and the new schema exist together. In MongoDB, modifying a collection structure is trivial. More accurately, collections, analogous to tables, are schema-less and so it allows you to store disparate document types within the same collection.

Consider an example where you need to store the location preferences of another user, whose name and zip code are identical to a document already existing in your database, say, another {name: "Lee Chang", zip: 94129}. Intentionally and not realistically, of course, the assumption was that a name and zip pair would be unique!

To distinctly identify the second Lee Chang from the one in the database, an additional attribute, the street address, is added like so:

> anotherLee = {name:"Lee Chang", zip: 94129, streetAddress:"37000 Graham Street"};
{
        "name" : "Lee Chang",
        "zip" : 94129,
        "streetAddress" : "37000 Graham Street"
}
> db.location.save(anotherLee);

Now getting all documents, using find, returns the following data sets:

> db.location.find();
{ "_id" : ObjectId("4c97053abe67000000003857"), "name" : "John Doe",
    "zip" : 10001 }
{ "_id" : ObjectId("4c970541be67000000003858"), "name" : "Lee Chang",
   "zip" : 94129 }
{ "_id" : ObjectId("4c970548be67000000003859"), "name" : "Jenny Gonzalez",
    "zip" : 33101 }
{ "_id" : ObjectId("4c970555be6700000000385a"), "name" : "Srinivas Shastri",
    "zip" : 1089 }
{ "_id" : ObjectId("4c97a6555c760000000054d8"), "name" : "Don Joe",
    "zip" : 10001 }
{ "_id" : ObjectId("4c97a7ef5c760000000054da"), "name" : "John Doe",
    "zip" : 94129 }
{ "_id" : ObjectId("4c97add25c760000000054db"), "name" : "Lee Chang",
    "zip" : 94129, "streetAddress" : "37000 Graham Street" }

You can access this data set from most mainstream programming languages, because drivers for those exist. A section titled “Working with Language Bindings” later in this chapter covers the topic. In a subsection in that section, this location preferences example is accessed from Java, PHP, Ruby, and Python.

In the next example, you see a simple data set that relates to car make and models stored in a non-relational column-family database.

Storing Car Make and Model Data

Apache Cassandra, a distributed column-family database, is used in this example. Therefore, it would be beneficial to have Cassandra installed before you delve into the example. That will allow you to follow along as I proceed. Refer to Appendix A if you need help installing and setting up Cassandra.

Apache Cassandra is a distributed database, so you would normally set up a database cluster when using this product. For this example, the complexities of setting up a cluster are avoided by running Cassandra as a single node. In a production environment you would not want such a configuration, but you are only testing the waters and getting familiar with the basics for now so the single node will suffice.

A Cassandra database can be interfaced via a simple command-line client or via the Thrift interface. The Thrift interface helps a variety of programming languages connect to Cassandra. Functionally, you could think of the Thrift interface as a generic multilanguage database driver. Thrift is discussed later in the section titled “Working with Language Bindings.”

Moving on with the car makes and models database, first start Cassandra and connect to it.

Starting Cassandra and Connecting to It

You can start the Cassandra server by invoking bin/cassandra from the folder where the Cassandra compressed (tarred and gzipped) distribution is extracted. For this example, run bin/Cassandra -f. The -f option makes Cassandra run in the foreground. This starts one Cassandra node locally on your machine. When running as a cluster, multiple nodes are started and they are configured to communicate with each other. For this example, one node will suffice to illustrate the basics of storing and accessing data in Cassandra.

On starting a Cassandra node, you should see the output on your console as follows:

PS C:applicationsapache-cassandra-0.7.4> .incassandra -f
Starting Cassandra Server
 INFO 18:20:02,091 Logging initialized
 INFO 18:20:02,107 Heap size: 1070399488/1070399488
 INFO 18:20:02,107 JNA not found. Native methods will be disabled.
 INFO 18:20:02,107 Loading settings from file:/C:/applications/
    apache-cassandra-0.7.4/conf/cassandra.yaml
 INFO 18:20:02,200 DiskAccessMode 'auto' determined to be standard,
    indexAccessMode is standard
 INFO 18:20:02,294 Deleted varlibcassandradatasystemLocationInfo-f-3
 INFO 18:20:02,294 Deleted varlibcassandradatasystemLocationInfo-f-2
 INFO 18:20:02,294 Deleted varlibcassandradatasystemLocationInfo-f-1
 INFO 18:20:02,310 Deleted varlibcassandradatasystemLocationInfo-f-4
 INFO 18:20:02,341 Opening varlibcassandradatasystemLocationInfo-f-5
 INFO 18:20:02,388 Couldn't detect any schema definitions in local storage.
 INFO 18:20:02,388 Found table data in data directories. Consider using JMX to call
    org.apache.cassandra.service.StorageService.loadSchemaFromYam
l().
 INFO 18:20:02,403 Creating new commitlog segment /var/lib/cassandra/commitlog
    CommitLog-1301793602403.log
 INFO 18:20:02,403 Replaying varlibcassandracommitlog
    CommitLog-1301793576882.log
 INFO 18:20:02,403 Finished reading varlibcassandracommitlog
    CommitLog-1301793576882.log
 INFO 18:20:02,419 Log replay complete
 INFO 18:20:02,434 Cassandra version: 0.7.4
 INFO 18:20:02,434 Thrift API version: 19.4.0
 INFO 18:20:02,434 Loading persisted ring state
 INFO 18:20:02,434 Starting up server gossip
 INFO 18:20:02,450 Enqueuing flush of Memtable-LocationInfo@33000296(29 bytes, 
    1 operations)
 INFO 18:20:02,450 Writing Memtable-LocationInfo@33000296(29 bytes, 1 operations)
 INFO 18:20:02,622 Completed flushing varlibcassandradatasystem
    LocationInfo-f-6-Data.db (80 bytes)
 INFO 18:20:02,653 Using saved token 63595432991552520182800882743159853717
 INFO 18:20:02,653 Enqueuing flush of Memtable-LocationInfo@22518320(53 bytes, 
    2 operations)
 INFO 18:20:02,653 Writing Memtable-LocationInfo@22518320(53 bytes, 2 operations)
 INFO 18:20:02,824 Completed flushing varlibcassandradatasystem
    LocationInfo-f-7-Data.db (163 bytes)
 INFO 18:20:02,824 Will not load MX4J, mx4j-tools.jar is not in the classpath
 INFO 18:20:02,871 Binding thrift service to localhost/127.0.0.1:9160
 INFO 18:20:02,871 Using TFastFramedTransport with a max frame size of 
    15728640 bytes.
 INFO 18:20:02,871 Listening for thrift clients...

The specific output is from my Windows 7 64-bit machine when the Cassandra executable was run from the Windows PowerShell. If you use a different operating system and a different shell, your output may be a bit different.

ESSENTIAL CONFIGURATION FOR RUNNING AN APACHE CASSANDRA NODE

Apache Cassandra storage configuration is defined in conf/cassandra.yaml. When you download and extract a Cassandra stable or development distribution that is available in a compressed tar.gz format, you get a cassandra.yaml file with some default configuration. For example, it would expect the commit logs to be in the /var/lib/cassandra/commitlog directory and the data files to be in the /var/lib/cassandra/data directory. In addition, Apache Cassandra uses log4j for logging. The Cassandra log4j can be configured via conf/log4j-server.properties. By default, Cassandra log4j expects to write log output to /var/log/cassandra/system.log. If you want to keep these defaults make sure that these directories exist and you have appropriate permissions to access and write to them. If you want to modify this configuration, make sure to specify the new folders of your choice in the corresponding log files.

Commit log and data directory properties from conf/cassandra.yaml in my instance are:

# directories where Cassandra should store data on disk.
data_file_directories:
    - /var/lib/cassandra/data

# commit log
commitlog_directory: /var/lib/cassandra/commitlog

The path values in cassandra.yaml need not be specified in Windows-friendly formats. For example, you do not need to specify the commitlog path as commitlog_directory: C:varlibcassandracommitlog. The log4j appender file configuration from conf/log4j-server.properties in my instance is:

log4j.appender.R.File=/var/log/cassandra/system.log

The simplest way to connect to the running Cassandra node on your machine is to use the Cassandra Command-Line Interface (CLI). Starting the command line is as easy as running bin/Cassandra-cli. You can pass in the host and port properties to the CLI as follows:

bin/cassandra-cli -host localhost -port 9160

The output of running cassandra-cli is as follows:

PS C:applicationsapache-cassandra-0.7.4> .incassandra-cli -host localhost 
    -port 9160
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
 
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
[default@unknown]

To get a list of available commands type help or ? and you will see the following output:

[default@unknown] ?
List of all CLI commands:
?                                                          Display this message.
help;                                                         Display this help.
help <command>;                         Display detailed, command-specific help.
connect <hostname>/<port> (<username> '<password>')?; Connect to thrift service.
use <keyspace> [<username> 'password'];                    Switch to a keyspace.
describe keyspace (<keyspacename>)?;                          Describe keyspace.
exit;                                                                  Exit CLI.
quit;                                                                  Exit CLI.
describe cluster;                             Display information about cluster.
show cluster name;                                         Display cluster name.
show keyspaces;                                          Show list of keyspaces.
show api version;                                       Show server API version.
create keyspace <keyspace> [with <att1>=<value1> [and <att2>=<value2> ...]];
                Add a new keyspace with the specified attribute(s) and value(s).
update keyspace <keyspace> [with <att1>=<value1> [and <att2>=<value2> ...]];
                 Update a keyspace with the specified attribute(s) and value(s).
create column family <cf> [with <att1>=<value1> [and <att2>=<value2> ...]];
        Create a new column family with the specified attribute(s) and value(s).
update column family <cf> [with <att1>=<value1> [and <att2>=<value2> ...]];
            Update a column family with the specified attribute(s) and value(s).
drop keyspace <keyspace>;                                     Delete a keyspace.
drop column family <cf>;                                 Delete a column family.
get <cf>['<key>'];                                       Get a slice of columns.
get <cf>['<key>']['<super>'];                        Get a slice of sub columns.
get <cf> where <column> = <value> [and <column> > <value> and ...] [limit int];
get <cf>['<key>']['<col>'] (as <type>)*;                     Get a column value.
get <cf>['<key>']['<super>']['<col>'] (as <type>)*;      Get a sub column value.
set <cf>['<key>']['<col>'] = <value> (with ttl = <secs>)*;         Set a column.
set <cf>['<key>']['<super>']['<col>'] = <value> (with ttl = <secs>)*;
                                                               Set a sub column.
del <cf>['<key>'];                                                Delete record.
del <cf>['<key>']['<col>'];                                       Delete column.
del <cf>['<key>']['<super>']['<col>'];                        Delete sub column.
count <cf>['<key>'];                                    Count columns in record.
count <cf>['<key>']['<super>'];                 Count columns in a super column.
truncate <column_family>;                      Truncate specified column family.
assume <column_family> <attribute> as <type>;
              Assume a given column family attributes to match a specified type.
list <cf>;                                   List all rows in the column family.
list <cf>[<startKey>:];
                       List rows in the column family beginning with <startKey>.
list <cf>[<startKey>:<endKey>];
        List rows in the column family in the range from <startKey> to <endKey>.
list ... limit N;                                   Limit the list results to N.

Now that you have some familiarity with Cassandra basics, you can move on to create a storage definition for the car make and model data and insert and access some sample data into this new Cassandra storage scheme.

Storing and Accessing Data with Cassandra

The first place to start is to understand the concept of a keyspace and a column-family. The closest relational database parallels of a keyspace and a column-family are a database and a table. Although these definitions are not completely accurate and sometimes misleading, they serve as a good starting point to understand the use of a keyspace and a column-family. As you get familiar with the basic usage patterns you will develop greater appreciation for and understanding of these concepts, which extend beyond their relational parallels.

For starters, list the existing keyspaces in your Cassandra server. Go to the cassandra-cli, type the show keyspaces command, and press Enter. Because you are starting out with a fresh Cassandra installation, you are likely to see output similar to this:

[default@unknown] show keyspaces;
Keyspace: system:
  Replication Strategy: org.apache.cassandra.locator.LocalStrategy
    Replication Factor: 1
  Column Families:
    ColumnFamily: HintsColumnFamily (Super)
    "hinted handoff data"
      Columns sorted by: org.apache.cassandra.db.marshal.BytesType/
    org.apache.cassandra.db.marshal.BytesType
      Row cache size / save period: 0.0/0
      Key cache size / save period: 0.01/14400
      Memtable thresholds: 0.15/32/1440
      GC grace seconds: 0
      Compaction min/max thresholds: 4/32
      Read repair chance: 0.0
      Built indexes: []
    ColumnFamily: IndexInfo
    "indexes that have been completed"
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      Row cache size / save period: 0.0/0
      Key cache size / save period: 0.01/14400
      Memtable thresholds: 0.0375/8/1440
      GC grace seconds: 0
      Compaction min/max thresholds: 4/32
      Read repair chance: 0.0
      Built indexes: []
    ColumnFamily: LocationInfo
    "persistent metadata for the local node"
      Columns sorted by: org.apache.cassandra.db.marshal.BytesType
      Row cache size / save period: 0.0/0
      Key cache size / save period: 0.01/14400
      Memtable thresholds: 0.0375/8/1440
      GC grace seconds: 0
      Compaction min/max thresholds: 4/32
      Read repair chance: 0.0
      Built indexes: []
    ColumnFamily: Migrations
    "individual schema mutations"
      Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType
      Row cache size / save period: 0.0/0
      Key cache size / save period: 0.01/14400
      Memtable thresholds: 0.0375/8/1440
      GC grace seconds: 0
      Compaction min/max thresholds: 4/32
      Read repair chance: 0.0
      Built indexes: []
    ColumnFamily: Schema
    "current state of the schema"
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      Row cache size / save period: 0.0/0
      Key cache size / save period: 0.01/14400
      Memtable thresholds: 0.0375/8/1440
      GC grace seconds: 0
      Compaction min/max thresholds: 4/32
      Read repair chance: 0.0
      Built indexes: []

System keyspace, as the name suggests, is like the administration database in an RDBMS. The system keyspace includes a few pre-defined column-families. You will learn about column-family, via example, later in this section. Keyspaces group column-families together. Usually, one keyspace is defined per application. Data replication is defined at the keyspace level. This means the number of redundant copies of data and how these copies are stored are specified at the keyspace level. The Cassandra distribution comes with a sample keyspace creation script in a file named schema-sample.txt, which is available in the conf directory. You can run the sample keyspace creation script as follows:

  PS C:applicationsapache-cassandra-0.7.4> .incassandra-cli -host localhost 
    --file .confschema-sample.txt

Once again, connect via the command-line client and reissue the show keyspaces command in the interface. The output this time should be like so:

[default@unknown] show keyspaces;
Keyspace: Keyspace1:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
    Replication Factor: 1
  Column Families:
    ColumnFamily: Indexed1
      Columns sorted by: org.apache.cassandra.db.marshal.BytesType
      Row cache size / save period: 0.0/0
      Key cache size / save period: 200000.0/14400
      Memtable thresholds: 0.2953125/63/1440
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: [Indexed1.birthdate_idx]
      Column Metadata:
        Column Name: birthdate (626972746864617465)
          Validation Class: org.apache.cassandra.db.marshal.LongType
          Index Name: birthdate_idx
          Index Type: KEYS
    ColumnFamily: Standard1
      Columns sorted by: org.apache.cassandra.db.marshal.BytesType
      Row cache size / save period: 1000.0/0
      Key cache size / save period: 10000.0/3600
      Memtable thresholds: 0.29/255/59
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: []
    ColumnFamily: Standard2
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      Row cache size / save period: 0.0/0
      Key cache size / save period: 100.0/14400
      Memtable thresholds: 0.2953125/63/1440
      GC grace seconds: 0
      Compaction min/max thresholds: 5/31
      Read repair chance: 0.0010
      Built indexes: []
    ColumnFamily: StandardByUUID1
      Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType
      Row cache size / save period: 0.0/0
      Key cache size / save period: 200000.0/14400
      Memtable thresholds: 0.2953125/63/1440
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: []
    ColumnFamily: Super1 (Super)
      Columns sorted by: org.apache.cassandra.db.marshal.BytesType/
    org.apache.cassandra.db.marshal.BytesType
      Row cache size / save period: 0.0/0
      Key cache size / save period: 200000.0/14400
      Memtable thresholds: 0.2953125/63/1440
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: []
    ColumnFamily: Super2 (Super)
    "A column family with supercolumns, whose column and subcolumn names are 
    UTF8 strings"
      Columns sorted by: org.apache.cassandra.db.marshal.BytesType/
    org.apache.cassandra.db.marshal.UTF8Type
      Row cache size / save period: 10000.0/0
      Key cache size / save period: 50.0/14400
      Memtable thresholds: 0.2953125/63/1440
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: []
    ColumnFamily: Super3 (Super)
    "A column family with supercolumns, whose column names are Longs (8 bytes)"
      Columns sorted by: org.apache.cassandra.db.marshal.LongType/
    org.apache.cassandra.db.marshal.BytesType
      Row cache size / save period: 0.0/0
      Key cache size / save period: 200000.0/14400
      Memtable thresholds: 0.2953125/63/1440
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: []
Keyspace: system:
...(Information on the system keyspace is not included here as it's 
    the same as what you have seen earlier in this section)

Next, create a CarDataStore keyspace and a Cars column-family within this keyspace using the script in Listing 2-1.

image
LISTING 2-1: Schema script for CarDataStore keyspace
/*schema-cardatastore.txt*/

create keyspace CarDataStore
    with replication_factor = 1
    and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';

use CarDataStore;
 
create column family Cars
    with comparator = UTF8Type
    and read_repair_chance = 0.1
    and keys_cached = 100
    and gc_grace = 0
    and min_compaction_threshold = 5
    and max_compaction_threshold = 31;

schema-cardatastore.txt

You can run the script, illustrated in Listing 2-1, as follows:

PS C:applicationsapache-cassandra-0.7.4> bin/cassandra-cli -host localhost 
    --file C:workspace
osqlexamplesschema-cardatastore.txt

You have successfully added a new keyspace! Go back to the script and briefly review how you added a keyspace. You added a keyspace called CarDataStore. You also added an artifact called a ColumnFamily within this keystore. The name of the ColumnFamily was Cars. You will see ColumnFamily in action in a while, but think of them as tables for now, especially if you can’t hold your curiosity. Within the ColumnFamily tag an attribute called CompareWith was also included. The value of CompareWith was specified as UTF8Type. The CompareWith attribute value affects how row-keys are indexed and sorted. The other tags within the keyspace definition specify the replication options. CarDataStore has a replication factor of 1, which means there is only one copy of data stored in Cassandra.

Next, add some data to the CarDataStore keyspace like so:

 [default@unknown] use CarDataStore;
Authenticated to keyspace: CarDataStore
[default@CarDataStore] set Cars['Prius']['make'] = 'toyota';
Value inserted.
[default@CarDataStore] set Cars['Prius']['model'] = 'prius 3';
Value inserted.
[default@CarDataStore] set Cars['Corolla']['make'] = 'toyota';
Value inserted.
[default@CarDataStore] set Cars['Corolla']['model'] = 'le';
Value inserted.
[default@CarDataStore] set Cars['fit']['make'] = 'honda';
Value inserted.
[default@CarDataStore] set Cars['fit']['model'] = 'fit sport';
Value inserted.
[default@CarDataStore] set Cars['focus']['make'] = 'ford';
Value inserted.
[default@CarDataStore] set Cars['focus']['model'] = 'sel';
Value inserted.

The set of commands illustrated is a way to add data to Cassandra. Using this command, a name-value pair or column value is added within a row, which in turn is defined in a ColumnFamily in a keyspace. For example, set Cars['Prius']['make'] = 'toyota', a name-value pair: 'make' = 'toyota' is added to a row, which is identified by the key 'Prius'. The row identified by 'Prius' is part of the Cars ColumnFamily. The Cars ColumnFamily is defined within the CarDataStore, which you know is a keyspace.

Once the data is added, you can query and retrieve it. To get the name-value pairs or column names and values for a row identified by Prius, use the following command: get Cars['Prius']. The output should be like so:

[default@CarDataStore] get Cars['Prius'];
=> (column=make, value=746f796f7461, timestamp=1301824068109000)
=> (column=model, value=70726975732033, timestamp=1301824129807000)
Returned 2 results.

Be careful while constructing your queries because the row-keys, column-family identifiers, and column keys are case sensitive. Therefore, passing in 'prius' instead of 'Prius' does not return any name-value tuples. Try running get Cars['prius'] via the CLI. You will receive a response that reads Returned 0 results. Also, before you query, remember to issue use CarDataStore to make CarDataStore the current keyspace.

To access just the 'make' name-value data for the 'Prius' row you could query like so:

[default@CarDataStore] get Cars['Prius']['make'];
=> (column=make, value=746f796f7461, timestamp=1301824068109000)

Cassandra data sets can support richer data models than those shown so far and querying capabilities are also more complex than those illustrated, but I will leave those topics for a later chapter. For now, I am convinced you have had your first taste.

After walking through two simple examples, one that involved a document store, MongoDB, and another that involved a column database, Apache Cassandra, you may be ready to start interfacing with these using a programming language of your choice.

WORKING WITH LANGUAGE BINDINGS

To include NoSQL solutions into the application stack, it’s extremely important that robust and flexible language bindings allow access and manipulation of these stores from some of the most popular languages.

This section covers two types of interfaces between NoSQL stores and programming languages. The first illustration covers the essentials of MongoDB drivers for Java, PHP, Ruby, and Python. The second illustration covers the language agnostic and, therefore, multilanguage-supported Thrift interface for Apache Cassandra. The coverage of these topics is elementary. Later chapters build on this initial introduction to show more powerful and detailed use cases.

MongoDB’s Drivers

In this section, MongoDB drivers for four different languages, Java, PHP, Ruby, and Python, are introduced in the order in which they are listed.

Mongo Java Driver

First, download the latest distribution of the MongoDB Java driver from the MongoDB github code repository at http://github.com/mongodb. All officially supported drivers are hosted in this code repository. The latest version of the driver is 2.5.2, so the downloaded jar file is named mongo-2.5.2,jar.

Once again start the local MongoDB server by running bin/mongod from within the MongoDB distribution. Now use a Java program to connect to this server. Look at Listing 2-2 for a sample Java program that connects to MongoDB, lists all the collections in the prefs database, and then lists all the documents within the location collection.

image
LISTING 2-2: Sample Java program to connect to MongoDB
import java.net.UnknownHostException;
import java.util.Set;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBCursor;
import com.mongodb.Mongo;
import com.mongodb.MongoException;
 
public class ConnectToMongoDB {
    Mongo m = null;
    DB db;
    
    public void connect() {
        try {
            m = new Mongo("localhost", 27017 );
        } catch (UnknownHostException e) {
            e.printStackTrace();
        } catch (MongoException e) {
            e.printStackTrace();
        }
    }
    
    public void listAllCollections(String dbName) {
        if(m!=null){
            db = m.getDB(dbName);
            Set<String> collections = db.getCollectionNames();
 
            for (String s : collections) {
                System.out.println(s);
            }
        }        
    }
    
    public void listLocationCollectionDocuments() {
        if(m!=null){
            db = m.getDB("prefs");
            DBCollection collection = db.getCollection("location");
            
            DBCursor cur = collection.find();
 
            while(cur.hasNext()) {
                System.out.println(cur.next());
            }   
        } else {
            System.out.println("Please connect to MongoDB 
    and then fetch the collection");
        }
    }
 
    public static void main(String[] args) {
        ConnectToMongoDB connectToMongoDB = new ConnectToMongoDB();
        connectToMongoDB.connect();
        connectToMongoDB.listAllCollections("prefs");
        connectToMongoDB.listLocationCollectionDocuments();
    }
}

ConnectToMongoDB.java

Make sure to have the MongoDB Java driver in the classpath when you compile and run this program. On running the program, the output is as follows:

location
system.indexes
{ "_id" : { "$oid" : "4c97053abe67000000003857"} , "name" : "John Doe" , 
    "zip" : 10001.0}
{ "_id" : { "$oid" : "4c970541be67000000003858"} , "name" : "Lee Chang" , 
    "zip" : 94129.0}
{ "_id" : { "$oid" : "4c970548be67000000003859"} , "name" : "Jenny Gonzalez" , 
    "zip" : 33101.0}
{ "_id" : { "$oid" : "4c970555be6700000000385a"} , "name" : "Srinivas Shastri" , 
    "zip" : 1089.0}
{ "_id" : { "$oid" : "4c97a6555c760000000054d8"} , "name" : "Don Joe" , 
    "zip" : 10001.0}
{ "_id" : { "$oid" : "4c97a7ef5c760000000054da"} , "name" : "John Doe" , 
    "zip" : 94129.0}
{ "_id" : { "$oid" : "4c97add25c760000000054db"} , "name" : "Lee Chang" , 
    "zip" : 94129.0 , "streetAddress" : "37000 Graham Street"}

The output of the Java program tallies with what you saw with the command-line interactive JavaScript shell earlier in the chapter.

Now see how the same example works with PHP.

MongoDB PHP Driver

First, download the PHP driver from the MongoDB github code repository and configure the driver to work with your local PHP environment. Refer to the Appendix A subsection on MongoDB installation for further details.

A sample PHP program that connects to a local MongoDB server and lists the documents in the location collections in the prefs database is as follows:

image
$connection = new Mongo( "localhost:27017" );
$collection = $connection->prefs->location;
$cursor = $collection->find();
foreach ($cursor as $id => $value) {
    echo "$id: ";
    var_dump( $value );
}

connect_to_mongodb.php

The program is succinct but does the job! Next, you see how Ruby handles this same task.

MongoDB Ruby Driver

MongoDB has drivers for all mainstream languages and Ruby is no exception. You can obtain the driver from the MongoDB github code repository but it may be easier to simply rely on RubyGems to manage the installation. To get ready to connect to MongoDB from Ruby, get at least the mongo and bson gems. You can install the mongo gem as follows:

gem install mongo

The bson gem will be installed automatically. In addition, installing bson_ext may be recommended as well.

Listing 2-3 depicts a sample Ruby program that connects to the MongoDB server and lists all the documents in the location collection in the prefs database.

image
LISTING 2-3: Get all documents in a MongoDB collection using Ruby
db = Mongo::Connection.new("localhost", 27017).db("prefs")
locationCollection = db.collection("location")
locationCollection.find().each { |row| puts row.inspect

connect_to_mongodb.rb

The next MongoDB driver discussed in this chapter is the one that helps connect Python to MongoDB.

MongoDB Python Driver

The easiest way to install the Python driver is to run easy_install pymongo. Once it is installed, you can invoke the Python program in Listing 2-4 to get a list of all documents in the location collection in the prefs database.

image
LISTING 2-4: Python program to interface with MongoDB
from pymongo import Connection
connection = Connection('localhost', 27017)
db = connection.prefs
collection = db.location
for doc in collection.find():
    doc

connect_to_mongodb.py

At this stage, this example has been created and run in at least five different ways. It’s a simple and useful example that illustrates the directly relevant concepts of establishing a connection, fetching a database, a collection, and documents within that collection.

A First Look at Thrift

Thrift is a framework for cross-language services development. It consists of a software stack and a code-generation engine to connect smoothly between multiple languages. Apache Cassandra uses the Thrift interface to provide a layer of abstraction to interact with the column data store. You can learn more about Apache Thrift at http://incubator.apache.org/thrift/.

The Cassandra Thrift interface definitions are available in the Apache Cassandra distribution in a file, named cassandra.thrift, which resides in the interface directory. The Thrift interface definitions vary between Cassandra versions so make sure that you get the correct version of the interface file. Also, make sure you have a compatible version of Thrift itself.

Thrift can create language bindings for a number of languages. In the case of Cassandra, you could generate interfaces for Java, C++, C#, Python, PHP, and Perl. The simplest command to generate all Thrift interfaces is:

thrift --gen interface/cassandra.thrift

Additionally, you could specify the languages as parameters to the Thrift generator program. For example, to create only the Java Thrift interface run:

thrift --gen java interface/cassandra.thrift

Once the Thrift modules are generated, you can use it in your program. Assuming you have generated the Python Thrift interfaces and modules successfully, you can connect to the CarDataStore keyspace and query for data as depicted in Listing 2-5.

image
LISTING 2-5: Querying CarDataStore keyspace using the Thrift interface
from thrift import Thrift
from thrift.transport import TTransport
from thrift.transport import TSocket
from thrift.protocol.TBinaryProtocol import TBinaryProtocolAccelerated
from cassandra import Cassandra
from cassandra.ttypes import *
import time
import pprint
 
def main():
 
  socket = TSocket.TSocket("localhost", 9160)
  protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport)
  transport = TTransport.TBufferedTransport(socket)
  client = Cassandra.Client(protocol)
  pp = pprint.PrettyPrinter(indent=2)
  keyspace = "CarDataStore"
  column_path = ColumnPath(column_family="Cars", column="make")
  key = "1"
  try:
      transport.open()
      #Query for data
      column_parent = ColumnParent(column_family="Cars")
      slice_range = SliceRange(start="", finish="")
      predicate = SlicePredicate(slice_range=slice_range)
      result = client.get_slice(keyspace,
                                key,
                                column_parent,
                                predicate,
                                ConsistencyLevel.ONE)
      pp.pprint(result)
  except Thrift.TException, tx:
      print 'Thrift: %s' % tx.message
  finally:
      transport.close()
  
if __name__ == '__main__':
  main()

query_cardatastore_using_thrift.py

Although, Thrift is a very useful multilanguage interface, sometimes you may just chose to go with a pre-existing language API. Some of these API(s) provide the much-needed reliability and stability as they are tested and actively supported, while the products they connect to evolve rapidly. Many of these use Thrift under the hood. A number of such libraries, especially Hector, for Java; Pycassa, for Python; and Phpcassa, for PHP, exist for Cassandra.

SUMMARY

The aim of this chapter was to give a first feel of NoSQL databases by providing a hands-on walkthrough of some of the core concepts. The chapter delivered that promise and managed to cover more than simple “Hello World” printing to console.

Introductory concepts that relate to NoSQL were explained in this chapter through small and terse examples. Examples gently started with the basics and developed to a point where they helped explain the simple concepts. In all these examples, MongoDB and Apache Cassandra, two leading NoSQL options, served as the underlying product.

The chapter was logically divided into two parts: one that dealt with the core NoSQL storage concepts and the other that helped connect NoSQL stores to a few mainstream programming languages. Therefore, the initial part involved examples run via the command-line client and the later part included examples that can be run as a standalone program.

The next chapter builds on what was introduced in this chapter. More examples on interfacing with NoSQL databases and querying the available data set are explored in that chapter. Newer and different NoSQL products are also introduced there.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.14.200