Chapter 6

Querying NoSQL Stores

WHAT’S IN THIS CHAPTER?

  • Illustrating a few query mechanisms in NoSQL in sample data sets
  • Querying use cases in the context of MongoDB, HBase, and Redis
  • Creating advanced and complex queries in NoSQL
  • Using alternatives for rich querying capabilities without SQL

SQL is possibly the simplest yet most powerful domain-specific language created so far. It is easy to learn because it has a limited vocabulary, unambiguous grammar, and a simple syntax. It is terse and limited in scope but it does precisely what it’s meant to do. It enables you to manipulate structured data sets like a ninja. You can easily filter, sort, dice, and slice the data sets. Based on relations, you can join data sets and create intersections and unions. You can summarize data sets and manipulate them to group by a specific attribute or filter them on the basis of their grouping criteria. There is one limitation, though: SQL is based on relational algebra, which works well with relational databases only. As is evident from its name, there is no SQL with NoSQL.

The absence of SQL does not mean that you need to stop querying data sets. After all, any data is stored to be possibly retrieved and manipulated later. NoSQL stores have their own ways of accessing and manipulating data and you have already seen some of that.

NoSQL should really have been NonRel, implying non-relational. Although the creators and proponents of the so-called NoSQL databases were moved away from relational databases because of the structural relational constraints it imposed and the ACID transactions that it upheld, especially as these became impediments to scaling and dealing with large data sets, they weren’t necessarily opposed to SQL. In fact, some still crave SQL in the world of NoSQL and as a result have created query languages that in syntax and style resemble the same old SQL. Old habits die hard, and why shouldn’t they if they are the virtuous ones!

In this chapter, you learn many tips and tricks for querying NoSQL stores. As in the previous chapters, you learn the tips and tricks in the context of multiple products and varying technologies, all grouped under the large umbrella of NoSQL. The lesson starts with querying data sets stored in MongoDB and then moves on to cover HBase and Redis.

SIMILARITIES BETWEEN SQL AND MONGODB QUERY FEATURES

Although MongoDB is a document database and has little resemblance to a relational database, the MongoDB query language feels a lot like SQL. You have already seen some initial examples, so I presume I don’t need to convince you about its SQL-like query features.

To understand the MongoDB query language capabilities and see how it performs, start by loading a data set into a MongoDB database. So far, the data sets used in this book have been small and limited because the focus has been more on introducing MongoDB’s core features and less on its applicability to real-life situations. For this chapter, though, I introduce a data set that is slightly more substantial than used in this book so far. I load up the MovieLens data set of millions of movie-rating records.

MOVIELENS

The GroupLens research lab in the Department of Computer Science and Engineering at the University of Minnesota conducts research in a number of disciplines:

  • Recommender systems
  • Online communities
  • Mobile and ubiquitous technologies
  • Digital libraries
  • Local geographic information systems

The MovieLens data set is a part of the available GroupLens data sets. The MovieLens data set contains user ratings for movies. It is a structured data set and is available in three different download bundles, containing 100,000, 1 million, and 10 million records, respectively. You can download the MovieLens data set from grouplens.org/node/73.

First, go to grouplens.org/node/73 and download the data set that has 1 million movie-rating records. Download bundles are available in tar.gz (tarred and zipped) and .zip archive formats. Download the format that is best for your platform. After you get the bundle, extract the contents of the archive file to a folder in your filesystem. On extracting the 1 million ratings data set, you should have the following three files:

  • movies.dat
  • ratings.dat
  • users.dat

The movies.dat data file contains data on the movies themselves. This file contains 3,952 records, and each line in that file contains one record. The record is saved in the following format:

<MovieID>::<Title>::<Genres>

The MovieId is a simple integral sequence of numbers. The movie title is a string, which includes the year the movie was released, specified in brackets appended to its name. The movie titles are the same as those in IMDB (www.imdb.com). Each movie may be classified under multiple genres, which are specified in a pipe-delimited format. A sample line from the file is like so:

1::Toy Story (1995)::Animation|Children's|Comedy

The ratings.dat file contains the ratings of the 3,952 movies by more than 6,000 users. The ratings file has more than 1 million records. Each line is a different record that contains data in the following format:

UserID::MovieID::Rating::Timestamp

UserID and MovieID identify and establish a relationship with the user and the movie, respectively. The rating is a measure on a 5-point (5-star) scale. Timestamp captures the time when the ratings were recorded.

The users.dat file contains data on the users who rated the movies. The information on more than 6,000 users is recorded in the following format:

UserID::Gender::Age::Occupation::Zip-code

Loading the MovieLens Data

For simplicity, upload the data into three MongoDB collections: movies, ratings, and users, each mapping to a .dat data file. The mongoimport utility, www.mongodb.org/display/DOCS/Import+Export+Tools, is suited to extracting the data from the .dat files and loading it into the MongoDB document store but that’s not an option here. The MovieLens data is delimited by the double-colon (::) character and mongoimport recognizes only JSON, comma-separated, and tab-delimited formats.

So, I fall back on a programming language and an associated MongoDB driver to help parse a text file and load the data set into a MongoDB collection. For the purpose of brevity I chose Ruby. Alternatively, you could use Python (which is also brief and elegant), Java, PHP, C, or any of the other supported languages.

A small bit of code, as shown in Listing 6-1, easily extracts and loads the data from the users, movies, and ratings data files to respective MongoDB collections. This code uses simple file-reading and string-splitting features, along with the MongoDB driver to carry out the task. It’s not the most elegant code. It doesn’t take care of exceptions or work fast with extremely large files, but it works for the purpose at hand.

image
LISTING 6-1: movielens_dataloader.rb
require 'rubygems' #can skip this line in Ruby 1.9
require 'mongo'

field_map = {
    "users" => %w(_id gender age occupation zip_code),
    "movies" => %w(_id title genres),
    "ratings" => %w(user_id movie_id rating timestamp)
}

db = Mongo::Connection.new.db("mydb")
collection_map = {
    "users" => db.collection("users"),
    "movies" => db.collection("movies"),
    "ratings" => db.collection("ratings")
}

unless ARGV.length == 1
    puts "Usage: movielens_dataloader data_filename"
    exit(0)
end

class Array
  def to_h(key_definition)
    result_hash = Hash.new()
    
    counter = 0
    key_definition.each do |definition|
      if not self[counter] == nil then
          if self[counter].is_a? Array or self[counter].is_a? Integer then
              result_hash[definition] = self[counter]
          else
              result_hash[definition] = self[counter].strip
          end
      else
        # Insert the key definition with an empty value.
        # Because we probably still want the hash to contain the key.
        result_hash[definition] = ""
      end
      # For some reason counter.next didn't work here....
      counter = counter + 1
    end
    
    return result_hash
  end
end

if File.exists?(ARGV[0])
    file = File.open(ARGV[0], 'r')
    data_set = ARGV[0].chomp.split(".")[0]
    file.each { |line|
        field_names = field_map[data_set] 
        field_values = line.split("::").map { |item|
            if item.to_i.to_s == item
                item = item.to_i
            else
                item
            end
        }
        puts "field_values: #{field_values}"
        #last_field_value = line.split("::").last
        last_field_value = field_values.last
        puts "last_field_value: #{last_field_value}"
        if last_field_value.split("|").length > 1
           field_values.pop 
           field_values.push(last_field_value.split().join('
').split("|"))
        end
        field_values_doc = field_values.to_h(field_names)
        collection_map[data_set].insert(field_values_doc)
    }
    puts "inserted #{collection_map[data_set].count()} records into the 
#{collection_map[data_set].to_s} collection"
end

movielens_dataloader.rb

When the data is loaded, you are ready to run a few queries to slice and dice it. Queries can be run from the JavaScript shell or from any of the supported languages. For this example, I run most queries using the JavaScript shell and a select few using a couple of different programming languages and their respective drivers. The only purpose of including programming language examples is to demonstrate that most, if not all, of what’s possible via the JavaScript shell is available through the different language drivers.

To commence with querying the MongoDB collections, start up the MongoDB server and connect to it using the Mongo shell. The necessary programs are accessible from the bin folder of your MongoDB installation. Having started MongoDB a couple of times in the past few chapters, you are hopefully well versed with starting and stopping these programs by now.

On your Mongo JavaScript shell, first get a count of all the values in the ratings collection as follows:

db.ratings.count();

In response, you should see 1000209. A million plus ratings were uploaded so this looks right.

Next, get a sample set of the ratings data with the help of the following command:

db.ratings.find();

On the shell you don’t need an explicit cursor to print out the values from a collection. The shell restricts the number of rows to a maximum of 20 at a time. To iterate over more data, simply type it (short for iterate) on your shell. In response to the it command, you should see 20 more records and a label saying “has more,” if more records exist beyond the ones you have already browsed on your shell.

The ratings data, for example, { "_id" : ObjectId("4cdcf1ea5a918708b0000001"), "user_id" : 1, "movie_id" : 1193, "rating" : 5, "timestamp" : "978300760" }, makes little intuitive sense about the movie it relates to because it’s linked to the movie id and not its name. You can get around this problem by answering the following questions:

  • How can I get all the ratings data for a given movie?
  • How do I get the movie information for a given rating?
  • How do I put together a list all the movies with the ratings data grouped by the movies they relate to?

In relational databases, these types of relationships are traversed using joins. In MongoDB, this relational data is explicitly co-related outside the scope of the server. MongoDB defines the concept of a DBRef to establish a relationship between two fields of two separate collections, but that feature has a few limitations and doesn’t provide the same power as explicit id-based linking does. I won’t cover DBRef in this section but I included a few examples of DBRef in the previous chapters and will revisit it in future chapters as well.

To get all the ratings data for a given movie, you filter the data set using the movie id as the criteria. For example, to view all ratings for the famous Academy Award-winning movie Titanic, you need to first find its id and then use that to filter the ratings collection. If you aren’t sure what the exact title string for “Titanic” is like but you are confident the word titanic appears in it, you can try an approximate, and not an exact, match with the title strings in the movies collection. In an RDBMS, to find the movie id under such circumstances, you are likely to rely on the like expression in a SQL where clause to get a list of all possible candidates. In MongoDB, there is no like expression but there is a more powerful feature available, which is the ability to define a pattern using regular expressions. So to get a list of all records in the movies collection that have Titanic or titanic in their title, you can query like so:

db.movies.find({ title: /titanic/i});

This query returns the following set of documents:

{ "_id" : 1721, "title" : "Titanic (1997)", "genres" : [ "Drama", "Romance" ] }
{ "_id" : 2157, "title" : "Chambermaid on the Titanic, The (1998)", "genres" :
"Romance" }
{ "_id" : 3403, "title" : "Raise the Titanic (1980)", "genres" : [ "Drama", 
"Thriller" ] }
{ "_id" : 3404, "title" : "Titanic (1953)", "genres" : [ "Action", "Drama" ] }

The title field in the MovieLens data set includes the year the movie was released. Within the title field, the release year is included in parentheses. So, if you remembered or happen to know that Titanic was released in the year 1997, you can write a more tuned query expression as follows:

db.movies.find({ title: /titanic.*(1997).*/i});

This returns just one document:

{ "_id" : 1721, "title" : "Titanic (1997)", "genres" : [ "Drama", "Romance" ] }

The expression essentially looks for all title strings that have Titanic, titanic, TitaniC, or TiTAnic in it. In short, it ignores case. In addition, it looks for the string (1997). It also states that there may be 0 or more characters between titanic and (1997) and after (1997). The support for regular expressions is a powerful feature and it is always worthwhile to gain mastery over them.

The range of values for the movie_id field of the ratings collection is defined by the _id of the movies collection. So to get all ratings for the movie Titanic, which has an id of 1721, you could query like so:

db.ratings.find({ movie_id: 1721 });

To find out the number of available ratings for Titanic, you can count them as follows:

db.ratings.find({ movie_id: 1721 }).count();

The response to the count is 1546. The ratings are on a 5-point scale. To get a list and count of only the 5-star ratings for the movie Titanic you can further filter the record set like so:

db.ratings.find({ movie_id: 1721, rating: 5 });
db.ratings.find({ movie_id: 1721, rating: 5 }).count();

DATA-TYPE SENSITIVITY IN QUERY DOCUMENTS

MongoDB query documents are data-type sensitive. That is, { movie_id: "1721" } and { movie_id: 1721 } are not the same, the first one matches a string and the second one considers the value as a number. When specifying documents, be sure to use the correct data type. To illustrate further, the movie_id is stored as a number (integer) in the ratings and the movies collections, so querying for a string match doesn’t return correct results. Therefore, the response to db.ratings.find({ movie_id: 1721 }); returns up to a total of 1,546 documents, but the response to db.ratings.find({ movie_id: "1721" }); returns none.

If you browse Listing 6-1 carefully, you will notice the following line:

field_values = line.split(“::”).map { |item|
            if item.to_i.to_s == item
                item = item.to_i
            else
                item
            end
        }

This bit of code checks to see if the split string holds an integer value and saves it as an integer, if that’s the case. Making this little extra effort to save numerical values as numbers has its benefits. Indexing and querying on numerical records is usually faster and more efficient than on character-based (string) records.

Next, you may want to get some statistics of all the ratings for Titanic. To find out the distinct set of ratings by users (from the possible set of integers between 1 and 5, both inclusive), you could query as follows:

db.runCommand({ distinct: 'ratings', key: 'rating', query: { movie_id: 1721} });

Ratings for Titanic include all possible cases between 1 and 5 (both inclusive) so the response is like so:

{ "values" : [ 1, 2, 3, 4, 5 ], "ok" : 1 }

runCommand takes the following arguments:

  • Collection name for the field labeled distinct
  • Field name for key, whose distinct values would be listed
  • Query to optionally filter the collection

runCommand is slightly different in pattern than the query style you have seen so far because the collection is filtered before the distinct values are searched for. Distinct values for all ratings in the collection can be listed in a way that you have seen so far, as follows:

db.ratings.distinct("rating");

You know from the distinct values that Titanic has all possible ratings from 1 to 5. To see how these ratings break down by each rating value on the 5-point scale, you could group the counts like so:

db.ratings.group(
... { key: { rating:true },
...   initial: { count:0 },
...   cond: { movie_id:1721 },
...   reduce: function(obj, prev) { prev.count++; }           
... }
... );

The output of this grouping query is an array as follows:

[
    {
        "rating" : 4,
        "count" : 500
    },
    {
        "rating" : 1,
        "count" : 100
    },
    {
        "rating" : 5,
        "count" : 389
    },
    {
        "rating" : 3,
        "count" : 381
    },
    {
        "rating" : 2,
        "count" : 176
    }
]

This group by function is quite handy for single MongoDB instances but doesn’t work in sharded deployments. Use MongoDB’s MapReduce facility to run grouping functions in a sharded MongoDB setup. A MapReduce version of the grouping function is included right after the group operation is explained.

The group operation takes an object as an input. This group operation object includes the following fields:

  • key — The document field to group by. The preceding example has only one field: rating. Additional group by fields can be included in a comma-separated list and assigned as the value of the key field. A possible configuration could be - key: { fieldA: true, fieldB: true}.
  • initial — Initial value of the aggregation statistic. In the previous example the initial count is set to 0.
  • cond — The query document to filter the collection.
  • reduce — The aggregation function.
  • keyf (optional) — An alternative derived key if the desired key is not an existing document field.
  • finalize (optional) — A function that can run on every item that the reduce function iterates through. This could be used to modify existing items.

Theoretically, the example could easily be morphed into a case where ratings for each movie are grouped by the rating points by simply using the following group operation:

db.ratings.group(
... { key: { movie_id:true, rating:true }, 
...   initial: { count:0 },
...   reduce: function(obj, prev) { prev.count++; }
... }
... );

In real cases, though, this wouldn’t work for the ratings collection of 1 million items. You would be greeted instead with the following error message:

Fri Nov 12 14:27:03 uncaught exception: group command failed: {
    "errmsg" : "exception: group() can't handle more than 10000 unique keys",
    "code" : 10043,
    "ok" : 0
}

The result is returned as a single BSON object and therefore the collection over which the group operation is applied should not have more than 10,000 keys. This limitation can also be overcome with the MapReduce facility.

In the following section you explore MongoDB’s MapReduce facility and run a few aggregation functions on the entire ratings data set.

MapReduce in MongoDB

MapReduce is a patented software framework from Google that supports distributed computing on a large distributed cluster of computers. You can read about Google’s MapReduce in a research paper titled “MapReduce: Simplified Data Processing on Large Clusters” available online at http://labs.google.com/papers/mapreduce.html.

Google’s MapReduce framework has inspired many clones and distributed computing frameworks in the open-source community. MongoDB’s is one of those. Google’s and MongoDB’s MapReduce features are also inspired by similar constructs in the world of functional programming. In functional programming, a map function is one that applies to each member of a collection and a reduce function or a fold function is one that runs an aggregation function across the collection.

image

MongoDB’s MapReduce features are not a clone of the Google’s MapReduce infrastructure. Hadoop’s MapReduce is an open-source implementation of Google’s distributed computing ideas and includes infrastructure for both column databases (HBase) and MapReduce-based computing.

Understanding MapReduce can sometimes be intimidating, but once you understand its structure and flow, it’s a powerful tool that helps you carry out large computations across distributed collections of data. So, starting out with a few simple examples and then graduating to more complex ones is a good way to smooth the learning curve and achieve mastery of the topic.

The simplest aggregation example could be a count of each type of an item in a collection. To use MapReduce, you need to define a map function and a reduce function and then run the map and reduce functions against a collection. A map function applies a function to every member of the collection and emits a key/value pair for each member as an outcome of this process. The key/value output of a map function is consumed by the reduce function. The reduce function runs an aggregation function across all key/value pairs and generates an output in turn.

The map function to count the number of female (F) and male (M) respondents in the users collection is as follows:

> var map = function() {
... emit({ gender:this.gender }, { count:1 });
... };

movielens_queries.txt

This map function emits a key/value pair for each item in the collection that has a gender property. It counts 1 for each such occurrence.

The reduce function for counting the number of total occurrences of male and female types among all users is as follows:

> var reduce = function(key, values) {
... var count = 0;
... values.forEach(function(v) {
... count += v['count'];
... });
...
... return { count:count };
... };

movielens_queries.txt

A reduce function takes a key/value pair emitted by the map function. In this particular reduce function, each value in the key/value pair is passed through a function that counts the number of occurrences of a particular type. The line count += v['count'] could also be written as count += v.count because of JavaScript’s ability to access object members and their values as a hash data structure.

Finally, running this map and reduce function pair against the users collection leads to an output of the total count of female and male members in the users collection. The mapReduce run and result extraction commands are as follows:

> var ratings_respondents_by_gender = db.users.mapReduce(map, reduce);
> ratings_respondents_by_gender                                       
{
    "result" : "tmp.mr.mapreduce_1290399924_2",
    "timeMillis" : 538,
    "counts" : {
        "input" : 6040,
        "emit" : 6040,
        "output" : 2
    },
    "ok" : 1,
}
> db[ratings_respondents_by_gender.result].find();                    
{ "_id" : { "gender" : "F" }, "value" : { "count" : 1709 } }
{ "_id" : { "gender" : "M" }, "value" : { "count" : 4331 } }

movielens_queries.txt

To verify the output, filter the users collection for gender values "F" and "M" and count the number of documents in each filtered sub-collection. The commands for filtering and counting the users collection for gender values "F" and "M" is like so:

> db.users.find({ "gender":"F" }).count();
1709
> db.users.find({ "gender":"M" }).count();
4331

movielens_queries.txt

Next, you can modify the map function slightly and run the map and reduce functions against the ratings collection to count the number of each type of rating (1, 2, 3, 4 or 5) for each movie. In other words, you are counting the collection grouped by rating value for each movie. Here are the complete map and reduce function definitions run against the ratings collection:

> var map = function() {
... emit({ movie_id:this.movie_id, rating:this.rating }, { count:1 });
... };
> var reduce = function(key, values) {
... var count = 0;
... values.forEach(function(v) {
... count += v['count'];
... });
...
... return { count: count };
... };
> var group_by_movies_by_rating = db.ratings.mapReduce(map, reduce);
> db[group_by_movies_by_rating.result].find();

movielens_queries.txt

To get a count of each type of rating for the movie Titanic, identified by movie_id 1721, you simply filter the MapReduce output using nested property access method like so:

> db[group_by_movies_by_rating.result].find({ "_id.movie_id":1721 }); 
{ "_id" : { "movie_id" : 1721, "rating" : 1 }, "value" : { "count" : 100 } }
{ "_id" : { "movie_id" : 1721, "rating" : 2 }, "value" : { "count" : 176 } }
{ "_id" : { "movie_id" : 1721, "rating" : 3 }, "value" : { "count" : 381 } }
{ "_id" : { "movie_id" : 1721, "rating" : 4 }, "value" : { "count" : 500 } }
{ "_id" : { "movie_id" : 1721, "rating" : 5 }, "value" : { "count" : 389 } }

movielens_queries.txt

In the two examples of MapReduce so far, the reduce function is identical but the map function is different. In each case a count of 1 is established for a different emitted key/value pair. In one a key/value pair is emitted for each document that has a gender property, whereas in the other a key/value pair is emitted for each document identified by the combination of a movie id and a rating id.

Next, you could calculate the average rating for each movie in the ratings collection as follows:

> var map = function() {
... emit({ movie_id:this.movie_id }, { rating:this.rating, count:1 });
... };
 
> var reduce = function(key, values) {
... var sum = 0;
... var count = 0;
... values.forEach(function(v) {
... sum += v['rating'];
... count += v['count'];
... });
...
... return { average:(sum/count) };
... };
> var average_rating_per_movie = db.ratings.mapReduce(map, reduce);
> db[average_rating_per_movie.result].find();

movielens_queries.txt

MapReduce allows you to write many types of sophisticated aggregation algorithms, some of which were presented in this section. A few others are introduced later in the book.

By now you have had a chance to understand many ways of querying MongoDB collections. Next, you get a chance to familiarize yourself with querying tabular databases. HBase is used to illustrate the querying mechanism.

ACCESSING DATA FROM COLUMN-ORIENTED DATABASES LIKE HBASE

Before you get into querying an HBase data store, you need to store some data in it. As with MongoDB, you have already had a first taste of storing and accessing data in HBase and its underlying filesystem, which often defaults to Hadoop Distributed FileSystem (HDFS). You are also aware of HBase and Hadoop basics. This section builds on that basic familiarity. As a working example, historical daily stock market data from NYSE since the 1970s until February 2010 is loaded into an HBase instance. This loaded data set is accessed using an HBase-style querying mechanism. The historical market data is collated from original sources by Infochimp.org and can be accessed at www.infochimps.com/datasets/nyse-daily-1970-2010-open-close-high-low-and-volume.

The Historical Daily Market Data

The zipped-up download of the entire data set is substantial at 199 MB but very small by HDFS and HBase standards. The HBase and Hadoop infrastructures are capable of and often used for dealing with petabytes of data that span multiple physical machines. I chose an easily manageable data set for the example as I intentionally want to avoid getting distracted by the immensity of preparing and loading up a large data set for now. This chapter is about the query methods in NoSQL stores and the focus in this section is on column-oriented databases. Understanding data access in smaller data sets is more manageable and the concepts apply equally well to larger amounts of data.

The data fields are partitioned logically into three different types:

  • Combination of exchange, stock symbol, and date served as the unique id
  • The open, high, low, close, and adjusted close are a measure of price
  • The daily volume

The row-key can be created using a combination of the exchange, stock symbol, and date. So NYSE,AA,2008-02-27 could be structured as NYSEAA20080227 to be a row-key for the data. All price-related information can be stored in a column-family named price and volume data can be stored in a column-family named volume.

The table itself is named historical_daily_stock_price. To get the row data for NYSE, AA, 2008-02-27, you can query as follows:

get 'historical_daily_stock_price', 'NYSEAA20080227'

You can get the open price as follows:

get 'historical_daily_stock_price', 'NYSEAA20080227', 'price:open'

You could also use a programming language to query for the data. A sample Java program to get the open and high price data could be as follows:

import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.io.RowResult;
 
import java.util.HashMap;
import java.util.Map;
import java.io.IOException;
 
public class HBaseConnector {
 
public static Map retrievePriceData(String rowKey) throws IOException {
HTable table = new HTable(new HBaseConfiguration(),
 "historical_daily_stock_price");
Map stockData = new HashMap();
 
RowResult result = table.getRow(rowKey);
 
for (byte[] column : result.keySet()) {
    stockData.put(new String(column), new 
String(result.get(column).getValue()));
}
 
return stockData;
}
 
public static void main(String[] args) throws IOException {
    Map stock_data = HBaseConnector.retrievePriceData("NYSEAA20080227");
    System.out.println(stock_data.get("price:open"));
    System.out.println(stock_data.get("price:high"));
}
}

HBaseConnector.java

HBase includes very few advanced querying techniques beyond what is illustrated, but its capability to index and query can be extended with the help of Lucene and Hive. Details of using Hive with HBase is illustrated in Chapter 12.

QUERYING REDIS DATA STORES

As in the case of MongoDB and HBase you have had a chance to interact with Redis in the earlier chapters. In the past few chapters you learned the essentials of data storage and access with Redis. In this section, the subject of querying data is explored a bit further. In line with the other illustrations in this chapter so far, a sample data set is first loaded into a Redis instance.

For the purpose of demonstration, the NYC Data Mine public raw data on parking spaces available online at www.nyc.gov/data is used. The data download is available in a comma-separated text file format. The download file is named parking_facilities.csv. See Listing 6-2 for a simple Python program that parses this CSV data set and loads it into a local Redis store. Remember to start your local Redis server before you run the Python script to load up the data. Running the Redis-server program, available in the Redis installation directory, starts a Redis server instance, which by default listens for client connections on port 6379.

image
LISTING 6-2: Python program to extract NYC parking facilities data
import csv
import redis

f = open("parking_facilities.csv", "r")
parking_facilities = csv.DictReader(f, delimiter=',')
r = redis.Redis(host='localhost', port=6379, db=0)

def add_parking_facility(license_number, 
        facility_type, 
        entity_name, 
        camis_trade_name,
        address_bldg,
        address_street_name,
        address_location,
        address_city,
        address_state,
        address_zip_code,
        telephone_number,
        number_of_spaces):
    if r.sadd("parking_facilities_set", license_number):
        r.hset("parking_facility:%s" % license_number, "facility_type", 
facility_type)
        r.hset("parking_facility:%s" % license_number, "entity_name", 
entity_name)
        r.hset("parking_facility:%s" % license_number, "camis_trade_name", 
camis_trade_name)
        r.hset("parking_facility:%s" % license_number, "address_bldg", 
address_bldg)
        r.hset("parking_facility:%s" % license_number, "address_street_name", 
address_street_name)
        r.hset("parking_facility:%s" % license_number, "address_location",
address_location)
        r.hset("parking_facility:%s" % license_number, "address_city",
address_city)
        r.hset("parking_facility:%s" % license_number, "address_state",
address_state)
        r.hset("parking_facility:%s" % license_number, "address_zip_code",
address_zip_code)
        r.hset("parking_facility:%s" % license_number, "telephone_number",
telephone_number)
        r.hset("parking_facility:%s" % license_number, "number_of_spaces", 
number_of_spaces)
        return True
    else:
        return False
        

if __name__ == "__main__":
    for parking_facility_hash in parking_facilities:
        add_parking_facility(parking_facility_hash['License Number'],
            parking_facility_hash['Facility Type'],
            parking_facility_hash['Entity Name'],
            parking_facility_hash['Camis Trade Name'],
            parking_facility_hash['Address Bldg'],
            parking_facility_hash['Address Street Name'],
            parking_facility_hash['Address Location'],
            parking_facility_hash['Address City'],
            parking_facility_hash['Address State'],
            parking_facility_hash['Address Zip Code'],
            parking_facility_hash['Telephone Number'],
            parking_facility_hash['Number of Spaces'])
        print "added parking_facility with %s" % parking_facility_hash['License 
Number']

nyc_parking_data_loader.py

The Python program loops through a list of extracted hash records and saves the values to a Redis instance. Each hash record is keyed using the license number. All license numbers themselves are saved in a set named parking_facilities_set.

To get a list of all license numbers in the set named parking_facilities_list, connect via another program or simply the command-line client and use the following command:

SMEMBERS parking_facilities_set

All 1,912 license numbers in the set would be printed out. You can run wc -l paking_facilities.csv to verify if this number is correct. Each line in the CSV corresponds to a parking facility so the two numbers should reconcile.

For each parking facility the attributes are stored in a hash, which is identified by the key of the form parking_facility:<license_number>. Thus, to see all keys in the hash associated with license number 1105006 you can use the following command:

HKEYS parking_facility:1105006

The response is as follows:

1. "facility_type"
2. "entity_name"
3. "camis_trade_name"
4. "address_bldg"
5. "address_street_name"
6. "address_location"
7. "address_city"
8. "address_state"
9. "address_zip_code"
10. "telephone_number"
11. "number_of_spaces"

The license number 1105006 was first on the list returned by the SMEMBERS parking_facilities_set command. However, sets are not ordered, so rerunning this command may not result in the same license number on top. If you need the list of members to appear in a certain order, use the sorted sets instead of the set. All you may need to do to use a sorted set is to replace the line if r.sadd("parking_facilities_set", license_number): with the following:

if r.zadd("parking_facilities_set", license_number):

Now, you can query for specific values in the hash, say facility type, as follows:

HGET parking_facility:1105006 facility_type

The response is "Parking Lot". You can also print out all values using the HVALS command as follows:

HVALS parking_facility:1105006

The response is:

1. "Parking Lot"
2. "CENTRAL PARKING SYSTEM OF NEW YORK, INC"
3. ""
4. "41-61"
5. "KISSENA BOULEVARD"
6. ""
7. "QUEENS"
8. "NY"
9. "11355"
10. "2126296602"
11. "808"

Of course, it would be much nicer if you could print out all the keys and the corresponding values in a hash. You can do that using the HGETALL command as follows:

HGETALL parking_facility:1105006

The response is as follows:

1. "facility_type"
2. "Parking Lot"
3. "entity_name"
4. "CENTRAL PARKING SYSTEM OF NEW YORK, INC"
5. "camis_trade_name"
6. ""
7. "address_bldg"
8. "41-61"
9. "address_street_name"
10. "KISSENA BOULEVARD"
11. "address_location"
12. ""
13. "address_city"
14. "QUEENS"
15. "address_state"
16. "NY"
17. "address_zip_code"
18. "11355"
19. "telephone_number"
20. "2126296602"
21. "number_of_spaces"
22. "808"

Sometimes, you may not need all the key/value pairs but just want to print out the values for a specific set of fields. For example, you may want to print out only the address_city and the address_zip_code as follows:

HMGET parking_facility:1105006 address_city address_zip_code

The response is:

1. "QUEENS"
2. "11355"

You could similarly set values for a set of fields using the HMSET command. To get a count of the number of keys, you can use the HLEN command as follows:

HLEN parking_facility:1105006

The response is 11. If you wanted to check if address_city was one of these, you can use the HEXISTS command to verify if it exists as a key. The command is used as follows:

HEXISTS parking_facility:1105006 address_city

The response is 1 if the field exists and 0 if it doesn’t.

Going back to the set parking_facilities_set, you may just want to count the number of members instead of listing them all using the SCARD command as follows:

SCARD parking_facilities_set

As expected, the response is 1912. You could verify if a specific member exists in the set using the SISMEMBER command. To verify if 1005006 is a member of the set, you could use the following command:

SISMEMBER parking_facilities_set 1105006

Integral values of 0 and 1 are returned to depict false and true for this query that verifies if a member exists in a set.

SUMMARY

This chapter illustrated a few query mechanisms that are more advanced than those you have seen so far. Querying was explained using examples. MongoDB querying details were explained using a sample movies rating data set. The HBase example was illustrated using historical stock market data and the Redis querying capabilities were demonstrated using sample NYC government data.

The coverage of the querying capabilities is not exhaustive and doesn’t cover all types of use cases. The use cases illustrated in this chapter are just some of the innumerable possibilities. However, walking through the example should get you familiar with the style and mechanics of querying NoSQL data stores.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.131.212