CHAPTER 11

image

Languages and Programming Interfaces

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

—Martin Fowler

As far as the customer is concerned, the interface is the product.

—Jef Raskin

Crucial to the dominance of the relational database was the almost universal adoption of the SQL language as the mechanism for querying and modifying data. SQL is not a perfect language, but it has demonstrated sufficient flexibility to meet the needs of both non-programming database users and professional database programmers. Programmers embed SQL in programming languages, while non-programmers use SQL either explicitly within a query tool or implicitly when a BI tool uses SQL under the hood to talk to the database. Prior to the introduction of SQL, most IT departments labored with a backlog of requests for reports; SQL allowed the business user to “self-serve” these requests.

SQL remains the most significant database language. Not only does it remain the universal language for RDBMS, but it is also widely adopted in next-generation database systems of the NewSQL variety.

However, even though NoSQL has been retrospectively amended to mean “Not Only SQL” rather than “Hell No SQL!,” SQL is not usually available for next-generation databases of the NoSQL variety. In this chapter, we look at how we can interact with these databases in the absence of SQL, and see how SQL is increasingly finding its way back into the world of nonrelational databases.

Image Note  The code examples in this chapter are intended only to provide a flavor of database programming languages1—this chapter is not intended to serve as a tutorial or reference for any of the languages concerned.

SQL

SQL remains the most significant database programming language, even within the scope of next-generation database systems. Hundreds of books have been written on the topic of SQL (indeed, I’ve written a couple), and it would be superfluous to provide a full review of the language here. However, it is probably worth recapping the variations that exist within SQL implementations today, as well as key ingredients of the SQL language.

The SQL language consists of these major categories of statements:

  • Data query language (DQL), represented by the ubiquitous SELECT statement.
  • Data manipulation language (DML), which includes statements that modify data in the database, such as UPDATE, DELETE, INSERT, and MERGE, together with transactional control statements (COMMIT, ROLLBACK, BEGIN TRANSACTION) and—for the purposes of this discussion—data control language (DCL) statements such as GRANT.
  • Data definition language (DDL), which includes statements that create or alter tables and other structures (indexes, materialized views, etc.). DML also allows for the specification of stored procedures and triggers.  These statements are usually highly vendor specific, since they incorporate support for proprietary programming languages (Oracle PL/SQL, for instance) or for storage clauses that are unique to the database in question.

The SQL language is the subject of several ANSI and ISO standard specifications. Some of these standards are:

  • SQL-89: The first major standard to be widely adopted. This standard describes the core elements of the SQL language as we know it today.
  • SQL-92 (SQL2): Added the modern join syntax, in which join conditions are fully qualified within the FROM clause, and added a standard metadata interface that provides descriptions of objects in the database and which is implemented by at least some vendors. SQL-92 introduced the concept of an “entry-level” subset of the specification (which was very similar to SQL-89).
  • SQL:1999 (SQL3): An explosion on the moon propels the moon into interstellar space. Whoops, sorry, that was Space: 1999. SQL:1999 was somewhat less interesting, introducing object-oriented relational database features that almost nobody uses and a few minor language changes.
  • SQL:2003: Introduced analytic “window” functions—an important innovation for analytics—and SQL/XML. It also finalized a specification for stored procedures.
  • SQL:2006, SQL:2008, SQL:2011: Miscellaneous refinements, such as INSTEAD OF triggers, the MERGE statement, and temporal queries.

In my opinion, the various SQL standards have become increasingly disconnected from real-world SQL language implementations. It’s common to hear vendors describe their SQL support as “entry-level SQL-92,” effectively claiming credit for adhering to the minimum level of a 14-year-old specification.

In practice, you can expect an RDBMS to implement everything in SQL-92 at least with respect to the SELECT statement and DML. SQL:2003 windowing functions, which allow a calculation within a row to have visibility into a “window” of adjacent rows, are widely implemented and are probably the most significant single innovation introduced in the last 15 years of the SQL language standards.

DDL statements and stored procedure code will generally not be compatible across databases. DDL statements such as CREATE TABLE share a common syntax, but usually contain vendor-specific constructs such as custom data types or proprietary storage clauses.  While the ANSI stored procedure syntax is implemented by DB2 and MySQL, Oracle and Microsoft SQL Server implement a completely incompatible stored program language.

NoSQL APIs

Databases that are described as NoSQL clearly have to provide a mechanism for inserting, modifying, and retrieving data. Since most of these databases were developed “by programmers for programmers,” they usually primarily provide low-level APIs supported in a language such as Java.

Riak

Riak is an open-source implementation of the Amazon Dynamo model. It implements a pure key-value system: objects in Riak are located through the object’s key and the object retrieved by the key is a binary object whose contents are opaque to the database engine.

Given the relatively simple interaction just described, we expect a fairly straightforward API, and that is what we get. Let’s look at some Java code that inserts a value into a Riak bucket:

1. RiakClient myClient = RiakClient.newClient(myServer);
2. // Create the key, value and set the bucket
3. String myKey = Long.toString(System.currentTimeMillis());
4. String myValue = myKey + ":" + Thread.getAllStackTraces().toString();
5. Location myLocation = new Location(new Namespace("MyBucket"), myKey);
6. StoreValue sv =  new StoreValue.Builder(myValue).withLocation(myLocation)
  
.build();
7. StoreValue.Response svResponse = myClient.execute(sv);
8. System.out.println("response="+svResponse);

In lines 3 to 5, we define a key (set to the current time in milliseconds), a value (a string representation of the current stack trace), and the bucket (“MyBucket”) that will receive the value. Line 6 prepares a StoreValue object—this object contains the key-value pair and the bucket name that it will be associated with. We execute the StoreValue in line 7, effectively adding the data to the database.

The StoreValue object takes options that can control optional behaviors such as quorums. In the example that follows, we specify that the write can complete as long as at least one node completes the write IO (see Chapter 9 for a discussion of quorums):

sv =  new StoreValue.Builder(Thread.getAllStackTraces()).
      withLocation(myLocation).
      withOption(StoreValue.Option.W,Quorum.oneQuorum()).build();

This example also utilizes one of the cool features of the Riak API: if we pass a Java object as our value, Riak will automatically convert it to a JSON document.

Here, we retrieve the data we just inserted. Note that we use the same Location object (MyLocation) that we used for the insert:

FetchValue fv = new FetchValue.Builder(myLocation).build();
FetchValue.Response fvResp = myClient.execute(fv);
String myFetchedData = fvResp.getValue(String.class);
System.out.println("value=" + myFetchedData);

The value returned is a string containing a JSON representation of the stack trace object we inserted earlier:

value=
{"Thread[nioEventLoopGroup-2-8,10,main]":
   [{"methodName":"poll0","fileName":null,
     "lineNumber":-2,
     "className":"sun.nio.ch.WindowsSelectorImpl$SubSelector" ...

For some applications, this put/get programming model may be sufficient. However, there are significant nuances to Riak programming that arise from the Dynamo consistency model. You may remember from Chapters 8 and 9 that conflicting updates may result in siblings that need to be resolved by the application. Riak provides interfaces you can implement to automate the resolution of such conflicts.

Next is a simplistic conflict resolver class.  This resolver will be invoked when siblings of type String are detected. The class is passed a list of strings and returns the “resolved” result. Any application code could be implemented here, but in my example I’ve just returned the first string in the list.

public class MyResolver implements ConflictResolver<String> {
      public String resolve(List<String> siblings) {
            return(siblings.get(0) );
      }
}

The conflict resolver needs to be registered in order to become active:

ConflictResolverFactory factory = ConflictResolverFactory.getInstance();
factory.registerConflictResolver(String.class, new MyResolver());

Now, if we fetch a string value that includes siblings, the conflict resolver will be invoked and will resolve them. In our case, we used almost the simplest object type—string—and used trivial resolution logic: in a production application, the object types would likely be an application-defined complex class, and the resolution might involve highly complex logic to merge the two objects.

As well as resolving siblings, the Riak API allows for modifications to complex data types to be handled. Application code need not provide a complete copy of the new object to be inserted, but instead specify only a change vector that is to be applied to an existing object. For instance, if an object were a list of friends, we could add a new friend simply by specifying the new friend’s name, without having to retrieve and reinsert all existing friends.

By extending the UpdateValue super class, we can define how an update value is applied to a Riak object. In the example that follows, the apply method defines that the update will simply be appended (after a “ ” carriage return) to the Riak object.

public class MyRiakUpdater extends UpdateValue.Update<String> {
      private final String updatedString;
      public MyRiakUpdater(String updatedString) {
            this.updatedString = updatedString;
      }
      public String apply(String original) {
            return original + " " + updatedString;
      }
}

To invoke the updater, we create a new object instance from the updater class we created earlier and apply it to the original version using the withUpdate method.

MyRiakUpdater myUpdatedData = new MyRiakUpdater(newData);
UpdateValue myUpdate = new UpdateValue.Builder(myLocation)
      .withUpdate(myUpdatedData).build();
UpdateValue.Response updateResponse = myClient.execute(myUpdate);

This code will apply the update to the Riak object after first resolving siblings (providing we have registered an appropriate conflict resolver).

Hbase

The HBase API bears some resemblance to the put/get pattern we saw in the Riak API. However, because HBase tables have a wide column structure, some additional complexity is introduced.

HBase supports a simple shell that allows access to basic data definition and manipulation commands. Here, we launch the HBase shell and issue commands to create a table ourfriends with two column families: info and friends:

$ hbase shell
hbase(main):003:0* create 'ourfriends', {NAME => 'info'}, {NAME => 'friends'}
0 row(s) in 1.3400 seconds

Each put command populates one cell within a row, so these four commands populate columns in the row identified by rowkey ‘guy’:

hbase(main):005:0* put 'ourfriends', 'guy','info:email','[email protected]'
0 row(s) in 0.0900 seconds
hbase(main):006:0> put 'ourfriends', 'guy','info:userid','9990'
0 row(s) in 0.0070 seconds
hbase(main):007:0> put 'ourfriends', 'guy','friends:Jo','[email protected]'
0 row(s) in 0.0050 seconds
hbase(main):008:0> put 'ourfriends', 'guy','friends:John','[email protected]'
0 row(s) in 0.0040 seconds

The get command pulls the values for the specific rowkey, allowing us to see the cell values we just input:

hbase(main):018:0* get 'ourfriends','guy'
COLUMN                        CELL
 friends:Jo                   timestamp=1444123707299, [email protected]
 friends:John                 timestamp=1444123707324, [email protected]
 info:email                   timestamp=1444123707214, [email protected]
 info:userid                  timestamp=1444123707274, value=9990
4 row(s) in 0.0390 seconds

Note that this is a wide column structure, where the column names in the column family friends represent the names of friends. This is the data structure that was illustrated in Chapter 10, Figure 10-5.

The shell is suitable for simple experimentation and data validation, but most real work in HBase is done within Java programs. Here, we see simple code to connect to an HBase server from a Java program:

Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", myServer);
config.set("hbase.zookeeper.property.clientport", "2181");
HBaseAdmin.checkHBaseAvailable(config);
Connection connection = ConnectionFactory.createConnection(config);

For column families with a fixed column structure, such as info in our example, we can retrieve the data fairly simply:

1. Table myTable = connection.getTable(TableName.valueOf("ourfriends"));
2. byte[] myRowKey = Bytes.toBytes("guy");
3. byte[] myColFamily=Bytes.toBytes("info");
4. byte[] myColName=Bytes.toBytes("email");

5. Get myGet =  new Get(myRowKey);
6. Result myResult = myTable.get(myGet);
7. byte[] myBytes = myResult.getValue(myColFamily, myColName);
8. String email = Bytes.toString(myBytes);
9. System.out.println("Email address="+email);

Lines 1 through 4 specify the table, rowkey, column family, and column name to be retrieved.  Lines 5 and 6 retrieve the row for the specified rowkey, and line 7 extracts a cell value for the specified column family and column name. Line 8 converts the value from a byte array to a string. This is one of the less endearing features of the HBase API: you are constantly required to convert data to and from HBase native byte arrays.

For dynamic column names, we need to retrieve the names of the columns for each row before we can retrieve the data. The getFamilyMap method returns a map structure from a result object identifying the column names within a specific row:

NavigableMap<byte[], byte[]>
      myFamilyMap = myResult.getFamilyMap(myColFamily);

We can then iterate through the column names and use the standard getValue call to retrieve them:

for (byte[] colNameBytes : myFamilyMap.keySet()) {
      // Get the name of the column in the column family
      String colName = Bytes.toString(colNameBytes);
      byte[] colValueBytes = myResult.getValue(myColFamily, colNameBytes);
      System.out.println("Column " + colName + "=" + Bytes.toString(colValueBytes));
}

The HBase put method is similar in form to the get method and allows us to place a new column value into a cell. Remember as discussed in Chapter 10, Hbase cells may contain multiple versions of data, identified by timestamp. So the put method really adds a new value rather than overwriting the existing value.

Here, we add a new friend:

myColFamily=Bytes.toBytes("friends");
byte [] myNewColName=Bytes.toBytes("paul");
byte [] myNewColValue=Bytes.toBytes("[email protected]");
Put myPut=new Put(myRowKey);
myPut.addColumn(myColFamily,myNewColName,myNewColValue);
myTable.put(myPut);

The HBase API does not need to include methods for sibling resolution as does Riak, since HBase uses locking and strict consistency to prevent such siblings from ever being created.

The HBase API includes additional methods for scanning rowkey ranges, which can be useful when the rowkey includes some kind of natural partitioning value, such as a customer ID, or where the rowkey indicates some kind of temporal order. Also, you may recall from our discussion in Chapter 10 that HBase supports a coprocessor architecture allowing code to be invoked when certain data manipulation operations occur. This architecture provides a database trigger-like capability that allows us to perform validation on input values, maintain secondary indexes, or maintain derived values.

MongoDB

The APIs we’ve seen in Riak and HBase are not overly complex when compared to popular development frameworks like Spring or J2EE. However, for the casual user—especially one not proficient in programming—they are virtually inaccessible. The languages require that you explicitly navigate to the desired values and do not provide any direct support for complex queries.

MongoDB goes some of the way toward addressing these restrictions by providing a rich query language implemented in JavaScript, and exposed through the MongoDB shell.  Collection objects within MongoDB include a find() method that allows fairly complex queries to be constructed. Figure 11-1 provides a comparison of a MongoDB query with an equivalent SQL statement.

9781484213308_Fig11-01.jpg

Figure 11-1. Comparison of MongoDB JavaScript query and SQL

Image Note  The example data for MongoDB and Couchbase is based on a port of the MySQL “Sakila” sample schema, which describes a database for a DVD rental business. DVD rental businesses are largely gone now, but the sample database provides a schema familiar to MySQL users. The converted database is available at http://bit.ly/1LwY9xl.

Quite complicated queries can be constructed using the find() method, but for queries that want to aggregate across documents, we need to use the aggregate() method.  The aggregate() method implements a logical equivalent of the SQL GROUP BY clause. Figure 11-2 compares a SQL GROUP BY statement with its MongoDB equivalent.

9781484213308_Fig11-02.jpg

Figure 11-2. MongoDB aggregation framework compared with SQL

The JavaScript-based MongoDB shell language is not directly embeddable within other programming languages. Historically, MongoDB drivers for various languages implemented interfaces inspired by the JavaScript interface, but with significant divergence across languages. More recently, a cross-driver specification has been developed that improves consistency among the driver implementations for various programming languages.

Using Java as an example, here is some code that connects to a server and selects the “test” collection within the NGDBDemo database:

MongoClient mongoClient = new MongoClient(mongoServer);
MongoDatabase database = mongoClient.getDatabase("NGDBDemo");
MongoCollection<Document> collection = database.getCollection("test");

The following code creates and inserts a new document:

1.  Document people =  new Document();              // A document for a person
2.  people.put("Name", "Guy");
3.  people.put("Email", "[email protected]");
4.  BasicDBList friendList = new BasicDBList();    // List of friends

5.  BasicDBObject friendDoc = new BasicDBObject(); // A single friend

6.  friendDoc.put("Name", "Jo");
7.  friendDoc.put("Email", "[email protected]");
8.  friendList.add(friendDoc);                     // Add the friend
9.  friendDoc.clear();
10.  friendDoc.put("Name", "John");
11.  friendDoc.put("Email", "[email protected]");
12.  friendList.add(friendDoc);                    // Add another friend
13.  people.put("Friends", friendDoc);
14.  collection.insertOne(people);

Line 1 creates an empty document. Lines 2 and 3 add some values to the document. In lines 4 through 7, we create a List structure that represents an array of subdocuments and inserts the first document into that array. Lines 9 through 13 insert a second subdocument to the list.  Line 14 inserts the new document (which includes two embedded friend documents) into MongoDB.

This programming pattern reflects the underlying structure of JSON documents as described in Chapter 10: JSON documents are composed of arbitrarily nested objects, values, and arrays. The MongoDB interface requires that we build these programmatically, although there are several utility classes available independently that allow Java objects to be converted to and from JSON documents.

Collection objects in the Java driver support a find() method that, although not syntactically identical with the JavaScript version, allows us to execute the same operations that we can perform in the JavaScript shell:

Document myDoc = collection.find(eq("Name", "Guy")).first();
System.out.println(myDoc.toJson());

The API provides scrollable cursors that allow us to navigate through a result set or an entire collection. This example iterates through all the documents in a collection:

MongoCursor<Document> cursor = collection.find().iterator();
try {
        while (cursor.hasNext()) {
                System.out.println(cursor.next().toJson());
        }
}
finally {
        cursor.close();
}

A more compact alternative fetch loop could be framed like this:

for (Document cur : collection.find()) {
        System.out.println(cur.toJson());
}

Cassandra Query Language (CQL)

Cassandra’s underlying data structures are based on Google’s BigTable model, which would lead us to expect an API syntactically similar to that of the HBase API. Indeed, the early thrift-based Cassandra APIs were easily as complex as the HBase programming API; arguably even more so since Cassandra had implemented a “SuperColumn” structure that extended the range of possible column family configurations, but which was hard to conceptualize and program against.

In version 0.8, the Cassandra team made a decisive shift from an API-centric interface to a language-based interface, inspired by SQL: the Cassandra Query Language (CQL). CQL uses familiar SQL idioms for data definition, manipulation, and query tasks, and is now the preferred method for interacting with Cassandra databases from query tools or within programs.

CQL provides Cassandra with an interactive ad hoc query capability through the cqlsh program. It also simplifies programming tasks by allowing for more succinct and comprehensible data manipulation code, which looks familiar to those who have coded in SQL-based interfaces such as JDBC.

But perhaps most significantly and most controversially, Cassandra CQL abstracts the underlying wide column BigTable-style data model in favor of a more relational-like tabular scheme. We discussed this in detail in in Chapter 10 and won’t repeat that discussion here: see in particular, Figure 10-7 for a comparison of the Cassandra CQL representation of data compared with the underlying wide column structure.

The best—and maybe also the worst—thing about this CQL abstraction is that users can interact with Cassandra without understanding the nuances of the wide column data model. However, although you can write functional Cassandra CQL without understanding the underlying Cassandra data model, the best results will be attained if you do understand how the two relate.

Wide column structures in CQL are defined by using composite primary keys, where the first part of the key defines the partitioning (e.g., the rowkey), and the second part of the key defines the clustering columns. Clustering column values become the dynamic column names in the wide column family.

The CQL statements that follow—executed in the cqlsh shell—define and populate a Cassandra table roughly equivalent to the HBase table we created earlier in this chapter:

cqlsh:guy> CREATE TABLE friends
       ...  (name text,
       ...   friend_name text,
       ...   friend_email text,
       ...   PRIMARY KEY (name,friend_name));

cqlsh:guy>   INSERT INTO friends (name,friend_name,friend_email)
                    VALUES('Guy','Jo','[email protected]');
cqlsh:guy>   INSERT INTO friends (name,friend_name,friend_email)
                    VALUES('Guy','Chris','[email protected]');
cqlsh:guy>   INSERT INTO friends (name,friend_name,friend_email)
                    VALUES('Guy','John','[email protected]');

Familiar SQL-like constructs allow us to perform updates and deletes, create indexes, or issue queries.

However, the CQL SELECT statement has limited capabilities when compared to standard SQL: in particular, joins and aggregate (GROUP BY) operations are not supported. Furthermore, WHERE clauses and ORDER BY clauses are severely restricted.  Ordering and range queries are limited to clustering columns within a specific partition key.

These limitations seem confusing if you think of CQL tables as relational structures. But if you remember that the first part of the key is actually a rowkey that is consistently hashed across the cluster, then the limitation seems more reasonable. Cassandra is unable to effectively perform a range scan across rowkey values that are hashed across the entire cluster. Nor is it possible to access the partition columns without accessing a specific row, since every row could have entirely distinct column values.

So this ORDER BY clause cannot be supported:

cqlsh:guy> SELECT * FROM friends ORDER BY name;
SInvalidRequest: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN."

But this is legal:

cqlsh:guy> SELECT * FROM friends WHERE name = 'Guy'
           ORDER BY friend_name;

 name | friend_name | friend_email
------+-------------+-----------------
  Guy |       Chris | [email protected]
  Guy |          Jo |    [email protected]
  Guy |        John |  [email protected]

A similar restriction prevents range operations on the partition key:

cqlsh:guy> SELECT * FROM friends WHERE name > 'Guy'  ;
InvalidRequest: code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)"

But allows a range query on the clustering key, provided the partition key is also specified:

cqlsh:guy> SELECT * FROM friends
            WHERE name='Guy' AND friend_name > 'Guy';

 name | friend_name | friend_email
------+-------------+----------------
  Guy |          Jo |   [email protected]
  Guy |        John | [email protected]

CQL is used within Java programs or other languages using a driver syntax that is similar to JDBC: CQL statements are passed as strings to methods that submit the CQL to the server and return result sets or return codes.

The following Java code connects to a Cassandra server and specifies a keyspace (lines 1-3), submits a CQL query (lines 5-6) and iterates through the results (lines 8-11):

 1.  String myServer=args[0];
 2.  Cluster cluster = Cluster.builder().addContactPoint(myServer).build();
 3.  Session myKeySpace = cluster.connect("guy");
 4.
 5.  String cqlString = "SELECT * FROM friends where name='Guy'";
 6.  ResultSet myResults = myKeySpace.execute(cqlString);
 7.
 8.  for (Row row : myResults.all()) {
 9.        System.out.println(row.getString(0) +" "+
10.  row.getString(1) + " " + row.getString(2));
11.  }

If we don’t know the structure of the result set in advance, then there is a metadata interface that allows us to extract column names and data types:

List<Definition> colDefs = myResults.getColumnDefinitions().asList();
System.out.println("Column count=" + colDefs.size());
System.out.println("Column Names:");
for (Definition colDef : colDefs) {
        System.out.println(colDef.getName());
}

MapReduce

The put and get methods provided by early NoSQL systems support only record-at-a-time processing and place a heavy programming burden on an application that needs to perform even simple analytics on the data. Google’s MapReduce algorithm—first published in 2004—provided a solution for parallelizing computation across a distributed system, and it has been widely adopted not just by systems inspired by the Google stack, such as Hadoop and HBase, but also by many early NoSQL systems, such as CouchDB and MongoDB.

The canonical example of MapReduce is provided by the WordCount (https://wiki.apache.org/hadoop/WordCount ) program, which represents almost the simplest possible MapReduce example. We showed a diagrammatic representation of WordCount way back in Figure 2-4.

In the WordCount example, the map phase uses a tokenizer to break up the input into words, then assigns a value of 1 to each word:

public static class Map
                extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context)
                        throws IOException, InterruptedException {
                String line = value.toString();
                StringTokenizer tokenizer = new StringTokenizer(line);
                while (tokenizer.hasMoreTokens()) {
                        word.set(tokenizer.nextToken());
                        context.write(word, one);
                }
        }
}

The reducer class takes these name:value pairs (where the value is always 1) and calculates the sum of counts for each word:

public static class Reduce
                extends Reducer<Text, IntWritable, Text, IntWritable> {

        public void reduce(Text key, Iterable<IntWritable> values,
                        Context context)
                        throws IOException, InterruptedException {
                int sum = 0;
                for (IntWritable val : values) {
                        sum += val.get();
                }
                context.write(key, new IntWritable(sum));
        }
}

The MapReduce job is invoked by mainline code that defines input and output types and files, specifies the map and reducer classes, and invokes the job:

Job job  = Job .getInstance(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

MapReduce coding in Java is somewhat cumbersome and involves a lot of boilerplate coding. Many alternative implementations are much simpler. For instance, here is the WordCount algorithm implemented in MongoDB’s JavaScript MapReduce framework:

db.films.mapReduce(
  /* Map    */ function() {emit (this.Category,1);},
  /* Reduce */ function(key,values) {return Array.sum(values)} ,

  { out: "MovieRatings"  }
)

This JavaScript reads from the films collection, and does a word count on the categories of each film, which is output to the collection “MovieRatings”:

> db.MovieRatings.find();
{ "_id" : "Action", "value" : 64 }
{ "_id" : "Animation", "value" : 66 }
{ "_id" : "Children", "value" : 60 }
{ "_id" : "Classics", "value" : 57 }
{ "_id" : "Comedy", "value" : 58 }

MapReduce is a flexible programming paradigm capable of being adapted to a wide range of data processing algorithms. However, it is rarely the most efficient algorithm for a given problem and is usually not the most programmer-efficient approach. Consequently, there have been many frameworks that provide alternative programming and processing paradigms.

Pig

It was early realized that the full potential of Hadoop could not be unlocked if commonplace operations required highly skilled Java programmers with experience in complex MapReduce programming. As we will see later in this chapter, at Facebook the development of Hive—the original SQL on Hadoop—was an important step toward finding a solution for this problem. At Yahoo! the Hadoop team felt that the SQL paradigm could not address a sufficiently broad category of MapReduce programming tasks. Yahoo! therefore set out to create a language that maximized productivity but still allowed for complex procedural data flows. The result was Pig.

Pig superficially resembles scripting languages such as Perl or Python in that it offers flexible syntax and dynamically typed variables. But Pig actually implements a fairly unique programming paradigm; it is best described as a data flow language. Pig statements typically represent data operations roughly analogous to individual operators in SQL—load, sort, join, group, aggregate, and so on. Typically, each Pig statement accepts one or more datasets as inputs and returns a single dataset as an output. For instance, a Pig statement might accept two datasets as inputs and return the joined set as an output.  Users can add their own operations through a Java-based user-defined function (UDF) facility.

For those familiar with SQL programming, programming in Pig turns the programming model upside down. SQL is a nonprocedural language: you specify the data you want rather than outline the sequence of events to be executed.  In contrast, Pig is explicitly procedural: the exact sequence of data operations is specified within your Pig code.  For SQL gurus, it resembles more the execution plan of a SQL statement rather than the SQL statement itself.

SQL compilers and Hive’s HQL compiler include optimizers that attempt to determine the most efficient way to resolve a SQL request. Pig is not heavily reliant on such an optimizer, since the execution plan is explicit. As the Pig gurus are fond of saying “Pig uses the optimizer between your ears.”

Here is the ubiquitous word count implemented in Pig:

file= load 'some file’;
b = foreach file generate flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into 'pig_wordcount';

Pig can be used to perform complex workflows and provide an ad hoc query capability similar to SQL. For instance, the example shown in Figure 11-3 performs joins, filters, and aggregations to provide a summary of customers in the Asian region. Figure 11-3 also includes a comparable SQL statement.

9781484213308_Fig11-03.jpg

Figure 11-3. Pig compared to SQL

Although Pig is capable of expressing virtually any data query that can be expressed in SQL syntax, it is also capable of performing more complex data flows that would require multiple SQL statements chained together with procedural code in an RDBMS.

Nevertheless, while Pig is more flexible than SQL, it is not a Turing complete programming language: it lacks the control structures required for a complete general-purpose programming solution. However, Pig can be embedded in Python and other languages.

Directed Acyclic Graphs

Hadoop 1.0 was based on the MapReduce pattern. Complex programs could link multiple MapReduce steps to achieve their end result.

It’s long been acknowledged that while MapReduce is a broadly applicable model that can support a wide range of job types, it is not the best model for all workloads. In particular, MapReduce has a very large startup cost, which means that even the simplest “Hello World” MapReduce job typically takes minutes rather than seconds; this alone makes MapReduce a poor choice for interactive workloads and low-latency operations.

Hadoop 2.0 introduced the YARN framework, which allows Hadoop to run workloads based on other processing patterns—of which MapReduce is just one.

The Apache Tez project (Tez is Hindi for “speed”) is one of a number of YARN-based initiatives that provide Hadoop with a processing framework supporting both the massive batch processing that characterizes traditional Hadoop workloads and low-latency operations that allow Hadoop to support a wider variety of solutions.

Tez is based on a flexible processing paradigm known as directed acyclic graph (DAG). This intimidating term actually describes a familiar processing model. Anyone who has examined a SQL execution plan will have encountered a DAG. These graphs describe how a complex request is decomposed into multiple operations that are executed in a specific order and that can arbitrarily feed into each other. MapReduce itself is a DAG, but the MapReduce paradigm severely limits the types of graphs that can be constructed. Furthermore, MapReduce requires that each step in the graph be executed by a distinct set of processes, while Tez allows multiple steps in the graph to be executed by a single process, potentially on a specific node of the Hadoop cluster.

Cascading

Cascading is a popular open-source Java framework that abstracts Hadoop MapReduce or YARN-based processing primitives. In some respects, it resembles Pig in that it works with high-level data flows and transformations and spares the programmer the labor involved in constructing low-level parallelization classes. However, unlike Pig, Cascading is integrated within the Java language and is capable of creating solutions that are more modular and sophisticated.

The Cascading programming model is based on sources, sinks, and pipes assembled in data flows. The programmer assembles these pipes to construct programs that are more sophisticated than MapReduce. These workflows represent the DAG discussed in the previous section.

Spark

We looked at the origins and architecture of the Spark project in Chapter 7. Spark can be thought of as “memory-based Hadoop,” but in truth it offers more than just an in-memory speed boost. The Spark API operates at a higher level of abstraction than native YARN or MapReduce code, offering improvements in programmer productivity and execution speed.

Sparks supports APIs in Python, Java, and other languages, but it is native to the Scala language, so our examples here will use the Scala API.

Here, we load text files (in this case, from HDFS) into Spark Resilient Distributed Datasets (RDDs)

val countries=sc.textFile("COUNTRIES")
val customers=sc.textFile("CUSTOMERS")

Spark RDDs are immutable: we can’t alter the contents of an RDD; rather, we perform operations that create new RDDs.

The HDFS inputs are CSV files, so in our initial RDDs each line in the input file is represented as a single string. Here’s the first element in the countries RDD:

scala> countries.first()
res9: String = "52790,United States of America,Americas"

In the next example, we use a map function to extract the key value from each of the CSV strings and create key-value pair RDDs.

val countryRegions=countries.map(x=>(x.split(",")(0),x.split(",")(2)))
val AsianCountries=countryRegions.filter(x=> x._2.contains("Asia") )
// Country codes and country names
val countryNames=countries.map(x=>(x.split(",")(0),x.split(",")(1)))

The first RDD countryRegions contains all country codes and their associated regions. The second (AsianCountries) uses the filter() method to create a RDD containing only Asian countries. The third (countryNames) creates an RDD with country names keyed by country ID.  Here’s the first element in the countryNames RDD:

scala> countryNames.first()
res12: (String, String) = (52790,United States of America)

Aggregations can be created by performing map and reduce operations. The first line that follows uses map to emit a country name and the numeral 1 for each customer. The second line invokes a reducer that emits an RDD containing the counts of customers in each country:

val custByCountry=customers.map(x=>(x.split(",")(3),1))
val custByCountryCount=custByCountry.reduceByKey((x,y)=> x+y)

In the next statement, we join the RDD containing customer counts by country ID with the list of Asian country IDs that we created earlier. Because we are using the default inner join method, this returns only customer counts for Asian regions.

val AsiaCustCount=AsianCountries.join(custByCountryCount)

The next join operation joins our RDD containing country names keyed by country code to the previous RDD. We now have an RDD that contains counts of customers by country name in the Asian region:

val AsiaCustCountryNames=AsiaCustCount.join(countryNames)

This Spark workflow is roughly equivalent to the Pig workflow shown in Figure 11-3. However, for massive datasets, the Spark job could be expected to complete in a fraction of the time, since every operation following the initial load from HDFS would complete in memory.

The Return of SQL

I don’t know about you, but as I created the examples in this chapter I was struck by the “Tower of Babel” impression created by the new generation of database languages. Say what you will about SQL, but for more than two decades database programmers have all been speaking the same language. As a result of the explosion of nonrelational systems, we’ve all been forced to speak different languages. And while some of these languages offer definite advantages for working with unstructured data or for parallelization of programming tasks, in many cases the level of abstraction has been reduced and the work for the programmer increased.

So it’s not surprising that within just a few years after the initial enthusiasm for NoSQL, we’ve seen SQL return to almost every new database niche. SQL just has too many advantages: it’s a high-level abstraction that simplifies data access and manipulation, it’s a language in which literally millions of database users are conversant, and there are hundreds of popular business intelligence and analytic tools that use it under the hood as the means for getting at data.

Hive

Hive is the original SQL on Hadoop. We discussed the origins and architecture of Hive in Chapter 2. From the very early days of Hadoop, Hive represented the most accessible face of Hadoop for many users.

Hive Query Language (HQL) is a SQL-based language that comes close to SQL-92 entry-level compliance, particularly within its SELECT statement. DML statements—such as INSERT, DELETE, and UPDATE—are supported in recent versions, though the real purpose of Hive is to provide query access to Hadoop data usually ingested via other means.  Some SQL-2003 analytic window functions are also supported.

As discussed in Chapter 2, HQL is compiled to MapReduce or—in later releases—more sophisticated YARN-based DAG algorithms.

The following is a simple Hive query that performs the same analysis as our earlier Pig and Spark examples:

0: jdbc:Hive2://>   SELECT country_name, COUNT (cust_id)
0: jdbc:Hive2://>     FROM countries co JOIN customers cu
0: jdbc:Hive2://>      ON(cu.country_id=co.country_id)
0: jdbc:Hive2://>    WHERE region = 'Asia'
0: jdbc:Hive2://> GROUP BY country_name
0: jdbc:Hive2://>   HAVING COUNT (cust_id) > 500;

2015-10-10 11:38:55     Starting to launch local task to process map join;      maximum memory = 932184064
<<Bunch of Hadoop JobTracker output deleted>>
2015-10-10 11:39:05,928 Stage-2 map = 0%,  reduce = 0%
2015-10-10 11:39:12,246 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.28 sec
2015-10-10 11:39:20,582 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.4 sec
+---------------+------+--+
| country_name  | _c1  |
+---------------+------+--+
| China         | 712  |
| Japan         | 624  |
| Singapore     | 597  |
+---------------+------+--+
3 rows selected (29.014 seconds)

HQL statements look and operate like SQL statements. There are a few notable differences between HQL and commonly used standard SQL, however:

  • HQL supports a number of table generating functions which can be used to return multiple rows from an embedded field that may contain an array of values or a map of name:value pairs. The Explode() function returns one row for each element in an array or map, while json_tuple() explodes an embedded JSON document.
  • Hive provides a SORT BY clause that requests output be sorted only within each reducer within the MapReduce pipeline.  Compared to ORDER BY, this avoids a large sort in the final reducer stage, but may not return results in sorted order.
  • DISTRIBUTE BY controls how mappers distribute output to reducers. Rather than distributing values to reducers based on hashing of key values, we can insist that each reducer receive contiguous ranges of a specific column.  DISTRIBUTE BY can be used in conjunction with SORT BY to achieve an overall ordering of results without requiring an expensive final sort operation. CLUSTER BY combines the semantics of DISTRIBUTE BY and SORT BY operations that specify the same column list.

Hive can query data in HBase tables and data held in HDFS. Support for Spark is available, though still under active development.

Impala

It’s difficult to overstate the significance of Hive to the early adoption of Hadoop. Hive allowed non-Java programmers a familiar mechanism for accessing data held in Hadoop, and allowed third-party analytic solutions to more easily integrate with Hadoop.

However, the similarity between Hive and RDBMS SQL led inevitably to unrealistic expectations. RDBMS users had become used to SQL as a real-time query tool, whereas even the simplest Hive queries would typically take minutes to complete, since even the simplest query had to undertake the overhead of initiating a MapReduce job. Furthermore, caching of data in memory typically reduces overall SQL response time by 90 percent or more, while Hive queries were totally I/O bound. Initial releases of Hive also employed fairly primitive query optimization.

Disappointment with Hive performance led to a fairly intensive effort to improve Hive, and modern versions of Hive outperform the initial releases by several orders of magnitude.

Cloudera’s Impala project aims to provide low-latency ANSI-compliant SQL on Hadoop. The key difference between Impala’s approach and that of Hive is that while Hive programs are translated to native Hadoop processing—initially MapReduce but today including Tez— Impala includes its own processing architecture that reads directly from native HDFS or HBase storage. Although Impala bypasses the Hadoop processing layer, it can access Hive and Pig metadata through the HCatalog interface, which means it knows the structure of Hive and Pig tables.

Impala architecture is heavily influenced by traditional massively parallel processing (MPP) data warehouse database architectures, which we discussed in Chapter 8. Impala deploys daemon processes—typically on each Hadoop data node—and distributes work to these daemons using algorithms similar to those used by RDBMS systems, such as Teradata or Oracle.  These daemon processes are always ready for action, so there is no initial latency involved in job creation as there is in Hive.

There’s no doubt that the Impala architecture offers a better performance experience for short-duration queries than Hive.  However, given improvements in recent releases of Hive, there are those who claim that Hive still offers superior performance at greater scale. Some of this debate is driven by marketing groups within commercial Hadoop companies, so careful evaluation of competing claims is warranted.

Spark SQL

As noted earlier, there is some support for Spark within recent releases of Hive. However, Spark includes its own SQL dialect, called—not surprisingly—Spark SQL.

You may recall from Chapter 7 that Spark SQL works with data frames rather than RDDs. Data frames can be thought of as a more tabular and schematized relation to the RDD. A data frame can be created from an RDD, or from a Hive table.

SQL-92 compliance is a work in progress within Spark SQL.

Couchbase N1QL

So far we’ve seen SQL used within the context of analytic systems such as Hadoop and Spark. Operational databases of the NoSQL variety have not been so quick to implement SQL; however, in 2015, Couchbase announced Non-first Normal Form Query Language (N1QL), pronounced “Nickel,” a virtually complete SQL language implementation for use with document databases and implemented within the Couchbase server 4.0.

For example, consider the sample data shown in Figure 11-4 (this is the same sample data we used for MongoDB earlier).

9781484213308_Fig11-04.jpg

Figure 11-4. Couchbase sample document

N1QL allows us to perform basic queries to retrieve selected documents or attributes of selected documents:

cbq> SELECT `Title` FROM films WHERE _id=200;
{
    "requestID": "0d5cff15-f5e7-434d-9dc4-d950ef5e21f8",
    "signature": {
        "Title": "json"
    },
    "results": [
        {
            "Title": "CURTAIN VIDEOTAPE"
        }
    ],
    "status": "success",

N1QL allows us to access nested documents within the JSON structure using array notation. So, for instance, in the example that follows, Actors[0] refers to the first nested document within the actors array:

cbq>  SELECT Actors[0].`First name` , Actors[0].`Last name`
   >   FROM films where _id=200;
{
    "requestID": "5aa27ec1-ce4d-4452-a137-2239b88e47fe",
    "results": [
        {
            "First name": "JOE",
            "Last name": "SWANK"
        }
    ],
    "status": "success",

We can query for subdocuments that match a search criteria using WHERE ANY syntax:

cbq> SELECT `Title` FROM films
   >  WHERE ANY Actor IN films.Actors SATISFIES
   >   ( Actor.`First name`="JOE" AND Actor.`Last name`="SWANK" )END;
{
    "requestID": "f3d6dd05-912d-437b-984f-214770f87076",
     "results": [
        {
            "Title": "CHOCOLAT HARRY"
        },
        {
            "Title": "CHOCOLATE DUCK"
        },
... ...

The UNNEST command allows embedded documents to be “joined” back up to the parent document. So here we get one result for each actor who starred in film 200, with the film title included in the results:

cbq> SELECT f.`Title` ,a.`First name` ,a.`Last name`
   >   FROM films f
   >  UNNEST f.Actors a
   >  WHERE f._id=200;
{
    "requestID": "f8227647-3506-4bfd-a538-3f8a0d038198",

    "results": [
        {
            "First name": "JOE",
            "Last name": "SWANK",
            "Title": "CURTAIN VIDEOTAPE"
        },
        {
            "First name": "WALTER",
            "Last name": "TORN",
            "Title": "CURTAIN VIDEOTAPE"
        },
... ...
   ],
    "status": "success",
}

The UNNEST command allows us to perform the equivalent of joins between parent and child documents when the child documents are nested within the parent. N1QL also allows us to join between independent documents, providing that one of the documents contains a reference to the primary key in the other.

So, for instance, if we had a bucket of documents that contained the primary keys of “overdue” films in our imaginary (and by now definitely struggling) DVD store, then we can join that to the films collection to return-just those films using the ON KEYS join syntax:

cbq> SELECT  f.`Title` FROM overdues
   >   JOIN films f ON KEYS overdues.filmId ;
{
    "requestID": "6f0f505e-72f6-404d-9e20-953850dc9524",
    "results": [
        {
            "Title": "CURTAIN VIDEOTAPE"
        },
        {
            "Title": "HARPER DYING"
        }
    ],
    "status": "success",

N1QL also includes DML statements allowing us to manipulate the contents of documents and DDL statements allowing creation and modification of indexes.

N1QL is an ambitious attempt to bring SQL into the world of document databases. It’s interesting to consider that at the same time as companies like CouchBase are introducing SQL support into their database, companies like Oracle are introducing strong JSON support into their SQL-based database. It would seem that the two worlds are coming together.

Apache Drill

So far we have looked at SQL variants that are tightly coupled with their underlying technology. It’s true that technologies such as Hive can access data in HBase and Spark, as well as HDFS, but this speaks more to the integration of HDFS, HBase, and Spark than it does to some inherent heterogeneity of the Hive system.

The Apache Drill framework aims to provide a SQL engine that can operate across multiple distributed data stores such as HDFS or Amazon S3, as well as NoSQL systems such as MongoDB and HBase. Drill’s architecture is based on Google’s Dremel system, which provides the foundation for the Google BigQuery product.

Drill incorporates a distributed heterogeneous cost-based optimizer that can intelligently distribute data-access algorithms across multiple, disparate systems. This allows a SQL query to span Hadoop, MongoDB, Oracle, or other databases and—at least in theory—to do so in an efficient and optimal manner.

Currently, Drill can query data from relational systems that have a JDBC or ODBC connector, from systems that are supported by Hive, from a variety of cloud-based distributed file systems (Amazon S3, Google Cloud Drive), and from MongoDB.

Let’s look at the MongoDB support, since it allows us to see how Drill deals with nontabular data. Here, we use Drill to query our sample MongoDB collections.

Simple queries are, of course, simple:

0: jdbc:drill:zk=local> SELECT Title FROM films WHERE Rating='G' LIMIT 5;
+--------------------+
|       Title        |
+--------------------+
| ACE GOLDFINGER     |
| AFFAIR PREJUDICE   |
| AFRICAN EGG        |
| ALAMO VIDEOTAPE    |
| AMISTAD MIDSUMMER  |
+--------------------+
5 rows selected (1.365 seconds)

We can drill into subdocuments using a notation that is similar to the N1QL array notation. So, here we retrieve data from the second document in the embedded actors array for the film West Lion using the array notation Actors[2]:

0: jdbc:drill:zk=local> SELECT Actors[2].`First name`, Actors[2].`Last name`
. . . . . . . . . . . >   FROM films WHERE Title='WEST LION';
+---------+-----------+
| EXPR$0  |  EXPR$1   |
+---------+-----------+
| SEAN    | WILLIAMS  |
+---------+-----------+

The FLATTEN function returns one row for every document in an embedded array. It’s somewhat similar to the Hive EXPLODE function or the N1QL UNNEST clause. Note that each document is returned in JSON format; there doesn’t seem to be a way currently to schematize these results:

0: jdbc:drill:zk=local> SELECT Title, FLATTEN(Actors)
. . . . . . . . . . . >   FROM films WHERE Rating='G' LIMIT 5;
+-----------------+---------------------------------------------------------
|       Title     |                            EXPR$1
+-----------------+---------------------------------------------------------
| ACE GOLDFINGER  | {"First name":"BOB","Last name":"FAWCETT","actorId":19}
| ACE GOLDFINGER  | {"First name":"MINNIE","Last name":"ZELLWEGER",
| ACE GOLDFINGER  | {"First name":"SEAN","Last name":"GUINESS", "actorId":9
| ACE GOLDFINGER  | {"First name":"CHRIS","Last name":"DEPP","actorId":160}
| AFFAIR PREJUDICE| {"First name":"JODIE","Last name":"DEGENERES",
+-----------------+---------------------------------------------------------
5 rows selected (0.589 seconds)

We can see that Drill has a basic ability to navigate complex JSON documents, and we can expect this capability to improve over time.

Drill can also navigate wide column store structures in HBase. Let’s look at the data that we inserted into HBase earlier in this chapter, this time using Drill:

0: jdbc:drill:zk=local>  SELECT * FROM friends;
+---------+---------+------+
| row_key | friends | info |
+---------+---------+------+
| [B@6fe2da0c | {"Jo":"am9AZ21haWwuY29t","John":"am9obkBnbWFpbC5jb20="} | {"email":"Z3V5QGdtYWlsLmNvbQ==","userid":"OTk5MA=="} |
| [B@43be5d62 | {"John":"am9obkBnbWFpbC5jb20=","Guy":"Z3V5QGdtYWlsLmNvbQ==","Paul":"cGF1bEBnbWFpbC5jb20=","Ringo":"cmluZ29AZ21haWwuY29t"} | {"email":"am9AZ21haWwuY29t","userid":"OTk5MQ=="} |
+---------+---------+------+
2 rows selected (1.532 seconds)

Not very friendly output! Initially, Drill returns HBase data without decoding the internal byte array structure and without flattening any of the maps that define our wide column family.

However, we can use the FLATTEN function to extract one row for each column in our wide column family “friends”, KVGEN function to convert the map to columns, and the CONVERT FROM function to cast the byte arrays into Unicode characters:

0: jdbc:drill:zk=local>
WITH friend_details AS
    (SELECT info, FLATTEN(KVGEN(friends)) AS friend_info FROM friends)
  SELECT CONVERT_FROM(friend_details.info.email,'UTF8') AS email,
         CONVERT_FROM(friend_details.friend_info.`value`,'UTF8')
              AS friend_email
    FROM friend_details;

+----------------+------------------+
|     email      |   friend_email   |
+----------------+------------------+
| [email protected]  | [email protected]     |
| [email protected]  | [email protected]   |
| [email protected]   | [email protected]   |
| [email protected]   | [email protected]    |
| [email protected]   | [email protected]   |
| [email protected]   | [email protected]  |
+----------------+------------------+

Drill shows an enormous amount of promise. A single SQL framework capable of navigating the variety of data structures presented by relational and nonrelational systems is just what we need to resolve the Tower of Babel problem presented by the vast array of languages and interfaces characterizing the next generation databases of today and the future.

Other SQL on NoSQL

There are a number of other notable SQL-on-NoSQL systems:

  • Presto is an open-source SQL engine similar in many respects to Drill that can query data in JDBC, Cassandra, Hive, Kafka, and other systems.
  • Many relational database vendors provide connectors that allow their SQL language to retrieve data from Hadoop or other systems. Examples include Oracle Big Data SQL, IBM BigSQL, and Terradata QueryGrid.
  • Apache Phoenix provides a SQL layer for HBase.
  • Dell Toad Data Point provides SQL access to a variety of nonrelational systems, including MongoDB, Cassandra, HBase, and DynamoDB. (Disclaimer: I lead the team at Dell that develops Toad.)

Conclusion

This chapter tried to give you a feel for the languages and APIs provided by next-generation database systems. In many cases, low-level programming APIs are all that is provided. However, the trend for the future seems clear: SQL is reasserting itself as the lingua franca of the database world. Virtually all new databases are becoming accessible by SQL, and many systems are adopting SQL-like interfaces even for low-level programming.

It’s unlikely that SQL will again become the sole interface to databases of the future: the unique requirements of wide column systems and document databases suggest that non-SQL idioms will be required, although Cassandra CQL and CouchBase N1QL do show how a SQL-like language remains a useful abstraction for dealing with data that might not be in relational format. Nevertheless, it seems increasingly likely that most next-generation databases will eventually support some form of SQL access, even if only through an independent layer such as Drill.

Note

  1. Code examples can be found at https://github.com/gharriso/NextGenDBSamples.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.105.215