Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Connecting Drill to Data Sources

In previous chapters you learned how to query individual files, but Apache Drill’s real power is unleashed when you connect Drill to multiple data sources. You have already seen how Drill can natively query data in a file-based system, but it also can natively query the following data sources:

Cloud storage (Amazon Simple Storage Service/Microsoft Azure/Google Cloud Platform)
Hadoop
HBase
Hive
Kafka
Kudu
MapR
MongoDB
Open Time Series Database (Open TSDB)

Additionally, Drill can query any system that provides a JDBC driver. In this chapter, you’ll learn how to configure Drill to access and query all these different data sources. Drill accesses storage via a system of extensions known as storage plug-ins that require activation and configuration in order to query an external data source. This chapter assumes that you have a basic familiarity with the various data sources mentioned.

Querying Multiple Data Sources

Up to this point, you have seen only queries that use the dfs, or distributed filesystem, storage plug-in.¹ To query a different data source, you must configure a storage plug-in for that data source and then include it in the query. If you recall from Chapter 3, the FROM clause in a Drill query is structured as follows:

FROM storage_plugin.workspace.table

As an example, let’s say you want to query a Hive cluster and you have created and configured the storage plug-in and given it the name hive. You could query that data by using the following (note that not all storage plug-ins have workspaces):

SELECT field
FROM hive.`table`

Additionally, you can join this data with other data sources simply by including it in a JOIN statement. As an example, if you had a Drill storage plug-in called hdfs connected to a Hadoop cluster, and a JDBC storage plug-in called mysql connected to a MySQL database, you could combine these data sources as demonstrated in the following query:

SELECT *
FROM hdfs.`data1` AS d1
INNER JOIN mysql.`database`.`table` AS m1
    ON m1.id = d1.id

As you can see, it is quite simple to query multiple data sources.

Configuring a New Storage Plug-in

To configure or add a new storage plug-in, in the Web Console, click the Storage tab at the top of the screen. From there, you will see the storage plug-ins that are available. By default, Drill ships with the cp and dfs storage plug-ins enabled and several others disabled. You can enable a storage plug-in simply by clicking the Enable button adjacent to the plug-in name, as shown in Figure 6-1. You can also update the storage plug-in’s configuration by clicking the Update button (also shown in Figure 6-1). Finally, if you would like to add a new storage plug-in, you can do so in the New Storage Plugin section at the bottom of the page.

Connecting Drill to a Relational Database

Let’s begin with the easiest data source to connect: a JDBC-compliant relational database management system (RDBMS). It might seem silly to connect Drill to a relational database; however, doing so allows you to join this data with other systems or files through standard SQL.

Before you set up the storage plug-in, you will need to download the database’s JDBC driver and copy it to your Drill installation’s /jar/3rdparty directory. This is found in $DRILL_HOME/jars/3rdparty. If you are running Drill in distributed mode, you will need to copy the driver to the same path on every Drill node, then restart Drill.

In general, Drill will work with most conventional SQL databases that have a JDBC driver. The general format for the configuration is shown in the following snippet:

{
  "type": "jdbc",
  "driver": "JDBC Driver Artifact ID",
  "url": "jdbc connection string",
  "username": "username",
  "password": "password",
  "enabled": true
}

Be Cautious of Shared Credentials

At the time of writing, any credentials placed in storage plug-in configurations are visible to Drill administrators and should be considered shared passwords.

Drill has been tested with Microsoft SQL Server, MySQL/MariaDB, Oracle, PostgreSQL, and SQLite. The drivers for these databases are available at the following links:

Configuring Drill to query an RDBMS

As indicated in the previous section, to connect Drill to a MySQL database, you first need to copy the JDBC driver to the jars/3rdparty folder, or for a production Drill, your site directory (see Chapter 9).

Next, start Drill and navigate to the Storage tab in the user interface. At the bottom of the page, go to the section labeled New Storage Plugin. Type a name for the storage plug-in, and then click the Create button. This name is how the storage plug-in will be accessed in queries. It should go without saying that the RDBMS must be up and running to successfully query it from Drill.

In the configuration panel, enter one of the configurations listed in the following subsections.

Microsoft SQL Server

To query a Microsoft SQL Server, use the following configuration settings:

{
  "type": "jdbc",
  "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
  "url": "jdbc:sqlserver://host:1433;databaseName=database_name",
  "username": "username",
  "password": "password",
  enabled: "true"

}

MySQL

To query a MySQL database through Drill, use the following:

{
  "type": "jdbc",
  "driver": "com.mysql.jdbc.Driver",
  "url": "jdbc:mysql://localhost:3306",
  "username": "username",
  "password": "password",
  "enabled": true
}

Oracle

To query an Oracle database through Drill, you need to specify the database name in the configuration, as shown here:

{
  "type": "jdbc",
  "driver": "oracle.jdbc.OracleDriver",
  "url": "jdbc:oracle:thin:username/password@host:1521/database"
  "enabled": true
}

PostgreSQL

Here’s how to specify the database name to query a PostgreSQL database:

{
  "type": "jdbc",
  "driver": "org.postgresql.Driver",
  "url": "jdbc:postgresql://host/database",
  "username": "username",
  "password": "password",
  "enabled": true
}

SQLite

SQLite databases require you to enter the table name after the database alias when you query data:

{
  "type": "jdbc",
  "driver": "org.sqlite.JDBC",
  "url": "jdbc:sqlite:path_to_database_file", 
  "enabled": true
}

Querying an RDBMS from Drill

After you have provided the configuration string, click Enable; if all went well, you should get a success message. You should now be able to use the database in a Drill query. You should structure the FROM clause as shown here:

FROM storage_plugin.database.table

Thus, a completed query would look like this:

SELECT *
FROM sqlite.`genres`;

This query returns all of the results from a table called genres in the chinook SQLite database.

Other uses of the drill JDBC storage plug-in

In addition to being able to query traditional databases that support JDBC, you can also access other data sources that have JDBC drivers. One example of this is data.world, an online platform that hosts open datasets for public use. As it turns out, a JDBC driver is available at the data.world GitHub repository.

After you have downloaded and installed the data.world JDBC driver, you can query data stored there as if it were hosted locally. To configure Drill to do this, open Drill’s web interface and, at the top of the screen, click the Storage tab. In the New Storage Plugin section, click Create. In the blank window that opens, enter the following configuration:

{
 "type": "jdbc",
 "driver": "world.data.jdbc.Driver",
 "url": "jdbc:data:world:sql:username:dataset",
 "username": "username",
 "password": "API Key",
 "enabled": true
}

You will need to get an API key from data.world, which is available after you have an account. Save this plug-in as dw. After you’ve set up the storage plug-in, all you really need to do to query a data.world dataset is modify the FROM clause of your query.

The data.world driver is a little different than a regular Drill FROM clause in that it has four parts, whereas a traditional Drill data source has only three:

FROM dw.username.dataset.filename

We uploaded a spreadsheet to data.world that contained a listing of MAC addresses and the manufacturers of the associated devices. To query this data, you would enter a query like the following:

SELECT *
FROM dw.cgivre.`mac-address-manufacturers`.
`20170426mac_address.csv/20170426mac_address`

Querying Data in Hadoop from Drill

One of the primary use cases for Apache Drill is querying data in a Hadoop cluster. It is actually quite simple to configure Drill to query data stored in a distributed filesystem. You can query HDFS data from your laptop if the data is small. But, for large data sets, you typically run a Drillbit on each HDFS node as described in Chapter 9. Or, in specialized cases, you can run Drillbits on the same racks as your HDFS nodes, though this is an advanced configuration.

With that said, it is quite simple to connect Drill to query data stored in a Hadoop cluster. Simply copy the configuration settings from the dfs plug-in and change the connection as shown here:

"connection":"hdfs://name_node_host_name:port"

Authentication from Drill

When you connect to a distributed filesystem Drill uses user impersonation, as explained in “Security”. Therefore, the account that you are using to query Drill must have permission in HDFS to access the data you are trying to query.

Connecting to and Querying HBase from Drill

Apache HBase is a popular open source, nonrelational database modeled after Google’s Bigtable. HBase has a few spinoff projects, such as OpenTSDB and Apache Accumulo.

Drill includes a default HBase storage plug-in configuration. It will be necessary to specify the IP addresses of your ZooKeeper quorum as a comma-separated list. After you’ve done that, enable the plug-in:

{
   "type": "hbase",
   "config": {
      "hbase.zookeeper.quorum": "hosts",
      "hbase.zookeeper.property.clientPort": "2181"
    },
    "size.calculator.enabled": false,
    "enabled": true
 }

When you’ve done that, you are ready to query data in HBase.

Querying data from HBase

Querying data in HBase is significantly more complicated than other data sources because you will need to understand how data is stored in HBase. HBase stores data in bytes or byte arrays, and to be able to performantly query this data you need to understand the underlying data structure. From Drill, you can access individual columns can by using the following pattern:

table_name.column_family_name.column_name

HBase stores all columns as byte arrays stored in Drill as a VARBINARY column. To work with an HBase column, you first convert the VARBINARY column into another data type using either the CAST() function for numeric fields or CONVERT_FROM() for all other fields. The following example queries a fictitious table in HBase and converts the results into readable results:

SELECT 
    CONVERT_FROM(tbl.sales.customerid, 'UTF8') AS customerID,
    CAST(tbl.sales.order_amount AS FLOAT) AS order_amount
FROM hbase.table

Little Endian Versus Big Endian

At the byte level, data is represented in two ways, big endian or little endian, which refers to the order in which the bytes of a number are stored in memory. As an example, the number 123 would be stored as “1-2-3” in big-endian format, and “3-2-1” in little-endian format. Normally, you don’t really need to know this, but HBase stores information in both big- and little-endian formats and you can take advantage of performance enhancements if you are aware of the format of the data.

In production systems, HBase will use timestamps or dates as row keys that are encoded as big-endian. By default, however, Drill assumes that the data is unsorted little-endian. When you use the CONVERT_TO() and CONVERT_FROM() functions on big-endian data, Drill will optimize the scanning process. Drill also provides several data types for big-endian conversions that all end in _BE, which are intended to be used with HBase. You can find a complete list in the Drill documentation.

The BYTE_SUBSTR() function is also very useful when querying HBase; it separates the components of a composite HBase row key. The following example shows a performant query in HBase:

SELECT 
    CONVERT_FROM(BYTE_SUBSTR(row_key, 1, 8), 'BIGINT_BE') AS row_key
FROM hbase.table

When the data is encoded in sorted byte arrays Drill can take advantage of this to improve performance, but it requires you to convert your data into an _OB data type. The following example demonstrates how to improve the performance of an HBase query by limiting the data scanned to only those rows with keys between 0 and 100:

SELECT 
    CONVERT_FROM(row_key, 'INT_OB') AS row_key
FROM hbase.table
WHERE CONVERT_FROM(row_key, 'INT_OB') >= CAST(0 AS INT) 
    AND CONVERT_FROM(row_key ,'INT_OB') < 100

Querying Hive Data from Drill

Apache Hive is one of the first SQL-on-Hadoop software packages that was originally developed by Facebook and gives the user a SQL-like interface known as HiveQL to data stored in HDFS Amazon S3. Hive is a schema-on-read system, meaning that you must define a schema before you query data using Hive. The schema information is stored in a database known as a metastore and can be either on a remote system or embedded.

The Hive Metastore

One important thing to note about Hive is that it maintains a metastore—usually using Apache Derby or MySQL—which stores metadata about all the tables that users have queried. Unlike Drill, HiveQL queries generate MapReduce jobs, thus if you have data that is accessed using Hive, Drill can take advantage of Hive’s metastore and query the data much faster than Hive can.

For more information about Hive, we recommend Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen (O’Reilly).

Connecting Drill to Hive

To connect Drill to Hive, you must first determine whether the Hive cluster is using the embedded or a remote metastore. Normally, in a production environment, Hive uses a remote metastore; if it is not, you might want to talk with your systems administrator.

If you are using Hive with an embedded metastore, on the storage configuration page of the user interface, click the Update button adjacent to the Hive storage plug-in configuration. The default configuration for Hive is as follows:

  {
    "type": "hive",
    "configProps": {
      "hive.metastore.uris": "",
      "javax.jdo.option.ConnectionURL": 
          "jdbc:database://host:port/metastore database",
      "hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
      "fs.default.name": "file:///",
      "hive.metastore.sasl.enabled": "false"
    },
    "enabled": true,
  }

Make Sure the Metastore Is Running

Before attempting to enable the Hive storage plug-in, make sure your Hive metastore is up and running by executing the following command:

hive --service metastore

For Drill to connect to Hive with an embedded metastore, you must first set the fs.default.name to the default location of the files that you want to query. This can be either the local filesystem (file:///), an HDFS root path on the default HDFS namenode (hdfs://), or the path to a specific HDFS namenode (hdfs://host:port).

The configuration option javax.jdo.option.ConnectionURL requires a JDBC connection string to the metastore database. As with an RDBMS, you need to copy the driver to the DRILL_PATH/jars/3rdparty folder for this to work.

After you have done all that, click Enable, and then you are ready to query Hive.

Connecting to Hive with a remote metastore

If you are working with a production Hive installation, you are more likely to have a remote metastore, and if you would like to use Drill to query this data you will need to configure it to connect to the remote metastore. As with the embedded metastore, you must first verify that the metastore is running before you attempt to connect Drill. You can verify that the metastore is running by using the following command:

hive --service metastore

Next, navigate to the Drill storage configuration and click the Hive storage plug-in:

{
    "type": "hive",
    "configProps": {
       "hive.metastore.uris": "thrift://host:port",
       "hive.metastore.sasl.enabled": "false",
       "fs.default.name": "hdfs://hdfs_host/",
    },
    "enabled": true
  }

In the configProps section, set the hive.metastore.uris property to the thrift URI and port for your metastore. Next, set the fs.default.name to the default location of the files that you want to query. Note that if you include a hostname and port, the URI must point to a main control node of a Hadoop cluster.

If your installation uses the HBaseStorageHandler, you will need to add the following two additional properties to the configuration to set the location of the ZooKeeper quorum hosts and ports:

"hbase.zookeeper.quorum": "zkhost1,zkhost2",
"hbase.zookeeper.property.clientPort":"2181"

At this point, you should be able to enable the Hive storage plug-in and query Hive directly from Drill.

Improving Performance of Parquet-Backed Tables

If you are querying Parquet files from Drill via Hive, you should set the store.hive.optimize_scan_with_native_readers option to true to enable Drill to use its native readers rather than those shipped with Hive. This option will be removed in Drill 1.15 and replaced with the following variable:

store.hive.parquet.optimize_scan_with_native_reader

Connecting to and Querying Streaming Data with Drill and Kafka

Apache Kafka is a popular, scalable real-time data and stream processor. According to Wikipedia, Kafka “aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data.” If you are interested in learning more about Apache Kafka, we recommend Kafka: The Definitive Guide by Neha Narkhede, Gwen Shapira, and Todd Palino (O’Reilly).

To query Kafka from Drill, you must be running at least Kafka version 0.10 and Drill version 1.12 or greater. At the time of writing, the Kafka storage plug-in supports only JSON messages from Kafka.

JSON Configuration for the Kafka Plug-in

When querying data from Kafka, you might encounter some of the same issues as when querying JSON data in the filesystem. To set the JSON configurations for Kafka, you can use the following variables:

store.kafka.all_text_mode
store.kafka.read_numbers_as_double

As with the other storage plug-ins, to enable the Kafka storage plug-in, go to the Storage tab. There, you will see a disabled storage plug-in with the configuration shown here:

{ 
   "type": "kafka", 
   "kafkaConsumerProps": 
   { 
      "key.deserializer": 
          "org.apache.kafka.common.serialization.ByteArrayDeserializer",
      "auto.offset.reset": "earliest", 
      "bootstrap.servers": "kafka_host:9092", 
      "group.id": "drill-query-consumer-1", 
      "enable.auto.commit": "true", 
      "value.deserializer": 
          "org.apache.kafka.common.serialization.ByteArrayDeserializer",
      "session.timeout.ms": "30000" 
    }, 
   "enabled": true 
}

How Drill Interprets Data from Kafka

When Drill queries Kafka, the storage plug-in maps Kafka topics to SQL tables. When you connect Drill to Kafka, it discovers all the available topics and maps them to tables with the same name. The rows of a topic are known in Kafka terminology as messages, and each message is mapped to a SQL row. Each message contains fields that are mapped to columns in Drill. Additionally, Drill generates the following metadata columns:

kafkaTopic: The topic name from Kafka
kafkaPartitionid: The partition ID from Kafka
kafkaMsgOffset: The message offset from Kafka
kakfaMsgTimestamp: The timestamp when Kafka received the message
kafkaMsgKey (if not null): The message key or ID if not null

Querying streaming data

After you have registered the Kafka storage plug-in, you can execute queries against data in Kafka as if it were any other data source. It is important to remember that Kafka contains streaming data, and when you query data from Drill, it will scan all the Kafka messages from the earliest (most recent) offset to the latest.

Because data is constantly flowing into Kafka, every time you query (or consume) data, Drill will reset the offset pointer. You can change this option in the configuration for the Kafka storage plug-in.

Avoid Endless Queries

When querying streaming data, it should go without saying that you should avoid SELECT * queries without limits. In the event of an endless query, Kafka will time out, but it is best to limit your queries as tightly as possible.

Improving the performance of Kafka queries

In Drill 1.14, there is an enhancement to the Kafka plug-in that significantly reduces query time. Prior to this version, Drill would scan all the data in a topic before filtering; however, as of version 1.14, Drill can filter on the Kafka metadata, which will greatly reduce the volume of data that Drill is scanning. At present, there are three fields that you can include in the WHERE clause of your queries that will enhance performance:

kafkaPartitionId: Can be used with =, >, >=, <, <=.
kafkaMsgOffset: Can be used with =, >, <=, <, <=.
kafkaMsgTimestamp: Maps to the timestamp for each Kafka message. Drill can use this field only when you use the operators =, >, >=.

We recommend that you include these fields in predicates in the WHERE clause of your queries in your queries. For example:

WHERE kafkaMsgTimestamp > 'some date'

Connecting to and Querying Kudu

Apache Kudu is a low-latency storage system for tabular data. At the time of writing, Kudu is incubating with the Apache Software Foundation and the Drill integration is still experimental. Kudu is a true columnar store that supports key-indexed record lookup similar to HBase, but Kudu differs from HBase in that Kudu’s data model more closely resembles a relational database.

To enable Drill to query Kudu, simply create a storage plug-in and enter the configuration that follows, replacing master address with the IP address or hostname of your Kudu master. After entering the IP address, enable the plug-in, and you should be able to query your data in Kudu!

{
  "type": "kudu",
  "masterAddresses": "master address",
  "enabled": true
}

Connecting to and Querying MongoDB from Drill

MongoDB is a popular NoSQL document-oriented database that can scale to enormous datasets. Installing and configuring MongoDB is beyond the scope of this book, but if you would like to read more about it, we recommend MongoDB: The Definitive Guide by Shannon Bradshaw and Kristina Chodorow (O’Reilly).

It is quite simple to connect Drill to MongoDB and to query the data in it. Make sure that MongoDB is running before you attempt to configure Drill. You’ll notice on the Storage tab of the web console that Drill ships with a storage plug-in for MongoDB preconfigured. However, in the event that the configuration is missing, you can use the configuration shown here:

{
    "type": "mongo", 
    "connection": "mongodb://mongodb_host:27017/"
    "enabled": true
}

After you have saved this configuration, you can verify the connection was successful by executing a SHOW DATABASES query in Drill. You should then see the tables from your MongoDB instance. You can now query your MongoDB instance using standard SQL from Drill!

Connecting Drill to Cloud Storage

In many instances, organizations store large amounts of data in various cloud storage services such as Amazon S3, Microsoft Azure, or Google Cloud Storage (GCS). As with other filesystems, you can use Drill to connect to these systems and directly query this data. The process for connecting to one of these services is essentially the same. First, you must download and install the appropriate driver for the cloud service you are using. Next, set your access credentials in the core-site.xml file, and then finally configure the storage plug-in (see “Working with Amazon S3”).

Querying data on Amazon S3

To query data on Amazon S3, you also need an access key and secret key.

Getting access credentials for S3

To obtain access credentials, first log in to Amazon S3, and then, in the upper-right corner, click your name to open your personal settings menu, as shown in Figure 6-2. Click the My Security Credentials option.

Next, click the “Access keys” link and generate the access and secret keys (see Figure 6-3).

Warning

Be sure to save the keys in a safe place; you cannot retrieve them again after you have generated them.

After you have obtained your access and secret keys, the next step is to configure Drill with your AWS credentials. To do this, add the following code to the file $DRILL_HOME/conf/core-site.xml, replacing the placeholder values with your credentials:

<configuration>
  <property>
    <name>fs.s3a.access.key</name>
    <value>YOUR_ACCESSKEY</value>
  </property>
  <property>
    <name>fs.s3a.secret.key</name>
    <value>YOUR_SECRETKEY</value>
  </property>
  <property>
     <name>fs.s3a.connection.maximum</name>
     <value>100</value>
  </property>
  <property>
     <name>fs.s3a.endpoint</name>
     <value>s3.REGION.amazonaws.com</value>
  </property>
</configuration>

When you have added these options to your configuration file, the next step is to install the driver and configure the storage plug-in.

Fundamentally, the configuration to query Amazon S3 is the same as querying the local filesystem (dfs), with the only difference being the configuration string having an s3a address instead of file:///:

"connection":"s3a://your_S3_bucket"

Just as with the dfs storage plug-in, you can set up workspaces and file types; thus, a complete s3a configuration might look like this:

{ 
   "type": "file", 
   "enabled": true, 
   "connection": "s3a://your s3a bucket", 
   "workspaces": { 
       "root": { 
          "location": "/", 
          "writable": false, 
          "defaultInputFormat": null 
    }, 
    "formats": { 
        "csv": { 
            "type": "text", 
            "extensions": [ "csv" ], 
            "delimiter": "," 
        }
    } 
}

Querying Minio datastores from drill

Minio is a high-performance distributed object storage server that you can easily set up to query with Drill. After you have Minio up and running from an accessible IP address, you need to modify the core-site.xml file as shown here:

 <configuration>
     <property>
         <name>fs.s3a.access.key</name>
         <value>YOUR_ACCESS_KEY</value>
     </property>
     <property>
        <name>fs.s3a.secret.key</name>
        <value>YOUR_SECRET_KEY</value>
     </property>
     <property>
       <name>fs.s3a.endpoint</name>
       <value>http://minio_host:9000</value>
     </property>
     <property>
       <name>fs.s3a.connection.ssl.enabled</name>
       <value>false </value>
     </property>
     <property>
       <name>fs.s3a.path.style.access</name>
       <value>true</value>
     </property>
 </configuration>

When this is complete, follow the instructions for connecting to an S3 bucket and you will be able to query your data stored using Minio!

Connecting to other cloud storage services

In addition to being able to directly query data in an Amazon S3 bucket, Drill can query data in GCS or Microsoft Azure. The basic procedure is similar: download and install the driver, set up the credentials in the core-site.xml file, and then create a storage plug-in similar to the dfs plug-in, with the connection string for the cloud storage provider.

The following steps download and unzip the Azure driver to the Drill jars/3rdparty folder (don’t forget to restart your Drillbits after adding drivers):

$ cd $DRILL_HOME/jars/3rdparty/
$ sudo wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/
2.7.1/hadoop-azure-2.7.1.jar
$ sudo wget http://central.maven.org/maven2/com/microsoft/azure/azure-storage/
2.0.0/azure-storage-2.0.0.jar

Next, add the following property to the core-site.xml file:

<property>
    <name>fs.azure.account.key.your_data_files.blob.core.windows.net</name>
    <value><your_azure_account_key></value>
</property>

Finally, here’s the connection string for Azure:

wasb://your_container@your_datafiles.blob.core.windows.net

To connect Drill to GCS, download the drivers from the GCS website and copy them to the $DRILL_HOME/jars/3rdparty folder.

As with Azure, duplicate the settings for dfs and change the connection string to the one shown in the following snippet, replacing the placeholder with your bucket name:

"connection": "gs://your_bucket_name"

Querying Time Series Data from Drill and OpenTSDB

OpenTSDB is an open source, scalable time series database that is built on top of HBase. As of version 1.13, Drill can natively query time series data in OpenTSDB. It is quite simple to connect Drill to OpenTSDB, but querying time series data in Drill is a little more complicated than regular data.

As with the other storage plug-ins, we begin with the default configuration. To enable OpenTSDB, make sure it is running, navigate to the Drill storage configuration page, and then enable the storage plug-in, changing the connection property if necessary to point to your instance:

{ 
   "type": "openTSDB", 
   "connection": "http://opentsdb_host:4242", 
   "enabled": true 
}

After you have enabled the storage plug-in, you are ready to query the time series data in OpenTSDB.

Special considerations for time series data

When you query time series data with Drill, you must specify certain configuration parameters in the FROM clause in your query. Table 6-1 lists the three mandatory parameters.

Table 6-1. OpenTSDB required query parameters
Parameter name	Description
`metric`	The name of the metric or field that you are querying.
`start`	The start time of your query. Can be specified as a relative or absolute timestamp.^a For example, `start=5y-ago`.
`aggregator`	The aggregation function for your data. Some aggregation functions include `count`, `sum`, `avg`, `std`, and `none`.^b
^a A complete list of time formats for OpenTSDB is available in the OpenTSDB 2.3 documentation. ^b A complete list of aggregation functions is available in the OpenTSDB documentation.

Additionally, there are two optional parameters, explained in Table 6-2.

Table 6-2. OpenTSDB optional query parameters
Parameter name	Description
`downsample`	Allows you to reduce the amount of data returned via a downsampling function.^a
`end`	The desired end time for the query. If left blank, OpenTSDB assumes the current system time of the system as the end time.
^a A complete list of available downsampling functions is available in the OpenTSDB documentation.

The FROM clause of a Drill query must therefore look like the following example:

SELECT fields
FROM openTSDB.`(metric=metric_name, 
   start=start_time, 
   aggregator=aggregation_function)`

As a final example, the query that follows returns the aggregated and grouped values from the clicks.speed.test table, limited to the last year:

SELECT `timestamp`, SUM(`aggregated value`) AS avg_sum
FROM openTSDB.`(metric=clicks.speed.test, 
        aggregator=avg, 
        start=1y-ago)`
GROUP BY `timestamp`

Conclusion

One of Drill’s most powerful use cases is the ability to query and join data from a wide variety of different data sources, including Hadoop, traditional relational databases, MongoDB, Kafka, and many others, all using standard SQL. In this chapter you have learned how to configure Drill to query all of these different data sources as well as the intricacies of querying particular data sources.

In the next section of the book, you will learn how to incorporate Drill into your data pipeline. You’ll see how to connect to Drill, how to handle complex data formats with Drill, and how to deploy Drill in production.

¹ As indicated in Chapter 4, the plug-in name “dfs” is completely arbitrary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Connecting Drill to Data Sources

Create new playlist

Sign In

Sign Up