Chapter 9. Deploying Drill in Production

The two most common usage patterns for Drill are a single (often embedded) instance used to learn Drill, and a fully distributed, multinode setup used in production. This chapter explains the issues to consider when moving from personal usage to a production cluster. This chapter does not explain the basics of setting up a distributed Hadoop or Amazon Web Services (AWS) cluster; we assume that you already have that knowledge.

Installing Drill

This section walks you through the steps required to get the Drill software on your nodes. Later sections explain how to configure Drill.

You have several options for how to install Drill on your servers:

Vendor-provided installer

If you are a MapR customer, the easiest solution is to use the MapR installer. Note that the MapR configuration does not use the site directory (see “Creating a Site Directory”); if you upgrade Drill, be sure to follow the manual steps in the MapR documentation to save your configuration and JAR files before upgrading.

Casual installation

If you are trying Drill on a server but are not quite ready for production use, you can install Drill much as you would on your laptop: just download Drill into your home directory and then follow the steps listed in Chapter 2.

Production installation

This is a more structured way to install Drill onto your system; see “Production Installation”.

Regardless of how you do the installation, there are three common deployment patterns, as explained in earlier chapters:

Single-node

This is handy for learning Drill, developing plug-ins and user-defined functions (UDFs), and simple tasks. Data generally resides on your local disk.

HDFS-style cluster

Drill runs on each data node to ensure that data is read from the local HDFS. This is called data locality.

Cloud-style cluster

Drill runs on a set of, say, AWS instances, but the data itself resides in the cloud vendor’s object store (such as Amazon S3). This is called separation of compute and storage.

Drill runs fine in all of these configurations. After you become familiar with Drill, you can also experiment with hybrid configurations (run on a subset of your HDFS nodes, say, or use a caching layer such as Alluxio with cloud storage.) Because Drill makes minimal assumptions about storage, it will run fine in many configurations.

Prerequisites

Drill has only two prerequisites:

  • Java 8

  • ZooKeeper

Install the version of Java 8 appropriate for your operating system. Drill works with both the Oracle and OpenJDK versions. Install the JDK (development) package, not the JRE (runtime) package; Drill uses the Java compiler, which is available only in the JDK. Be sure to install Java on all your nodes.

Drill uses Apache ZooKeeper to coordinate the Drill cluster. Follow the standard ZooKeeper installation procedures to set up a ZooKeeper quorum (typically three or five nodes). See the ZooKeeper documentation for installation instructions.

Production Installation

In this section we describe one way to set up Drill on a production server, using a site directory to separate your configuration from Drill files to ease upgrades.

First, use wget to download Drill. To find a mirror, and the latest version, visit the Drill download page and click the “Find an Apache Mirror” button. Pass the resulting link into a wget command:

wget download link

Then, expand the archive:

tar xzf apache-drill-1.XX.0.tar.gz

Move the resulting directory to /opt/drill:

mv apache-drill-1.XX.0 /opt/drill

Creating a Site Directory

By default, Drill expects its configuration files to appear in $DRILL_HOME/conf and its plug-ins to appear in $DRILL_HOME/jars. Many users, when first deploying Drill, continue to follow this model. Although convenient, the model becomes a nuisance when it is time to upgrade. You must copy your own files from the old distribution to the new one. This requires that you remember to save the old distribution before overwriting it with the new one. Then, you must remember which files you edited (or compare the file lists between the new and old distributions and sort out which are new in Drill and which are those that you added). Clearly, this is not a reliable way to maintain your system.

The alternative is to completely separate your own files from the Drill distribution files by using a site directory, which is simply a directory, located anywhere on your system outside the Drill directories, in which you place your configuration. In the following example, $DRILL_SITE points to the site directory:

$DRILL_SITE
|- drill-env.sh
|- drill-override.conf
|- logback.xml
|- jars/
   |- Your custom JARs (UDFs, plug-ins, etc.)
|- lib/
   |- Your custom native libraries (PAM modules, etc.)

Then, you simply instruct Drill to use your site directory instead of the default location:

drillbit.sh --site $DRILL_SITE start

If you set the $DRILL_SITE variable, Drill will use that location without you specifying the --site option.

When it is time to upgrade, replace the Drill distribution and restart Drill. There’s no need to copy or merge files, and so on.

Let’s put the site directory adjacent to /opt/drill:

mkdir /opt/drill-site

Drill must have at least the drill-override.conf file. You can create this from the example provided in Drill:

cp /opt/drill/conf/drill-override.conf /opt/drill-site
cp /opt/drill/conf/drill-env.sh /opt/drill-site
cp /opt/drill/conf/logback.xml /opt/drill-site

Next, create a custom launch command by creating the file /opt/drill-site/drill.sh:

#! /bin/bash
export DRILL_SITE=`basename "${BASH_SOURCE-$0}"`
export DRILL_HOME=/opt/drill
$DRILL_HOME/bin/drillbit.sh --site $DRILL_SITE $@

You’ll use this to launch Drill instead of the usual drillbit.sh command so you can set the site directory. Finally, make the script executable:

chmod +x drill.sh

Configuring ZooKeeper

The next step is to configure Drill. We’ll assume that you are using the production installation.

In the most basic Drill configuration, ZooKeeper is all you need to configure in /opt/drill-site/drill-override.conf:

drill.exec: {
  cluster-id: "drillbits1",
  zk.connect: "zkHost1:2181;zkHost2:2181;zkHost3:2181"
}

In this example, zkHost1 is the name of your first ZooKeeper host, and so on. Replace these names with your actual host names. Port 2181 is the default ZooKeeper port; replace this with your custom port if you changed the port number.

Advanced ZooKeeper configuration

The cluster-id is used as part of the path to Drill’s znodes. The default path is as follows:

drill/drillbits1

Normally, you just use the defaults. However, if you ever have the need to run two separate Drill clusters on the same set of nodes, you have two options:

  • The Drill clusters share the same configuration (storage configurations, system options, etc.).

  • The Drill clusters are completely independent.

To create clusters that share state, change just the cluster-id. To create distinct shared-nothing Drill clusters, change the zk.root; for example:

drill.exec: {
  cluster-id: "drillbits2",
  zk.root: "mySecondDrill",
  zk.connect: "zkHost1:2181;zkHost2:2181;zkHost3:2181"
} 

Configuring Memory

Drill uses three types of memory:

  • Heap

  • Direct memory (off-heap)

  • Code Cache

Drill is an in-memory query engine and benefits from ample direct memory. When your query contains an operation such as a sort or hash join, Drill must accumulate rows in memory to do the operation. The amount of memory depends on the total number of rows your query selects, the width of each row, and the number of Drillbits you run in your cluster. The more rows, or the wider the row, the more memory is needed. Adding Drillbits spreads that memory need across multiple nodes.

For small, trial queries, the default memory is probably sufficient. But as you go into production, with large datasets, you must consider how much memory Drill needs to process your queries.

Memory tuning is a fine art. By default, Drill uses 8 GB of direct memory, 4 GB of heap, and 1 GB of code cache setting. You will likely not need to change the code cache. The other two memory settings depend entirely on your workload and the available memory on your nodes.

When running Drill in distributed mode (multiple Drillbits), each Drillbit must be configured with the same memory settings because Drill assumes that the Drillbits are symmetric.

Recent versions of Drill have added spill-to-disk capabilities for most operators so that if Drill encounters memory pressure, it will write temporary results to disk. Although this avoids out-of-memory errors, it does slow query performance considerably. (To gain maximum benefit, you will want to enable query queueing, as described in a moment.)

The best rule of thumb is to give Drill as much memory as you can. Ideally, Drill will run on a dedicated node. If you run on a cloud provider such as AWS, reading data from the cloud provider’s storage (such as Amazon S3), the simplest solution is to run Drill on its own instance, with Drill using all available memory on that instance. As your queries grow larger and you need more memory, simply upgrade to a larger instance, adjusting the Drill memory settings accordingly.

If you run Drill on-premises on an HDFS cluster, you will want to run Drill colocated with your HDFS data nodes (more on this shortly). In this case, try to give Drill a large share of memory.

Next, run sample queries, especially those that include sorts and joins. Examine the query profiles to determine the amount of memory actually used. Then, determine the expected number of concurrent queries and adjust the direct memory accordingly.

Drill tends to need much less heap memory than direct, and the relationship between the two is not linear. You can adjust heap via trial and error (increase it if you hit Java out-of-memory errors), or you can use a Java monitoring tool to watch the heap. Frequent garbage collection events mean that Java has insufficient heap memory.

You set memory in the drill-env.sh file. The file starts with a large number of commented-out options along with descriptions. Suppose that you want to set direct memory to 20 GB and heap to 5 GB. Find the lines in the snippet that follows. First remove the leading comment characters, and then set the desired values:

export DRILL_HEAP=${DRILL_HEAP:-"5G"}
export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}

The funny syntax allows you to override these values using environment variables. If the environment variables are already set, they are used, and if not your values are used. (This feature is used in the Drill-on-YARN integration, described later.)

Configuring Logging

Like most servers, Drill creates log files to inform you as to what is happening. Drill places logs in $DRILL_HOME/log by default. Although this is fine for experimenting with Drill, it is not good practice in production where logs can grow quite large. Let’s put the logs in /var/log/drill, instead. Edit drill-env.sh as described in the previous section:

export DRILL_LOG_DIR=${DRILL_LOG_DIR:-/var/log/drill}

Then create the log directory:

sudo mkdir /var/log/drill

Drill uses Logback for logging, which is configured by the logback.xml file. By default, Drill uses rotating log files with each file growing to 100 MB, and retaining up to 10 old log files. You can change this by editing the logback.xml file in your site directory. Suppose that you want your files to be no more than 50 MB and you want to keep up to 20 files:

<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
  <file>${log.path}</file>
  <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
    <fileNamePattern>${log.path}.%i</fileNamePattern>
    <minIndex>1</minIndex>
    <maxIndex>20</maxIndex>
  </rollingPolicy>
  <triggeringPolicy 
      class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
    <maxFileSize>50MB</maxFileSize>
  </triggeringPolicy>

Drill sets the log level to info by default. This level is really needed only if you are trying to track down problems. Otherwise, you can reduce log output by setting the level to error, as shown here:

<logger name="org.apache.drill" additivity="false">
  <level value="error" />
  <appender-ref ref="FILE" />
</logger>

Testing the Installation

Drill should now be ready to run on your single node. Let’s try it using the script created earlier:

/opt/drill/drill.sh start

Verify that the Drillbit started:

/opt/drill/drill.sh status
<node-name>: drillbit is running.

Verify that Drill has indeed started by pointing your browser to the following URL:

http://<node-ip>:8047

If the Drill Web Console appears, click the Query tab and run the suggested sample query:

SELECT * FROM cp.`employee.json` LIMIT 20

If the query works, congratulations! You have a working one-node Drill cluster. If the connection times out, Drill has failed. Double-check the preceding instructions and examine the log files, which should be in /var/log/drill. Common mistakes include the following:

  • Misconfiguring ZooKeeper

  • ZooKeeper not running

  • No Java installed

  • Installing the Java JRE instead of the JDK

  • Making a typo somewhere along the way

Stop Drill to prepare for the next step:

/opt/drill/drill.sh stop

Distributing Drill Binaries and Configuration

Every system administrator must choose a way to distribute software on a cluster. If you run Hadoop, you very likely already have a solution: one provided by your Hadoop vendor or one preferred by your shop.

If you are learning, a very simple solution is clush (CLUster SHell), which is easy to install on any Linux system. The big advantage of clush is that it makes no assumptions about your system and is completely standalone. Of course, that is its primary weakness as well, and why it is useful only in simple development or test environments. Still, we use it here to avoid depending on more advanced production tools such as Ambari, Puppet, Chef, or others.

Installing clush

On CentOS and the like, type the following:

yum --enablerepo=extras install epel-release
yum install clustershell

Check the clush documentation for installation instructions for other Linux flavors.

clush depends on passwordless Secure Shell (SSH) between nodes. Many online articles exist to explain how to do the setup.

clush provides shortcut syntax to reference your nodes if your nodes follow a simple numbering convention. For our examples, let’s assume that we have three nodes: drill-1, drill-2, and drill-3. We further assume that we’ve done the setup thus far on drill-1 and so must push changes to drill-2 and drill-3.

Distributing Drill files

We created three directories in our installation that we must now re-create on each node: the Drill distribution, the site directory, and the log directory.

The log directory is created only once at install time (Drill will fail if the directory does not exist):

clush -w drill-[2-3] mkdir -p /var/log/drill

You must copy the Drill distribution only on first install, and then again when you upgrade to a new version:

clush -w drill-[2-3] --copy /opt/drill --dest /opt/drill

However, you must copy the site directory each time you make a configuration change:

clush -w drill-[2-3] --copy /opt/drill-site --dest /opt/drill-site

It is essential that all nodes see the same configuration; obscure errors will result otherwise. To help ensure consistency, you should make it a habit to change configuration only on a single node (drill-1 here) and to immediately push the configuration out to your other nodes.

Starting the Drill Cluster

You are now ready to start your full Drill cluster:

clush -w drill-[1-3] /opt/drill/drill.sh start

Next, verify that the Drillbits started:

clush -w drill-[1-3] /opt/drill/drill.sh status

You should see:

drill-1: drillbit is running.
drill-2: drillbit is running.
drill-3: drillbit is running.

Open the Drill web console as before, pointing to any of the nodes:

http://drill-1:8047

Now you should see all your nodes listed on the main web page.

Again run a test query. If it succeeds, congratulations; you have a working Drill cluster. The next step will be to configure access to your distributed filesystem.

If, however, you get error messages from clush, or you do not see all the nodes in the Drill web console, it is time to do a bit of debugging. Identify which nodes work and which failed. Look in the logs (in /var/log/drill) on those nodes for errors. Common mistakes include the following:

  • Wrong host name for ZooKeeper (use the internal name as known by your nodes, not the public name, or use IP addresses, but use the internal IP addresses in the cloud, such as on AWS)

  • Forgetting to copy some of the files (particularly any changes to configuration files)

  • Forgetting to install Java on all nodes

  • Forgetting to create the /var/log/drill directory on all nodes

If all else fails, post to the Drill user mailing list for help; signup information is available on the Drill website.

Configuring Storage

Drill is a distributed query engine. As soon as you move past a single node, Drill requires a distributed filesystem to hold the data. On-premises, the standard solution is HDFS or an HDFS-compatible filesystem such as MapR, Alluxio, and so on. In the cloud, Drill supports data stored in Amazon S3. Here, we discuss generic Apache Hadoop HDFS and Amazon S3. If you are a MapR customer, the MapR installer will set up the MapR filesystem for you.

Working with Apache Hadoop HDFS

Drill can work with data stored in Hadoop HDFS as well as many other formats. Drill is designed to be independent of Hadoop, but you can integrate it with Hadoop when needed, such as to read data from HDFS.

There are two ways to do the integration: the simple way and with full integration.

Simple HDFS integration

If your HDFS configuration does not enable security and uses mostly default settings, you can probably access HDFS with the simple configuration described in Chapter 6:

  1. Open the Drill web console.

  2. Click the Storage tab.

  3. Find the dfs storage configuration, and then click Update.

  4. Add the HDFS URL to the connection field, as follows:

     "connection": "hdfs://name-node:port",
  5. Click Update.

Because the configuration is stored in ZooKeeper, it is immediately available for Drill to use. Try a simple query to verify the setup.

Full HDFS integration

If you are familiar with HDFS, you know that you configure it using a set of configuration files within /etc/hadoop or a similar location.

By default, however, Drill ignores these files. There is often an ambiguity in Hadoop configuration files: are options meant for the server or client? What if tool A needs one option, but Drill needs a different option? Drill works around this by storing Hadoop configuration in the storage configuration rather than getting it from Hadoop configuration. If you need extra configuration settings, you can provide them in the "config" property, as shown in the next section for Amazon S3.

In a secure system, however, configuration becomes complex quickly, and you do not want to duplicate settings between Drill and Hadoop. In this case, you can use tighter HDFS integration by adding Hadoop’s configuration directory to the Drill class path. You do this by editing /opt/drill-site/drill-env.sh as follows:

EXTN_CLASSPATH=/etc/hadoop

Use the actual path to your Hadoop configuration files, which must exist on all nodes on which Drill runs.

If you’re using MapR-FS, configuration is even simpler because it is done automatically by the MapR storage system driver.

Crash Error in Drill 1.13

In Drill 1.13, if you set the value of the fs.defaultFS property (the most important one!), Drill will fail to start. The workaround is to create a “hadoop” directory in your /opt/drill-site directory, copy the configurations there, remove fs.defaultFS, and use clush to push your updated configuration to all your nodes, as shown earlier. See DRILL-6520.

Must Duplicate the HDFS Name Node URL

Ideally, if given an HDFS configuration in the core-site.xml file, you would not need to specify the same information in Drill’s storage plug-in configuration. But, Drill will fail to save your configuration if you omit the connection. So, you must duplicate the HDFS name node URL from your HDFS configuration into the plug-in configuration. See DRILL-6521.

Working with Amazon S3

Chapter 6 and the Drill documentation provides a good overview of working with Amazon S3. Here is one way to set up Amazon S3 access on a distributed Drill cluster. We assume that you’ve created an Amazon S3 bucket and have obtained the required access keys. We also assume you are familiar with Amazon S3.

Drill uses the Hadoop HDFS s3a connector for Amazon S3. As of Drill 1.13, Drill uses the 2.x version of HDFS. The s3a library in that version does not support Amazon’s newer temporary access keys; you must use the regular access keys.

The example Amazon S3 storage configuration that ships with Drill suggests you put your keys in the storage plug-in configuration. However, the configuration is stored in ZooKeeper. Unless you have secured ZooKeeper, keys configured that way will be visible to prying eyes. Here, we’ll store the keys in files in a way that depends on whether you’ve already configured Hadoop for Amazon S3.

Access keys with Hadoop

If you already use Hadoop, you have likely configured it to access Amazon S3 as described in the Hadoop documentation. You only need to point Drill at that configuration by adding a line to /opt/drill-site/drill-env.sh:

export EXTN_CLASSPATH=/etc/hadoop

Standalone Drill

If Drill runs on a server with no other components, you can create just the one Hadoop configuration file: core-site.xml.

Create the core-site.xml file:

cp /opt/drill/conf/core-site-example.xml 
   /opt/drill-site/core-site.xml

Edit the file to include your keys and add your endpoint:

<configuration>
    <property>
        <name>fs.s3a.access.key</name>
        <value>ACCESS-KEY</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>SECRET-KEY</value>
    </property>
    <property>
        <name>fs.s3a.endpoint</name>
        <value>s3.REGION.amazonaws.com</value>
    </property>
</configuration>

In the preceding example, replace ACCESS-KEY, SECRET-KEY, and REGION with your values.

Distributing the configuration

Each time you change any of your configuration files, you must push the updates to all of your nodes using the same command we used earlier:

clush -w drill-[2-3] --copy /opt/drill-site --dest /opt/drill-site

If Drill was running, be sure to restart it using the stop and start commands. Or, use the handy restart command to do the restart in a single step:

clush -w drill-[1-3] /opt/drill/drill.sh restart

Defining the Amazon S3 storage configuration

After you’ve configured Drill to work with Amazon S3, you must create a storage configuration to bind a Drill namespace to your Amazon S3 bucket. If you have multiple buckets, you can create multiple storage configurations. This setup is a bit different than the one in Chapter 6 because we’re using an external core-site.xml file.

  1. Start Drill if it is not yet running.

  2. Navigate to the web console.

  3. Click the Storage tab.

  4. In the Disabled Storage Plugins section, locate the s3 configuration, and then click Update.

  5. Remove the config section (if present), because you want to take the configuration from the core-site.xml file:

      "config": {},
  6. Revise the connection to the proper format:

      "connection": "s3a://your-bucket/",

    Use your actual bucket name in place of your-bucket.

  7. Click Update.

  8. Click Enable.

The storage configuration is stored in ZooKeeper and so is immediately available to all your running Drillbits.

Troubleshooting

Try a sample query to verify that the configuration works. If so, you’re good to go. Otherwise, here are some things to try:

  • If you’re using the Hadoop integration, be sure that the EXTN_CLASSPATH environment variable in drill-env.sh points to the Hadoop configuration directory: the one that contains core-site.xml.

  • If you’re using a Drill-specific file, be sure that the file is called core-site.xml and is in your site directory.

  • Double-check the credentials and the endpoint.

  • On the Storage tab in the Drill web console, verify that your s3 configuration is enabled.

  • Be sure that you removed the default config section. If the key properties are left in the storage configuration, they will override the credentials in core-site.xml.

For hints about the problem, look at the Error tab in the query profile for the query that failed.

You can also look in the Drill logs: /var/drill/logs/drillbit.log. (Looking in the Drill logs is generally a good practice anytime something goes awry.)

Admission Control

Drill is a shared, distributed query engine intended to run many queries concurrently. It can be quite difficult to understand how to size a query engine for your load or how much load a given configuration can support. Further, because queries differ in resource needs, it can be surprising why a given number of queries works fine sometimes but causes Drill to run out of memory at other times.

If you use Drill for “casual” usage, simply sizing memory as described earlier is often sufficient. But if you subject Drill to heavy load and occasionally hit out-of-memory issues, you might want to consider enabling Drill’s admission control features.

At the time of this writing (Drill 1.13), admission control is good, but basic. Check back to see how the story might have evolved if you use a newer version. Here’s how to turn on admission control:

ALTER SESSION SET `exec.queue.enable` = true

Perhaps the easiest way to set and review the options is to use the Options tab at the top of the Drill web console page.

Admission control uses two queues: one for “small” queries and another for “large” queries. The idea is that you might want to run a number of small interactive queries concurrently but allow only one large, slow query. The dividing line is a query cost number as computed by the Drill query planner and shown in the query profile. To set this number, sample a few of your smaller queries and a few larger queries. Because query plan numbers tend to grow exponentially, it is often not too difficult to pick a dividing line. Specify your selected number in the exec.queue.threshold property.

Then, determine how much bigger the larger queries are than the small ones by consulting your query profiles. You might find, say, that large queries generally want 5 or 10 times the memory of a small query. Use this to set exec.queue.memory_ratio to the ratio between query memory sizes.

Next, specify the maximum number of concurrent small queries using exec.queue.small and then do the same for large queries with exec.queue.large. This then lets you determine the amount of memory given to each query:

memory units = exec.queue.small +
               exec.queue.large * exec.queue.memory_ratio

Suppose that Drill is given M bytes of memory. Then, each small query gets:

small query memory = M / memory units

And each large query gets:

large query memory = M * exec.queue.memory_ratio / memory units

You can use these formulas to choose values for queue size if you know the available Drill memory and the desired memory for each query. Or, you can work out how much memory to give Drill in order to run some number of queries with a certain amount of memory each.

After you’ve enabled admission control, Drill automatically enforces the memory limits by limiting the data stored in memory by buffering operators such as sort, hash aggregate, and hash join. As of Drill 1.11, both the sort and hash aggregate operators will spill to disk to stay under the memory limit; Drill 1.14 adds support for the hash join operator.

If you hit a peak load where more queries arrive than the maximum you have set, Drill will automatically queue those queries to wait for one of the running queries to complete. Queries wait in the queue for up to exec.queue.timeout_millis milliseconds, after which they fail due to a timeout.

Admission control works best when the system is tuned to handle your peak workload over the timeout period, with the queue smoothing out spikes that exceed the limits by using otherwise spare capacity during valleys in activity.

See the Drill documentation for additional details.

Additional Configuration

You have now completed the basis of configuring Drill. Depending on your needs, you may need to configure one or more optional features.

User-Defined Functions and Custom Plug-ins

Later chapters explain how to write user-defined functions and custom format plug-ins. To use these with a production Drill installation, the code (and sources, for UDFs) must be available on all Drill nodes. The Drill documentation suggests putting the files into the Drill product directory, but we’ve already explained that doing so is an awkward choice for a production cluster. The simplest solution is simply to copy your JAR files into the /opt/drill-site/jars directory and then use clush to push the files to all nodes and restart Drill.

You can also use the Dynamic UDF feature to distribute new UDFs without restarting Drill. Even here, after your UDF is solid and ready to go into production, you should move it into the site directory to minimize runtime overhead.

Security

When using Drill for development and evaluation, it is easiest to run it as some specific user (such as your login user on a Mac) or as root (such as on AWS). As you go into production, you’ll want to define a user specifically for Drill, such as drill, that is given only the permissions needed to do the work required.

When multiple users start using Drill, you must configure Drill to “impersonate” each user using various forms of authentication and impersonation.

Drill provides multiple ways to implement security in a secure cluster. Drill security is a complex, specialized, and evolving topic that very much depends on the type of system to which you want to integrate. For example, MapR security is different from Kerberos security, which differs from Pluggable Authentication Module (PAM) security. Drill provides multiple kinds of security, such as (but not limited to) the following:

  • Kerberos

  • Basic security using Linux users (PAM integration)

  • Encryption for each of the communication paths

  • User impersonation (the Drillbit can access distributed filesystem files as the session user instead of as the user running Drill)

  • Restricting administration access (such as the ability to alter storage plug-in configurations)

We recommend that you consult the Drill documentation and the user mailing list for up-to-date information.

Logging Levels

Drill uses Logback for logging, which you configure via the logback.xml file in your site directory. You copied over the logback.xml file when you created the site directory earlier.

Logging in Drill, as in most software, is used primarily to diagnose problems such as these:

  • Startup failures

  • Failed queries

  • Unusually slow performance

  • Data source errors

This was reworked directly in the files to merge this with the prior logging section:

  <logger name="org.apache.drill" additivity="false">
    <level value="info" />
    <appender-ref ref="FILE" />
  </logger>

As with all Java loggers, you can turn on detailed logging just for one part of Drill. However, it can be difficult to know which class or package name to use unless you are familiar with Drill internals. If you work with other Drill users to resolve an issue, they might be able to suggest a limited set of additional logging to enable.

Logging at the info level provides a wealth of detailed information about Drill internals added by Drill developers to help tune each bit of code. The information does require a deep knowledge of the code. If you want to better understand how Drill works, "info" logging, along with reading the code, will give you solid understanding of how operators work, the flow of record batches up the operator directed acyclic graph (DAG), and so on.

If you run Drill in production, it will produce a large volume of log messages. Drill uses Logback’s rolling log feature to cap each log file at a certain size. The default size is 50 MB, as shown here:

<triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
  <maxFileSize>50MB</maxFileSize>
</triggeringPolicy>

Each time the size limit is reached, Logback creates a new file. It also begins to delete old files when the number of those files exceeds a configured limit. The default maximum is 20 files, as demonstrated here:

<rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
  <fileNamePattern>${log.path}.%i</fileNamePattern>
  <minIndex>1</minIndex>
  <maxIndex>20</maxIndex>
</rollingPolicy>

Controlling CPU Usage

Drill is designed as a massively parallel query engine and aggressively consumes CPU resources to do its work. Drill converts your SQL query into an operator DAG that you see in the Drill web console. Operators are grouped into major fragments: each major fragment contains a string of operators up to a network exchange (shuffle).

Note

This section includes a discussion on fragments. For a more detailed look at how Drill creates and uses fragments, refer to Chapter 3.

Drill parallelizes each major fragment into a set of minor fragments, each of which runs as a thread on your Drillbits. Multiple minor fragments run on each node. In fact, the number of minor fragments per node defaults to 70% of the number of cores on your node, so if your node has 24 cores, Drill will create 16 minor fragments (threads) per node.

If you run multiple concurrent queries, each query is also parallelized. If you run 10 queries on your 24-core node, Drill will run 160 concurrent threads, or about 7 threads per core.

For most queries, each thread spends much of its time waiting for I/O (either from the data source or for network shuffles). Thus, running a thread number larger than the core count helps Drill achieve better parallelization across queries.

If, however, your queries tend to be compute-bound (as can occur with aggregations, joins, and so on), you might want to reduce the number of threads to avoid excessive context switches.

You can do so by adjusting a session option: planner.cpu_load_average. Adjusting this to, say, 0.25 means that Drill will create one minor fragment per major fragment for each four cores on the system.

As of Drill 1.13, there are few automated ways to limit the overall number of slices per query; thus, highly complex queries can, by themselves, create a large number of fragments, even if you limit the per-fragment parallelization factor.

Drill was originally designed to dominate the node on which it runs. Drill assumes that it will be given all of the memory on the node (so it can efficiently do in-memory processing) and that it is free to use all CPU resources on that node, hence, the aggressive CPU scheduling.

If you run Drill in the cloud, the simplest and most effective design is simply to choose an instance size that fits your workload and then run Drill as the only process on that instance.

There are times when you might want to limit the amount of CPU that Drill can use. This is especially useful if Drill is placed on the same node as your data storage and must share that node with other components, such as YARN.

Drill 1.14 added support to use Linux cgroups to limit the amount of CPU that Drill will consume. Because this is a new and evolving feature, you should check the Drill documentation for details.

Note that cgroups does not limit the number of threads that Drill runs; it limits only the amount of CPU time that those threads can consume. If you put Drill under heavy load, limiting CPU will cause queries to run slower than without a limit and can lead to increased thread context switches. Monitor your system if you suspect that you are hitting these limitations. We discuss how to do this next.

Monitoring

Monitoring of Drill is divided into three distinct tasks:

  • Monitoring the Drill process

  • Monitoring Drill JMX metrics

  • Monitoring queries

Monitoring the Drill Process

The two most important metrics for Drill are CPU and memory. CPU is the easiest to monitor: you can get an instantaneous view using the Linux top command or you can use your favorite Linux tools to get a time series view. You should expect to see Drill using a large percentage of CPU while queries are active (see the previous section for details), with usage dropping to near zero between queries. Enabling admission control will help to smooth out peaks by shifting some queries to run when Drill would otherwise be idle.

A related metric is the number of context switches. As the number of concurrent queries increases, the number of threads per Drillbit will increase. If context switches become excessive, you will want to limit parallelism or use admission control to throttle queries, both of which were described earlier.

The other key metric is memory. Here, it is important to understand how memory works in Drill and Java. As Drill runs, it will request more memory from the operating system, but it will never release it. You will see a chart that stair-steps upward. Some people interpret such a chart as a memory leak, but this is simply how Java manages memory. The memory is not leaked; it is simply held in Java’s (heap) and Drill’s (direct) memory free pools, waiting to be used by the next query. When memory allocated to Drill reaches the configured maximum, subsequent requests will be satisfied only by the free list.

One other metric of interest is I/O; however, this is much more difficult to measure and understand. If Drill runs on a MapR cluster, the MapR tools will tell you the performance of Drill’s reads against the MapR filesystem. If your data is stored on Amazon S3, all reads are network reads and you must measure network performance, but you must tease apart traffic to the data hosts from other network traffic.

When Drill spills to disk, it uses local disk but only during spill reads and writes. You can see this activity by monitoring local I/O.

When Drill does an exchange (shuffle), it serializes potentially large amounts of data across the network between Drillbits.

Monitoring JMX Metrics

Process monitoring provides information about the Drill process in a generic way. To get more detail, you can use JMX monitoring via JConsole or a similar JMX tool. The standard metrics (CPU, threads, memory) are useful.

Drill publishes a number of metrics available as MBeans with the prefix drill. For example, drill.fragments.running will show you the number of fragments currently running, which is a good way to judge Drillbit load. See the documentation for details.

Monitoring Queries

The aforementioned tools allow you to monitor the Drill process as a whole. The tools indicate Drill’s overall load level, but not why that load occurs. To understand the specifics of load, there is no better tool than the Drill query profiles, which are available in the web console. The profile provides a large amount of information, including:

  • Runtimes, memory, and record counts for each operator

  • The number of operators, major fragments (groups of operators), and minor fragments

  • Operator-specific metrics such as the amount of data spilled to disk

If a query is slow, you can determine whether one of the following is the cause:

  • Reading a large amount of data from disk

  • An ineffective filter that accepts more rows than you intended

  • Excessive data exchanges

  • Insufficient file splits (which can occur when reading nonsplittable files such as JSON)

  • Excessive disk spilling because the sort, hash aggregate, or hash join operators had too little memory to hold query data

Other Deployment Options

Thus far, this chapter has primarily discussed deploying Drill using the most basic of tools: clush. Experienced admins will use this information to automate the processes with their favorite tools such as Chef or Puppet. In this section, we discuss three other techniques.

MapR Installer

MapR, which developed Drill and contributed it to the Apache Software Foundation, includes Drill in its MapR Ecosystem Package (MEP) releases. If you are a MapR customer, the MapR installer is the easiest way to deploy Drill, but you still need to configure Drill as described earlier in this chapter. As of this writing, MapR does not, however, support the site directory concept; the various configuration files mentioned earlier must go into the Drill distribution directory, and you must manually copy those files across from your old to new installations each time you upgrade. (Check the MapR documentation to see if this changes in some future release.)

Drill-on-YARN

Drill 1.13 added the ability to deploy Drill on your cluster using YARN as a cluster manager. You launch a YARN application master (AM) using the tool provided and then interact with the AM to add or remove nodes. YARN takes care of copying the Drill software and configuration files from your master copy onto the target nodes.

With Drill-on-YARN, you can create multiple Drill clusters, which is sometimes helpful when different organizations want to separate their resources (or need different configurations). You simply give each Drill cluster a unique ZooKeeper “root,” as described earlier, and then configure each cluster with distinct port numbers. Clients will pick up this configuration from ZooKeeper.

See the Drill-on-YARN documentation for details.

Docker

Drill primarily runs on Hadoop clusters, which have traditionally used “bare-metal” or YARN deployment models. It can, however, sometimes be useful to deploy Drill on Docker, especially for testing and development, or on Amazon S3, where there is less need to run in the classic Hadoop formats.

Drill 1.14 provides a Dockerfile to build the Docker container for Drill running in embedded mode. Future releases promise to add support for running in server mode (a Drillbit) in a Docker container. Check the Drill website and the dev list for the latest information.

Conclusion

Drill is a very flexible tool: it works just as well on your laptop as in a large distributed Hadoop cluster. This section discussed many of the most common tasks involved in scaling Drill to run on a production cluster. The details will vary depending on whether you run on a MapR cluster, a classic Hadoop cluster, in AWS, or other variations. The way you perform the specific tasks will vary depending on whether you do the tasks manually (as described here) or take advantage of the MapR installer or your favorite administration tool.

This chapter could not hope to cover every issue that might arise, so we concentrated on the most common factors. The Drill user mailing list is a great place to discuss advanced deployment needs and to learn how others have deployed Drill.

After you’ve mastered the deployment steps, you’ll find that Drill makes a solid addition to your big data toolkit, complementing Hive and Spark and allowing your users to query big data using their favorite BI tools.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.214.32