Chapter 2. Installing and Running Drill

Drill is a layered Java application. At the core is a set of Java JAR files that you can include in another Java application. Drill calls this the embedded mode. The sqlline tool we’ll discuss is one example; the JDBC driver is another. The embedded mode is handy for learning Drill or to work with datasets stored on your local machine.

More typically, however, you run Drill as a server. Drill calls this the server mode. The Drill server, called a Drillbit, is just a wrapper around the core Drill library, so you get much of the same functionality either way. You can run a single Drillbit, or you can run multiple Drillbits on a cluster. Drill calls this distributed mode. The key advantage is that distributed mode provides much easier access to distributed filesystems, such as HDFS or S3.

You can run a single Drillbit on your laptop, which lets you try out the server features that you’ll use in production, including working with distributed filesystems. It is important to note that you can do a lot with Drill without requiring a Hadoop cluster at your disposal. When it’s installed on a cluster you work with Drill in exactly the same way as on your laptop, but now Drill will distribute its load across multiple machines. When you run Drill as a server, you will also need ZooKeeper running.

This chapter explains how to install Drill on your laptop, run it in both embedded and server modes, and configure your system to work with Drill. Chapter 9 explains how to run a distributed set of Drillbits on a production cluster.

After you install Drill, there are several ways of interacting with it. In addition to the command-line interface, you can interact with Drill via most BI tools through Drill’s ODBC/JDBC interface. Drill also comes with a web interface, which we cover in Chapter 4.

If you just want to experiment with Drill, it is available preinstalled in a few publicly available virtual machines such as the Griffon Distribution for Data Science,1 the MapR Sandbox, and others.

If you are interested in trying out Drill, it is sufficient to install Drill on your laptop and run it in embedded mode.

Preparing Your Machine for Drill

Before you install Drill on any machine, you need to have the Oracle Java SE Development Kit 8 (JDK 8) installed, which is available from Oracle’s website.

If you already have the JDK installed, you can verify the version as shown here:

$ java -version
java version "1.8.0_65" 
Java(TM) SE Runtime Environment (build 1.8.0_65-b17) 
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode) 

As long as your JDK is higher than version 1.8, you can run Drill. If you don’t have the JDK on your machine, simply download the executable installer and run it.

Special Configuration Instructions for Windows Installations

On Windows machines, you will also need to have two environment variables set:

  • A JAVA_HOME environment variable pointing to the JDK installation

  • A PATH environment variable that includes the path to your JDK installation

To set these variables in Windows XP/Vista/7/8/10, perform the following steps:

  1. Open the Control Panel, click System, and then click Advanced System Settings.

  2. Switch to the Advanced tab and then click Environment Variables.

  3. In the Environment Variables window, in the System Variables pane (see Figure 2-1), scroll down to PATH and then click Edit.

  4. Add C:Program FilesJavajdk1.x.x_xxin;, replacing x.x_xx with the exact version number you downloaded in front of all existing entries. Do not delete any existing entries, because this can cause other applications to not run.

  5. Repeat this process for the JAVA_HOME variable.

Figure 2-1. Configuring the environment variables

After you have installed Java and set the environment variables, you can confirm that your installation is working correctly by typing java -version at a command prompt. You should receive the result shown in Figure 2-2.

Figure 2-2. Confirmation that your installation is working correctly

Lastly, for Windows you will also need a program that can decompress TAR files, such as 7-Zip.

Installing Drill on Windows

After you have Java correctly configured, installing Drill on a Windows machine is essentially the same process as for macOS or Linux:

  1. Download the latest version of Drill.

  2. Move the file to the directory where you want to install Drill.

  3. Using 7-Zip, or whichever extractor you have, unzip the file and the underlying TAR file into the directory of your choosing.

Starting Drill on a Windows Machine

Open a command prompt and navigate to the directory where you unzipped Drill. When you are there, navigate to the bin directory by typing cd bin. Next, at the command prompt, type the command sqlline.bat -u "jdbc:drill:zk=local". If all went well, you should see the Drill prompt.

From here, you can enter SQL queries at the prompt. As part of the default installation, Drill includes a number of demonstration datasets, and among these is a file called employee.json that contains nominal data about an organization’s employees. To verify that your installation is working properly, enter the following query at the Drill prompt:

SELECT education_level, COUNT( * ) AS person_count
FROM cp.`employee.json`
GROUP BY education_level
ORDER BY person_count DESC

If all is working properly, you should see an ASCII table of the data, as demonstrated here:

jdbc:drill:zk=local> 
SELECT education_level, COUNT( * ) AS person_count
. . . . . . . . . . . >FROM cp.`employee.json`
. . . . . . . . . . . >GROUP BY education_level
. . . . . . . . . . . >ORDER BY person_count DESC;
+----------------------+---------------+
|  education_level     |  person_count |
+----------------------+---------------+
| Partial College      | 288           |
| Bachelors Degree     | 287           |
| High School Degree   | 281           |
| Graduate Degree      | 170           |
| Partial High School  | 129           |
+----------------------+---------------+
5 rows selected (0.195 seconds)

Drill does not support server mode on Windows machines, so at this point your Drill installation is working and you can skip to the next chapter about connecting Drill to data sources.

To quit the Drill shell, type !quit.

Installing Drill in Embedded Mode on macOS or Linux

Installing Drill on a macOS or Linux machine is as simple as downloading and decompressing the TAR file. Depending on your system, either one of the following commands will download the Drill TAR file:

wget -O http://mirror.olnevhost.net/pub/apache/
drill/drill-version/apache-drill-1.13.0.tar.gz

curl -o apache-drill-version.tar.gz 
http://mirror.olnevhost.net/pub/apache/drill/
drill-version/apache-drill-1.13.0.tar.gz

After you have moved Drill to the desired location, simply decompress it:

tar -xvzf apache-drill-version.tar.gz

Adding Drill to the PATH

It is often convenient to define a variable to point to Drill’s location. In your .bashrc or .bash_profile file, add the following line:

export DRILL_HOME=/path/to/drill

Then, all your commands can be of the form:

$DRILL_HOME/bin/drill-embedded

You might want to add the following line to your .bash_profile file as a convenience:

alias startDrill='$DRILL_HOME/bin/drill-embedded'

This will allow you to start Drill in embedded mode by typing startDrill at the command line. At this point, you are ready to run Drill.

Starting Drill on macOS or Linux in Embedded Mode

To start Drill after you’ve installed it, navigate to the path where you extracted the Drill files and execute the following command:

./bin/drill-embedded

You should see the following prompt:

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; 
  support was removed in 8.0<&>
Jan 04, 2016 11:09:55 AM org.glassfish.jersey.server.ApplicationHandler 
  initialize<>
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 
  01:25:26.apache drill 1.14.0
"a drill is a terrible thing to waste"
0: jdbc:drill:zk=local>

From here, you can enter SQL queries at the prompt. As part of the default installation, Drill includes a number of demonstration datasets, and among these is a file called employee.json that contains nominal data about an organization’s employees. To verify that your installation is working properly, enter the following query at the Drill prompt:

SELECT education_level, COUNT( * ) AS person_count
FROM cp.`employee.json`
GROUP BY education_level
ORDER BY person_count DESC;

If all is working properly, you should see an ASCII table of the data such as that shown here:

jdbc:drill:zk=local> 
SELECT education_level, COUNT( * ) AS person_count
. . . . . . . . . . . >FROM cp.`employee.json`
. . . . . . . . . . . >GROUP BY education_level
. . . . . . . . . . . >ORDER BY person_count DESC;
+----------------------+---------------+
|  education_level     |  person_count |
+----------------------+---------------+
| Partial College      | 288           |
| Bachelors Degree     | 287           |
| High School Degree   | 281           |
| Graduate Degree      | 170           |
| Partial High School  | 129           |
+----------------------+---------------+
5 rows selected (0.195 seconds)

At this point, your Drill installation is working and you can skip to the next chapter about connecting Drill to data sources.

To quit the Drill shell, type !quit.

Installing Drill in Distributed Mode on macOS or Linux

Although Drill can significantly improve data exploration and analysis on one machine, its real power comes from running it in distributed mode, in which you can query multiple datasets in situ. This section assumes that you have a working cluster and covers how to set it up to work with Drill. We will not cover how to set up a Hadoop or MongoDB cluster. We discuss how to set up a production Drill cluster in Chapter 9.

Preparing Your Cluster for Drill

Before you install Drill on a cluster, your cluster must have the following installed:

When running in embedded mode, or as a server on a single machine, Drill can read from either your local filesystem or a distributed filesystem. But when you run in distributed mode (with two or more Drill servers), you must use a distributed filesystem such as HDFS or Amazon S3. 

Regardless of whether you run in embedded or server mode, Drill allows you to query any number of data sources in a single query in order to do joins, unions, and other SQL operations.

Drillbits form a Drill cluster using ZooKeeper, as described shortly. Even if you run a “single-node cluster” (a single Drillbit) on your laptop, you must also run ZooKeeper. For details on installing and configuring a ZooKeeper cluster, consult the “Getting Started Guide”.

Depending on your system, either one of these commands will download the file:

wget http://drill.apache.org/download/apache-drill-version.tar.gz
curl -o apache-drill-version.tar.gz   
http://drill.apache.org/download/apache-drill-version.tar.gz

After you have moved Drill to the desired location, decompress it in the directory of your choice, such as /opt:

tar -xvzf apache-drill-version.tar.gz

Finally, you may need to configure the node to communicate with ZooKeeper. The default file is fine for a single Drillbit with ZooKeeper running on the same host with default options. You must configure Drill for all other cases. To do this, you will need to modify the drill-override.conf file, which you can find at path_to_drill/conf/drill-override.conf. You will need to configure each node with the ZooKeeper host names and port numbers, as demonstrated in the following:

drill.exec:{
   cluster-id: "drillbits1",
   zk.connect: "zkhostname1:2181,
                zkhostname2:2181,
                zkhostname3:2181"
}

Configuration must be the same for all nodes in your Drill cluster. The default ZooKeeper port is 2181.

Starting Drill in Distributed Mode

To use Drill in distributed mode, each node in your cluster must have a Drillbit running on it. After you configure each node (as described in the previous section), you now have to start the Drillbit:

$DRILL_HOME/bin/drillbit.sh start

You can start, restart, stop, check the status of the daemon, or set it to autorestart with the same script, as follows:

$DRILL_HOME/bin/drillbit.sh [start|stop|restart|status|autorestart]

After you’ve started the Drill daemon on every node in your cluster, you are ready to connect to the cluster.

Connecting to the Cluster

The final step is starting the Drill shell. Execute the following command:

$DRILL_HOME/bin/sqlline –u jdbc:drill:zk=zkhost1,zkhost2,zkhost3:2181

If all goes well, the Drill shell will open. Execute the following query to verify the connection to the Drillbit:

SELECT * FROM sys.drillbits;

Drill also provides the drill-localhost script, which you can use to connect to Drill when ZooKeeper is installed on a local machine (typically when running a single Drillbit on your laptop).

Conclusion

If you’ve been following along on your computer, you should have a functioning installation of Apache Drill and are ready to query data on it. In the next few chapters, you will learn how to query various types of data using Drill.

1 Griffon is a version of Linux specially adapted for data science.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.29.19