Chapter 7. Connecting to Drill

In the past few chapters you have learned how to use Drill to query and explore data, and although this is extremely useful, you can get even more value out of Drill by using it as part of a larger data pipeline. Fortunately, Drill has many interfaces through which you can connect to it and query the results. In principle, the three methods are Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), and Drill’s RESTful interface. In this chapter, you will learn how to use these interfaces and connect to Drill using a wide variety of scripting languages and tools, including Python, R, Java, PHP, and Node.js, as well as some business intelligence (BI) tools such as Apache Superset, Zeppelin, and Tableau.

Understanding Drill’s Interfaces

Before we get into the various language-specific libraries that enable you to easily connect to Drill, it is a good idea to understand the ways in which these libraries connect to Drill. Fundamentally, there are three ways to connect to it:1

  • JDBC

  • ODBC

  • REST interface

In theory, you can use any program that implements one of these standards to programmatically connect to Drill. For instance, if you have some database explorer or visualization tool that uses ODBC, you can use Drill’s ODBC interface to connect the tool to Drill.

JDBC and Drill

Drill is implemented in Java, and as such it makes sense that Drill should implement the JDBC standard to enable other Java-based programs to connect to it. To use JDBC, you first need a copy of the Drill JDBC driver on your machine. The JDBC driver is included with Drill and is located at $DRILL_HOME/jars/jdbc-driver/drill-jdbc-all-version.jar. The driver class’s name is org.apache.drill.jdbc.Driver.2 Copy it to the machine that has the application you want to connect to Drill, then add it to that application’s class path. Consult the documentation of your application for details.

Regardless of the application, to connect to Drill, you need to craft a JDBC connection URL. If you are using Drill in embedded mode on your local machine, that URL will be as follows:

jdbc:drill:drillbit=localhost:31010

You can of course, replace localhost with your hostname or IP address.

If you are connecting to Drill in distributed mode, you will need to know the name of your ZooKeeper cluster and you will use this form of URL: jdbc:drill:zk=zkhost1:2818,zkhost2:2181,.... Additionally, there are various optional configuration variables that you can pass in your JDBC connection URL, including the following:

cluster ID

The cluster name as set in drill-override.conf. The default is drillbits1. The default is typically fine unless you’ve customized this property in Drill’s drill-override.conf file.

directory

The ZooKeeper root znode for the Drill cluster. The default is root. Again, the default is typically fine unless you’ve customized this property in Drill’s drill-override.conf file.

port

The connection port. The port to connect directly to a Drillbit is 31010 (unless you changed this value in drill-override.conf.) The ZooKeeper default is 2181, and for a MapR cluster it is 5181.

schema

The default storage plug-in for the connection. This parameter is optional.

tries

The number of connection attempts. The default is 5.

These values must be the same as those specified in the drill-override.conf file for the target Drill cluster.

In the following code snippet, you see the first of two examples of JDBC connection strings. This one demonstrates how to connect to Drill via a single ZooKeeper instance, as is used for development:

jdbc:drill:zk=zkhost:2181

The second example demonstrates how to connect to Drill via a multinode ZooKeeper cluster:

jdbc:drill:zk=10.20.100.20:2181,
              10.20.100.21:2181,
              10.20.100.22:2181;schema=mongodb

When there is more than one ZooKeeper node, Drill randomly connects to any of them and then randomly picks a Drillbit. If the ZooKeeper node happens to be down, Drill will try another. The complete documentation about connecting to Drill with JDBC is available in the documentation on the Drill website.

ODBC and Drill

In addition to connecting to Drill using JDBC, Drill supports using ODBC as a connection method. Configuring ODBC requires installing the ODBC driver on your system and configuring some files on your system.

Installing the ODBC driver

Installing the ODBC driver can be complicated. Complete instructions downloading it are available in the Drill documentation. After you’ve installed the ODBC driver, you need to configure it. Complete instructions to do so are also available in the Drill documentation.

Configuring ODBC on Linux or macOS

If you are installing the ODBC driver on a Mac or Unix machine, you must also install an ODBC driver manager, such as iODBC.

After downloading and installing the MapR ODBC driver, there are three configuration files that you need to install and configure. These files will be copied to /opt/mapr/drill/Setup/ after you’ve installed the ODBC driver, and you’ll need to copy them to your home directory and rename them as hidden files.

.odbc.ini

Contains the definition of your ODBC data sources

.mapr.drillodbc.ini

Contains driver configuration variables

.odbcinst.ini (optional)

Defines the ODBC driver

Then you need to set the following environment variables to point to these files:

export ODBCINI=~/.odbc.ini
export MAPRDRILLINI=~/.mapr.drillodbc.ini
export LD_LIBRARY_PATH=/usr/local/lib

Configuring ODBC on Windows

After downloading and installing the MapR ODBC driver, navigate to the ODBC Data Source Administrator and click the Settings tab. On the System DSN tab, select MapR Drill, and then click Configure. To connect to Drill in embedded mode or as a single server on your local machine, define the following variables:

  • HOST: localhost

  • PORT: 31010

  • Catalog: DRILL

If you are connecting to a cluster in distributed mode, you’ll need to define the following variables:

ZKQuorum

A comma-separated list of ZooKeeper nodes in the format host or IP:port,host or IP:port

ZKClusterID

The name of the Drillbit cluster that can be found in the drill-override.conf file. Defaults to drillbits1.

Now that you’ve learned how to set up ODBC and JDBC to be used with Drill, in the next section, you’ll learn how to write scripts to connect to Drill’s various interfaces.

Drill’s REST Interface

In addition to JDBC and ODBC, Drill also has a RESTful interface that you can use to execute queries as well as modify configuration settings. Drill’s REST API uses the web console URL:

http://hostname:8047/function

The REST API has the following functions:

Cluster (cluster.json)

Gets Drillbit information.

Options (options.json)

Gets the name, default, and data type of the system and session options.

Query (query.json)

Executes a query. You must use a POST request to execute a query.

Profiles (profiles.json)

Gets the profiles of running queries.

Status (status.json)

Gets the Drill status.

Storage (storage.json)

Gets storage plug-in configuration and allows you to create, modify, or delete storage plug-in configurations. You must use a POST request to change the storage plug-in configuration.

Limitations of the RESTful Interface

Although the RESTful interface is the easiest to work with, it has two key limitations:

  • The RESTful interface returns the entire result set in a single REST response, making it unsuitable for extremely large queries.

  • The RESTful interface is stateless, so you cannot use the USE or ALTER SESSION commands. 

The complete documentation for Drill’s RESTful interface is available on the Apache Drill website.

Connecting to Drill with Python

Python is certainly one of the most popular languages for data science and data analysis. Furthermore, Python has an ever-expanding ecosystem of data analysis, visualization, and machine learning libraries that makes it one of the best tools for data science.

At the core of this ecosystem is the Pandas library, which provides a collection of vectorized data structures. Most relevant to Drill is the DataFrame, which is a two-dimensional data structure. It is probably the most important component of the data science ecosystem in Python and is used as the basic container of data for visualization and machine learning. Therefore, in this section, we demonstrate how to quickly and easily get data directly from a Drill query into a Pandas DataFrame.

There are two Python modules, drillpy and pydrill, that you can use to connect to Drill. Both are wrappers for Drill’s RESTful interface, so both have the limitations associated with that, but both work reasonably well, and as the Drill community improves the RESTful interface, so too will the drivers improve. drillpy is notably different in that it implements Python’s DBAPI standard and thus you can use it with other modules that require a database connection.

Using drillpy to Query Drill

The basic steps for connecting to Drill with drillpy are as follows:

  1. Create the connection object.

  2. Create a cursor object.

  3. Execute the query using the execute() method.

  4. Retrieve the results by calling one of the fetchone(), fetchmany(<n>), or fetchall() methods. fetchall() and fetchmany() both return Pandas DataFrames, and fetchone() returns a Pandas Series object.

If necessary, first install the module by running:

pip install drillpy

The snippet that follows demonstrates how to create a connection object in Python using drillpy. If you are using Drill in embedded mode, you can use localhost for the hostname. The default port is 8047 for the RESTful interface:

from drillpy import connect
con = connect(host="host", 
   db="database_name", 
   port=port_number)

After creating the connection object, call the cursor() function to create the cursor object, as shown in the following snippet:

cur = con.cursor()

Now you are ready to execute a query and get the results. The following snippet demonstrates how to execute a query and load the results directly into a DataFrame:

queryResult = cur.execute("query")
df = query.fetchall()

If you are executing multiple queries with variable parameters, you can use parameter substitution in a similar manner as with Python’s SQLLite libraries, where the substitution is handled by a ? in the query string. For example:

query = cur.execute("SELECT * FROM mytable WHERE param1 = ? AND param2 = ?", 
  (var1,var2))

Connecting to Drill Using pydrill

pydrill is another wrapper for Drill’s RESTful interface, but it is a little more robust and includes access to many other aspects of Drill that are exposed in the RESTful interface besides just executing a query. pydrill is not DBAPI-compliant, and therefore if you want to use the database connection object in another module, it likely will not work. However pydrill can transfer query results directly into a Pandas DataFrame.

The process for connecting to Drill using pydrill is basically the same as with drillpy:

  1. Create a connection object. pydrill has an .isActive() function to verify the connection was in fact successful.

  2. Execute the query.

  3. Iterate through the results or transfer them to a DataFrame.

Like drillpy, you can install pydrill by using pip, as shown here:

pip install pydrill

The following code snippet demonstrates how to query Drill using pydrill:

from pydrill.client import pydrill

#Connect to Drill
drill = pydrill(host='hostname', port=8047)

#Check to see if the connection succeeded
if not drill.is_active():
    raise ImproperlyConfigured('Please run Drill first')

#Execute the query
queryResult = drill.query('query')

#Export the results to a Pandas DataFrame
df = queryResult.to_dataframe()

Other functionality of pydrill

In addition to executing simple queries, pydrill has a lot of other functionality, including:

  • Activating or deactivating a Drill storage plug-in as well as getting details about the available storage plug-ins

  • Accessing usage statistics

  • Accessing information about the query plans and much more

You can find the complete documentation for pydrill on the project’s website.

Other Ways of Connecting to Drill from Python

Because Drill is ODBC- and JDBC-compliant, it is possible to query Drill by using the pyODBC or JayDeBeApi modules. Although using these modules does get you access to additional functionality, they are also more difficult to configure and debug. JDBC has several key advantages over the RESTful interface; specifically, it can handle large result sets, security and authentication are easier to enforce using JDBC, and REST isn’t usable in a large-scale cluster because there is no REST API to connect to ZooKeeper to determine whether a Drillbit is still alive and part of the cluster. You will find that it is easier to use the Python modules that are based on the RESTful interface such as PyDrill and DrillPy because you do not have to configure your system in order to use them.

The following code snippet demonstrates how to query Drill via JDBC using JayDeBeApi, Python’s JDBC module. After importing the module, the most difficult step is creating the connection object using the jaydebeapi.connect() function. There are four arguments to this function, as shown here:

conn = jaydebeapi.connect("org.apache.drill.jdbc.Driver", 
                          "jdbc connection string",
                          ["username", "password"],
                          "path_to_Drill_JDBC_driver",)

The following complete snippet demonstrates how to query Drill using JayDeBeApi and import the results into a Pandas DataFrame. The driver version will be different for each version of Drill, so make sure you are using the appropriate version:

import jaydebeapi
import pandas as pd

#Create the connection object
conn = jaydebeapi.connect("org.apache.drill.jdbc.Driver", 
                     "jdbc:drill:drillbit=localhost:31010",
                       ["admin", "password"],
                       "drill_path/jars/jdbc-driver/drill-jdbc-all-version.jar")

#Create the cursor object
curs = conn.cursor()

#Execute the query
curs.execute("SELECT * FROM cp.`employee.json` LIMIT 20")

#Get the results
curs.fetchall()

#Read query results into a Pandas DataFrame
df = pd.read_sql("SELECT * FROM cp.`employee.json` LIMIT 20", conn)

The following snippet demonstrates how to query Drill in embedded mode in Python using Drill’s ODBC interface with pyODBC. Using ODBC is a little more complex than JDBC because in order for this code to work properly, you must make sure the ODBC driver is correctly installed on your system. Also, crafting the DSN is more complex than the JDBC connection string:

import pyodbc
import re
import pandas as pd

#This section from https://github.com/cjmatta/drill_ipython_notebook
MY_DSN = """
Driver                = /opt/mapr/drillodbc/lib/universal/libmaprdrillodbc.dylib
ConnectionType        = Direct
ZKQuorum              = 
ZKClusterID           = 
Catalog               = DRILL
AuthenticationType    = No Authentication
AdvancedProperties    = CastAnyToVarchar=true
HandshakeTimeout      = 5
QueryTimeout          = 180
TimestampTZDisplayTimezone = utc
ExcludedSchemas       = sys,INFORMATION_SCHEMA
NumberOfPrefetchBuffers    = 5
HOST                  = localhost
PORT                  = 31010
"""

#Build DSN
MY_DSN = ";".join(
    [re.sub(r'(	+|s+)=s+', '=', i) for i in MY_DSN.split('
') if i != '']
)

#Create the connection object and cursor
conn = pyodbc.connect(MY_DSN, UID='admin', PWD='password', autocommit=True)
cursor = conn.cursor()

#Set character encoding
conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
conn.setencoding(encoding='utf-8')

#Execute the query and get the results
sql = 'SELECT * FROM cp.`employee.json` LIMIT 20'
cursor.execute(sql)
sql_result = cursor.fetchall()

#Read the data into a DataFrame
df = pd.read_sql('SELECT * FROM cp.`employee.json` LIMIT 20', conn)
df.sample(5)

Connecting to Drill Using R

If you are an R user, fear not, as there is a module called sergeant3 that enables you to query Drill directly from R and import the results into a DataFrame. sergeant allows you to connect to Drill in three ways: the DBI, RJDBC, and dplyr interfaces. sergeant also wraps all the functionality found in the Drill RESTful API, so you can activate, deactivate, or delete storage plug-ins. You can find the complete documentation at the sergeant website.

Querying Drill from R Using sergeant

To use sergeant, of course you must install it, which you can accomplish by using the following command:

devtools::install_github("hrbrmstr/sergeant")

After you’ve installed sergeant, the process for querying Drill is much like that for using one of the Python modules:

  1. Create the connection.

  2. Optionally check to see if the connection succeeded.

  3. Execute the query.

The following code snippet demonstrates how to execute a query in Drill through R:

#Import sergeant
library(sergeant)

#Create the connection to Drill
conn <- drill_connection("hostname") 

#Check to see if the connection was successful
drill_active(conn)

#Execute the query
drill_query(conn, "query")

As you can see, it really isn’t too difficult to use R to query Drill.

Accessing other functionality in R

In addition to executing queries in Drill, sergeant can access other functionality in the Drill RESTful interface, much like pydrill. Here are some selected functions:

drill_active()

Tests whether the Drill HTTP REST API server is up

drill_metrics()

Gets the current memory metrics

drill_options()

Lists the name, default, and data type of the system and session options

drill_profile()

Gets the profile of the query that has the given query ID

drill_set()

Sets Drill SYSTEM or SESSION options

drill_show_files()

Shows files in a filesystem schema

drill_show_schemas()

Returns a list of available schemas

drill_stats()

Gets Drillbit information such as port numbers

drill_status()

Gets the status of Drill

drill_storage()

Gets the list of storage plug-in names and configurations

drill_uplift()

Turns columnar query results into a type-converted table

drill_use()

Changes to a particular schema

Connecting to Drill Using Java

Although there isn’t a dedicated module for Drill, it is relatively straightforward to query Drill using JDBC. For this to work, you must install the JDBC JAR file in the classpath. The driver class name is org.apache.drill.jdbc.Driver.

To connect using JDBC and Java, the basic process is as follows:

  1. Set the Drill driver.

  2. Create a Connection object using a JDBC connection string and by calling the DriverManager.getConnection() function.

  3. Create a Statement object by calling the connection.createStatement() function from your Connection object.

  4. Execute the query by calling the executeQuery() function from your Statement object.

  5. Iterate through the results by calling the appropriate get() function.

There are get() functions for every data type that exists in SQL databases, and a complete list of get() functions is available in the Java documentation.

This code snippet demonstrates how to query Drill in Java using JDBC:

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

public class testDrillInterface {
  public static void main(String[] args) {
    try{
      //Choose the driver
      Class.forName("org.apache.drill.jdbc.Driver");
      
      //Create the Connection object
      Connection connection=DriverManager 
                   .getConnection("jdbc:drill:drillbit=localhost:31010");
      
      //Create the Statement object
      Statement st = connection.createStatement();
      
      //Execute the query 
      ResultSet rs = st.executeQuery("SELECT * from cp.`employees.json`");
      
      //Iterate through the results
      while(rs.next()){
        System.out.println(rs.getString(1));
      }
    } catch(Exception e){
      throw new RuntimeException(e);
    }
  }
}

Querying Drill with PHP

If PHP is your language or you are looking to write a web-based application using Drill as a backend, there is a PHP library to facilitate that. It is loosely modeled after PHP’s MySQL interface, so if you are familiar with that, you already pretty much know how to use the Drill connector.

Installing the Connector

The connector is on Packagist and can be installed by using composer, as follows:

composer require thedataist/drill-connector:dev-master

Using the Connector

The first step is to connect to Drill. The module uses Drill’s RESTful interface, so it doesn’t really make a “connection” in the same sense as with MySQL:

$drill = new DrillConnection('localhost', 8047);

As mentioned earlier, this creates the object, but doesn’t actually send anything to Drill. You can use the is_active() method to verify that your connection is active:

 if($drill->is_active()) {
    print("Connection Active");
  } else {
    print("Connection Inactive");
  } 

Querying Drill from PHP

Now that you’ve connected to Drill, you can query it in a similar way as MySQL, by calling the query() method. After you’ve called the query() method, you can use one of the fetch() methods to retrieve the results. Currently the Drill connector has the following methods:

fetch_all()

Returns all query results in an associative array

fetch_assoc()

Returns a single query row as an associative array

fetch_object()

Returns a single row as a PHP object

You might also find these functions useful:

data_seek($n)

Returns the row at index $n and sets the current row to $n

num_rows()

Returns the number of rows returned by the query

field_count()

Returns the number of columns returned by the query

Thus, if you want to execute a query in Drill, you can do so as follows:

$query_result = $drill->query("SELECT * FROM cp.`employee.json` LIMIT 20");
while($row = $query_result->fetch_assoc()) {
  print("Field 1: {$row['field1']}
");
  print("Field 2: {$row['field2']}
");
}

Interacting with Drill from PHP

You can also use the connector to activate and deactivate Drill’s storage as well as get information about Drill’s plug-ins, with the following methods:

disable_plugin($plugin)

Disables the given plug-in. Returns true if successful, false if not.

enable_plugin($plugin)

Enables the given plug-in. Returns true if successful, false if not.

get_all_storage_plugins()

Returns an array of all storage plug-ins.

get_disabled_storage_plugins()

Returns an array of all disabled plug-ins.

get_enabled_storage_plugins()

Returns an array of all enabled plug-ins.

get_storage_plugins()

Returns an associative array of plug-ins and associated configuration options for all plug-ins.

get_storage_plugin_info($plugin)

Returns an associative array of configuration options for a given plug-in.

Querying Drill Using Node.js

Lastly, in much the same manner as querying Drill in other languages, you can query Drill using Node.js with the drill-client module. You can install it with this command:

npm install drill-client

After you have installed the Drill client, the procedure is the same as in all other languages. Here are the basic steps:

  1. Create the connection object by calling Drill.Client() with your connection parameters.

  2. Call the query() function.

Because Node.js is asynchronous it is necessary to put the query processing logic in the callback function, as demonstrated here:

var client = new Drill.Client({hostname:'localhost', port:8047});

client.query('SELECT * FROM cp.`employee.json` LIMIT 10', 
    function(error, data, columns){
          console.log({data: data['rows'], 
          columns: data['columns'], 
          error: error});
    }
);

Using Drill as a Data Source in BI Tools

In the previous sections, you saw how to connect to and query Drill in various programming languages. In this section, you will learn how to connect to Drill and use it as a data source in various popular BI and analytic tools to build interactive dashboards. This book cannot cover every possible tool, but you should be able to connect to most BI tools using either the JDBC or the ODBC interface.

Exploring Data with Apache Zeppelin and Drill

Apache Zeppelin is a notebooking tool that is becoming quite popular due to its tight integration with a variety of big data tools, such as Apache Spark. Zeppelin’s simplicity and visualization capabilities, demonstrated in Figure 7-1, make it an ideal companion tool for working with Apache Drill.

Figure 7-1. Zeppelin rendering Drill sample data

The easiest way to use Zeppelin with Apache Drill is to use the JDBC interpreter that ships with Zeppelin. If you didn’t install the JDBC interpreter with Zeppelin, you’ll need to install it using the following command:

$ZEPPELIN_HOME/bin/install-interpreter.sh --name jdbc

After you’ve installed the Zeppelin JDBC interpreter, you need to configure it within the Zeppelin interface.

Configuring Zeppelin to query Drill

The first step is to add Drill as a data source. To do that, in the upper-right corner of the screen, click the user settings menu, and then navigate to the Interpreter screen, as shown in Figure 7-2. Click +Create to create a new interpreter.

Figure 7-2. Zeppelin settings menu

On the Create screen, choose JDBC as the interface type, and then set the following configuration variables, as shown in Figure 7-3:

default.driver

This should be set to org.apache.drill.jdbc.Driver.

default.password

This is your account password.

default.url

This is your JDBC connection string to Drill. To use Drill with Zeppelin in embedded mode or when a single Drillbit is running on your local machine, the connection string is jdbc:drill:drillbit=localhost:31010.

At the bottom of the screen, in the Dependencies section, add the path to the Drill JDBC driver as an artifact.

Figure 7-3. Configuring the Zeppelin interpreter

At this point, you should be able to query Drill in a Zeppelin notebook by simply setting the notebook’s default interpreter to Drill.

Querying Drill from a Zeppelin notebook

If you’ve successfully set up Drill as an interpreter in Zeppelin, all you need to do to query Drill is simply type a SQL query into a box in a Zeppelin notebook, as illustrated here:

SELECT * 
FROM cp.`employee.json`

This query is the default query for Drill. If you enter it in Zeppelin, you will see the query results depicted in Figure 7-4.

Figure 7-4. Query results in Zeppelin

By itself this doesn’t really add much value beyond Drill’s user interface; however, you can perform aggregations and other operations directly in the Zeppelin notebook. As an example, let’s say that you wanted to find the average salary by education level in the sample data file included with Drill. You could write the following query:

SELECT education_level, AVG(CAST(salary AS FLOAT)) AS avg_salary
FROM cp.`employee.json`
GROUP BY education_level

However, in Zeppelin, you can accomplish the same task starting from the original query, which outputs all the results. In Figure 7-5, you can see how to take the original results, perform the aggregation, and visualize the results.

Figure 7-5. Query results in Zeppelin

Adding interactivity in Zeppelin

One of the features that makes Zeppelin powerful is the ability to create interactive reports directly from a SQL query. In the WHERE clause of a query, you can specify variables that you can change and cause Zeppelin to dynamically redraw the visualization.

The following query has a WHERE clause of gender='${gender=M,M|F}', which creates a drop-down menu with two options for the gender:

SELECT * 
FROM cp.`employee.json`
WHERE `gender`='${gender=M,M|F}'

Changing the drop-down menu selection causes the visualization to be redrawn, as shown in Figure 7-6.

Figure 7-6. Visualization with dynamic variable

As you can see, Zeppelin is a good companion tool for Drill and requires little special configuration to quickly visualize and explore your query results.

Exploring Data with Apache Superset

Another open source BI tool that works well with Drill is a relatively new one, which as of this writing is in incubation with the Apache Software Foundation. This tool is called Superset, and it was developed by Airbnb and is implemented in Python using SQLAlchemy as a means of interacting with SQL-based data sources. Superset offers more sophisticated visualizations than Zeppelin, but at the time of writing, the project is still being developed and not all features work well with Drill.

According to the documentation, Superset’s features include:

  • A rich set of data visualizations

  • An easy-to-use interface for exploring and visualizing data

  • The ability to create and share dashboards

  • Enterprise-ready authentication with integration with major authentication providers (database, OpenID, LDAP, OAuth, and REMOTE_USER through Flask AppBuilder)

  • An extensible, high-granularity security/permission model allowing intricate rules on who can access individual features and the dataset

  • A simple semantic layer, allowing users to control how data sources are displayed in the UI by defining which fields should show up in which drop-down and which aggregation and function metrics are made available to the user

Windows does not support Superset, but it works well on Linux and macOS machines. You can install Superset via the following commands:

pip install superset
fabmanager create-admin --app superset
superset db upgrade
superset load_examples
superset init

In addition to the standard SQLAlchemy, you need to install the Drill dialect for SQLAlchemy before running Superset. To accomplish this step, either clone the sqlalchemy-drill GitHub repository or download and unzip the files directly from GitHub. When you’ve done that, navigate to the folder you just cloned or extracted and execute the following command:

python setup.py install

At this point, you are ready to configure Superset to work with Drill.

Configuring Superset to work with Drill

After you install Superset and the additional drivers, you need to add Drill as a data source. Click the Sources menu and select the Databases option, as shown in Figure 7-7.

Figure 7-7. Adding Drill as a data source

When you are in the databases section, in the upper-right corner of the List Databases pane, click the + button to add a new data source, as shown in Figure 7-8.

Figure 7-8. Adding Drill as a data source to Superset

On the Add Database screen, set up the configuration to allow Superset to communicate with Drill. You need to write a SQLAlchemy URI in the format shown here:

drill+sadrill://username:password@host:port/storage_plugin/workspace
?use_ssl=True

If you are connecting to Drill in embedded mode, you can use this connection string:

drill+sadrill://localhost:8047/dfs/drillclass?use_ssl=False

Enter your connection URI in the appropriate field, and you should be able to successfully test the connection, as demonstrated in Figure 7-9.

Figure 7-9. Drill configuration in Superset

You will want to select the “Expose in SQL Lab” and “Allow Run Sync” checkboxes. If you are querying large datasets or have queries that will take a long time to run, it would also be a good idea to set up Celery4 to allow asynchronous connections.

Building a demonstration visualization using Drill and Superset

Now that you have Superset working with Drill, this section walks you through building a visualization. In the GitHub repository for this book, you will find a data file called baltimore_salaries_2016.csvh, which contains data about the salaries of all of the employees of the City of Baltimore.5

For this example, we will calculate the average salary per department and visualize the results. The following query will return those results:

SELECT Agency, TO_NUMBER(`AnnualSalary`, '¤') AS AnnualSalary
FROM dfs.demo.`baltimore_salaries_2016.csvh`

To visualize this data, the first step is to open SQL Lab and enter the query. Next, click the Run Query button; the results should look like those shown in Figure 7-10.

Figure 7-10. Query results in SQL Lab

When you have the query results, click the Visualize button, set the options as depicted in Figure 7-11, and then click Visualize.

Figure 7-11. Superset chart configuration

At this point, after some minor configuration changes, you should have a visualization similar to the one shown in Figure 7-12.

Figure 7-12. Visualization in Superset

When you save the visualization, you can add it to a pre-existing dashboard or create a new dashboard.

Using Drill with Other Visualization Tools

You can use Drill with many other visualization tools, including Tableau, Qlix, JReport, MicroStrategy Analytics, and others that have ODBC or JDBC support.

Conclusion

In this chapter, you learned how to programmatically connect to Drill using a variety of languages via Drill’s interfaces. This capability enables you to use Drill to automate data flows as well as integrate Drill as a data source for various BI tools. Now that you know how to use Drill to query your data, the next step is to explore the details of Drill’s schema-on-read approach.

1 There actually are two other ways of connecting to Drill: a native Drill client and a C clone of the Drill client. These are both beyond the scope of the book.

2 There is an additional JDBC driver developed by MapR that has additional functionality beyond what is described here.

3 Get it?...Drill sergeant.

4 Celery is an open source distributed queue with a focus on real-time processing.

5 This dataset is available at the Baltimore City website.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.165.126