In the past few chapters you have learned how to use Drill to query and explore data, and although this is extremely useful, you can get even more value out of Drill by using it as part of a larger data pipeline. Fortunately, Drill has many interfaces through which you can connect to it and query the results. In principle, the three methods are Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), and Drill’s RESTful interface. In this chapter, you will learn how to use these interfaces and connect to Drill using a wide variety of scripting languages and tools, including Python, R, Java, PHP, and Node.js, as well as some business intelligence (BI) tools such as Apache Superset, Zeppelin, and Tableau.
Before we get into the various language-specific libraries that enable you to easily connect to Drill, it is a good idea to understand the ways in which these libraries connect to Drill. Fundamentally, there are three ways to connect to it:1
JDBC
ODBC
REST interface
In theory, you can use any program that implements one of these standards to programmatically connect to Drill. For instance, if you have some database explorer or visualization tool that uses ODBC, you can use Drill’s ODBC interface to connect the tool to Drill.
Drill is implemented in Java, and as such it makes sense that Drill should implement the JDBC standard to enable other Java-based programs to connect to it. To use JDBC, you first need a copy of the Drill JDBC driver on your machine. The JDBC driver is included with Drill and is located at $DRILL_HOME
/jars/jdbc-driver/drill-jdbc-all-version
.jar. The driver class’s name is org.apache.drill.jdbc.Driver.2 Copy it to the machine that has the application you want to connect to Drill, then add it to that application’s class path. Consult the documentation of your application for details.
Regardless of the application, to connect to Drill, you need to craft a JDBC connection URL. If you are using Drill in embedded mode on your local machine, that URL will be as follows:
jdbc:drill:drillbit=localhost:31010
You can of course, replace localhost
with your hostname or IP address.
If you are connecting to Drill in distributed mode, you will need to know the name of your ZooKeeper cluster and you will use this form of URL: jdbc:drill:zk=zkhost1:2818,zkhost2:2181,.... Additionally, there are various optional configuration variables that you can pass in your JDBC connection URL, including the following:
cluster ID
The cluster name as set in drill-override.conf. The default is drillbits1
. The default is typically fine unless you’ve customized this property in Drill’s drill-override.conf file.
directory
The ZooKeeper root znode for the Drill cluster. The default is root
. Again, the default is typically fine unless you’ve customized this property in Drill’s drill-override.conf file.
port
The connection port. The port to connect directly to a Drillbit is 31010 (unless you changed this value in drill-override.conf.) The ZooKeeper default is 2181, and for a MapR cluster it is 5181.
schema
The default storage plug-in for the connection. This parameter is optional.
tries
The number of connection attempts. The default is 5.
These values must be the same as those specified in the drill-override.conf file for the target Drill cluster.
In the following code snippet, you see the first of two examples of JDBC connection strings. This one demonstrates how to connect to Drill via a single ZooKeeper instance, as is used for development:
jdbc:
drill:
zk
=
zkhost
:
2181
The second example demonstrates how to connect to Drill via a multinode ZooKeeper cluster:
jdbc:drill:
zk
=
10.20
.
100.20
:
2181
,
10.20
.
100.21
:
2181
,
10.20
.
100.22
:
2181
;
schema
=
mongodb
When there is more than one ZooKeeper node, Drill randomly connects to any of them and then randomly picks a Drillbit. If the ZooKeeper node happens to be down, Drill will try another. The complete documentation about connecting to Drill with JDBC is available in the documentation on the Drill website.
In addition to connecting to Drill using JDBC, Drill supports using ODBC as a connection method. Configuring ODBC requires installing the ODBC driver on your system and configuring some files on your system.
Installing the ODBC driver can be complicated. Complete instructions downloading it are available in the Drill documentation. After you’ve installed the ODBC driver, you need to configure it. Complete instructions to do so are also available in the Drill documentation.
If you are installing the ODBC driver on a Mac or Unix machine, you must also install an ODBC driver manager, such as iODBC.
After downloading and installing the MapR ODBC driver, there are three configuration files that you need to install and configure. These files will be copied to /opt/mapr/drill/Setup/ after you’ve installed the ODBC driver, and you’ll need to copy them to your home directory and rename them as hidden files.
Contains the definition of your ODBC data sources
Contains driver configuration variables
Defines the ODBC driver
Then you need to set the following environment variables to point to these files:
export ODBCINI=~/.odbc.ini export MAPRDRILLINI=~/.mapr.drillodbc.ini export LD_LIBRARY_PATH=/usr/local/lib
After downloading and installing the MapR ODBC driver, navigate to the ODBC Data Source Administrator and click the Settings tab. On the System DSN tab, select MapR Drill, and then click Configure. To connect to Drill in embedded mode or as a single server on your local machine, define the following variables:
HOST
: localhost
PORT
: 31010
Catalog
: DRILL
If you are connecting to a cluster in distributed mode, you’ll need to define the following variables:
ZKQuorum
A comma-separated list of ZooKeeper nodes in the format host or IP
:port
,host or IP
:port
ZKClusterID
The name of the Drillbit cluster that can be found in the drill-override.conf file. Defaults to drillbits1
.
Now that you’ve learned how to set up ODBC and JDBC to be used with Drill, in the next section, you’ll learn how to write scripts to connect to Drill’s various interfaces.
In addition to JDBC and ODBC, Drill also has a RESTful interface that you can use to execute queries as well as modify configuration settings. Drill’s REST API uses the web console URL:
http://hostname:8047/function
The REST API has the following functions:
Gets Drillbit information.
Gets the name, default, and data type of the system and session options.
Executes a query. You must use a POST request to execute a query.
Gets the profiles of running queries.
Gets the Drill status.
Gets storage plug-in configuration and allows you to create, modify, or delete storage plug-in configurations. You must use a POST request to change the storage plug-in configuration.
Although the RESTful interface is the easiest to work with, it has two key limitations:
The complete documentation for Drill’s RESTful interface is available on the Apache Drill website.
Python is certainly one of the most popular languages for data science and data analysis. Furthermore, Python has an ever-expanding ecosystem of data analysis, visualization, and machine learning libraries that makes it one of the best tools for data science.
At the core of this ecosystem is the Pandas library, which provides a collection of vectorized data structures. Most relevant to Drill is the DataFrame
, which is a two-dimensional data structure. It is probably the most important component of the data science ecosystem in Python and is used as the basic container of data for visualization and machine learning. Therefore, in this section, we demonstrate how to quickly and easily get data directly from a Drill query into a Pandas DataFrame
.
There are two Python modules, drillpy
and pydrill
, that you can use to connect to Drill. Both are wrappers for Drill’s RESTful interface, so both have the limitations associated with that, but both work reasonably well, and as the Drill community improves the RESTful interface, so too will the drivers improve. drillpy
is notably different in that it implements Python’s DBAPI standard and thus you can use it with other modules that require a database connection.
The basic steps for connecting to Drill with drillpy
are as follows:
Create the connection object.
Create a cursor object.
Execute the query using the execute()
method.
Retrieve the results by calling one of the fetchone()
, fetchmany(<n>)
, or fetchall()
methods. fetchall()
and fetchmany()
both return Pandas DataFrame
s, and fetchone()
returns a Pandas Series
object.
If necessary, first install the module by running:
pip
install
drillpy
The snippet that follows demonstrates how to create a connection object in Python using drillpy
. If you are using Drill in embedded mode, you can use localhost
for the hostname. The default port is 8047 for the RESTful interface:
from
drillpy
import
connect
con
=
connect
(
host
=
"
host
"
,
db
=
"
database_name
"
,
port
=
port_number
)
After creating the connection object, call the cursor()
function to create the cursor object, as shown in the following snippet:
cur
=
con
.
cursor
()
Now you are ready to execute a query and get the results. The following snippet demonstrates how to execute a query and load the results directly into a DataFrame
:
queryResult
=
cur
.
execute
(
"
query
"
)
df
=
query
.
fetchall
(
)
If you are executing multiple queries with variable parameters, you can use parameter substitution in a similar manner as with Python’s SQLLite libraries, where the substitution is handled by a ?
in the query string. For example:
query
=
cur
.
execute
(
"SELECT * FROM mytable WHERE param1 = ? AND param2 = ?"
,
(
var1
,
var2
))
pydrill
is another wrapper for Drill’s RESTful interface, but it is a little more robust and includes access to many other aspects of Drill that are exposed in the RESTful interface besides just executing a query. pydrill
is not DBAPI-compliant, and therefore if you want to use the database connection object in another module, it likely will not work. However pydrill
can transfer query results directly into a Pandas DataFrame
.
The process for connecting to Drill using pydrill
is basically the same as with drillpy
:
Create a connection object. pydrill
has an .isActive()
function to verify the connection was in fact successful.
Execute the query.
Iterate through the results or transfer them to a DataFrame
.
Like drillpy
, you can install pydrill
by using pip
, as shown here:
pip install pydrill
The following code snippet demonstrates how to query Drill using pydrill
:
from
pydrill.client
import
pydrill
#Connect to Drill
drill
=
pydrill
(
host
=
'
hostname
'
,
port
=
8047
)
#Check to see if the connection succeeded
if
not
drill
.
is_active
(
)
:
raise
ImproperlyConfigured
(
'
Please run Drill first
'
)
#Execute the query
queryResult
=
drill
.
query
(
'
query
'
)
#Export the results to a Pandas DataFrame
df
=
queryResult
.
to_dataframe
(
)
In addition to executing simple queries, pydrill
has a lot of other functionality, including:
Activating or deactivating a Drill storage plug-in as well as getting details about the available storage plug-ins
Accessing usage statistics
Accessing information about the query plans and much more
You can find the complete documentation for pydrill
on the project’s website.
Because Drill is ODBC- and JDBC-compliant, it is possible to query Drill by using the pyODBC
or JayDeBeApi
modules. Although using these modules does get you access to additional functionality, they are also more difficult to configure and debug. JDBC has several key advantages over the RESTful interface; specifically, it can handle large result sets, security and authentication are easier to enforce using JDBC, and REST isn’t usable in a large-scale cluster because there is no REST API to connect to ZooKeeper to determine whether a Drillbit is still alive and part of the cluster. You will find that it is easier to use the Python modules that are based on the RESTful interface such as PyDrill and DrillPy because you do not have to configure your system in order to use them.
The following code snippet demonstrates how to query Drill via JDBC using JayDeBeApi
, Python’s JDBC module. After importing the module, the most difficult step is creating the connection object using the jaydebeapi.connect()
function. There are four arguments to this function, as shown here:
conn
=
jaydebeapi
.
connect
(
"
org.apache.drill.jdbc.Driver
"
,
"
jdbc connection string
"
,
[
"
username
"
,
"
password
"
]
,
"
path_to_Drill_JDBC_driver
"
,
)
The following complete snippet demonstrates how to query Drill using JayDeBeApi
and import the results into a Pandas DataFrame
. The driver version will be different for each version of Drill, so make sure you are using the appropriate version:
import
jaydebeapi
import
pandas
as
pd
#Create the connection object
conn
=
jaydebeapi
.
connect
(
"
org.apache.drill.jdbc.Driver
"
,
"
jdbc:drill:drillbit=localhost:31010
"
,
[
"
admin
"
,
"
password
"
]
,
"
drill_path
/jars/jdbc-driver/drill-jdbc-all-
version
.jar
"
)
#Create the cursor object
curs
=
conn
.
cursor
(
)
#Execute the query
curs
.
execute
(
"
SELECT * FROM cp.`employee.json` LIMIT 20
"
)
#Get the results
curs
.
fetchall
(
)
#Read query results into a Pandas DataFrame
df
=
pd
.
read_sql
(
"
SELECT * FROM cp.`employee.json` LIMIT 20
"
,
conn
)
The following snippet demonstrates how to query Drill in embedded mode in Python using Drill’s ODBC interface with pyODBC
. Using ODBC is a little more complex than JDBC because in order for this code to work properly, you must make sure the ODBC driver is correctly installed on your system. Also, crafting the DSN is more complex than the JDBC connection string:
import
pyodbc
import
re
import
pandas
as
pd
#This section from https://github.com/cjmatta/drill_ipython_notebook
MY_DSN
=
"""
Driver = /opt/mapr/drillodbc/lib/universal/libmaprdrillodbc.dylib
ConnectionType = Direct
ZKQuorum =
ZKClusterID =
Catalog = DRILL
AuthenticationType = No Authentication
AdvancedProperties = CastAnyToVarchar=true
HandshakeTimeout = 5
QueryTimeout = 180
TimestampTZDisplayTimezone = utc
ExcludedSchemas = sys,INFORMATION_SCHEMA
NumberOfPrefetchBuffers = 5
HOST = localhost
PORT = 31010
"""
#Build DSN
MY_DSN
=
";"
.
join
(
[
re
.
sub
(
r
'( +|s+)=s+'
,
'='
,
i
)
for
i
in
MY_DSN
.
split
(
'
'
)
if
i
!=
''
]
)
#Create the connection object and cursor
conn
=
pyodbc
.
connect
(
MY_DSN
,
UID
=
'admin'
,
PWD
=
'password'
,
autocommit
=
True
)
cursor
=
conn
.
cursor
()
#Set character encoding
conn
.
setdecoding
(
pyodbc
.
SQL_CHAR
,
encoding
=
'utf-8'
)
conn
.
setdecoding
(
pyodbc
.
SQL_WCHAR
,
encoding
=
'utf-8'
)
conn
.
setencoding
(
encoding
=
'utf-8'
)
#Execute the query and get the results
sql
=
'SELECT * FROM cp.`employee.json` LIMIT 20'
cursor
.
execute
(
sql
)
sql_result
=
cursor
.
fetchall
()
#Read the data into a DataFrame
df
=
pd
.
read_sql
(
'SELECT * FROM cp.`employee.json` LIMIT 20'
,
conn
)
df
.
sample
(
5
)
If you are an R user, fear not, as there is a module called sergeant
3 that enables you to query Drill directly from R and import the results into a DataFrame
. sergeant
allows you to connect to Drill in three ways: the DBI
, RJDBC
, and dplyr
interfaces. sergeant
also wraps all the functionality found in the Drill RESTful API, so you can activate, deactivate, or delete storage plug-ins. You can find the complete documentation at the sergeant
website.
To use sergeant
, of course you must install it, which you can accomplish by using the following command:
devtools
::
install_github
(
"hrbrmstr/sergeant"
)
After you’ve installed sergeant
, the process for querying Drill is much like that for using one of the Python modules:
Create the connection.
Optionally check to see if the connection succeeded.
Execute the query.
The following code snippet demonstrates how to execute a query in Drill through R:
#Import sergeant
library
(
sergeant
)
#Create the connection to Drill
conn
<-
drill_connection
(
"
hostname
"
)
#Check to see if the connection was successful
drill_active
(
conn
)
#Execute the query
drill_query
(
conn
,
"
query
"
)
As you can see, it really isn’t too difficult to use R to query Drill.
In addition to executing queries in Drill, sergeant
can access other functionality in the Drill RESTful interface, much like pydrill
. Here are some selected functions:
drill_active()
Tests whether the Drill HTTP REST API server is up
drill_metrics()
Gets the current memory metrics
drill_options()
Lists the name, default, and data type of the system and session options
drill_profile()
Gets the profile of the query that has the given query ID
drill_set()
Sets Drill SYSTEM
or SESSION
options
drill_show_files()
Shows files in a filesystem schema
drill_show_schemas()
Returns a list of available schemas
drill_stats()
Gets Drillbit information such as port numbers
drill_status()
Gets the status of Drill
drill_storage()
Gets the list of storage plug-in names and configurations
drill_uplift()
Turns columnar query results into a type-converted table
drill_use()
Changes to a particular schema
Although there isn’t a dedicated module for Drill, it is relatively straightforward to query Drill using JDBC. For this to work, you must install the JDBC JAR file in the classpath. The driver class name is org.apache.drill.jdbc.Driver
.
To connect using JDBC and Java, the basic process is as follows:
Set the Drill driver.
Create a Connection
object using a JDBC connection string and by calling the DriverManager.getConnection()
function.
Create a Statement
object by calling the connection.createStatement()
function from your Connection
object.
Execute the query by calling the executeQuery()
function from your Statement
object.
Iterate through the results by calling the appropriate get()
function.
There are get()
functions for every data type that exists in SQL databases, and a complete list of get()
functions is available in the Java documentation.
This code snippet demonstrates how to query Drill in Java using JDBC:
import
java.sql.Connection
;
import
java.sql.DriverManager
;
import
java.sql.ResultSet
;
import
java.sql.Statement
;
public
class
testDrillInterface
{
public
static
void
main
(
String
[]
args
)
{
try
{
//Choose the driver
Class
.
forName
(
"org.apache.drill.jdbc.Driver"
);
//Create the Connection object
Connection
connection
=
DriverManager
.
getConnection
(
"jdbc:drill:drillbit=localhost:31010"
);
//Create the Statement object
Statement
st
=
connection
.
createStatement
();
//Execute the query
ResultSet
rs
=
st
.
executeQuery
(
"SELECT * from cp.`employees.json`"
);
//Iterate through the results
while
(
rs
.
next
()){
System
.
out
.
println
(
rs
.
getString
(
1
));
}
}
catch
(
Exception
e
){
throw
new
RuntimeException
(
e
);
}
}
}
If PHP is your language or you are looking to write a web-based application using Drill as a backend, there is a PHP library to facilitate that. It is loosely modeled after PHP’s MySQL interface, so if you are familiar with that, you already pretty much know how to use the Drill connector.
The connector is on Packagist and can be installed by using composer
, as follows:
composer require thedataist/drill-connector:dev-master
The first step is to connect to Drill. The module uses Drill’s RESTful interface, so it doesn’t really make a “connection” in the same sense as with MySQL:
$drill
=
new
DrillConnection
(
'localhost'
,
8047
);
As mentioned earlier, this creates the object, but doesn’t actually send anything to Drill. You can use the is_active()
method to verify that your connection is active:
if
(
$drill
->
is_active
())
{
(
"Connection Active"
);
}
else
{
(
"Connection Inactive"
);
}
Now that you’ve connected to Drill, you can query it in a similar way as MySQL, by calling the query()
method. After you’ve called the query()
method, you can use one of the fetch()
methods to retrieve the results. Currently the Drill connector has the following methods:
fetch_all()
Returns all query results in an associative array
fetch_assoc()
Returns a single query row as an associative array
fetch_object()
Returns a single row as a PHP object
You might also find these functions useful:
data_seek($n)
Returns the row at index $n
and sets the current row to $n
num_rows()
Returns the number of rows returned by the query
field_count()
Returns the number of columns returned by the query
Thus, if you want to execute a query in Drill, you can do so as follows:
$query_result
=
$drill
->
query
(
"SELECT * FROM cp.`employee.json` LIMIT 20"
);
while
(
$row
=
$query_result
->
fetch_assoc
())
{
(
"Field 1:
{
$row
[
'field1'
]
}
"
);
(
"Field 2:
{
$row
[
'field2'
]
}
"
);
}
You can also use the connector to activate and deactivate Drill’s storage as well as get information about Drill’s plug-ins, with the following methods:
disable_plugin($plugin)
Disables the given plug-in. Returns true
if successful, false
if not.
enable_plugin($plugin)
Enables the given plug-in. Returns true
if successful, false
if not.
get_all_storage_plugins()
Returns an array of all storage plug-ins.
get_disabled_storage_plugins()
Returns an array of all disabled plug-ins.
get_enabled_storage_plugins()
Returns an array of all enabled plug-ins.
get_storage_plugins()
Returns an associative array of plug-ins and associated configuration options for all plug-ins.
get_storage_plugin_info($plugin)
Returns an associative array of configuration options for a given plug-in.
Lastly, in much the same manner as querying Drill in other languages, you can query Drill using Node.js with the drill-client
module. You can install it with this command:
npm
install
drill
-
client
After you have installed the Drill client, the procedure is the same as in all other languages. Here are the basic steps:
Create the connection object by calling Drill.Client()
with your connection parameters.
Call the query()
function.
Because Node.js is asynchronous it is necessary to put the query processing logic in the callback function, as demonstrated here:
var
client
=
new
Drill
.
Client
({
hostname
:
'localhost'
,
port
:
8047
});
client
.
query
(
'SELECT * FROM cp.`employee.json` LIMIT 10'
,
function
(
error
,
data
,
columns
){
console
.
log
({
data
:
data
[
'rows'
],
columns
:
data
[
'columns'
],
error
:
error
});
}
);
In the previous sections, you saw how to connect to and query Drill in various programming languages. In this section, you will learn how to connect to Drill and use it as a data source in various popular BI and analytic tools to build interactive dashboards. This book cannot cover every possible tool, but you should be able to connect to most BI tools using either the JDBC or the ODBC interface.
Apache Zeppelin is a notebooking tool that is becoming quite popular due to its tight integration with a variety of big data tools, such as Apache Spark. Zeppelin’s simplicity and visualization capabilities, demonstrated in Figure 7-1, make it an ideal companion tool for working with Apache Drill.
The easiest way to use Zeppelin with Apache Drill is to use the JDBC interpreter that ships with Zeppelin. If you didn’t install the JDBC interpreter with Zeppelin, you’ll need to install it using the following command:
$ZEPPELIN_HOME
/bin/install-interpreter.sh
--name
jdbc
After you’ve installed the Zeppelin JDBC interpreter, you need to configure it within the Zeppelin interface.
The first step is to add Drill as a data source. To do that, in the upper-right corner of the screen, click the user settings menu, and then navigate to the Interpreter screen, as shown in Figure 7-2. Click +Create to create a new interpreter.
On the Create screen, choose JDBC as the interface type, and then set the following configuration variables, as shown in Figure 7-3:
default.driver
This should be set to org.apache.drill.jdbc.Driver
.
default.password
This is your account password.
default.url
This is your JDBC connection string to Drill. To use Drill with Zeppelin in embedded mode or when a single Drillbit is running on your local machine, the connection string is jdbc:drill:drillbit=localhost:31010
.
At the bottom of the screen, in the Dependencies section, add the path to the Drill JDBC driver as an artifact.
At this point, you should be able to query Drill in a Zeppelin notebook by simply setting the notebook’s default interpreter to Drill.
If you’ve successfully set up Drill as an interpreter in Zeppelin, all you need to do to query Drill is simply type a SQL query into a box in a Zeppelin notebook, as illustrated here:
SELECT
*
FROM
cp
.
`employee.json`
This query is the default query for Drill. If you enter it in Zeppelin, you will see the query results depicted in Figure 7-4.
By itself this doesn’t really add much value beyond Drill’s user interface; however, you can perform aggregations and other operations directly in the Zeppelin notebook. As an example, let’s say that you wanted to find the average salary by education level in the sample data file included with Drill. You could write the following query:
SELECT
education_level
,
AVG
(
CAST
(
salary
AS
FLOAT
))
AS
avg_salary
FROM
cp
.
`employee.json`
GROUP
BY
education_level
However, in Zeppelin, you can accomplish the same task starting from the original query, which outputs all the results. In Figure 7-5, you can see how to take the original results, perform the aggregation, and visualize the results.
One of the features that makes Zeppelin powerful is the ability to create interactive reports directly from a SQL query. In the WHERE
clause of a query, you can specify variables that you can change and cause Zeppelin to dynamically redraw the visualization.
The following query has a WHERE
clause of gender='${gender=M,M|F}'
, which creates a drop-down menu with two options for the gender:
SELECT
*
FROM
cp
.
`employee.json`
WHERE
`gender`
=
'${gender=M,M|F}'
Changing the drop-down menu selection causes the visualization to be redrawn, as shown in Figure 7-6.
As you can see, Zeppelin is a good companion tool for Drill and requires little special configuration to quickly visualize and explore your query results.
Another open source BI tool that works well with Drill is a relatively new one, which as of this writing is in incubation with the Apache Software Foundation. This tool is called Superset, and it was developed by Airbnb and is implemented in Python using SQLAlchemy as a means of interacting with SQL-based data sources. Superset offers more sophisticated visualizations than Zeppelin, but at the time of writing, the project is still being developed and not all features work well with Drill.
According to the documentation, Superset’s features include:
A rich set of data visualizations
An easy-to-use interface for exploring and visualizing data
The ability to create and share dashboards
Enterprise-ready authentication with integration with major authentication providers (database, OpenID, LDAP, OAuth, and REMOTE_USER through Flask AppBuilder)
An extensible, high-granularity security/permission model allowing intricate rules on who can access individual features and the dataset
A simple semantic layer, allowing users to control how data sources are displayed in the UI by defining which fields should show up in which drop-down and which aggregation and function metrics are made available to the user
Windows does not support Superset, but it works well on Linux and macOS machines. You can install Superset via the following commands:
pip install superset fabmanager create-admin --app superset superset db upgrade superset load_examples superset init
In addition to the standard SQLAlchemy, you need to install the Drill dialect for SQLAlchemy before running Superset. To accomplish this step, either clone the sqlalchemy-drill
GitHub repository or download and unzip the files directly from GitHub. When you’ve done that, navigate to the folder you just cloned or extracted and execute the following command:
python
setup
.
py
install
At this point, you are ready to configure Superset to work with Drill.
After you install Superset and the additional drivers, you need to add Drill as a data source. Click the Sources menu and select the Databases option, as shown in Figure 7-7.
When you are in the databases section, in the upper-right corner of the List Databases pane, click the +
button to add a new data source, as shown in Figure 7-8.
On the Add Database screen, set up the configuration to allow Superset to communicate with Drill. You need to write a SQLAlchemy URI in the format shown here:
drill+sadrill://username:password@host:port/storage_plugin/workspace ?use_ssl=True
If you are connecting to Drill in embedded mode, you can use this connection string:
drill
+
sadrill
:
//
localhost
:
8047
/
dfs
/
drillclass
?
use_ssl
=
False
Enter your connection URI in the appropriate field, and you should be able to successfully test the connection, as demonstrated in Figure 7-9.
You will want to select the “Expose in SQL Lab” and “Allow Run Sync” checkboxes. If you are querying large datasets or have queries that will take a long time to run, it would also be a good idea to set up Celery4 to allow asynchronous connections.
Now that you have Superset working with Drill, this section walks you through building a visualization. In the GitHub repository for this book, you will find a data file called baltimore_salaries_2016.csvh, which contains data about the salaries of all of the employees of the City of Baltimore.5
For this example, we will calculate the average salary per department and visualize the results. The following query will return those results:
SELECT
Agency
,
TO_NUMBER
(
`AnnualSalary`
,
'¤'
)
AS
AnnualSalary
FROM
dfs
.
demo
.
`baltimore_salaries_2016.csvh`
To visualize this data, the first step is to open SQL Lab and enter the query. Next, click the Run Query button; the results should look like those shown in Figure 7-10.
When you have the query results, click the Visualize button, set the options as depicted in Figure 7-11, and then click Visualize.
At this point, after some minor configuration changes, you should have a visualization similar to the one shown in Figure 7-12.
When you save the visualization, you can add it to a pre-existing dashboard or create a new dashboard.
In this chapter, you learned how to programmatically connect to Drill using a variety of languages via Drill’s interfaces. This capability enables you to use Drill to automate data flows as well as integrate Drill as a data source for various BI tools. Now that you know how to use Drill to query your data, the next step is to explore the details of Drill’s schema-on-read approach.
1 There actually are two other ways of connecting to Drill: a native Drill client and a C clone of the Drill client. These are both beyond the scope of the book.
2 There is an additional JDBC driver developed by MapR that has additional functionality beyond what is described here.
3 Get it?...Drill sergeant.
4 Celery is an open source distributed queue with a focus on real-time processing.
5 This dataset is available at the Baltimore City website.
18.221.165.126