Spark SQL as a distributed SQL engine

Spark SQL is generally used in two different ways. The first way is to use it as a library to write SQL, Hive QL, DSL, or to write queries in languages such as Java, Scala, Python, or R. The second way is to use it as a distributed SQL engine in which clients connect to a Thrift server and submit SQL or Hive QL queries using JDBC or ODBC interfaces. It is really useful for data warehousing users to write and execute queries from Business Intelligence (BI) tools interactively. So, Spark SQL can be used for data warehousing solutions as well a distributed SQL query engine.

Spark SQL's Thrift server for JDBC/ODBC access

Spark SQL's thrift server provides JDBC access to Spark SQL.

The Thrift JDBC server corresponds to HiveServer2 in Hive. You can test the JDBC server with the beeline client or any SQL client. From Spark 1.6, by default, the Thrift server runs in multi-session mode.

For a complete list of options for starting thriftserver, use the following command. Note the similar options for spark-submit, spark-shell, or pyspark shell:

./start-thriftserver.sh --help
Usage: ./sbin/start-thriftserver [options] [thrift server options]

To start the JDBC/ODBC server, run the following command in Spark's sbin directory:

./start-thriftserver.sh --master yarn --hiveconf hive.server2.thrift.bind.host=localhost --hiveconf hive.server2.thrift.port=10001

Check the thriftserver log in the logs directory, which shows that it is connecting to the Hive metastore and thrift server listening on port number 10001. You can check the JDBC/ODBC Server tab in Spark's UI to check the number of users connected, their details, and also SQL statistics. The easiest way to open this UI is to go to Yarn's resource manager UI and then click on the Application Master's tracking UI.

Apache Spark needs to be built with Hive support by adding the –Phive and –Phive-thriftserver profiles to the build options.

Querying data using beeline client

After starting the Thrift server, it can be tested using beeline:

bin/beeline -u "jdbc:hive2://localhost:10001/default;hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice"

beeline will ask you for a username and password. In non-secure mode, providing a blank user ID and password will work.

Once you are connected to the beeline client, issue Hive QL or SQL commands:

0: jdbc:hive2://localhost:10001/default> show tables;
+------------+--------------+--+
| tableName  | isTemporary  |
+------------+--------------+--+
| customers  | false        |
| log        | false        |
| sample_07  | false        |
| sample_08  | false        |
| web_logs   | false        |
+------------+--------------+--+


CREATE TEMPORARY TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
path "/user/cloudera/people.json"
);

0: jdbc:hive2://localhost:10001/default> show tables;
+------------+--------------+--+
| tableName  | isTemporary  |
+------------+--------------+--+
| customers  | false        |
| log        | false        |
| sample_07  | false        |
| sample_08  | false        |
| web_logs   | false        |
| jsontable  | true         |
+------------+--------------+--+

0: jdbc:hive2://localhost:10001/default> select * from jsontable;
+-------+----------+--+
|  age  |   name   |
+-------+----------+--+
| NULL  | Michael  |
| 30    | Andy     |
| 19    | Justin   |
+-------+----------+--+
3 rows selected (1.13 seconds)
0: jdbc:hive2://localhost:10001/default> select * from jsontable where age > 19;
+------+-------+--+
| age  | name  |
+------+-------+--+
| 30   | Andy  |
+------+-------+--+
1 row selected (0.624 seconds)
0: jdbc:hive2://localhost:10001/default> REFRESH TABLE jsonTable

CREATE TEMPORARY TABLE jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:postgresql:dbserver",
dbtable "schema.tablename"
)

The SQL tab on Spark's UI will show the all the jobs finished with the logical plan, physical plan, and visual DAG for the jobs. This is useful in debugging and performance tuning.

Querying data from Hive using spark-sql CLI

The Spark SQL CLI is a command line tool to run the Hive metastore service in local mode and execute queries from the command line. Spark SQL CLI cannot talk to the Thrift JDBC server.

To start the Spark SQL CLI, use the following command:

./bin/spark-sql
spark-sql> show tables;
spark-sql> select count(*) from tab1;

You may run ./bin/spark-sql --help for a complete list of all available options.

Integration with BI tools

To integrate BI tools such as Tableau or Qlikview, install the ODBC drivers for Spark from: http://databricks.com/spark-odbc-driver-download.

Configure and start the thrift server as mentioned in the previous section. Start Tableau and select the option to connect to Spark SQL. Configure the hostname, port number 10001, and necessary authentication details and click OK. This will enable us to access tables from the Hive Metastore and execute queries on the Spark Execution engine.

Note that you can only query tables that registered with the Hive Metastore.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.212.71