REST interface

Databricks provides a REST interface for Spark cluster-based manipulation. It allows for cluster management, library management, command execution, and the execution of contexts. To be able to access the REST API, the port 34563 must be accessible for your instance in the AWS EC2-based Databricks cloud. The following Telnet command shows an attempt to access the port 34563 of my Databricks cloud instance. Note that the Telnet attempt has been successful:

[hadoop@hc2nn ~]$ telnet dbc-bff687af-08b7.cloud.databricks.com 34563
Trying 52.6.229.109...
Connected to dbc-bff687af-08b7.cloud.databricks.com.
Escape character is '^]'.

If you do not receive a Telnet session, then contact Databricks via . The next sections provide examples of REST interface access to your instance on the Databricks cloud.

Configuration

In order to use the interface, I needed to whitelist the IP address that I use to access my Databricks cluster instance. This is the IP address of the machine from which I will be running the REST API commands. By whitelisting the IP addresses, Databricks can ensure that a secure list of users access each Databricks cloud instance.

I contacted Databricks support via the previous help email address, but there is also a Whitelist IP Guide, found in the Workspace menu in your cloud instance:

Workspace | databricks_guide | DevOps Utilities | Whitelist IP.

REST API calls can now be submitted to my Databricks cloud instance, from the Linux command line, using the Linux curl command. The example general form of the curl command is shown next using my Databricks cloud instance username, password, cloud instance URL, REST API path, and parameters.

The Databricks forum, and the previous help email address can be used to gain further information. The following sections will provide some REST API worked examples:

curl –u  '<user>:<paswd>' <dbc url> -d "<parameters>"

Cluster management

You will still need to create Databricks Spark clusters from your cloud instance user interface. The list REST API command is as follows:

/api/1.0/clusters/list

It needs no parameters. This command will provide a list of your clusters, their status, IP addresses, names, and the port numbers that they run on. The following output shows that the cluster semclust1 is in a pending state in the process of being created:

curl -u 'xxxx:yyyyy' 'https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/clusters/list'

 [{"id":"0611-014057-waist9","name":"semclust1","status":"Pending","driverIp":"","jdbcPort":10000,"numWorkers":0}]

The same REST API command run when the cluster is available, shows that the cluster called semcust1 is running, and has one worker:

[{"id":"0611-014057-waist9","name":"semclust1","status":"Running","driverIp":"10.0.196.161","jdbcPort":10000,"numWorkers":1}]

Terminating this cluster, and creating a new one called semclust changes the results of the REST API call as shown:

curl -u 'xxxx:yyyy' 'https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/clusters/list'

[{"id":"0611-023105-moms10","name":"semclust", "status":"Pending","driverIp":"","jdbcPort":10000,"numWorkers":0},
 {"id":"0611-014057-waist9","name":"semclust1","status":"Terminated","driverIp":"10.0.196.161","jdbcPort":10000,"numWorkers":1}]

The execution context

With these API calls, you can create, show the status of, or delete an execution context. The REST API calls are as follows:

  • /api/1.0/contexts/create
  • /api/1.0/contexts/status
  • /api/1.0/contexts/destroy

In the following REST API call example, submitted via curl, a Scala context has been created for the cluster semclust identified by it's cluster ID.

curl -u 'xxxx:yyyy' https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/contexts/create -d "language=scala&clusterId=0611-023105-moms10"

The result returned is either an error, or a context ID. The following three example return values show an error caused by an invalid URL, and two successful calls returning context IDs:

{"error":"ClusterNotFoundException: Cluster not found: semclust1"}
{"id":"8689178710930730361"}
{"id":"2876384417314129043"}

Command execution

These commands allow you to run a command, list a command status, cancel a command, or show the results of a command. The REST API calls are as follows:

  • /api/1.0/commands/execute
  • /api/1.0/commands/cancel
  • /api/1.0/commands/status

The following example shows an SQL statement being run against an existing table called cmap. The context must exist, and must be of the SQL type. The parameters have been passed on to the HTTP GET call via a –d option. The parameters are language, the cluster ID, the context ID, and the SQL command. The command ID is returned as follows:

curl -u 'admin:FirmWare1$34' https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/commands/execute -d
"language=sql&clusterId=0611-023105-moms10&contextId=7690632266172649068&command=select count(*) from cmap"

{"id":"d8ec4989557d4a4ea271d991a603a3af"}

Libraries

The REST API also allows for libraries to be uploaded to a cluster and their statuses checked. The REST API call paths are as follows:

  • /api/1.0/libraries/upload
  • /api/1.0/libraries/list

An example is given next of a library upload to the cluster instance called semclust. The parameters passed on to the HTTP GET API call via a –d option are the language, cluster ID, the library name and URI. A successful call results in the name and URI of the library, which is as follows:

curl -u 'xxxx:yyyy' https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/libraries/upload
 -d "language=scala&clusterId=0611-023105-moms10&name=lib1&uri=file:///home/hadoop/spark/ann/target/scala-2.10/a-n-n_2.10-1.0.jar"

{"name":"lib1","uri":"file:///home/hadoop/spark/ann/target/scala-2.10/a-n-n_2.10-1.0.jar"}

Note that this REST API can change by content and version overtime, so check in the Databricks forum, and use the previous help email address to check the API details with Databricks support. I do think though that, with these simple example calls, it is clear that this REST API can be used to integrate Databricks with the external systems, and ETL chains. In the next section, I will provide an overview of data movement within the Databricks cloud.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.15.43