Databricks provides a REST interface for Spark cluster-based manipulation. It allows for cluster management, library management, command execution, and the execution of contexts. To be able to access the REST API, the port 34563
must be accessible for your instance in the AWS EC2-based Databricks cloud. The following Telnet command shows an attempt to access the port 34563
of my Databricks cloud instance. Note that the Telnet attempt has been successful:
[hadoop@hc2nn ~]$ telnet dbc-bff687af-08b7.cloud.databricks.com 34563 Trying 52.6.229.109... Connected to dbc-bff687af-08b7.cloud.databricks.com. Escape character is '^]'.
If you do not receive a Telnet session, then contact Databricks via <[email protected]>
. The next sections provide examples of REST interface access to your instance on the Databricks cloud.
In order to use the interface, I needed to whitelist the IP address that I use to access my Databricks cluster instance. This is the IP address of the machine from which I will be running the REST API commands. By whitelisting the IP addresses, Databricks can ensure that a secure list of users access each Databricks cloud instance.
I contacted Databricks support via the previous help email address, but there is also a Whitelist IP Guide, found in the Workspace menu in your cloud instance:
Workspace | databricks_guide | DevOps Utilities | Whitelist IP.
REST API calls can now be submitted to my Databricks cloud instance, from the Linux command line, using the Linux curl
command. The example general form of the curl
command is shown next using my Databricks cloud instance username, password, cloud instance URL, REST API path, and parameters.
The Databricks forum, and the previous help email address can be used to gain further information. The following sections will provide some REST API worked examples:
curl –u '<user>:<paswd>' <dbc url> -d "<parameters>"
You will still need to create Databricks Spark clusters from your cloud instance user interface. The list REST API command is as follows:
/api/1.0/clusters/list
It needs no parameters. This command will provide a list of your clusters, their status, IP addresses, names, and the port numbers that they run on. The following output shows that the cluster semclust1
is in a pending state in the process of being created:
curl -u 'xxxx:yyyyy' 'https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/clusters/list' [{"id":"0611-014057-waist9","name":"semclust1","status":"Pending","driverIp":"","jdbcPort":10000,"numWorkers":0}]
The same REST API command run when the cluster is available, shows that the cluster called semcust1
is running, and has one worker:
[{"id":"0611-014057-waist9","name":"semclust1","status":"Running","driverIp":"10.0.196.161","jdbcPort":10000,"numWorkers":1}]
Terminating this cluster, and creating a new one called semclust
changes the results of the REST API call as shown:
curl -u 'xxxx:yyyy' 'https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/clusters/list' [{"id":"0611-023105-moms10","name":"semclust", "status":"Pending","driverIp":"","jdbcPort":10000,"numWorkers":0}, {"id":"0611-014057-waist9","name":"semclust1","status":"Terminated","driverIp":"10.0.196.161","jdbcPort":10000,"numWorkers":1}]
With these API calls, you can create, show the status of, or delete an execution context. The REST API calls are as follows:
/api/1.0/contexts/create
/api/1.0/contexts/status
/api/1.0/contexts/destroy
In the following REST API call example, submitted via curl
, a Scala context has been created for the cluster semclust
identified by it's cluster ID.
curl -u 'xxxx:yyyy' https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/contexts/create -d "language=scala&clusterId=0611-023105-moms10"
The result returned is either an error, or a context ID. The following three example return values show an error caused by an invalid URL, and two successful calls returning context IDs:
{"error":"ClusterNotFoundException: Cluster not found: semclust1"} {"id":"8689178710930730361"} {"id":"2876384417314129043"}
These commands allow you to run a command, list a command status, cancel a command, or show the results of a command. The REST API calls are as follows:
The following example shows an SQL statement being run against an existing table called cmap
. The context must exist, and must be of the SQL type. The parameters have been passed on to the HTTP GET call via a –d
option. The parameters are language, the cluster ID, the context ID, and the SQL command. The command ID is returned as follows:
curl -u 'admin:FirmWare1$34' https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/commands/execute -d "language=sql&clusterId=0611-023105-moms10&contextId=7690632266172649068&command=select count(*) from cmap" {"id":"d8ec4989557d4a4ea271d991a603a3af"}
The REST API also allows for libraries to be uploaded to a cluster and their statuses checked. The REST API call paths are as follows:
/api/1.0/libraries/upload
/api/1.0/libraries/list
An example is given next of a library upload to the cluster instance called semclust
. The parameters passed on to the HTTP GET API call via a –d
option are the language, cluster ID, the library name and URI. A successful call results in the name and URI of the library, which is as follows:
curl -u 'xxxx:yyyy' https://dbc-bff687af-08b7.cloud.databricks.com:34563/api/1.0/libraries/upload -d "language=scala&clusterId=0611-023105-moms10&name=lib1&uri=file:///home/hadoop/spark/ann/target/scala-2.10/a-n-n_2.10-1.0.jar" {"name":"lib1","uri":"file:///home/hadoop/spark/ann/target/scala-2.10/a-n-n_2.10-1.0.jar"}
Note that this REST API can change by content and version overtime, so check in the Databricks forum, and use the previous help email address to check the API details with Databricks support. I do think though that, with these simple example calls, it is clear that this REST API can be used to integrate Databricks with the external systems, and ETL chains. In the next section, I will provide an overview of data movement within the Databricks cloud.
3.144.244.250