Jobs and libraries

Within Databricks, it is possible to import JAR libraries and run the classes in them on your clusters. I will create a very simple piece of Scala code to print out the first 100 elements of the Fibonacci series as BigInt values, locally on my Centos Linux server. I will compile my class into a JAR file using SBT, run it locally to check the result, and then run it on my Databricks cluster to compare the results. The code looks as following:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object db_ex1  extends App
{
  val appName = "Databricks example 1"
  val conf = new SparkConf()

  conf.setAppName(appName)

  val sparkCxt = new SparkContext(conf)

  var seed1:BigInt = 1
  var seed2:BigInt = 1
  val limit = 100
  var resultStr = seed1 + " " + seed2 + " "

  for( i <- 1 to limit ){

    val fib:BigInt = seed1 + seed2
    resultStr += fib.toString + " "

    seed1 = seed2
    seed2 = fib
  }

  println()
  println( "Result : " + resultStr )
  println()

  sparkCxt.stop()

} // end application

Not that the most elegant piece of code, or the best way to create Fibonacci, but I just want a sample JAR and class to use with Databricks. When run locally, I get the first 100 terms, which look as follows (I've clipped this data to save space):

Result : 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 4807526976 7778742049 12586269025 20365011074 32951280099 53316291173

4660046610375530309 7540113804746346429 12200160415121876738 19740274219868223167 31940434634990099905 51680708854858323072 83621143489848422977 135301852344706746049 218922995834555169026 354224848179261915075 573147844013817084101 927372692193078999176

The library that has been created is called data-bricks_2.10-1.0.jar. From my folder menu, I can create a new Library using the menu drop-down option. This allows me to specify the library source as a JAR file, name the new library, and load the library JAR file from my local server. The following screenshot shows an example of this process:

Jobs and libraries

When the library has been created, it can be attached to the cluster called semclust1, my Databricks cluster, using the Attach option. The following screenshot shows the new library in the process of attaching:

Jobs and libraries

In the following example, a job called job2 has been created by selecting the jar option on the Task item. For the job, the same JAR file has been loaded and the class db_ex1 has been assigned to run in the library. The cluster has been specified as on-demand, meaning that a cluster will be created automatically to run the job. The Active runs section shows the job running in the following screenshot:

Jobs and libraries

Once run, the job is moved to the Completed runs section of the display. The following screenshot, for the same job, shows that it took 47 seconds to run, that it was launched manually, and that it succeeded.

Jobs and libraries

By selecting the run named Run 1 in the previous screenshot, it is possible to see the run output. The following screenshot shows the same result as the local run, displayed from my local server execution. I have clipped the output text to make it presentable and readable on this page, but you can see that the output is the same.

Jobs and libraries

So, even from this very simple example, it is obvious that it is possible to develop applications remotely, and load them onto a Databricks cluster as JAR files in order to execute. However, each time a Databricks cluster is created on AWS EC2 storage, the Spark URL changes, so the application must not hard-code details such as the Spark master URL. Databricks will automatically set the Spark URL.

When running the JAR file classes in this way, it is also possible to define class parameters. The jobs may be scheduled to run at a given time, or periodically. The job timeouts, and alert email addresses may also be specified.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.115.154