The Livy REST job server and Hue Notebooks

Hadoop User Experience (Hue) introduced Spark-based notebooks, which are inspired by IPython Notebooks. The Spark Notebook works on the Livy REST job server backend to provide Spark as a service to end users.

Hue Notebooks are similar to IPython Notebooks or Zeppelin Notebooks, which are integrated well with Hadoop ecosystem components. Notebooks support Scala, Python, and R languages, and Hive and Impala queries. They can be shared with import/export and sharing features. Visualizations can be done with a variety of charts and graphs. Hue supports a multiuser environment with impersonation.

Livy is a Spark REST job server that can submit and interact with Spark jobs from anywhere. It is inspired by the popular IPython/Jupyter and Spark REST job server, but implemented to better integrate the Hadoop ecosystem with multiple users. With the Livy server, Spark can be offered as a service to users in the following two ways:

  • Instead of every user creating their own shells, Livy creates a shell on the cluster while the end user will access them at their own convenience through a REST API
  • Batch applications can also be submitted using a REST API remotely
  • Livy creates several SparkContexts and RDDs for users that can be shared with multiple users
  • With the YARN impersonation, jobs will be executed with the actual permissions of the users submitting them

Livy supports browser-based notebooks from Hue, Jupyter, or any REST clients. The output of the jobs is returned in a tabular format to visualize in charts in Hue. Figure 6.5 shows you the architecture of the Livy REST job server and Hue, which has the following main components:

  • The Livy web server that exposes a REST API
  • The Session Manager creates, and manages sessions for users on YARN
  • Pluggable Interpreters such as Scala, PySpark, and R to execute user programs
  • Hue web-based notebooks for interactive sessions
  • Users sending REST calls from any shell or programs in batch or interactive mode

SparkMagic (https://github.com/jupyter-incubator/sparkmagic) provides you with a set of tools to be used to connect to remote Spark clusters from Jupyter notebooks using Livy Server.

The Livy REST job server and Hue Notebooks

Figure 6.5: The Livy server is between clients and the Spark cluster

Installing and configuring the Livy server and Hue

The latest Livy version must be installed before using Hue Notebooks. Follow these instructions to install and configure the Livy job server on a CDH QuickStart VM or any CDH cluster. Note that the Livy server and Hue Notebooks are in alpha release mode as of writing this book.

You can download and unzip prebuilt binaries as follows:

wget  http://archive.cloudera.com/beta/livy/livy-server-0.2.0.zip
unzip livy-server-0.2.0.zip
cd livy-server-0.2.0

You can also download and compile the source, as shown in the following:

git clone https://github.com/cloudera/livy.git
cd livy
mvn -DskipTests clean package
mvn -Dspark.version=1.6.1 package (To build Livy against a specific version of spark)

Once you are done with one of the preceding two methods, follow the steps:

cd conf/

Add the following configuration properties to livy.conf to enable a YARN impersonation of other users:

vi livy.conf

livy.impersonation.enabled = true
livy.server.session.factory = yarn

Create a script with the following parameters and then start the livy-server command:

cd ../bin
export SPARK_HOME=/usr/lib/spark/
export CLASSPATH=`hadoop classpath`
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export LIVY_SERVER_JAVA_OPTS="-Dlivy.impersonation.enabled=true"
./livy-server

The REST server will start at the default port number 8998. Change the port number to a different port number in the configuration if necessary.

Using the Livy server

There are multiple ways to use Livy's REST API. You can submit jobs interactively in a shell or in batch mode. SparkContext and RDDs created within SparkContext can be shared by multiple users. Let's understand these features with simple examples.

An interactive session

An interactive session is similar to Spark-shell or PySpark shell, where we interactively enter commands and check the result. However, instead of users creating the SparkContext, users will interact with Livy using a REST API, and the Livy server creates and manages the sessions for users. To check the existing sessions, run the following command from any terminal. This will return zero if no sessions are running:

curl localhost:8998/sessions | python -m json.tool

{"from": 0,"sessions": [],"total": 0}

Create a PySpark session with the following command, which will return the session ID number. In this example, the ID number is 0 with the status as starting. If you submit another similar command, a new session will be created by incrementing the session ID:

curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions

{"id":0,"state":"starting","kind":"pyspark","log":[]}

Let's poll the status again with the following command. When the status becomes idle, a session will be ready to receive any interactive commands. This command displays the log output as well, so, if the state is in error, check the log in the JSON output:

curl localhost:8998/sessions/0 | python -m json.tool

"state": "idle"

We are now ready to submit interactive commands. Let's submit a simple PySpark code and check the output:

curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sc.parallelize(range(1000)).map(lambda x: 2 * x).take(10)"}'

{"id":2,"state":"running","output":null}

From the preceding result, id will become the statement number. So, run the following command to view the output:

curl localhost:8998/sessions/0/statements/2 | python -m json.tool

{
    "id": 2,
    "output": {
        "data": {
            "text/plain": "[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]"
        },
        "execution_count": 2,
        "status": "ok"
    },
    "state": "available"
}

The computation result is shown as [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]. You can submit any Spark code snippet in the code block. Once you are done with the session, delete the session as follows:

curl localhost:8998/sessions/0 -X DELETE

{"msg":"deleted"}

A batch session

A batch session is similar to submitting applications using spark-submit. As the first step, copy all the JARs or Python files to the HDFS filesystem to execute them:

hadoop fs -put /usr/lib/spark/lib/spark-examples.jar
sudo tar xzvf /usr/lib/spark/examples/lib/python.tar.gz
hadoop fs -put /usr/lib/spark/examples/lib/pi.py
curl -X POST --data '{"file": "/user/cloudera/spark-examples.jar", "className": "org.apache.spark.examples.SparkPi", "args": ["100"]}' -H "Content-Type: application/json" localhost:8998/batches

{"id":1,"state":"running","log":[]}

Check the status of the job at the resource manager UI, http://quickstart.cloudera:8088/cluster, or check with the following REST call:

curl localhost:8998/batches/1 | python -m json.tool

To delete a running job, use the following command:

curl -X DELETE localhost:8998/batches/1
{"msg":"deleted"}

To run a PySpark batch job, use the following command:

curl -X POST --data '{"file": "user/cloudera/pi.py"}' -H "Content-Type: application/json" localhost:8998/batches

{"id":2,"state":"starting","log":[]}

Command-line arguments can be passed args, as shown in the first example. The entire batch API can be found at https://github.com/cloudera/hue/tree/master/apps/spark/java#post-batches.

Sharing SparkContexts and RDDs

Usually, multiple users will create their own sessions and won't share with each other. This will waste the resources of the Spark cluster. If all users are talking to the same session, they would interact with the same SparkContext. This context would itself manage several RDDs, which can be shared by multiple users. Users simply need to use the same session ID, for example, 0, and issue commands there.

They can be accessed from any shell, program, or notebook. Here is an example:

User1 creates an RDD with the name sharedRDD in session 0. Then, both User1 and User2 access the same RDD using the same session ID, as shown here:

User1 uses the following command:

curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sharedRDD.collect()"}'

User2 uses the following command:

curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"sharedRDD.take(5)"}'

Using Livy with Hue Notebook

As the Spark Notebooks in Hue are in the beta mode, it is not enabled by default in Hue. To enable the Spark Notebooks in Hue, make the following changes in Cloudera Manager:

  1. Navigate to Hue service | Configuration | Advanced | Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini and enter the following lines:
    [desktop]
    app_blacklist=
    [notebook]
    show_notebooks=true
  2. In the Hue Server Advanced Configuration Snippet (Safety Valve) for hue_safety_valve_server.ini, enter the following lines:
    rver_url=http://quickstart.cloudera:8998/
    [spark]
    livy_server_host=quickstart.cloudera
    livy_server_port=8998
    livy_server_session_kind=yarn
    livy_impersonation_enabled=false
    livy_server_session_timeout=3600000
  3. In the HDFS Service | Configuration | Service-Wide-Advanced | Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, enter the following key value pairs:
    hadoop.proxyuser.livy.groups *
    hadoop.proxyuser.livy.hosts *
  4. Save the configuration changes and restart HDFS and Hue.

Before using the Hue Notebook, stop the Livy server by pressing Ctrl + C in the terminal where the Livy server is running. Start the Livy server by disabling impersonation to avoid errors in Hue:

CLASSPATH=`hadoop classpath` SPARK_HOME=/usr/lib/spark/ HADOOP_CONF_DIR=/etc/hadoop/conf/ LIVY_SERVER_JAVA_OPTS="-Dlivy.impersonation.enabled=false" ./livy-server

Log in to Hue using cloudera/cloudera credentials at http://quickstart.cloudera:8888/about/. Hue ships sample tables along with the Hue installation. Go to Hue's main screen, click on Step 2: Examples, and then import all examples. This will create the web_logs table, which can be used with Impala queries.

Navigate to Notebooks menu and then open a new notebook or an existing notebook.

The notebook look and feel, execution, and general notebook operations will be similar to IPython or Zeppelin Notebooks.

Figure 6.6 shows you the available snippets from the Spark Notebook application. It shows PySpark, R, Hive, Impala, and Scala snippets. For more snippets, click on select snippet on the plus symbol that will show more snippets, such as Spark Submit Jar and Spark Submit Python, and so on:

Using Livy with Hue Notebook

Figure 6.6: The Hue Notebook

For a new notebook, click on PySpark, enter the name of the notebook, and enter the following code. Note that it takes more time on the first execution as the SparkContext has not started yet. The next set of commands will be quickly executed:

sc.parallelize(range(1000)).map(lambda x: x * x).take(10)

Figure 6.7 shows you the typical execution of PySpark code in the Hue notebook. Notice that a job is running in the job browser with the name as livy-session-0, application type as SPARK, and user as cloudera. Click on the job ID and then on the URL of the jobs, which will take you to the Spark UI. The jobs submitted in Hue will appear on this Spark UI:

Using Livy with Hue Notebook

Figure 6.7: The Hue PySpark notebook

Now click on the drop-down menu in front of the PySpark icon and select Impala. Install Impala examples from step 2 of HUE quick start wizard. Run the following query and select gradient map from the drop-down list. Select country_code3 as REGION and count(*) as VALUE.

invalidate metadata;

select country_code3, count(*) from web_logs group by country_code3;

Results will be shown on the world map, as shown in Figure 6.8:

Using Livy with Hue Notebook

Figure 6.8: HUE visualizations

Using Livy with Zeppelin

Apache Zeppelin provides an interpreter for Livy server. So, Apache Spark jobs can be executed interactively from a Zeppelin notebook. Configure livy.spark.master and zeppelin.livy.url in Livy interpreter settings before using it. External libraries to be configured with the livy.spark.jars.packages property. The interpreter bindings to be used are %livy.spark, %livy.pyspark, and % livy.sparkr. For more information, refer to https://zeppelin.apache.org/docs/0.6.0/interpreter/livy.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.19.185