Introducing Apache Zeppelin

Apache Zeppelin is a web-based notebook that enables data-driven, interactive analytics with built-in visualizations. It supports multiple languages with an interpreter framework. Currently, it supports interpreters such as Spark, Markdown, Shell, Hive, Phoenix, Tajo, Flink, Ignite, Lens, HBase, Cassandra, Elasticsearch, Geode, PostgreSQL, and Hawq. It can be used for data ingestion, discovery, analytics, and visualizations using notebooks similar to IPython Notebooks. Zeppelin notebooks recognize output from any language and visualize these using the same tools.

The Zeppelin project started as an incubator project in the Apache software foundation in December 2014 and became a top-level project in May 2016. Zeppelin mainly has four components, as shown in the architecture in Figure 6.3:

Introducing Apache Zeppelin

Figure 6.3: The Zeppelin architecture

The components in the Zeppelin architecture are described as follows:

  • Frontend: This provides UI and shells to interact with humans and a display system to show data in tabular, graphical form, and export iframe.
  • Zeppelin Server: This provides web sockets and the REST API to interact with the UI and access service remotely. There are two types of API calls—a REST API for notebooks and an Interpreter API for interpreters. The Notebook REST API is to interact with notebooks—creating paragraph, submitting paragraph job in batch, adding cron jobs, and so on. The Interpreter REST API is to change the configuration properties and restart the interpreter. For more information about the REST API, visit https://github.com/apache/incubator-zeppelin/tree/master/docs/rest-api.
  • Pluggable Interpreter System: This is to interact with different interpreters such as Spark, Shell, Markdown, AngularJS, Hive, Ignite, Flink, and others.
  • Interpreters: Each interpreter runs in a separate JVM to provide the functionality needed by the user.

Jupyter versus Zeppelin

Each product has its own strengths and weaknesses. We need to understand the differences in order to use the right tool for the right use case. The following table shows you the differences between Jupyter and Zeppelin:

 

Jupyter

Zeppelin

Evolution

A long history, large community support, and stable

Relatively young

Type of software

Open source

Open source with Apache Releases

Visualization of results

Using tools such as matplotlib

Built-in tools in the notebook for graphs and charts

Customization of forms

No dynamic forms

Dynamic forms with user-provided inputs

Tab completion

Jupyter provides tab completion

Zeppelin does not provide tab completion yet

Languages/components supported

Over 40+ languages including Python, Julia, and R

Interpreters such as Scala and Python with Apache Spark, Spark SQL, Hive, Markdown, Shell, HBase, Flink, Cassandra, Elasticsearch, Tajo, HDFS, Ignite, Lens, PostgreSQL, Hawq, Scalding, and Geode

Mixing multiple languages

Not easy to mix multiple languages in the same notebook.

It's quite easy to mix multiple languages in the same notebook

Implementation

Python-based

JVM-based

Environments

Jupyter is a generic tool that can be used in any environment

Zeppelin is more suitable for Hadoop and Spark installations

Installing Apache Zeppelin

The latest Hortonworks Sandbox provides a preconfigured Zeppelin service that can be used to quickly try out. If you want to install Zepplelin on a cluster, there are a couple of ways to do so. Use the Hortonworks Ambari service or the manual installation method. The Ambari service can be used for Hortonworks-based installations and the manual installation can be used for Hortonworks, Cloudera, and MapR distributions.

Ambari service

Use the following instructions to install, configure, and start the Zeppelin service on Ambari:

VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'`
sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git/var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN
sudo ambari-server restart

Go to ipaddressofsandbox:8080 and log in with admin/admin credentials. The Apache Zeppelin service is now included in the stack and can be added as a service. At the bottom left of the Ambari page, click on Actions, click on Add Service, check Zeppelin service, configure it and deploy.

During the configuration step, change the following parameters as necessary:

  • spark.home: Use the standard /usr/hdp/current/spark or any custom Spark version installed.
  • zeppelin.server.port: This is the port number where the Zeppelin server listens. Use any unused port.
  • zeppelin.setup.prebuilt: Make it false to get the latest code base.

The manual method

Use the following commands to install and configure the Apache Zeppelin service manually:

wget http://mirror.metrocast.net/apache/zeppelin/zeppelin-0.6.1/zeppelin-0.6.1-bin-all.tgz

tar xzvf zeppelin-0.6.1-bin-all.tgz

cd zeppelin-0.6.1-bin-all/conf

To access the Hive metastore, copy hive-site.xml to the conf directory of Zeppelin:

cp /etc/hive/conf/hive-site.xml .

Copy the configuration template files as follows:

cp zeppelin-env.sh.template zeppelin-env.sh
cp zeppelin-site.xml.template zeppelin-site.xml

Add the following lines to the zeppelin-env.sh file:

export JAVA_HOME=/usr/lib/jvm/java
export MASTER=yarn-client
export HDAOOP_CONF_DIR=/etc/hadoop/conf

Add the following lines to zeppelin-site.xml:

<property>
  <name>zeppelin.server.addr</name>
  <value>sandbox.hortonworks.com</value>
  <description>Server address</description>
</property>

<property>
  <name>zeppelin.server.port</name>
  <value>9999</value>
  <description>Server port.</description>
</property>

Finally, start the Zeppelin service from the bin directory with the following command:

cd ../bin/
./zeppelin-daemon.sh start

Now you can access your notebook at http://host.ip.address:9999.

Analytics with Zeppelin

Zeppelin provides multiple interpreters in the same notebook. So, you can write Scala, Python, SQL, and others in the same notebook.

Click on the Notebook menu option at the top of the screen, and then click on Create new note and provide a meaningful name for the notebook. The newly created notebook can be opened from the main screen or from the Notebook menu option. Click on the newly created notebook and then on the interpreter binding button in the upper right corner. Click on the interpreters to bind or unbind the interpreters. You can change the order of interpreters by dragging and dropping them. The first one on the list will be the default interpreter in the notebook. Finally, click on the Save option at the bottom to save the changes.

Now, click on the Interpreter menu option at the top and then on edit to change Spark properties such as master, spark.cores.max, spark.executor.memory, and args as needed by the application. Click on Save to make changes to update and restart the interpreter with new settings. You can also restart any specific interpreter by clicking on the restart button.

You are now ready to code. As %spark is the first on the list of interpreter binding, you don't need to type %spark to write Scala code in the paragraph. However, in the paragraph, if you are writing any other code, say, PySpark, you need to type %pyspark. Provide a Markdown text in the first paragraph to provide information about the notebook. Write code in the next set of paragraphs. Finally, to visualize the result, write %sql or %table in a separate paragraph.

Write code from previous chapters or use the Zeppelin Tutorial notebook that comes along with Zeppelin for a quick start. You can use the following code to analyze Ambari agent logs:

%pyspark
words = sc.textFile('file:///var/log/ambari-agent/ambari-agent.log') 
 .flatMap(lambda x: x.lower().split(' ')) 
 .filter(lambda x: x.isalpha()).map(lambda x: (x, 1)) 
 .reduceByKey(lambda a,b: a+b)
 sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(words, ['word', 'count']), 'words')

%sql select word, max(count) from words group by word

The output of the preceding code looks similar to Figure 6.4:

Analytics with Zeppelin

Figure 6.4: Zeppelin visualizations

If you get any errors, check the logs in the logs directory of Zeppelin.

Hortonworks Gallery has prebuilt notebooks at https://github.com/hortonworks-gallery/zeppelin-notebooks to play with Spark, PySpark, Spark SQL, Spark Streaming, Hive, and so on.

Any existing notebook can be viewed at the ZeppelinHub Viewer:

https://www.zeppelinhub.com/viewer

There are multiple ways to share a notebook with others. Other users on the same cluster can access and run the notebook with the URL of the notebook. You can also share the notebook in report mode by clicking on the drop-down list in the upper right corner and then choosing report.

Tip

If you get an error such as interpreter not found, click on the interpreter binding icon at the right-hand top corner in the notebook and then click on Save to resolve the issue.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.170.223