Introducing Jupyter

The Jupyter Notebook supports over 40 languages and integrates with Spark and Hadoop to query interactively and visualize results with ggplot2, matplotlib, and others.

The Jupyter project evolved from the IPython project. The IPython project has accumulated many languages other than Python over a period of time. As a result, the IPython name became irrelevant for the project, so the name has been changed to Jupyter with inspiration from the Julia, Python, and R languages. IPython will continue to exist as a Python kernel for Jupyter. In simple words, IPython supports the Python language and Jupyter is language-agnostic. Jupyter provides the following features:

  • An interactive shell for OS commands
  • A Qt console for interactive shell-based analytics
  • A browser-based notebook for interactive analytics on a web browser
  • Kernels for different languages such as Python, Ruby, Julia, R, and so on
  • The nbconvert tool to convert .ipynb to other formats such as .html, .pdf, .markdown, and others
  • The nbviewer tool (http://nbviewer.ipython.org/) to view the notebooks with integration of GitHub to share notebooks in public

The Jupyter web-based notebook will automatically detect installed kernels such as Python, Scala, R, and Julia. Notebook users will be able to select the programming language of their choice for each individual notebook from a drop-down menu. The UI logic such as syntax highlighting, logos, and help menus, will automatically be updated on the notebook as the programming language of a notebook is changed.

Installing Jupyter

You need Python 3.3 and above or Python 2.7 for the installation of Jupyter. Once these requirements are met, installing Jupyter is quite easy with Anaconda or pip using the following commands:

conda install jupyter
pip3 install jupyter

The IPython kernel will be automatically installed with the Juypter installation. If you want to install other kernels, go to https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages, click on the kernel, and follow the procedure. This page provides a list of all the available languages for Jupyter as well.

Let's follow this procedure for the installation of the Jupyter Notebook (if you are not using conda) on the Hortonworks Sandbox virtual machine. These instructions will work on other distributions (Cloudera and MapR) as well:

  1. First of all, configure the dependencies with the following commands and then install and enable Python 2.7:
    yum install nano centos-release-SCL zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libpng-devel libjpg-devel atlas-devel
    
    yum groupinstall "Development tools"
    
    yum install python27
    
    source /opt/rh/python27/enable
    
  2. Now, let's install pip and then install the Jupyter Notebook:
    sudo yum -y install python-pip 
    
    sudo pip install --upgrade pip
    
    pip install numpy scipy pandas scikit-learn tornado pyzmq pygments matplotlib jinja2 jsonschema
    
    pip install jinja2 --upgrade
    
    pip install jupyter
    
  3. Once the profile is created, create the script to start the kernel. Create a file and enter the following content:
    vi ~/start_ipython_notebook.sh
    
    #!/bin/bash
    source /opt/rh/python27/enable
    IPYTHON_OPTS="notebook --port 8889 
    --notebook-dir='/usr/hdp/current/spark-client/' 
    --ip='*' --no-browser" pyspark
    
    chmod +x ~/start_ipython_notebook.sh
    
  4. Finally, run the following command:
    ./start_ipython_notebook.sh
    
  5. Go to the browser and open the page with the following address as an example. You should see a page similar to Figure 6.1. Change the IP address to the IP address of your sandbox:

    http://192.168.139.165:8889/tree#

    Installing Jupyter

    Figure 6.1: The Jupyter Notebook

Note that, by default, the Spark application starts in local mode. If you want to start with the YARN cluster manager, change your start command as follows:

[root@sandbox ~]# cat start_ipython_notebook.sh
#!/bin/bash
source /opt/rh/python27/enable
IPYTHON_OPTS="notebook --port 8889 --notebook-dir='/usr/hdp/current/spark-client/' --ip='*' --no-browser" pyspark --master yarn

Hortonworks provides an unsupported Ambari service for Jupyter. The installation and management of Jupyter is easier with this service. Perform the following steps to install and start the Jupyter Service within Ambari:

git clone https://github.com/randerzander/jupyter-service
sudo cp -r jupyter-service /var/lib/ambari-server/resources/stacks/HDP/2.4/services/
sudo ambari-server restart

Go to ipaddressofsandbox:8080 and log in with admin/admin credentials. The Jupyter service is now included in the stack and can be added as a service. Click on Actions | Add Service | Select Jupyter | Customize service and deploy. Start the service; the notebook can be viewed at port number 9999 on the browser. You can also add a port forwarding rule for port 9999 so that the notebook can be accessed with the address hostname 9999.

Change the port number in the configuration if it is already bound to another service.

Analytics with Jupyter

Before we get started with the analytics of Spark, let's learn some of the important features of the Jupyter Notebook.

Click on New in the upper right corner and select the Python 2 kernel to start a new notebook. Notebooks provide cells and output areas. You need to write code in a cell and then click on the execute button or press Shift + Enter. You can run regular operating system commands such as ls, mkdir, cp, and others. Note that you get tab completion while typing the commands. IPython also provides magic commands that start with the % symbol. A list of magic commands is available with the %lsmagic command.

You can mark the cell with Code, Markdown, Raw NBConvert, or Heading with drop-down lists located on the toolbar. You can add rich text, links, mathematical formulas, code, and images in Markdown text to document within the notebook. Some of the example Markdowns are available at https://guides.github.com/features/mastering-markdown/. When you create a notebook, it is created with untitled.ipynb, but you can save it with a filename by clicking at the top of the page.

Now, let's get started with analytics using Spark. You can execute any exercise from Chapter 3, Deep Dive into Apache Spark to Chapter 5, Real-Time Analytics with Spark Streaming and Structured Streaming. Commands can be executed one by one or you can put all the code in one cell for entire code execution. You can open multiple notebooks and run them on the same SparkContext. Let's run a simple word count and plot the output with matplotlib:

from operator import add
words = sc.parallelize(["hadoop spark hadoop spark mapreduce spark jupyter ipython notebook interactive analytics"])
counts = words.flatMap(lambda x: x.split(' ')) 
                  .map(lambda x: (x, 1)) 
                  .reduceByKey(add)   
                  .sortBy(lambda x: x[1])

%matplotlib inline
import matplotlib.pyplot as plt
def plot(counts):
    labels = map(lambda x: x[0], counts)
    values = map(lambda y: y[1], counts)
    plt.barh(range(len(values)), values, color='green')
    plt.yticks(range(len(values)), labels)
    plt.show()

plot(counts.collect())

You will see the result as shown in Figure 6.2:

Analytics with Jupyter

Figure 6.2: Visualizations in the Jupyter Notebook

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.150.123