© Ervin Varga 2019
E. Varga Practical Data Science with Python 3https://doi.org/10.1007/978-1-4842-4859-1_4

4. Documenting Your Work

Ervin Varga1 
(1)
Kikinda, Serbia
 

Data science and scientific computing are human-centered, collaborative endeavors that gather teams of experts covering multiple domains. You will rarely perform any serious data analysis task alone. Therefore, efficient intra-team communication that ensures proper information exchange within a team is required. There is also a need to convey all details of an analysis to relevant external parties. Your team is part of a larger scientific community, so others must be able to easily validate and verify your team’s findings. Reproducibility of an analysis is as important as the result itself. Achieving this requirement—to effectively deliver data, programs, and associated narrative as an interactive bundle—is not a trivial task. You cannot assume that everybody who wants to peek into your analysis is an experienced software engineer. On the other hand, all stakeholders aspire to make decisions based on available data. Fortunately, there is a powerful open-source solution for reconciling differences in individuals’ skill sets. This chapter introduces the project Jupyter (see https://jupyter.org ), the most popular ecosystem for documenting and sharing data science work.

The key to understanding the Jupyter project is to grasp the Jupyter Notebook architecture (see references [1–3]). A notebook in Jupyter is an interactive, executable narrative, which is buttressed by a web application running inside a browser. This web platform provides a convenient environment for doing all sorts of data science work, such as data preprocessing, data cleaning, exploratory analysis, reporting, etc. It may also serve as a full-fledged development environment and supports a multitude of programming languages, including Julia, Python, R, and Scala.1 As is emphasized on the Jupyter home page, the project “exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.” The project provides the building blocks for crafting all kinds of custom interactive computing environments. Here are the three principal components of the Jupyter architecture:
  • Notebook document format: A JSON document format for storing all types of content (code, images, videos, HTML, Markdown, LaTeX equations, etc.). The format has the extension .ipynb. See https://github.com/jupyter/nbformat .

  • Messaging protocol: A network messaging protocol based on JSON payloads for web clients (such as Jupyter Notebook) to talk to programming language kernels. A kernel is an engine for executing live code inside notebooks. The protocol specifies ZeroMQ or WebSockets as a transport mechanism. See https://github.com/jupyter/jupyter_client .

  • Notebook server: A server that exposes WebSocket and HTTP Resource APIs for clients to remotely access a file system, terminal, and kernels. See https://github.com/jupyter/jupyter_server .

These elements are visualized in Figure 4-1. The web application may be the original Jupyter Notebook or some other compatible web user interface provider. The server process communicates with many different kernels. Every notebook instance is associated with the matching kernel process. A notebook is comprised from many cells, each of which can be either text or live code. The code is executed by the proper language-dependent kernel. Code in a notebook may refer to previously defined code blocks. Usually, live code produces some read-only content, which becomes an integral part of the notebook. Such autogenerated content may be later browsed without starting any computation.
../images/473947_1_En_4_Chapter/473947_1_En_4_Fig1_HTML.jpg
Figure 4-1

The major components of the Jupyter architecture as well as the structural decomposition of a notebook. Each notebook is associated with its dedicated kernel at any given point in time.

The various tools encompassed in the Jupyter ecosystem are as follows:
  • Jupyter Notebook: The first client/server stack of the architecture, which is still widely used at the time of writing.

  • JupyterLab: The next-generation web client application for working with notebooks.2 We will implement it in this chapter. A notebook created with JupyterLab is fully compatible with Jupyter Notebook and vice versa.

  • JupyterHub: A cloud-based hosting solution for working with notebooks. It is especially important for enabling large organizations to scale their Jupyter deployments in a secure manner. Superb examples are UC Berkeley’s “Foundations of Data Science” and UC San Diego’s “Data Science MicroMasters” MOOC programs on the edX platform.

  • IPython: The Python kernel that enables users to use all extensions of the IPython console in their notebooks. This includes invoking shell commands prefixed with !. There are also magic commands for performing shell operations (such as %cd for changing the current working directory). You may want to explore the many IPython-related tutorials at https://github.com/ipython/ipython-in-depth .

  • Jupyter widgets and notebook extensions: All sorts of web widgets to bolster interactivity of notebooks as well as extensions to boost your notebooks. We will demonstrate some of them in this chapter. For a good collection of extensions, visit https://github.com/ipython-contrib/jupyter_contrib_nbextensions .

  • nbconvert: Converts a notebook into another rich content format (e.g., Markdown, HTML, LaTeX, PDF, etc.). See https://github.com/jupyter/nbconvert .

  • nbviewer: A system for sharing notebooks. You can provide a URL that points to your notebook, and the tool will produce a static HTML web page with a stable link. You may later share this link with your peers. See https://github.com/jupyter/nbviewer .

JupyterLab in Action

You should start the Anaconda Navigator as appropriate to your operating system. Chapter 1 provides instructions for installing it on your operating system (you can also instantly try various Jupyter applications by visiting https://jupyter.org/try ), and Figure 1-4 shows Anaconda Navigator’s main screen with JupyterLab in the upper-left corner. Click JupyterLab’s Launch button to start the tool. JupyterLab will present its main page (dashboard) inside your web browser. The screen is divided into three areas:
  • The top menu bar includes commands to create, load, save, and close notebooks, create, delete, and alter cells, run cells, control the kernel, change views, read and update settings, and reach various help documents.

  • The left pane has tabs for browsing the file systems, supervising the running notebooks, accessing all available commands, setting the properties of cells, and seeing what tabs are open in the right pane. The file browser’s tab is selected by default and allows you to work with directories and files on the notebook server (this is the logical root). If you run everything locally (your server’s URL will be something like http://localhost:88xx/lab), then this will be your home folder.

  • The right pane has tabs for active components. The Launcher (present by default) contains buttons to fire up a new notebook, open an IPython console, open a Terminal window, and open a text editor. Newly opened components will be tiled in this area.

Experimenting with Code Execution

In the spirit of data science, let’s first do some experiments with code execution. The goal is to get a sense of what happens when things go wrong, since being able to quickly debug issues increases productivity. Inside the file browser, click the toolbar button to create a new folder (the standard folder button with a plus sign on it). Right-click the newly created folder and rename it to hanoi. Double-click it to switch into that directory. Now, click the button in the Launcher in the right pane to open a notebook. Right-click inside the file browser on the new notebook file and rename it to Solver1.ipubn. If you have done everything properly, then you should see something similar to the screen shown in Figure 4-2.

Tip

If you have made an error, don’t worry. You can always delete and move items by using the context menu and/or drag-and-drop actions. Furthermore, everything you do from JupyterLab is visible in your favorite file handler, so you can fix things from there, too. I advise you to always create a designated directory for your project. This avoids clutter and aids in organizing your artifacts. You will need to reference many other items (e.g., images, videos, data files, etc.) from your notebook. Keeping these together is the best way to proceed.

Enter into the code cell the erroneous function to solve the Hanoi Tower problem, as shown in Listing 4-1. Can you spot the problem without executing it? Execute the cell by clicking the Run button (large green arrow) on the toolbar (or simply press Shift+Enter). Observe that by running your cell in this fashion, JupyterLab automatically creates a new cell below it. This is useful when you continually add content to your document.
def solve_tower(num_disks, start, end, extra):
    if (num_disks > 0):
        solve_tower(num_disks - 1, start, extra, end)
        print('Move top disk from', start, 'to", end)
        solve_tower(num_disks - 1, extra, end, start)
solve_tower(3, 'a', 'c', 'b')
Listing 4-1

Hanoi Tower Solver with a Syntax Error

The output will be the following error message:
  File "<ipython-input-2-7eeda1002555>", line 4
    print('Move top disk from', start, 'to", end)
                                                 ^
SyntaxError: EOL while scanning string literal
../images/473947_1_En_4_Chapter/473947_1_En_4_Fig2_HTML.jpg
Figure 4-2

The newly created notebook opened in the right pane

The first cell is empty, and you may start typing in your code. Notice that its type is Code, which is shown in the drop-down box. Every cell is demarcated with a rectangle, and the currently active one has a thick bar on its left side. The field surrounded by square brackets is the placeholder for the cell’s number. A cell receives a new identifier each time after being run. There are two types of numerated cells (see Figure 4-1): editable code cell (its content is preserved inside the In collection) and auto-generated cell (its content is preserved inside the Out collection). For example, you can refer to a cell’s output by typing Out[X], where X is the cell’s identifier. An immediate parent’s output can be referenced via _, such as _X for a parent of X. Finally, the history is searchable and you may press the Up and Down keys to find the desired statement.

The error message in the output from Listing 4-1 is correct about encountering a syntax error. Nonetheless, the explanation is not that helpful, and is even misleading. Observe the bold characters in the error report. The caret symbol is Python’s guess about the location of the error, while the real error is earlier. It is caused by an imbalanced string marker. You may use either " or ' to delineate a string in Python, but do not mix them for the same string.

To make this notebook a good reminder and worthy educational material, put the following Markdown content into the cell below the error output (don’t forget to change the type of the cell to Markdown in the drop-down box):
This error message is an example that Python sometimes wrongly guesses the location of the error. **Different string markers should not be mixed for the same string**.
If you execute this cell (this time choose Run ➤ Run Selected Cells and Don’t Advance to eschew creating a new cell), you will get nicely rendered HTML content. Your screen should look similar to Figure 4-3.
../images/473947_1_En_4_Chapter/473947_1_En_4_Fig3_HTML.jpg
Figure 4-3

Your first completed notebook showing an edge case of Python’s error reporting. Notice that anyone can see all the details without running the code.

Now save your notebook by clicking the disk icon button in the toolbar. Afterward, close the notebook’s window and shut it down (choose the tab with a symbol of a runner in the left pane and select SHUTDOWN next to your notebook). Simply closing the UI page does not terminate the dedicated background process, so your notebook will keep running.

Repeat the earlier steps to create a new notebook and name it Solver2.ipubn. Enter the Hanoi Tower solver version shown in Listing 4-2 into a code cell and run it.
def solve_tower(num_disks, start, end, extra):
    if (num_disks > 0):
        solve_tower(num_disks - 1, start, extra, end)
        print('Move top disk from', start, 'to', end)
        solve_tower(num_disks, extra, end, start)
solve_tower(3, 'a', 'c', 'b')
Listing 4-2

Hanoi Tower Solver with Infinite Recursion

You will notice a very strange behavior. It will print an endless list of messages to move disks around, and finally report an error about reaching the maximum recursion depth. Instead of waiting for your code to blow up the stack, you may want to stop it. Such abrupt stoppage is the only option if your code enters an infinite loop. You can interrupt the code’s execution by invoking Kernel ➤ Interrupt Kernel (note the many other kernel-related commands available in the Kernel menu). The effect of this action is visible in the following abridged output:
...
Move top disk from b to c
Move top disk from a to c
Move top disk from b to c
Move top disk from a to c
Move top disk from b to
---------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-2-a4a17e313c43> in <module>()
      5         solve_tower(num_disks, extra, end, start)
      6
----> 7 solve_tower(3, 'a', 'c', 'b')
<ipython-input-2-a4a17e313c43> in solve_tower(num_disks, start, end, extra)
      1 def solve_tower(num_disks, start, end, extra):
      2     if (num_disks > 0):
----> 3         solve_tower(num_disks - 1, start, extra, end)
      4         print('Move top disk from', start, 'to', end)
      5         solve_tower(num_disks, extra, end, start)
...
Now, enter the final variant of the program to solve the Tower of Hanoi puzzle, as shown in Listing 4-3. Of course, you should first repeat the previous steps and create a new notebook, Solver3.ipubn. This revision also contains embedded documentation explaining the purpose of the routine. Even though you may describe your code in narrative, it is always beneficial to document functions/routines separately. You can easily decide to move them from your notebook into a common place. Furthermore, input arguments are rarely described inside text cells, and it is easy to forget the details. All of this perfectly aligns with the following observation:

After all, the critical programming concerns of software engineering and artificial intelligence tend to coalesce as the systems under investigation become larger.

—Alan J. Perlis, Foreword to Structure and Interpretation of Computer Programs, Second Edition (MIT Press, 1996)

def solve_tower(num_disks:int, start:str, end:str, extra:str) -> None:
    """
    Solves the Tower of Hanoi puzzle.
    Args:
    num_disks: the number of disks to move.
    start: the name of the start pole.
    end: the name of the target pole.
    extra: the name of the temporary pole.
    Examples:
    >>> solve_tower(3, 'a', 'c', 'b')
    Move top disk from a to c
    Move top disk from a to b
    Move top disk from c to b
    Move top disk from a to c
    Move top disk from b to a
    Move top disk from b to c
    Move top disk from a to c
    >>> solve_tower(-1, 'a', 'c', 'b')
    """
    if (num_disks > 0):
        solve_tower(num_disks - 1, start, extra, end)
        print('Move top disk from', start, 'to', end)
        solve_tower(num_disks - 1, extra, end, start)
Listing 4-3

Correct Hanoi Tower Solver with Embedded Documentation and Type Annotations

The code comment also incorporates doctest tests . These can be executed by running the following code cell (it is handy to put such a cell at the end of your notebook):
import doctest
doctest.testmod(verbose=True)
The output reflects the number and outcome of your tests:
Trying:
    solve_tower(3, 'a', 'c', 'b')
Expecting:
    Move top disk from a to c
    Move top disk from a to b
    Move top disk from c to b
    Move top disk from a to c
    Move top disk from b to a
    Move top disk from b to c
    Move top disk from a to c
ok
Trying:
    solve_tower(-1, 'a', 'c', 'b')
Expecting nothing
ok
1 items had no tests:
    __main__
1 items passed all tests:
   2 tests in __main__.solve_tower
2 tests in 2 items.
2 passed and 0 failed.
Test passed.
TestResults(failed=0, attempted=2)

Finally, such documentation can be easily retrieved by executing solve_tower? (if you include two question marks, then you can dump the source code, too).

Managing the Kernel

The kernel is an engine that runs your code and sends back the results of execution. In Figure 4-2 (in the previous section), you can see the kernel’s status; look at the circle in the upper-right corner, next to the type of the kernel (in our case, Python 3). A white circle means that the kernel is idle and is ready to execute code. A filled-in circle denotes a busy kernel (this same visual clue also designates a dead kernel). If you notice that your program is halted, or the interaction with your notebook becomes tedious (slow and unresponsive behavior), then you may want to consider the following commands (all of them are grouped under the Kernel menu, and each menu item with an ellipsis will open a dialog box for you to confirm your intention):
  • Interrupt Kernel: Interrupts your current kernel. This is useful when all system components are healthy except your currently running code (we have already seen this command in action).

  • Restart Kernel...: Restarts the engine itself. You should try this if you notice sluggish performance of your notebook.

  • Restart Kernel and Clear...: Restarts the server and clears all autogenerated output.

  • Restart Kernel and Run All...: Restarts the server and runs all code cells from the beginning of your notebook.

  • Shutdown Kernel: Shuts down your current kernel.

  • Shutdown All Kernels...: Shuts down all active kernels. This applies when your notebook contains code written in different supported programming languages.

  • Change Kernel...: Changes your current kernel. You can choose a new kernel from the drop-down list. One of them is the dummy kernel called No Kernel (with this, all attempts to execute a code cell will simply wipe out its previously generated output). You can also choose a kernel from your previous session.

Note

Whenever you restart the kernel, you must execute your code cells again in proper order. Forgetting this is the most probable cause for an error like NameError: name <XXX> is not defined.

Connecting to a Notebook’s Kernel

If you would like to experiment with various variants of your code without disturbing the main flow of your notebook, you can attach another front end (like a terminal console or Qt Console application) to the currently running kernel. To find out the necessary connection information, run inside a code cell the %connect_info magic command.3 The output will also give you some hints about how to make a connection. The nice thing about this is that you will have access to all artifacts from your notebook.

Caution

Make sure to always treat your notebook as a source of truth. You can easily introduce a new variable via your console, and it will appear as defined in your notebook, too. Don’t forget to put that definition back where it belongs, if you deem it to be useful.

Descending Ball Project

We will now develop a small but complete project to showcase other powerful features of JupyterLab (many of them are delivered by IPython, as this is our kernel). The idea is to make the example straightforward so that you can focus only on JupyterLab. Start by creating a new folder for this project (name it ball_descend). Inside it create a new notebook, Simulation.ipubn. Revisit the “Experimenting with Code Execution” section for instructions on how to accomplish these steps.

Problem Specification

It is useful to start your notebook with a clear title and a description of the problem. These details are very important to highlight from the very beginning the essence of your work. Don’t let others waste time trying to figure out whether your work is valuable to them or not. Select the Markdown cell type from the drop-down box, and enter the following text:
# Simulation of a Bal's Descent in a Terrain
This project simulates where a ball will land in a terrain.
## Input
The terrain's configuration is given as a matrix of integers representing elevation at each spot. For simplicity, assume that the terrain is surrounded by a rectangular wall that prevents the ball from escaping. The inner dimensions of the terrain are NxM, where N and M are integers between 3 and 1000.
The ball's initial position is given as a pair of integers (a, b).
## Output
The result is a list of coordinates denoting the ball's path in a terrain. The first element of the list is the starting position, and the last one is the ending position. It could happen that they are the same, if the terrain is flat or the ball is dropped at the lowest point (the local minima).
## Rules
The ball moves according to the next two simple rules:
- The ball rolls from the current position into the lowest neighboring one.
- If the ball is surrounded by higher points, then it stops.

You should utilize the rich formatting options provided by the Markdown format. Here, we define headers to create some structure in our document. We will talk more about structuring in the next section. Once you execute this cell, it will be rendered as HTML.

It turns out that the title contains a small typo (the issue is more apparent in the rendered HTML); it says Bal's instead of Ball's. Double-click the text cell, correct the problem, and rerun the cell.

You also could have entered the preceding text using multiple consecutive cells. It is possible to split, merge, and move around cells by using the commands from the Edit menu or by using drag-and-drop techniques. The result of running a text cell is its formatted output. Text cells may be run in any order. This is not the case with code cells. They usually have dependencies on each other. Forgetting to run a cell on which your code depends may cause all sorts of errors. Obviously, reducing coupling between code cells is vital. A graph of dependencies between code cells may reveal a lot about complexity. This is one reason why you should minimize dependencies on global variables, as these intertwine your cells.

Model Definition

The problem description in the previous section adequately suggests the data model. One intuitive choice is to represent the terrain as a matrix of integers. NumPy already has such a data structure. Of course, for this miniature example, we could have used only standard Python stuff. Nonetheless, I want to give some hints about setup code. In the next code cell enter the content of Listing 4-4 and execute the cell (this will also create a new code cell below it).
# Usual bootstrapping code; just run this cell.
import numpy as np
Listing 4-4

Global Imports for the Notebook with a Comment for the User to Just Run It

Your notebook should be carefully organized with well-defined sections. Usually, bootstrapping code (such as shown in Listing 4-4) should be kept inside a single code cell at the very beginning of your notebook. Such code is not inherently related to your work, and thus should not be spread out all over your notebook. Moreover, most other code cells depend on this cell to be executed first. If you alter this section, then you should rerun all dependent cells. A handy way to accomplish this is to invoke Run ➤ Run Selected Cell and All Below.

Type into the code cell the content of Listing 4-5 and run it.
terrain = np.matrix([
    [-2, 3, 2, 1],
    [-2, 4, 3, 0],
    [-3, 3, 1, -3],
    [-4, 2, -1, 1],
    [-5, -7, 3, 0]
])
terrain
Listing 4-5

Definition of Our Data Model

A cell may hold multiple lines and expressions. Such a composite cell executes by sequentially running the contained expressions (in the order in which they appear). Here, we have an assignment and a value expression. The latter is useful to see the effect of the assignment (assignments are silently executed). Remember that a cell’s output value is always the value of its last expression (an assignment has no value). If you want to dump multiple messages, you can use Python’s print statement (these messages will not count as output values) with or without a last expression. Typically, an output value will be nicely rendered into HTML, which isn’t the case with printed output. This will be evident when we output as a value a Pandas data frame in Chapter 5.

It is also possible to prevent outputting the last expression’s value by ending it with a semicolon. This can be handy in situations where you just want to see the effects of calling some function (most often related to visualization).

When your cursor is inside a multiline cell, you can use the Up and Down arrows on your keyboard to move among those lines. To use your keys to move between cells, you must first escape the block by pressing the Esc key. It is also possible to select a whole cell by clicking inside an area between the left edge of a cell and the matching thick vertical bar. Clicking the vertical bar will shrink or expand the cell (a squashed cell is represented with three dots).

Tip

It is possible to control the rendering mechanism for output values by toggling pretty printing on and off. This can be achieved by running the %pprint magic command. For a list of available magic commands, execute %lsmagic.

Note

Never put inside the same cell a slow expression that always results in the same value (like reading from a file) and an idempotent parameterized expression (like showing the first couple of elements of a data frame). The last expression cannot be executed independently (i.e., without continuously reloading the input file).

The matrix function is just one of many from the NumPy package. JupyterLab’s context-sensitive typing facility can help you a lot. Just press the Tab key after np., and you will get a list of available functions. Further typing (for example, pressing the M key) will narrow down the list of choices (for example, to names starting with m). Moreover, issuing np.matrix? in a code cell provides you with help information about this function (you must execute the cell to see the message). Executing np? gives you help about the whole framework.

Path Finder’s Implementation

We are now ready to tackle the essential piece of our project, the function to calculate the ball’s path. The input arguments will be the terrain’s configuration and the starting position of the ball. The output will be the list of coordinates that the ball would follow in the terrain. Figure 4-4 depicts the top-down decomposition of the initial problem into subproblems. Each subproblem is implemented as a separate function. We will start with the wall function (see Listing 4-6).
def wall(terrain:np.matrix, position:Tuple[int,int]) -> bool:
    """
    Checks whether the provided position is hitting the wall.
    Args:
    terrain: the terrain's configuration comprised from integer elevation levels.
    position: the pair of integers representing the ball's potential position.
    Output:
    True if the position is hitting the wall, or False otherwise.
    Examples:
    >>> wall(np.matrix([[-2, 3, 2, 1]]), (0, 1))
    False
    >>> wall(np.matrix([[-2, 3, 2, 1]]), (-1, 0))
    True
    """
    x, y = position
    length, width = terrain.shape
    return (x < 0) or (y < 0) or (x >= length) or (y >= width)
Listing 4-6

Definition of the wall Function to Detect Borders

../images/473947_1_En_4_Chapter/473947_1_En_4_Fig4_HTML.jpg
Figure 4-4

Top-down decomposition of our problem; we will implement the functions via the bottom-up method

The logic is really simple. The wall function’s signature uses Python’s optional type annotations. The Tuple must be imported by adding the next line into our bootstrapping cell (it must be rerun after the modification):
from typing import Tuple
I suggest that you be pragmatic with these annotations. For example, it is enough to state that the terrain is np.matrix, instead of embarking on custom type definitions to describe it in more detail. The next two functions in Listing 4-7 should be added inside the same cell with the wall function. It makes sense to keep them together because they are interrelated. Moreover, they can be hidden in an all or nothing fashion by clicking the cell’s vertical bar.
def next_neighbor(terrain:np.matrix, position:Tuple[int,int]) -> Tuple[int,int]:
    """
    Returns the position of the lowest neighbor .
    Args:
    terrain: the terrain's configuration comprised from integer elevation levels.
    position: the pair of integers representing the ball's current position.
    Output:
    The position (pair of coordinates) of the lowest neighbor.
    Example:
    >>> next_neighbor(np.matrix([[-2, 3, 2, 1]]), (0, 1))
    (0, 0)
    """
    x, y = position
    allowed_neighbors = []
    for delta_x in range(-1, 2):
        for delta_y in range(-1, 2):
            new_position = (x + delta_x, y + delta_y)
            if (not wall(terrain, new_position)):
                allowed_neighbors.append((terrain.item(new_position), new_position))
    return min(allowed_neighbors)[1]
def find_path(terrain:np.matrix, position:Tuple[int,int]) -> List[Tuple[int,int]]:
    """
    Find the path that the ball would follow while descending in the terrain.
    Args:
    terrain: the terrain's configuration comprised from integer elevation levels.
    position: the pair of integers representing the ball's current position.
    Output:
    The list of coordinates of the path .
    Example:
    >>> find_path(np.matrix([[-2, 3, 2, 1]]), (0, 1))
    [(0, 1), (0, 0)]
    """
    next_position = next_neighbor(terrain, position)
    if (position == next_position):
        return [position]
    else:
        return [position] + find_path(terrain, next_position)
Listing 4-7

Implementation of the Other Two Functions As Shown in Figure 4-4

The find_path function is a very simple recursive function. The exit condition is the guaranteed local minima (unless you model a terrain from Escher’s world), since we monotonically descend toward the lowest neighbor.

We must augment our list of imports in the setup section to include List, too. We must also add the following code cell to test our functions:
# Just run this cell to invoke tests embedded inside function descriptors.
import doctest
doctest.testmod(verbose=True)

After executing all cells, we should receive a test result with no errors. Notice that all functions are self-contained and independent from the environment. This is very important from the viewpoint of maintenance and evolution. Interestingly, all functions would execute perfectly even if you were to remove the terrain argument from their signature. Nonetheless, dependence on global variables is an equally bad practice in notebooks as it is anywhere else. It is easy to introduce unwanted side-effects and pesky bugs. Nobody has time, nor incentive, to debug your document to validate your results!

Interaction with the Simulator

It is time to wrap up the project by offering to users a comfortable way to interact with our simulator. A classical approach would be to just create a code cell of the following form (each time a user would need to change the code and rerun the cell):
start_position = (1, 1)
find_path(terrain, start_position)
There is a much better way by exploiting Jupyter Widgets. First, you need to augment the bootstrapping cell with the following import:
from ipywidgets import interact, widgets
JupyterLab doesn’t allow you to directly embed JavaScript-generated content into your document. You must install IPyWidgets as a JupyterLab extension. Save your notebook, shut down the kernel, and open a Terminal window. From a command line, execute the following command:
jupyter labextension install @jupyter-widgets/jupyterlab-manager
You will get a warning if you don’t have NodeJS installed. You can easily install it by summoning
conda install nodejs
As explained in Chapter 1, these additions are best handled by first creating a custom environment for your project. Now start JupyterLab and load your notebook. Enter the following content into a new code cell:
interact(lambda start_x, start_y: find_path(terrain, (start_x, start_y)),
         start_x = widgets.IntSlider(value=1, max=terrain.shape[0]-1, description='Start X'),
         start_y = widgets.IntSlider(value=1, max=terrain.shape[1]-1, description='Start Y'));

After you execute this cell, you will be presented with two named sliders to set the ball’s initial position (X represents the row and Y the column). Each time you move the slider, the system will output a new path. There is no chance to provide an invalid starting position, as the sliders are configured to match the terrain’s shape. The notebook included in this book’s source code bundle also contains some narrative for presenting the result inside a separate section.

Test Automation

In this modern DevOps era, we cannot afford to perform tasks manually all the time. It is easy to open a notebook and select the menu item to run all cells. However, doing this repeatedly and frequently isn’t feasible. We must be able to automate the whole process. Part of the build automation is to test whether all cells are appropriate in a notebook (in this manner, we can indirectly run doctest tests, too). Such a statement would be part of a build script. Open a Terminal window; this time do it from JupyterLab by clicking the corresponding button in the Launcher (if the Launcher tab isn’t visible, choose File ➤ New Launcher). Ensure that you are inside the source code folder of this chapter. Execute the following statement4:
python -m pytest --nbval-lax ball_descend/Simulation.ipynb
You should see the following output:
=============== test session starts ==========================
platform darwin -- Python 3.6.5, pytest-3.6.0, py-1.5.3, pluggy-0.6.0
rootdir: /Users/evarga/Projects/pdsp_book/src/ch4, inifile:
plugins: remotedata-0.3.0, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2, nbval-0.9.1
collected 6 items
ball_descend/Simulation.ipynb ......                                                  [100%]
============ 6 passed in 2.15 seconds ========================
On the other hand, try to execute the following statement:
python -m pytest --nbval-lax hanoi/Solver1.ipynb
The tool will report an error, which is expected, as the notebook contains a code cell with a syntax error:
=============== test session starts ==========================
platform darwin -- Python 3.6.5, pytest-3.6.0, py-1.5.3, pluggy-0.6.0
rootdir: /Users/evarga/Projects/pdsp_book/src/ch4, inifile:
plugins: remotedata-0.3.0, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2, nbval-0.9.1
collected 1 item
hanoi/Solver1.ipynb F                                   [100%]
==================== FAILURES ================================
___________ hanoi/Solver1.ipynb::Cell 0 ______________________
Notebook cell execution failed
Cell 0: Cell execution caused an exception
Input:
def solve_tower(num_disks, start, end, extra):
    if (num_disks > 0):
        solve_tower(num_disks - 1, start, extra, end)
        print('Move top disk from', start, 'to", end)
        solve_tower(num_disks - 1, extra, end, start)
solve_tower(3, 'a', 'c', 'b')
Traceback:
  File "<ipython-input-1-7eeda1002555>", line 4
    print('Move top disk from', start, 'to", end)
                                                 ^
SyntaxError: EOL while scanning string literal
============ 1 failed in 2.16 seconds ========================

Refactoring the Simulator’s Notebook

You should always seek to improve the clarity and structure of your artifacts. A notebook isn’t an exception. The current one contains lots of Python code, which may distract the user from the main points of the work. We will make the following improvements:
  1. 1.

    Move out the wall, next_neighbor, and find_path functions into a separate package called pathfinder.

     
  2. 2.

    Move the doctest call into our new package.

     
  3. 3.

    Import the new package into our notebook (we need to access the find_path function).

     
  4. 4.

    Add more explanation about what we are doing, together with a nicely formatted formula.

     
Create a new folder named pathfinder in the current project folder. Create a new file in pathfinder and name it pathfinder.py. Copy the wall, next_neighbor, and find_path functions into this file. Remove the matching code cell from the notebook. Copy the imports of numpy and typing (located at the beginning of the notebook). Delete the typing import from the notebook. Create a file named __init.py__ in pathfinder (see reference [4]) and insert the following line:
from pathfinder.pathfinder import find_path
Move over the content of the code cell from the notebook that invokes doctest and put it in the following if statement at the end of pathfinder.py:
if __name__ == "__main__":
At the beginning of the notebook, insert the following import statement:
from pathfinder import find_path
At this point, your notebook should function the same as before. Notice its tidiness. Finally, add the following text to be the second and third sentences in your notebook:
It simulates the influence of Newton's law of universal gravitation on the movement of a ball, given by the formula $F=gfrac{m_1m_2}{r^2}$. Here, F is the resulting gravitational pull between the matching objects, $m_1$ and $m_2$ are their masses, r is the distance between the centers of their masses, and g is the gravitational constant.
The bold parts are LaTex expressions (they must be demarcated by $). After you execute the text cell, the formula will be nicely rendered, as shown in Figure 4-5. More complex LaTex content can be put inside the Raw cell type and rendered via the nbconvert command-line utility.
../images/473947_1_En_4_Chapter/473947_1_En_4_Fig5_HTML.jpg
Figure 4-5

The LaTex content inside ordinary text is beautifully rendered into HTML

Document Structure

In our previous project, we have already tackled the topic of content structuring, although in a really lightweight fashion. We will devote more attention to it here. Whatever technology you plan to use for your documentation task, you need to have a firm idea of how to structure your document. A structure brings order and consistency to your report. There are no hard rules about this structure, but there are many heuristics (a.k.a. best practices). The document should be divided into well-defined sections arranged into some logical order. Each section should have a clear scope and volume; for example, it makes no sense to devote more space to the introduction than to the key findings in your analysis. Remember that a notebook is also a kind of document and it must be properly laid out. Sure, the data science process already advises how and in what order to perform the major steps (see Chapter 1), and this aspect should be reflected in your notebook. Nonetheless, there are other structuring rules that should be superimposed on top of the data science life cycle model.

One plausible high-level document template may look as follows (I assume that a sound title/subtitle is mandatory in all scenarios):
  • Abstract: This section should be a brief summary of your work. It must illuminate what you have done, in what way, and enumerate key results.

  • Motivation: This section should explain why your work is important and how it may impact the target audience.

  • Dataset: This section should describe the dataset and its source(s). You should give unambiguous instructions that explain how to retrieve the dataset for reproducibility purposes.

  • Data Science Life Cycle Phases: The next couple of sections should follow the data science life cycle model (date preprocessing, research questions, methods, data analysis, reporting, etc.) and succinctly explain each phase. These details are frequently present in data analysis notebooks.

  • Drawbacks: This section should honestly mention all limitations of your methodology. Not knowing about constraints is very dangerous in decision making.

  • Conclusion: This section should elaborate about major achievements.

  • Future Work: This section should give some hints about what you are planning to do in the future. Other scientists are dealing with similar issues, and this opens up an opportunity for collaboration.

  • References: This section should list all pertinent references that you have used during your research. Don’t bloat this section as an attempt to make your work more “convincing.”

The users of your work may be segregated into three major categories: the general public, decision makers, and technically savvy users. The general public is only interested in what you are trying to solve. Users in this category likely will read only the title and abstract. The decision makers are business people and are likely to read the major findings as well as the drawbacks and conclusion. They mostly seek well-formulated actionable insights. The technical people (including CTOs, data scientists, etc.) would also like to reproduce your findings and extend your research. Therefore, they will look into all aspects of your report, including implementation details. If you fail to recognize and/or address the needs of these various classes of users, then you will reduce the potential to spread your results as broadly as possible.

Wikipedia Edits Project

As an illustration of the template outlined in the previous section, I will fill out some of the sections based upon my analysis of Wikipedia edits. The complete Jupyter notebook is publicly available at Kaggle (see https://www.kaggle.com/evarga/analysis-of-wikipedia-edits ). It does contain details about major data science life cycle phases. The goal of this project is to spark discussion, as there are many points open for debate. The following sections from the template should be enough for you to grasp the essence of this analysis (without even opening the previously mentioned notebook). Don’t worry if the Kaggle notebook seems complicated at this moment.

Abstract

This study uses the Wikipedia Edits dataset from Kaggle. It tries to inform the user whether Wikipedia’s content is stable and accountable. The report also identifies which topics are most frequently edited, based on words in edited titles. The work relies on various visualizations (like scatter plot, stacked bar graphs, and word cloud) to drive conclusions. It also leverages NLTK to process the titles. We may conclude that Wikipedia is good enough for informal usage with proper accountability, and themes like movies, sports, and music are most frequently updated.

Motivation

Wikipedia often is the first web site that people visit when they are looking for information. Obviously, high quality (accuracy, reliability, timeliness, etc.) of its content is imperative for a broad community. This work tries to peek under the hood of Wikipedia by analyzing the edits made by users. Wikipedia can be edited by anyone (including bots), and this may raise concerns about its trustworthiness. Therefore, by getting more insight about the changes, we can judge whether Wikipedia can be treated as a reliable source of information. As a side note, scientific papers cannot rely on it. There are also some book publishers who forbid referencing Wikipedia. All in all, this report tries to shed light on whether Wikipedia is good enough for informal usage.

Drawbacks

  • The data reflects an activity of users over a very short period of time. Such a small dataset cannot provide a complete story. Moreover, due to time zone differences, it cannot represent all parts of the world.

  • There is no description on Kaggle about the data acquisition process for the downloaded dataset. Consequently, the recorded facts should be taken with a pinch of salt. The edit’s size field is especially troublesome.

  • The data has inherent limitations, too. I had no access to the user profiles, so I assumed all users are equally qualified to make edits. If I would have had this access, then I could have weighted the impact of their modifications.

Conclusion

  • Wikipedia is good enough for informal usage. The changes are mostly about fixing smaller issues. Larger changes in size are related mostly to addition of new content and are performed by humans. These updates are treated as major.

  • Larger edits are done by registered users, while smaller fixes are performed also by anonymous persons.

  • Specialized content (scientific, technical/technology related, etc.) doesn’t change as frequently as topics about movies, sport, music, etc.

Exercise 4-1. External Load of Data

Manually entering huge amounts of data doesn’t scale. In our case study, the terrain’s configuration fits into a 5×4 matrix. Using this approach to insert elevations for a 200×300 terrain would be impossible. A better tactic is to store data externally and load it on demand. Modify the terrain’s initialization code cell to read data from a text file. Luckily, you don’t need to wrestle with this improvement too much. NumPy’s Matrix class already supports data as a string. Here is an excerpt from its documentation: “If data is a string, it is interpreted as a matrix with commas or spaces separating columns, and semicolons separating rows.”

You would want to first produce a text file with the same content as we have used here. In this way, you can easily test whether everything else works as expected. You should upload the configuration file into the same folder where your notebook is situated (to be able to use only the file name as a reference). To upload stuff, click the Upload Files toolbar button in JupyterLab’s file browser. Refer to Chapter 2 for guidance on how to open/read a text file in Python.

Exercise 4-2. Fixing Specification Ambiguities

Thanks to the accessibility of your JupyterLab notebook and the repeatability of your data analysis, one astute data scientist has noticed a flaw in your solution. He reported this problem with a complete executable example (he shared with you a modified notebook file). Hence, you can exactly figure out what he would like you to fix. For a starting position (0, 1) it is not clear in advance whether the ball should land in (0, 0) or (1, 0), since they are both at a locally lowest altitude (in this case -2). It is also not clear where the ball should go if it happens to land on a plateau (an area of the terrain at the same elevation). In the current solution, it will stop on one of the spots, depending on the search order of neighbors. Surely, this doesn’t quite satisfy the rule of stopping at the local minima.

The questions are thus: Should you consider inertia? How do you document what point will be the final point? Think about these questions and expand the text and/or modify the path-finding algorithm.

Exercise 4-3. Extending the Project

Another data scientist has requested an extension of the problem. She would like to ascribe elasticity to the ball. If it drops more than X units, then it could bounce up Y units. Change the path-finding algorithm to take this flexibility into account. Assume that the ball will still select the lowest neighbor, although the set of candidates will increase. Will the recursion in find_path always terminate? What conditions dictate such guaranteed termination? Test your solution thoroughly.

Exercise 4-4. Notebook Presentation

In the “Document Structure” section, you can find the proposed document template. Creating a separate artifact, external to your main notebook, isn’t a good choice, since it will eventually drift away from it (like passive design documents in software development).

A JupyterLab notebook can have a dual purpose: as presentation material and as an executable narrative. Extend the descending ball project’s notebook with parts from the document template (add Abstract and Conclusion sections, for a start). Set the slide type of these textual cells to Slide. Mark other cells as Skip. Open a Terminal window and type:
jupyter nbconvert <YourSlide>.ipynb --to slides --post serve

A new browser window will open, presenting one after another cells marked as Slide. Look up the meaning of other slide types: Sub-slide, Fragment, and Notes. Consult JupyterLab’s documentation for more information about presentation mode. For a really professional presentation, you should use Reveal.js (see https://revealjs.com ).

Summary

This chapter covered some of the benefits of packaging documentation as a self-contained, executable, and shareable asset:
  • Freeform text is bundled together with executable code; this eliminates the need to maintain separate documents, which usually get out of sync with code.

  • The output of code execution may be saved in the document and become an integral part of it.

  • The setup of an executable environment (to bring in dependencies for running your code) may be done inside the document. This solves many deployment problems and eliminates a steep learning curve for those who would like to see your findings in action.

You have witnessed the power behind computational notebooks and how the Jupyter toolset accomplishes most requirements regarding documentation. By supporting disparate programming languages, JupyterLab fosters polyglot programming, which is important in the realm of data science. In the same way as multiple data sources are invaluable, many differently optimized development/executable environments are indispensable in crafting good solutions.

JupyterLab has many more useful features not demonstrated here. For example, it has a web-first code editor that eliminates the need for a full-blown integrated development environment (such as Spyder) for smaller edits. You can edit Python code online far away from your machine. JupyterLab also allows you to handle data without writing Python code. If you open a Leaflet GeoJSON file (see https://leafletjs.com ), then it will be immediately visualized and ready for interaction. A classical approach entails running a Python code cell.

All in all, this chapter has provided the foundation upon which further chapters will build. We will continually showcase new elements of JupyterLab, as this will be our default executable environment.

References

  1. 1.

    Brian Granger, “Project Jupyter: From Computational Notebooks to Large Scale Data Science with Sensitive Data,” ACM Learning Seminar, September 2018.

     
  2. 2.

    Matt Cone, The Markdown Guide, https://www.markdownguide.org .

     
  3. 3.

    Jake VanderPlas, Python Data Science Handbook: Essential Tools for Working with Data, O’Reilly Media, Inc., 2016.

     
  4. 4.

    Mike Grouchy, “Be Pythonic: __init__.py,” https://mikegrouchy.com/blog/be-pythonic-__init__py , May 17, 2012.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.107.140