Creating scatter plots with Bokeh-Scala

While Zeppelin is powerful enough to quickly execute our Spark SQLs and visualize data, it is still an evolving platform. In this section, we'll take a brief look at the most popular visualizing framework in Python, called Bokeh, and use its (also fast evolving) Scala bindings to the framework. Breeze also has a visualization API called breeze-viz, which is built on JFreeChart. Unfortunately, at the time of writing this book, the API is not actively maintained, and therefore we won't be discussing it here.

The power of Zeppelin lies in the ability to share and view graphics on the browser. This is brought forth by the backing of the D3.js JavaScript visualization library. Bokeh is also backed by another JavaScript visualization library, called BokehJS. The Scala bindings library (bokeh-scala) not only gives an easier way to construct glyphs (lines, circles, and so on) out of Scala objects, but also translates glyphs into a format that is understandable by the BokehJS JavaScript components.

There is a warning here: the Bokeh-Scala bindings are still evolving and act at a lower level. Sometimes, this is more cumbersome than its Python counterpart. That said, I am still sure that we all would be able to appreciate the amazing graphs that we can create right out of Scala.

How to do it...

In this recipe, we will be creating a scatter plot using iris data (https://archive.ics.uci.edu/ml/datasets/Iris), which has the length and width attributes of flowers belonging to three different species of the same plant. Drawing a scatter plot on this dataset involves a series of interesting substeps.

For the purpose of representing the iris data in a Breeze matrix, I have naïvely transformed the species categories into numbers:

  • Iris setosa: 0
  • Iris versicolor: 1
  • Iris virginica: 2

This is available in irisNumeric.csv. Later, we'll see how we can load the original iris data (iris.data) into a Spark DataFrame and use that as a source for plotting.

For the sake of clarity, let's define what the various terms in Bokeh actually mean:

  • Glyph: All geometric shapes that we can think of—circles, squares, lines, and so on—are glyphs. This is just the UI representation and doesn't hold any data. All the properties related to this object just help us modify the UI properties: color, x, y, width, and so on.
  • Plot: A plot is like a canvas on which we arrange various objects relevant to the visualization, such as the legend, x and y axes, grid, tools, and obviously, the core of the graph—the data itself. We construct various accessory objects and finally add them to the list of renderers in the plot object.
  • Document: The document is the component that does the actual rendering. It accepts the plot as an argument, and when we call the save method in the document, it uses all the child renderers in the plot object and constructs a JSON from the wrapped elements. This JSON is eventually read by the BokehJS widgets to render the data in a visually pleasing manner. More than one plot can be rendered in the document by adding it to a grid plot (we'll look at how this is done in the next recipe, Creating a time series MultiPlot with Bokeh-Scala).

A plot is a composition of multiple widgets/glyphs.

This consists of a series of steps:

  1. Preparing our data.
  2. Creating the Plot and Document objects.
  3. Creating a point (marker object) and a renderer for it.
  4. Setting the x and y axes' data range for the plot.
  5. Drawing the x and the y axes.
  6. Viewing the marker objects with varying colors.
  7. Adding Grid lines.
  8. Adding a legend to the plot.

Preparing our data

Bokeh plots require our data to be in a format that it understands, but it's really easy to do it. All that we need to do is create a new source object that inherits from ColumnDataSource. The other options are AjaxDataSource and RemoteDataSource.

So, let's overlay our Breeze data source on ColumnDataSource:

import breeze.linalg._

object IrisSource extends ColumnDataSource {

  private val colormap = Map[Int, Color](0 -> Color.Red, 1 -> Color.Green, 2 -> Color.Blue)

  private val iris = csvread(file = new File("irisNumeric.csv"), separator = ',')

  val sepalLength = column(iris(::, 0))
  val sepalWidth = column(iris(::, 1))
  val petalLength = column(iris(::, 2))
  val petalWidth = column(iris(::, 3))
  val species = column(iris(::, 4))
}

The first line just reads irisNumeric.csv using the csvread function of the Breeze library. The color map is something that we'll be using later while plotting. The purpose of this map is to translate each species of flower into a different color. The final piece is where we convert the Breeze matrix into ColumnDataSource. As required by ColumnDataSource, we select and map specific columns in the Breeze matrix to corresponding columns.

Creating Plot and Document objects

Let's have our image's title as Iris Petal Length vs Width and create a document object so that we can save the final HTML by the name IrisBokehBreeze.html. Since we haven't specified the full path of the target file in the save method, the file will be saved in the same directory as the project itself:

val plot = new Plot().title("Iris Petal Length vs Width")

val document = new Document(plot)
val file = document.save("IrisBokehBreeze.html")

println(s"Saved the chart as ${file.url}")

Creating a marker object

Our plot has neither data nor any glyphs. Let's first create a marker object that marks the data point. There are a variety of marker objects to choose from: Asterisk, Circle, CircleCross, CircleX, Cross, Diamond, DiamondCross, InvertedTriangle, PlainX, Square, SquareCross, SquareX, and Triangle.

Let's choose Diamond for our purpose:

val diamond = new Diamond()
    .x(petalLength)
    .y(petalWidth)
    .fill_color(Color.Blue)
    .fill_alpha(0.5)
    .size(5)


val dataPointRenderer = new GlyphRenderer().data_source(IrisSource).glyph(diamond)

While constructing the marker object, other than the UI attributes, we also say what the x and the y coordinates for it are. Note that we have also mentioned that the color of this marker is blue. We'll change that in a while using the color map.

Setting the X and Y axes' data range for the plot

The plot needs to know what the x and y data ranges of the plot are before rendering. Let's do that by creating two DataRange objects and setting them to the plot:

val xRange = new DataRange1d().sources(petal_length :: Nil)
val yRange = new DataRange1d().sources(petal_width :: Nil)

plot.x_range(xRange).y_range(yRange)

Let's try and run the first cut of this program.

The following is the output:

Setting the X and Y axes' data range for the plot

We see that this needs a lot of work to be done. Let's do it bit by bit.

Drawing the x and the y axes

Let's now draw the axes, set their bounds, and add them to the plot's renderers. We also need to let the plot know which location each axis belongs to:

//X and Y Axis
  val xAxis = new LinearAxis().plot(plot).axis_label("Petal Length").bounds((1.0, 7.0))
  val yAxis = new LinearAxis().plot(plot).axis_label("Petal Width").bounds((0.0, 2.5))
  plot.below <<= (listRenderer => (xAxis :: listRenderer))
  plot.left <<= (listRenderer => (yAxis :: listRenderer))

  //Add the renderer to the plot
  plot.renderers := List(xAxis, yAxis, dataPointRenderer)

Here is the output:

Drawing the x and the y axes

Viewing flower species with varying colors

All the data points are marked with blue as of now, but we would really like to differentiate the species visually. This is a simple two-step process:

  1. Add new derived data (speciesColor) into our ColumnDataSource to hold colors that represent the species:
    object IrisSource extends ColumnDataSource {
    
      private val colormap = Map[Int, Color](0 -> Color.Red, 1 -> Color.Green, 2 -> Color.Blue)
    
      private val iris = csvread(file = new File("irisNumeric.csv"), separator = ',')
    
      val sepalLength = column(iris(::, 0))
      val sepalWidth = column(iris(::, 1))
      val petalLength = column(iris(::, 2))
      val petalWidth = column(iris(::, 3))
    val speciesColor = column(species.value.map(v => colormap(v.round.toInt)))
    }

    So, we assign red to Iris setosa, green to Iris versicolor and blue to Iris virginica.

  2. Modify the diamond marker to take this as input instead of accepting a static blue:
    val diamond = new Diamond()
        .x(petalLength)
        .y(petalWidth)
        .fill_color(speciesColor)
        .fill_alpha(0.5)
        .size(10)
    

The output is as follows:

Viewing flower species with varying colors

It looks fairly okay now. Let's add some tools to the image. Bokeh has some nice tools that can be attached to the image: BoxSelectTool, BoxZoomTool, CrosshairTool, HoverTool, LassoSelectTool, PanTool, PolySelectTool, PreviewSaveTool, ResetTool, ResizeTool, SelectTool, TapTool, TransientSelectTool, and WheelZoomTool.

Let's add a few of them to see them for fun:

val panTool = new PanTool().plot(plot)
  val wheelZoomTool = new WheelZoomTool().plot(plot)
  val previewSaveTool = new PreviewSaveTool().plot(plot)
  val resetTool = new ResetTool().plot(plot)
  val resizeTool = new ResizeTool().plot(plot)
  val crosshairTool = new CrosshairTool().plot(plot)

plot.tools := List(panTool, wheelZoomTool, previewSaveTool, resetTool, resizeTool, crosshairTool)
Viewing flower species with varying colors

Adding grid lines

While we have the crosshair tool, which helps us locate the exact x and y values of a particular data point, it would be nice to have a data grid too. Let's add two data grids, one for the x axis and one for the y axis:

 val xAxis = new LinearAxis().plot(plot).axis_label("Petal Length").bounds((1.0, 7.0))
  val yAxis = new LinearAxis().plot(plot).axis_label("Petal Width").bounds((0.0, 2.5))
  val xgrid = new Grid().plot(plot).axis(xAxis).dimension(0)
  val ygrid = new Grid().plot(plot).axis(yAxis).dimension(1)

Next, let's add the grids to the plot renderer list too:

  plot.renderers := List(xAxis, yAxis, dataPointRenderer, xgrid, ygrid)
Adding grid lines

Adding a legend to the plot

This step is a bit tricky in the Scala binding of Bokeh due to the lack of high-level graphing objects, such as scatter. For now, let's cook up our own legend. The legends property of the Legend object accepts a list of tuples - a label and a GlyphRenderer pair. Let's explicitly create three GlyphRenderer wrapping diamonds of three colors, which represent the species. We then add them to the plot:

val setosa = new Diamond().fill_color(Color.Red).size(10).fill_alpha(0.5)
  val setosaGlyphRnd=new GlyphRenderer().glyph(setosa)
  val versicolor = new Diamond().fill_color(Color.Green).size(10).fill_alpha(0.5)
  val versicolorGlyphRnd=new GlyphRenderer().glyph(versicolor)
  val virginica = new Diamond().fill_color(Color.Blue).size(10).fill_alpha(0.5)
  val virginicaGlyphRnd=new GlyphRenderer().glyph(virginica)
  val legends = List("setosa" -> List(setosaGlyphRnd),
            "versicolor" -> List(versicolorGlyphRnd),
            "virginica" -> List(virginicaGlyphRnd))

  val legend = new Legend().orientation(LegendOrientation.TopLeft).plot(plot).legends(legends)


plot.renderers := List(xAxis, yAxis, dataPointRenderer, xgrid, ygrid, legend, setosaGlyphRnd, virginicaGlyphRnd, versicolorGlyphRnd)
Adding a legend to the plot
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.43.26