Chapter 11. Data Science and R

Data Science is a relatively new discipline which first came to the attention of many with this article by O’Reilly’s Mike Loukides. While there are many definitions in the field, Loukides notes:

A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.

One of the main open-source ecosystems for data science software is at Apache, and includes Hadoop (which includes the HDFS distributed filesystem, Hadoop Map/Reduce1, Ozone object store, Yarn scheduler, and more), Cassandra distributed database, Spark compute engine, and more. Read the Modules and Related Tools section of the Hadoop page for a current list.

What’s interesting here is that a great deal of this infrastructure, which is taken for granted by data scientists, is written in Java and Scala (a JVM language). Much of the rest is written in Python, a language that complements Java.

Data science problems may involve a lot of setup, so we’ll only give one example from traditional DS, using the Spark framework. Spark is written in Scala so it can be used directly by Java code.

In the rest of the chapter I’ll focus on a language called R which is widely used both in statistics and in data science (well, also in many other sciences; many of the graphs you see in refereed journal articles are prepared with R). R is widely used and is useful to know. Its primary implementation was not written in Java, but in a mixture of C, Fortran and R itself. But R can be used within Java, and Java can be used within R. We’ll talk about several implementations of R and how to select one, then show techniques for using Java from R and R from Java, as well as using R in a web application.

11.1 Machine Learning with Java

Problem

You want to use Java for machine learning and data science, but everyone tells you to use Python.

Solution

Use one of the many powerful Java toolkits available for free download.

Discussion

It’s sometimes said that machine learning (ML) and deep learning have to be done in C++ for efficiency or in Python for the wide availability of software. While these languages have their advantages and their advocates, it is certainly possible to use Java for these purposes. However, setting up these packages and presenting a short demo tends to be longer than would fit in our typical “Recipe” format.

With industry giant Amazon having released their Java-based Deep Java Learning (DJL) library as this book was going to press, and many other good libraries available (with quite a few supporting CUDA for faster GPU-based processing), (see Table 11-1), there is no reason to avoid using Java for ML. With the exception of DJL which is new as this book goes to press, I’ve tried to list packages that are still being maintained and have a decent reputation among users.

Table 11-1. Some Java Machine Learning Packages
Library name Description Info URL Source URL

ADAMS

Workflow engine for building/maintaining data-driven, reactive workflows; integration with business processes

https://adams.cms.waikato.ac.nz/

https://github.com/waikato-datamining/adams-base

Deep Java Library

Amazon’s ML library

https://djl.ai

https://github.com/awslabs/djl

Deeplearning4j

DL4J, Eclipse’s distributed deep-learning library; integrates w/ Hadoop and Apache Spark

https://deeplearning4j.org/

https://github.com/eclipse/deeplearning4j

ELKI

Data Mining Toolkit

https://elki-project.github.io/

https://github.com/elki-project/elki

Mallet

ML for text processing

mallet.cs.umass.edu

https://github.com/mimno/Mallet.git

Weka

ML algorithms for data mining; tools for data preparation, classification, regression, clustering, association rules mining, and visualization.

https://www.cs.waikato.ac.nz/ml/weka/index.html

https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka

See Also

The book Data Mining: Practical Machine Learning, 4th Edition, was written by the team behind Weka.

See also Eugen Parschiv’s list of Java AI software packages.

11.2 Using Data In Apache Spark

Problem

You want to process data using Spark.

Solution

Create a SparkSession, use its read() function to read a DataSet, apply operations, summarize results.

Discussion

Spark is a massive subject! Entire books have been written on using it. Quoting DataBricks, home of much of the original Spark team:foonote[DataBricks offers several free e-books on Spark from their website; they also offer commercial Spark add-ons].

Apache Spark™ has seen immense growth over the past several years, becoming the de-facto data processing and AI engine in enterprises today due to its speed, ease of use, and sophisticated analytics. Spark unifies data and AI by simplifying data preparation at massive scale across various sources, providing a consistent set of APIs for both data engineering and data science workloads, as well as seamless integration with popular AI frameworks and libraries such as TensorFlow, PyTorch, R and SciKit-Learn.

I cannot convey the whole subject matter in this book. However, one thing Spark is good for is dealing with lots of data. In this example, we read an Apache-format logfile, and find (and count) the lines with 200, 404 and 500 responses.

Example 11-1. spark/src/main/java/sparkdemo/LogReader.java
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.api.java.function.FilterFunction;

// tag::main[]
/**
 * Read an Apache Logfile and summarize it.
 */
public class LogReader {

    public static void main(String[] args) {

        final String logFile = "/var/wildfly/standalone/log/access_log.log";    1
        SparkSession spark =
            SparkSession.builder().appName("Log Analyzer").getOrCreate();        2
        Dataset<String> logData = spark.read().textFile(logFile).cache();        3

        long good = logData.filter(                                                4
        new FilterFunction<>() {public boolean call(String s) {
                    return s.contains("200");
                }
            }).count();

        long bad = logData.filter(new FilterFunction<>() {
                public boolean call(String s) {
                    return s.contains("404");
                }
            }).count();

        long ugly = logData.filter(new FilterFunction<>() {
                public boolean call(String s) {
                    return s.contains("500");
                }
            }).count();

        System.out.printf(                                                        5
            "Successful transfers %d, 404 tries %d, 500 errors %d
",
            good, bad, ugly);

        spark.stop();
    }
}
// end::main[]
1

Set up the filename for the logfile. Probably should come from args

2

Start up the Spark SparkSession object - the runtime

3

Tell Spark to read the logfile, and keep it in memory (cache)

4

Define the filters for 200, 404 and 500 errors. Should be able to use Lambdas to make the code shorter, but there’s an ambiguity between the Java and Scala versions of FilterFunction.

5

Print the results.

To make this compile, you need to add the following to a Maven POM file:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>2.4.4</version>
    <scope>provided</scope>
</dependency>

Then you should be able to do mvn package to get a Jar file packaged.

The use of the provided scope is because we will also download the Apache Spark runtime package to run the application. from the Spark Download page. Unpack the distribution and set SPARK_HOME environment to the root of it, e.g.,

SPARK_HOME=~/spark-3.0.0-bin-hadoop3.2/

Then you can use the run script which I’ve provided in the source download (javasrc/spark).

Spark is designed for larger-scale computing than this simple example, so its voluminous output simply dwarfs the output from my simple sample program. Nontheless, for an approximately 42,000-line file, I did get this result, buried among the logging:

Successful transfers 32555, 404 tries 6539, 500 errors 183

As mentioned, Spark is a massive subject, but a necessary tool for most data scientists. You can program Spark in Java (obviously), in Scala - a JVM language that promotes functional programming, see this Scala tutorial for Java devs, in Python (which we won’t mention further here), and probably other languages. You can learn more at https://spark.apache.org/ or from the many books, videos, and tutorials online.

11.3 Using R Interactively

Problem

You don’t know the first thing about R, and want to.

Solution

R has been around for ages, and its predecessor S for a decade before that. There are many books and online resources devoted to this language. The official home page is at https://www.r-project.org/. There are many online tutorials; the R Project hosts one at https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf. R itself is available in most systems’ package managers or it can be downloaded from the official download site. The name CRAN in these URLs stands for Comprehensive R Archive Network, named in a similar fashion to TeX’s CTAN and the Perl language’s CPAN.

In this example we’ll write some data from a Java program and then analyze and graph it using R interactively.

Discussion

This is merely a brief intro to using R interactively. Suffice to say that R is a valuable interactive environment for exploring data. Here are some simple calculations to show the flavor of the language: a chatty startup (so long I had to cut part of it), simple arithmetic, automatic printing of results if not saved, half-decent errors when you make a mistake, arithmetic on vectors, and so on. You may see some similarities to Java’s JShell (see Recipe 1.4); both are REPL (read-evaluate-print loop) interfaces. R adds the ability to save your interactive session (workspace) when exiting the program, so all your data and function definitions are restored next time you start R.

$ R

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

...

> 2 + 2
[1] 4
> x = 2 + 2
> x
[1] 4
> r = 10 20 30 40 50
Error: unexpected numeric constant in "r = 10 20"
> r = c(10,20,30,45,55,67)
> r
[1] 10 20 30 45 55 67
> r+3
[1] 13 23 33 48 58 70
> r / 3
[1]  3.333333  6.666667 10.000000 15.000000 18.333333 22.333333
>quit()
Save workspace image? [y/n/c]: n
$

R purists will usually use the assignment arrow ← in lieu of the = sign when assigning. If you like that, go for it.

This short session barely scratches the surface: R offers hundreds of built-in functions, sample datasets, over a thousand add-on packages, built-in help, and much more. For interactive exploration of data, R is really the one to beat.

Some people prefer a GUI front-end to R; the most widely-used GUI front-end is R Studio.

Now we want to write some data from Java and process it in R (we’ll use Java and R together in later recipes in this chapter). In Recipe 5.9 we discussed the java.util.Random class and its nextDouble() vs nextGaussian() methods. The nextDouble() and related methods try to give a “flat” distribution between 0 and 1.0, in which each value has an equal chance of being selected. A Gaussian or normal distribution is a bell-curve of values from negative infinity to positive infinity, with the majority of the values clustered around zero (0.0). We’ll use R’s histogramming and graphics functions to examine visually how well they do so.

Random r = new Random();
for (int i = 0; i < 10_000; i++) {
    System.out.println("A normal random double is " + r.nextDouble());
    System.out.println("A gaussian random double is " + r.nextGaussian());

To illustrate the different distributions, I generated 10,000 numbers each using nextRandom() and nextGaussian(). The code for this is in Random4.java (not shown here) and is a combination of the sample code above with code to print just the numbers into two files. I then plotted histograms using R; the R script used to generate the graph is in javasrc under src/main/resources) but its core is shown in Example 11-2. The results are shown in Figure 11-1.

Example 11-2. R Commands to Generate Histograms
png("randomness.png")
us <- read.table("normal.txt")[[1]]
ns <- read.table("gaussian.txt")[[1]]

layout(t(c(1,2)), respect=TRUE)

hist(us, main = "Using nextRandom()", nclass = 10,
       xlab = NULL, col = "lightgray", las = 1, font.lab = 3)

hist(ns, main = "Using nextGaussian()", nclass = 16,
       xlab = NULL, col = "lightgray", las = 1, font.lab = 3)
dev.off()

The png() call tells R which graphics device to use - others include X11(), Postscript(), and more. read.table() reads data from a text file into a table; the [1] gives us just the data column, ignoring some metadata. The layout() call says we want two graphics objects displayed side-by-side. Each hist() call draws one of the two histograms. And dev.off() closes the output and flushes any writing buffers to the PNG file. The result is shown in Figure 11-1.

jcb4 1101
Figure 11-1. Flat (left) and Gaussian (right) distributions

11.4 Comparing/Choosing an R Implementation

Problem

You’re not sure which implementation of R to use.

Solution

Look at “original R”, Renjin, and FastR.

Discussion

The original for R was S, an environment for interactive programming developed by John Chambers et al at AT&T Bell Labs starting in 1976. I ran into S when supporting the University of Toronto Statistics Departement, and again when reviewing a commercial implementation of it, SPlus, for a long-ago glossy magazine called Sun Expert. AT&T was only making S source available to universities and to commercial licensees who could not further distribute the source. Two developers at the University of Auckland, Ross Ihaka and Robert Gentleman, developed a clone of S, starting in 1995. They named it R after their own first initials and as a play on the name S. (There is precedent for this: the awk language popular on Unix/Linux was named for the initials of its designers Aho, Weinberger and Kernighan). R grew quickly because it was very largely compatible with S and was more readily available. This implementation, “original R”, is actively managed by the R Foundation for Statistical Computing which also manages the Comprehensive R Archive Network.

Renjin is a fairly complete implementation of R in Java. This project provides built Jar files via their own Maven repository.

FastR is another implementation in Java, designed to run in the faster GraalVM and supporting direct invocation of JVM code from almost any other programming language. Fastr is also described at https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb.

Besides these implementations, R’s popularity has led to development of many “access libraries” for invoking R from many popular programming languages. Rserve is a TCP/IP networked access mode for R, for which Java wrappers exist.

11.5 Using R from within a Java app: Renjin

Problem

You want to access R from within a Java application using Renjin.

Solution

Add renjin to your Maven or Gradle build, and call it via the Script Engines mechanism described in Recipe 18.3.

Discussion

Renjin is a pure-Java, open-source re-implementation of R, and provides a script engines interface. Add the following dependency to your build tool:

org.renjin:renjin-script-engine:3.5-beta76

Of course there is probably a later version of renjin than the one shown above by the time you read this; use the latest unless there’s a reason not to.

Note: You will also need a <repository> entry since the maintainers put their artifacts in the repo at nexus.betadriven.com instead of the usual Maven Central. Here’s what I used (obtained from https://www.renjin.org/downloads.html):

<repositories>
    <repository>
        <id>bedatadriven</id>
        <name>bedatadriven public repo</name>
        <url>https://nexus.bedatadriven.com/content/groups/public/</url>
    </repository>
</repositories>

Once that’s done, you should be able to access Renjin via the Script Engines framework, as in Example 11-3.

Example 11-3. main/src/main/java/otherlang/RenjinScripting.java
    /**
     * Demonstrate interacting with the "R" implementation called "Renjin"
     */
    public static void main(String[] args) throws ScriptException {
        ScriptEngineManager manager = new ScriptEngineManager();
        ScriptEngine engine = manager.getEngineByName("Renjin");
        engine.put("a", 42);
        Object ret = engine.eval("b <- 2; a*b");
        System.out.println(ret);
    }

Because R treats all numbers as floating point, like many interpreters, the value printed is 84.0.

One can also get Renjin to invoke a script file; Example 11-4 invokes the same script used in Recipe 11.3 to generate and plot a batch of pseudo-random numbers.

Example 11-4. Renjin with a Script File
    private static final String R_SCRIPT_FILE = "/randomnesshistograms.r";
    private static final int N = 10000;

    public static void main(String[] argv) throws Exception {
        // java.util.Random methods are non-static, do need to construct
        Random r = new Random();
        double[] us = new double[N], ns = new double[N];
        for (int i=0; i<N; i++) {
            us[i] = r.nextDouble();
            ns[i] =r.nextGaussian();
        }
        try (InputStream is =
            Random5.class.getResourceAsStream(R_SCRIPT_FILE)) {
            if (is == null) {
                throw new IllegalStateException("Can't open R file ");
            }
            ScriptEngineManager manager = new ScriptEngineManager();
            ScriptEngine engine = manager.getEngineByName("Renjin");
            engine.put("us", us);
            engine.put("ns", ns);
            engine.eval(FileIO.readerToString(new InputStreamReader(is)));
        }
    }

Renjin can also be used as a standalone R implementation if you download an all-dependencies JAR file from https://renjin.org/downloads.html.

11.6 Using Java from within a R session

Problem

You are part-way through a computation in R and realize that there’s a Java library to do the next step. Or for any other reason, you need to call Java code from with an R session.

Solution

Install rJava, call .jinit(), and use J() to load classes or invoke methods.

Discussion

For example:

> install.packages('rJava')                            1
trying URL 'http://.../rJava_0.9-11.tgz'
Content type 'application/x-gzip' length 745354 bytes (727 KB)
==================================================
downloaded 727 KB


The downloaded binary packages are in
    /tmp//Rtmp6XYZ9t/downloaded_packages
> library('rJava')                                    2
> .jinit()
> J('java.time.LocalDate', 'now')                    3
[1] "Java-Object{2019-11-22}"
> d=J('java.time.LocalDate', 'now')$toString()        4
> d
[1] "2019-11-22"
1

Install the rJava package; only needs to be done once

2

load rJava, and initialize it with .jinit(); both needed in every R session

3

The J function takes one argument of a full class name. If only that argument is given, a class descriptor (like a java.lang.Class object) is returned. If more than one argument is given, the second is a static method name, and any subsequent arguments are passed to that method.

4

Returned objects can have Java methods invoked with the standard R $ notation; here the toString() method is invoked to return just a character string instead of a LocalDate object.

The .jcall function gives you more control over calling method and return types.

> d=J('java.time.LocalDate', 'now')                    1
> .jcall(d, "I", 'getYear')                            2
[1] 2019
>
> .jcall("java/lang/System","S","getProperty","user.dir") 3
[1] "/home/ian"
> c=J('java/lang/System')                            4
> .jcall(c, "S", 'getProperty', 'user.dir')
[1] "/home/ian"
>
1

Invoke Java LocalDate.now() method and save result in R variable d

2

Invoke Java getYear() method on the LocalDate object; the “I” tells jcall to expect an Integer result.

3

Call System.getProperty("user.dir") and print the result; the “S” tells .jcall to expect a String return.

4

If you will be using a class several times, save the Class object, pass it as the first argument of .jcall()

There are more capabilities here; consult the documentation at https://cran.r-project.org/web/packages/rJava/ and an article at https://www.developer.com/java/ent/getting-started-with-r-using-java.html.

11.7 Using FastR, the GraalVM implementation of R

Problem

You use the R language, but feel a need for speed.

Solution

Use FastR, Oracle’s GraalVM reimplementation of the R language.

Discussion

Assuming you have installed GraalVM as described in Recipe 1.2, you can just type the following command:

$ gu install R
Downloading: Component catalog from www.graalvm.org
Processing component archive: FastR
Downloading: Component R: FastR  from github.com
Installing new component: FastR (org.graalvm.R, version 19.2.0.1)
NOTES:
---------------
The user specific library directory was not created automatically.
You can either create the directory manually or edit file
/Library/Java/JavaVirtualMachines/graalvm-ce-19.2.0.1/Contents/Home/jre/languages/R/etc/Renviron
to change it to any desired location. Without user specific library
directory, users will need write permission for the GraalVM home
directory in order to install R packages.
...
[more install notes]

If you have set your PATH to have GraalVM before other directories, the command R will now give you the GraalVM version of R. To access the standard R, you will have to either set your PATH or give a full path to the R installation. On all Unix and Unix-like systems, the command which R will reveal all R commands on your PATH:

$ which R
/Library/Java/JavaVirtualMachines/graalvm-ce-19.2.0.1/Contents/Home/bin/R
/usr/local/bin/R

Let’s just run it:

$ R
R version 3.5.1 (FastR)
Copyright (c) 2013-19, Oracle and/or its affiliates
Copyright (c) 1995-2018, The R Core Team
Copyright (c) 2018 The R Foundation for Statistical Computing
Copyright (c) 2012-4 Purdue University
Copyright (c) 1997-2002, Makoto Matsumoto and Takuji Nishimura
All rights reserved.

FastR is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information.

Type 'q()' to quit R.
[Previously saved workspace restored]

> 2 + 2
[1] 4
> ^D
Save workspace image? [y/n/c]: n
$

From that point on, you should be able to do practically anything that you would do in standard R, since this R’s source code is largely derived from the R Foundation’s source.

11.8 Using R in a web app

Problem

You want to display R’s data and graphics in a web page on a web server.

Solution

There are several approaches that would achieve this effect:

  • Prepare the data, and generate graphics as we did in Recipe 11.3, then incorporate both into a static web page;

  • Use one of several R add-on web frameworks, such as Shiny or Rook;

  • Invoke a JVM implementation of R from within a Servlet, JSF or Spring Bean, or other web-tier component.

Discussion

The first is trivial, and doesn’t need discussion here.

For the second, I’ll actually use timevis, which in turn uses shiny. This isn’t built in to the R library, so we first have to install it, using R’s install.packages():

$ R
> install.packages('timevis')
> quit()
$

This may take a while as it downloads and builds multiple dependencies.

For this demo I have a small dataset with some basic information on medieval literature, which I load and display using shiny

# Draw the timeline for the epics.

epics = read.table("epics.txt", header=TRUE, fill=TRUE)

# epics

library("timevis")

timevis(epics)

When run, this creates a temporary file containing HTML and JavaScript to allow interactive exploration of the data. The library also opens this in a browser, shown in Figure 11-2. One can explore the data by expanding or contracting the timeline and scrolling sideways..

jcb4 1102
Figure 11-2. TimeVis (Shiny) In Action

Where there are two boxes (Cid, Sagas), the first is when the life or stories took place, the second, when they were written down.

To expose this on the public web, copy the file (whose full path is revealed in the browser titlebar) and the lib folder in that same directory into a directory served by the web server. Or just to File→Save As→Complete Web Page within the browser. Either way, you must do this while the R session is running, as the temporary files are deleted when the session ends. Or, if you are familiar with the shiny framework, you can insert the timevis visualization into a shiny application.

1 Map/Reduce is a famous algorithm pioneered by Google to handle large data problems. An unspecified number of generator process map data - such as words on a web page to the page’s URL - and a single (usually) reduce process reduces the maps to a manageable form, such as a list of all the pages that contain the given words. Early on, Data Science went overboard on trying to do everything with Map/Reduce; now the pendulum has swung back to using compute engines like Spark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.228.88