Data Science is a relatively new discipline which first came to the attention of many with this article by O’Reilly’s Mike Loukides. While there are many definitions in the field, Loukides notes:
A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.
One of the main open-source ecosystems for data science software is at Apache, and includes Hadoop (which includes the HDFS distributed filesystem, Hadoop Map/Reduce1, Ozone object store, Yarn scheduler, and more), Cassandra distributed database, Spark compute engine, and more. Read the Modules and Related Tools section of the Hadoop page for a current list.
What’s interesting here is that a great deal of this infrastructure, which is taken for granted by data scientists, is written in Java and Scala (a JVM language). Much of the rest is written in Python, a language that complements Java.
Data science problems may involve a lot of setup, so we’ll only give one example from traditional DS, using the Spark framework. Spark is written in Scala so it can be used directly by Java code.
In the rest of the chapter I’ll focus on a language called R which is widely used both in statistics and in data science (well, also in many other sciences; many of the graphs you see in refereed journal articles are prepared with R). R is widely used and is useful to know. Its primary implementation was not written in Java, but in a mixture of C, Fortran and R itself. But R can be used within Java, and Java can be used within R. We’ll talk about several implementations of R and how to select one, then show techniques for using Java from R and R from Java, as well as using R in a web application.
You want to use Java for machine learning and data science, but everyone tells you to use Python.
Use one of the many powerful Java toolkits available for free download.
It’s sometimes said that machine learning (ML) and deep learning have to be done in C++ for efficiency or in Python for the wide availability of software. While these languages have their advantages and their advocates, it is certainly possible to use Java for these purposes. However, setting up these packages and presenting a short demo tends to be longer than would fit in our typical “Recipe” format.
With industry giant Amazon having released their Java-based Deep Java Learning (DJL) library as this book was going to press, and many other good libraries available (with quite a few supporting CUDA for faster GPU-based processing), (see Table 11-1), there is no reason to avoid using Java for ML. With the exception of DJL which is new as this book goes to press, I’ve tried to list packages that are still being maintained and have a decent reputation among users.
Library name | Description | Info URL | Source URL |
---|---|---|---|
ADAMS |
Workflow engine for building/maintaining data-driven, reactive workflows; integration with business processes |
||
Deep Java Library |
Amazon’s ML library |
||
Deeplearning4j |
DL4J, Eclipse’s distributed deep-learning library; integrates w/ Hadoop and Apache Spark |
||
ELKI |
Data Mining Toolkit |
||
Mallet |
ML for text processing |
mallet.cs.umass.edu |
|
Weka |
ML algorithms for data mining; tools for data preparation, classification, regression, clustering, association rules mining, and visualization. |
The book Data Mining: Practical Machine Learning, 4th Edition, was written by the team behind Weka.
See also Eugen Parschiv’s list of Java AI software packages.
You want to process data using Spark.
Create a SparkSession
, use its read()
function to read a DataSet
, apply operations,
summarize results.
Spark is a massive subject! Entire books have been written on using it. Quoting DataBricks, home of much of the original Spark team:foonote[DataBricks offers several free e-books on Spark from their website; they also offer commercial Spark add-ons].
Apache Spark™ has seen immense growth over the past several years, becoming the de-facto data processing and AI engine in enterprises today due to its speed, ease of use, and sophisticated analytics. Spark unifies data and AI by simplifying data preparation at massive scale across various sources, providing a consistent set of APIs for both data engineering and data science workloads, as well as seamless integration with popular AI frameworks and libraries such as TensorFlow, PyTorch, R and SciKit-Learn.
I cannot convey the whole subject matter in this book. However, one thing Spark is good for is dealing with lots of data. In this example, we read an Apache-format logfile, and find (and count) the lines with 200, 404 and 500 responses.
import
org.apache.spark.sql.SparkSession
;
import
org.apache.spark.sql.Dataset
;
import
org.apache.spark.api.java.function.FilterFunction
;
// tag::main[]
/** * Read an Apache Logfile and summarize it. */
public
class
LogReader
{
public
static
void
main
(
String
[
]
args
)
{
final
String
logFile
=
"/var/wildfly/standalone/log/access_log.log"
;
SparkSession
spark
=
SparkSession
.
builder
(
)
.
appName
(
"Log Analyzer"
)
.
getOrCreate
(
)
;
Dataset
<
String
>
logData
=
spark
.
read
(
)
.
textFile
(
logFile
)
.
cache
(
)
;
long
good
=
logData
.
filter
(
new
FilterFunction
<
>
(
)
{
public
boolean
call
(
String
s
)
{
return
s
.
contains
(
"200"
)
;
}
}
)
.
count
(
)
;
long
bad
=
logData
.
filter
(
new
FilterFunction
<
>
(
)
{
public
boolean
call
(
String
s
)
{
return
s
.
contains
(
"404"
)
;
}
}
)
.
count
(
)
;
long
ugly
=
logData
.
filter
(
new
FilterFunction
<
>
(
)
{
public
boolean
call
(
String
s
)
{
return
s
.
contains
(
"500"
)
;
}
}
)
.
count
(
)
;
System
.
out
.
printf
(
"Successful transfers %d, 404 tries %d, 500 errors %d "
,
good
,
bad
,
ugly
)
;
spark
.
stop
(
)
;
}
}
// end::main[]
Set up the filename for the logfile. Probably should come from args
Start up the Spark SparkSession
object - the runtime
Tell Spark to read the logfile, and keep it in memory (cache)
Define the filters for 200, 404 and 500 errors. Should be able to use Lambdas to make the code shorter, but there’s an ambiguity between the Java and Scala versions of FilterFunction.
Print the results.
To make this compile, you need to add the following to a Maven POM file:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>2.4.4</version> <scope>provided</scope> </dependency>
Then you should be able to do mvn package
to get a Jar file packaged.
The use of the provided
scope is because we will also download the Apache Spark
runtime package to run the application.
from the Spark Download page.
Unpack the distribution and set SPARK_HOME environment to the root of it, e.g.,
SPARK_HOME=~/spark-3.0.0-bin-hadoop3.2/
Then you can use the run
script which I’ve provided in the source download (javasrc/spark).
Spark is designed for larger-scale computing than this simple example, so its voluminous output simply dwarfs the output from my simple sample program. Nontheless, for an approximately 42,000-line file, I did get this result, buried among the logging:
Successful transfers 32555, 404 tries 6539, 500 errors 183
As mentioned, Spark is a massive subject, but a necessary tool for most data scientists. You can program Spark in Java (obviously), in Scala - a JVM language that promotes functional programming, see this Scala tutorial for Java devs, in Python (which we won’t mention further here), and probably other languages. You can learn more at https://spark.apache.org/ or from the many books, videos, and tutorials online.
You don’t know the first thing about R, and want to.
R has been around for ages, and its predecessor S for a decade before that. There are many books and online resources devoted to this language. The official home page is at https://www.r-project.org/. There are many online tutorials; the R Project hosts one at https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf. R itself is available in most systems’ package managers or it can be downloaded from the official download site. The name CRAN in these URLs stands for Comprehensive R Archive Network, named in a similar fashion to TeX’s CTAN and the Perl language’s CPAN.
In this example we’ll write some data from a Java program and then analyze and graph it using R interactively.
This is merely a brief intro to using R interactively. Suffice to say that R is a valuable interactive environment for exploring data. Here are some simple calculations to show the flavor of the language: a chatty startup (so long I had to cut part of it), simple arithmetic, automatic printing of results if not saved, half-decent errors when you make a mistake, arithmetic on vectors, and so on. You may see some similarities to Java’s JShell (see Recipe 1.4); both are REPL (read-evaluate-print loop) interfaces. R adds the ability to save your interactive session (workspace) when exiting the program, so all your data and function definitions are restored next time you start R.
$ R R version 3.6.0 (2019-04-26) -- "Planting of a Tree" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin15.6.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. ... > 2 + 2 [1] 4 > x = 2 + 2 > x [1] 4 > r = 10 20 30 40 50 Error: unexpected numeric constant in "r = 10 20" > r = c(10,20,30,45,55,67) > r [1] 10 20 30 45 55 67 > r+3 [1] 13 23 33 48 58 70 > r / 3 [1] 3.333333 6.666667 10.000000 15.000000 18.333333 22.333333 >quit() Save workspace image? [y/n/c]: n $
R purists will usually use the assignment arrow ← in lieu of the = sign when assigning. If you like that, go for it.
This short session barely scratches the surface: R offers hundreds of built-in functions, sample datasets, over a thousand add-on packages, built-in help, and much more. For interactive exploration of data, R is really the one to beat.
Some people prefer a GUI front-end to R; the most widely-used GUI front-end is R Studio.
Now we want to write some data from Java and process it in R
(we’ll use Java and R together in later recipes in this chapter).
In Recipe 5.9 we discussed the java.util.Random
class
and its nextDouble()
vs nextGaussian()
methods.
The nextDouble()
and related methods try to give a “flat” distribution between 0 and 1.0,
in which each value has an equal chance of being selected.
A Gaussian or normal distribution is a bell-curve of values from negative infinity to positive infinity, with the majority of the values clustered around zero (0.0).
We’ll use R’s histogramming and graphics functions to examine visually how well they do so.
Random
r
=
new
Random
();
for
(
int
i
=
0
;
i
<
10_000
;
i
++)
{
System
.
out
.
println
(
"A normal random double is "
+
r
.
nextDouble
());
System
.
out
.
println
(
"A gaussian random double is "
+
r
.
nextGaussian
());
To illustrate the different distributions, I generated 10,000 numbers each using
nextRandom()
and nextGaussian()
. The code for this is in Random4.java
(not shown here) and is a combination of the sample code above with code to print just the
numbers into two files. I then plotted histograms using R;
the R script used to generate the graph is in
javasrc under src/main/resources) but its core is shown in
Example 11-2.
The results are shown in Figure 11-1.
png
(
"randomness.png"
)
us
<-
read.table
(
"normal.txt"
)[[
1
]]
ns
<-
read.table
(
"gaussian.txt"
)[[
1
]]
layout
(
t
(
c
(
1
,
2
)),
respect
=
TRUE
)
hist
(
us
,
main
=
"Using nextRandom()"
,
nclass
=
10
,
xlab
=
NULL
,
col
=
"lightgray"
,
las
=
1
,
font.lab
=
3
)
hist
(
ns
,
main
=
"Using nextGaussian()"
,
nclass
=
16
,
xlab
=
NULL
,
col
=
"lightgray"
,
las
=
1
,
font.lab
=
3
)
dev.off
()
The png()
call tells R which graphics device to use - others include X11()
, Postscript()
, and more.
read.table()
reads data from a text file into a table; the [1] gives us just the data column,
ignoring some metadata.
The layout()
call says we want two graphics objects displayed side-by-side.
Each hist()
call draws one of the two histograms.
And dev.off()
closes the output and flushes any writing buffers to the PNG file.
The result is shown in Figure 11-1.
You’re not sure which implementation of R to use.
Look at “original R”, Renjin, and FastR.
The original for R was S, an environment for interactive programming developed by John Chambers et al at AT&T Bell Labs starting in 1976. I ran into S when supporting the University of Toronto Statistics Departement, and again when reviewing a commercial implementation of it, SPlus, for a long-ago glossy magazine called Sun Expert. AT&T was only making S source available to universities and to commercial licensees who could not further distribute the source. Two developers at the University of Auckland, Ross Ihaka and Robert Gentleman, developed a clone of S, starting in 1995. They named it R after their own first initials and as a play on the name S. (There is precedent for this: the awk language popular on Unix/Linux was named for the initials of its designers Aho, Weinberger and Kernighan). R grew quickly because it was very largely compatible with S and was more readily available. This implementation, “original R”, is actively managed by the R Foundation for Statistical Computing which also manages the Comprehensive R Archive Network.
Renjin is a fairly complete implementation of R in Java. This project provides built Jar files via their own Maven repository.
FastR is another implementation in Java, designed to run in the faster GraalVM and supporting direct invocation of JVM code from almost any other programming language. Fastr is also described at https://medium.com/graalvm/faster-r-with-fastr-4b8db0e0dceb.
Besides these implementations, R’s popularity has led to development of many “access libraries” for invoking R from many popular programming languages. Rserve is a TCP/IP networked access mode for R, for which Java wrappers exist.
You want to access R from within a Java application using Renjin.
Add renjin to your Maven or Gradle build, and call it via the Script Engines mechanism described in Recipe 18.3.
Renjin is a pure-Java, open-source re-implementation of R, and provides a script engines interface. Add the following dependency to your build tool:
org.renjin:renjin-script-engine:3.5-beta76
Of course there is probably a later version of renjin than the one shown above by the time you read this; use the latest unless there’s a reason not to.
Note: You will also need a <repository>
entry since the maintainers
put their artifacts in the repo at nexus.betadriven.com
instead of the usual
Maven Central. Here’s what I used (obtained from
https://www.renjin.org/downloads.html):
<repositories> <repository> <id>bedatadriven</id> <name>bedatadriven public repo</name> <url>https://nexus.bedatadriven.com/content/groups/public/</url> </repository> </repositories>
Once that’s done, you should be able to access Renjin via the Script Engines framework, as in Example 11-3.
/**
* Demonstrate interacting with the "R" implementation called "Renjin"
*/
public
static
void
main
(
String
[]
args
)
throws
ScriptException
{
ScriptEngineManager
manager
=
new
ScriptEngineManager
();
ScriptEngine
engine
=
manager
.
getEngineByName
(
"Renjin"
);
engine
.
put
(
"a"
,
42
);
Object
ret
=
engine
.
eval
(
"b <- 2; a*b"
);
System
.
out
.
println
(
ret
);
}
Because R treats all numbers as floating point, like many interpreters, the value printed is 84.0
.
One can also get Renjin to invoke a script file; Example 11-4 invokes the same script used in Recipe 11.3 to generate and plot a batch of pseudo-random numbers.
private
static
final
String
R_SCRIPT_FILE
=
"/randomnesshistograms.r"
;
private
static
final
int
N
=
10000
;
public
static
void
main
(
String
[]
argv
)
throws
Exception
{
// java.util.Random methods are non-static, do need to construct
Random
r
=
new
Random
();
double
[]
us
=
new
double
[
N
],
ns
=
new
double
[
N
];
for
(
int
i
=
0
;
i
<
N
;
i
++)
{
us
[
i
]
=
r
.
nextDouble
();
ns
[
i
]
=
r
.
nextGaussian
();
}
try
(
InputStream
is
=
Random5
.
class
.
getResourceAsStream
(
R_SCRIPT_FILE
))
{
if
(
is
==
null
)
{
throw
new
IllegalStateException
(
"Can't open R file "
);
}
ScriptEngineManager
manager
=
new
ScriptEngineManager
();
ScriptEngine
engine
=
manager
.
getEngineByName
(
"Renjin"
);
engine
.
put
(
"us"
,
us
);
engine
.
put
(
"ns"
,
ns
);
engine
.
eval
(
FileIO
.
readerToString
(
new
InputStreamReader
(
is
)));
}
}
Renjin can also be used as a standalone R implementation if you download an all-dependencies JAR file from https://renjin.org/downloads.html.
You are part-way through a computation in R and realize that there’s a Java library to do the next step. Or for any other reason, you need to call Java code from with an R session.
Install rJava, call .jinit(), and use J() to load classes or invoke methods.
For example:
> install.packages('rJava') trying URL 'http://.../rJava_0.9-11.tgz' Content type 'application/x-gzip' length 745354 bytes (727 KB) ================================================== downloaded 727 KB The downloaded binary packages are in /tmp//Rtmp6XYZ9t/downloaded_packages > library('rJava') > .jinit() > J('java.time.LocalDate', 'now') [1] "Java-Object{2019-11-22}" > d=J('java.time.LocalDate', 'now')$toString() > d [1] "2019-11-22"
Install the rJava
package; only needs to be done once
load rJava
, and initialize it with .jinit()
; both needed in every R session
The J
function takes one argument of a full class name.
If only that argument is given, a class descriptor (like a java.lang.Class object) is returned.
If more than one argument is given, the second is a static method name,
and any subsequent arguments are passed to that method.
Returned objects can have Java methods invoked with the standard R $ notation;
here the toString()
method is invoked to return just a character string
instead of a LocalDate
object.
The .jcall function gives you more control over calling method and return types.
>
d
=
J
(
'
java
.
time
.
LocalDate
'
,
'
now
'
)
>
.
jcall
(
d
,
"I"
,
'
getYear
'
)
[
1
]
2019
>
>
.
jcall
(
"java/lang/System"
,
"S"
,
"getProperty"
,
"user.dir"
)
[
1
]
"/home/ian"
>
c
=
J
(
'
java
/
lang
/
System
'
)
>
.
jcall
(
c
,
"S"
,
'
getProperty
'
,
'
user
.
dir
'
)
[
1
]
"/home/ian"
>
Invoke Java LocalDate.now()
method and save result in R variable d
Invoke Java getYear()
method on the LocalDate
object;
the “I” tells jcall
to expect an Integer result.
Call System.getProperty("user.dir")
and print the result;
the “S” tells .jcall to expect a String return.
If you will be using a class several times, save the Class
object,
pass it as the first argument of .jcall()
There are more capabilities here; consult the documentation at https://cran.r-project.org/web/packages/rJava/ and an article at https://www.developer.com/java/ent/getting-started-with-r-using-java.html.
You use the R language, but feel a need for speed.
Use FastR, Oracle’s GraalVM reimplementation of the R language.
Assuming you have installed GraalVM as described in Recipe 1.2, you can just type the following command:
$ gu install R Downloading: Component catalog from www.graalvm.org Processing component archive: FastR Downloading: Component R: FastR from github.com Installing new component: FastR (org.graalvm.R, version 19.2.0.1) NOTES: --------------- The user specific library directory was not created automatically. You can either create the directory manually or edit file /Library/Java/JavaVirtualMachines/graalvm-ce-19.2.0.1/Contents/Home/jre/languages/R/etc/Renviron to change it to any desired location. Without user specific library directory, users will need write permission for the GraalVM home directory in order to install R packages. ... [more install notes]
If you have set your PATH to have GraalVM before other directories, the command R will now give you
the GraalVM version of R. To access the standard R, you will have to either set your PATH
or give a full path to the R installation. On all Unix and Unix-like systems, the command which R
will reveal all R commands on your PATH:
$ which R /Library/Java/JavaVirtualMachines/graalvm-ce-19.2.0.1/Contents/Home/bin/R /usr/local/bin/R
Let’s just run it:
$ R R version 3.5.1 (FastR) Copyright (c) 2013-19, Oracle and/or its affiliates Copyright (c) 1995-2018, The R Core Team Copyright (c) 2018 The R Foundation for Statistical Computing Copyright (c) 2012-4 Purdue University Copyright (c) 1997-2002, Makoto Matsumoto and Takuji Nishimura All rights reserved. FastR is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information. Type 'q()' to quit R. [Previously saved workspace restored] > 2 + 2 [1] 4 > ^D Save workspace image? [y/n/c]: n $
From that point on, you should be able to do practically anything that you would do in standard R, since this R’s source code is largely derived from the R Foundation’s source.
You want to display R’s data and graphics in a web page on a web server.
There are several approaches that would achieve this effect:
Prepare the data, and generate graphics as we did in Recipe 11.3, then incorporate both into a static web page;
Use one of several R add-on web frameworks, such as Shiny or Rook;
Invoke a JVM implementation of R from within a Servlet, JSF or Spring Bean, or other web-tier component.
The first is trivial, and doesn’t need discussion here.
For the second, I’ll actually use timevis
, which in turn uses shiny
.
This isn’t built in to the R library, so we first have to install it,
using R’s install.packages()
:
$ R > install.packages('timevis') > quit() $
This may take a while as it downloads and builds multiple dependencies.
For this demo I have a small dataset with some basic information on
medieval literature, which I load and display using shiny
# Draw the timeline for the epics.
epics
=
read.table
(
"epics.txt"
,
header
=
TRUE
,
fill
=
TRUE
)
# epics
library
(
"timevis"
)
timevis
(
epics
)
When run, this creates a temporary file containing HTML and JavaScript to allow interactive exploration of the data. The library also opens this in a browser, shown in Figure 11-2. One can explore the data by expanding or contracting the timeline and scrolling sideways..
Where there are two boxes (Cid, Sagas), the first is when the life or stories took place, the second, when they were written down.
To expose this on the public web, copy the file
(whose full path is revealed in the browser titlebar)
and the lib folder in that same directory
into a directory served by the web server.
Or just to File→Save As→Complete Web Page within the browser.
Either way, you must do this while the R session is running, as
the temporary files are deleted when the session ends.
Or, if you are familiar with the shiny framework,
you can insert the timevis
visualization into a shiny
application.
1 Map/Reduce is a famous algorithm pioneered by Google to handle large data problems. An unspecified number of generator process map data - such as words on a web page to the page’s URL - and a single (usually) reduce process reduces the maps to a manageable form, such as a list of all the pages that contain the given words. Early on, Data Science went overboard on trying to do everything with Map/Reduce; now the pendulum has swung back to using compute engines like Spark.
18.219.228.88