In the last few chapters, we have been exploring how we can use H2O AutoML in production. We saw how we can use H2O models as POJOs and MOJOs as portable objects that can make predictions. However, in actual production environments, you will often be using multiple technologies to meet various technical requirements. The collaboration of such technologies plays a big role in the seamless functionality of your system.
Thus, it is important to know how we can use H2O models in collaboration with other commonly used technologies in the ML domain. In this chapter, we shall explore and implement H2O with some of these technologies and see how we can build systems that can work together to provide a collaborative benefit.
First, we will investigate how we can host an H2O prediction service as a web service using the Spring Boot application. Then, we will explore how we can perform real-time prediction using H2O with Apache Storm.
In this chapter, we will cover the following topics:
By the end of this chapter, you should have a better understanding of how you can use models trained using H2O AutoML with different technologies to make predictions in different scenarios.
For this chapter, you will require the following:
All code examples for this chapter can be found on GitHub at https://github.com/PacktPublishing/Practical-Automated-Machine-Learning-on-H2O/tree/main/Chapter%2013.
Let’s jump right into the first section, where we’ll learn how to host models trained using H2O AutoML on a web application created using Spring Boot.
In today’s times, most software services that are created are hosted on the internet, where they can be made accessible to all internet users. All of this is done using web applications hosted on web servers. Even prediction services that use ML can be made available to the public by hosting them on web applications.
The Spring Framework is one of the most commonly used open source web application frameworks to create websites and web applications. It is based on the Java platform and, as such, can be run on any system with a JVM. Spring Boot is an extension of the Spring Framework that provides a preconfigured setup for your web application out of the box. This helps you quickly set up your web application without the need to implement the underlying pipelining needed to configure and host your web service.
So, let’s dive into the implementation by understanding the problem statement.
Let’s assume you are working for a wine manufacturing company. The officials have a requirement where they want to automate the process of calculating the quality of wine and its color. The service should be available as a web service where the quality assurance executive can provide some information about the wine’s attributes, and the service uses these details and an underlying ML model to predict the quality of the wine as well as its color.
So, technically, we will need two models to make the full prediction. One will be a regression model that predicts the quality of the wine, while the other will be a classification model that predicts the color of the wine.
We can use a combination of the Red Wine Quality and White Wine Quality datasets and run H2O AutoML on it to train the models. You can find the datasets at https://archive.ics.uci.edu/ml/datasets/Wine+Quality. The combined dataset is already present at https://github.com/PacktPublishing/Practical-Automated-Machine-Learning-on-H2O/tree/main/Chapter%2013/h2o_spring_boot/h2o_spring_boot.
The following screenshot shows a sample of the dataset:
Figure 13.1 – Wine quality and color dataset
This dataset consists of the following features:
Now that we understand the problem statement and the dataset that we will be working with, let’s design the architecture to show how this web service will work.
Before we dive deep into the implementation of the service, let’s look at the overall architecture of how all of the technologies should work together. The following is the architecture diagram of the wine quality and color prediction web service:
Figure 13.2 – Architecture of the wine quality and color prediction web service
Let’s understand the various components of this architecture:
Now that we have a good understanding of how we are going to create our wine quality and color prediction web service, let’s move on to its implementation.
This service has already been built and is available on GitHub. The code base can be found at https://github.com/PacktPublishing/Practical-Automated-Machine-Learning-on-H2O/tree/main/Chapter%2013/h2o_spring_boot.
Before we dive into the code, make sure your system meets the following minimum requirements:
First, we will clone the GitHub repository, open it in our preferred IDE, and go through the files to understand the whole process. The following steps have been performed on Ubuntu 22.04 LTS and we are using IntelliJ IDEA version 2022.1.4 as the IDE. Feel free to use any IDE of your choice that supports Maven and the Spring Framework for better support.
So, clone the GitHub repository and navigate to Chapter 13/h2o_spring_boot/. Then, you start your IDE and open the project. Once you have opened the project, you should get a directory structure similar to the following:
Figure 13.3 – Directory structure of h2o_wine_predictor
The directory structure consists of the following important files:
You may have noticed that we don’t have the model POJO files anywhere in the directory. So, let’s build those. Refer to the script.py Python file and let’s understand what is being done line by line.
The code for script.py is as follows:
import h2o
import shutil
from h2o.automl import H2OautoML
h2o.init()
wine_quality_dataframe = h2o.import_file(path = "sec/main/resources/winequality_combined.csv")
wine_quality_dataframe["color"] = wine_quality_dataframe["color"].asfactor()
train, valid = wine_quality_dataframe.split_frame(ratios=[.7])
label = "color"
features = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"]
aml_for_color_predictor = H2OAutoML(max_models=10, seed=123, exclude_algos=["StackedEnsemble"], max_runtime_secs=300)
aml_for_color_predictor.train(x = features, y = label, training_frame=train, validation_frame = valid)
When initializing the H2OautoML object, we set the exclude_algos parameter with the StackedEnsemble value. This is done as stacked ensemble models are not supported by POJOs, as we learned in Chapter 10, Working with Plain Old Java Objects (POJOs).
This starts the AutoML model training process. Some print statements will help you observe the progress and results of the model training process.
model = aml_for_color_predictor.leader
model.model_id = "WineColorPredictor"
print(model)
model.download_pojo(path="tmp")
label="quality"
aml_for_quality_predictor = H2OAutoML(max_models=10, seed=123, exclude_algos=["StackedEnsemble"], max_runtime_secs=300)
aml_for_quality_predictor.train(x = features, y = label, training_frame=train, validation_frame = valid)
model = aml_for_color_predictor.leader
model.model_id = "WineQualityPredictor"
print(model)
model.download_pojo(path="tmp")
with open("tmp/WineColorPredictor.java", "r") as raw_model_POJO:
with open("src/main/java/com.h2o_wine_predictor.demo/model/ WineColorPredictor.java", "w") as model_POJO:
model_POJO.write(f'package com.h2o_wine_predictor.demo; ' + raw_model_POJO.read())
with open("tmp/WineQualityPredictor.java", "r") as raw_model_POJO:
with open("src/main/java/com.h2o_wine_predictor.demo/model/ WineQualityPredictor.java", "w") as model_POJO:
model_POJO.write(f'package com.h2o_wine_predictor.demo; ' + raw_model_POJO.read())
shutil.rmtree("tmp")
So, let’s run this script and generate our models. You can do so by executing the following command in your Terminal:
python3 script.py
This should generate the respective model POJO files in the src/main/java/com.h2o_wine_predictor.demo/model/ directory.
Now, let’s observe the PredictionService file in the src/main/java/com.h2o_wine_predictor.demo/service directory.
The PredictionService class inside the PredictionService file has the following attributes:
Now that we understand the attributes of this file, let’s check out the methods, which are as follows:
Now that we have had a chance to go through the important parts of the whole project, let’s go ahead and run the application so that we can have the web service running locally on our machines. Then, we will run a simple cURL command with the wine quality feature values and see if we get the predictions as a response.
To start the application, you can do the following:
mvn spring-boot:run -e
If everything is working fine, then you should get an output similar to the following:
Figure 13.4 – Successful Spring Boot application run output
Now that the Spring Boot application is running, the only thing remaining is to test this out by making a POST request call to the web service running on localhost:8082.
Open another Terminal and execute the following curl command to make a prediction request:
curl -X POST localhost:8082/api/v1/predict -H "Content-Type: application/json" -d '{"fixed acidity":6.8,"volatile acidity":0.18,"citric acid":0.37,"residual sugar":1.6,"chlorides":0.055,"free sulfur dioxide":47,"total sulfur dioxide":154,"density":0.9934,"pH":3.08," ,"sulphates":0.45,"alcohol":9.1}'
The request should go to the web application, where the application will extract the feature values, convert them into the RowData object type, pass RowData to the prediction function, get the prediction results, convert the prediction results into an appropriate JSON, and get the JSON back as a response. This should look as follows:
Figure 13.5 – Prediction result from the Spring Boot web application
From the JSON response, you can see that the predicted color of the wine is white and that its quality is 5.32.
Congratulations! You have just implemented an ML prediction service on a Spring Boot web application. You can further expand this service by adding a frontend that takes the feature values as input and a button that, upon being clicked, creates a POST body of all those values and sends the API request to the backend. Feel free to experiment with this project as there is plenty of scope for how you can use H2O model POJOs on a web service.
In the next section, we’ll learn how to make real-time predictions using H2O AutoML, along with another interesting technology called Apache Storm.
Apache Storm is an open source data analysis and computation tool for processing large amounts of stream data in real time. In the real world, you will often have plenty of systems that continuously generate large amounts of data. You may need to make some computations or run some processes on this data to extract useful information as it is generated in real time.
Let’s take the example of a log system in a very heavily used web service. Assuming that this web service receives millions of requests per second, it is going to generate tons of logs. And you already have a system in place that stores these logs in your database. Now, this log data will eventually pile up and you will have petabytes of log data stored in your database. Querying all this historical data to process it in one go is going to be very slow and time-consuming.
What you can do is process the data as it is generated. This is where Apache Storm comes into play. You can configure your Apache Storm application to perform the needed processing and direct your log data to flow through it and then store it in your database. This will streamline the processing, making it real-time.
Apache Storm can be used for multiple use cases, such as real-time analytics, Extract-Transform-Load (ETL) data in data pipelines, and even ML. What makes Apache Storm the go-to solution for real-time processing is because of how fast it is. A benchmarking test performed by the Apache Foundation found Apache Storm to process around a million tuples per second per node. Apache Storm is also very scalable and fault-tolerant, which guarantees that it will process all the incoming real-time data.
So, let’s dive deep into the architecture of Apache Storm to understand how it works.
Apache Storm uses cluster computing, similar to how Hadoop and even H2O work. Consider the following architectural diagram of Apache Storm:
Figure 13.6 – Architecture of Apache Storm
Apache Storm distinguishes the nodes in its cluster into two categories – a master node and a worker node. The features of these nodes are as follows:
The communication between the master node and the worker nodes using their respective daemons is done using the Zookeeper cluster. In short, the Zookeeper cluster is a centralized service that maintains configuration and synchronization services for stateless groups. In this scenario, the master node and the worker nodes are stateless and fast-failing services. All the state details are stored in the Zookeeper cluster. This is beneficial as keeping the nodes stateless helps with fault tolerance as the nodes can be brought back to life and they will start working as if nothing had happened.
Tip
If you are interested in understanding the various concepts and technicalities of Zookeeper, then feel free to explore it in detail at https://zookeeper.apache.org/.
Before we move on to the implementation part of Apache Storm, we need to be aware of certain concepts that are important to understand how Apache Storm works. The different concepts are as follows:
Tip
You can learn more about Apache Thrift by going to https://thrift.apache.org/docs/types.
Now that you have a better understanding of what Apache Storm is and the various concepts involved in its implementation, we can move on to the implementation part of this section, starting with installing Apache Storm.
Tip
Apache Storm is a very powerful and sophisticated system. It has plenty of applicability outside of just machine learning and also has plenty of features and support. If you want to learn more about Apache Storm, go to https://storm.apache.org/.
Let’s start by noting down the basic requirements for installing Apache Storm. They are as follows:
So, make sure these basic requirements are already installed on your system. Now, let’s start by downloading the Apache Storm repo. You can find the repo at https://github.com/apache/storm.
So, execute the following command to clone the repository to your system:
git clone https://github.com/apache/storm.git
Once the download is finished, you can open the storm folder to get a glimpse of its contents. You will notice that there are tons of files, so it can be overwhelming when you’re trying to figure out where to start. Don’t worry – we’ll work on very simple examples that should be enough to give you a basic idea of how Apache Storm works. Then, you can branch out from there to get a better understanding of what Apache Storm has to offer.
Now, open your Terminal and navigate to the cloned repo. You will need to locally build Apache Storm itself before you can go about implementing any of the Apache Storm features. You need to do this as locally building Apache Storm generates important JAR files that get installed in your $HOME/.m2/repository folder. This is the folder where Maven will pick up the JAR dependencies when you build your Apache Storm application.
So, locally build Apache Storm by executing the following command at the root of the repository:
mvn clean install -DskipTests=true
The build might take some time, considering that Maven will be building several JAR files that are important dependencies to your application. So, while that is happening, let’s understand the problem statement that we will be working on.
Let’s assume you are working for a medical company. The medical officials have a requirement, where they want to create a system that predicts whether the person is likely to suffer from any complications after surviving a heart failure or whether they are safe to be discharged. The catch is that this prediction service will be used by all the hospitals in the country, and they need immediate prediction results so that the doctors can decide whether to keep the patient admitted for a few days to monitor their health or decide to discharge them.
So, the machine learning problem is that there will be streams of data that our system will need to make immediate predictions. We can set up a Apache Storm application that streams all the data into the prediction service and deploys model POJOs trained using H2O AutoML to make the predictions.
We can train the models on the Heart Failure Clinical dataset, which can be found at https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records.
The following screenshot shows some sample content from the dataset:
Figure 13.7 – Heart Failure Clinical dataset
This dataset consists of the following features:
Now that we understand the problem statement and the dataset that we will be working with, let’s design the architecture of how we can use Apache Storm and H2O AutoML to solve this problem.
Let’s look at the overall architecture of how all the technologies should work together. Refer to the following architecture diagram of the heart failure complication prediction service:
Figure 13.8 – Architecture diagram of using H2O AutoML with Apache Storm
Let’s understand the various components of the architecture:
Now that we have designed a simple and good solution, let’s move on to its implementation.
This service is already available on GitHub. The code base can be found at https://github.com/PacktPublishing/Practical-Automated-Machine-Learning-on-H2O/tree/main/Chapter%2013/h2o_apache_storm/h2o_storm.
So, download the repo and navigate to /Chapter 13/h2o_apache_storm/h2o_storm/.
You will see that we have two folders. One is the storm-starter directory, while the other is the storm-streaming directory. Let’s focus on the storm-streaming directory first. Start your IDE and open the storm-streaming project. Once you open the project, you should see a directory structure similar to the following:
Figure 13.9 – storm_streaming directory structure
This directory structure consists of the following important files:
Unlike the previous experiments, where we made changes in a separate application repository, for this experiment, we shall make changes in Apache Storm’s repository.
The following steps have been performed on Ubuntu 22.04 LTS; IntelliJ IDEA version 2022.1.4 has been used as the IDE. Feel free to use any IDE of your choice that supports the Maven framework for better support.
Let’s start by understanding the model training script, script.py. The code for script.py is as follows:
import h2o
from h2o.automl import H2OautoML
h2o.init()
wine_quality_dataframe = h2o.import_file(path = "training_data.csv")
label = "complications"
features = ["age", "anemia", "creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "platelets", "serum_creatinine ", "serum_sodium", "sex", "smoking", "time"]
aml_for_complications = H2OAutoML(max_models=10, seed=123, exclude_algos=["StackedEnsemble"], max_runtime_secs=300)
aml_for_complications.train(x = features, y = label, training_frame = wine_quality_dataframe )
Since POJOs are not supported for stacked ensemble models, we set the exclude_algos parameter with the StackedEnsemble value.
This starts the AutoML model training process. Some print statements are in here that will help you observe the progress and results of the model training process.
model = aml_for_color_predictor.leader
model.model_id = "HeartFailureComplications"
print(model)
model.download_pojo(path="tmp")
So, let’s run this script and generate our model. Executing the following command in your Terminal:
python3 script.py
This should generate the respective model POJO files in the tmp directory.
Now, let’s investigate the next file in the repository: H2ODataSpout.java. The H2ODataSpout class in the Java file has a few attributes and functions that are important for building the Apache Storm applications. We won’t focus on them much, but let’s have a look at the functions that do play a bigger role in the business logic of the applications. They are as follows:
Util.sleep(1000)
File file = new File("live_data.csv")
String[] observation = null;
try {
String line="";
BufferedReader br = new BufferedReader(new FileReader(file));
while (i++<=_cnt.get()) line = br.readLine(); // stream thru to next line
observation = line.split(",");
} catch (Exception e) {
e.printStackTrace();
_cnt.set(0);
}
_cnt.getAndIncrement();
if (_cnt.get() == 1000) _cnt.set(0);
_collector.emit(new Values(observation));
LinkedList<String> fields_list = new LinkedList<String>(Arrays.asList(ComplicationPredictorModel.NAMES));
fields_list.add(0,"complication");
String[] fields = fields_list.toArray(new String[fields_list.size()]);
declarer.declare(new Fields(fields));
Moving on, let’s investigate the H2OStormStarter.java file. This file contains both bolts that are needed for performing the predictions and classification, as well as the h2o_storm() function, which builds the Apache Storm topology and passes it onto the Apache Storm cluster. Let’s dive deep into the individual attributes:
HeartFailureComplications h2oModel = new HeartFailureComplications();
ArrayList<String> stringData = new ArrayList<String>();
for (Object tuple_value : tuple.getValues()) stringData.add((String) tuple_value);
String[] rawData = stringData.toArray(new String[stringData.size()]);
double data[] = new double[rawData.length-1];
String[] columnName = tuple.getFields().toList().toArray(new String[tuple.size()]);
for (int I = 1; i < rawData.length; ++i) {
data[i-1] = h2oModel.getDomainValues(columnName[i]) == null
? Double.valueOf(rawData[i])
: h2oModel.mapEnum(h2oModel.getColIdx(columnName[i]), rawData[i]);
}
double[] predictions = new double [h2oModel.nclasses()+1];
h2oModel.score0(data, predictions);
_collector.emit(tuple, new Values(rawData[0], predictions[1]));
_collector.ack(tuple);
_collector.emit(tuple, new Values(expected, complicationProb <= _threshold ? "No Complication" : "Possible Complication"));
_collector.ack(tuple);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("inputDataRow", new H2ODataSpout(), 10);
builder.setBolt("scoreProbabilities", new PredictionBolt(), 3).shuffleGrouping("inputDataRow");
builder.setBolt("classifyResults", new ClassifierBolt(), 3).shuffleGrouping("scoreProbabilities");
Config conf = new Config();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("HeartComplicationPredictor", conf, builder.createTopology());
Utils.sleep(1000 * 60 * 60);
cluster.killTopology("HeartComplicationPredictor");
cluster.shutdown();
Now that we have a better understanding of the contents of the files and how we are going to be running our service, let’s proceed and look at the contents of the storm-starter project. The directory structure will be as follows:
Figure 13.10 – storm-starter directory structure
The src directory contains several different types of Apache Storm topology samples that you can choose to experiment with. I highly recommend that you do so as that will help you get a better understanding of how versatile Apache Storm is when it comes to configuring your streaming service for different needs.
However, we shall perform this experiment in the test directory to keep our files isolated from the ones in the src directory. So, let’s see how we can run this experiment.
Follow these steps to build and run the experiment:
python3 script.py
<dependency>
<groupId>ai.h2o</groupId>
<artifactId>h2o-genmodel</artifactId>
<version>3.36.1.3</version>
</dependency>
Your directory should now look as follows:
Figure 13.11 – storm-starter directory structure after file transfers
Figure 13.12 – Heart complication prediction output in Apache Storm
If you observe the results closely, you should see that there are executors in the logs; all the Apache Storm spouts and bolts are internal executor processes that run on the cluster. You will also see the prediction probabilities besides each tuple. This should look as follows:
Figure 13.13 – Heart complication prediction result
Congratulations – we have just covered another design pattern that shows us how we can use models trained using H2O AutoML to make real-time predictions on streaming data using Apache Storm. This concludes the last experiment of this chapter.
In this chapter, we focused on how we can implement models that have been trained using H2O AutoML in different scenarios using different technologies to make predictions on different kinds of data.
We started by implementing an AutoML leader model in a scenario where we tried to make predictions on data over a web service. We created a simple web service that was hosted on localhost using Spring Boot and the Apache Tomcat web server. We trained the model on data using AutoML, extracted the leader model as a POJO, and loaded that POJO as a class in the web application. By doing this, the application was able to use the model to make predictions on the data that it received as a POST request, responding with the prediction results.
Then, we looked into another design pattern where we aimed to make predictions on real-time data. We had to implement a system that can simulate the real-time flow of data. We did this with Apache Storm. First, we dived deep into understanding what Apache Storm is, its architecture, and how it works by using spouts and bolts. Using this knowledge, we built a real-time data streaming application. We deployed our AutoML trained model in a Prediction Bolt where the Apache Storm application was able to use the model to make predictions on the real-time streaming data.
This concludes the final chapter of this book. There are still innumerable features, concepts, and design patterns that we can work with while using H2O AutoML. The more you experiment with this technology, the better you will get at implementing it. Thus, it is highly recommended that you keep experimenting with this technology and discover new ways of solving ML problems while automating your ML workflows.
18.117.99.152