Model training is one of the core components of a Machine Learning (ML) pipeline. It is the step in the pipeline where the system reads and understands the patterns in the dataset. This learning outputs a mathematical representation of the relationship between the different features in the dataset and the target value. The way in which the system reads and analyzes data depends on the ML algorithm being used and its intricacies. This is where the primary complexity of ML lies. Every ML algorithm has its own way of interpreting the data and deriving information from it. Every ML algorithm aims to optimize certain metrics while trading off certain biases and variances. Automation done by H2O AutoML further complicates this concept. Trying to understand how that would work can be overwhelming for many engineers.
Don’t be discouraged by this complexity. All sophisticated systems can be broken down into simple components. Understanding these components and their interaction with each other is what helps us understand the system as a whole. Similarly, in this chapter, we will open up the black box, that is, H2O’s AutoML service, and try to understand what kind of magic makes the automation of ML possible. We shall first understand the architecture of H2O. We shall break it down into simple components and then understand what interaction takes place between the various components of H2O. Later, we will come to understand how H2O AutoML trains so many models and is able to optimize their hyperparameters to get the best possible model.
In this chapter, we are going to cover the following topics:
So, let’s begin by first understanding the architecture of H2O.
To deep dive into H2O technology, we first need to understand its high-level architecture. It will not only help us understand what the different software components that make up the H2O AI stack are, but it will also help us understand how the components interact with each other and their dependencies.
With this in mind, let’s have a look at the H2O AI high-level architecture, as shown in the following diagram:
Figure 4.1 – H2O AI high-level architecture
The H2O AI architecture is conceptually divided into two parts, each serving a different purpose in the software stack. The parts are as follows:
The client and the JVM component layers are separated by the network layer. The network layer is nothing but the general internet, which requests are sent over.
Let’s dive deep into every layer to better understand their functionalities, starting with the first layer, the client layer.
The client layer comprises all the client code that you install in your system. You use this software program to send requests to the H2O server to perform your ML activities. The following diagram shows you the client layer from the H2O high-level architecture:
Figure 4.2 – The client layer of H2O high-level architecture
Every supported language will have its own H2O client code that is installed and used in the respective language’s script. All client code internally communicates with the H2O server via a REST API over a socket connection.
The following H2O clients exist for the respective languages:
The following diagram shows you the interactions of various H2O clients with the same H2O server:
Figure 4.3 – Different clients communicating with the same H2O server
As you can see in the diagram, all the different clients can communicate with the same instance of the H2O server. This enables a single H2O server to service different software products written in different languages.
This covers the contents of the client layer; let’s move down to the next layer in the H2O’s high-level architecture, that is, the JVM component layer.
The JVM is a runtime engine that runs Java programs in your system. The H2O cloud server runs on multiple JVM processes, also called JVM nodes. Each JVM node runs specific components of the H2O software stack.
The following diagram shows you the various JVM components that make up the H2O server:
Figure 4.4 – H2O JVM component layer
As seen in the preceding diagram, the JVM nodes are further split into three different layers, which are as follows:
Some of the JVM processes in this layer are as follows:
The entire JVM component layer lies on top of Spark and Hadoop data processing systems. The components in the JVM layer leverage these data processing cluster management engines to support cluster computing.
This sums up the entire high-level architecture of H2O’s software technology. With this background in mind, let’s move to the next section, where we shall understand the flow of interaction between the client and H2O and how the client-server interaction helps us perform ML activities.
In Chapter 1, Understanding H2O AutoML Basics, and Chapter 2, Working with H2O Flow (H2O’s Web UI), we saw how we can send a command to H2O to import a dataset or train a model. Let’s try to understand what happens behind the scenes when you send a request to the H2O server, beginning with data ingestion.
The process of a system ingesting data is the same as how we read a book in real life: we open the book and start reading one line at a time. Similarly, when you want your program to read a dataset stored in your system, you will first inform the program about the location of the dataset. The program will then open the file and start reading the bytes of the data line by line and store it in its RAM. However, the issue with the type of sequential data reading in ML is that datasets tend to be huge in ML. Such data is often termed big data and can span from gigabytes to terabytes of volume. Reading such huge volumes of data by a system, no matter how fast it may be, will need a significant amount of time. This is time that ML pipelines do not have, as the aim of an ML pipeline is to make predictions. These predictions won’t have any value if the time to make decisions has already passed. For example, if you design an ML system that is installed in a car that automatically stops the car if it detects a possibility of collision, then the ML system would be useless if it spent all its time reading data and was too late to make collision predictions before they happened.
This is where parallel computing or cluster computing comes in. A cluster is nothing but multiple processes connected together over a network that performs like a single entity. The main aim of cluster computing is to parallelize long-running sequential tasks using these multiple processes to finish the task quickly. It is for this reason that cluster computing plays a very important role in ML pipelines. H2O also rightly uses clusters to ingest data.
Let’s observe how a data ingestion interaction request flows from the H2O client to the H2O server and how H2O ingests data.
Refer to the following diagram to understand the flow of data ingestion interaction:
Figure 4.5 – H2O data ingestion request interaction flow
The following sequence of steps describes how a client request to the H2O cluster server to ingest data is serviced by H2O using the Hadoop Distributed File System (HDFS):
h2o.import_file("Dataset/iris.data")
The H2O client will extract the dataset location from the function call and internally create a REST API request (see Step 2 in Figure 4.5). The client will then send the request over the network to the IP address where the H2O server is hosted.
Each node will read a section of the dataset and store it in its cluster memory.
Refer to the following diagram to understand the flow of interaction once data is ingested and H2O returns a response:
Figure 4.6 – H2O data ingestion response interaction flow
Once the client receives the response, it creates a DataFrame object that contains this pointer, which the user can then later use to run any further executions on the ingested dataset (see Step 4 in Figure 4.6). In this way, with the use of pointers and the distributed key-value store, H2O can work on DataFrame manipulations and usage without needing to transfer the huge volume of data that it ingested between the server and client.
Now that we understand how H2O ingests data, let us now look into how it handles model training requests.
During model training, there are plenty of interactions that take place, right from the users making the model training request to the user getting the trained ML model. The various components of H2O perform the model training activity using a series of coordinated messages and scheduled jobs.
To better understand what happens internally when a model training request is sent to the H2O server, we need to dive deep into the sequence of interactions that occur during model training.
We shall understand the sequences of interactions by categorizing them as follows:
So, let’s begin first by understanding what happens when the client starts a model training job.
The model training job starts when the client first sends a model training request to H2O.
The following sequence diagram shows you the sequence of interactions that take place inside H2O when a client sends a model training request:
Figure 4.7 – Sequence of interactions in the model training request
The following set of sequences takes place during a model training request:
This sums up the sequence of events that take place inside the H2O server when it receives a model training request.
Now that we understand what happens to the training request, let’s understand what the events that take place are when the training job created in step 6 is training the model.
In H2O, the training of a model is carried out by an internal model training job that acts independently from the user’s API request. The user’s API request just initiates the job; the job manager does the actual execution of the job.
The following sequence diagram shows you the sequence of interactions that take place when a model training job is training a model:
Figure 4.8 – Sequence of interactions in the model training job execution
The following set of sequences takes place during model training:
Now that we understand what goes on behind the scenes when a model training job is training a model, let’s move on to understand what happens when a client polls for the model training status.
As mentioned previously, the actual training of the model is processed independently from the client’s training request. In this case, once a training request is sent by the client, the client is in fact unaware of the progress of the model. The client will need to constantly poll for the status of the model training job. This could be done either via manually making a request using HTTP or via certain client software features, such as progress trackers polling the H2O server for the status of the model training at regular intervals.
The following sequence diagram shows you the sequence of interactions that takes place when a client polls for the model training job completion:
Figure 4.9 – User polling for the model status sequence of interactions
The following set of sequences takes place when the client polls for the model training job completion:
This sums up the various interactions that take place when a client polls for the status of model training. With this in mind, let’s now see what happens when a client requests for the model info once it is informed that the model training job has finished training the model.
Once a model is trained successfully, the user will most likely want to analyze the details of the model. An ML model has plenty of metadata associated with its performance and quality. This metadata is very useful even before a model is used for predictions. But as we saw in the previous section, the model training process was independent of the user’s request, and H2O did not return a model object once training was complete. However, the H2O server does provide an API, using which you can get the information about a model already stored in the server.
The following sequence diagram shows you the sequence of interactions that take place when a client requests information about a trained model:
Figure 4.10 – User querying for model information
The following set of sequences takes place when the client polls for the model training job completion:
A model, once trained, is stored directly in the H2O server itself for quick access whenever there are any prediction requests. You can download the H2O model as well; however, any model not imported into the H2O server cannot be used for predictions.
This sums up the entire sequence of interactions that takes place in various parts of the H2O client-server communication. Now that we understand how H2O trains models internally using jobs and the job manager, let’s dive deeper and try to understand what happens when H2O AutoML trains and optimizes hyperparameters, eventually selecting the best model.
Throughout the course of this book, we have marveled at how the AutoML process automates the sophisticated task of training and selecting the best model without us needing to lift a finger. Behind every automation, however, there is a series of simple steps that is executed in a sequential manner.
Now that we have a good understanding of H2O’s architecture and how to use H2O AutoML to train models, we are now ready to finally open the black box, that is, H2O AutoML. In this section, we shall understand what H2O AutoML does behind the scenes so that it automates the entire process of training and selecting the best ML models.
The answer to this question is pretty simple. H2O AutoML automates the entire ML process using grid search hyperparameter optimization.
Grid search hyperparameter optimization sounds very intimidating to a lot of non-experts, but the concept in itself is actually very easy to understand, provided that you know some of the basic concepts in model training, especially the importance of hyperparameters.
So, before we dive into grid search hyperparameter optimization, let’s first come to understand what hyperparameters are.
Most software engineers are aware of what parameters are: certain variables containing certain user input data, or any system-calculated data that is fed to another function or process. In ML, however, this concept is slightly complicated due to the introduction of hyperparameters. In the field of ML, there are two types of parameters. One type we call the model parameters, or just parameters, and the other is hyperparameters. Even though they have a similar name, there are some important differences between them that all software engineers should keep in mind when working in the ML space.
So, let’s understand them by simple definition:
The aim of training an optimal model is simple:
Sounds simple enough. However, there is a catch. Hyperparameters are not intuitive in nature. One cannot simply just observe the data and decide x value for the hyperparameter will get us the best model. Finding the perfect hyperparameter is a trial-and-error process, where the aim is to find a combination that minimizes errors.
Now, the next question that arises is how you find the best hyperparameters for training a model. This is where hyperparameter optimization comes into the picture, which we will cover next.
Hyperparameter optimization, also known as hyperparameter tuning, is the process of choosing the best set of hyperparameters for a given ML algorithm to train the most optimal model. The best combination of these values minimizes a predefined loss function of an ML algorithm. A loss function in simple terms is a function that measures some unit of error. The loss function is different for different ML algorithms. A model with the lowest possible amount of errors among a potential combination of hyperparameter values is said to have optimal hyperparameters.
There are many approaches to implementing hyperparameter optimization. Some of the most common ones are grid search, random grid search, Bayesian optimization, and gradient-based optimization. Each is a very broad topic to cover; however, for this chapter, we shall focus on only two approaches: grid search and random grid search.
Tip
If you want to explore more about the Bayesian optimization technique for hyperparameter tuning, then feel free to do so. You can get additional information on the topic at this link: https://arxiv.org/abs/1807.02811. Similarly, you can get more details on gradient-based optimization at this link: https://arxiv.org/abs/1502.03492.
It is actually the random grid search approach that is used by H2O’s AutoML for hyperparameter optimization, but you need to have an understanding of the original grid search approach to optimization in order to understand random grid search.
So, let’s begin with grid search hyperparameter optimization.
Let’s take the example of the Iris Flower Dataset that we used in Chapter 1, Understanding H2O AutoML Basics. In this dataset, we are training a model that is learning from the sepal width, sepal length, petal width, and petal length to predict the classification type of the flower.
Now, the first question you are faced with is: which ML algorithm should be used to train a model? Assuming you do come up with an answer to that and choose an algorithm, the next question you will have is: which combination of hyperparameters will get me the optimal model?
Traditionally, ML practitioners would train multiple models for a given ML algorithm with different combinations of hyperparameter values. They would then compare the performance of these models and find out which hyperparameter combination trained the model with the lowest possible error rate.
The following diagram shows you how different combinations of hyperparameters train different models with varying performance:
Figure 4.11 – Manual hyperparameter tuning
Let’s take an example where you are training a decision tree. Its hyperparameters are the number of trees, ntrees, and the maximum depth, max_depth. If you are performing a manual search for hyperparameter optimization, then you will initially start out with values like 50, 100, 150, and 200 for ntrees and 5, 10, and 50 for max_depth, train the models, and measure their performance. When you find out which combination of those values gives you the best results, you set those values as the threshold and tweak them with smaller increments or decrements, retrain the models with these new hyperparameter values, and compare the performance again. You keep doing this until you find the best set of hyperparameter values that gives you the optimum performance.
This method, however, has a few drawbacks. Firstly, the range of values you can try out initially is limited since you can only train so many models manually. So, if you have a hyperparameter whose value can range between 1 and 10,000, then you need to make sure that you cover enough ground to not miss the ideal value by a huge margin. If you do, then you will end up constantly tweaking the value with smaller increments or decrements, spending lots of time optimizing. Secondly, as the number of hyperparameters increases and the number of possible values and combinations of values you want to use increases, it becomes tedious for the ML practitioner to manage and run optimization processes.
To manage and partially automate this process of training multiple models with different hyperparameters, grid search was invented. Grid search is also known as Cartesian Hyperparameter Search or exhaustive search.
Grid search basically maps all the values for given hyperparameters over a Cartesian grid and exhaustively searches combinations in the grid to train models. Refer to the following diagram, which shows you how a hyperparameter grid search translates to multiple models being trained:
Figure 4.12 – Cartesian grid search hyperparameter tuning
In the diagram, we can see that we have a two-dimensional grid that maps the two hyperparameters. Using this Cartesian grid, we can further expand the combination of hyperparameter values to 10 values per parameter, extending our search. The grid search approach exhaustively searches across different values of the two hyperparameters. So, it will have 100 different combinations and will train 100 different models in total, all trained without needing much manual intervention.
H2O does have grid search capabilities that users can use to test out their own manually implemented grid search approach for hyperparameter optimization. When training models using grid search, H2O will map all models that it trains to the respective hyperparameter value combinations of the grid. H2O also allows you to sort all these models based on any supported model performance metrics. This sorting helps you quickly find the best-performing model based on the metric values. We shall explore more about performance metrics in Chapter 6, Understanding H2O AutoML Leaderboard and Other Performance Metrics.
However, despite automating and introducing a quality-of-life improvement to manual searching, there are still some drawbacks to this approach. Grid search hyperparameter optimization suffers from what is called the curse of dimensionality.
The curse of dimensionality was a term coined by Richard E. Bellman when considering problems in dynamic programming. From the point of view of ML, this concept states that as the number of hyperparameter combinations increases, the number of evaluations that the grid search will perform increases exponentially.
For example, let’s say you have a hyperparameter x and you want to try out integer values 1-20. In this case, you will end up doing 20 evaluations, in other words, training 20 models. Now suppose that there is another hyperparameter y and you want to try out the values 1-20 in combination with the values for x. Your combinations will be as follows:
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7)….(20,20) where (x, y)
Now, there are 20x20=400 combinations in total in your grid, for which your grid search optimization will end up training 400 models. Add another hyperparameter z to it and your number of combinations will skyrocket beyond management. The more hyperparameters you have, the more combinations you would try and the more combinatorial explosion will occur.
Given the time and resource sensitivity of ML, an exhaustive search is counterproductive to finding the best model. The real world has limitations, hence a random selection of hyperparameter values has often been proven to provide better results than an exhaustive grid search.
This brings us to our next approach in hyperparameter optimization, random grid search.
Random grid search replaces the previous exhaustive grid search by choosing random values from the hyperparameter search space, rather than sequentially exhausting all of them.
For example, refer to the following diagram, which shows you an example of random grid search optimization:
Figure 4.13 – Random grid search hyperparameter tuning
The preceding diagram is a hyperparameter space of 100 combinations of two hyperparameters, X and Y. Random grid search optimization will only choose a few at random and perform evaluations using those hyperparameter values.
The drawback of random grid search optimization is that it is a best effort approach to find the best combination of hyperparameter values with a limited number of evaluations. It may or may not find the best combination of hyperparameter values to train the optimal model, but given a large sample size, it can find the near-perfect combination to train a model with good-enough quality.
H2O library functions support random grid search optimization. It provides users with the functionality to set their own hyperparameter search grid and set a search criteria parameter to control the type and extent of the search. The search criteria can be anything, such as maximum runtime, the maximum number of models to train, or any metric. H2O will choose different hyperparameter combinations from the grid at random sequentially without repeat and will keep searching and evaluating till the search criteria are met.
H2O AutoML works slightly differently from random grid search optimization. Instead of waiting for the user to input the hyperparameter search grid, H2O has automated this part as well by already having a list of hyperparameters with all potential values for specific algorithms spaced out in the grid as default values. H2O AutoML also has provisions to include non-default values in the hyperparameter search list set by the user. H2O AutoML has predetermined values already set for algorithms; we shall explore them in the next chapter, along with understanding how different algorithms work.
In this chapter, we have come to understand the high-level architecture of H2O and what the different layers that comprise the overall architecture are. We then dived deep into the client and JVM layer of the architecture, where we understood the different components that make up the H2O software stack. Next, keeping the architecture of H2O in mind, we came to understand the flow of interactions that take place between the client and server, where we understood how exactly we command the H2O server to perform various ML activities. We also came to understand how the interactions flow down the architecture stack during model training.
Building on this knowledge, we have investigated the sequence of interactions that take place inside the H2O server during model training. We also looked into how H2O trains models using the job manager to coordinate training jobs and how H2O communicates the status of model training with the user. And, finally, we unboxed H2O AutoML and came to understand how it trains the best model automatically. We have understood the concept of hyperparameter optimization and its various approaches and how H2O automates these approaches and mitigates their drawbacks to automatically train the best model.
Now that we know the internal details of H2O AutoML and how it trains models, we are now ready to understand the various ML algorithms that H2O AutoML trains and how they manage to make predictions. In the next chapter, we shall explore these algorithms and have a better understanding of models, which will help us to justify which model would work best for a given ML problem.
3.12.166.255