A Machine Learning (ML) model is the output we get once data is fitted into an ML algorithm. It represents the underlying relationship between various features and how that relationship impacts the target variable. This relationship depends entirely on the contents of the dataset. What makes every ML model unique, despite using the same ML algorithm, is the dataset that is used to train said model. Data can be collected from various sources and can have different schemas and structures, which need not be structurally compatible among themselves but may in fact be related to each other. This relationship can be very valuable and can also potentially be the differentiator between a good and a bad model. Thus, it is important to transform this data to meet the requirements of the ML algorithm to eventually train a good model.
Data processing, data preparation, and data preprocessing are all steps in the ML pipeline that focus on best exposing the underlying relationship between the features by transforming the structure of the data. Data processing may be the most challenging step in the ML pipeline, as there are no set steps to the transformation process. Data processing depends entirely on the problem you wish to solve; however, there are some similarities among all datasets that can help us define certain processes that we can perform to optimize our ML pipeline.
In this chapter, we will learn about some of the common functionalities that are often used in data processing and how H2O has in-built operations that can help us easily perform them. We will understand some of the H2O operations that can reframe the structure of our dataframe. We will understand how to handle missing values and the importance of the imputation of values. We will then investigate how we can manipulate the various feature columns in the dataframe, as well as how to slice the dataframe for different needs. We shall also investigate what encoding is and what the different types of encoding are.
In this chapter, we are going to cover the following main topics:
All code examples in this chapter are run on Jupyter Notebook for an easy understanding of what each line in the code block does. You can run the whole block of code via a Python or R script executor and observe the output results, or you can follow along by installing Jupyter Notebook and observing the execution results of every line in the code blocks.
To install Jupyter Notebook, make sure you have the latest version of Python and pip installed on your system and execute the following command:
pip install jupyterlab
Once JupyterLab has successfully installed, you can start your Jupyter Notebook locally by executing the following command in your terminal:
jupyter notebook
This will open the Jupyter Notebook page on your default browser. You can then select which language you want to use and start executing the lines in the code step by step.
All code examples for this chapter can be found on GitHub at https://github.com/PacktPublishing/Practical-Automated-Machine-Learning-on-H2O/tree/main/Chapter%203.
Now, let’s begin processing our data by first creating a dataframe and reframing it so that it meets our model training requirement.
Data collected from various sources is often termed raw data. It is called raw in the sense that there might be a lot of unnecessary or stale data, which might not necessarily benefit our model training. The structure of the data collected also might not be consistent among all the sources. Hence, it becomes very important to first reframe the data from various sources into a consistent format.
You may have noticed that once we import the dataset into H2O, H2O converts the dataset into a .hex file, also called a dataframe. You have the option to import multiple datasets as well. Assuming you are importing multiple datasets from various sources, each with its own format and structure, then you will need a certain functionality that helps you reframe the contents of the dataset and merge them to form a single dataframe that you can feed to your ML pipeline.
H2O provides several functionalities that you can use to perform the required manipulations.
Here are some of the dataframe manipulation functionalities that help you reframe your dataframe:
Let’s see how we can combine columns from different dataframes in H2O.
One of the most common dataframe manipulation functionalities is combining different columns from different dataframes. Sometimes, the columns of one dataframe may be related to those of another. This could prove beneficial during model training. Thus, it is quite useful to have a functionality that can help us manipulate these columns and combine them together to form a single dataframe for model training.
H2O has a function called cbind() that combines the columns from one dataset into another.
Let’s try this function out in our Jupyter Notebook using Python. Execute the following steps in sequence:
import h2o
import numpy as np
h2o.init()
np.random.seed(123)
important_dataframe_1 = h2o.H2OFrame.from_python(np.random.randn(15,5).tolist(), column_names=list([" important_column_1" , " important_column_2" , " important_column_3" , " important_column_4" , " important_column_5" ]))
important_dataframe_1.describe
The following screenshot shows you the contents of the dataset:
Figure 3.1 – important_dataframe_1 data content
important_dataframe_2 = h2o.H2OFrame.from_python(np.random.randn(15,2).tolist(), column_names=list([" important_column_6" , " important_column_7" ]))
Figure 3.2 – important_dataframe_2 data content
final_dataframe = important_dataframe_1.cbind(important_dataframe_2)
final_dataframe.describe
You should see the contents of final_dataframe as follows:
Figure 3.3 – final_dataframe data content after cbind()
Here, you will notice that we have successfully combined the columns from important_dataframe_2 with the columns of important_dataframe_1.
This is how you can use the cbind() function to combine the columns of two different datasets into a single dataframe. The only thing to bear in mind while using the cbind() function is that it is necessary to ensure that both the datasets to be combined have the same number of rows. Also, if you have dataframes with the same column name, then H2O will append a 0 in front of the column from dataframe.
Now that we know how to combine the columns of different dataframes, let’s see how we can combine the column values of multiple dataframes with the same column structure.
The majority of big corporations often handle tremendous amounts of data. This data is often partitioned into multiple chunks to make storing and reading it faster and more efficient. However, during model training, we will often need to access all these partitioned datasets. These datasets have the same structure but the data contents are distributed. In other words, the dataframes have the same columns; however, the data values or rows are split among them. We will often need a function that combines all these dataframes together so that we have all the data values available for model training.
H2O has a function called rbind() that combines the rows from one dataset into another.
Let’s try this function out in the following example:
import h2o
import numpy as np
h2o.init()
np.random.seed(123)
important_dataframe_1 = h2o.H2OFrame.from_python(np.random.randn(15,5).tolist(), column_names=list([" important_column_1" , " important_column_2" ," important_column_3" ," important_column_4" ," important_column_5" ]))
important_dataframe_1.nrows
important_dataframe_2 = h2o.H2OFrame.from_python(np.random.randn(10,5).tolist(), column_names=list([" important_column_1" , " important_column_2" ," important_column_3" ," important_column_4" ," important_column_5" ]))
important_dataframe_2.nrows
final_dataframe = important_dataframe_1.rbind(important_dataframe_2)
final_dataframe.describe
You should see the contents of final_dataframe as follows:
Figure 3.4 – final_dataframe data contents after rbind()
final_dataframe.nrows
The output of the last operation should show you the value of the number of rows in the final dataset. You will see that the value is 25 and the contents of the dataframe are the combined row values of both the previous datasets.
Now that we have understood how to combine the rows of two dataframes in H2O using the rbind() function, let’s see how we can fully combine two datasets.
You can directly merge two dataframes, combining their rows and columns into a single dataframe. H2O provides a merge() function that combines two datasets that share a common column or common columns. During merging, columns that the two datasets have in common are used as the merge key. If they only have one column in common, then that column forms the singular primary key for the merge. If there are multiple common columns, then H2O will form a complex key of all these columns based on their data values and use that as the merge key. If there are multiple common columns between the two datasets and you only wish to merge a specific subset of them, then you will need to rename the other common columns to remove the corresponding commonality.
Let’s try this function out in the following example in Python:
import h2o
import numpy as np
h2o.init()
dataframe_1 = h2o.H2OFrame.from_python({'words':['Hello', 'World', 'Welcome', 'To', 'Machine', 'Learning'], 'numerical_representation': [0,1,2,3,4,5],'letters':['a','b','c','d']})
dataframe_1.describe
Figure 3.5 – dataframe_1 data content
dataframe_2 = h2o.H2OFrame.from_python({'other_words':['How', 'Are', 'You', 'Doing', 'Today', 'My', 'Friend', 'Learning', 'H2O', 'Artificial', 'Intelligence'], 'numerical_representation': [0,1,2,3,4,5,6,7,8,9],'letters':['a','b','c','d','e']})
dataframe_2.head(11)
On executing the code, you should see the following output in your notebook:
Figure 3.6 – dataframe_2 data contents
final_dataframe = dataframe_2.merge(dataframe_1)
final_dataframe.describe
Figure 3.7 – final_dataframe contents after merge()
You will notice that H2O used the combination of the numerical_representation column and the letters column as the merging key. This is why we have values ranging from 1 to 5 in the numerical_representation column with the appropriate values in the other columns.
Now, you may be wondering why there is no row for 4. That is because while merging, we have two common columns: numerical_representation and letters. So, H2O used a complex merging key that uses both these columns: (0, a), (1, b), (2, c), and so on.
Now the next question you might have is What about the row with the value 5? It has no value in the letters column. That is because even an empty value is treated as a unique value in ML. Thus, during merging, the complex key that was generated treated (5, ) as a valid merge key.
H2O drops all the remaining values since dataframe_1 does not have any more numerical representation values.
final_dataframe = dataframe_2.merge(dataframe_1, all_x = True)
Figure 3.8 – final_dataframe data content after enforcing merge()
You will notice that we now have all the values from both dataframes merged into a single dataframe. We have all the numerical representations from 0 to 9 and all letters from a to e from dataframe_2 that were missing in the previous step, along with the correct values from the other_words column and the words column.
To recap, we learned how to combine dataframe columns and rows. We also learned how to combine entire dataframes together using the merge() function. However, we noticed that if we enforced the merging of dataframes despite them not having common data values in their key columns, we ended up with missing values in the dataframe.
Now, let’s look at the different methods we can use to handle missing values using H2O.
Missing values in datasets are the most common issue in the real world. It is often expected to have at least a few instances of missing data in huge chunks of datasets collected from various sources. Data can be missing for several reasons, which can range from anything from data not being generated at the source all the way to downtimes in data collectors. Handling missing data is very important for model training, as many ML algorithms don’t support missing data. Those that do may end up giving more importance to looking for patterns in the missing data, rather than the actual data that is present, which distracts the machine from learning.
Missing data is often referred to as Not Available (NA) or nan. Before we can send a dataframe for model training, we need to handle these types of values first. You can either drop the entire row that contains any missing values or you can fill them with any default value either default or common for that data column. How you handle missing values depends entirely on which data is missing and how important it is for the overall model training.
H2O provides some functionalities that you can use to handle missing values in a dataframe. These are some of them:
Next, let’s see how we can fill missing values in a dataframe using H2O.
fillna() is a function in H2O that you can use to fill missing data values in a sequential manner. This is especially handy if you have certain data values in a column that are sequential in nature, for example, time series or any metric that increases or decreases sequentially and can be sorted. The smaller the difference between the values in the sequence, the more applicable this function becomes.
The fillna() function has the following parameters:
Let’s see an example in Python of how we can use this function to fill missing values:
import h2o
import numpy as np
h2o.init()
dataframe = h2o.create_frame(rows=1000, cols=3, integer_fraction=1.0, integer_range=100, missing_fraction=0.2, seed=123)
dataframe.describe
You should see the contents of the dataframe as follows:
Figure 3.9 – Dataframe contents
filled_dataframe = dataframe.fillna(method=" forward" , axis=0, maxlen=1)
filled_dataframe.describe
Figure 3.10 – filled_dataframe contents
The fillna() function has filled most of the NA values in the dataframe sequentially.
However, you will notice that we still have some nan values in the first row of the dataframe. This is because we filled the dataframe missing values row-wise in the forward direction. When filling NA values, H2O will record the last value in a row for a specific column and copy it if the value in the subsequent row is NA. Since this is the very first column, H2O does not have any previous value in the record to fill it, thus it skips over it.
Now that we understand how we can sequentially fill data in a dataframe using the fillna() function in H2O, let’s see how we can replace certain values in the dataframe.
Another common functionality often needed for data processing is replacing certain values in the dataframe. There can be plenty of reasons why you might want to do this. This is especially common for numerical data where some of the most common transformations include rounding off values, normalizing numerical ranges, or just correcting a data value. In this section, we will explore some of the functions that we can use in H2O to replace values in the dataframe.
Let’s first create a dataframe that we can use to test out such functions. Execute the following code so that we have a dataframe ready for manipulation:
import h2o h2o.init() dataframe = h2o.create_frame(rows=10, cols=3, real_range=100, integer_fraction=1, missing_fraction=0.1, seed=5) dataframe.describe
The dataframe should look as follows:
Figure 3.11 – Dataframe data contents
So, we have a dataframe with three columns: C1, C2, and C3. Each column has a few negative numbers and some nan values. Let’s see how we can play around with this dataframe.
Let’s start with something simple. Let’s update the value of a single data value, also called a datum, in the dataframe. Let’s update the fourth row of the C2 column to 99. You can update the value of a single data value based on its position in the dataframe as follows:
dataframe[3,1] = 99
Note that the columns and rows in the dataframe all start with 0. Hence, we set the value in the dataframe with the row number of 3 and the column number of 1 as 99. You should see the results in the dataframe by executing dataframe.describe as follows:
dataframe.describe
The dataframe should look as follows:
Figure 3.12 – Dataframe contents after the datum update
As you can see in the dataframe, we replaced the nan value that was previously in the third row of the C2 column with 99.
This is a manipulation of just one data value. Let’s see how we can replace the values of an entire column. Let’s increase the data values in the C3 column to three times their original value. You can do so by executing the following code:
dataframe[2] = 3*dataframe[2]
You should see the results in the dataframe by executing dataframe.describe as follows:
dataframe.describe
The dataframe should look as follows:
Figure 3.13 – Dataframe contents after column value updates
We can see in the output that the values in the C3 column have now been increased to three times the original values in the column.
All these replacements we performed till now are straightforward. Let’s try some conditional updates on the dataframe. Let’s round off all the negative numbers in the dataframe to 0. So, the condition is that we only update the negative numbers to 0 and don’t change any of the positive numbers. You can do conditional updates as follows:
dataframe[dataframe['C1'] < 0, " C1" ] = 0 dataframe[dataframe['C2'] < 0, " C2" ] = 0 dataframe[dataframe['C2'] < 0, " C3" ] = 0
You should see the results in the dataframe by executing dataframe.describe as follows:
dataframe.describe
The dataframe should look as follows:
Figure 3.14 – Dataframe contents after conditional updates
As you can see in the dataframe, all the negative values have been rounded up/replaced by 0.
Now, what if instead of rounding the negative numbers up to 0 we wished to just inverse the negative numbers? We could do so by combining the conditional updates with arithmetic updates. Refer to the following example:
dataframe[" C1" ] = (dataframe[" C1" ] < 0).ifelse(-1*dataframe[" C1" ], dataframe[" C1" ]) dataframe[" C2" ] = (dataframe[" C2" ] < 0).ifelse(-1*dataframe[" C2" ], dataframe[" C2" ]) dataframe[" C3" ] = (dataframe[" C3" ] < 0).ifelse(-1*dataframe[" C3" ], dataframe[" C3" ])
Now, let’s try to see whether we can replace the remaining nan values with something valid. We already read about the fillna() function, but what if the nan values are nothing but some missing values that don’t exactly fall into any incremental or decremental pattern, and we just want to set it to 0? Let’s do that now. Run the following code:
dataframe[dataframe[" C1" ].isna(), " C1" ] = 0 dataframe[dataframe[" C2" ].isna(), " C2" ] = 0 dataframe[dataframe[" C3" ].isna(), " C3" ] = 0
You should see the results in the dataframe by executing dataframe.describe as follows:
dataframe.describe
The dataframe should look as follows:
Figure 3.15 – Dataframe contents after replacing nan values with 0
The isna() function is a function that checks whether the value in the datum is nan or not and returns either True or False. We use this condition to replace the values in the dataframe.
Tip
There are plenty of ways to manipulate and replace the values in a dataframe and H2O provides plenty of functionality to make implementation easy. Feel free to explore and experiment more with manipulating the values in the dataframe. You can find more details here: https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html.
Now that we have learned various methods to replace values in the dataframe, let’s look into a more advanced approach to doing so that data scientists and engineers often take.
Previously, we have seen how we can replace nan values in the dataset using fillna(), which sequentially replaces the nan data in the dataframe. The fillna() function fills data in a sequential manner; however, data need not always be sequential in nature. For example, consider a dataset of people buying gaming laptops. The dataset will mostly contain data about people in the age demographic of 13-28, with a few outliers. In such a scenario, if there are any nan values in the age column of the dataframe, then we cannot use the fillna() function to fill the nan values, as any nan value after any outlier value will introduce a bias in the dataframe. We need to replace the nan value with a value that is common among the standard distribution of the age group for that product, something that is between 13 and 28, rather than say 59, which is less likely.
Imputation is the process of replacing certain values in the dataframe with an appropriate substitute that does not introduce any bias or outliers that may impact model training. The method or formulas used to calculate the substitute value are termed the imputation strategy. Imputation is one of the most important methods of data processing, which handles missing and nan values and tries to replace them with a value that will potentially introduce the least bias into the model training process.
H2O has a function called impute() that specifically provides this functionality. It has the following parameters:
Let’s see an example in Python of how we can use this function to fill missing values.
For this, we shall use the high school student sprint dataset. The high school student sprint dataset is a dataset that consists of recordings of the age of high school students, their weight, maximum recorded speed, and their performance in a 100-meter sprint. The dataset is used to predict how the age, weight, and sprint speed affect the performance of students in a 100-meter sprint race.
The dataset looks as follows:
Figure 3.16 – A high school student sprint dataset
The features of the dataset are as follows:
As you can see, there are plenty of missing values in the 100_meter_time column.
We cannot simply use the fillna() function, as that will introduce bias into the data if the missing values happen to be right after the fastest or slowest time. We can’t simply replace the values with a constant number either.
What would actually make sense is to replace these missing values with whatever is normal for an average teenager doing a 100-meter dash. We already have values for the majority of students, so we can use their results to calculate a general average 100-meter sprint time and use that as a baseline to replace all the missing values without introducing any bias.
This is exactly what imputation is used for. Let’s use the imputation function to fill in these missing values:
import h2o
h2o.init()
dataframe = h2o.import_file(" Dataset/high_school_student_sprint.csv" )
dataframe.impute(" 100_meter_time" , method = " mean" )
dataframe.describe
You will see the output of the imputed dataframe as follows:
Figure 3.17 – 100_meter_time column imputed by its mean
Similarly, instead of mean, you can use median values as well. However, note that if a column has categorical values, then the method must be mode. The decision is up to you to make, depending on the dataset that is most useful when replacing the missing values:
dataframe.impute(" 100_meter_time" , method = " median" )
dataframe.impute(" 100_meter_time" , method = " mode" )
dataframe = h2o.import_file(" Dataset/high_school_student_sprint.csv" )
dataframe.impute(" 100_meter_time" , method = " mean" , by=[" age" ])
dataframe.describe
You will see the output as follows:
Figure 3.18 – 100_meter_sprint imputed by its mean and grouped by age
You will notice that now H2O has calculated the mean values by age and replaced the respective missing values for that age in the 100_meter_time column. Observe the first row in the dataset. The row was of students aged 13 and had missing values in its 100_meter_time column. It was replaced with the mean value of all the 100_meter_time values for other 13-year-olds. Similar steps were followed for other age groups. This is how you can use the group by parameter in the impute() function to flexibly impute the correct values.
The impute() function is extremely powerful to impute the correct values in a dataframe. The additional parameters for grouping via columns as well as frames make it very flexible for use in handling all sorts of missing values.
Feel free to use and explore all these functions on different datasets. At the end of the day, all these functions are just tools used by data scientists and engineers to improve the quality of the data; the real skill is understanding when and how to use these tools to get the most out of your data, and that requires experimentation and practice.
Now that we have learned about the different ways in which we can handle missing data, let’s move on to the next part of data processing, which is how to manipulate the feature columns of the dataframe.
The majority of the time, your data processing activities will mostly involve manipulating the columns of the dataframes. Most importantly, the type of values in the column and the ordering of the values in the column will play a major role in model training.
H2O provides some functionalities that help you do so. The following are some of the functionalities that help you handle missing values in your dataframe:
Let’s first understand how we can sort a column using H2O.
Ideally, you want the data in a dataframe to be shuffled before passing it off to model training. However, there may be certain scenarios where you might want to re-order the dataframe based on the values in a column.
H2O has a functionality called sort() to sort dataframes based on the values in a column. It has the following parameters:
The way H2O will sort the dataframe depends on whether one column name is passed to the sort() function or multiple column names. If only a single column name is passed, then H2O will return a frame that is sorted by that column.
However, if multiple columns are passed, then H2O will return a dataframe that is sorted as follows:
Let’s see an example in Python of how we can use this function to sort columns:
import h2o
h2o.init()
dataframe = h2o.H2OFrame.from_python({'C1': [3,3,3,0,12,13,1,8,8,14,15,2,3,8,8],'C2':[1,5,3,6,8,6,8,7,6,5,1,2,3,6,6],'C3':[15,14,13,12,11,10,9,8,7,6,5,4,3,2,1]})
dataframe.describe
The contents of the dataset should be as follows:
Figure 3.19 – dataframe_1 data contents
sorted_dataframe_1 = dataframe.sort(0)
sorted_dataframe_1.describe
You should get an output of the code as follows:
Figure 3.20 – dataframe_1 sorted by the C1 column
You will see that the dataframe is now sorted in ascending order by the C1 column.
sorted_dataframe_2 = dataframe.sort(['C1','C2']) sorted_dataframe_2.describe
You should get an output as follows:
Figure 3.21 – dataframe_1 sorted by columns C1 and C2
As you can see, H2O first sorted the columns by the C1 column. Then, it sorted the rows by the C2 column for those rows that had the same value in the C1 column. H2O will sequentially sort the dataframe column-wise for all the columns you pass in the sort function.
sorted_dataframe_3 = dataframe.sort(by=['C1','C2'], ascending=[True,False])
sorted_dataframe_3.describe
You should see an output as follows:
Figure 3.22 – dataframe_1 sorted by the C1 column in ascending order and the C2 column in descending order
In this case, H2O first sorted the columns by the C1 column. Then, it sorted the rows by the C2 column for those rows that had the same value in the C1 column. However, this time it sorted the values in descending order.
Now that you’ve learned how to sort the dataframe by a single column as well as by multiple columns, let’s move on to another column manipulation function that changes the type of the column.
As we saw in Chapter 2, Working with H2O Flow (H2O’s Web UI), we changed the type of the Heart Disease column to enum from numerical. The reason we did this is that the type of column plays a major role in model training. During model training, the type of column decides whether the ML problem is a classification problem or a regression problem. Despite the fact that the data in both cases is numerical in nature, how a ML algorithm will treat the column depends entirely on its type. Thus, it becomes very important to correct the types of columns that might not be correctly set during the initial stages of data collection.
H2O has several functions that not only help you change the type of the columns but also run initial checks on the column types.
Some of the functions are as follows:
Let’s see an example in Python of how we can use these functions to change the column types:
import h2o
h2o.init()
dataframe = h2o.H2OFrame.from_python({'C1': [3,3,3,0,12,13,1,8,8,14,15,2,3,8,8],'C2':[1,5,3,6,8,6,8,7,6,5,1,2,3,6,6],'C3':[15,14,13,12,11,10,9,8,7,6,5,4,3,2,1]})
dataframe.describe
The contents of the dataset should be as follows:
Figure 3.23 – Dataframe data contents
dataframe['C1'].isnumeric()
You should get an output of True.
dataframe['C1'].isfactor()
You should get an output of False.
dataframe['C1'] = dataframe['C1'].asfactor()
dataframe['C1'].isfactor()
You should now get an output of True.
dataframe['C1'] = dataframe['C1'].asnumeric()
dataframe['C1'].isnumeric()
You should now get an output of True.
Now that you have learned how to sort the columns of a dataframe and change column types, let’s move on to another important topic in data processing, which is tokenization and encoding.
Not all Machine Learning Algorithms (MLAs) are focused on mathematical problem-solving. Natural Language Processing (NLP) is a branch of ML that specializes in analyzing meaning out of textual data, though it will try to derive meaning and understand the contents of a document or any text for that matter. Training an NLP model can be very tricky, as every language has its own grammatical rules and the interpretation of certain words depends heavily on context. Nevertheless, an NLP algorithm often tries its best to train a model that can predict the meaning and sentiments of a textual document.
The way to train an NLP algorithm is to first break down the chunk of textual data into smaller units called tokens. Tokens can be words, characters, or even letters. It depends on what the requirements of the MLA are and how it uses these tokens to train a model.
H2O has a function called tokenize() that helps break down string data in a dataframe into tokens and creates a separate column containing all the tokens for further processing.
It has the following parameter: split: We pass a regular expression in this parameter that will be used by the function to split the text data into tokens.
Let’s see an example of how we can use this function to tokenize string data in a dataframe:
import h2o
h2o.init()
dataframe1 = h2o.H2OFrame.from_python({'C1':['Today we learn AI', 'Tomorrow AI learns us', 'Today and Tomorrow are same', 'Us and AI are same']})
dataframe1 = dataframe1.ascharacter()
dataframe1.describe
The dataset should look as follows:
Figure 3.24 – Dataframe data contents
This type of textual data is usually collected in systems that generate a lot of log text or conversational data. To solve such NLP tasks, we need to break down the sentences into individual tokens so that we can eventually build the context and meaning of these texts that will help the ML algorithm to make semantic predictions. However, before diving into the complexities of NLP, data scientists and engineers will process this data by tokenizing it first.
tokenized_dataframe = dataframe1.tokenize(" " )
tokenized_dataframe
You should see the dataframe as follows:
Figure 3.25 – Tokenized dataframe data contents
You will notice that the tokenize() function splits the text data into tokens and appends the tokens as rows into a single column. You will also notice that all tokenized sentences are separated by empty rows. You can cross-check this by comparing the number of words in all the sentences in the dataframe, plus the empty spaces between the sentences against the number of rows in the tokenized dataset, using nrows.
These are some of the most used data processing methods that are used to process your data before you feed it to your ML pipeline for training. There are still plenty of methods and techniques that you can use to further clean and polish your dataframes. So much so that you could dedicate an entire book to discussing them. Data processing happens to be the most difficult part of the entire ML life cycle. The quality of the data used for training depends on the context of the problem statement. It also depends on the creativity and ingenuity of the data scientists and engineers in processing that data. The end goal of data processing is to extract as much information as we can from the dataset and remove noise and bias from the data to allow for a more efficient analysis of data during training.
As we know, machines are only capable of understanding numbers. However, plenty of real-world ML problems revolve around objects and information that are not necessarily numerical in nature. Things such as states, names, and classes, in general, are represented as categories rather than numbers. This kind of data is called categorical data. Categorical data will often play a big part in analysis and prediction. Hence, there is a need to convert these categorical values to a numerical format so that machines can understand them. The conversion should also be in such a way that we do not lose the inherent meaning of those categories, nor do we introduce new information into the data, such as the incremental nature of numbers, for example.
This is where encoding is used. Encoding is a process where categorical values are transformed, in other words, encoded, into numerical values. There are plenty of encoding methods that can perform this transformation. One of the most commonly used ones is target encoding.
Target encoding is an encoding process that transforms categorical values into numerical values by calculating the average probability of the target variable occurring for a given category. H2O also has methods that help users implement target encoding on their data.
To better understand this method, consider the following sample Mythical creatures dataset:
Figure 3.26 – Our mythical creatures dataset
This dataset has the following content:
Now, let’s encode the Animals categorical column using target encoding. Target encoding will perform the following steps:
Figure 3.27 – The mythical creatures dataset with a target count
Figure 3.28 – The mythical creatures dataset with a Probability of Target 1 Occurring column
Figure 3.29 – A target-encoded mythical creatures dataset
In the encoded dataset, the Animals feature is encoded using target encoding and we have a dataset that is entirely numerical in nature. This dataset will be easy for an ML algorithm to interpret and learn from, providing high-quality models.
Let us now see how we can perform target encoding using H2O. The dataset we will use for this example is the Automobile price prediction dataset. You can find the details of this dataset at https://archive.ics.uci.edu/ml/datasets/Automobile (Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science).
The dataset is fairly straightforward. It contains various details about cars, such as the make of the car, engine size, fuel system, compression ratio, and price. The aim of the ML algorithm is to predict the price of a car based on these features.
For our experiment, we shall encode the categorical columns make, fuel type, and body style using target encoding where the price column is the target.
Let’s perform target encoding by following this example:
import h2o
from h2o.estimators import H2OTargetEncoderEstimator
h2o.init()
automobile_dataframe = h2o.import_file(" DatasetAutomobile_data.csv" )
automobile_dataframe
Let’s observe the contents of the dataframe; it should look as follows:
Figure 3.30 – An automobile price prediction dataframe
As you can see in the preceding figure, the dataframe consists of a large number of columns containing the details of cars. For the sake of understanding target encoding, let’s filter out the columns that we want to experiment with while dropping the rest. Since we plan on encoding the make column, the fuel-type column, and the body-style column, let’s use only those columns along with the price response column. Execute the following code:
automobile_dataframe = automobile_dataframe[:,[" make" , " fuel-type" , " body-style" , " price" ]]
automobile_dataframe
The filtered dataframe will look as follows:
Figure 3.31 – The automobile price prediction dataframe with filtered columns
automobile_dataframe_for_training, automobile_dataframe_for_test = automobile_dataframe.split_frame(ratios = [.8], seed = 123)
automobile_te = H2OTargetEncoderEstimator()
automobile_te.train(x= [" make" , " fuel-type" , " body-style" ], y=" price" , training_frame=automobile_dataframe_for_training)
Once the target encoder has finished its training, you will see the following output:
Figure 3.32 – The result of target encoder training
From the preceding screenshot, you can see that the H2O target encoder will generate the target-encoded values for the make column, the fuel-type column, and the body-style column and store them in different columns named make_te, fuel-type_te, and body-style_te, respectively. These new columns will contain the encoded values.
te_automobile_dataframe_for_training = automobile_te.transform(frame=automobile_dataframe_for_training, as_training=True)
te_automobile_dataframe_for_training
The encoded training frame should look as follows:
Figure 3.33 – An encoded automobile price prediction training dataframe
As you can see from the figure, our training frame now has three additional columns, make_te, fuel-type_te, and body-style_te, with numerical values. These are the target-encoded columns for the make column, the fuel-type column, and the body-style column.
te_automobile_dataframe_for_test = automobile_te.transform(frame=automobile_dataframe_for_test, noise=0)
te_automobile_dataframe_for_test
The encoded test frame should look as follows:
Figure 3.34 – An encoded automobile price prediction test dataframe
As you can see from the figure, our test frame also has three additional columns, which are the encoded columns. You can now use these dataframes to train your ML models.
Depending on your next actions, you can use the encoded dataframes however you see fit. If you want to use the dataframe to train ML models, then you can drop the categorical columns from the dataframe and use the respective encoded columns as training features to train your models. If you wish to perform any further analytics on the dataset, then you can keep both types of columns and perform any comparative study.
Tip
H2O’s target encoder has several parameters that you can set to tweak the encoding process. Selecting the correct settings for target encoding your dataset can get very complex, depending on the type of data with which you are working. So, feel free to experiment with this function, as the better you understand this feature and target encoding in general, the better you can encode your dataframe and further improve your model training. You can find more details about H2O’s target encoder here: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/target-encoding.html.
Congratulations! You have just understood how you can encode categorical values using H2O’s target encoder.
In this chapter, we first explored the various techniques and some of the common functions we use to preprocess our dataframe before it is sent to model training. We looked into how we can reframe our raw dataframe into a suitable consistent format that meets the requirement for model training. We learned how to manipulate the columns of dataframes by combining them with different columns of different dataframes. We learned how to combine rows from partitioned dataframes, as well as how to directly merge dataframes into a single dataframe.
Once we knew how to reframe our dataframes, we learned how to handle the missing values that are often present in freshly collected data. We learned how to fill NA values, replace certain incorrect values, as well as how to use different imputation strategies to avoid adding noise and bias when filling missing values.
We then investigated how we can manipulate the feature columns by sorting the dataframes by column, as well as changing the types of columns. We also learned how to tokenize strings to handle textual data, as well as how to encode categorical values using H2O’s target encoder.
In the next chapter, we will open the black box of AutoML, explore its training, and what happens internally during the AutoML process. This will help us to better understand how H2O does its magic and efficiently automates the model training process.
3.144.106.150