The template method pattern

The template method pattern is used to create a well-defined process that can use different kinds of algorithms or operations. As a template, it can be customized with whatever algorithm or functions the client requires.

Here, we will explore how the template method pattern can be utilized in a machine learning (ML) pipeline use case. For those who are unfamiliar with ML pipelines, here is a simplified version of what a data scientist might do:

A dataset is first split into two separate datasets for training and testing purposes. The training dataset is fed into a process that fits the data into a statistical model. Then, the validate function uses the model to predict the response (also called the target) variable in the test set. Finally, it compares the predicted values against the actual values and determines how accurate the model is.

Let's say we have the pipeline already set up as follows:

function run(data::DataFrame, response::Symbol, predictors::Vector{Symbol})
train, test = split_data(data, 0.7)
model = fit(train, response, predictors)
validate(test, model, response)
end

For the sake of brevity, the specific functions, split_data, fit, and validate, are not shown here; you can look them up on this book's GitHub site if you wish. However, the pipeline concept is demonstrated in the preceding logic. Let's take a quick spin at predicting Boston house prices:

In this example, the response variable is :MedV, and we will build a statistic model based on :Rm, :Tax, and :Crim

The Boston housing dataset contains data collected by the U.S. Census Service concerning housing in the area of Boston, Massachusetts. It is used extensively by much statistical analysis educational literature. The variables that we used in this example are:

MedV: Median value of owner-occupied homes in $1,000's
Rm: Average number of rooms per dwelling
Tax: Full-value property tax rate per $10,000
Crim: Per capita crime rate by town

The accuracy of the model is captured in the rmse variable (meaning the root mean squared error). The default implementation uses linear regression as the fitting function.

To implement the template method pattern, we should allow the client to plug in any part of the process. For that reason, we can modify the function with keyword arguments:

function run2(data::DataFrame, response::Symbol, predictors::Vector{Symbol};
fit = fit, split_data = split_data, validate = validate)
train, test = split_data(data, 0.7)
model = fit(train, response, predictors)
validate(test, model, response)
end

Here, we have added three keyword arguments: fit, split_data, and validate. The function is named as run2 to avoid confusion here, so the client should be able to customize any one of them by passing in a custom function. To illustrate how it works, let's create a new fit function that uses the generalized linear model (GLM):

using GLM

function fit_glm(df::DataFrame, response::Symbol, predictors::Vector{Symbol})
formula = Term(response) ~ +(Term.(predictors)...)
return glm(formula, df, Normal(), IdentityLink())
end

Now that we have customized the fitting function, we can rerun the program by passing it via the fit keyword argument: 

As you can see, the client can customize the pipeline easily by just passing in functions. This is possible because Julia supports first-class functions.

In the next section, we will review a few other traditional behavioral patterns.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.189.67