The first step is to define a trait and a method that describe the transformation of data by the computation units of a workflow. The data transformation is the foundation of any workflow for processing and classifying a dataset, training and validating a model, and displaying results.
There are two symbolic models for defining a data processing or data transformation:
The simplest form of data transformation is morphism between two types U
and V
. The data transformation enforces a contract for validating input and returning either a value or an error. From now on, we will use the following convention:
PartialFunction
, that is returned by the data transformation. A MatchErr
is thrown in case the input value does not meet the required condition (contract).Try[V]
for which an exception is returned in case of an error.Partial functions enable developers to implement methods that address the most common (primary) use case for which input values has been tested. All other non-trivial use cases (or input values) generate a MatchErr
exception. At a later stage in the development cycle, the developer may implement the code to handle the less common use cases.
Runtime validation of a partial function
It is good practice to validate whether a partial function is defined for a specific value of the argument:
for {pfn.isDefinedAt(input)
value<- pfn(input)} yield { … }
This preemptive approach allows the developer to select an alternative method or a full function [2:3]. It is an efficient alternative to catching a MathErr
exception.
The validation of partial functions is omitted throughout the book for the sake of clarity.
Therefore, the signature of a data transformation is defined as follows:
def |> : PartialFunction[T, Try[A]]
F# language reference
The notation |>
used as the signature of the transform is borrowed from the F# language [2:2].
The objective is to define a symbolic representation of the transformation of different types of data without exposing the internal state of the algorithm implementing the data transformation.
Supervised learning models are extracted from a training set. Transformations such as classification or regression use the implicit models to process input data, as illustrated in the following diagram:
trait ITransform[T, A] { //1 self => def |> : PartialFunction[T, Try[A]] //2 def map[B](f: A => B): ITransform[T, B] def flatMap[B](f: A => ITransform[T, B]): ITransform[T, B] def andThen[B](tr: ITransform[A, B]): ITransform[T, B] }
An implicit transformation has a type ITransform
with two parameter types (line 1):
For instance, the moving average on a time series of a single variable is computed by a ITransform[Double, Double]
. The input collection is the time series and the output is a smoothed time series.
Apache Spark ML transformers
The concept behind ITransform
is somewhat similar to the Apache Spark MLlib transformers on data frames described in the ML Reusable Pipelines section of Chapter 17, Apache Spark MLlib.
The method |>
declares the transformation that is defined by implementing the trait ITransform
(line 2). Let's look at the monadic operators.
The map
method applies a function to each element of the output of the transformation |>
. It generates a new ITransform
by overriding the |>
method (line 3).
A new implementation of the data transformation |>
returning an instance of PartialFunction[T
, Try[B]]
(line 4) is created by overriding the methods isDefinedAt
(line 5) and apply
(line 6):
def map[B](f: A => B): ITransform[T,B] = new ITransform[T,B] { override def |> : PartialFunction[T, Try[B]] = //3 new PartialFunction[T, Try[B]] { //4 override def isDefinedAt(t: T): Boolean = //5 self.|>.isDefinedAt(t) override def apply(t: T): Try[B] = self.|>(t).map(f) //6 }
The overridden methods for the instantiation of ITransform
in flatMap
follow the same design pattern as the map method. The argument f
converts each output element into an implicit transformation of type ITransform[T, B]
(line 7), and outputs a new instance of ITransform
after flattening (line 8).
As with the map
method, it overrides the implementation of the data transformation |>
returning a new partial function (line 9) after overriding the isDefinedAt
and apply
methods:
def flatMap[B]( f: A => ITransform[T, B] //7 ): ITransform[T, B] = new ITransform[T, B] { //8 override def |> : PartialFunction[T, Try[B]] = new PartialFunction[T, Try[B]] { //9 override def isDefinedAt(t: T): Boolean = self.|>.isDefinedAt(t) override def apply(t: T): Try[B] = self.|>(t).flatMap(f(_).|>(t) }
The method andThen
is not a proper element of a monad. Its meaning is similar to the Scala method Function1.andThen
that chains a function with another one. It is indeed useful to create chains of implicit transformations. The method applies the transformation tr
(line 10) to the output of this transformation. The output type of the first transformation is the input type of the next transformation, tr
.
The implementation of the method andThen
follows a pattern similar to the implementation of map
and flatMap
:
def andThen[B](
tr: ITransform[A, B] (line 10)
): ITransform[T, B] = new ITransform[T, B] {
override def |> : PartialFunction[T, Try[B]] =
new PartialFunction[T, Try[B]] {
override def isDefinedAt(t: T): Boolean =
self.|>.isDefinedAt(t) &&
tr.|>.isDefinedAt(self.|>(t).get)
override def apply(t: T):Try[B] = tr.|>(self.|>(t).get)
}
The transformation on a dataset is performed using a model or configuration fully defined by the user, as illustrated in the following diagram:
The execution of a data transformation may depend on some context or external configuration. Such transformations are defined having the type ETransform
parameterized by the type T
of the elements of input collection and the type A
of element of output collection (line 11). The context or configuration is defined by the trait config
(line 12).
An explicit transformation is a transformation with the extra capability to use a set of external configuration parameters to generate an output. Therefore, ETransform
inherits from ITransform
(line 13):
abstract class ETransform[T,A]( //11 config: Config//12 ) extends ITransform[T,A] { //13 self => def map[B](f: A => B): ETransform[T,B] = new ETransform[T,B](config) { override def |> : PartialFunction[T, Try[B]] = super.|> } def flatMap[B](f: A => ETransform[T,B]): ETransform[T,B] = new ETransform[T, B](config) { override def |> : PartialFunction[T, Try[B]] = super.|> } def andThen[B](tr: ETransform[A,B]): ETransform[T,B] = new ETransform[T, B](config) { override def|> :PartialFunction[T, Try[B]] = super.|> } }
The client code is responsible for specifying the type and value of the configuration used by a given explicit transformation. Here are a few examples of configuration classes:
Trait Config
case class ConfigInt(iParam: Int) extends Config
case class ConfigDouble(fParam: Double) extends Config
case class ConfigArrayDouble(fParams: Array[Double)
extends Config
Memory cleaning
Instances of ITransformand ETransform
do not release memory allocated for the input data. The client code is responsible for the memory management of input and output data. However, the method |>
is to release any memory associated to temporary data structure(s) used for the transformation.
The supervised learning models described in future chapters, such as logistic regression, support vector machines, Naïve Bayes or multilayer perceptron are defined as implicit transformations and implement the ITransform
trait. Filtering and data processing algorithms such as data extractor, moving averages, or Kalman filters inherit the ETransform
abstract class.
Immutable transformations
The model for a data transformation (or processing unit or classifier) class should be immutable: any modification would alter the integrity of the model or parameters used to process data. In order to ensure that the same model is used in processing input data for the entire lifetime of a transformation:
ETransform
is defined as an argument of its constructor.ITransform
generates the model from a given training set. The model has to be rebuilt from the training set (not altered), if it starts to provide an incorrect outcome or prediction.Models are created by the constructor of classifiers or data transformation classes to ensure their immutability. The design of immutable transformation is described in the Design template for classifiers subsection in the Scala programming section of the Appendix.
3.142.133.54