Monadic data transformation

The first step is to define a trait and a method that describe the transformation of data by the computation units of a workflow. The data transformation is the foundation of any workflow for processing and classifying a dataset, training and validating a model, and displaying results.

There are two symbolic models for defining a data processing or data transformation:

  • Explicit model: The developer creates a model explicitly from a set of configuration parameters. Most deterministic algorithms and unsupervised learning techniques use an explicit model.
  • Implicit model: The developer provides a training set that is a set of labeled observations (observations with expected outcome). A classifier extracts a model through the training set. Supervised learning techniques rely on a model implicitly generated from labeled data.

Error handling

The simplest form of data transformation is morphism between two types U and V. The data transformation enforces a contract for validating input and returning either a value or an error. From now on, we will use the following convention:

  • Input value: The validation is implemented through a partial function, type PartialFunction, that is returned by the data transformation. A MatchErr is thrown in case the input value does not meet the required condition (contract).
  • Output value: The type of return value is Try[V] for which an exception is returned in case of an error.

Tip

Partial function reusability

Reusability is another benefit of partial functions, as illustrated in the following code snippet:

class F { 
  def f: PartialFunction[Int, Try[Double]] { .. }
  }
valpfn = (new F).f
pfn(4)
pfn(10)

Partial functions enable developers to implement methods that address the most common (primary) use case for which input values has been tested. All other non-trivial use cases (or input values) generate a MatchErr exception. At a later stage in the development cycle, the developer may implement the code to handle the less common use cases.

Note

Runtime validation of a partial function

It is good practice to validate whether a partial function is defined for a specific value of the argument:

for {pfn.isDefinedAt(input)
   value<- pfn(input)} yield { … }

This preemptive approach allows the developer to select an alternative method or a full function [2:3]. It is an efficient alternative to catching a MathErr exception.

The validation of partial functions is omitted throughout the book for the sake of clarity.

Therefore, the signature of a data transformation is defined as follows:

def |> : PartialFunction[T, Try[A]]

F# language reference

The notation |> used as the signature of the transform is borrowed from the F# language [2:2].

Monads to the rescue

The objective is to define a symbolic representation of the transformation of different types of data without exposing the internal state of the algorithm implementing the data transformation.

Note

This section illustrates the concept of monadic data transformation, which is not essential to the understanding of machine learning algorithms as described throughout the book. You can safely skip to the Workflow computational models section.

Implicit models

Supervised learning models are extracted from a training set. Transformations such as classification or regression use the implicit models to process input data, as illustrated in the following diagram:

Implicit models

Visualization of implicit models

trait ITransform[T, A] {  //1
self => 
   def |> : PartialFunction[T, Try[A]]   //2
   def map[B](f: A => B): ITransform[T, B] 
   def flatMap[B](f: A => ITransform[T, B]): ITransform[T, B] 
   def andThen[B](tr: ITransform[A, B]): ITransform[T, B] 
}

An implicit transformation has a type ITransform with two parameter types (line 1):

  • T: Type of feature or element of the input collection
  • A: Type of element of the output collection

For instance, the moving average on a time series of a single variable is computed by a ITransform[Double, Double]. The input collection is the time series and the output is a smoothed time series.

Note

Apache Spark ML transformers

The concept behind ITransform is somewhat similar to the Apache Spark MLlib transformers on data frames described in the ML Reusable Pipelines section of Chapter 17, Apache Spark MLlib.

The method |> declares the transformation that is defined by implementing the trait ITransform (line 2). Let's look at the monadic operators.

The map method applies a function to each element of the output of the transformation |>. It generates a new ITransform by overriding the |> method (line 3).

A new implementation of the data transformation |> returning an instance of PartialFunction[T, Try[B]](line 4) is created by overriding the methods isDefinedAt (line 5) and apply (line 6):

def map[B](f: A => B): ITransform[T,B] = new ITransform[T,B] {
   override def |> : PartialFunction[T, Try[B]] = //3
     new PartialFunction[T, Try[B]] {     //4
   override def isDefinedAt(t: T): Boolean =  //5
          self.|>.isDefinedAt(t)
   override def apply(t: T): Try[B] = self.|>(t).map(f) //6
}

The overridden methods for the instantiation of ITransform in flatMap follow the same design pattern as the map method. The argument f converts each output element into an implicit transformation of type ITransform[T, B](line 7), and outputs a new instance of ITransform after flattening (line 8).

As with the map method, it overrides the implementation of the data transformation |> returning a new partial function (line 9) after overriding the isDefinedAt and apply methods:

def flatMap[B](
  f: A => ITransform[T, B] //7
): ITransform[T, B] = new ITransform[T, B] { //8

  override def |> : PartialFunction[T, Try[B]] =
      new PartialFunction[T, Try[B]] { //9
  override def isDefinedAt(t: T): Boolean = 
      self.|>.isDefinedAt(t)
  override def apply(t: T): Try[B] = 
      self.|>(t).flatMap(f(_).|>(t)
}

The method andThen is not a proper element of a monad. Its meaning is similar to the Scala method Function1.andThen that chains a function with another one. It is indeed useful to create chains of implicit transformations. The method applies the transformation tr (line 10) to the output of this transformation. The output type of the first transformation is the input type of the next transformation, tr.

The implementation of the method andThen follows a pattern similar to the implementation of map and flatMap:

def andThen[B](
  tr: ITransform[A, B]  (line 10)
): ITransform[T, B] = new ITransform[T, B] {
   
  override def |> : PartialFunction[T, Try[B]] =
     new PartialFunction[T, Try[B]] {
  override def isDefinedAt(t: T): Boolean = 
     self.|>.isDefinedAt(t) &&
           tr.|>.isDefinedAt(self.|>(t).get)
  override def apply(t: T):Try[B] = tr.|>(self.|>(t).get)
}

Note

andThen and compose

The reader is invited to implement a compose method which executes the |> in a reverse order as andThen.

Explicit models

The transformation on a dataset is performed using a model or configuration fully defined by the user, as illustrated in the following diagram:

Explicit models

Visualization of explicit models

The execution of a data transformation may depend on some context or external configuration. Such transformations are defined having the type ETransform parameterized by the type T of the elements of input collection and the type A of element of output collection (line 11). The context or configuration is defined by the trait config (line 12).

An explicit transformation is a transformation with the extra capability to use a set of external configuration parameters to generate an output. Therefore, ETransform inherits from ITransform (line 13):

abstract class ETransform[T,A]( //11
  config: Config//12
) extends ITransform[T,A] {  //13
self =>
  def map[B](f: A => B): ETransform[T,B] = 
    new ETransform[T,B](config) {
      override def |> : PartialFunction[T, Try[B]] = super.|>
     }

  def flatMap[B](f: A => ETransform[T,B]): ETransform[T,B] = 
    new ETransform[T, B](config) {
      override def |> : PartialFunction[T, Try[B]] = super.|>
    }

  def andThen[B](tr: ETransform[A,B]): ETransform[T,B] = 
    new ETransform[T, B](config) {
     override def|> :PartialFunction[T, Try[B]] = super.|>
   }
}

The client code is responsible for specifying the type and value of the configuration used by a given explicit transformation. Here are a few examples of configuration classes:

Trait Config
case class ConfigInt(iParam: Int) extends Config
case class ConfigDouble(fParam: Double) extends Config
case class ConfigArrayDouble(fParams: Array[Double) 
   extends Config

Tip

Memory cleaning

Instances of ITransformand ETransform do not release memory allocated for the input data. The client code is responsible for the memory management of input and output data. However, the method |> is to release any memory associated to temporary data structure(s) used for the transformation.

The supervised learning models described in future chapters, such as logistic regression, support vector machines, Naïve Bayes or multilayer perceptron are defined as implicit transformations and implement the ITransform trait. Filtering and data processing algorithms such as data extractor, moving averages, or Kalman filters inherit the ETransform abstract class.

Note

Immutable transformations

The model for a data transformation (or processing unit or classifier) class should be immutable: any modification would alter the integrity of the model or parameters used to process data. In order to ensure that the same model is used in processing input data for the entire lifetime of a transformation:

  • A model for an ETransform is defined as an argument of its constructor.
  • The constructor of an ITransform generates the model from a given training set. The model has to be rebuilt from the training set (not altered), if it starts to provide an incorrect outcome or prediction.

Models are created by the constructor of classifiers or data transformation classes to ensure their immutability. The design of immutable transformation is described in the Design template for classifiers subsection in the Scala programming section of the Appendix.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.133.54