Workflow computational model

Monads are very useful for manipulating and chaining data transformation using implicit configuration or explicit models. However, they are restricted to a single morphism type T => U. More complex and flexible workflows require weaving transformations of different types using a generic factory pattern.

Traditional factory patterns rely on a combination of composition and inheritance and do not provide developers with the same level of flexibility as stackable traits.

In this section, we introduce the concept of modeling using mixins and a variant of the cake pattern to provide a workflow with three degrees of configurability.

Supporting mathematical abstractions

Stackable traits enable developers to follow a strict mathematical formalism while implementing a model in Scala. Scientists use a universally accepted template to solve mathematical problems:

  1. Declare the variables relevant to the problem.
  2. Define a model (equation, algorithm, formulas…) as the solution to the problem.
  3. Instantiate the variables and execute the model to solve the problem.

Let's consider the example of the concept of kernel functions (see the Kernel functions section of Chapter 12, Kernel Models and Support Vector Machines), a model that consists of the composition of two mathematical functions, and its potential implementation in Scala.

Step 1 – variable declaration

The implementation consists of wrapping (scope) the two functions into traits and defining these functions as abstract values.

The mathematical formalism is as follows:

Step 1 – variable declaration

The Scala implementation is represented here:

type V = Vector[Double]
trait F{ val f: V => V}
trait G{ val g: V => Double }

Step 2 – model definition

The model is defined as the composition of the two functions. The stack of traits G, F describes the type of compatible functions that can be composed using the self-referenced constraint self: G with F:

Formalism   h = f o g

The implementation is as follows:

class H {self: G with F =>def apply(v:V): Double =g(f(v))}

Step 3 – instantiation

The model is executed once the variable f and g are instantiated.

The formalism is as follows:

Step 3 – instantiation

The implementation is as follows:

val h = new H with G with F {
val f: V=>V = (v: V) =>v.map(exp(_))
val g: V => Double = (v: V) =>v.sum

Tip

Lazy value trigger

In the preceding example, the value of h(v) = g(f(v)) can be automatically computed as soon as g and f are initialized, by declaring h a lazy value.

Clearly, Scala preserves the formalism of mathematical models, making it easier for scientists and developers to migrate their existing projects written in scientific oriented languages such as R.

Note

Emulation of R

Most data scientists use the language R to create models and apply learning strategies. They may consider Scala as an alternative to R in some cases, as Scala preserves the mathematical formalism used in models implemented in R.

Let's extend the concept preservation of mathematical formalism to the dynamic creation of workflows using traits. The design pattern described in the next section is sometimes referred to as the cake pattern.

Composing mixins to build workflow

This section presents the key constructs behind the cake pattern. A workflow composed of configurable data transformations requires a dynamic modularization (substitution) of the different stages of the workflow.

Note

Traits and mixins

Mixins are traits that are stacked against a class. The composition of mixins and the cake pattern described in this section are important for defining sequences of data transformation. However, the topic is not directly related to machine learning and the reader can skip this section.

The cake pattern is an advanced class composition pattern that uses mixin traits to meet the demands of a configurable computation workflow. It is also known as stackable modification traits [2:4].

This is not an in-depth analysis of the stackable trait injection and self-reference in Scala. There are a few interesting articles on dependencies injection that are worth a look [2:5].

Java relies on packages tightly coupled with the directory structure and prefixed to modularize the code base. Scala provides developers with a flexible and reusable approach to create and organize modules: traits. Traits can be nested, mixed in with classes, stacked, and inherited.

Understanding the problem

Dependency injection is a fancy name for a reverse look-up and binding to dependencies. Let's consider the simple application that requires data preprocessing, classification and validation.

A simple implementation using traits looks like this:

val app = new Classification with Validation with PreProcessing { 
  val filter = ???
}

If, at a later stage, you need to use an unsupervised clustering algorithm instead of a classifier, then the application has to be re-wired:

val app = new Clustering with Validation with PreProcessing { 
  val filter = ???
}

This approach results in code duplication and lack of flexibility. Moreover, the class member, filter, needs to be redefined for each new class in the composition of the application. The problem arises when there is a dependency between traits used in the composition. Let's consider the case for which filter depends on the validation methodology.

Tip

Mixins linearization [2:6]

The linearization or invocation of methods between mixins follows a right-to-left and base-to-subtype pattern:

  • Trait B extends A
  • Trait C extends A
  • Class M extends N with C with B
  • The Scala compiler implements the linearization as follows: A => B => C => N

Although you can define filter as an abstract value, it still has to be redefined each time a new validation type is introduced. The solution is to use self-type in the definition of the new composed trait PreProcessingWithValidation:

trait PreProcessiongWithValidation extends PreProcessing {
   self: Validation => val filter = ???
}

The application is built by stacking the PreProcessingWithValidation mixin against the class Classification:

val app = new Classification with PreProcessingWithValidation {
  val validation: Validation
}

Tip

Overriding def with val

It is advantageous to override the declaration of a method with a declaration of a value with the same signature. Contrary to a value which is assigned once for all during instantiation, a method may return a different value for each invocation.

A def is a proc that can be redefined as a def, a val, or a lazy val. Therefore, you should not override a value declaration with a method with the same signature:

trait Validator {val g = (n: Int) => ??? }
trait MyValidator extends Validator {def g(n: Int) = }//WRONG

Let's adapt and generalize this pattern to construct a boilerplate template in order to create dynamic computational workflows.

Defining modules

The first step is to generate different modules to encapsulate different types of data transformation.

Tip

Use case for describing the cake pattern

It is difficult to build an example of real-world workflow using classes and algorithms introduced later in the book.

The following simple example is realistic enough to illustrate the different component of the cake pattern.

Let's define a sequence of three parameterized modules that each define a specific data transformation using the explicit configuration of type Etransform:

  • Sampling to extract a sample from raw data
  • Normalization to normalize the sampled data over [0, 1]
  • Aggregation to aggregate or reduce the data:
    trait Sampling[T,A] { val sampler: ETransform[T, A] }
    trait Normalization[T,A] { val normalizer: ETransform[T, A]}
    trait Aggregation[T,A] { valaggregator: ETransform[T, A] }

The modules contain a single abstract value. One characteristic of the cake pattern is to enforce strict modularity by initializing the abstract values with the type encapsulated in the module. One of the objectives in building the framework is allowing developers to create data transformation (inherited from ETransform) independently from any workflow.

Tip

Scala traits and Java packages

There is a major difference between Scala and Java in terms of modularity. Java packages constrain developers into following a strict syntax requiring, for instance, that the source file has the same name as the class it contains. Scala modules based on stackable traits are far more flexible.

Instantiating the workflow

The next step is to write the different modules into a workflow. This is achieved by using the self reference to the stack of the three traits defined in the previous paragraph.

Here's the code:

class Workflow[T,U,V,W] {
self: Sampling[T,U] 
    with Normalization[U,V] with Aggregation[V,W] =>

  def |> (t: T): Try[W] = for {
   u <- sampler |> t
    v <- normalizer |> u
    w <- aggregator |> v
  } yield w
}

A picture is worth a thousand words; the following UML class diagram illustrates the workflow factory (or cake) design pattern:

Instantiating the workflow

UML class diagram of the workflow factory

Finally, the workflow is instantiated by dynamically initializing the abstract values, sampler, normalizer, and aggregator of the transformation as long as the signature (input and output types) matches the parameterized types defined in each module (line 1):

Type DblF = Double => Double
type DblVec = Vector[Double]
val samples = 100; val normRatio = 10; val splits = 4

val workflow = new Workflow[DblF, DblVec, DblVec, Int] 
      with Sampling[DblF, DblVec] 
        with Normalization[DblVec, DblVec] 
          with Aggregation[DblVec, Int] {
  val sampler = ???  //1
  val normalizer = ???
  val aggregator = ???
}

Let's implement the data transformation function for each of the three modules/traits by assigning a transformation to the abstract values.

The first transformation, sampler, samples a function f with frequency 1/samples over the interval [0, 1]. The second transformation, normalizer, normalizes the data over the range [0, 1] using the Stats class introduced in the next chapter.

The last transformation, aggregator, extracts the index of the large sample (value 1.0):

val sampler = new ETransform[DblF,DblVec](ConfigInt(samples)) {//2
  override def |> : PartialFunction[DblF, Try[DblVec]] = { 
    case f: =>
      Try(Vector.tabulate(samples)(n =>f(1.0*n/samples))) //5
  }
}

The transformation sampler uses a single model or configuration parameter sample (line 2). The type DblF of input is defined as Double=> Double (line 3) and the type of output as a vector of floating point values, DblVec (line 4). In this particular case, the transformation consists of applying the input function f to a vector of increasing normalized values (line 5).

The normalizer and aggregator transforms follow the same design pattern as the sampler:

val normalizer = new ETransform[DblVec, DblVec](         
   ConfigDouble(normRatio)
) {
  override def |> : PartialFunction[DblVec,Try[DblVec]] = { 
    case x: DblVec if(x.size > 0) =>    
      Try((Stats[Double](x)).normalize)
  }
}

val aggregator = new ETransform[DblVec, Int](ConfigInt(splits)) {
   override def |> : PartialFunction[DblVec, Try[Int]] = {
     case x:DblVec if(x.size> 0) =>
       Try((0 until x.size).find(x(_)==1.0).getOrElse(-1))
   }
}

The instantiation of the transformation function follows the template described in the Monadic data transformation section of Chapter 1, Getting Started.

The workflow is now ready to process any function as input:

val g = (x: Double) =>Math.log(x+1.0) + nextDouble
Try (workflow |> g)  //6

The workflow is executed by providing the input function g to the first mixin, sampler (line 6).

Scala's strong type checking catches any inconsistent data types at compilation time. It reduces the development cycle because runtime errors are more difficult to track down.

Note

Mixin composition for ITransform

We arbitrarily selected a data transformation using an explicit configuration, ETransform, to illustrate the concept of mixin composition. The same pattern applies to implicit data transformation, ITransform.

Modularizing

The last step is the modularization of the workflow. For complex scientific computations, you need to be able to do the following:

  • Select the appropriate workflow as a sequence of module or tasks according to the objective of the execution (regression, classification, clustering…)
  • Select the appropriate algorithm to fulfill a task according to the quality of the data (noisy, incomplete, …)
  • Select the appropriate implementation of the algorithm according to the environment (distributed with high latency network, single host…):
    Modularizing

    Illustration of the dynamic creation of workflow from modules/traits

Let's consider a simple preprocessing task, defined in the module PreprocessingModule. The module (or task) is declared as a trait to hide its internal workings from other modules. The pre-processing task is executed by a preprocessor of type Preprocessor. We have arbitrarily listed two algorithms, the exponential moving average of type ExpMovingAverage and the discrete Fourier transform low pass filter of type DFTFilter, as potential pre-processors:

trait PreprocessingModule[T] {
  trait Preprocessor[T] { //7
     def execute(x: Vector[T]): Try[DblVec]
  }
  val preprocessor: Preprocessor[T]  //8

  class ExpMovingAverage[T : ToDouble](p: Int)//9
   (implicit num: Numeric[T]) extendsPreprocessor[T] {

     val expMovingAvg = filtering.ExpMovingAverage[T](p) //10
     val pfn= expMovingAvg |>  //11
     override def execute(x: Vector[T]): Try[DblVec] = pfn(x)
  }
  
  class DFTFilter[T : ToDouble](
    fc: Double, 
    g: (Double,Double) =>Double
  ) extends Preprocessor[T] { //12

    val filter = filtering.DFTFir[T](g, fc, 1e-5)
    val pfn = filter |>
    override def execute(x: Vector[T]):Try[DblVec]= pfn(x)
  }
}

The generic pre-processor trait Preprocessor declares a single method, execute, whose purpose is to filter an input vector x of element of type T for noise (line 7). The instance of the pre-processor is declared as an abstract class to be instantiated as one of the filtering algorithm (line 8).

The first filtering algorithm of type ExpMovingAverage implements the Preprocessor trait and overrides the execute method (line 9). The class declares the algorithm but delegates its implementation to a class with an identical signature, org.scalaml.filtering.ExpMovingAverage (line 10). Data of generic type T is automatically converted into a vector of Double using a context bound with the syntax T: ToDouble. The context bound is implemented by the following trait:

Trait ToDouble[T] { def apply(t: T): Double }

The partial function returned from the |> method is instantiated as a value, pfn, so it can be applied multiple times (line 11). The same design pattern is used for the discrete Fourier transform filter (line 12).

The filtering algorithm (ExpMovingAverageorDFTFir) is selected according to the profile or characteristic of the input data. Its implementation in the org.scalaml.filtering package depends on the environment (single host, Akka cluster, Apache Spark…).

Note

Filtering algorithms

The filtering algorithms used to illustrate the concept of modularization in the context of the cake pattern are described in detail in Chapter 3, Data Pre-processing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.103