Chapter 3. Data Preprocessing

Real-world observations are usually noisy and inconsistent, with missing data. No classification, regression, or clustering model can extract reliable information from data that has not been cleansed, filtered, or analyzed.

Data preprocessing consists of cleaning, filtering, transforming, and normalizing raw observations using statistics in order to correlate features or groups of features, identify trends, model, and filter out noise. The purpose of cleansing raw data is twofold:

  • Identify flaws in raw input data
  • Provide unsupervised or supervised learning with a clean and reliable dataset

You should not underestimate the power of traditional statistical analysis methods to infer and classify information from textual or unstructured data.

In this chapter, you will learn how to to the following:

  • Apply commonly used moving average techniques to detect long-term trends in a time series
  • Identify market and sector cycles using the discrete Fourier series
  • Leverage the discrete Kalman filter to extract the state of a linear dynamic system from incomplete and noisy observations

Time series in Scala

The majority of examples used to illustrate the different machine algorithms in the book deal with time series or sequential, time-ordered sets of observations.

Context bounds

The algorithms presented in this chapter are applied to time series with a single variable of type Double. Therefore we need a mechanism to convert implicitly a given type T to a Double. Scala provides developers with such design: context bounds [3:1]:

  trait ToDouble[T] { def apply(t: T): Double }
  implicit val str2Double = new ToDouble[String] {
     def apply(s: String): Double = s.toDouble
  }

Types and operations

The Defining primitives types section under Source code in Chapter 1, Getting Started introduced the types for time series of single variable, Vector[T], and multiple variables, Vector[Array[T]].

A time series of observations is a vector (type Vector) of observation elements:

  • Of type T in the case of a single-variable/feature observation
  • Of type Array[T] for observations with more than one variable/feature

A time series of labels or expected values is a single-variable vector for which elements may have a primitive type of Int for classification and Double for regression.

A time series of labeled observations is a pair of a vector of observations and a vector of labels:

Types and operations

Visualization of the single feature and multi features observations

The two generic types for time series, Vector[T] and Vector[Array[T]], will be used as the two primary classes for input data, from now on.

Note

Labeled observations structure:

Throughout the book, labeled observations are defined either as a pair of a vector of observations and a vector of labels/expected value or as a vector of pair of {observation, label/expected value}.

The class Stats introduced in the Profiling data section of Chapter 2, Data Pipelines, implements some basic statistics and normalization for single variable observations.

Let us create a singleton, TSeries, to compute statistics and normalize multi-dimensional observations:

type DblVec = Vector[Double]

Object TSeries { 
  def zipWithShift[T](v: Vector[T], n: Int): Vector[(T, T)] = 
     v.drop(n).zip(v.view.dropRight(n))  //1

  def statistics[T: ToDouble](vs: Vector[Array[T]]): 
      Vector[Stats[T]] = vs.transpose.map( Stats[T]( _ ))  //2
  
  def normalize[T: ToDouble](  //3
      vs: Vector[Array[T]], low: Double, high: Double
  )(implicit ordering: Ordering[T]): Try[DblVec] = 
    Try (Stats[T](vt).normalize(low, high) )
   ...
}

The first method of the singleton, TSeries, generates a vector of a pair of elements by zipping the last size –n elements of a time series with its first size- n elements (line 1). The methods statistics (line 2) and normalize (line 3) operate on both single- and multi-variable observations. These three methods are a subset of the functionality implemented in TSeries.

Here is a partial list of other commonly used operators:

  • Create a time series of type Vector[(T, T)] by zipping two vectors, x and y:
    def zipToSeries[T](x: Vector[T], y: Vector[T]): 
         Vector[(T, T)]
  • Split a single- or multi-dimensional time series, xv, into two time series at index n:
    def splitAt[T](xv: Vector[T], n: Int): 
        (Vector[T], Vector[T])
  • Apply a zScore transform to a single-dimension time series:
    def zScore[T: ToDouble](xt: Vector[T]): Try[DblVec]
  • Transform a single-dimension time series x into a new time series whose elements are x(n) – x(n-1):
    def delta(x: DblVec): DblVec
  • Compute the sum of squared error between two arrays x, z:
    def sse[T: ToDouble](x: Array[T], z: Array[T]): Double
  • Compute the statistics for each features of a multi-dimensional time series:
    def statistics[T: ToDouble](xt: Vector[Array[[T]]):          
        Vector[Stats[T]]

Tip

Magnet pattern:

Some operations on time series may have a large variety of input and output types. Scala and Java support method overloading, which has the following limitations:

  • It does not prevent type collision caused by type erasure in the JVM
  • It does not allow a lifting to a single, generic function
  • It does not reduce completely code redundancy

The magnet pattern used in the implementation of the Transpose and Differential operators remedies these limitations.

Transpose operator

Let's consider the transpose operator for any kind of multi-dimensional time series. The transpose operator can be objectified as the trait Transpose:

sealed trait Transpose {
  type Result   //4
  def apply(): Result  //5
}

The trait has an abstract type, Result (line 4), and an abstract constructor apply() (line 5), which allows us to create a generic transpose method with any kind of combination of input and output type. The type conversion for input and output of the transpose method is defined as an implicit:

implicit def vSeries2Matrix[T: ClassTag](from: Vector[Array[T]]) = 
  new Transpose { type Result = Array[Array[T]]  //6
    def apply(): Result = from.toArray.transpose
}

The first implicit vSeries2Matrix transposes a time series of type Vector[Array[T]] into a matrix with elements of type T (line 6). The generic transpose method is written as follows:

def transpose(tpose: Transpose): tpose.Result = tpose()

Differential operator

The second candidate to the magnet pattern is the computation of the differential in a time series. The purpose is to generate the time series Differential operator from a time series Differential operator:

sealed trait Difference[T] {
  type Result
  def apply(f: (Double, Double) => T): Result
}

The trait Difference allows us to compute the differential of time series with arbitrary element types. For instance, the differential on a one-dimensional vector of type Double is defined by the following implicit conversion:

implicit def vector2Double[T](x: DblVec) = new Difference[T] {
  type Result = Vector[T]
  def apply(f: (Double, Double) => T): Result =  //7
    zipWithShift(x, 1).collect{case(next,prev) =>f(prev,next)}
}

The constructor apply() takes one argument: the user-defined function f that computes the difference between two consecutive elements of the time series (line 7). The generic difference method is as follows:

def difference[T](
   diff: Difference[T], 
   f: (Double, Double) => T): diff.Result = diff(f)

Here are some of the predefined differential operators on time series for which the output of the operator has the type Double (line 8), Int (line 9), and Boolean (line 10):

val diffDouble = (x: Double,y: Double) => y –x //8
val diffInt = (x: Double,y: Double) => if(y > x) 1 else 0 //9
val diffBoolean = (x: Double,y: Double) => (y > x) //10

Lazy views

A view in Scala is a proxy collection that represents a collection, but implements data transformation or higher-order method lazily. The elements of a view are defined as lazy values, which are instantiated on demand.

One important advantage of views over a strict (or fully allocated) collection is the reduced memory consumption.

Let's look at the data transformation, aggregator, introduced in the Instantiating the workflow section under Workflow computational model in Chapter 2, Data Pipelines. There is no need to allocate the entire set of x.size of elements: the higher-order method, find may exit after only a few elements have been read (line 11):

val aggregator = new ETransform[Int, Int](ConfigInt(splits)) { 
  override def |> : PartialFunction[Int, Try[Int]] = { 
    case x: U if(!x.isEmpty) => 
      Try( Range(0, x.size).view.find(x(_) == 1.0).get) //11
   }
}

Tip

Views, iterators, and streams:

Views, iterators, and streams share the same objective of constructing elements on demand. There is, however, some major difference:

  • Iterators do not persist elements of the collection (read once)
  • Streams allow operations to be performed on collections with undefined size
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.60.158