Basic components of a data-driven system

In short, a data-driven architecture contains the following components—at least all the systems I've seen have them—or can be reduced to these components:

  • Data ingest: We need to collect the data from systems and devices. Most of the systems have logs, or at least an option to write files into a local filesystem. Some can have capabilities to report information to network-based interfaces such as syslog, but the absence of persistence layer usually means potential data loss, if not absence of audit information.
  • Data transformation layer: It was also historically called extract, transform, and load (ETL). Today the data transformation layer can also be used to have real-time processing, where the aggregates are computed on the most recent data. The data transformation layer is also traditionally used to reformat and index the data to be efficiently accessed by a UI component of algorithms down the pipeline.
  • Data analytics and machine learning engine: The reason this is not part of the standard data transformation layer is usually that this layer requires quite different skills. The mindset of people who build reasonable statistical models is usually different from people who make terabytes of data move fast, even though occasionally I can find people with both skills. Usually, these unicorns are called data scientists, but the skills in any specific field are usually inferior to ones who specialize in a particular field. We need more of either, though. Another reason is that machine learning, and to a certain extent, data analysis, requires multiple aggregations and passes over the same data, which as opposed to a more stream-like ETL transformations, requires a different engine.
  • UI component: Yes, UI stands for user interface, which most often is a set of components that allow you to communicate with the system via a browser (it used to be a native GUI, but these days the web-based JavaScript or Scala-based frameworks are much more powerful and portable). From the data pipeline and modeling perspective, this component offers an API to access internal representation of data and models.
  • Actions engine: This is usually a configurable rules engine to optimize the provided metrics based on insights. The actions may be either real-time, like in online advertising, in which case the engine should be able to supply real-time scoring information, or a recommendation for a user action, which can take the form of an e-mail alert.
  • Correlation engine: This is an emerging component that may analyze the output of data analysis and machine learning engine to infer additional insights into data or model behavior. The actions might also be triggered by an output from this layer.
  • Monitoring: This is a complex system will be incomplete without logging, monitoring, and some way to change system parameters. The purpose of monitoring is to have a nested decision-making system regarding the optimal health of the system and either to mitigate the problem(s) automatically or to alert the system administrators about the problem(s).

Let's discuss each of the components in detail in the following sections.

Data ingest

With the proliferation of smart devices, information gathering has become less of a problem and more of a necessity for any business that does more than a type-written text. For the purpose of this chapter, I will assume that the device or devices are connected to the Internet or have some way of passing this information via home dialing or direct network connection.

The major purpose of this component is to collect all relevant information that can be relevant for further data-driven decision making. The following table provides details on the most common implementations of the data ingest:

Framework

When used

Comments

Syslog

Syslog is one of the most common standards to pass messages between the machines on Unix. Syslog usually listens on port 514 and the transport protocol can be configured either with UDP (unreliable) or with TCP. The latest enhanced implementation on CentOS and Red Hat Linux is rsyslog, which includes many advanced options such as regex-based filtering that is useful for system-performance tuning and debugging. Apart from slightly inefficient raw message representation—plain text, which might be inefficient for long messages with repeated strings—the syslog system can support tens of thousands of messages per second.

Syslog is one of the oldest protocols developed in the 1980s by Eric Allman as part of Sendmail. While it does not guarantee delivery or durability, particularly for distributed systems, it is one of the most widespread protocols for message passing. Some of the later frameworks, such as Flume and Kafka, have syslog interfaces as well.

Rsync

Rsync is a younger framework developed in the 1990s. If the data is put in the flat files on a local filesystem, rsync might be an option. While rsync is more traditionally used to synchronize two directories, it also can be run periodically to transfer log data in batches. Rsync uses a recursive algorithm invented by an Australian computer programmer, Andrew Tridgell, for efficiently detecting the differences and transmitting a structure (such as a file) across a communication link when the receiving computer already has a similar, but not identical, version of the same structure. While it incurs extra communication, it is better from the point of durability, as the original copy can always be retrieved. It is particularly appropriate if the log data is known to arrive in batches in the first place (such as uploads or downloads).

Rsync has been known to be hampered by network bottlenecks, as it ultimately passes more information over the network when comparing the directory structures. However, the transferred files may be compressed when passed over the network. The network bandwidth can be limited per command-line flags.

Flume

Flume is one of the youngest frameworks developed by Cloudera in 2009-2011 and open sourced. Flume—we refer to the more popular flume-ng implementation as Flume as opposed to an older regular Flume—consists of sources, pipes, and sinks that may be configured on multiple nodes for high availability and redundancy purposes. Flume was designed to err on the reliability side at the expense of possible duplication of data. Flume passes the messages in the Avro format, which is also open sourced and the transfer protocol, as well as messages can be encoded and compressed.

While Flume originally was developed just to ship records from a file or a set of files, it can also be configured to listen to a port, or even grab the records from a database. Flume has multiple adapters including the preceding syslog.

Kafka

Kafka is the latest addition to the log-processing framework developed by LinkedIn and is open sourced. Kafka, compared to the previous frameworks, is more like a distributed reliable message queue. Kafka keeps a partitioned, potentially between multiple distributed machines; buffer and one can subscribe to or unsubscribe from getting messages for a particular topic. Kafka was built with strong reliability guarantees in mind, which is achieved through replication and consensus protocol.

Kafka might not be appropriate for small systems (< five nodes) as the benefits of the fully distributed system might be evident only at larger scales. Kafka is commercially supported by Confluent.

The transfer of information usually occurs in batches, or micro batches if the requirements are close to real time. Usually the information first ends up in a file, traditionally called log, in a device's local filesystem, and then is transferred to a central location. Recently developed Kafka and Flume are often used to manage these transfers, together with a more traditional syslog, rsync, or netcat. Finally, the data can be placed into a local or distributed storage such as HDFS, Cassandra, or Amazon S3.

Data transformation layer

After the data ends up in HDFS or other storage, the data needs to be made available for processing. Traditionally, the data is processed on a schedule and ends up partitioned by time-based buckets. The processing can happen daily or hourly, or even on a sub-minute basis with the new Scala streaming framework, depending on the latency requirements. The processing may involve some preliminary feature construction or vectorization, even though it is traditionally considered a machine-learning task. The following table summarizes some available frameworks:

Framework

When used

Comments

Oozie

This is one of the oldest open source frameworks developed by Yahoo. This has good integration with big data Hadoop tools. It has limited UI that lists the job history.

The whole workflow is put into one big XML file, which might be considered a disadvantage from the modularity point of view.

Azkaban

This is an alternative open source workflow-scheduling framework developed by LinkedIn. Compared to Oozie, this arguably has a better UI. The disadvantage is that all high-level tasks are executed locally, which might present a scalability problem.

The idea behind Azkaban is to create a fully modularized drop-in architecture where the new jobs/tasks can be added with as few modifications as possible.

StreamSets

StreamSets is the latest addition build by the former Informix and Cloudera developers. It has a very developed UI and supports a much richer set of input sources and output destinations.

This is a fully UI-driven tool with an emphasis on data curation, for example, constantly monitoring the data stream for problems and abnormalities.

Separate attention should be given to stream-processing frameworks, where the latency requirements are reduced to one or a few records at a time. First, stream processing usually requires much more resources dedicated to processing, as it is more expensive to process individual records at a time as opposed to batches of records, even if it is tens or hundreds of records. So, the architect needs to justify the additional costs based on the value of more recent result, which is not always warranted. Second, stream processing requires a few adjustments to the architecture as handling the more recent data becomes a priority; for example, a delta architecture where the more recent data is handled by a separate substream or a set of nodes became very popular recently with systems such as Druid (http://druid.io).

Data analytics and machine learning

For the purpose of this chapter, Machine Learning (ML) is any algorithm that can compute aggregates or summaries that are actionable. We will cover more complex algorithms from Chapter 3, Working with Spark and MLlib to Chapter 6, Working with Unstructured Data, but in some cases, a simple sliding-window average and deviation from the average may be sufficient signal for taking an action. In the past few years, it just works in A/B testing somehow became a convincing argument for model building and deployment. I am not speculating that solid scientific principles might or might not apply, but many fundamental assumptions such as i.i.d., balanced designs, and the thinness of the tail just fail to hold for many big data situation. Simpler models tend to be faster and to have better performance and stability.

For example, in online advertising, one might just track average performance of a set of ads over a certain similar properties over times to make a decision whether to have this ad displayed. The information about anomalies, or deviation from the previous behavior, may be a signal a new unknown unknown, which signals that the old data no longer applies, in which case, the system has no choice but to start the new exploration cycle.

I will talk about more complex non-structured, graph, and pattern mining later in Chapter 6, Working with Unstructured Data, Chapter 8, Integrating Scala with R and Python and Chapter 9, NLP in Scala.

UI component

Well, UI is for wimps! Just joking...maybe it's too harsh, but in reality, UI usually presents a syntactic sugar that is necessary to convince the population beyond the data scientists. A good analyst should probably be able to figure out t-test probabilities by just looking at a table with numbers.

However, one should probably apply the same methodologies we used at the beginning of the chapter, assessing the usefulness of different components and the amount of cycles put into them. The presence of a good UI is often justified, but depends on the target audience.

First, there are a number of existing UIs and reporting frameworks. Unfortunately, most of them are not aligned with the functional programming methodologies. Also, the presence of complex/semi-structured data, which I will describe in Chapter 6, Working with Unstructured Data in more detail, presents a new twist that many frameworks are not ready to deal with without implementing some kind of DSL. Here are a few frameworks for building the UI in a Scala project that I find particularly worthwhile:

Framework

When used

Comments

Scala Swing

If you used Swing components in Java and are proficient with them, Scala Swing is a good choice for you. Swing component is arguably the least portable component of Java, so your mileage can vary on different platforms.

The Scala.swing package uses the standard Java Swing library under the hood, but it has some nice additions. Most notably, as it's made for Scala, it can be used in a much more concise way than the standard Swing.

Lift

Lift is a secure, developer-centric, scalable, and interactive framework written in Scala. Lift is open sourced under Apache 2.0 license.

The open source Lift framework was launched in 2007 by David Polak, who was dissatisfied with certain aspects of the Ruby on Rails framework. Any existing Java library and web container can be used in running Lift applications. Lift web applications are thus packaged as WAR files and deployed on any servlet 2.4 engine (for example, Tomcat 5.5.xx, Jetty 6.0, and so on). Lift programmers may use the standard Scala/Java development toolchain, including IDEs such as Eclipse, NetBeans, and IDEA. Dynamic web content is authored via templates using standard HTML5 or XHTML editors. Lift applications also benefit from native support for advanced web development techniques, such as Comet and Ajax.

Play

Play is arguably better aligned with Scala as a functional language than any other platform—it is officially supported by Typesafe, the commercial company behind Scala. The Play framework 2.0 builds on Scala, Akka, and sbt to deliver superior asynchronous request handling, fast and reliable. Typesafe templates, and a powerful build system with flexible deployment options. Play is open sourced under Apache 2.0 license.

The open source Play framework was created in 2007 by Guillaume Bort, who sought to bring a fresh web development experience inspired by modern web frameworks like Ruby on Rails to the long-suffering Java web development community. Play follows a familiar stateless model-view-controller architectural pattern, with a philosophy of convention-over-configuration and an emphasis on developer productivity. Unlike traditional Java web frameworks with their tedious compile-package-deploy-restart cycles, updates to Play applications are instantly visible with a simple browser refresh.

Dropwizard

The dropwizard (www.dropwizard.io) project is an attempt to build a generic RESTful framework in both Java and Scala, even though one might end up using more Java than Scala. What is nice about this framework is that it is flexible enough to be used with arbitrary complex data (including semi-structured).This is licensed under Apache License 2.0.

RESTful API assumes state, while functional languages shy away from using state. Unless you are flexible enough to deviate from a pure functional approach, this framework is probably not good enough for you.

Slick

While Slick is not a UI component, it is Typesafe's modern database query and access library for Scala, which can serve as a UI backend. It allows you to work with the stored data almost as if you were using Scala collections, while at the same time, giving you full control over when a database access occurs and what data is transferred. You can also use SQL directly. Use it if all of your data is purely relational. This is open sourced under BSD-Style license.

Slick was started in 2012 by Stefan Zeiger and maintained mainly by Typesafe. It is useful for mostly relational data.

NodeJS

Node.js is a JavaScript runtime, built on Chrome's V8 JavaScript engine. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient. Node.js' package ecosystem, npm, is the largest ecosystem of open source libraries in the world. It is open sourced under MIT License.

Node.js was first introduced in 2009 by Ryan Dahl and other developers working at Joyent. Originally Node.js supported only Linux, but now it runs on OS X and Windows.

AngularJS

AngularJS (https://angularjs.org) is a frontend development framework, built to simplify development of one-page web applications. This is open sourced under MIT License.

AngularJS was originally developed in 2009 by Misko Hevery at Brat Tech LLC. AngularJS is mainly maintained by Google and by a community of individual developers and corporations, and thus is specifically for Android platform (support for IE8 is dropped in versions 1.3 and later).

Actions engine

While this is the heart of the data-oriented system pipeline, it is also arguably the easiest one. Once the system of metrics and values is known, the system decides, based on the known equations, whether to take a certain set of actions or not, based on the information provided. While the triggers based on a threshold is the most common implementation, the significance of probabilistic approaches that present the user with a set of possibilities and associated probabilities is emerging—or just presenting the user with the top N relevant choices like a search engine does.

The management of the rules might become pretty involved. It used to be that managing the rules with a rule engine, such as Drools (http://www.drools.org), was sufficient. However, managing complex rules becomes an issue that often requires development of a DSL (Domain-Specific Languages by Martin Fowler, Addison-Wesley, 2010). Scala is particularly fitting language for the development of such an actions engine.

Correlation engine

The more complex the decision-making system is, the more it requires a secondary decision-making system to optimize its management. DevOps is turning into DataOps (Getting Data Right by Michael Stonebraker et al., Tamr, 2015). Data collected about the performance of a data-driven system are used to detect anomalies and semi-automated maintenance.

Models are often subject to time drift, where the performance might deteriorate either due to the changes in the data collection layers or the behavioral changes in the population (I will cover model drift in Chapter 10, Advanced Model Monitoring). Another aspect of model management is to track model performance, and in some cases, use "collective" intelligence of the models by various consensus schemes.

Monitoring

Monitoring a system involves collecting information about system performance either for audit, diagnostic, or performance-tuning purposes. While it is related to the issues raised in the previous sections, monitoring solution often incorporates diagnostic and historical storage solutions and persistence of critical data, such as a black box on an airplane. In the Java and, thus, Scala world, a popular tool of choice is Java performance beans, which can be monitored in the Java Console. While Java natively supports MBean for exposing JVM information over JMX, Kamon (http://kamon.io) is an open source library that uses this mechanism to specifically expose Scala and Akka metrics.

Some other popular monitoring open source solutions are Ganglia (http://ganglia.sourceforge.net/) and Graphite (http://graphite.wikidot.com).

I will stop here, as I will address system and model monitoring in more detail in Chapter 10, Advanced Model Monitoring.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.193.143