In short, a data-driven architecture contains the following components—at least all the systems I've seen have them—or can be reduced to these components:
Let's discuss each of the components in detail in the following sections.
With the proliferation of smart devices, information gathering has become less of a problem and more of a necessity for any business that does more than a type-written text. For the purpose of this chapter, I will assume that the device or devices are connected to the Internet or have some way of passing this information via home dialing or direct network connection.
The major purpose of this component is to collect all relevant information that can be relevant for further data-driven decision making. The following table provides details on the most common implementations of the data ingest:
The transfer of information usually occurs in batches, or micro batches if the requirements are close to real time. Usually the information first ends up in a file, traditionally called log, in a device's local filesystem, and then is transferred to a central location. Recently developed Kafka and Flume are often used to manage these transfers, together with a more traditional syslog, rsync, or netcat. Finally, the data can be placed into a local or distributed storage such as HDFS, Cassandra, or Amazon S3.
After the data ends up in HDFS or other storage, the data needs to be made available for processing. Traditionally, the data is processed on a schedule and ends up partitioned by time-based buckets. The processing can happen daily or hourly, or even on a sub-minute basis with the new Scala streaming framework, depending on the latency requirements. The processing may involve some preliminary feature construction or vectorization, even though it is traditionally considered a machine-learning task. The following table summarizes some available frameworks:
Separate attention should be given to stream-processing frameworks, where the latency requirements are reduced to one or a few records at a time. First, stream processing usually requires much more resources dedicated to processing, as it is more expensive to process individual records at a time as opposed to batches of records, even if it is tens or hundreds of records. So, the architect needs to justify the additional costs based on the value of more recent result, which is not always warranted. Second, stream processing requires a few adjustments to the architecture as handling the more recent data becomes a priority; for example, a delta architecture where the more recent data is handled by a separate substream or a set of nodes became very popular recently with systems such as Druid (http://druid.io).
For the purpose of this chapter, Machine Learning (ML) is any algorithm that can compute aggregates or summaries that are actionable. We will cover more complex algorithms from Chapter 3, Working with Spark and MLlib to Chapter 6, Working with Unstructured Data, but in some cases, a simple sliding-window average and deviation from the average may be sufficient signal for taking an action. In the past few years, it just works in A/B testing somehow became a convincing argument for model building and deployment. I am not speculating that solid scientific principles might or might not apply, but many fundamental assumptions such as i.i.d., balanced designs, and the thinness of the tail just fail to hold for many big data situation. Simpler models tend to be faster and to have better performance and stability.
For example, in online advertising, one might just track average performance of a set of ads over a certain similar properties over times to make a decision whether to have this ad displayed. The information about anomalies, or deviation from the previous behavior, may be a signal a new unknown unknown, which signals that the old data no longer applies, in which case, the system has no choice but to start the new exploration cycle.
I will talk about more complex non-structured, graph, and pattern mining later in Chapter 6, Working with Unstructured Data, Chapter 8, Integrating Scala with R and Python and Chapter 9, NLP in Scala.
Well, UI is for wimps! Just joking...maybe it's too harsh, but in reality, UI usually presents a syntactic sugar that is necessary to convince the population beyond the data scientists. A good analyst should probably be able to figure out t-test probabilities by just looking at a table with numbers.
However, one should probably apply the same methodologies we used at the beginning of the chapter, assessing the usefulness of different components and the amount of cycles put into them. The presence of a good UI is often justified, but depends on the target audience.
First, there are a number of existing UIs and reporting frameworks. Unfortunately, most of them are not aligned with the functional programming methodologies. Also, the presence of complex/semi-structured data, which I will describe in Chapter 6, Working with Unstructured Data in more detail, presents a new twist that many frameworks are not ready to deal with without implementing some kind of DSL. Here are a few frameworks for building the UI in a Scala project that I find particularly worthwhile:
Framework |
When used |
Comments |
---|---|---|
Scala Swing |
If you used Swing components in Java and are proficient with them, Scala Swing is a good choice for you. Swing component is arguably the least portable component of Java, so your mileage can vary on different platforms. |
The |
Lift |
Lift is a secure, developer-centric, scalable, and interactive framework written in Scala. Lift is open sourced under Apache 2.0 license. |
The open source Lift framework was launched in 2007 by David Polak, who was dissatisfied with certain aspects of the Ruby on Rails framework. Any existing Java library and web container can be used in running Lift applications. Lift web applications are thus packaged as WAR files and deployed on any servlet 2.4 engine (for example, Tomcat 5.5.xx, Jetty 6.0, and so on). Lift programmers may use the standard Scala/Java development toolchain, including IDEs such as Eclipse, NetBeans, and IDEA. Dynamic web content is authored via templates using standard HTML5 or XHTML editors. Lift applications also benefit from native support for advanced web development techniques, such as Comet and Ajax. |
Play |
Play is arguably better aligned with Scala as a functional language than any other platform—it is officially supported by Typesafe, the commercial company behind Scala. The Play framework 2.0 builds on Scala, Akka, and sbt to deliver superior asynchronous request handling, fast and reliable. Typesafe templates, and a powerful build system with flexible deployment options. Play is open sourced under Apache 2.0 license. |
The open source Play framework was created in 2007 by Guillaume Bort, who sought to bring a fresh web development experience inspired by modern web frameworks like Ruby on Rails to the long-suffering Java web development community. Play follows a familiar stateless model-view-controller architectural pattern, with a philosophy of convention-over-configuration and an emphasis on developer productivity. Unlike traditional Java web frameworks with their tedious compile-package-deploy-restart cycles, updates to Play applications are instantly visible with a simple browser refresh. |
Dropwizard |
The dropwizard (www.dropwizard.io) project is an attempt to build a generic RESTful framework in both Java and Scala, even though one might end up using more Java than Scala. What is nice about this framework is that it is flexible enough to be used with arbitrary complex data (including semi-structured).This is licensed under Apache License 2.0. |
RESTful API assumes state, while functional languages shy away from using state. Unless you are flexible enough to deviate from a pure functional approach, this framework is probably not good enough for you. |
Slick |
While Slick is not a UI component, it is Typesafe's modern database query and access library for Scala, which can serve as a UI backend. It allows you to work with the stored data almost as if you were using Scala collections, while at the same time, giving you full control over when a database access occurs and what data is transferred. You can also use SQL directly. Use it if all of your data is purely relational. This is open sourced under BSD-Style license. |
Slick was started in 2012 by Stefan Zeiger and maintained mainly by Typesafe. It is useful for mostly relational data. |
NodeJS |
Node.js is a JavaScript runtime, built on Chrome's V8 JavaScript engine. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient. Node.js' package ecosystem, npm, is the largest ecosystem of open source libraries in the world. It is open sourced under MIT License. |
Node.js was first introduced in 2009 by Ryan Dahl and other developers working at Joyent. Originally Node.js supported only Linux, but now it runs on OS X and Windows. |
AngularJS |
AngularJS (https://angularjs.org) is a frontend development framework, built to simplify development of one-page web applications. This is open sourced under MIT License. |
AngularJS was originally developed in 2009 by Misko Hevery at Brat Tech LLC. AngularJS is mainly maintained by Google and by a community of individual developers and corporations, and thus is specifically for Android platform (support for IE8 is dropped in versions 1.3 and later). |
While this is the heart of the data-oriented system pipeline, it is also arguably the easiest one. Once the system of metrics and values is known, the system decides, based on the known equations, whether to take a certain set of actions or not, based on the information provided. While the triggers based on a threshold is the most common implementation, the significance of probabilistic approaches that present the user with a set of possibilities and associated probabilities is emerging—or just presenting the user with the top N relevant choices like a search engine does.
The management of the rules might become pretty involved. It used to be that managing the rules with a rule engine, such as Drools (http://www.drools.org), was sufficient. However, managing complex rules becomes an issue that often requires development of a DSL (Domain-Specific Languages by Martin Fowler, Addison-Wesley, 2010). Scala is particularly fitting language for the development of such an actions engine.
The more complex the decision-making system is, the more it requires a secondary decision-making system to optimize its management. DevOps is turning into DataOps (Getting Data Right by Michael Stonebraker et al., Tamr, 2015). Data collected about the performance of a data-driven system are used to detect anomalies and semi-automated maintenance.
Models are often subject to time drift, where the performance might deteriorate either due to the changes in the data collection layers or the behavioral changes in the population (I will cover model drift in Chapter 10, Advanced Model Monitoring). Another aspect of model management is to track model performance, and in some cases, use "collective" intelligence of the models by various consensus schemes.
Monitoring a system involves collecting information about system performance either for audit, diagnostic, or performance-tuning purposes. While it is related to the issues raised in the previous sections, monitoring solution often incorporates diagnostic and historical storage solutions and persistence of critical data, such as a black box on an airplane. In the Java and, thus, Scala world, a popular tool of choice is Java performance beans, which can be monitored in the Java Console. While Java natively supports MBean for exposing JVM information over JMX, Kamon (http://kamon.io) is an open source library that uses this mechanism to specifically expose Scala and Akka metrics.
Some other popular monitoring open source solutions are Ganglia (http://ganglia.sourceforge.net/) and Graphite (http://graphite.wikidot.com).
I will stop here, as I will address system and model monitoring in more detail in Chapter 10, Advanced Model Monitoring.
18.117.229.33