Building a real-time application differs from batch processing in terms of architecture and components involved. While the latter can easily be built bottom-up, where programmers add functionalities and components when needed, the former usually needs to be built top-down with a solid architecture in place. In fact, due to the constraints of volume and velocity (or veracity in a streaming context), an inadequate architecture will prevent programmers from adding new functionalities. One always needs a clear understanding of how streams of data are interconnected, how and where they are processed, cached, and retrieved.
In terms of stream processing using Apache Spark, there are two emerging architectures that should be considered: Lambda architecture and Kappa architecture. Before we delve into the details of the two architectures, let's discuss the problems they are trying to solve, what they have in common, and in what context you would use each.
For years, engineers working on highly-distributed systems have been concerned with handling network outages. The following is a scenario of particular interest, consider:
Normal operation of a typical distributed system is where users perform actions and the system uses techniques, such as replication, caching, and indexing, to ensure correctness and timely response. But what happens when something goes wrong:
Here, a network outage has effectively prevented users from performing their actions safely. Yes, a simple network failure causes a complication that not only affects the function and performance as you might expect but also the correctness of the system.
In fact, the system now suffers from what is known as split brain syndrome. In this situation, the two parts of the system are no longer able to talk to each other, so any modifications performed by users on one side are not visible on the opposite side. It's almost like there are two separate systems, each maintaining their own internal state, which would become quite different over time. Crucially, a user may report different answers when running the same queries on either side.
This is but one example in the general case of failure within a distributed system, and although much time has been devoted to solving these problems, there are still only three practical approaches:
The preceding conjuncture is more formally stated as CAP theorem (http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html). It reasons that in an environment where failures are a fact of life and you cannot sacrifice functionality (1) you must choose between having consistent answers (2) or full capability (3). You cannot have both as it's a trade-off.
Needless to say this is a simplification, but nonetheless, most data processing systems will fit into one of these broad categories in the event of a failure. Furthermore, it turns out that most traditional database systems favor consistency, achieving this using well-understood computer science methods such as transactions, write-ahead logs, and pessimistic locking.
However, in today's online world, where users expect 24/7 access to services, many of which are revenue-generating; Internet of Things or real-time decision making, a scalable fault-tolerant approach is required. Consequently, there has been a surge in efforts to produce alternatives that ensure availability in the event of failure (indeed the Internet itself was born from this very need).
It turns out that striking the right balance between implementing highly-available systems that also provide an acceptable level of consistency is a challenge. In order to manage the necessary trade-offs, approaches tend to provide weaker definitions of consistency, that is, eventual consistency where stale data is usually tolerated for a short while, and over time the correct data is agreed upon. Yet even with this compromise, they still require the use of far more complicated techniques hence they are more difficult to build and maintain.
Both Lambda and Kappa architectures provide simpler solutions to the previously described problems. They advocate the use of modern big data technologies, such as Apache Spark and Apache Kafka as the basis for consistent available processing systems, where logic can be developed without the need to reason about failure. They are applicable in situations with the following characteristics:
Where you have these conditions, you can consider either architecture as a general candidate. Each adheres to the following core principles that help simplify issues around data consistency, concurrent access, and prevention of data corruption:
It's these principles that form the basis of their eventually-consistent solutions without the need to worry about complexities such as read-repairs or vector-clocks; they're definitely more developer-friendly architectures!
So, let's discuss some of the reasons to choose one over the other. Let's first consider the Lambda architecture.
The Lambda architecture, as first proposed by Nathan Marz, typically ls something like this:
In essence, data is dual-routed into two layers:
The Serving layer is then used to merge these two views of the data together producing a single up-to-date version of the truth.
In addition to the previously described general characteristics, Lambda architecture is most suitable when you have either of the following specific conditions:
Where you have either one of these conditions, you should consider using the Lambda architecture. However, before going ahead, be aware that it brings with it the following qualities that may present challenges:
Despite these challenges, the Lambda architecture is a robust and useful approach that has been implemented successfully in many institutions and organizations, including Yahoo!, Netflix, and Twitter.
The Kappa architecture takes simplification one step further by putting the concept of a distributed log at its center. This allows the removal of the batch layer altogether and consequently creates a vastly simpler design. There are many different implementations of Kappa, but generally it looks like this:
In this architecture, the distributed log essentially provides the characteristics of data immutability and re-playability. By introducing the concept of mutable state store in the processing layer, it unifies the computation model by treating all processing as stream processing, even batch, which is considered just a special case of stream. Kappa architecture is most suitable when you have either of the following specific conditions:
If either one of these options is viable, then Kappa architecture should provide a modern, scalable approach to meet your batch and streaming requirements. However, it's worth considering the constraints and challenges of the technologies chosen for any implementation you may decide on. The potential limitations include:
In the next chapter, we will discuss incremental iterative algorithms, how data skew or server failures affect consistency, and how the back-pressure features in Spark Streaming can help reduce failures. With regards to what has been explained in this section, we will build our classification system following a Kappa architecture.
18.119.136.84