Designing a Spark Streaming application

Building a real-time application differs from batch processing in terms of architecture and components involved. While the latter can easily be built bottom-up, where programmers add functionalities and components when needed, the former usually needs to be built top-down with a solid architecture in place. In fact, due to the constraints of volume and velocity (or veracity in a streaming context), an inadequate architecture will prevent programmers from adding new functionalities. One always needs a clear understanding of how streams of data are interconnected, how and where they are processed, cached, and retrieved.

A tale of two architectures

In terms of stream processing using Apache Spark, there are two emerging architectures that should be considered: Lambda architecture and Kappa architecture. Before we delve into the details of the two architectures, let's discuss the problems they are trying to solve, what they have in common, and in what context you would use each.

The CAP theorem

For years, engineers working on highly-distributed systems have been concerned with handling network outages. The following is a scenario of particular interest, consider:

The CAP theorem
Figure 4: Distributed system outage

Normal operation of a typical distributed system is where users perform actions and the system uses techniques, such as replication, caching, and indexing, to ensure correctness and timely response. But what happens when something goes wrong:

The CAP theorem
Figure 5: Distributed system split brain syndrome

Here, a network outage has effectively prevented users from performing their actions safely. Yes, a simple network failure causes a complication that not only affects the function and performance as you might expect but also the correctness of the system.

In fact, the system now suffers from what is known as split brain syndrome. In this situation, the two parts of the system are no longer able to talk to each other, so any modifications performed by users on one side are not visible on the opposite side. It's almost like there are two separate systems, each maintaining their own internal state, which would become quite different over time. Crucially, a user may report different answers when running the same queries on either side.

This is but one example in the general case of failure within a distributed system, and although much time has been devoted to solving these problems, there are still only three practical approaches:

  1. Prevent users from making any updates until the underlying problem is resolved and in the meantime preserve the current state of the system (last known state before failure) as correct (that is, sacrifice partition tolerance).
  2. Allow users to continue doing updates as before, but accept that answers may be different and will have to converge at some point when the underlying problem is corrected (that is, sacrifice consistency).
  3. Shift all users onto one part of the system and allow them to continue doing updates as before. The other part of the system is treated as failed and a partial reduction of processing power is accepted until the problem is resolved - the system may become less responsive as a result (that is, sacrifice Availability).

The preceding conjuncture is more formally stated as CAP theorem (http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html). It reasons that in an environment where failures are a fact of life and you cannot sacrifice functionality (1) you must choose between having consistent answers (2) or full capability (3). You cannot have both as it's a trade-off.

Tip

In fact, it's more correct here to describe "failures" as the more general term, "partition tolerance", as this type of failure could refer to any division of the system - a network outage, server reboot, a full disk, and so on - it is not necessarily specifically network problems.

Needless to say this is a simplification, but nonetheless, most data processing systems will fit into one of these broad categories in the event of a failure. Furthermore, it turns out that most traditional database systems favor consistency, achieving this using well-understood computer science methods such as transactions, write-ahead logs, and pessimistic locking.

However, in today's online world, where users expect 24/7 access to services, many of which are revenue-generating; Internet of Things or real-time decision making, a scalable fault-tolerant approach is required. Consequently, there has been a surge in efforts to produce alternatives that ensure availability in the event of failure (indeed the Internet itself was born from this very need).

It turns out that striking the right balance between implementing highly-available systems that also provide an acceptable level of consistency is a challenge. In order to manage the necessary trade-offs, approaches tend to provide weaker definitions of consistency, that is, eventual consistency where stale data is usually tolerated for a short while, and over time the correct data is agreed upon. Yet even with this compromise, they still require the use of far more complicated techniques hence they are more difficult to build and maintain.

Tip

With more onerous implementations, vector-clocks and read-repair are involved in order to handle concurrency and prevent data corruption

The Greeks are here to help

Both Lambda and Kappa architectures provide simpler solutions to the previously described problems. They advocate the use of modern big data technologies, such as Apache Spark and Apache Kafka as the basis for consistent available processing systems, where logic can be developed without the need to reason about failure. They are applicable in situations with the following characteristics:

  • An unbounded, inbound stream of information, potentially from multiple sources
  • Analytical processing over a very large, cumulative dataset
  • User queries with time-based guarantees on data consistency
  • Zero-tolerance for degradation of performance or downtime

Where you have these conditions, you can consider either architecture as a general candidate. Each adheres to the following core principles that help simplify issues around data consistency, concurrent access, and prevention of data corruption:

  • Data immutability: Data is only ever created or read. It is never updated or deleted. Treating data this way greatly simplifies the model required to keep your data consistent.
  • Human fault tolerance: When fixing or upgrading software during the normal course of the software development lifecycle, it is often necessary to deploy new versions of analytics and replay historical data through the system in order to produce revised answers. Indeed, when managing systems dealing directly with data of this capability is often critical. The batch layer provides a durable store of historical data and hence allows for any mistakes to be recovered.

It's these principles that form the basis of their eventually-consistent solutions without the need to worry about complexities such as read-repairs or vector-clocks; they're definitely more developer-friendly architectures!

So, let's discuss some of the reasons to choose one over the other. Let's first consider the Lambda architecture.

Importance of the Lambda architecture

The Lambda architecture, as first proposed by Nathan Marz, typically ls something like this:

Importance of the Lambda architecture
Figure 6: Lambda architecture

In essence, data is dual-routed into two layers:

  • A Batch layer capable of computing a snapshot at a given point in time
  • A Real-Time layer capable of processing incremental changes since the last snapshot

The Serving layer is then used to merge these two views of the data together producing a single up-to-date version of the truth.

In addition to the previously described general characteristics, Lambda architecture is most suitable when you have either of the following specific conditions:

  • Complex or time-consuming bulk or batch algorithms that have no equivalent or alternative incremental iterative algorithm (and approximations are not acceptable) so you need a batch layer.
  • Guarantees on data consistency cannot be met by the batch layer alone, regardless of parallelism of the system, so you need a real-time layer. For example, you have:
    • Low latency write-reads
    • Arbitrarily wide ranges of data, that is, years
    • Heavy data skew

Where you have either one of these conditions, you should consider using the Lambda architecture. However, before going ahead, be aware that it brings with it the following qualities that may present challenges:

  • Two data pipelines: There are separate workflows for batch and stream processing and, although where possible you can attempt to reuse core logic and libraries, the flows themselves must be managed individually at runtime.
  • Complex code maintenance: For all but simple aggregations, the algorithms in the batch and real time layers will need to be different. This is particularly true for machine learning algorithms, where there is an entire field devoted to this study called online machine learning (https://en.wikipedia.org/wiki/Online_machine_learning), which can involve implementing incremental iterative algorithms, or approximation algorithms, outside of existing frameworks.
  • Increased complexity in the serving layer: Aggregations, unions, and joins are necessary in the serving layer in order to merge deltas with aggregations. Engineers should be careful that this does not split out into consuming systems.

Despite these challenges, the Lambda architecture is a robust and useful approach that has been implemented successfully in many institutions and organizations, including Yahoo!, Netflix, and Twitter.

Importance of the Kappa architecture

The Kappa architecture takes simplification one step further by putting the concept of a distributed log at its center. This allows the removal of the batch layer altogether and consequently creates a vastly simpler design. There are many different implementations of Kappa, but generally it looks like this:

Importance of the Kappa architecture
Figure 7: Kappa architecture

In this architecture, the distributed log essentially provides the characteristics of data immutability and re-playability. By introducing the concept of mutable state store in the processing layer, it unifies the computation model by treating all processing as stream processing, even batch, which is considered just a special case of stream. Kappa architecture is most suitable when you have either of the following specific conditions:

  • Guarantees on data consistency can be met using the existing batch algorithm by increasing parallelism of the system to reduce latency
  • Guarantees on data consistency can be met by implementing incremental iterative algorithms

If either one of these options is viable, then Kappa architecture should provide a modern, scalable approach to meet your batch and streaming requirements. However, it's worth considering the constraints and challenges of the technologies chosen for any implementation you may decide on. The potential limitations include:

  • Exactly-once semantics: Many popular distributed messaging systems, such as Apache Kafka, don't currently support exactly-once message delivery semantics. This means that, for now, consuming systems have to deal with receiving data duplicates themselves. This is typically done by using checkpoints, unique keys, idempotent writes, or other such de-duplication techniques, but it does increase complexity and hence makes the solution more difficult to build and maintain.
  • Out of order event handling: Many streaming implementations, such as Apache Spark, do not currently support updates ordered by the event time, instead they use the processing time, that is, the time the event was first observed by the system. Consequently, updates could be received out of order and the system needs to be able to handle this. Again, this increases code complexity and makes the solution more difficult to build and maintain.
  • No strong consistency, that is, linearizability: As all updates are applied asynchronously, there are no guarantees that write will take effect immediately (although they will be eventually consistent). This means that in some circumstances you would not immediately be able to "read your writes".

In the next chapter, we will discuss incremental iterative algorithms, how data skew or server failures affect consistency, and how the back-pressure features in Spark Streaming can help reduce failures. With regards to what has been explained in this section, we will build our classification system following a Kappa architecture.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.136.84