Chapter 5. Real-World Systems

Fast data architectures raise the bar for the “ilities” of distributed data processing. Whereas batch jobs seldom last more than a few hours, a streaming pipeline is designed to run for weeks, months, even years. If you wait long enough, even the most obscure problem is likely to happen.

The umbrella term reactive systems embodies the qualities that real-world systems must meet. These systems must be:

Responsive

The system can always respond in a timely manner, even when it’s necessary to respond that full service isn’t available due to some failure.

Resilient

The system is resilient against failure of any one component, such as server crashes, hard drive failures, or network partitions. Replication prevents data loss and enables a service to keep going using the remaining instances. Isolation prevents cascading failures.

Elastic

You can expect the load to vary considerably over the lifetime of a service. Dynamic, automatic scalability, both up and down, allows you to handle heavy loads while avoiding underutilized resources in less busy times.

Message driven

While fast data architectures are obviously focused on data, here we mean that all services respond to directed commands and queries. Furthermore, they use messages to send commands and queries to other services as well.

Classic big data systems, focused on batch and offline interactive workloads, have had less need to meet these qualities. Fast data architectures are just like other online systems where these qualities are necessary to avoid costly downtime and data loss. If you come from a big data engineering background, you are suddenly forced to learn new skills for distributed systems programming and operations.

Some Specific Recommendations

Most of the components we’ve discussed previously support the reactive qualities to one degree or another. Of course, you should follow all of the usual recommendations about good management and monitoring tools, disaster recovery plans, and so on, which I won’t repeat here. That being said, here are some specific recommendations:

  • Ingest all inbound data into Kafka first, then consume it with the stream processors and microservices. You get durable, scalable, resilient storage. You get support for multiple, decoupled consumers, replay capabilities, and the simplicity and power of event log semantics and topic organization as the backplane of your architecture.

  • For the same reasons, write data back to Kafka for consumption by downstream services. Avoid direct connections between services, which are less resilient, unless latency concerns require direct connections.

  • When using direct connections between microservices, use libraries that implement the Reactive Streams standard, for the resiliency provided by back pressure as a flow-control mechanism.

  • Deploy to Kubernetes, Mesos, YARN, or a similar resource management infrastructure with proven scalability, resiliency, and flexibility. I don’t recommend Spark’s standalone-mode deployments, except for relatively simple deployments that aren’t mission critical, because Spark provides only limited support for these features.

  • Choose your databases and other persistence stores wisely. Are they easy to manage? Do they provide distributed scalability? How resilient against data loss and service disruption are they when components fail? Understand the CAP trade-offs you need and how well they are supported by your databases. Should you really be using a relational database? I’m surprised how many people jump through hoops to implement transactions themselves because their NoSQL database doesn’t provide them.

  • Seek professional production support for your environment, even when using open source solutions. It’s cheap insurance and it saves you time (which equals money).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.156.212