Chapter 23. Cloud Processing Is Not About Speed

Rustem Feyzkhanov

Data and machine learning processing pipelines used to be about speed. Now we live in a world of public cloud technologies, where any company can procure additional resources in a matter of seconds. This fact changes our perspective on the way processing pipelines should be constructed.

In practice, we pay equally for providing 10 servers for 1 minute or 1 server for 10 minutes. Therefore, the focus shifts from optimizing execution time to optimizing scalability and parallelization.

Let’s imagine a perfect data processing pipeline: 1,000 jobs get processed in parallel on 1,000 nodes, and then the results are gathered together. This would mean that at any scale, the speed of processing doesn’t depend on the number of jobs and is always equal to the processing time for one job.

This doesn’t sound so impossible. Serverless infrastructure, which is becoming more and more popular, provides a way to launch thousands of processing nodes in parallel. In addition, more and more vendors now have pure container-as-a-service offerings, meaning that once you define, for example, a Docker image, it will be executed in parallel, and you will pay for only processing time. Not only that, but when serverless infrastructure and container as a service are coupled with native message buses or orchestrators they are able to handle large numbers of incoming messages by independently mapping them to these scalable computer services. These services enable a lot of opportunities, and by utilizing them we can minimize idle time and scale infrastructure to match load perfectly.

But once we achieve perfect horizontal scalability, should we focus on execution time? Yes, but for a different reason. In a world of perfect horizontal scalability, execution time doesn’t influence the speed of processing the batch much, but it significantly influences the cost. Optimizing speed twice means optimizing cost twice, and that is the new motivation for optimizing development.

Furthermore, designing an absolutely scalable data pipeline without taking into account optimizing algorithms at all can lead to an extremely high cost of the pipeline. That is one of the disadvantages of a system that doesn’t have an economy of scale.

One additional opportunity is about separating apps into modular parts and executing each part on a separate scalable service. This approach provides a way to find the best fit for your application and minimize idle CPU (or GPU) and RAM. Once we find this kind of fit, we can not only minimize the cost of processing, but also make sure that we process each part fast enough (for example, we might preprocess data not on the GPU VM, but in parallel on multiple CPU VMs). And finally, we can start using CPU VMs from one vendor, GPU VMs from another vendor, and serverless computing resources from a third one to find the best balance between speed, cost, and scalability.

The emerging opportunity is to design data pipelines to optimize unit costs and enable scalability from initial phases for transparent communication between data engineers and other stakeholders, such as project managers and data scientists.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.85.76