Chapter 14. Scalable Algorithms

In this chapter, we discuss the challenges associated with writing efficient and scalable analytics running on Spark. We will start by introducing the reader to the general concepts of distributed parallelization and scalability and how they relate to Spark. We will recap over Spark's distributed architecture giving the reader an understanding of its underlying principles and how this supports the parallel processing paradigm. We will learn about the characteristics of scalable analytics and the elements of Spark that underpin these (for example, RDD, combineByKey, and GraphX).

Next, we will learn about why sometimes even basic algorithms, despite working at small scale, will often fail in big data. We'll see how to avoid issues when writing Spark jobs that run over massive datasets, including an example using mean/variance. The reader will learn about the structure of algorithms and how to write custom data science analytics that scale over petabytes of data.

Later, we will move on to discuss some of the limitations of Spark's in-memory model (such as excessive memory usage, the pitfalls of traditional data models including the object-oriented approach [OOP] and 3rd normal form [3NF], the benefits of a denormalized data representation, and the dangers of fixed precision number representations) and how these relate to writing efficient Spark jobs.

This chapter completes by describing the main performance-related features and patterns that facilitate efficient runtime processing in Spark, and shows when to take advantage of them. We will introduce features such as parallelization strategies, caching, shuffle strategies, garbage collection optimization, and probabilistic models; and explain how these help you to get the most out of Spark.

This chapter also emphasizes the importance of having a good overall approach to the development process when analytic authoring. It reveals the tips and tricks of the professionals that will ensure that your algorithm writing experience is a success.

General principles

Throughout this book we have demonstrated many data science techniques that, by using the power of Spark, will scale across petabytes of data. Hopefully, you have found these techniques sufficiently useful that you want to start using them in your own analytics and, indeed, have been inspired to create data science pipelines of your own.

Writing your own analytics is definitely a challenge! It can be huge fun at times and it's great when they work well. But there are times when getting them to run at scale and efficiently (or even at all) can seem like a daunting task.

Sometimes, with scarce feedback, you can get stuck in a seemingly endless loop waiting for task after task to complete not even knowing whether your job will fail at the very last hurdle. And let's face it, seeing a dreaded OutOfMemoryError at the end of a 20-hour job is no fun for anyone! Surely there must be a better way to develop analytics that run well on Spark and don't lead to wasted time and poorly performing code?

One of the main characteristics of a well written Spark job is the concept of scalability. Scalability is a distinct concept from performance. While performance is a measure of the speed of the response of a computation, scalability is a measure of how well a computation performs when you increase demand (or in the case of Spark, increase the amount of data).

The holy grail of scalability is known as linear scalability. This refers to the ideal condition where there are no performance constraints imposed on scalability when additional resources are added to the cluster. In this case, doubling the number of machines in a cluster will lead to double the performance, or similarly doubling the volume of data would yield the same performance on a cluster twice the size.

This chapter will serve as an introduction to writing analytics that seek to take advantage of this linear scalability. It will demonstrate best practice for maximizing scalability and explain the barriers to achieving it. And while it will not provide an exhaustive description of optimization techniques, it will get you started by giving you a feel for how to write efficient Spark jobs.

Before we dive into the details, let's establish some basic principles that will assist and guide throughout:

  1. Preserve data locality where possible: Moving data around is expensive. It's usually much quicker to process data in place by moving the processing to the data. In Spark, this is known as data locality. And indeed, Spark is designed to take full advantage of it. Therefore, you might assume that you don't need to worry much about it as the framework will handle it for you. While this is partly true, it's prudent to test this assumption at every stage to ensure that it's behaving as expected. And if not, use the levers that Spark provides in order to prevent moving data around when it's not necessary. In fact, the principle of data locality is so important that we even need to consider it throughout the entire development process. At each stage, considering whether it's really necessary to move data at all. In some cases, it's possible to decompose the problem in a different way in order to minimize or avoid movement altogether. If so, it's always worth considering the approach where less data is transferred.
  2. Ensure even distribution of data: When running a Spark job, the ideal situation is to have all of your executors equally utilized. Having executors sit idle while a select few do all of the work is indicative of a poorly performing job. By arranging even distribution of data across executors, one can ensure the maximum utilization of cluster resources.
  3. Favor faster stores: Not all methods of randomly accessing data are the same. The following snippet shows the approximate time taken to reference data in various states:

    L1 cache                                 0.5 ns

    L2 cache                                 7 ns

    Main memory                       100 ns

    Disk (random seek)              2,000,000 ns

    Fortunately, Spark provides in-memory processing capabilities, including many optimizations that take advantage of the fast caches available (L1/L2/L3 caches). Therefore, it can avoid unnecessarily reading from main memory or spilling to disk it's important that your analytics take full advantage of these efficiencies. This was introduced as part of Project Tungsten, https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html.

  4. Only optimize after observation: There's a famous saying by Donald Knuth, the legendary computer scientist and author, that premature optimization is the root of all evil. While this sounds extreme, what he means is that all performance-related tweaks or optimizations should be based on empirical evidence rather than preemptive intuition. As such predictions very often fail to correctly identify performance problems, and instead give rise to poor design choices that are later regretted. But contrary to what you might think, the suggestion here is not that you just forget about performance until the end, in fact quite the reverse. In an environment where the size of the data and hence the length of time any operation takes dictates everything, it's fundamental to begin optimization early in the analytic design process. But isn't this a contradiction of Knuth's law? Well, no. In terms of performance, simplicity is often the key. The approach should be evidence-based so start simple, carefully observe the performance of your analytic at runtime (through the use of analytic tuning and code profiling, see the next section), perform targeted optimizations that correct the problems identified, and repeat. Over-engineering is usually as much to blame in poorly performing analytics as choosing slow algorithms, but it can be much harder to fix down the line.
  5. Start small and scale-up: Start with small data samples. While an analytic may eventually be required to run over a petabyte of data, starting with a small dataset is definitely advisable. Sometimes only a handful of rows are required to determine whether an analytic is working as expected. And more rows can be added to prove out the various test and edge cases. It's more about breadth of coverage here rather than volume. The analytic design process is extremely iterative and judicious use of data sampling will pay dividends during this phase; while even a small dataset will allow you to measure the impact on performance as you incrementally increase the size of the data.

    The bottom line is that writing analytics, particularly over data you are unfamiliar with, can take time and there are no shortcuts.

Now that we have some guidelines, let's focus on how they apply to Spark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.37.154