In this chapter, we discuss the challenges associated with writing efficient and scalable analytics running on Spark. We will start by introducing the reader to the general concepts of distributed parallelization and scalability and how they relate to Spark. We will recap over Spark's distributed architecture giving the reader an understanding of its underlying principles and how this supports the parallel processing paradigm. We will learn about the characteristics of scalable analytics and the elements of Spark that underpin these (for example, RDD
, combineByKey
, and GraphX
).
Next, we will learn about why sometimes even basic algorithms, despite working at small scale, will often fail in big data. We'll see how to avoid issues when writing Spark jobs that run over massive datasets, including an example using mean/variance. The reader will learn about the structure of algorithms and how to write custom data science analytics that scale over petabytes of data.
Later, we will move on to discuss some of the limitations of Spark's in-memory model (such as excessive memory usage, the pitfalls of traditional data models including the object-oriented approach [OOP] and 3rd normal form [3NF], the benefits of a denormalized data representation, and the dangers of fixed precision number representations) and how these relate to writing efficient Spark jobs.
This chapter completes by describing the main performance-related features and patterns that facilitate efficient runtime processing in Spark, and shows when to take advantage of them. We will introduce features such as parallelization strategies, caching, shuffle strategies, garbage collection optimization, and probabilistic models; and explain how these help you to get the most out of Spark.
This chapter also emphasizes the importance of having a good overall approach to the development process when analytic authoring. It reveals the tips and tricks of the professionals that will ensure that your algorithm writing experience is a success.
Throughout this book we have demonstrated many data science techniques that, by using the power of Spark, will scale across petabytes of data. Hopefully, you have found these techniques sufficiently useful that you want to start using them in your own analytics and, indeed, have been inspired to create data science pipelines of your own.
Writing your own analytics is definitely a challenge! It can be huge fun at times and it's great when they work well. But there are times when getting them to run at scale and efficiently (or even at all) can seem like a daunting task.
Sometimes, with scarce feedback, you can get stuck in a seemingly endless loop waiting for task after task to complete not even knowing whether your job will fail at the very last hurdle. And let's face it, seeing a dreaded OutOfMemoryError
at the end of a 20-hour job is no fun for anyone! Surely there must be a better way to develop analytics that run well on Spark and don't lead to wasted time and poorly performing code?
One of the main characteristics of a well written Spark job is the concept of scalability. Scalability is a distinct concept from performance. While performance is a measure of the speed of the response of a computation, scalability is a measure of how well a computation performs when you increase demand (or in the case of Spark, increase the amount of data).
The holy grail of scalability is known as linear scalability. This refers to the ideal condition where there are no performance constraints imposed on scalability when additional resources are added to the cluster. In this case, doubling the number of machines in a cluster will lead to double the performance, or similarly doubling the volume of data would yield the same performance on a cluster twice the size.
This chapter will serve as an introduction to writing analytics that seek to take advantage of this linear scalability. It will demonstrate best practice for maximizing scalability and explain the barriers to achieving it. And while it will not provide an exhaustive description of optimization techniques, it will get you started by giving you a feel for how to write efficient Spark jobs.
Before we dive into the details, let's establish some basic principles that will assist and guide throughout:
L1 cache 0.5 ns
L2 cache 7 ns
Main memory 100 ns
Disk (random seek) 2,000,000 ns
Fortunately, Spark provides in-memory processing capabilities, including many optimizations that take advantage of the fast caches available (L1/L2/L3 caches). Therefore, it can avoid unnecessarily reading from main memory or spilling to disk it's important that your analytics take full advantage of these efficiencies. This was introduced as part of Project Tungsten, https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html.
The bottom line is that writing analytics, particularly over data you are unfamiliar with, can take time and there are no shortcuts.
Now that we have some guidelines, let's focus on how they apply to Spark.
18.188.218.226