Avoiding Shuffle and Reducing Operational Expenses

In this chapter, we will learn how to avoid shuffle and reduce the operational expense of our jobs, along with detecting a shuffle in a process. We will then test operations that cause a shuffle in Apache Spark to find out when we should be very careful and which operations we should avoid. Next, we will learn how to change the design of jobs with wide dependencies. After that, we will be using the keyBy() operations to reduce shuffle and, in the last section of this chapter, we'll see how we can use custom partitioning to reduce the shuffle of our data.

In this chapter, we will cover the following topics:

  • Detecting a shuffle in a process
  • Testing operations that cause a shuffle in Apache Spark
  • Changing the design of jobs with wide dependencies
  • Using keyBy() operations to reduce shuffle
  • Using the custom partitioner to reduce shuffle
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.216.254