Performance

Before moving on to the rest of the chapters covering functional areas of Apache Spark and extensions to it, I wanted to examine the area of performance. What issues and areas need to be considered? What might impact Spark application performance starting at the cluster level, and finishing with actual Scala code? I don't want to just repeat what the Spark website says, so have a look at the following URL: http://spark.apache.org/docs/<version>/tuning.html.

Here, <version> relates to the version of Spark that you are using, that is, latest, or 1.3.1 for a specific version. So, having looked at that page, I will briefly mention some of the topic areas. I am going to list some general points in this section without implying an order of importance.

The cluster structure

The size and structure of your big data cluster is going to affect performance. If you have a cloud-based cluster, your IO and latency will suffer in comparison to an unshared hardware cluster. You will be sharing the underlying hardware with multiple customers, and that the cluster hardware maybe remote.

Also, the positioning of cluster components on servers may cause resource contention. For instance, if possible, think carefully about locating Hadoop NameNodes, Spark servers, Zookeeper, Flume, and Kafka servers in large clusters. With high workloads, you might consider segregating servers to individual systems. You might also consider using an Apache system such as Mesos in order to share resources.

Also, consider potential parallelism. The greater the number of workers in your Spark cluster for large data sets, the greater the opportunity for parallelism.

The Hadoop file system

You might consider using an alternative to HDFS, depending upon your cluster requirements. For instance, MapR has the MapR-FS NFS-based read write file system for improved performance. This file system has a full read write capability, whereas HDFS is designed as a write once, read many file system. It offers an improvement in performance over HDFS. It also integrates with Hadoop and the Spark cluster tools. Bruce Penn, an architect at MapR, has written an interesting article describing its features at: https://www.mapr.com/blog/author/bruce-penn.

Just look for the blog post entitled Comparing MapR-FS and HDFS NFS and Snapshots. The links in the article describe the MapR architecture, and possible performance gains.

Data locality

Data locality or the location of the data being processed is going to affect latency and Spark processing. Is the data sourced from AWS S3, HDFS, the local file system/network, or a remote source?

As the previous tuning link mentions, if the data is remote, then functionality and data must be brought together for processing. Spark will try to use the best data locality level possible for task processing.

Memory

In order to avoid OOM (Out of Memory) messages for the tasks, on your Apache Spark cluster, you can consider a number of areas:

  • Consider the level of physical memory available on your Spark worker nodes. Can it be increased?
  • Consider data partitioning. Can you increase the number of partitions in the data used by your Spark application code?
  • Can you increase the storage fraction, the memory used by the JVM for storage and caching of RDD's?
  • Consider tuning data structures used to reduce memory.
  • Consider serializing your RDD storage to reduce the memory usage.

Coding

Try to tune your code to improve Spark application performance. For instance, filter your application-based data early in your ETL cycle. Tune your degree of parallelism, try to find the resource-expensive parts of your code, and find alternatives.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.74.160