Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Challenges

Now that we have gained an understanding of the Spark architecture, let's prepare for writing scalable analytics by introducing some of the challenges, or gotchas that you might face if you're not careful. Without knowledge of these up-front, you could lose time trying to figure them out on your own!

Algorithmic complexity

As well as the obvious effect of the size of your data, the performance of an analytic is highly dependent on the nature of the problem you're trying to solve. Even some seemingly simple problems, such as a depth first search of a graph, do not have well-defined algorithms that perform efficiently in distributed environments. This being the case, great care should be taken when designing analytics to ensure that they exploit patterns of processing that are readily parallelized. Taking the time to understand the nature of your problem in terms of complexity before you start, can pay off in the long term. In the next section, we'll show you how to do this.

Note

Generally speaking, NC-complete problems are parallelizable, whereas P-complete problems are not: https://en.wikipedia.org/wiki/NC_(complexity).

Another thing to note is that distributed algorithms will often be much slower than single-threaded applications when run on small data. It's worth bearing in mind that in the scenarios where all of your data fits onto a single machine, the overhead of Spark: spawning processes, transferring data, and the latency introduced by interprocess communications, will rarely payoff. Investment in this approach only really starts to assist in the case where your datasets are large enough that they don't fit comfortably into memory, then you will notice gains in throughput, the amount of data you can process in unit time, as a result of using Spark.

Numerical anomalies

When processing large amounts of data, you might notice some strange effects with numbers. These oddities relate to the universal number representations of modern machines and specifically to the concept of precision.

To demonstrate the effect, consider the following:

scala> val i = Integer.MAX_VALUE
i: Int = 2147483647
 
scala> i + 1
res1: Int = -2147483648

Notice how a positive number is turned into a negative number simply by adding one. This phenomenon is known as a number overflow and it occurs when a calculation results in a number that is too large for its type. In this case, an Int has a fixed-width of 32-bits, so when we attempt to store a 33-bit number, we get an overflow, resulting in a negative. This type of behavior can be demonstrated for any numeric type, and as a result of any arithmetic operation.

Note

This is due to the signed, fixed-width, two's complement number representations adopted by most modern processor manufacturers (and hence Java and Scala).

Although overflows occur in the course of normal programming, it's much more apparent when dealing with large datasets. It can occur even when performing relatively simple calculations, such as summations or means. Let's consider the most basic example:

scala> val distanceBetweenStars = Seq(2147483647, 2147483647)
distanceBetweenStars: Seq[Int] = List(2147483647, 2147483647)

scala> val rdd = spark.sparkContext.parallelize(distanceBetweenStars)
rdd: org.apache.spark.rdd.RDD[Int] =  ...

scala> rdd.reduce(_+_)
res1: Int = -2

Datasets are not immune:

scala> distanceBetweenStars.toDS.reduce(_+_)
res2: Int = -2

Of course, there are strategies for handling this; for example by using alternative algorithms, different data types, or changing the unit of measurement. However, a plan for tackling these types of issues should always be taken into account in your design.

Another similar effect is the loss of significance caused by rounding errors in calculations limited by their precision. For illustrative purposes, consider this really basic (and not very sophisticated!) example:

scala> val bigNumber = Float.MaxValue
bigNumber: Float = 3.4028235E38

scala> val verySmall = Int.MaxValue / bigNumber
verySmall: Float = 6.310888E-30

scala> val almostAsBig = bigNumber - verySmall
almostAsBig: Float = 3.4028235E38

scala> bigNumber - almostAsBig
res2: Float = 0.0

Here, we were expecting the answer 6.310887552645619145394993304824655E-30, but instead we get zero. This is a clear loss of precision and significance, demonstrating another type of behavior that you need to be aware of when designing analytics.

To cope with these issues, Welford and Chan devised an online algorithm for calculating the mean and variance. It seeks to avoid problems with precision. Under the covers, Spark implements this algorithm, and an example can be seen in the PySpark StatCounter:

   def merge(self, value):
        delta = value - self.mu
        self.n += 1
        self.mu += delta / self.n
        self.m2 += delta * (value - self.mu)
        self.maxValue = maximum(self.maxValue, value)
        self.minValue = minimum(self.minValue, value)

Let's take a deeper look into how it's calculating the mean and variance:

delta: The delta is the difference between mu (the current running average) and the new value under consideration. It measures the change in value between data points and because of this it's always small. It's basically a magic number that ensures that the calculation never involves summing all the values as this would potentially lead to an overflow.
mu: The mu represents the current running average. At any given time, it's the total of the values seen so far, over the count of those values. The mu is calculated incrementally by continually applying the delta.
m2: The m2 is the sum of the mean squared difference. It assists the algorithm in avoiding loss of significance by adjusting the precision during the calculation. This reduces the amount of information lost through rounding errors.

As it happens, this particular online algorithm is specifically for computing statistics, but the online approach may be adopted by the design of any analytic.

Shuffle

As we identified earlier in our section on principles, moving data around is expensive and this means that one of the main challenges when writing any scalable analytic is that of minimizing the transfer of data. The overhead of management and handling of data transfer is still, at this moment in time, a very costly operation. We'll discuss more on how to tackle this later in the chapter, but for now we'll build awareness of the challenges around data locality; knowing which operations are OK to use and which should be avoided, whilst also understanding the alternatives. Some of the key offenders are:

cartesian()
reduce()
PairRDDFunctions.groupByKey()

But be aware, with a little forethought, using these can be avoided altogether.

Data schemes

Choosing a schema for your data will be critical to your analytic design. Obviously, often you have no choice about the format of your data; either a schema will be imposed on you or your data may not have a schema. Either way, with techniques such as "temporary tables" and schema-on-read (see Chapter 3, Input Formats and Schema for details), you still have control over how data is presented to your analytic - and you should take advantage of this. There are an enormous number of options here and selecting the right one is part of the challenge. Let's discuss some common approaches and start with some that are not so good:

OOP: Object-oriented programming (OOP) is the general concept of programming by decomposing problems into classes that model real world concepts. Typically, definitions will group both data and behavior, making them a popular way to ensure that code is compact and understandable. In the context of Spark, however, creating complex object structures, particularly ones that includes rich behavior, is unlikely to benefit your analytic in terms of readability or maintenance. Instead, it is likely to vastly increase the number of objects requiring garbage collection and limit the scope for code reuse. Spark is designed using a functional approach, and while you should be careful about abandoning objects altogether, you should strive to keep them simple and reuse object references where it is safe to do so.
3NF: For decades, databases have been optimized for certain types of schema - relational, star, snowflake, and so on. And techniques such as 3rd Normal Form (3NF) work well to ensure the correctness of traditional data models. However, within the context of Spark, forcing dynamic table joins, or/and joining facts with dimensions, results in shuffles, potentially many shuffles, which is ultimately bad for performance.
Denormalization: Denormalization is a practical way to ensure that your analytic has all the data it needs without having to resort to a shuffle. Data can be arranged so that records processed together are also stored together. This has the added cost of having to store duplicates of much of the data, but it's often a trade-off that pays off. Particularly as there are techniques and technologies that help overcome the cost of duplication, such as columnar-oriented storage, column-pruning, and so on. More on this later.

Now that we understand some of the difficulties that you might encounter when designing analytics, let's get into the detail of how to apply patterns that address these and ensure that your analytics run well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Challenges

Create new playlist

Sign In

Sign Up

Challenges

Algorithmic complexity

Note

Numerical anomalies

Note

Shuffle

Data schemes

Table of Contents for
Challenges