Spark has a more limited API than Java and Scala, but supports most of the core functionalities.
The hallmarks of a MapReduce system are the two commands: map
and reduce
. You've seen the map
function used in the past chapters. The map
function works by taking in a function that works on each individual element in the input RDD and produces a new output element. For example, to produce a new RDD where you have added one to every number, you would use rdd.map(lambda x: x+1)
. It's important to understand that the map
function and the other Spark functions do not transform the existing elements, rather they return a new RDD with the new elements. The reduce
function takes a function that operates on pairs to combine all the data. This is returned to the calling program. If you were to sum all the elements, you would use rdd.reduce(lambda x, y: x+y)
.
The flatMap
function is a useful utility that allows you to write a function which returns an Iterable
object of the type you want and then flattens the results. A simple example of this is a case where you want to parse all the data, but some of the data may not be parsed. The flatMap
function can output an empty list if it failed or a list with the success if it worked. In addition to reduce
, there is a corresponding reduceByKey
function that works on RDDs which are key-value pairs and produces another RDD.
Many of the mapping operations are also defined with a partition variant. In this case, the function you need to provide takes and returns an Iterator
object that represents all the data on that partition. This can be quite useful if the operation you need to perform has to do extensive work on each partition, for example, establishing a connection to a backend server.
Often, your data can be expressed with key-value mappings. As such, many of the functions defined on the Python RDD class only work if your data is in a key-value mapping. The mapValues
function is used when you only want to update the key-value pair you are working with.
Another variant on the traditional map function is mapPartitions
, which works on a per-partition level. The primary reason for using mapPartitions
is to create the setup for your map function, which can't be serialized across the network. A good example of this is creating an expensive connection to a backend service or parsing some expensive side input.
def f(iterator): //Expensive work goes here for i in iterator: yield per_element_function(i)
In addition to simple operations on the data, Spark provides support for broadcast
and accumulator
values. The broadcast
values can be used to broadcast a read-only value to all the partitions that can save having to reserialize a given value multiple times. Accumulators allow all the partitions to add to the accumulator
and the result can then be read on the master. You can create an accumulator by doing counter = sc.accumulator(initialValue)
. If you want to add custom behavior, you can also provide an AccumulatorParam
as an argument to the accumulator
function. The return can then be incremented as counter += x
on any of the workers. The resulting value can then be read with counter.value()
. The broadcast
value is created with bc = sc.broadcast(value)
and then accessed by bc.value()
by any worker. The accumulator
value can only be read on the master and the broadcast
value can be read on all the partitions.
The following table explains some of the functions that are available on all RDDs in Python:
The following table explains some functions that are only available on key-value pair functions:
3.14.252.56