Pivot tabling with key-value paired data points

Pivot tables are very simple and easy to use. What we are going to do is use big datasets, such as the KDD cup dataset, and group certain values by certain keys.

For example, we have a dataset of people and their favorite fruits. We want to know how many people have apple as their favorite fruit, so we will group the number of people, which is the value, against a key, which is the fruit. This is the simple concept of a pivot table.

We can use the map function to move the KDD datasets into a key-value pair paradigm. We map feature 41 of the dataset using a lambda function in the kv key value, and we append the value as follows:

kv = csv.map(lambda x: (x[41], x))
kv.take(1)

We use feature 41 as the key, and the value is the data point, which is x. We can use the take function to take one of these transformed rows to see how it looks.

Let's now try something similar to the previous example. To figure out the total duration against each type of value that is present in feature 41, we can use the map function again and simply take the 41 feature as our key. We can take the float of the first number in the data point as our value. We will use the reduceByKey function to reduce each duration by its key.

So, instead of just reducing all of the data points regardless of which key they belong to, reduceByKey reduces duration numbers depending on which key it is associated with. You can view the documentation at https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=map#pyspark.RDD.reduceByKey. reduceByKey merges the values for each key using an associative and commutative reduce function. It performs local merging on each mapper before sending the results to the reducer, which is similar to a combiner in MapReduce.

The reduceByKey function simply takes one argument. We will be using the lambda function. We take two different durations and add them together, and PySpark is smart enough to apply this reduction function depending on a key, as follows:

kv_duration = csv.map(lambda x: (x[41], float(x[0]))).reduceByKey(lambda x, y: x+y)
kv_duration.collect()

The resulting output is shown in the following screenshot:

If we collect the key-value duration data, we can see that the duration is collected by the value that appears in feature 41. If we are using pivot tables in Excel, there is a convenience function that is the countByKey function, which does the exact same thing, demonstrated as follows:

kv.countByKey()

This will give us the following output:

You can see that calling the kv.countByKey() function is the same as calling the reduceByKey function, preceded by a mapping from the key to the duration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.235.176