Sampling/filtering RDDs to pick out relevant data points

In this section, we will look at sampling and filtering RDDs to pick up relevant data points. This is a very powerful concept that allows us to circumvent the limitations of big data and perform our calculations on a particular sample.

Let's now check how sampling not only speeds up our calculations, but also gives us a good approximation of the statistic that we are trying to calculate. To do this, we first import the time library as follows:

from time import time

The next thing we want to do is look at lines or data points in the KDD database that contains the word normal:

raw_data = sc.textFile("./kdd.data.gz")

We need to create a sample of raw_data. We will store the sample into the sample, variable, and we're sampling from raw_data without replacement. We're sampling 10% of the data, and we're providing 42 as our random seed:

sampled = raw_data.sample(False, 0.1, 42)

The next thing to do is to chain some map and filter functions, as we do normally if we are dealing with unsampled datasets:

contains_normal_sample = sampled.map(lambda x: x.split(",")).filter(lambda x: "normal" in x)

Next, we need to time how long it would take for us to count the number of rows in the sample:

t0 = time()
num_sampled = contains_normal_sample.count()
duration = time() - t0

We issue the count statement here. As you know from the previous section, this is going to trigger all the calculations in PySpark as defined in contains_normal_sample, and we're recording the time before the sample count happens. We are also recording the time after the sample count happens, so we can see how long it takes when we're looking at a sample. Once this is done, let's take a look at how long the duration was in the following code snippet:

duration

The output will be as follows:

23.724565505981445

It took us 23 seconds to run this operation over 10% of the data. Now, let's look at what happens if we run the same transform over all of the data:

contains_normal = raw_data.map(lambda x: x.split(",")).filter(lambda x: "normal" in x)
t0 = time()
num_sampled = contains_normal.count()
duration = time() - t0

Let's take a look at the duration again:

duration

This will provide the following output:

36.51565098762512

There is a small difference, as we are comparing 36.5 seconds to 23.7 seconds. However, this difference becomes much larger as your dataset becomes much more varied, and the amount of data you're dealing with becomes much more complex. The great thing about this is that, if you are usually doing big data, verifying whether your answers make sense with a small sample of the data can help you catch bugs much earlier on.

The last thing to look at is how we can use takeSample. All we need to do is use the following code:

data_in_memory = raw_data.takeSample(False, 10, 42)

As we've learned earlier, when we present the new functions we call takeSample, and it will give us 10 items with a random seed of 42, which we will now put into memory. Now that this data is in memory, we can call the same map and filter functions using native Python methods as follows:

contains_normal_py = [line.split(",") for line in data_in_memory if "normal" in line]
len(contains_normal_py)

The output will be as follows:

We have now finished calculating our contains_normal function by bringing data_in_memory. This is a great illustration of the power of PySpark.

We originally took a sample of 10,000 data points, and it crashed the machine. So here, we will take these ten data points to see if it contains the word normal.

We can see that the calculation is completed in the previous code block, and it took longer and used more memory than if we were doing it in PySpark. And that's why we use Spark, because Spark allows us to parallelize any big datasets and operate it on it using a parallel fashion, which means that we can do more with less memory and with less time. In the next section, we're going to talk about splitting datasets and creating new combinations with set operations.

Table of Contents for Sampling/filtering RDDs to pick out relevant data points

Create new playlist

Sign In

Sign Up

Table of Contents for
Sampling/filtering RDDs to pick out relevant data points