Splitting datasets and creating some new combinations

In this section, we are going to look at splitting datasets and creating new combinations with set operations. We're going to learn subtracts, and Cartesian ones in particular.

Let's go back to Chapter 3 of the Jupyter Notebook that we've been looking at lines in the datasets that contain the word normal. Let's try to get all the lines that don't contain the word normal. One way is to use the filter function to look at lines that don't have normal in it. But, we can use something different in PySpark: a function called subtract to take the entire dataset and subtract the data that contains the word normal. Let's have a look at the following snippet:

normal_sample = sampled.filter(lambda line: "normal." in line)

We can then obtain interactions or data points that don't contain the word normal by subtracting the normal ones from the entire sample as follows:

non_normal_sample = sampled.subtract(normal_sample)

We take the normal sample and we subtract it from the entire sample, which is 10% of the entire dataset. Let's issue some counts as follows:

sampled.count()

This will give us the following output:

As you can see, 10% of the dataset gives us 490705 data points, and within it, we have a number of data points containing the word normal. To find out its count, write the following code:

normal_sample.count()

This will give us the following output:

So, here we have 97404 data points. If we count the on normal samples because we're simply subtracting one sample from another, the count should be roughly just below 400,000 data points, because we have 490,000 data points minus 97,000 data points, which should result in something like 390,000. Let's see what happens using the following code snippet:

non_normal_sample.count()

This will give us the following output:

As expected, it returned a value of 393301, which validates our assumption that subtracting the data points containing normal gives us all the non-normal data points.

Let's now discuss the other function, called cartesian. This allows us to give all the combinations between the distinct values of two different features. Let's see how this works in the following code snippet:

feature_1 = sampled.map(lambda line: line.split(",")).map(lambda features: features[1]).distinct()

Here, we're splitting the line function by using ,. So, we will split the values that are comma-separated—for all the features that we come up with after splitting, we take the first feature, and we find all the distinct values of that column. We can repeat this for the second feature as follows:

feature_2 = sampled.map(lambda line: line.split(",")).map(lambda features: features[2]).distinct()

And so, we now have two features. We can look at the actual items in feature_1 andfeature_2 as follows, by issuing the collect() call that we saw earlier:

f1 = feature_1.collect()
f2 = feature_2.collect()

Let's look at each one as follows:

f1

This will provide the following outcome:

['tcp', 'udp', 'icmp']

So, f1 has three values; let's check for f2 as follows:

f2

This will provide us with the following output:

f2 has a lot more values, and we can use the cartesian function to collect all the combinations between f1 and f2 as follows:

len(feature_1.cartesian(feature_2).collect())

This will give us the following output:

This is how we use the cartesian function to find the Cartesian product between two features. In this chapter, we looked at Spark Notebooks; sampling, filtering, and splitting datasets; and creating new combinations with set operations.

Table of Contents for Splitting datasets and creating some new combinations

Create new playlist

Sign In

Sign Up

Table of Contents for
Splitting datasets and creating some new combinations