Computing summary statistics with MLlib

 In this section, we will be answering the following questions:

  • What are summary statistics?
  • How do we use MLlib to create summary statistics?

MLlib is the machine learning library that comes with Spark. There has been a recent new development that allows us to use Spark's data-processing capabilities to pipe into machine learning capabilities native to Spark. This means that we can use Spark not only to ingest, collect, and transform data, but we can also analyze and use it to build machine learning models on the PySpark platform, which allows us to have a more seamless deployable solution.

Summary statistics are a very simple concept. We are familiar with average, or standard deviation, or the variance of a particular variable. These are summary statistics of a dataset. The reason why it's called a summary statistic is that it gives you a summary of something via a certain statistic. For example, when we talk about the average of a dataset, we're summarizing one characteristic of that dataset, and that characteristic is the average.

Let's check how to compute summary statistics in Spark. The key factor here is the colStats function. The colStats function computes the column-wise summary statistics for an rdd input. The colStats function accepts one parameter, that is rdd, and it allows us to compute different summary statistics using Spark.

Let's look at the code in the Jupyter Notebook (available at https://github.com/PacktPublishing/Hands-On-Big-Data-Analytics-with-PySpark/tree/master/Chapter05) for this chapter in Chapter5.ipynb. We will first collect the data from the kddcup.data.gz text file and pipe this into the raw_data variable as follows:

raw_data = sc.textFile("./kddcup.data.gz")

The kddcup.data file is a comma-separated value (CSV) file. We have to split this data by the , character and put it in the csv variable as follows:

csv = raw_data.map(lambda x: x.split(","))

Let's take the first feature x[0] of the data file; this feature represents the duration, that is, aspects of the data. We will transform it into an integer here, and also wrap it in a list as follows:

duration = csv.map(lambda x: [int(x[0])])

This helps us do summary statistics over multiple variables, and not just one of them. To activate the colStats function, we need to import the Statistics package, as shown in the following snippet:

from pyspark.mllib.stat import Statistics

This Statistics package is a sub package of pyspark.mllib.stat. Now, we need to call the colStats function in the Statistics package and feed it some data. Here, we are talking about the duration data from the dataset and we're feeding the summary statistics into the summary variable:

summary = Statistics.colStats(duration)

To access different summary statistics, such as the mean, standard deviation, and so on, we can call the functions of the summary objects, and access different summary statistics. For example, we can access the mean, and since we have only one feature in our duration dataset, we can index it by the 00 index, and we'll get the mean of the dataset as follows:

summary.mean()[0]

This will give us the following output:

47.97930249928637

Similarly, if we import the sqrt function from the Python standard library, we can create the standard deviation of the durations seen in the datasets, as demonstrated in the following code snippet:

from math import sqrt
sqrt(summary.variance()[0])

This will give us the following output:

707.746472305374

If we don't index the summary statistics with [0], we can see that summary.max() and summary.min() gives us back an array, of which the first element is the summary statistic that we desire, as shown in the following code snippet:

summary.max()
array ([58329.]) #output
summary.min()
array([0.]) #output
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.78.136