Computing Pearson and Spearman correlations

To understand this, let's assume that we are taking the first three numeric variables from our dataset. For this, we want to access the csv variable that we defined previously, where we simply split raw_data using a comma (,). We will consider only the first three columns that are numeric. We will not take anything that contains words; we're only interested in features that are purely based on numbers. In our case, in kddcup.data, the first feature is indexed at 0; feature 5 and feature 6 are indexed at 4 and 5, respectively which are the numeric variables that we have. We use a lambda function to take all three of these into a list and put it into the metrics variable:

metrics = csv.map(lambda x: [x[0], x[4], x[5]])
Statistics.corr(metrics, method="spearman")

This will give us the following output:

array([[1.       ,  0.01419628,  0.29918926],
[0.01419628, 1. , -0.16793059],
[0.29918926, -0.16793059, 1. ]])

In the Computing summary statistics with MLlib section, we simply took the first feature into a list and created a list with a length of one. Here, we're taking three quantities of three variables into the same list. Now, each list has a length of three.

To compute the correlations, we call the corr method on the metrics variable and specify the method as "spearman". PySpark would give us a very simple matrix telling us the correlation between the variables. In our example, the third variable in our metrics variable is more correlated than the second variable.

If we run corr on metrics again, but specify that the method is pearson, then it gives us Pearson correlations. So, let's examine why we need to be qualified as the data scientist or machine learning researcher to call these two simple functions and simply change the value of the second parameter. A lot of machine learning and data science revolves around our understanding of statistics, understanding how data behaves, an understanding of how machine learning models are grounded, and what gives them their predictive power.

So, as a machine learning practitioner or a data scientist, we simply use PySpark as a big calculator. When we use a calculator, we never complain that the calculator is simple to use—in fact, it helps us to complete the goal in a more straightforward way. It is the same case with PySpark; once we move from the data engineering side to the MLlib side, we will notice that the code gets incrementally easier. It tries to hide the complexity of the mathematics underneath it, but we need to understand the difference between different correlations, and we also need to know how and when to use them. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.200.3