Testing our hypotheses on large datasets

In this section, we will look at hypothesis testing, and also learn how to test the hypotheses using PySpark. Let's look at one particular type of hypothesis testing that is implemented in PySpark. This form of hypothesis testing is called Pearson's chi-square test. Chi-square tests how likely it is that the differences in the two datasets are there by chance.

For example, if we had a retail store without any footfall, and suddenly you get footfall, how likely is it that this is random, or is there even any statistically significant difference in the level of visitors that we are getting now in comparison to before? The reason why this is called the chi-square test is that the test itself references the chi-square distributions. You can refer to online documentation to understand more about chi-square distributions.

There are three variations within Pearson's chi-square test. We will check whether the observed datasets are distributed differently than in a theoretical dataset.

Let's see how we can implement this. Let's start by importing the Vectors package from pyspark.mllib.linalg. Using this vector, we're going to create a dense vector of the visitor frequencies by day in our store.

Let's imagine that the frequencies go from 0.13 an hour to 0.61, 0.8, and 0.5, finally ending on Friday at 0.3. So, we are putting these visitor frequencies into the visitors_freq variable. Since we're using PySpark, it is very simple for us to run a chi-square test from the Statistics package, which we have already imported as follows:

from pyspark.mllib.linalg import Vectors
visitors_freq = Vectors.dense(0.13, 0.61, 0.8, 0.5, 0.3)
print(Statistics.chiSqTest(visitors_freq))

By running the chi-square test, the visitors_freq variable gives us a bunch of useful information, as demonstrated in the following screenshot:

The preceding output shows the chi-square test summary. We've used the pearson method, where there are 4 degrees of freedom in our Pearson chi-square test, and the statistics are 0.585, which means that the pValue is 0.964. This results in no presumption against the null hypothesis. In this way, the observed data follows the same distribution as expected, which means our visitors are not actually different. This gives us a good understanding of hypothesis testing.

Table of Contents for Testing our hypotheses on large datasets

Create new playlist

Sign In

Sign Up

Table of Contents for
Testing our hypotheses on large datasets