z score standardization

This technique consists of subtracting the mean of the column from each value in a column, and then dividing the result by the standard deviation of the column. The formula to achieve this is the following:

The result of standardization is that the features will be rescaled so that they’ll have the properties of a standard normal distribution, as follows:

  • μ=0
  • σ=1

μ is the mean and σ is the standard deviation from the mean.

In summary, the z score (also called the standard score) represents the number of standard deviations with which the value of an observation point or data differ than the mean value of what is observed or measured. Values more than the mean have positive z scores, while values less than the mean have negative z scores. The z score is a quantity without dimension, obtained by subtracting the population mean from a single rough score and then dividing the difference for the standard deviation of the population.

Once again, to standardize the data, we will use the same procedure used to min-max normalization. This time the two functions are changed as follows:

  • AVERAGE: Calculates the average for each column
  • STDEV: Calculates the standard deviation for each column

To perform a z score standardization, just analyze the same dataset used for min-max normalization. I refer to dataset called Airquality.csv, which contains daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

To apply z score standardization to a dataset column, perform the following steps:

  1. In the Recipe panel, click on the NEW STEP button. Again, to open the Recipe panel, just click on the icon (Recipe) at the top left of the Run Job button.
  2. From the Transformation drop-down menu, select Apply formula (this item sets the values of one or more columns to the result of a formula).
  3. From the column box, select the Ozone column (the same procedure can be applied to all the dataset variables).
  4. In the Formula box, edit the following formula:
  5. (Ozone- AVERAGE(Ozone))/STDEV(Ozone)
  6. Just click on the ADD button.

The following statement is added to the Recipe panel:

Set Ozone to (Ozone- AVERAGE(Ozone))/STDEV(Ozone)

We follow the same procedure for the other three variables: Solar_R, Wind, and Temp. At the end, we will have added the following lines to the Recipe panel:

Set Solar_R to (Solar_R - AVERAGE(Solar_R))/STDEV(Solar_R)
Set Wind to (Wind - AVERAGE(Wind))/STDEV(Wind)
Set Temp to (Temp - AVERAGE(Temp))/STDEV(Temp)

The transformer page has been changed, as follows:

The modification made by the z score standardization is evident: the data ranges are quite similar. This happens for each column of the dataset, and then for each variable. The differences in scale due to the different units of measurement have therefore been removed.

According to the assumptions, all variables must have average= 0 and stdev =1. Let's verify that. To do so, just use the statistical information available in the Column Details panel, already used in the previous sections. To open the Column Details panel, select Column Details from the drop-down menu of a column; the following panel is opened:

So, we have verified that the Ozone variable has an average of zero and a standard deviation of one. The same check can be executed for the other variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.20.210