Removing outliers

So far we have seen different techniques for identifying possible outliers. What should we do after identifying them? After identifying the values ​​that are outliers in the column, you need to determine whether these values ​​are valid or invalid for the dataset.

If these are invalid values ​​due to an error in the population phase of the dataset, then we must correct them. This operation may involve the replacement of this value with a presumably valid one or the removal of the entire row. In this latter case, we must pay attention to the weight that this action can have on the whole dataset.

To replace the value 100, which seems to us an invalid value in all respects (maybe it was 10 and an extra zero was added), we can insert the following formula:

IF(($col == 100),10, $col)

Instead, if the removal of this record does not assume statistically significant importance, we can adopt a simple erasing instruction, as shown here:

DELETE row: age == 100

If the data seems valid (in reality, a 100-year age is possible for human beings), we can leave it as it is. Or we can convert it into a value that seems to us to be statistically more significant. For example, we can decide to replace this value with the average value of the entire column so as to preserve at least the information of this observation derived from the other columns. To replace 100 with the average value of the column, use the following formula:

if($col > 80, average($col), $col)

Ultimately we choose the first option, so 100 will be replaced with 10 in the age column. To do this, we insert (as always) the formula just proposed in a new step, as shown in the following screenshot:

To view a preview of the change made, click on the Add button and a new step will be added to our Recipe panel. In the screenshot, we can also verify that now the range of values has been significantly reduced: 10 to 32 instead of 15 to 100. Not only that, but also the histogram no longer has bars isolated at the end.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.77.149