Finding outliers in the data

Outliers are the values that, compared to others, are particularly extreme (a value clearly distant from the other available observations). The presence of outliers causes a hindrance because they tend to distort the results of data analysis, in particular in descriptive statistics and correlations. It is ideal to identify these outliers in the data cleaning phase itself; however, they can also be dealt with in the next step of the data analysis. Outliers can be univariate when they have an extreme value for a single variable, or multivariate when they have an unusual combination of values for a number of variables.

Outliers are the extreme values of a distribution that are characterized by being extremely high or extremely low compared to the rest of the distribution, thus representing isolated cases in respect to the rest of the distribution.

There are different methods to detect outliers. Google Cloud Dataprep uses Tukey's method, which uses the interquartile range (IQR) approach. This method is not dependent on the distribution of the data and ignores the mean and the standard deviation, which are influenced by outliers.

As said before, to determine the outlier values, refer to the IQR given by the difference between the 25th percentile and the 75th percentile, that is, the amplitude of the range within which it falls. These 50 percent of observations occupy the central positions in the ordered series of data. An outlier is a value with positive deviation from the 75th percentile greater than two times the IQR or, symmetrically, a value with a negative deviation from the 25th percentile (in absolute value) greater than two times the IQR.

Practically, an outlier value is either of these two:

< (25th percentile) - (2 * IQR)
> (75th percentile) + (2 * IQR)

To identify outliers in individual columns, Google Cloud Dataprep has visual functionality and statistical information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.204.166