Outlier detection and filtering

Outliers are data points that diverge from other observations for several reasons. During the EDA phase, one of our common tasks is to detect and filter these outliers. The main reason for this detection and filtering of outliers is that the presence of such outliers can cause serious issues in statistical analysis. In this section, we are going to perform simple outlier detection and filtering. Let's get started:

  1. Load the dataset that is available from the GitHub link as follows:
df = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exploratory-data-analysis-with-python/master/Chapter%204/sales.csv')
df.head(10)

The dataset was synthesized manually by creating a script. If you are interested in looking at how we created the dataset, the script can be found inside the folder named Chapter 4 in the GitHub repository shared with this book. 

The output of the preceding df.head(10) command is shown in the following screenshot:

  1. Now, suppose we want to calculate the total price based on the quantity sold and the unit price. We can simply add a new column, as shown here:
df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
df

This should add a new column called TotalPrice, as shown in the following screenshot:

Now, let's answer some questions based on the preceding table.

Let's find the transaction that exceeded 3,000,000:

TotalTransaction = df["TotalPrice"]
TotalTransaction[np.abs(TotalTransaction) > 3000000]

The output of the preceding code is as follows:

2 3711433
7 3965328
13 4758900
15 5189372
17 3989325
...
9977 3475824
9984 5251134
9987 5670420
9991 5735513
9996 3018490
Name: TotalPrice, Length: 2094, dtype: int64

Note that, in the preceding example, we have assumed that any price greater than 3,000,000 is an outlier. 

Display all the columns and rows from the preceding table if TotalPrice is greater than 6741112, as follows:

df[np.abs(TotalTransaction) > 6741112]

The output of the preceding code is the following:

Note that in the output, all the TotalPrice values are greater than 6741112. We can use any sort of conditions, either row-wise or column-wise, to detect and filter outliers. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.237.164