Calculating outliers in a data frame

We can calculate outliers using standard calculations as to whether the absolute value of the difference from the mean value is greater than 1.96 times the standard deviation. (This assumes a normal Gaussian distribution of the data).

For example, using the same Titanic dataset loaded previously, we can determine which passengers were outliers based on age.

The Python script is as follows:

import pandas as pd

df = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')

# compute mean age
df['x-Mean'] = abs(df['age'] - df['age'].mean())

# 1.96 times standard deviation for age
df['1.96*std'] = 1.96*df['age'].std()

# this age is an outlier if abs difference > 1.96 times std dev
df['Outlier'] = abs(df['age'] - df['age'].mean()) > 1.96*df['age'].std()

# print (results)
print ("Dataset dimensions", df.count)
print ("Number of age outliers", df.Outlier.value_counts()[True])

And under Jupyter the results show as:

Number of age outliers 65

So, given there were about 1,300 passengers, we have about 5% outliers, which means that there may be a normal distribution of the ages.

Table of Contents for Calculating outliers in a data frame

Create new playlist

Sign In

Sign Up

Table of Contents for
Calculating outliers in a data frame