Calculating outliers in a data frame

We can calculate outliers using standard calculations as to whether the absolute value of the difference from the mean value is greater than 1.96 times the standard deviation. (This assumes a normal Gaussian distribution of the data).

For example, using the same Titanic dataset loaded previously, we can determine which passengers were outliers based on age.

The Python script is as follows:

import pandas as pd

df = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')

# compute mean age
df['x-Mean'] = abs(df['age'] - df['age'].mean())

# 1.96 times standard deviation for age
df['1.96*std'] = 1.96*df['age'].std()

# this age is an outlier if abs difference > 1.96 times std dev
df['Outlier'] = abs(df['age'] - df['age'].mean()) > 1.96*df['age'].std()

# print (results)
print ("Dataset dimensions", df.count)
print ("Number of age outliers", df.Outlier.value_counts()[True])  

And under Jupyter the results show as:

Number of age outliers 65  

So, given there were about 1,300 passengers, we have about 5% outliers, which means that there may be a normal distribution of the ages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.132.214