We can calculate outliers using standard calculations as to whether the absolute value of the difference from the mean value is greater than 1.96 times the standard deviation. (This assumes a normal Gaussian distribution of the data).
For example, using the same Titanic dataset loaded previously, we can determine which passengers were outliers based on age.
The Python script is as follows:
import pandas as pd df = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls') # compute mean age df['x-Mean'] = abs(df['age'] - df['age'].mean()) # 1.96 times standard deviation for age df['1.96*std'] = 1.96*df['age'].std() # this age is an outlier if abs difference > 1.96 times std dev df['Outlier'] = abs(df['age'] - df['age'].mean()) > 1.96*df['age'].std() # print (results) print ("Dataset dimensions", df.count) print ("Number of age outliers", df.Outlier.value_counts()[True])
And under Jupyter the results show as:
Number of age outliers 65
So, given there were about 1,300 passengers, we have about 5% outliers, which means that there may be a normal distribution of the ages.