Showing distribution with the KDE plot

Similar to a histogram, the KDE plot is a method to visualize the shape of data distribution. It uses kernel smoothing to create smooth curves and is often combined with a histogram. It is useful in exploratory data analysis.

In the following example, we will compare the income in various age groups across different countries, with data obtained from surveys binned with different age groupings.

Here is the code for data curation:

import pandas as pd
import matplotlib.pyplot as plt

# Prepare the data
# Weekly earnings of U.S. wage workers in 2016, by age
# Downloaded from Statista.com
# Source URL: https://www.statista.com/statistics/184672/median-weekly-earnings-of-full-time-wage-and-salary-workers/
us_agegroups = [22,29.5,39.5,49.5]
# Convert to a rough estimation of monthly earnings by multiplying 4
us_incomes = [x*4 for x in [513,751,934,955]]

# Monthly salary in the Netherlands in 2016 per age group excluding overtime (Euro)
# Downloaded from Statista.com
# Source URL: https://www.statista.com/statistics/538025/average-monthly-wage-in-the-netherlands-by-age/
# take the center of each age group
nl_agegroups = [22.5, 27.5, 32.5, 37.5, 42.5, 47.5, 52.5]
nl_incomes = [x*1.113 for x in [1027, 1948, 2472, 2795, 2996, 3069, 3070]]

# Median monthly wage analyzed by sex, age group, educational attainment, occupational group and industry section
# May-June 2016 (HKD)
# Downloaded form the website of Censor and Statistics Department of the HKSAR government
# Source URL: https://www.censtatd.gov.hk/fd.jsp?file=D5250017E2016QQ02E.xls&product_id=D5250017&lang=1
hk_agegroups = [19.5, 29.5, 39.5, 49.5]
hk_incomes = [x/7.770 for x in [11900,16800,19000,16600]]

Let's now draw the KDE plots for comparison. We have prepared a reusable function to plot the three pieces of data with less repetition in the code:

import seaborn as sns
def kdeplot_income_vs_age(agegroups,incomes):
plt.figure()
sns.kdeplot(agegroups,incomes)
plt.xlim(0,65)
plt.ylim(0,6000)
plt.xlabel('Age')
plt.ylabel('Monthly salary (USD)')
return

kdeplot_income_vs_age(us_agegroups,us_incomes)
kdeplot_income_vs_age(nl_agegroups,nl_incomes)
kdeplot_income_vs_age(hk_agegroups,hk_incomes)

Now we can look at the results, which are from top to bottom for the US, the Netherlands, and Hong Kong, respectively:

  

Of course, the figure is not a very accurate reflection of the original data, as extrapolation was involved before any tweaking (for instance, we do not have child labor data here, but the contours extend even to children below age 10). Yet, we can observe a general difference in the pattern of income structures between ages 20 and 50 across the three economies, and to what extent the downloaded public data is comparable. We may then be able to suggest surveys with more useful groupings and perhaps to get more raw data points to suit our analyses.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.25.217