How to do it...

  1. Read in the employee dataset, and create a DatetimeIndex with the HIRE_DATE column:
>>> employee = pd.read_csv('data/employee.csv', 
parse_dates=['JOB_DATE', 'HIRE_DATE'],
index_col='HIRE_DATE')
>>> employee.head()
  1. Let's first do a simple grouping by just gender, and find the average salary for each:
>>> employee.groupby('GENDER')['BASE_SALARY'].mean().round(-2)
GENDER Female 52200.0 Male 57400.0 Name: BASE_SALARY, dtype: float64
  1. Let's find the average salary based on hire date, and group everyone into 10-year buckets:
>>> employee.resample('10AS')['BASE_SALARY'].mean().round(-2)
HIRE_DATE 1958-01-01 81200.0 1968-01-01 106500.0 1978-01-01 69600.0 1988-01-01 62300.0 1998-01-01 58200.0 2008-01-01 47200.0 Freq: 10AS-JAN, Name: BASE_SALARY, dtype: float64
  1. If we wanted to group by both gender and a five-year time span, we can call resample directly after calling groupby:
>>> employee.groupby('GENDER').resample('10AS')['BASE_SALARY'] 
.mean().round(-2)
GENDER HIRE_DATE Female 1975-01-01 51600.0 1985-01-01 57600.0 1995-01-01 55500.0 2005-01-01 51700.0 2015-01-01 38600.0 Male 1958-01-01 81200.0 1968-01-01 106500.0 1978-01-01 72300.0 1988-01-01 64600.0 1998-01-01 59700.0 2008-01-01 47200.0 Name: BASE_SALARY, dtype: float64
  1. Now, this does what we set out to do, but we run into a slight issue whenever we want to compare female to male salaries. Let's unstack the gender level and see what happens:
>>> sal_avg.unstack('GENDER')
  1. The 10-year periods for males and females do not begin on the same date. This happened because the data was first grouped by gender and then, within each gender, more groups were formed based on hire dates. Let's verify that the first hired male was in 1958 and the first hired female was in 1975:
>>> employee[employee['GENDER'] == 'Male'].index.min()
Timestamp('1958-12-29 00:00:00')

>>> employee[employee['GENDER'] == 'Female'].index.min()
Timestamp('1975-06-09 00:00:00')
  1. To resolve this issue, we must group the date together with the gender, and this is only possible with the groupby method:
>>> sal_avg2 = employee.groupby(['GENDER', 
pd.Grouper(freq='10AS')])
['BASE_SALARY'].mean().round(-2)
>>> sal_avg2
GENDER HIRE_DATE Female 1968-01-01 NaN 1978-01-01 57100.0 1988-01-01 57100.0 1998-01-01 54700.0 2008-01-01 47300.0 Male 1958-01-01 81200.0 1968-01-01 106500.0 1978-01-01 72300.0 1988-01-01 64600.0 1998-01-01 59700.0 2008-01-01 47200.0 Name: BASE_SALARY, dtype: float64
  1. Now we can unstack the gender and get our rows aligned perfectly:
>>> sal_final = sal_avg2.unstack('GENDER')
>>> sal_final
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.44