How to do it...

Read in the employee dataset, and create a DatetimeIndex with the HIRE_DATE column:

>>> employee = pd.read_csv('data/employee.csv', 
                           parse_dates=['JOB_DATE', 'HIRE_DATE'], 
                           index_col='HIRE_DATE')
>>> employee.head()

Let's first do a simple grouping by just gender, and find the average salary for each:

>>> employee.groupby('GENDER')['BASE_SALARY'].mean().round(-2)
GENDER
Female    52200.0
Male      57400.0
Name: BASE_SALARY, dtype: float64

Let's find the average salary based on hire date, and group everyone into 10-year buckets:

>>> employee.resample('10AS')['BASE_SALARY'].mean().round(-2)
HIRE_DATE
1958-01-01     81200.0
1968-01-01    106500.0
1978-01-01     69600.0
1988-01-01     62300.0
1998-01-01     58200.0
2008-01-01     47200.0
Freq: 10AS-JAN, Name: BASE_SALARY, dtype: float64

If we wanted to group by both gender and a five-year time span, we can call resample directly after calling groupby:

>>> employee.groupby('GENDER').resample('10AS')['BASE_SALARY'] 
            .mean().round(-2)
GENDER  HIRE_DATE 
Female  1975-01-01     51600.0
        1985-01-01     57600.0
        1995-01-01     55500.0
        2005-01-01     51700.0
        2015-01-01     38600.0
Male    1958-01-01     81200.0
        1968-01-01    106500.0
        1978-01-01     72300.0
        1988-01-01     64600.0
        1998-01-01     59700.0
        2008-01-01     47200.0
Name: BASE_SALARY, dtype: float64

Now, this does what we set out to do, but we run into a slight issue whenever we want to compare female to male salaries. Let's unstack the gender level and see what happens:

>>> sal_avg.unstack('GENDER')

The 10-year periods for males and females do not begin on the same date. This happened because the data was first grouped by gender and then, within each gender, more groups were formed based on hire dates. Let's verify that the first hired male was in 1958 and the first hired female was in 1975:

>>> employee[employee['GENDER'] == 'Male'].index.min()
Timestamp('1958-12-29 00:00:00')

>>> employee[employee['GENDER'] == 'Female'].index.min()
Timestamp('1975-06-09 00:00:00')

To resolve this issue, we must group the date together with the gender, and this is only possible with the groupby method:

>>> sal_avg2 = employee.groupby(['GENDER', 
                                 pd.Grouper(freq='10AS')]) 
                        ['BASE_SALARY'].mean().round(-2)
>>> sal_avg2
GENDER  HIRE_DATE 
Female  1968-01-01         NaN
        1978-01-01     57100.0
        1988-01-01     57100.0
        1998-01-01     54700.0
        2008-01-01     47300.0
Male    1958-01-01     81200.0
        1968-01-01    106500.0
        1978-01-01     72300.0
        1988-01-01     64600.0
        1998-01-01     59700.0
        2008-01-01     47200.0
Name: BASE_SALARY, dtype: float64

Now we can unstack the gender and get our rows aligned perfectly:

>>> sal_final = sal_avg2.unstack('GENDER')
>>> sal_final

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...