- Read in the Denver crime hdf5 dataset leaving the REPORTED_DATE as a column:
>>> crime = pd.read_hdf('data/crime.h5', 'crime')
>>> crime.head()
- All Timestamp columns have a special attribute called the dt accessor, which gives access to a variety of extra attributes and methods specifically designed for them. Let's find the weekday name of each REPORTED_DATE and then count these values:
>>> wd_counts = crime['REPORTED_DATE'].dt.weekday_name
.value_counts()
>>> wd_counts
Monday 70024
Friday 69621
Wednesday 69538
Thursday 69287
Tuesday 68394
Saturday 58834
Sunday 55213
Name: REPORTED_DATE, dtype: int64
- The weekends appear to have substantially less crime and traffic accidents. Let's put this data in correct weekday order and make a horizontal bar plot:
>>> days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday']
>>> title = 'Denver Crimes and Traffic Accidents per Weekday'
>>> wd_counts.reindex(days).plot(kind='barh', title=title)
- We can do a very similar procedure to plot the count by year:
>>> title = 'Denver Crimes and Traffic Accidents per Year'
>>> crime['REPORTED_DATE'].dt.year.value_counts()
.sort_index()
.plot(kind='barh', title=title)
- We need to group by both weekday and year. One way of doing this is saving the weekday and year Series to separate variables and then using these variables with the groupby method:
>>> weekday = crime['REPORTED_DATE'].dt.weekday_name
>>> year = crime['REPORTED_DATE'].dt.year
>>> crime_wd_y = crime.groupby([year, weekday]).size()
>>> crime_wd_y.head(10)
REPORTED_DATE REPORTED_DATE
2012 Friday 8549
Monday 8786
Saturday 7442
Sunday 7189
Thursday 8440
Tuesday 8191
Wednesday 8440
2013 Friday 10380
Monday 10627
Saturday 8875
dtype: int64
- We have aggregated the data correctly but the structure isn't exactly conducive to make comparisons easily. Let's first rename those meaningless index level names and then unstack the weekday level to get us a more readable table:
>>> crime_table = crime_wd_y.rename_axis(['Year', 'Weekday'])
.unstack('Weekday')
>>> crime_table
- We now have a nicer representation that is easier to read but noticeably, the 2017 numbers are incomplete. To help make a fairer comparison, we can make a simple linear extrapolation to estimate the final number of crimes. Let's first find the last day that we have data for in 2017:
>>> criteria = crime['REPORTED_DATE'].dt.year == 2017
>>> crime.loc[criteria, 'REPORTED_DATE'].dt.dayofyear.max()
272
- A naive estimate would be to assume a constant rate of crime throughout the year and simply multiply all values in the 2017 table by 365/272. However, we can do a little better and look at our historical data and calculate the average percentage of crimes that have taken place through the first 272 days of the year:
>>> round(272 / 365, 3)
.745
>>> crime_pct = crime['REPORTED_DATE'].dt.dayofyear.le(272)
.groupby(year)
.mean()
.round(3)
>>> crime_pct
REPORTED_DATE
2012 0.748
2013 0.725
2014 0.751
2015 0.748
2016 0.752
2017 1.000
Name: REPORTED_DATE, dtype: float64
>>> crime_pct.loc[2012:2016].median()
.748
- It turns out, perhaps very coincidentally, that the percentage of crimes that happen during the first 272 days of the year is almost exactly proportional to the percentage of days passed in the year. Let's now update the row for 2017 and change the column order to match the weekday order:
>>> crime_table.loc[2017] = crime_table.loc[2017].div(.748)
.astype('int')
>>> crime_table = crime_table.reindex(columns=days)
>>> crime_table
- We could make a bar or line plot but this is also a good situation for a heatmap, which is available with the seaborn library:
>>> import seaborn as sns
>>> sns.heatmap(crime_table, cmap='Greys')
- Crime seems to be rising every year but this data does not account for rising population. Let's read in a table for the Denver population for each year that we have data:
>>> denver_pop = pd.read_csv('data/denver_pop.csv',
index_col='Year')
>>> denver_pop
- Many crime metrics are reported as rates per 100,000 residents. Let's divide the population by 100,000 and then divide the raw crime counts by this number to get the crime rate per 100,000 residents:
>>> den_100k = denver_pop.div(100000).squeeze()
>>> crime_table2 = crime_table.div(den_100k, axis='index')
.astype('int')
>>> crime_table2
- Once again, we can make a heatmap that, even after adjusting for population increase, looks nearly identical to the first one:
>>> sns.heatmap(crime_table2, cmap='Greys')