How to do it...

  1. Read in the Denver crime hdf5 dataset leaving the REPORTED_DATE as a column:
>>> crime = pd.read_hdf('data/crime.h5', 'crime')
>>> crime.head()
  1. All Timestamp columns have a special attribute called the dt accessor, which gives access to a variety of extra attributes and methods specifically designed for them. Let's find the weekday name of each REPORTED_DATE and then count these values:
>>> wd_counts = crime['REPORTED_DATE'].dt.weekday_name 
.value_counts()
>>> wd_counts
Monday 70024 Friday 69621 Wednesday 69538 Thursday 69287 Tuesday 68394 Saturday 58834 Sunday 55213 Name: REPORTED_DATE, dtype: int64
  1. The weekends appear to have substantially less crime and traffic accidents. Let's put this data in correct weekday order and make a horizontal bar plot:
>>> days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 
'Friday', 'Saturday', 'Sunday']
>>> title = 'Denver Crimes and Traffic Accidents per Weekday'
>>> wd_counts.reindex(days).plot(kind='barh', title=title)
  1. We can do a very similar procedure to plot the count by year:
>>> title = 'Denver Crimes and Traffic Accidents per Year' 
>>> crime['REPORTED_DATE'].dt.year.value_counts()
.sort_index()
.plot(kind='barh', title=title)
  1. We need to group by both weekday and year. One way of doing this is saving the weekday and year Series to separate variables and then using these variables with the groupby method:
>>> weekday = crime['REPORTED_DATE'].dt.weekday_name
>>> year = crime['REPORTED_DATE'].dt.year

>>> crime_wd_y = crime.groupby([year, weekday]).size()
>>> crime_wd_y.head(10)
REPORTED_DATE REPORTED_DATE 2012 Friday 8549 Monday 8786 Saturday 7442 Sunday 7189 Thursday 8440 Tuesday 8191 Wednesday 8440 2013 Friday 10380 Monday 10627 Saturday 8875 dtype: int64
  1. We have aggregated the data correctly but the structure isn't exactly conducive to make comparisons easily. Let's first rename those meaningless index level names and then unstack the weekday level to get us a more readable table:
>>> crime_table = crime_wd_y.rename_axis(['Year', 'Weekday']) 
.unstack('Weekday')
>>> crime_table
  1. We now have a nicer representation that is easier to read but noticeably, the 2017 numbers are incomplete. To help make a fairer comparison, we can make a simple linear extrapolation to estimate the final number of crimes. Let's first find the last day that we have data for in 2017:
>>> criteria = crime['REPORTED_DATE'].dt.year == 2017
>>> crime.loc[criteria, 'REPORTED_DATE'].dt.dayofyear.max()
272
  1. A naive estimate would be to assume a constant rate of crime throughout the year and simply multiply all values in the 2017 table by 365/272. However, we can do a little better and look at our historical data and calculate the average percentage of crimes that have taken place through the first 272 days of the year:
>>> round(272 / 365, 3)
.745

>>> crime_pct = crime['REPORTED_DATE'].dt.dayofyear.le(272)
.groupby(year)
.mean()
.round(3)
>>> crime_pct
REPORTED_DATE 2012 0.748 2013 0.725 2014 0.751 2015 0.748 2016 0.752 2017 1.000 Name: REPORTED_DATE, dtype: float64

>>> crime_pct.loc[2012:2016].median()
.748
  1. It turns out, perhaps very coincidentally, that the percentage of crimes that happen during the first 272 days of the year is almost exactly proportional to the percentage of days passed in the year. Let's now update the row for 2017 and change the column order to match the weekday order:
>>> crime_table.loc[2017] = crime_table.loc[2017].div(.748) 
.astype('int')
>>> crime_table = crime_table.reindex(columns=days)
>>> crime_table
  1. We could make a bar or line plot but this is also a good situation for a heatmap, which is available with the seaborn library:
>>> import seaborn as sns
>>> sns.heatmap(crime_table, cmap='Greys')
  1. Crime seems to be rising every year but this data does not account for rising population. Let's read in a table for the Denver population for each year that we have data:
>>> denver_pop = pd.read_csv('data/denver_pop.csv',
index_col='Year')
>>> denver_pop
  1. Many crime metrics are reported as rates per 100,000 residents. Let's divide the population by 100,000 and then divide the raw crime counts by this number to get the crime rate per 100,000 residents:
>>> den_100k = denver_pop.div(100000).squeeze()
>>> crime_table2 = crime_table.div(den_100k, axis='index')
.astype('int')
>>> crime_table2
  1. Once again, we can make a heatmap that, even after adjusting for population increase, looks nearly identical to the first one:
>>> sns.heatmap(crime_table2, cmap='Greys')
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.35.185