How to do it...

Read in the Denver crime hdf5 dataset leaving the REPORTED_DATE as a column:

>>> crime = pd.read_hdf('data/crime.h5', 'crime')
>>> crime.head()

All Timestamp columns have a special attribute called the dt accessor, which gives access to a variety of extra attributes and methods specifically designed for them. Let's find the weekday name of each REPORTED_DATE and then count these values:

>>> wd_counts = crime['REPORTED_DATE'].dt.weekday_name 
                                         .value_counts()
>>> wd_counts
Monday       70024
Friday       69621
Wednesday    69538
Thursday     69287
Tuesday      68394
Saturday     58834
Sunday       55213
Name: REPORTED_DATE, dtype: int64

The weekends appear to have substantially less crime and traffic accidents. Let's put this data in correct weekday order and make a horizontal bar plot:

>>> days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 
            'Friday', 'Saturday', 'Sunday']
>>> title = 'Denver Crimes and Traffic Accidents per Weekday'
>>> wd_counts.reindex(days).plot(kind='barh', title=title)

We can do a very similar procedure to plot the count by year:

>>> title = 'Denver Crimes and Traffic Accidents per Year' 
>>> crime['REPORTED_DATE'].dt.year.value_counts() 
                             .sort_index() 
                             .plot(kind='barh', title=title)

We need to group by both weekday and year. One way of doing this is saving the weekday and year Series to separate variables and then using these variables with the groupby method:

>>> weekday = crime['REPORTED_DATE'].dt.weekday_name
>>> year = crime['REPORTED_DATE'].dt.year

>>> crime_wd_y = crime.groupby([year, weekday]).size()
>>> crime_wd_y.head(10)
REPORTED_DATE  REPORTED_DATE
2012           Friday            8549
               Monday            8786
               Saturday          7442
               Sunday            7189
               Thursday          8440
               Tuesday           8191
               Wednesday         8440
2013           Friday           10380
               Monday           10627
               Saturday          8875
dtype: int64

We have aggregated the data correctly but the structure isn't exactly conducive to make comparisons easily. Let's first rename those meaningless index level names and then unstack the weekday level to get us a more readable table:

>>> crime_table = crime_wd_y.rename_axis(['Year', 'Weekday']) 
                            .unstack('Weekday')
>>> crime_table

We now have a nicer representation that is easier to read but noticeably, the 2017 numbers are incomplete. To help make a fairer comparison, we can make a simple linear extrapolation to estimate the final number of crimes. Let's first find the last day that we have data for in 2017:

>>> criteria = crime['REPORTED_DATE'].dt.year == 2017
>>> crime.loc[criteria, 'REPORTED_DATE'].dt.dayofyear.max()
272

A naive estimate would be to assume a constant rate of crime throughout the year and simply multiply all values in the 2017 table by 365/272. However, we can do a little better and look at our historical data and calculate the average percentage of crimes that have taken place through the first 272 days of the year:

>>> round(272 / 365, 3)
.745

>>> crime_pct = crime['REPORTED_DATE'].dt.dayofyear.le(272) 
                                      .groupby(year) 
                                      .mean() 
                                      .round(3)
>>> crime_pct
REPORTED_DATE
2012    0.748
2013    0.725
2014    0.751
2015    0.748
2016    0.752
2017    1.000
Name: REPORTED_DATE, dtype: float64

>>> crime_pct.loc[2012:2016].median()
.748

It turns out, perhaps very coincidentally, that the percentage of crimes that happen during the first 272 days of the year is almost exactly proportional to the percentage of days passed in the year. Let's now update the row for 2017 and change the column order to match the weekday order:

>>> crime_table.loc[2017] = crime_table.loc[2017].div(.748) 
                                        .astype('int')
>>> crime_table = crime_table.reindex(columns=days)
>>> crime_table

We could make a bar or line plot but this is also a good situation for a heatmap, which is available with the seaborn library:

>>> import seaborn as sns
>>> sns.heatmap(crime_table, cmap='Greys')

Crime seems to be rising every year but this data does not account for rising population. Let's read in a table for the Denver population for each year that we have data:

>>> denver_pop = pd.read_csv('data/denver_pop.csv',
                             index_col='Year')
>>> denver_pop

Many crime metrics are reported as rates per 100,000 residents. Let's divide the population by 100,000 and then divide the raw crime counts by this number to get the crime rate per 100,000 residents:

>>> den_100k = denver_pop.div(100000).squeeze()
>>> crime_table2 = crime_table.div(den_100k, axis='index') 
                              .astype('int')
>>> crime_table2

Once again, we can make a heatmap that, even after adjusting for population increase, looks nearly identical to the first one:

>>> sns.heatmap(crime_table2, cmap='Greys')

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...