Sunday, September 15, 2013

Predictive Modeling of S.F. Crime: Introduction

My last few posts have approached S.F. reported crime data from a descriptive perspective. My goal was to familiarize the unknown by employing data descriptively and presenting it in simple and intuitive ways. Methods keeping to this theme include:
  • natural frequencies, for example, 1 in 5 reported crimes in the Tenderloin were drug related. 
  • visualizations, such as barplots and wordclouds to compare relative magnitudes of natural frequencies and maps of San Francisco with the density of various crimes plotted over it (heatmap, of sorts).
  • brief summaries of stand-out traits in each district. The Tenderloin has the highest density of crime by area. Violent and drug-related crime are far more frequent than in other districts, and as a consequence, the arrest rate of 70% is double or triple that of the other districts.
The last few weeks I have switched focus from descriptive analysis to predictive analysis. Instead of summarizing what has happened, I want to use relevant historical data to generate forecasts for reported criminal activity.

I'm keeping it simple to start. I organized the data into daily cuts. From there, the crime/day was calculated for high volume crimes individually (crimes occurring more than 2500 times a year such as theft, robbery, burglary, assault, missing persons, non-criminal, other offenses, drugs, warrants, vehicle thefts) and for low volume crimes combined.

The data set of 223 days allows for a histogram of 223 individual values. For the high volume crimes (mean of more than ten a day), their distribution looks normal for a first run approximation. One such example:























 Approximating each high volume crime and the grouped together low volume crimes with a normal distribution gives a mean and standard deviation. A daily crime volume forecast can be produced by pulling randomly from each distribution, giving a # of each crime/day. Summing these, a daily volume is produced. Repeating this process 365 times yields the following barplot. The red line is the annual mean and the blue lines show a standard deviation on either side of the mean.



This is a first run and has little utility for decision making. There is no accounting for daily, monthly or annual trends, such as higher crime volume on Friday. It does not distribute crime by police district, which would be relevant to staffing. More importantly, each day is a random selection from a normal distribution. Another annual generation would likely not resemble this one in anything but the mean and variance.

Where can this be improved?
The short answer is everywhere. The main inclusion needed is realizing daily forecasts in the context of Monte Carlo simulations. With many daily forecasts, I can replace a one number estimate with percentiles. As an example, my day one crime volume was 339. If I ran my annual simulation 1000 times, I could instead offer an estimate through percentiles such as, "95% of outcomes fell below 400 crimes/day". 
This still does not improve the model, except in improving the clarity of results. To make this model useful, I need to include how these crimes are distributed by hourly blocks and police districts. Further, pulling from a stationary mean and variance make this a time independent model. While that might be the best approximation, I am not well enough informed to make that assumption. Finally, long-term or daily trends can be included on further studying the data.

No comments:

Post a Comment