recorded meandering in analytics: September 2013

Sunday, September 29, 2013

Predictive Modeling of S.F.P.D. Crime: Falling Short

Turning the crank in Python and R, here's my first long term forecast incorporating the tailoring detailed in the last few posts. I didn't spend much time on the plot, hence the funky axis units. The left of the red line is the actual daily volumes from 2003-2013, while after the red line is the forecast volumes.

So far, a long-term decreasing trend and a day of the week modifier has been included. By inspection, the long-term decrease seems to be doing what it needs to because there is a clear decreasing trend. What about the day of the week modifier?

Percentile results show that the forecast volumes are lower and more tightly distributed than historical volumes. Forecasts are in italics and historical results are in regular font.

Wednesday

Min. 1st Qu. Median Mean 3rd Qu. Max.

202.0 339.0 372.0 377.6 414.0 666.0

Min. 1st Qu. Median Mean 3rd Qu. Max.

277.0 326.2 344.0 344.7 362.0 422.0

Thursday

Min. 1st Qu. Median Mean 3rd Qu. Max.

197.0 352.0 379.0 382.9 413.0 576.0

Min. 1st Qu. Median Mean 3rd Qu. Max.

249 311 331 331 349 406

Friday

Min. 1st Qu. Median Mean 3rd Qu. Max.

6.0 350.0 388.0 388.8 429.0 552.0

Min. 1st Qu. Median Mean 3rd Qu. Max.

292.0 337.0 353.0 356.0 371.8 427.0

Saturday

Min. 1st Qu. Median Mean 3rd Qu. Max.

149.0 325.0 363.0 361.6 393.0 566.0

Min. 1st Qu. Median Mean 3rd Qu. Max.

272.0 316.0 334.0 335.4 355.0 394.0

Sunday

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.0 313.0 340.0 341.3 369.0 588.0

Min. 1st Qu. Median Mean 3rd Qu. Max.

237.0 291.8 308.0 309.6 326.2 376.0

Monday

Min. 1st Qu. Median Mean 3rd Qu. Max.

190.0 329.0 362.0 362.6 395.0 522.0

Min. 1st Qu. Median Mean 3rd Qu. Max.

256.0 298.0 317.5 318.5 338.0 395.0

Tuesday

Min. 1st Qu. Median Mean 3rd Qu. Max.

210.0 347.0 376.0 379.4 409.0 553.0

Min. 1st Qu. Median Mean 3rd Qu. Max.

274.0 311.8 328.0 333.9 352.0 421.0

In both cases, Friday is the peak day for historical and forecast crime while Sunday is the minimum. At first glance, the modifier looks like it has successfully shifted the daily volumes to better fit historical patterns.

Takeaways:

The forecast volumes hug the mean more closely than historical volumes.
More technically, switching from a normal distribution to a more fat tailed distribution is probably warranted. A fatter tail in the probability distribution will pull more "extreme" values more often, and so forecast volumes will spread away from the mean. When I incorporate the Monte Carlo aspect, this should lead to more accurate percentile results.
The mean of the forecast volumes are lower than the historical means.
I have added a decreasing trend, so volumes would be expected to be lower. Additionally, the historical means include the last ten years of crime and so reflect the higher volumes early on in the sample. Comparing instead with a more recent set like the three year period from 2009-2012. Turning the crank on that, the means increase so something seems amiss as that is not consistent from inspection.

Thursday, September 26, 2013

Predictive Modeling of S.F.P.D. Crime: Daily and Long-Term Tailoring

My last post established that a linear model of crime volume from 2003 to 2013 has a slope of -25 (incorrectly labeled 23 in the image from the last post), and that there is variation in crime volume associated with the day of the week. This post details how I intend to include this into the model.

My approach is to shift the means of the normal distributions that I am pulling from by applying a number of modifiers, which will incorporate the specific trends I have mentioned.

As a demonstration, assume that the daily decrease in crime is ~1%, Wednesday has 5% lesscrime than a normal day and Tuesday has 5% more crime than a normal day. Tailoring the model to include these tendencies requires shifting the means of these normal distributions. As a concrete example, let's consider thefts. The mean is ~89 per day.

Through the week, the mean of the distribution selected will be changed like this:

Monday	Tuesday	Wednesday
Mean=89	Mean=89[(1-0.01)^1+(0.05)]	Mean=89[(1-0.01)^2+(-0.05)]

The term inside of the brackets is the percentage modifier. The first term accounts for 1% decrease, compounding daily, hence the geometric term. The second term accounts for the day of the week. I'll explain the origin of each later.

Long-Term Decrease:
Sequentially taking monthly percentage changes in crime volume from 2003 to 2013 produces a series of 120 "returns".The geometric mean comes to -0.212%. Bringing this to our example of an ~89 mean, day one will have 89 crimes. Day two will have 89*(1+Geometric Mean=-0.00212) = 88.81. Day three will have day two's mean, 88.81*(1+Geometric Mean = -0.00212)=88.62. This can be simplified by 89(1+GeoMean)^(Day), where "Day" is the number of the days from the start. Over a year period, the last number for Day would be 365.
The long-term decrease will be accounted for by a term of (0.997877)^(Day of Forecast).

Day of Week Modifier:
This is a lesson in documenting code. I don't remember how I got to these results, but here's what I have noted. Listed in order from Monday to Sunday, -3.3%, 0.4%, 3.5%, 0.3%, 6.6%, 0.4%, -7.8%. The numbers work, in the sense that they sum pretty close to 0 so that they will shift the distributions without affecting the totals. This is important because historical volume informs my model, and I want to stay true to that while tailoring to make forecasts account for the day of the week.
Edit:
Alright, I went back. Here's how I got those numbers.
Each year, take the number of crimes occurring on each day of the week and find the mean. For example, Monday through Friday have 100 crimes each day and Saturday, Sunday have 200 each. The mean is 128. Daily volume can be represented as a percentage above/below the mean. Saturday, Sunday are 56% above it while Monday through Friday are -21%. Summing these percentages gives zero, meaning that the predicted means overly an annual period should be consistent with historical annual means.
Daily Modifier Shift:
Monday, -0.034
Tuesday, 0.002
Wednesday, 0.033
Thursday, 0.001
Friday, 0.067
Saturday, 0.006
Sunday, -0.074

Tuesday, September 17, 2013

Predictive Modeling of S.F. Crime: Trends

So far, my model generates crime predictions through statistical applications of historical data. Calling it predictive is a stretch because it assumes a stationary, homogeneous environment for crime. Some examples are that daily forecasts do not account for day of the week volume differences or long term crime trends. I explored that a little in the data today and produced some charts to that end.

Long-Term Trends:

Monthly volume was compiled into a list. The result is the plot below, and a best fit trendline that shows crime over the last 11 years has been approximately decreasing at a rate of 23 crimes per month. Significant variance exists on a monthly basis, but there is definitely a long-term decreasing trend which could be included in a model forecasting many years out.

Day of Week Trends:

Below is a barplot that takes crime volume by day, by year from 2003 to 2013. That is, the first seven bars correspond to the number of crimes on Friday, Monday, Saturday, Sunday, Thursday, Tuesday and Wednesday occurring in 2003.

More demonstrative is to take the mean number of crimes for the days of the week in each year and calculate the distance from that mean. Let's say Friday has 1000 crimes, Monday-Thursday have 500 crimes, Saturday and Sunday have 700 crimes each. The total # is 4400 crimes in the week. The mean is 628. Put another way, 628 crimes occurring each day of the week would reach that [cough, rounding] same number (7*628 ~ 4400). The barplot below takes the distance by day from the weekly mean for each year. We see that Friday and Wednesday frequently lie above the mean, while Sunday is uniformly below the mean.

Takeaways:
1. A long-term decreasing trend would be a useful inclusion to multi-year forecasts. This trend is not strictly decreasing and deviates significantly on a monthly basis.
2. Daily differences matter. Friday has more crime and Sunday has less crime. Including this to differentiate a Friday forecast from a Sunday forecast can solve the ambiguous and interchangeable format of current forecasts.

Sunday, September 15, 2013

Predictive Modeling of S.F. Crime: Introduction

My last few posts have approached S.F. reported crime data from a descriptive perspective. My goal was to familiarize the unknown by employing data descriptively and presenting it in simple and intuitive ways. Methods keeping to this theme include:

natural frequencies, for example, 1 in 5 reported crimes in the Tenderloin were drug related.
visualizations, such as barplots and wordclouds to compare relative magnitudes of natural frequencies and maps of San Francisco with the density of various crimes plotted over it (heatmap, of sorts).
brief summaries of stand-out traits in each district. The Tenderloin has the highest density of crime by area. Violent and drug-related crime are far more frequent than in other districts, and as a consequence, the arrest rate of 70% is double or triple that of the other districts.

The last few weeks I have switched focus from descriptive analysis to predictive analysis. Instead of summarizing what has happened, I want to use relevant historical data to generate forecasts for reported criminal activity.

I'm keeping it simple to start. I organized the data into daily cuts. From there, the crime/day was calculated for high volume crimes individually (crimes occurring more than 2500 times a year such as theft, robbery, burglary, assault, missing persons, non-criminal, other offenses, drugs, warrants, vehicle thefts) and for low volume crimes combined.

The data set of 223 days allows for a histogram of 223 individual values. For the high volume crimes (mean of more than ten a day), their distribution looks normal for a first run approximation. One such example:

Approximating each high volume crime and the grouped together low volume crimes with a normal distribution gives a mean and standard deviation. A daily crime volume forecast can be produced by pulling randomly from each distribution, giving a # of each crime/day. Summing these, a daily volume is produced. Repeating this process 365 times yields the following barplot. The red line is the annual mean and the blue lines show a standard deviation on either side of the mean.

This is a first run and has little utility for decision making. There is no accounting for daily, monthly or annual trends, such as higher crime volume on Friday. It does not distribute crime by police district, which would be relevant to staffing. More importantly, each day is a random selection from a normal distribution. Another annual generation would likely not resemble this one in anything but the mean and variance.

Where can this be improved?

The short answer is everywhere. The main inclusion needed is realizing daily forecasts in the context of Monte Carlo simulations. With many daily forecasts, I can replace a one number estimate with percentiles. As an example, my day one crime volume was 339. If I ran my annual simulation 1000 times, I could instead offer an estimate through percentiles such as, "95% of outcomes fell below 400 crimes/day".

This still does not improve the model, except in improving the clarity of results. To make this model useful, I need to include how these crimes are distributed by hourly blocks and police districts. Further, pulling from a stationary mean and variance make this a time independent model. While that might be the best approximation, I am not well enough informed to make that assumption. Finally, long-term or daily trends can be included on further studying the data.

Tuesday, September 3, 2013

SFPD Project, Part 2 by name and not proportion

Here's a wordcloud that pulls from the descriptions of the 75k+ crimes reported in San Francisco so far. The larger the word, the higher frequency it is mentioned. Theft

The quick conclusion is that the largest problem, by frequency, has to do with theft from autos or property. Of the 20k larceny/theft reported, only 1.2k had any sort of resolution. Reinforcing that notion is the incidence per 1000 crimes of each category. Larceny/theft is always the largest proportion of crime,

with the exception of the Tenderloin,

where drugs are a contender for most frequently occurring crime. An incredible 1 of 5 reported crimes involve drugs in the Tenderloin. For context, the average is 1 of 20 in the other 9 districts. Parsing for drug descriptions across the city showed that the occurrence of type of drug is fairly equal (don't quote me because that is off memory of something I ran a few weeks ago), but in the Tenderloin it was significant leaning towards "harder" drugs (non-marijuana, that is). Not surprisingly, the frequency of assaults and other non-theft crimes increases in the Tenderloin. Come to think of it, jacking people in the ghetto doesn't make a ton of sense.

The other exception is Bayview, where nobody drives with a license plate (ie, Other Offenses). Also not a wonderful area, in terms of % of violent crime.

Here's the last chart I liked. Pretty self-explanatory. The Tenderloin has an abnormally high percentage of arrests. That comes with the drugs. An anecdote from the incoming PD officer this research was intended for went along the line of that the officers from the TLoin station walk out into the street, and before they even get into the cars they have to arrest someone.

If exciting to you meanst putting the bad guys away, assaults (40% arrested), drugs (90% arrested) and warrants (93% arrested) are the ticket. The Loin, Mission are the heavy hitters in that department, followed by Bayview, Northern and Southern. It looks like Richmond, Park, Taraval and to a lesser extent, Central, are going to involve driving around and listening to people bitch about what was stolen from them and vague descriptions who might have done it when you know you will never actually find the person who did it. But there won't be as many crackheads, so they have that going for them.

Also funny is how the hood has a disproportionately low amount of burglaries compared to the rest of the crime occurring. Park, Richmond and Taraval lead the way at about 7-8 %. Can't blame the burglars for stupidity, anyway. Worth reinforcing that since this is a frequency of occurence rather than raw buglarly numbers, those districts haven't had the most burglaries (Northern and Southern are ahead, but not by as much as might be guessed without the numbers) because more crimes are committed elsewhere. Anyway, burglars know where to shop.

Subscribe to: Posts (Atom)