recorded meandering in analytics: Predictive Modeling of S.F. Crime: Adding Monthly Tendencies

I have been developing a model for forecasting crime volume in San Francisco, informed through publicly available historical data going back ten years. Volumes are generated through a sum of choices from normal distributions by crime category, as informed from the past data. By shifting the mean of this distribution, I have accounted for day of the week trends (Friday being a higher crime day than Tuesday) and a long-term linear decrease.

The last tweak is to account for monthly shifts. Here's a plot lacking severely in elegance to give a quick idea as to the magnitudes by year. Each year, I took the mean of the twelve months and differenced the actual number by month from the mean.

The clearest trend is that February and December reside below the mean. To include this numerically, I've taken an average of the percentage away from the mean by year. A quick example will make this more clear.

In '03, 13482 crimes were reported in January. The mean of that year was 12741. 13482-12741 ~ 740. As a percentage of the mean, 740/12741 ~5.8%. Doing this for each January, year by year, and averaging those percentages yields 3.5%. Doing this for each month, the list looks like
January ~ 3.5% ~ 0.00115
February ~ -6.2% ~ -0.00222
March ~ 3.6% ~ 0.001185
April ~ -0.6% ~ -0.000218
May ~ 1.6% ~ 0.0000538
June ~ -3.8% ~ -0.001285
July ~ 0.02% ~ 0.0000711
August ~ 3.5% ~ 0.001142
September~ 1.6% ~ 0.000533
October ~ 5.2% ~ 0.00167
November ~ -3.0% ~ -0.001011
December ~ -5.6% ~ -0.00183

As a sanity check, the sum of these should be close to zero. Taking the full, "unapproximated" values in R yields a number raised to the -17th so ~0. To include these percentages as a modifier to a daily volume forecast, I need to translate these numbers into a daily shift. This percentage does not need to be geometric like the long-term decrease because the mean is effectively being reset at each run. On an annual basis, this monthly tweak should redistribute when the crime volume occurs, but not change the total volume.

The second number trailing the percent is the daily shift that accounts for the monthly shift, by dividing the percentage by the number of days in that month and ignoring leap years.

My model is a simple conditional tweak to the means of normal distributions informed by past crimes. To put this new shift in, the past mean calculation,
Mean = Historical Mean * [ (Geometric Average Decrease)^(Number of Forecast Days) + (Day of Week Shift)]
will include a new component so that
Mean = His Mean * [(GeoAvgDec)^(NumOfDays) + (DoW Mod) + (Month of Year Mod)]

The result will be a redistribution of forecast volume which will stay consistent with annual expectations.

Above is the new method including the monthly modifier and below is the previous forecast using only a day of the week shift and a long-term shift. By inspection, it is seen that this model still does not forecast extreme events but inclusion of this new modifier has spread the distribution out rather than clustering so heavily around the downward linear trend.

The next step is to account for the fat tail events, and keeping in mind the longer term goal of putting forecasts in the context of Monte Carlo simulations.

recorded meandering in analytics

Monday, October 7, 2013

Predictive Modeling of S.F. Crime: Adding Monthly Tendencies

No comments:

Post a Comment