Wednesday, October 23, 2013

Predictive Modeling of S.F. Crime: First Backtest

I've incorporated multiple simulations per day to produce distributions for crime volume, turning this

"X crimes will happen tomorrow according to my model" to
"X-Y crimes happen 50% of the time according to my model".

Here's a short backtest using the 150 days beginning on Jan. 1st, 2013 to illustrate what's going on. The percentile lines correspond to the 95th, 80th, 20th and 5th percentiles for each day in the simulation. The points are the realized volumes.




I've recalibrated the means used by the model to the 2003 data and ran the model forward 5 years to see how the model accounts for historical data.
A "good" performance:
1. ~60% of volumes should lie between the 20th and 80th percentile.
2. ~15% of volumes should lie in both the 5th to 20th and 80th to 95th range (~30% combined).
3. 10% should lie above and below the 95th and 5th, respectively.

What happened?
1. 50% of historical volumes were within the model-produced 20th and 60th percentiles. Low.
2. 19% were within the model-produced 5th and 20th percentile.  High.
3. 13% were within the model-produced 80th and 95th percentile. Low.
4.  ~17% were above/below forecast 95th/5th.

What does that mean?
1. As suspected, my model underestimates the frequency of abnormally high/low volume days (ie, fat-tailed events).

2. Judging from the 5th-20th and 80th-95th results, the forecasts produce higher volume days more often than lower volume days, which is inconsistent with past observations.

Now what?
The underlying issue is that my model produces volumes by combining random draws from probability distributions. Going back to Probability 101, generating a 95th percentile event will require multiple high draws from the twelve distributions. The benefit of my approach is that I can tinker with means by category (customize) without much problem and independently. The downside is that the combination of draws decreases the consistency of my forecasts with historical observation.

A way around this could be to initiate one draw which will inform the draws from each other category. Rather than combining a number of independent, individual draws to produce a volume, a  percentile could be chosen, passed to the categorical distributions and inform that choice. That would reduce the problem of combined probabilities and better account for high-volume events.
But I'm not being paid and this model is jumping into overfit territory quickly, so I won't.

Monday, October 7, 2013

Predictive Modeling of S.F. Crime: Adding Monthly Tendencies

I have been developing a model for forecasting crime volume in San Francisco, informed through publicly available historical data going back ten years. Volumes are generated through a sum of choices from normal distributions by crime category, as informed from the past data. By shifting the mean of this distribution, I have accounted for day of the week trends (Friday being a higher crime day than Tuesday) and a long-term linear decrease.

The last tweak is to account for monthly shifts. Here's a plot lacking severely in elegance to give a quick idea as to the magnitudes by year. Each year, I took the mean of the twelve months and differenced the actual number by month from the mean.





The clearest trend is that February and December reside below the mean. To include this numerically, I've taken an average of the percentage away from the mean by year. A quick example will make this more clear.

In '03, 13482 crimes were reported in January. The mean of that year was 12741. 13482-12741 ~ 740. As a percentage of the mean, 740/12741 ~5.8%. Doing this for each January, year by year, and averaging those percentages yields 3.5%. Doing this for each month, the list looks like
January ~ 3.5%        ~ 0.00115
February ~ -6.2%    ~ -0.00222
March ~ 3.6%         ~ 0.001185
April ~ -0.6%          ~ -0.000218
May ~ 1.6%             ~ 0.0000538
June ~ -3.8%            ~ -0.001285
July ~ 0.02%           ~ 0.0000711
August ~ 3.5%          ~ 0.001142
September~ 1.6%     ~ 0.000533
October ~ 5.2%        ~ 0.00167
November ~ -3.0%   ~ -0.001011
December ~ -5.6%   ~ -0.00183

As a sanity check, the sum of these should be close to zero. Taking the full, "unapproximated" values in R yields a number raised to the -17th so ~0. To include these percentages as a modifier to a daily volume forecast, I need to translate these numbers into a daily shift. This percentage does not need to be geometric like the long-term decrease because the mean is effectively being reset at each run. On an annual basis, this monthly tweak should redistribute when the crime volume occurs, but not change the total volume.

The second number trailing the percent is the daily shift that accounts for the monthly shift, by dividing the percentage by the number of days in that month and ignoring leap years.

My model is a simple conditional tweak to the means of normal distributions informed by past crimes. To put this new shift in, the past mean calculation,
Mean = Historical Mean * [ (Geometric Average Decrease)^(Number of Forecast Days) + (Day of Week Shift)]
will include a new component so that
Mean = His Mean * [(GeoAvgDec)^(NumOfDays) + (DoW Mod) + (Month of Year Mod)]

The result will be a redistribution of forecast volume which will stay consistent with annual expectations.





Above is the new method including the monthly modifier and below is the previous forecast using only a day of the week shift and a long-term shift. By inspection, it is seen that this model still does not forecast extreme events but inclusion of this new modifier has spread the distribution out rather than clustering so heavily around the downward linear trend.





The next step is to account for the fat tail events, and keeping in mind the longer term goal of putting forecasts in the context of Monte Carlo simulations.