I've incorporated multiple simulations per day to produce distributions for crime volume, turning this
"X crimes will happen tomorrow according to my model" to
"X-Y crimes happen 50% of the time according to my model".
Here's a short backtest using the 150 days beginning on Jan. 1st, 2013 to illustrate what's going on. The percentile lines correspond to the 95th, 80th, 20th and 5th percentiles for each day in the simulation. The points are the realized volumes.
I've recalibrated the means used by the model to the 2003 data and ran the model forward 5 years to see how the model accounts for historical data.
A "good" performance:
1. ~60% of volumes should lie between the 20th and 80th percentile.
2. ~15% of volumes should lie in both the 5th to 20th and 80th to 95th range (~30% combined).
3. 10% should lie above and below the 95th and 5th, respectively.
What happened?
1. 50% of historical volumes were within the model-produced 20th and 60th percentiles. Low.
2. 19% were within the model-produced 5th and 20th percentile. High.
3. 13% were within the model-produced 80th and 95th percentile. Low.
4. ~17% were above/below forecast 95th/5th.
What does that mean?
1. As suspected, my model underestimates the frequency of abnormally high/low volume days (ie, fat-tailed events).
2. Judging from the 5th-20th and 80th-95th results, the forecasts produce higher volume days more often than lower volume days, which is inconsistent with past observations.
Now what?
The underlying issue is that my model produces volumes by combining random draws from probability distributions. Going back to Probability 101, generating a 95th percentile event will require multiple high draws from the twelve distributions. The benefit of my approach is that I can tinker with means by category (customize) without much problem and independently. The downside is that the combination of draws decreases the consistency of my forecasts with historical observation.
A way around this could be to initiate one draw which will inform the draws from each other category. Rather than combining a number of independent, individual draws to produce a volume, a percentile could be chosen, passed to the categorical distributions and inform that choice. That would reduce the problem of combined probabilities and better account for high-volume events.
But I'm not being paid and this model is jumping into overfit territory quickly, so I won't.