recorded meandering in analytics: 2013

Monday, December 9, 2013

The Benefits of Aggressive Driving: First Result

I've included the ability to change lanes. The aggressive car checks how much room it has in front of it in the current lane. If an obstruction is present within a distance, it checks the next lane for obstructions. If none exists in the second lane, it'll change lanes and hopefully improve its situation and elapsed course time.

I ran this program 500 times with the following conditions:

1. the top speed of the aggressive car is 33% faster than the other cars.

2. the acceleration of the aggressive car is twice that of the other cars.

3. the aggressive car will change lanes to improve its situation.

4. the number of cars in the simulation are 22, which makes the density look something like this.

The differences in elapsed time between the aggressive car and a test car starting at the same speed and position are in the histogram below.

According to this simulation, an aggressive driver usually benefits from his strategy but the degree of the benefit varies substantially. By percentage, the aggressive car finishes the course on average 22% faster than the test car.

Next step is to vary the number of obstacle cars and see how that shifts the distribution.

Thursday, December 5, 2013

The Benefits of Aggressive Driving: Simulation Employing Python OOP

Two weeks ago on the way home from work, an excessively aggressive driver was dodging through traffic behind me. It was night, but like all obnoxious drivers, the headlights were of the luminous, distracting blue-white ilk. Jumping lanes, aggressive acceleration, higher top speed. At the next light, we were lined up with three or so cars in front of both of us, on a two lane road.

On the green, I accelerated gently and kept my pace at the speed of traffic. The blue-white headlight car ,jumped on the bumper of the car ahead, accelerating aggressively and quickly jumping lanes (to no advantage). At the next red light, he was only one car ahead despite his strategy.

Got me thinking. Does an aggressive driving strategy pay off on surface streets?

Using the pygame module (the Python equivalent of Java's processing), I've modeled a surface street as six stoplight objects spread at random distances apart. The function that turns the crank here is screen.get_at, which finds the RGB color scheme at a specified (x,y) location on the grid. Each car object (white rectangles) looks at its current location and ahead of it to find potential obstacles and modifies its speed to avoid crashes.

"Stoplights" are implemented through the redzones, which slow the speed of the car. If the car reaches the end of the redzone, it stops. The green or red lines to the left of the lane indicate the light status.

The cars with the small blue rectangles are the "racing" cars. They start at the same location and speed, but have different initialization values for acceleration and max speed. At the end of the course, the time elapsed from start to finish is recorded.

The results of five simulations:

(slow left lane car, fast right lane car)

[279, 282], [257, 290], [203, 316], [208, 298], [239, 291]

how much faster?

[1% faster, 12% faster, 56% faster, 43% faster, 22% faster]

I still need to include the ability of the car to change lanes. Comparing a stupid driver with a stupid and aggressive driver isn't very interesting.

The end questions I want to answer:

1. How much does varying traffic volume, speed differential between fast and slow cars and the length of red lights affect the course time?

2. Establish a metric that weighs the value of quickly completing the course against moderate acceleration and top speed, and find a strategy for optimizing that metric.

Here is the code. I'll clean it up later.

import pygame
import random
from random import choice

#------------car class------------
class car:
    def __init__(auto,laneChoice,y,speed,accel,speedLimit,raceCar):
        auto.laneNumber=laneChoice
        auto.x=lanes[auto.laneNumber]
        auto.y=y
        auto.speed=speed
        auto.accel=accel
        auto.speedLimit=speedLimit
        auto.raceCar=raceCar
        auto.timeToFinish=0
        auto.oneTime=0
    def draw(auto):
        if (auto.y%height<=laneWidth):
            pygame.draw.rect(screen,(255,255,255,255),(auto.x,auto.y%height,
                            laneWidth*.5,-(auto.y%height)))
            pygame.draw.rect(screen,(255,255,255,255),
                        (lanes[auto.laneNumber-2],((auto.y-laneWidth)%height),
                         laneWidth*.5,(height-(auto.y-laneWidth)%height)))
        if (auto.y%height>laneWidth):
            pygame.draw.rect(screen,(255,255,255,255),(auto.x,auto.y%height,
                             laneWidth*.5,-laneWidth))
        if (auto.raceCar==1):
            pygame.draw.rect(screen,(0,0,255,255),((auto.x+(laneWidth*.5)),auto.y%height,
                            (-10),(-laneWidth*.5)))
#==========================Car Movement===========
    def move(auto):
        heightChecker=0
        if ((auto.y%height)>=500):
            heightChecker=1
            
        checkAhead=1
        posSet=0
##---------This part of the move looks only for WHITE CARS-----------
##Look ahead two lanewidths. Stop while loop if a white pixel is found.
        while (screen.get_at((int(auto.x),int((auto.y+checkAhead)%height)))!=(255,255,255,255)
        and checkAhead<=(laneWidth*2)):
            checkAhead=checkAhead+1
            
##If the next car is within a lanewidth, don't move forward.
        if (checkAhead<=laneWidth):
            auto.y=auto.y
            posSet=1
            
##If the next car is within two lanewidths (but not one), and in a redlight zone, cut
## the speed to 20 percent.
        if (screen.get_at((int(auto.x),int((auto.y+checkAhead-1)%height)))==(255,0,0,255)
        and (screen.get_at((int(auto.x),int((auto.y+checkAhead)%height)))==(255,255,255,255))
        and (posSet==0)):
            if (auto.speed>3):
                auto.speed=auto.speed*.5
            else:
                auto.speed=5
            auto.y=auto.y+auto.speed
            posSet=1
            
#------------/End looking for WHITE CARS.

#If the next pixel is red,
        if (screen.get_at((int(auto.x),int((auto.y+1)%height)))==(255,0,0,255)
        and (posSet==0)):

#run a while loop that determines where the end of the stopzone is.
            checkAhead=1
            while (screen.get_at((int(auto.x),int((auto.y+checkAhead)%height)))!=(0,0,0,255)):
                checkAhead=checkAhead+1
# if the light ends in less than 20 pixels, the car stops.
            if (checkAhead<=10):
                auto.y=auto.y
                posSet=1
            else:
                auto.speed=auto.speed*.9
                auto.y=auto.y+auto.speed
                posSet=1

if (posSet==0):
            if (auto.speed<=auto.speedLimit):
                auto.speed=auto.speed+auto.accel
            else:
                auto.speed=auto.speed
            auto.y=auto.y+auto.speed
            
        if ((auto.y%height)<=100 and heightChecker==1):
            if (auto.laneNumber==10):
                auto.laneNumber=auto.laneNumber=0
            elif (auto.laneNumber==11):
                auto.laneNumber=auto.laneNumber=1
            else:
                auto.laneNumber=auto.laneNumber+2
            auto.x=lanes[auto.laneNumber]
    def raceTimer(auto,index):
        
        auto.timeToFinish=auto.timeToFinish+1
        if ((auto.y>=(height*6) and auto.oneTime==0)):
            auto.oneTime=1
            timerPair[index]=auto.timeToFinish
        
#===============Stoplights=================
class stoplight:
    def __init__(light,x,y,stopFreq):
        light.x=x
        light.y=y
        light.stopFreq=stopFreq
        light.status=choice([0,1])
    def checkStatus(light):
        if (counter%light.stopFreq==0):
            if (light.status==1):
                light.status=0
            elif (light.status==0):
                light.status=1
        elif (counter%light.stopFreq==0 & light.status==0):
            light.status=1
    def draw(light):
        if (light.status==0):
            pygame.draw.rect(screen,(255,0,0),(light.x,light.y-laneWidth*5,laneWidth*2,laneWidth*5))
            pygame.draw.line(screen,(255,0,0),(light.x-4,light.y),(light.x-4,light.y+laneWidth*2),4)
        if (light.status==1):
            pygame.draw.line(screen,(0,255,0),(light.x-4,light.y),(light.x-4,light.y+laneWidth*2),4)
#---------End Stoplight Class--------------

#-----------Lane Class_--------
class lane:
    def __init__(lane,x):
        lane.x=x
    def draw(lane):
        pygame.draw.rect(screen,(0,0,0),(lane.x,0, laneWidth*2,height))     
        pygame.draw.line(screen,(255,255,0,255),(lane.x+(laneWidth),0),
                         (lane.x+(laneWidth),height))
#-----------End Lane Class---------

def drawLanes():
    lane1.draw()
    lane3.draw()
    lane5.draw()
    lane7.draw()
    lane9.draw()
    lane11.draw()
def checkLights():
    lightOne.checkStatus()
    lightTwo.checkStatus()
    lightThree.checkStatus()
    lightFour.checkStatus()
    lightFive.checkStatus()
    lightSix.checkStatus()
def drawLights():
    lightOne.draw()
    lightTwo.draw()
    lightThree.draw()
    lightFour.draw()
    lightFive.draw()
    lightSix.draw()

#------------Initializing--
height=600
width=600
numObstacleCars=20
heightValues=[]
for j in range(0,height-50):
    heightValues.append(j)

pixList=[]
for u in range(int(.3*height),int(.7*height)):
    pixList.append(u)

timerPairs=[]
for v in range(0,5):
    screen=pygame.display.set_mode((width,height))
    clock=pygame.time.Clock()
    laneWidth=int(round(.05*width,0))
    laneOne=int(.05*width)
    laneTwo=int(.1*width)
    laneThree=int(.2*width)
    laneFour=int(.25*width)
    laneFive=int(.35*width)
    laneSix=int(.4*width)
    laneSeven=int(.5*width)
    laneEight=int(.55*width)
    laneNine=int(.65*width)
    laneTen=int(.7*width)
    laneEleven=int(.85*width)
    laneTwelve=int(.9*width)

laneOne_car=int(.05*width+3)
    laneTwo_car=int(.1*width+3)
    laneThree_car=int(.2*width+3)
    laneFour_car=int(.25*width+3)
    laneFive_car=int(.35*width+3)
    laneSix_car=int(.4*width+3)
    laneSeven_car=int(.5*width+3)
    laneEight_car=int(.55*width+3)
    laneNine_car=int(.65*width+3)
    laneTen_car=int(.7*width+3)
    laneEleven_car=int(.85*width+3)
    laneTwelve_car=int(.9*width+3)

lanes=[laneOne_car,laneTwo_car,laneThree_car,laneFour_car,laneFive_car,
           laneSix_car,laneSeven_car,laneEight_car,laneNine_car,laneTen_car,
           laneEleven_car,laneTwelve_car]

lane1=lane(laneOne)
    lane2=lane(laneTwo)
    lane3=lane(laneThree)
    lane4=lane(laneFour)
    lane5=lane(laneFive)
    lane6=lane(laneSix)
    lane7=lane(laneSeven)
    lane8=lane(laneEight)
    lane9=lane(laneNine)
    lane10=lane(laneTen)
    lane11=lane(laneEleven)
    lane12=lane(laneTwelve)
    lightOne=stoplight(laneOne,choice(pixList),20)
    lightTwo=stoplight(laneThree,choice(pixList),20)
    lightThree=stoplight(laneFive,choice(pixList),20)
    lightFour=stoplight(laneSeven,choice(pixList),20)
    lightFive=stoplight(laneNine,choice(pixList),20)
    lightSix=stoplight(laneEleven,choice(pixList),20)
    carOne=car(laneChoice=0,y=50,speed=4,accel=2,speedLimit=20,raceCar=1)
    carTwo=car(laneChoice=1,y=50,speed=4,accel=1,speedLimit=15,raceCar=1)
    yChoices=[]
    xChoices=[]
    for u in range(0,numObstacleCars):
        xChoice=choice([j for j in range(0,len(lanes)-1)])
        yChoice=choice(heightValues)
        if (screen.get_at((lanes[xChoice], yChoice))!=(255,255,255,255)):
            overlap=0
            yChanging=yChoice
            while (overlap==0):
                yChanging=yChanging+1
                if (screen.get_at((xChoice,yChanging))==(255,255,255,255)):
                    overlap=1
                if (screen.get_at((xChoice,yChanging))==(0,0,255,255)):
                    overlap=1
                if (yChanging==yChoice+10):
                    yChoices.append(yChoice)
                    xChoices.append(xChoice)
                    overlap=1

obstructCars=[car(laneChoice=xChoices[j],y=yChoices[j],speed=7,accel=2,speedLimit=15, raceCar=0) for j in range(numObstacleCars)]

counter=0
    speedLimit=20
    timerPair=[0,0]
    running=True
    while running:
        counter=counter+1
        screen.fill((255,255,255))
        drawLanes()
        checkLights()
        drawLights()
        for oneCar in obstructCars:
            oneCar.draw()
        carOne.draw()
        carTwo.draw()
        carOne.move()
        carTwo.move()
        for oneCar in obstructCars:
            oneCar.move()
        carOne.raceTimer(0)
        carTwo.raceTimer(1)
        if (carOne.oneTime==1 and carTwo.oneTime==1):
            running=False
        pygame.display.flip()
        clock.tick(40)

event=pygame.event.poll()
        if event.type==pygame.QUIT:
            running=False
    timerPairs.append(timerPair)
print timerPairs
pygame.quit()

Tuesday, December 3, 2013

Postmortem on General Moly Speculation

I keep an eye on the molybdenum market. The newsfeed contained both "December" and "General Moly", which got me thinking about an earlier post on GMO futures expiring in December 2013.

Relevant summary:

General Moly (GMO) is a development stage mining company. They want to dig up molybdenum. They have the land to do so and had the money, until...
The financing required for mine construction fell through due to an unfortunately timed detaining of a Chinese bank chairman.
The stock price fell accordingly.

My assumption was that if GMO was able to secure financing elsewhere, the stock price would rebound. Futures for GMO expiring in September and December amounted to a bet on whether GMO would be able to secure financing. To date, GMO has not secured financing, and the futures expired worthless.

Mt. Hope, the focal point of GMO, has proven and probable reserves of ~1.5 billion pounds according to GMO's website (and a health dose of copper). At current prices of $10/lb, GMO is sitting on $15B in molybdenum that isn't going anywhere, for better and for worse.

Thursday, November 21, 2013

Aggregating Aggregation: "Exception Handling or: How I learned to Stop Worrying and Let R Handle Imperfections"

The code for parsing Indeed.com by keyword and area in the previous post has some primitive features. It looks for exact patterns, and based on the usual syntax of Indeed HTML, subtracts a fixed number of characters to make a usable link which I can re-paste.

I think the error I have experienced comes from this fixed number subtraction taking away/leaving a relevant/irrelevant character. It works for most links, but not all. Ideally, I would just skip the exceptions and let the loop continue through.

R does this through tryCatch(), which has a confusing documentation. Reading through this posting from the Working With Data blog helped me in writing a function that loads as many pages of Indeed as specified.

require(scrapeR)
Keywords <- c("quantitative","physics","entry level","recent","college graduate",
 " r ","python","forecast","forecasting","analyst","analytics","analysis",
 "model","modeling","predictive","prediction","analyzed","predicted",
 "forecasted","new graduate","recent graduate", "data analyst","data")
htmlCleaner <- function(htmlString){
return(gsub("<.*?>","",htmlString))}

jobFinder <- function(searchWords,area,radius,pageToSearch){
counter <- 0
for (l in seq(1,pageToSearch)){
searchDescrip <- gsub(" ","+",searchWords)
area <- gsub(" ","+",area)
indeedPage <- paste("http://www.indeed.com/jobs?q=",searchDescrip,
 "&l=",area,"&radius=",as.character(radius),"&sort=date",
 "&start=",as.character(l*10),sep="")
#finds the HTML page, reads it in, and finds the third party links
#by indeeds html format. Eliminates blank space. Calls lines that
#refer to a job posting.
print(indeedPage)
html <- tolower(readLines(indeedPage))#searches the html lines for urls.
cleanHtml <- html[html!=""]
lines <- grep("jobposting",cleanHtml)

for (i in seq(1,length(lines))){

tryCatch({
#Creates a chunk of html that has hyperlink. Finds hyperlink.
chunkHTML <- cleanHtml[lines[i]:(lines[i]+8)]
linkLine <- grep("href=\\\"(.*?)\\\"",chunkHTML)
test <- gregexpr("href=\\\"(.*?)\\\"",
   chunkHTML[linkLine])
hLink <- regmatches(chunkHTML[linkLine],test)[[1]]
hLink <- paste("www.indeed.com",substr(hLink,7,nchar(hLink)-1),sep="")
jPage <- scrape(hLink,follow=TRUE,parse=TRUE)
jPage <- htmlCleaner(as(jPage[[1]],"character"))
wordMatches <- vector(mode="numeric",length=length(Keywords))
names(wordMatches) <- Keywords
for (j in seq(1,length(Keywords)))
{
if (gregexpr(Keywords[j],jPage)[[1]][1]>0)
{wordMatches[j] <- length(gregexpr(Keywords[j],jPage)[[1]])}
else{wordMatches[j] <- 0}
}
print(wordMatches)
print(hLink)
counter <- counter+1
print(counter)
},warning=function(war){
print("problem")
},error=function(err){
print("error")
},finally={
})
}
}
}

tryCatch takes your code and runs it. If an error or warning occurs, it redirects to the warning/error "handler", below the tryCatch{}. In redirecting, it doesn't stop the main loop with an error so my problem seems avoided although I don't understand what happens entirely. I ran through 5 pages of Indeed postings for analyst, and it spit out 50 instances of the expected output (as shown in the previous post).

Now I need to figure out what kind of output I want this to give, how I can speed up the execution, and automate a list of user inputs for both the keywords searched and multiple search terms, ie, "analyst","physics",... rather than the single "analyst".

but the legs are operational!

Wednesday, November 20, 2013

Aggregating Aggregation: Finding Relevant Jobs

The job hunting process is like climbing a mountain. Each time the mountain crests, you assume it is the top only to be disappointed when there is another ascent. C'est la vie. To that end, I'm trying to implement some robotic legs to do my grunt work to the next ascent.

This is a function that takes the argument that specify a search to Indeed.com, read the page, find the job links, and scrapes the third party redirect postings for relevant keywords. The non-function implementation works well but this bugs out quickly.

jobFinder <- function(searchWords,area,radius,pageToSearch){
for (l in seq(1,pageToSearch)){
searchDescrip <- gsub(" ","+",searchWords)
area <- gsub(" ","+",area)
indeedPage <- paste("http://www.indeed.com/jobs?q=",searchDescrip,
 "&l=",area,"&radius=",as.character(radius),"&sort=date",
 "&start=",as.character(l*10),sep="")
#finds the HTML page, reads it in, and finds the third party links
#by indeeds html format. Eliminates blank space. Calls lines that
#refer to a job posting.
print(indeedPage)
html <- tolower(readLines(indeedPage))#searches the html lines for urls.
cleanHtml <- html[html!=""]
lines <- grep("jobposting",cleanHtml)

for (i in seq(1,length(lines))){
#Creates a chunk of html that has hyperlink. Finds hyperlink.
chunkHTML <- cleanHtml[lines[i]:(lines[i]+8)]
linkLine <- grep("href=\\\"(.*?)\\\"",chunkHTML)
test <- gregexpr("href=\\\"(.*?)\\\"",
   chunkHTML[linkLine])
hLink <- regmatches(chunkHTML[linkLine],test)[[1]]
hLink <- paste("www.indeed.com",substr(hLink,7,nchar(hLink)-1),sep="")
jPage <- scrape(hLink,follow=TRUE,parse=TRUE)
jPage <- htmlCleaner(as(jPage[[1]],"character"))
wordMatches <- vector(mode="numeric",length=length(Keywords))
names(wordMatches) <- Keywords
for (j in seq(1,length(Keywords)))
{
if (gregexpr(Keywords[j],jPage)[[1]][1]>0)
{wordMatches[j] <- length(gregexpr(Keywords[j],jPage)[[1]])}
else{wordMatches[j] <- 0}
}
print(wordMatches)
print(hLink)
}
}
}

Example inputs as "entry level analyst"==search terms, "sacramento ca"==geographic area, 100==radius, 1==the number of pages to search. The bug appears on the fifth third party scrape, with values==defs. Debugging the function takes me to the line that scrapes the HTML but I'm still not sure where the error comes up. Obviously a work in progress, but if I can automate looking through 60 pages of Indeed.com a day, I can save myself a nice chunk of time.

> jobFinder("entry level analyst","sacramento ca",100,1)
[1] "http://www.indeed.com/jobs?q=entry+level+analyst&l=sacramento+ca&radius=100&sort=date&start=10"
    quantitative          physics      entry level           recent 
               0                0                0                0 
college graduate               r            python         forecast 
               0                0                0                0 
     forecasting          analyst        analytics         analysis 
               0                0                0                0 
           model         modeling       predictive       prediction 
               0                0                0                0 
        analyzed        predicted       forecasted     new graduate 
               0                0                0                0 
 recent graduate     data analyst             data 
               0                0                0 
[1] "www.indeed.com/rc/clk?jk=0dc735391ec9e5fd"
    quantitative          physics      entry level           recent 
               0                0                0                0 
college graduate               r            python         forecast 
               0                0                0                0 
     forecasting          analyst        analytics         analysis 
               0                1                2                0 
           model         modeling       predictive       prediction 
               3                0                0                0 
        analyzed        predicted       forecasted     new graduate 
               0                0                0                0 
 recent graduate     data analyst             data 
               0                0               33 
[1] "www.indeed.com/rc/clk?jk=6dd34d0431d25416"
    quantitative          physics      entry level           recent 
               0                0                0                0 
college graduate               r            python         forecast 
               0                0                0                0 
     forecasting          analyst        analytics         analysis 
               0                0                0                4 
           model         modeling       predictive       prediction 
               2                0                0                0 
        analyzed        predicted       forecasted     new graduate 
               0                0                0                0 
 recent graduate     data analyst             data 
               0                0                3 
[1] "www.indeed.com/rc/clk?jk=f186fe17c3346164"
    quantitative          physics      entry level           recent 
               0                0                0                0 
college graduate               r            python         forecast 
               0                0                0                2 
     forecasting          analyst        analytics         analysis 
               0                0                0                0 
           model         modeling       predictive       prediction 
               0                0                0                0 
        analyzed        predicted       forecasted     new graduate 
               0                0                1                0 
 recent graduate     data analyst             data 
               0                0                7 
[1] "www.indeed.com/rc/clk?jk=cacce23a3c446ed7"
    quantitative          physics      entry level           recent 
               0                0                0                0 
college graduate               r            python         forecast 
               0                2                0                0 
     forecasting          analyst        analytics         analysis 
               0                4                2                3 
           model         modeling       predictive       prediction 
               1                0                0                0 
        analyzed        predicted       forecasted     new graduate 
               0                0                0                0 
 recent graduate     data analyst             data 
               0                0               15 
[1] "www.indeed.com/rc/clk?jk=8382c8ba0875d4f8"
    quantitative          physics      entry level           recent 
               0                0                0                0 
college graduate               r            python         forecast 
               0                0                0                0 
     forecasting          analyst        analytics         analysis 
               0                0                0                0 
           model         modeling       predictive       prediction 
               0                0                0                0 
        analyzed        predicted       forecasted     new graduate 
               0                0                0                0 
 recent graduate     data analyst             data 
               0                0                0 
[1] "www.indeed.com/rc/clk?jk=b56639458dcd2619"
Error in value == defs : 'value' is missing

Wednesday, October 23, 2013

Predictive Modeling of S.F. Crime: First Backtest

I've incorporated multiple simulations per day to produce distributions for crime volume, turning this

"X crimes will happen tomorrow according to my model" to

"X-Y crimes happen 50% of the time according to my model".

Here's a short backtest using the 150 days beginning on Jan. 1st, 2013 to illustrate what's going on. The percentile lines correspond to the 95th, 80th, 20th and 5th percentiles for each day in the simulation. The points are the realized volumes.

I've recalibrated the means used by the model to the 2003 data and ran the model forward 5 years to see how the model accounts for historical data.
A "good" performance:
1. ~60% of volumes should lie between the 20th and 80th percentile.
2. ~15% of volumes should lie in both the 5th to 20th and 80th to 95th range (~30% combined).
3. 10% should lie above and below the 95th and 5th, respectively.

What happened?
1. 50% of historical volumes were within the model-produced 20th and 60th percentiles. Low.
2. 19% were within the model-produced 5th and 20th percentile. High.
3. 13% were within the model-produced 80th and 95th percentile. Low.
4. ~17% were above/below forecast 95th/5th.

What does that mean?
1. As suspected, my model underestimates the frequency of abnormally high/low volume days (ie, fat-tailed events).

2. Judging from the 5th-20th and 80th-95th results, the forecasts produce higher volume days more often than lower volume days, which is inconsistent with past observations.

Now what?
The underlying issue is that my model produces volumes by combining random draws from probability distributions. Going back to Probability 101, generating a 95th percentile event will require multiple high draws from the twelve distributions. The benefit of my approach is that I can tinker with means by category (customize) without much problem and independently. The downside is that the combination of draws decreases the consistency of my forecasts with historical observation.

A way around this could be to initiate one draw which will inform the draws from each other category. Rather than combining a number of independent, individual draws to produce a volume, a percentile could be chosen, passed to the categorical distributions and inform that choice. That would reduce the problem of combined probabilities and better account for high-volume events.
But I'm not being paid and this model is jumping into overfit territory quickly, so I won't.

Monday, October 7, 2013

Predictive Modeling of S.F. Crime: Adding Monthly Tendencies

I have been developing a model for forecasting crime volume in San Francisco, informed through publicly available historical data going back ten years. Volumes are generated through a sum of choices from normal distributions by crime category, as informed from the past data. By shifting the mean of this distribution, I have accounted for day of the week trends (Friday being a higher crime day than Tuesday) and a long-term linear decrease.

The last tweak is to account for monthly shifts. Here's a plot lacking severely in elegance to give a quick idea as to the magnitudes by year. Each year, I took the mean of the twelve months and differenced the actual number by month from the mean.

The clearest trend is that February and December reside below the mean. To include this numerically, I've taken an average of the percentage away from the mean by year. A quick example will make this more clear.

In '03, 13482 crimes were reported in January. The mean of that year was 12741. 13482-12741 ~ 740. As a percentage of the mean, 740/12741 ~5.8%. Doing this for each January, year by year, and averaging those percentages yields 3.5%. Doing this for each month, the list looks like
January ~ 3.5% ~ 0.00115
February ~ -6.2% ~ -0.00222
March ~ 3.6% ~ 0.001185
April ~ -0.6% ~ -0.000218
May ~ 1.6% ~ 0.0000538
June ~ -3.8% ~ -0.001285
July ~ 0.02% ~ 0.0000711
August ~ 3.5% ~ 0.001142
September~ 1.6% ~ 0.000533
October ~ 5.2% ~ 0.00167
November ~ -3.0% ~ -0.001011
December ~ -5.6% ~ -0.00183

As a sanity check, the sum of these should be close to zero. Taking the full, "unapproximated" values in R yields a number raised to the -17th so ~0. To include these percentages as a modifier to a daily volume forecast, I need to translate these numbers into a daily shift. This percentage does not need to be geometric like the long-term decrease because the mean is effectively being reset at each run. On an annual basis, this monthly tweak should redistribute when the crime volume occurs, but not change the total volume.

The second number trailing the percent is the daily shift that accounts for the monthly shift, by dividing the percentage by the number of days in that month and ignoring leap years.

My model is a simple conditional tweak to the means of normal distributions informed by past crimes. To put this new shift in, the past mean calculation,
Mean = Historical Mean * [ (Geometric Average Decrease)^(Number of Forecast Days) + (Day of Week Shift)]
will include a new component so that
Mean = His Mean * [(GeoAvgDec)^(NumOfDays) + (DoW Mod) + (Month of Year Mod)]

The result will be a redistribution of forecast volume which will stay consistent with annual expectations.

Above is the new method including the monthly modifier and below is the previous forecast using only a day of the week shift and a long-term shift. By inspection, it is seen that this model still does not forecast extreme events but inclusion of this new modifier has spread the distribution out rather than clustering so heavily around the downward linear trend.

The next step is to account for the fat tail events, and keeping in mind the longer term goal of putting forecasts in the context of Monte Carlo simulations.

Sunday, September 29, 2013

Predictive Modeling of S.F.P.D. Crime: Falling Short

Turning the crank in Python and R, here's my first long term forecast incorporating the tailoring detailed in the last few posts. I didn't spend much time on the plot, hence the funky axis units. The left of the red line is the actual daily volumes from 2003-2013, while after the red line is the forecast volumes.

So far, a long-term decreasing trend and a day of the week modifier has been included. By inspection, the long-term decrease seems to be doing what it needs to because there is a clear decreasing trend. What about the day of the week modifier?

Percentile results show that the forecast volumes are lower and more tightly distributed than historical volumes. Forecasts are in italics and historical results are in regular font.

Wednesday