“Interesting Noise” and the Perils of Mining for Data

Bizarre statistics typically grab headlines. For example, everybody at Saratoga yesterday was smitten with the “A.P. Indy Curse,” the data-mined finding that no offspring of the great, now-pensioned A.P. Indy had ever won a six furlong (or lower) sprint at the premier summer meets of Saratoga or Del Mar. Believing this stat to be predictive of today’s race (a six furlong, NY-bred maiden allowance for 3-year olds and up), the public shied away from the speedy A.P. Indy daughter Girlaboutown and, instead, made a first time starter the favorite. Girlaboutown, just a tad shy of 5-2, dominated the field on the front end and never looked in trouble, ending the “curse.”

But there was absolutely no reason to believe that this statistic was particularly predictive for these lower-level sprints at Saratoga or Del Mar. The “A.P. Indy Curse” was the quintessential bad stat. It was simply the result of clever data-mining — a technique that often looks for patterns before theory. It’s great for finding fun stories, but not all that useful for predicting the future. Finding useful predictive stats in horse racing requires more than just combing through data and finding patterns. Not only do you need to find an interesting stat, but it is generally smart to also ask yourself why a particular stat might, in fact, independently be true. Stats must follow theory, not the other way around. This is often best done by using theory to drive research questions, not data.

Stats must follow theory, not the other way around.

That said, knowing what we know about sires, speed, and about A.P. Indy’s overall performance as a sire, was there any independent, theoretical reason to believe this stat to be anything other than “clever noise?” It has been fairly well-established that A.P. Indy’s progeny prefer routing (and there is ample theoretical basis for this belief), but there is no reason to believe that sprinting at Saratoga and Del Mar was ever a particular problem when compared to other tracks. The curse in general is merely a restatement of A.P. Indy’s weakness as a sprint sire, but it is misleading in regards to its increased severity at the marquee summer meets. There is no particular reason to believe him worse across the board — independent of class level — at Saratoga and Del Mar. A good sprinter by A.P. Indy, while rare, is not unheard of; for example, Giralimo was a G1 winner, and A.P. Indy still sires occasional winners in lower-level sprints around the United States. In sum, the Saratoga/Del Mar version of the “A.P. Indy Curse” provided a great example of seeing meaning in noise, and a great place to exploit how the public misuses and misapplies information.

Image: Microsoft Sweden, “Excel 2.0. Box Shot 1987,” Copyright 2011. Creative Commons 2.0.