“Interesting Noise” and the Perils of Mining for Data

Bizarre statistics typically grab headlines. For example, everybody at Saratoga yesterday was smitten with the “A.P. Indy Curse,” the data-mined finding that no offspring of the great, now-pensioned A.P. Indy had ever won a six furlong (or lower) sprint at the premier summer meets of Saratoga or Del Mar. Believing this stat to be predictive of today’s race (a six furlong, NY-bred maiden allowance for 3-year olds and up), the public shied away from the speedy A.P. Indy daughter Girlaboutown and, instead, made a first time starter the favorite. Girlaboutown, just a tad shy of 5-2, dominated the field on the front end and never looked in trouble, ending the “curse.”

But there was absolutely no reason to believe that this statistic was particularly predictive for these lower-level sprints at Saratoga or Del Mar. The “A.P. Indy Curse” was the quintessential bad stat. It was simply the result of clever data-mining — a technique that often looks for patterns before theory. It’s great for finding fun stories, but not all that useful for predicting the future. Finding useful predictive stats in horse racing requires more than just combing through data and finding patterns. Not only do you need to find an interesting stat, but it is generally smart to also ask yourself why a particular stat might, in fact, independently be true. Stats must follow theory, not the other way around. This is often best done by using theory to drive research questions, not data.

Stats must follow theory, not the other way around.

That said, knowing what we know about sires, speed, and about A.P. Indy’s overall performance as a sire, was there any independent, theoretical reason to believe this stat to be anything other than “clever noise?” It has been fairly well-established that A.P. Indy’s progeny prefer routing (and there is ample theoretical basis for this belief), but there is no reason to believe that sprinting at Saratoga and Del Mar was ever a particular problem when compared to other tracks. The curse in general is merely a restatement of A.P. Indy’s weakness as a sprint sire, but it is misleading in regards to its increased severity at the marquee summer meets. There is no particular reason to believe him worse across the board — independent of class level — at Saratoga and Del Mar. A good sprinter by A.P. Indy, while rare, is not unheard of; for example, Giralimo was a G1 winner, and A.P. Indy still sires occasional winners in lower-level sprints around the United States. In sum, the Saratoga/Del Mar version of the “A.P. Indy Curse” provided a great example of seeing meaning in noise, and a great place to exploit how the public misuses and misapplies information.

Image: Microsoft Sweden, “Excel 2.0. Box Shot 1987,” Copyright 2011. Creative Commons 2.0.

Top 5 Desert Island Races: #2 Secretariat

So many wonderful places to go with Secretariat.  I love his Canadian International win — he was, impossibly, an even better turf horse.  But, when it comes down to one, it has to be the Belmont Stakes.  Think about it — he runs two 6f sprints without tiring — with a style that should demand a collapse in the stretch.  I don’t know if there is a more amazing performance in sport.

#5: A.P. Indy in the Belmont Stakes, 1992

#4:  Frankel in  the Juddmonte Stakes, 2012

#3:  Cesario in the American Oaks, 2005

#2: Secretariat in the Belmont Stakes, 1973

#1:  Released July 8, 2014 (hint:  It’s from 2004)

Top 5 Desert Island Races: #5 A.P. Indy

If I was trapped on a desert island, I wouldn’t want a book or movie with me.  Instead, I’d want a collection of my favorite YouTube races.  I’m going to count them down over the holiday weekend, starting today.

#5:  A.P. Indy’s Triumph in the 1992 Belmont Stakes. This race is a testament to his pure stamina, which he showed both here and in winning the Classic.  A sire of sires, he passed his routing prowess on to his sons and daughters, including the 2007 winner of this race, Rags to Riches.