Using data from 1871 to 2008, BYU statistics professor Gil Fellingham developed a home-run prediction model for Major League Baseball players — one that controls for the player’s era of birth, season of play and home ball park to get what the researchers call a “pure performance” curve.
“If you know where something’s going, you’ve got a leg up to get there and beat the competition,” said Fellingham, whose research was recently published in The American Statistician. “And in sports, these players make a lot of money, so if you’re able to predict their performance better, general managers are better able to calibrate a contract.”
Using their 100-plus years of data, Fellingham and Jared Fisher, a BYU alum now in a Ph.D. program at the University of Texas, created a Bayesian semiparametric model, which Fellingham called a “very rich and important methodology for prediction.” They tested their model on 22 players using data from 2009–16, and in 11 cases, their semiparametric Bayesian model was superior to a simpler hierarchical Bayesian model. Nine were equivalent, and in only two cases the simpler model won out.
“We’re trying to predict out of the data range, and that’s very risky,” he said. “But in this case we’ve got other people in the data set that we think are performing like those we’re trying to predict.”
To enhance their predictive ability, the researchers controlled for factors that had a significant impact on home-run frequency. For example, lowering the mound and outlawing the spitball increased home-run frequency, while World War II decreased it. The impact of Babe Ruth’s play created a kind of golden age for home-run frequency for players born in the 1920s. The Colorado Rockies’ home field, by virtue of its elevation, lends itself to more home runs hit. Accounting for those kinds of influences allowed for a generally more accurate performance curve.
Though making a prediction from a single data point is not what this particular study is about, Fellingham pointed to the Angels’ Albert Pujols’ pure performance curve when asked to make a prediction on how many homers Yankee Aaron Judge will hit this season. Judge had a record-breaking rookie year in 2017, hitting 52 over the course of the season.
Assuming 1) Judge’s homerun output his rookie year truly reflects his ability, 2) his number of at-bats will be roughly similar to what they were his rookie year and 3) his pure performance curve will match up closely with that of Pujols (whom Fellingham calls “a reasonably productive HR hitter”), Fellingham expects Judge to hit homeruns at about the same rate he did last year. So for 550 at-bats, Fellingham predicted 53 homers, with a range of roughly plus or minus six.
For Fellingham, the research is an extension of his passion for both sports and statistics. He played water polo and swam in college, earned advanced degrees in physical education, statistics and biostatistics, and has coached water polo, volleyball and track, both at the high school and collegiate levels.
But the applications of this project, he said, extend far beyond sports, into such areas as stock-market and health-care-cost prediction.
“If you have a methodology that does a little better at predicting, then that’s worth pursuing,” said Fellingham, who is currently working on predictive health-care-cost research. “What we’re trying to do is anticipate what is going to happen years down the road, and by being able to anticipate it, be better prepared.”