While there are many awards in baseball that reward greatness, there is nothing more highly regarded than getting elected into the Baseball Hall of Fame. Not only is election into the Hall of Fame one of the most prestigious honors in all of baseball, it is also one of the most selective with only 74 pitchers (317 members overall) currently in the Hall of Fame. In recent years, predicting who gets elected into the Hall of Fame has become a guessing game due to uncertainty regarding steroid use of some players. The goal of our model is to take into account multiple features of a pitcher such as ERA and steroid use to make a prediction on if the player will get elected into the Hall of Fame.
We found all of the statistics that we used on seanlahman.com which is a website where a man compiles all of the baseball statistics from the previous year and outputs it in csv format on his site. A major set of data that we added to the traditional baseball statistics was awards won from various competitions such as World Series MVP and Golden Glove. We believe that winning these awards has a high correlation with being inducted into the hall of fame. However many of these awards were not created until the 1950’s or 1960’s whereas our dataset goes back to 1870’s. To combat this, for any player who only played before the creation of an award we set the value of NaN for the award and let xgboost feature engineer what this value should be to best help the model.
To further differentiate our model from a standard binary classification with the given data, we decided to create a steroid usage score as a further predictor of whether someone will be inducted into the Hall of Fame. Known steroid users are usually avoided for induction into the Hall of Fame. As there is no easily accessible data for whether a player used steroids, we created our own data for the players in our dataset. In python, our team wrote a script to query each baseball player’s wikipedia page if he had one, otherwise they were assigned a “None” value. For each page, the script counted up the occurrences of various keywords that we believed were indicators of steroid use. Once we had this count we attempted various methods of making it a normalized metric between players. We needed a weighted count since some players have much smaller wikipedia pages than others so the occurrence of a keyword word one in a smaller Wikipedia page should have a similar value to a larger page with numerous occurrences of the keywords. We also gave more weight to a mention of steroids in the player summary than the rest of the article. While this seemed like a good idea in the beginning, we began encountering problems for some of the lesser known players which would skew our final results. Due to the simplicity of the Wikipedia API, if a page was not directly found linked to the name of the baseball player, the script would then get the API to suggest the most relevant search term. Occasionally, Wikipedia would match a lesser known player who may not have a wikipedia page to a completely different subject or even a different person who plays a different sport linked to P.E.D. usage. Unfortunately there was no automated way to verify whether or not the suggested Wikipedia page was the appropriate page. It was very difficult to combat this as we had to rely on the Wikipedia scrapper returning the correct player and could not verify each manually. We needed a more reliable method for determining steroid usage.
We began exploring other possible data sets that could link a player to steroid usage and came across two. These data sets were the Mitchell Report and a list of players that have ever tested positive for performance enhancing drugs. The Mitchell Report is the result of a comprehensive investigation conducted by George Mitchell, a former federal prosecutor, that looked deeper into the use of performance enhancing drugs in Major League Baseball. The report focused on higher profile players and in the end a list of 89 former and current MLB players were named in the report. This combined with the list of positive tests gives us a great database to work off of to discover the effect of steroid use on Hall of Fame candidacy.
For our model we began with an XGBoost Classifier as we are doing a binary classification on the dataset. We used five fold cross validation on our dataset while training our xgboost model. We then utilized bagging with the first 37 Ramanujan primes as the seed for each of the xgboost models to combine the 37 predictions into one. This method of prediction gave us a AUC score of 0.9738 which is incredibly high.
The top three most important features are Earned Runs Average (ERA), wins, and All Star selections. ERA is the average number of runs a pitcher gives up per game normalized over nine innings. The lower the ERA the better and some of the best pitchers have a career ERA below 3.00. Wins is a self explanatory statistic, but to qualify for a win a starting pitcher needs to pitch at least five innings and his team needs to have the lead when he leaves the game. In the middle of every season, baseball writers vote for the season’s best players to participate in an All Star game. This game is a fan favorite because they get to see the best compete against each other.
Both of our features related to steroid use, Mitchell-Report and Positive-Test, had no impact on our model. We believe this is because so few players fall into these categories, and the vast majority of those that do are still active or are not yet eligible for the hall of fame, so nothing meaningful about their effect on hall of fame can be determined from data analysis alone.
The Perfect Player
So what does it take to make it to the hall of fame? Well, the three best things a pitcher can do over their career is to win a lot of games, maintain a low ERA, and be selected to a lot of all-star games. The average hall of fame pitcher won 246 games in his career, played in 7 all-star games, and had an ERA of 2.97. We also created a function to scales a player's stats to be the same as an average hall of fame pitcher. Before this scaling only one pitcher was given an over 50% chance of making it into the hall of fame but after scaling it made stronger predictions for several other pitchers to be inducted into the hall of fame.