Use R and the Baseball DataBank database in MySQL to analyze statistically the 10 most underpaid outfielders for the years 1999–2003.
The most common conversation between any two baseball fans (after arguing about whose favorite team is better, of course) is what player is the best in the game. Baseball, more than any other sport, it seems, is obsessed with ranking players. Certainly, the fact that every player gets a turn at bat plays into the perception that such a thing should be a “simple” matter. The ensuing debate after failing to establish the proper all-time pecking order is how high all players’ salaries have become. Of course, both parties agree on that and then they’re friends again. But the real truth is that ranking players is highly subjective, and the distribution of player salaries is highly skewed. For every A-Rod making $22 million, there are 100 younger players making a “scant” $300,000.
So, I won’t even try to claim who the “best” players are; rather, I will simply use R to help identify what players have “similar” attributes to one another and then attempt to predict what their salary can be expected to be, based on those attributes. Players who were paid much lower than expected given their similarity to other players are considered “undervalued.” Likewise, players paid much more than expected are considered “overvalued.” And to help simplify things, I look only at outfielders for the years 1999–2003.
(This analysis isn’t completely fair. Major league players aren’t free agents until they’ve played for six years, so their salaries aren’t set by the market. For the first three years, their salaries are set by their original contracts and are often set at the major league minimum. For the next three years, players are eligible for arbitration, and their salaries are set through negotiations with their teams. But even if the analysis isn’t totally fair, it’s still interesting to look for bargains.)
So how do you perform such an analysis using R? It’s actually pretty simple, as I will show you in this hack. The difficult part is making sure everything makes sense and is being interpreted properly. It’s really more art than science, but here are the basic steps:
Import raw data from MySQL into R.
Compress input variables to a handful of common attributes (such as speed, power, etc.).
Assign scores to all players based on these attributes.
Group players on these attributes based on their similarity to one another.
Predict salaries based on group membership and a few other input variables.
Look at actual minus predicted salaries.
Let’s get started!
It is assumed that you have the Baseball DataBank database loaded in MySQL [Hack #20] and that you can access this data from R [Hack #33] .
The first thing you need is a SQL query to define what raw data to get from the database. It makes sense to let R pull the data using a query because multiple tables will be joined together and we can set our criteria there. Moreover, I prefer to keep this SQL query in a separate text file because it’s easier to maintain. So, use Notepad to save the following SQL query as a simple text file (in C:aseball_salariesget_data.sql):
SELECT CONCAT(M.playerID,'|',F.yearID) AS ID, CONCAT(M.nameLast, ', ', M.nameFirst) AS playerName, F.yearID, (F.yearID - M.birthYear) AS age, F.lgID, F.teamID, F.POS, F.G AS POS_G, B.G AS G, B.AB, (B.H - B.2B - B.3B - B.HR) AS X1B, B.2B AS X2B, B.3B AS X3B, B.HR, B.SB, B.BB, B.SO, B.GIDP, F.PO, F.A, F.E, F.DP, S.salary FROM master AS M INNER JOIN fielding AS F ON (M.playerID = F.playerID) INNER JOIN batting AS B ON (F.yearID = B.yearID) AND (F.playerID = B.playerID) AND (F.teamID = B.teamID) INNER JOIN salaries AS S ON (F.yearID = S.yearID) AND (F.playerID = S.playerID) AND (F.teamID = S.teamID) WHERE (F.yearID BETWEEN 1999 AND 2003) AND F.pos IN ('LF','CF','RF') AND F.G >= 81 ORDER BY playerName, F.yearID
A lot is going on in this query. Most notable is the fact that we’ll be restricting our attention to players who played defensively in at least 81 games as an LF, CF, or RF for any season between 1999 and 2003. Because 81 games are exactly half of a full 162-game season, you’ll be forgiven if you assume I chose this cutoff to limit the analysis to only part- to full-time players. However, the real reason I chose this cutoff is to reduce significantly the likelihood that the exact same player will have two records for the same season on different teams or different positions. Let’s say the cutoff was set to at least 40 games. We would have two records for Moises Alou for 2000: 59 games at LF and 64 games at RF. So what’s wrong with that? Nothing, except the batting statistics and salary data are at the player/year/team level. That is, we don’t have the data to know what Moises batted in those games in which he played LF versus RF, we have only the data to know what his final stats were for the whole season. The same is true for salary. Plus, I’m only interested in comparing players who mostly play the same position as one another. Somebody who used to play infield might claim outfield is outfield, but having played it myself a little growing up, I know there’s a big difference between LF, CF, and RF. So, this analysis is constrained to only outfielders who played at least 81 games in a season at any one of the three outfield positions. I understand this systematically eliminates utility players and very versatile players, but I had to do it.
Another thing you’ll notice in the query is that we’re calculating singles (H-2B-3B-HR) now, rather than after we import the data into R, and that an updated ID is being constructed that takes the season into account. This will be useful for labeling and merging purposes later on. The combination of playerID and yearID should be unique (that is, we don’t want more than one record for any one player for a given season), so I simply combine these into a new variable called ID.
Now that we have defined a SQL query for getting the data, we need to import it:
> # LOAD THE RODBC LIBRARY > library(RODBC) > > # READ IN SQL QUERY FROM EXTERNAL FILE AND REMOVE ALL EXTRA SPACES > queryFile <- "C:/baseball_salaries/get_data.sql" > sql <- paste(readLines(query_file), collapse=" ") > sql <- gsub("[[:space:]]{2,}", " ", sql) > > # OPEN CONNECTION TO MYSQL, RUN QUERY, AND IMPORT INTO DATA FRAME > channel <- odbcConnect(dsn="bbdatabank") > bbdata <- sqlQuery(channel, sql) > close(channel)
If all went well, the results of the query will exist as a data frame named bbdata. To verify what was imported, we count the records and summarize the data set:
> length(bbdata$ID) [1] 357 > summary(bbdata)
This code yields the results in Figure 6-11. This is a very quick overview of everything we will be analyzing. Notice there is never more than one ID value; this is good! The distribution of CF, LF, and RF is pretty equal, though there’s slightly more CF (perhaps this player is less likely to platoon with other players or to play other positions). Also notice that slightly more NL players qualify than AL players, and notice the salary distribution. Half of all of these outfielders earned at least $3,416,667 per season, and half earned less. This breakpoint is a little higher than all players in general, which is to be expected because they’re mostly full-time players, but it doesn’t really matter because, remember, we’ll be comparing them to one another.
The next step in the process is to take the basic input variables we’ve imported and are quite familiar with—such as HR, RBI, and SB—and attempt to explain what accounts for them. A common statistical method for doing this sort of thing is called factor analysis (see http://www2.chass.ncsu.edu/garson/pa765/factor.htm), and R makes it very easy to perform this procedure. In a nutshell, when variables are highly correlated with each other, it’s logical to assume that some underlying construct might be responsible for the correlations that you are observing. For example, a high correlation between the number of triples and the number of stolen bases is likely due to an attribute called speed. Of course, just because two variables have a strong correlation doesn’t mean they’re related. (Have you ever heard of the “Super Bowl versus stock market” correlation? See http://www.sciencenews.org/articles/20000701/mathtrek.asp for more information.)
Let’s look at the correlations of the input variables we’re interested in:
> X <- subset(bbdata[,11:22]) > round(cor(X),digits=2) X1B X2B X3B HR SB BB SO GIDP PO A E DP X1B 1.00 0.50 0.31 0.06 0.42 0.04 -0.03 0.33 0.49 0.21 0.06 0.10 X2B 0.50 1.00 0.05 0.45 -0.01 0.29 0.25 0.45 0.34 0.30 0.13 0.12 X3B 0.31 0.05 1.00 -0.15 0.47 -0.05 -0.04 -0.19 0.31 0.18 0.05 0.14 HR 0.06 0.45 -0.15 1.00 -0.26 0.59 0.46 0.34 0.23 0.25 0.27 0.12 SB 0.42 -0.01 0.47 -0.26 1.00 -0.01 -0.02 -0.22 0.31 0.01 0.03 0.07 BB 0.04 0.29 -0.05 0.59 -0.01 1.00 0.32 0.08 0.13 0.13 0.18 0.07 SO -0.03 0.25 -0.04 0.46 -0.02 0.32 1.00 0.11 0.34 0.21 0.22 0.12 GIDP 0.33 0.45 -0.19 0.34 -0.22 0.08 0.11 1.00 0.19 0.20 0.18 0.06 PO 0.49 0.34 0.31 0.23 0.31 0.13 0.34 0.19 1.00 0.33 0.14 0.29 A 0.21 0.30 0.18 0.25 0.01 0.13 0.21 0.20 0.33 1.00 0.30 0.58 E 0.06 0.13 0.05 0.27 0.03 0.18 0.22 0.18 0.14 0.30 1.00 0.13 DP 0.10 0.12 0.14 0.12 0.07 0.07 0.12 0.06 0.29 0.58 0.13 1.00
As expected, 3B and SB are pretty well correlated with each other (r = 0.47), and SB and GIDP are negatively correlated (r = -0.22); that is, players with high SB totals tend to have hit into fewer double plays, and vice versa.
R’s function for performing a factor analysis, factanal( ), takes a matrix of numeric values you feed it, computes their correlations, and then attempts to optimize the maximum likelihood that those observed correlations came from the specified number of factors that you say they did. And herein lies the art. Knowing how many factors to suggest and whether they make sense is very much an iterative and subjective process (for more background on this fascinating statistical technique, search the Web for “factor analysis”). Applying this technique to our data yields the following:
> far <- factanal(~X1B+X2B+X3B+HR+SB+BB+SO+GIDP+PO+A+E+DP, + data=bbdata, factors=5, scores="regression") > far$n.obs [1] 357 > print(far) Call: factanal(x = ~X1B + X2B + X3B + HR + SB + BB + SO + GIDP + PO + A + E + DP, factors = 5, data = bbdata, scores = "regression") Uniquenesses: X1B X2B X3B HR SB BB SO GIDP PO A E DP 0.094 0.465 0.623 0.193 0.377 0.488 0.636 0.508 0.005 0.005 0.858 0.631 Loadings: Factor1 Factor2 Factor3 Factor4 Factor5 X1B 0.784 0.526 0.101 X2B 0.336 0.629 0.144 X3B 0.581 0.160 0.105 HR 0.807 0.272 -0.252 0.105 SB 0.784 BB 0.709 SO 0.506 0.147 0.288 GIDP 0.102 0.610 -0.301 0.101 PO 0.177 0.285 0.349 0.224 0.843 A 0.132 0.182 0.970 E 0.268 0.253 DP 0.581 0.136 Factor1 Factor2 Factor3 Factor4 Factor5 SS loadings 1.666 1.583 1.521 1.484 0.864 Proportion Var 0.139 0.132 0.127 0.124 0.072 Cumulative Var 0.139 0.271 0.397 0.521 0.593 Test of the hypothesis that 5 factors are sufficient. The chi square statistic is 30.04 on 16 degrees of freedom. The p-value is 0.0178
You’re looking to see that the variables that “load” into the different factors make sense and belong together, and that the higher the absolute value of the load (which ranges from −1 to +1), the more prominent a role that variable plays in the factor. So, it’s fairly obvious that Factor1 equals “power,” Factor2 equals “ability to make contact,” and Factor3 equals “speed.” Factor4 is a little harder to interpret. Is it “range” or “arm strength” that accounts for a high number of assists and double plays? Since PO and E play a small part, it might be “range.” But the very high A and DP imply “arm.” For now, let’s simply call it “defensive prowess.” And what about Factor5? Hmmm. I’m hesitant to call it “glove ability” despite the very high PO loading because there’s no negative loading for E, as you would expect. Plus, some outfielders simply see more fly balls, depending on their team’s pitchers. So, let’s call it “fly ball opportunities.” Of course, in practice, you would look at more than five factors here. Notice that the “cumulative variance” of these five factors together accounts for about 59.3% of the total variability seen in the original 12 variables, so we are losing some information. But while you could try using more factors to preserve more information, it’s a balancing act because there are diminishing returns the more factors you include. Furthermore, recall that the whole point is to reduce the dimensionality of the input variables (though pvalue < 0.05 suggests we should probably look at more factors). But we’ll go with these five for now and see what happens. Another thing to look at is the “uniqueness” of each variable (which ranges from 0 to 1). The higher the value, the more the variable “stands on its own” and doesn’t really correlate strongly enough with any other variables to suggest a common factor. For example, E is kind of high, which makes sense.
In our call to factanal( ), we asked it to compute factor scores for the five factors. This basically multiplies the factor loadings with the players’ values for each of the original variables and sums them up for each factor. Then R internally standardizes each factor based on the mean and standard deviation for that factor. The end result is five standardized factor scores for each player, where 0 suggests completely average, +2 suggests two standard deviations above the average, and −2 suggests two standard deviations below the average. So, let’s label these factors and then append the scores onto our data set:
> colnames(far$scores) <- c("power","contact","speed","defense","flyball") > bbdata <- data.frame(bbdata, far$scores)
Now, let’s test this out. Which players have had seasons at least two standard deviations above average on “power”?
> subset(bbdata[,c('playerName','yearID','POS','AB','HR','power')], + power >= 2) playerName yearID POS AB HR power 40 Bonds, Barry 2000 LF 480 49 2.238885 41 Bonds, Barry 2001 LF 476 73 4.420423 42 Bonds, Barry 2002 LF 403 46 3.033495 43 Bonds, Barry 2003 LF 390 45 2.190994 131 Gonzalez, Luis 2001 LF 609 57 2.121822 305 Sosa, Sammy 1999 RF 625 63 2.644626 306 Sosa, Sammy 2000 RF 604 50 2.261168 307 Sosa, Sammy 2001 RF 577 64 3.205626
No surprises here! What about the players who scored lowest on “defense”?
> subset(bbdata[,c('playerName','yearID','POS','POS_G','PO', + 'A','E','DP','defense')], defense <= -1.5) playerName year ID POS POS_G PO A E DP defense 13 Anderson, Brady 2000 CF 88 230 1 1 1 -1.604664 149 Grissom, Marquis 1999 CF 149 374 1 5 2 -1.768149 162 Henderson, Rickey 1999 LF 116 167 0 2 0 -1.869917 216 Lawton, Matt 2000 RF 83 163 2 3 0 -1.508250 258 O'Neill, Paul 2001 RF 130 210 1 4 0 -1.615759 289 Sanchez, Alex 2002 CF 86 234 0 5 0 -1.648747 306 Sosa, Sammy 2000 RF 156 316 3 10 1 -1.547480 333 Wells, Vernon 2003 CF 161 383 3 4 0 -1.519429 342 Williams, Bernie 2000 CF 137 353 2 0 1 -1.523878 344 Williams, Bernie 2002 CF 147 350 2 5 1 -1.769403
No great surprises here either. These players have seasons with very few assist and double play totals relative to putouts, and there are some fairly high error counts.
We could go ahead and predict player salaries based on their factor scores right now, but this isn’t quite fair because not all outfielders are of the same ilk. Some are power hitters who play shaky defense, and others are fleet-footed and cover a lot of ground but are not going to hit towering 475-foot home runs. So, we shall attempt to “classify” our players into similar groups based on their factor scores to one another. One fairly simple method for doing this is to perform the k-means clustering procedure (see http://www.statsoft.com/textbook/stcluan.html#k). This procedure creates as many groups as you choose and then attempts to cluster all of the observations into these groups by seeking minimal variability within clusters while seeking maximal variability between clusters. We’ll try four groups and see what happens. Note that first I’ll “seed” the algorithm with what initial cluster centers to use. Because I have no real way of knowing how they’ll end up, I’ll simply arbitrarily choose the first four. Then, I’ll allow the algorithm to use up to 10,000 “turns” for moving observations around to find the best fit (think Tower of Hanoi game).
> x <- subset(bbdata[,c('power','contact','speed','defense','flyball')]) > k <- kmeans(x, centers=x[1:4,], iter.max = 10^4) > k$size [1] 59 98 67 133 > k$centers power contact speed defense flyball 1 -0.4996645 0.51392727 1.2983112 -0.2806153 0.4546846 2 0.1314219 0.57052991 -0.2257822 0.9370217 -0.3982199 3 1.0818315 -0.03571007 -0.1355412 -0.2216914 1.0032891 4 -0.4201643 -0.63038395 -0.3412965 -0.4542744 -0.4136933
This clusters our 357 players into four groups based on the factors we found earlier. It suggests that 59 players belong to Group 1, 98 players belong to Group 2, etc. R did all of the work. All we need to do now is to attach meaningful labels to these groups somehow and see if they make sense. Looking at the mean factor scores (a.k.a. cluster centers) for each group, it looks as though Group 1 consists of those players with below-average power but great speed and who make good contact. They also play below-average defense, yet they record a high number of putouts. Perhaps they are leadoff-type hitters. Group 2 looks to be composed of contact hitters with average pop, slightly below-average speed, and the ability to play decent defense, despite not seeing a lot of fly balls. Group 3 comprises your typical power hitters with slightly below-average contact, speed, and defense. But they do catch a lot of fly balls—interesting. We’ll see more about these players later. Finally, Group 4 consists of players that are below average all around. It also happens to be the largest group. So, perhaps these are journeymen players or platoon players. It’s also worth noting that we’ve taken some statistical liberties in this analysis because some extremely good players (like Barry Bonds) can count for up to five different seasons and can really raise the bar for other player-seasons. If you really wanted to be good, you’d probably want to average players’ seasons so that each player appears only once. But what would you do about the fact that the player might play LF one season and RF another? Or that his salary or team might change?
Now we want to append group membership to our data set. This is easy enough:
> bbdata$cluster <- k$cluster
Let’s take a quick look at the breakdown of clustered group versus position to see if it fits with these conjectures:
> table(bbdata$cluster, bbdata$POS) CF LF RF 1 40 10 9 2 12 42 44 3 40 12 15 4 44 42 47
Recall that Group 1 comprised the fleet-footed contact hitters (leadoff types) who also recorded a fair number of putouts defensively but are slightly below average defensively. Generally, center fielders are considered to be the best defensive outfielders, so there’s probably too much thrown into that one “defense” factor (more proof that we didn’t ask for enough). Group 2 consisted of contact hitters with a decent defense score despite fewer putouts. Group 3 was composed of the power hitters with average contact, speed, and defense, but a lot of putouts. I’m really surprised to see that this group is full of CF players and few LF and RF players. Finally, Group 4 consisted of the below-average all-around players, so it makes sense that it’s pretty equally distributed across all three of the outfield positions. Why should there be any more or less all-around LF than RF players, for example? It’s good to see this isn’t the case.
Before we can accurately predict salaries for our players, we must keep in mind that salaries are not normally distributed. Specifically, taking a quick look at the distribution of salaries yields a very skewed picture, as shown in Figure 6-12.
> hist(bbdata$salary)
Let’s transform the salaries into something that looks much less skewed. Let’s take the natural log of salary and use it. The results are shown in Figure 6-13.
> bbdata$salary.log <- log(bbdata$salary) > hist(bbdata$salary.log)
The time has come to predict players’ salaries! We will run a linear regression model for each of our four groups of players, using our factor scores, along with player age, to estimate salaries. We’ll then “undo” the predicted values because, remember, we actually are predicting log(salary), so we must undo this transformation by applying exp(salary.log). We’ll then stack all the subsets back into one great big data set.
> # BREAK UP MAIN DATASET INTO SUBSETS > bbdata1 <- subset(bbdata, cluster==1) > bbdata2 <- subset(bbdata, cluster==2) > bbdata3 <- subset(bbdata, cluster==3) > bbdata4 <- subset(bbdata, cluster==4) > # RUN REGRESSION MODELS > lm1 <- lm(salary.log~age+power+contact+speed+defense+flyball, data=bbdata1) > lm2 <- lm(salary.log~age+power+contact+speed+defense+flyball, data=bbdata2) > lm3 <- lm(salary.log~age+power+contact+speed+defense+flyball, data=bbdata3) > lm4 <- lm(salary.log~age+power+contact+speed+defense+flyball, data=bbdata4) > # UNDO TRANSFORMATION OF PREDICTED LOG(SALARIES) > bbdata1$predicted <- exp(lm1$fitted.values) > bbdata2$predicted <- exp(lm2$fitted.values) > bbdata3$predicted <- exp(lm3$fitted.values) > bbdata4$predicted <- exp(lm4$fitted.values) > # STACK SUBSETS BACK TOGETHER TO FORM REAL DATASET > bbdata <-data.frame(rbind(bbdata1,bbdata2,bbdata3,bbdata4))
Glossing over all the statistical details, such as investigating how valid these regression models actually appear to be, discussing what statistical assumptions have been violated, etc., we’ll simply plot the predicted salaries against the actual salaries to see how well we did, and we’ll use the players’ group or cluster membership as the plotting points:
> plot(salary~predicted, data=bbdata, col=cluster, pch=paste(cluster,'')) > abline(lm(salary~predicted, data=bbdata))
As you can see in Figure 6-14, the points above the line represent players who have an actual salary higher than predicted, whereas the points below the line represent players who have an actual salary lower than predicted.
Ideally, all the points would lie on that regression line. What is the correlation between predicted and actual? We would hope it would be “high.”
> r <- cor(bbdata$predicted,bbdata$salary) > r; r^2 [1] 0.5861738 [1] 0.3435997
Again, it’s OK but not great. An R2 value of .344 means that 34.4% of the observed variability in actual salaries can be accounted for by our input variables of age, group membership, and factor scores. It’s not surprising that there must be a lot more variables out there that also influence players’ actual salaries than the ones we’ve chosen.
Let’s compute the difference between actual and predicted salaries and save the calculation for each player. We’ll then create a box plot, by group/cluster number, to see how it looks (see Figure 6-15).
> bbdata$diff = bbdata$salary - bbdata$predicted > for(i in 1:4) { + print(summary(subset(bbdata[,'diff'], bbdata$cluster==i))); + } Min. 1st Qu. Median Mean 3rd Qu. Max. -8215000 -559900 -16900 477900 2058000 5845000 Min. 1st Qu. Median Mean 3rd Qu. Max. -15080000 -697800 181300 544600 2486000 12040000 Min. 1st Qu. Median Mean 3rd Qu. Max. -17840000 -1099000 339000 553900 2257000 9574000 Min. 1st Qu. Median Mean 3rd Qu. Max. -9021000 -598800 109700 873400 2391000 11110000 > boxplot(diff~cluster, data=bbdata)
On average, we’ve undershot actual salary by about $500,000. But this could be largely due to a few extremely highly paid players “underperforming,” which throws the average off for everyone. Looking at the median values, it’s not quite so dire. We’re only about $200,000 under for 50% of the players. And for Group 1, we’ve actually overestimated salary slightly for 50% of its players.
It’s easy enough to find which players have the very largest discrepancies between actual and predicted salaries, but let’s first standardize the differences so that everybody is on equal footing, and to help eliminate some of this actual minus predicted bias that exists:
> bbdata$stdiff = (bbdata$diff - mean(bbdata$diff)) / sd(bbdata$diff)
Let’s find the most relatively underpaid players; that is, those at least two standard deviations below the average (actual minus predicted) difference:
> underpaid <- subset(bbdata[,c('playerName','yearID', + 'age','POS','G', + 'cluster','salary','predicted', + 'stdiff')], stdiff <= -2) > underpaid[sort.list(underpaid$stdiff),] playerName yearID age POS G cluster salary predicted 41 Bonds, Barry 2001 37 LF 153 3 10300000 28142677 111 Finley, Steve 2003 38 CF 147 2 4250000 19334449 133 Gonzalez, Luis 2003 36 LF 156 2 4000000 16800508 36 Bichette, Dante 1999 36 LF 151 2 7250000 19185633 163 Henderson, Rickey 2001 43 LF 123 4 300000 9321189 131 Gonzalez, Luis 2001 34 LF 162 3 4833333 13448594 152 Grissom, Marquis 2003 36 CF 149 1 1875000 10090267 328 Walker, Larry 2001 35 RF 142 2 12166667 19884406 162 Henderson, Rickey 1999 41 LF 121 4 1900000 9341312 316 Surhoff, B.J. 1999 35 LF 162 2 3705516 10697211 stdiff 41 -5.235698 111 -4.455110 133 -3.808747 36 -3.563984 163 -2.739187 131 -2.624308 152 -2.511108 328 -2.370306 162 -2.292076 316 -2.164833
It’s nice to see that this list contains players from each of the four identified clusters. It’s interesting that all of them are older players who mostly play LF. Of course, Barry Bonds’ 2001 campaign in which he broke the single-season HR and SLG records is the biggest bargain, even if he was paid $10 million, given his all-around performance relative to other players in the “slugger” cluster. It’s interesting that Luis Gonzalez makes it twice—very underappreciated.
At the other end of the spectrum, we have players who were actually paid more than expected, given their relative performance:
> overpaid <- subset(bbdata[,c('playerName','yearID', + 'age','POS','G', + 'cluster','salary','predicted', + 'stdiff')], stdiff >= 2) > overpaid[sort.list(overpaid$stdiff, decreasing=TRUE),] playerName yearID age POS G cluster salary predicted 279 Ramirez, Manny 2003 31 LF 154 2 20000000 7963658 140 Green, Shawn 2003 31 RF 160 2 15666667 3818053 244 Mondesi, Raul 2003 32 RF 98 4 13000000 1890954 157 Guerrero, Vladimir 2003 27 RF 112 2 11500000 1250975 148 Griffey, Ken 2001 32 CF 111 4 12500000 2454218 191 Jones, Andruw 2003 26 CF 156 3 12000000 2425791 173 Higginson, Bobby 2003 33 RF 130 4 11850000 2852073 309 Sosa, Sammy 2003 35 RF 137 3 16000000 7435686 138 Green, Shawn 2001 29 RF 161 3 12166667 3607110 242 Mondesi, Raul 2000 29 RF 96 4 10000000 1521948 190 Jones, Andruw 2002 25 CF 154 3 10000000 2231977 stdiff 279 3.220166 140 3.167039 244 2.957739 157 2.714350 148 2.656831 191 2.523375 173 2.360285 309 2.237571 138 2.236225 242 2.213158 190 2.012218
There are a lot of big salaries here, and they just didn’t stand out enough from other players on the various factors.
I covered a lot in this hack, and yet, I intentionally skimmed over a lot of the underlying statistical theory to help keep the hack reasonably short. But I hope it piqued your interest to perform similar analyses using R and to learn more about multivariate statistical techniques, especially as they might be applied to baseball stats.
You can get “tighter” predictions if you ask for more than five factors (remember, using five only retained 59% of the variability), though you would need to reinterpret all of them because there’s no guarantee the five existing ones would stay the same. Also, varying the number of clusters to group players would help fine-tune the predictions further, as well.
This hack could certainly be applicable to other positions, as well. Obviously, what is valued in an outfielder is different from, say, a shortstop or a catcher. But it gets a little thorny if you consider some players are versatile and play a variety of positions over the course of a season (and certainly over their careers). So perhaps identify utility players in a class by themselves.
It would be interesting to see what mix of skills was most valued in 1985 versus 1995. Or, if you can find the salary data, you might want to look back even further.
The variables included in this hack only scratch the surface for drawing a “complete” picture of a player’s worth. It would be much more interesting to include many more variables, such as what side of the plate he bats on; his experience, height, and weight; how good his team is; the home park he plays in; and his ground-ball-to-fly-ball tendency, just to name a few.
—Tom Dierickx
3.142.196.223