Top 10 Bargain Outfielders

Use R and the Baseball DataBank database in MySQL to analyze statistically the 10 most underpaid outfielders for the years 1999–2003.

The most common conversation between any two baseball fans (after arguing about whose favorite team is better, of course) is what player is the best in the game. Baseball, more than any other sport, it seems, is obsessed with ranking players. Certainly, the fact that every player gets a turn at bat plays into the perception that such a thing should be a “simple” matter. The ensuing debate after failing to establish the proper all-time pecking order is how high all players’ salaries have become. Of course, both parties agree on that and then they’re friends again. But the real truth is that ranking players is highly subjective, and the distribution of player salaries is highly skewed. For every A-Rod making $22 million, there are 100 younger players making a “scant” $300,000.

So, I won’t even try to claim who the “best” players are; rather, I will simply use R to help identify what players have “similar” attributes to one another and then attempt to predict what their salary can be expected to be, based on those attributes. Players who were paid much lower than expected given their similarity to other players are considered “undervalued.” Likewise, players paid much more than expected are considered “overvalued.” And to help simplify things, I look only at outfielders for the years 1999–2003.

(This analysis isn’t completely fair. Major league players aren’t free agents until they’ve played for six years, so their salaries aren’t set by the market. For the first three years, their salaries are set by their original contracts and are often set at the major league minimum. For the next three years, players are eligible for arbitration, and their salaries are set through negotiations with their teams. But even if the analysis isn’t totally fair, it’s still interesting to look for bargains.)

So how do you perform such an analysis using R? It’s actually pretty simple, as I will show you in this hack. The difficult part is making sure everything makes sense and is being interpreted properly. It’s really more art than science, but here are the basic steps:

  1. Import raw data from MySQL into R.

  2. Compress input variables to a handful of common attributes (such as speed, power, etc.).

  3. Assign scores to all players based on these attributes.

  4. Group players on these attributes based on their similarity to one another.

  5. Predict salaries based on group membership and a few other input variables.

  6. Look at actual minus predicted salaries.

Let’s get started!

The Code

Prerequisites.

It is assumed that you have the Baseball DataBank database loaded in MySQL [Hack #20] and that you can access this data from R [Hack #33] .

Define SQL query for getting raw data.

The first thing you need is a SQL query to define what raw data to get from the database. It makes sense to let R pull the data using a query because multiple tables will be joined together and we can set our criteria there. Moreover, I prefer to keep this SQL query in a separate text file because it’s easier to maintain. So, use Notepad to save the following SQL query as a simple text file (in C:aseball_salariesget_data.sql):

	SELECT
	 CONCAT(M.playerID,'|',F.yearID) AS ID,
	 CONCAT(M.nameLast, ', ', M.nameFirst) AS playerName,
	 F.yearID,
	 (F.yearID - M.birthYear) AS age,
	 F.lgID,
	 F.teamID,
	 F.POS,
	 F.G AS POS_G,
	 B.G AS G,
	 B.AB,
	 (B.H - B.2B - B.3B - B.HR) AS X1B,
	 B.2B AS X2B,
	 B.3B AS X3B,
	 B.HR,
	 B.SB,
	 B.BB,
	 B.SO,
	 B.GIDP,
	 F.PO,
	 F.A,
	 F.E,
	 F.DP,
	 S.salary

	FROM
	 master AS M

	INNER JOIN
	 fielding AS F
	 ON
	  (M.playerID = F.playerID)

	INNER JOIN
	 batting AS B
	 ON
	  (F.yearID = B.yearID) AND
	  (F.playerID = B.playerID) AND
	  (F.teamID = B.teamID)

	INNER JOIN
	 salaries AS S
	 ON
	  (F.yearID = S.yearID) AND
	  (F.playerID = S.playerID) AND
	  (F.teamID = S.teamID)

	WHERE
	 (F.yearID BETWEEN 1999 AND 2003) AND
	 F.pos IN ('LF','CF','RF') AND
	 F.G >= 81

	ORDER BY
	  playerName,
	  F.yearID

A lot is going on in this query. Most notable is the fact that we’ll be restricting our attention to players who played defensively in at least 81 games as an LF, CF, or RF for any season between 1999 and 2003. Because 81 games are exactly half of a full 162-game season, you’ll be forgiven if you assume I chose this cutoff to limit the analysis to only part- to full-time players. However, the real reason I chose this cutoff is to reduce significantly the likelihood that the exact same player will have two records for the same season on different teams or different positions. Let’s say the cutoff was set to at least 40 games. We would have two records for Moises Alou for 2000: 59 games at LF and 64 games at RF. So what’s wrong with that? Nothing, except the batting statistics and salary data are at the player/year/team level. That is, we don’t have the data to know what Moises batted in those games in which he played LF versus RF, we have only the data to know what his final stats were for the whole season. The same is true for salary. Plus, I’m only interested in comparing players who mostly play the same position as one another. Somebody who used to play infield might claim outfield is outfield, but having played it myself a little growing up, I know there’s a big difference between LF, CF, and RF. So, this analysis is constrained to only outfielders who played at least 81 games in a season at any one of the three outfield positions. I understand this systematically eliminates utility players and very versatile players, but I had to do it.

Another thing you’ll notice in the query is that we’re calculating singles (H-2B-3B-HR) now, rather than after we import the data into R, and that an updated ID is being constructed that takes the season into account. This will be useful for labeling and merging purposes later on. The combination of playerID and yearID should be unique (that is, we don’t want more than one record for any one player for a given season), so I simply combine these into a new variable called ID.

Import data into R.

Now that we have defined a SQL query for getting the data, we need to import it:

	> # LOAD THE RODBC LIBRARY
	> library(RODBC)
	>
	> # READ IN SQL QUERY FROM EXTERNAL FILE AND REMOVE ALL EXTRA
	SPACES
	> queryFile <- "C:/baseball_salaries/get_data.sql"
	> sql <- paste(readLines(query_file), collapse=" ")
	> sql <- gsub("[[:space:]]{2,}", " ", sql)
	>
	> # OPEN CONNECTION TO MYSQL, RUN QUERY, AND IMPORT INTO DATA
	FRAME
	> channel <- odbcConnect(dsn="bbdatabank")
	> bbdata <- sqlQuery(channel, sql)
	> close(channel)

If all went well, the results of the query will exist as a data frame named bbdata. To verify what was imported, we count the records and summarize the data set:

	> length(bbdata$ID)
	[1] 357
	> summary(bbdata)

This code yields the results in Figure 6-11. This is a very quick overview of everything we will be analyzing. Notice there is never more than one ID value; this is good! The distribution of CF, LF, and RF is pretty equal, though there’s slightly more CF (perhaps this player is less likely to platoon with other players or to play other positions). Also notice that slightly more NL players qualify than AL players, and notice the salary distribution. Half of all of these outfielders earned at least $3,416,667 per season, and half earned less. This breakpoint is a little higher than all players in general, which is to be expected because they’re mostly full-time players, but it doesn’t really matter because, remember, we’ll be comparing them to one another.

Summary of data set

Figure 6-11. Summary of data set

Running the Hack

Identify common attributes.

The next step in the process is to take the basic input variables we’ve imported and are quite familiar with—such as HR, RBI, and SB—and attempt to explain what accounts for them. A common statistical method for doing this sort of thing is called factor analysis (see http://www2.chass.ncsu.edu/garson/pa765/factor.htm), and R makes it very easy to perform this procedure. In a nutshell, when variables are highly correlated with each other, it’s logical to assume that some underlying construct might be responsible for the correlations that you are observing. For example, a high correlation between the number of triples and the number of stolen bases is likely due to an attribute called speed. Of course, just because two variables have a strong correlation doesn’t mean they’re related. (Have you ever heard of the “Super Bowl versus stock market” correlation? See http://www.sciencenews.org/articles/20000701/mathtrek.asp for more information.)

Look at correlations.

Let’s look at the correlations of the input variables we’re interested in:

	> X <- subset(bbdata[,11:22])
	> round(cor(X),digits=2)
	       X1B   X2B   X3B    HR    SB    BB     SO  GIDP    PO    A      E    DP
	X1B   1.00  0.50  0.31  0.06  0.42  0.04  -0.03  0.33  0.49  0.21  0.06  0.10
	X2B   0.50  1.00  0.05  0.45 -0.01  0.29   0.25  0.45  0.34  0.30  0.13  0.12
	X3B   0.31  0.05  1.00 -0.15  0.47 -0.05  -0.04 -0.19  0.31  0.18  0.05  0.14
	HR    0.06  0.45 -0.15  1.00 -0.26  0.59   0.46  0.34  0.23  0.25  0.27  0.12
	SB    0.42 -0.01  0.47 -0.26  1.00 -0.01  -0.02 -0.22  0.31  0.01  0.03  0.07
	BB    0.04  0.29 -0.05  0.59 -0.01  1.00   0.32  0.08  0.13  0.13  0.18  0.07
	SO   -0.03  0.25 -0.04  0.46 -0.02  0.32   1.00  0.11  0.34  0.21  0.22  0.12
	GIDP  0.33  0.45 -0.19  0.34 -0.22  0.08   0.11  1.00  0.19  0.20  0.18  0.06
	PO    0.49  0.34  0.31  0.23  0.31  0.13   0.34  0.19  1.00  0.33  0.14  0.29
	A     0.21  0.30  0.18  0.25  0.01  0.13   0.21  0.20  0.33  1.00  0.30  0.58
	E     0.06  0.13  0.05  0.27  0.03  0.18   0.22  0.18  0.14  0.30  1.00  0.13
	DP    0.10  0.12  0.14  0.12  0.07  0.07   0.12  0.06  0.29  0.58  0.13  1.00

As expected, 3B and SB are pretty well correlated with each other (r = 0.47), and SB and GIDP are negatively correlated (r = -0.22); that is, players with high SB totals tend to have hit into fewer double plays, and vice versa.

Identify possible explanations for correlations.

R’s function for performing a factor analysis, factanal( ), takes a matrix of numeric values you feed it, computes their correlations, and then attempts to optimize the maximum likelihood that those observed correlations came from the specified number of factors that you say they did. And herein lies the art. Knowing how many factors to suggest and whether they make sense is very much an iterative and subjective process (for more background on this fascinating statistical technique, search the Web for “factor analysis”). Applying this technique to our data yields the following:

	> far <- factanal(~X1B+X2B+X3B+HR+SB+BB+SO+GIDP+PO+A+E+DP,
	+ data=bbdata, factors=5, scores="regression")
	> far$n.obs
	[1] 357
	> print(far)
	
	Call:
	factanal(x = ~X1B + X2B + X3B + HR + SB + BB + SO + GIDP + PO + A + E + DP, factors
	= 5, data = bbdata, scores = "regression")

	Uniquenesses:
	  X1B   X2B   X3B    HR    SB    BB    SO  GIDP     PO     A     E    DP
	0.094 0.465 0.623 0.193 0.377 0.488 0.636 0.508  0.005 0.005 0.858 0.631

	Loadings:
	     Factor1 Factor2 Factor3 Factor4 Factor5
	X1B          0.784   0.526   0.101
	X2B  0.336   0.629           0.144
	X3B          0.581   0.160   0.105
	HR   0.807   0.272  -0.252   0.105
	SB           0.784
	BB   0.709
	SO   0.506           0.147   0.288
	GIDP 0.102   0.610  -0.301           0.101
	PO   0.177   0.285   0.349   0.224   0.843
	A    0.132   0.182           0.970
	E    0.268           0.253
	DP                   0.581   0.136

	                Factor1 Factor2 Factor3 Factor4 Factor5
	SS loadings     1.666   1.583   1.521   1.484   0.864
	Proportion Var  0.139   0.132   0.127   0.124   0.072
	Cumulative Var  0.139   0.271   0.397   0.521   0.593

	Test of the hypothesis that 5 factors are sufficient.
	The chi square statistic is 30.04 on 16 degrees of freedom.
	The p-value is 0.0178

You’re looking to see that the variables that “load” into the different factors make sense and belong together, and that the higher the absolute value of the load (which ranges from −1 to +1), the more prominent a role that variable plays in the factor. So, it’s fairly obvious that Factor1 equals “power,” Factor2 equals “ability to make contact,” and Factor3 equals “speed.” Factor4 is a little harder to interpret. Is it “range” or “arm strength” that accounts for a high number of assists and double plays? Since PO and E play a small part, it might be “range.” But the very high A and DP imply “arm.” For now, let’s simply call it “defensive prowess.” And what about Factor5? Hmmm. I’m hesitant to call it “glove ability” despite the very high PO loading because there’s no negative loading for E, as you would expect. Plus, some outfielders simply see more fly balls, depending on their team’s pitchers. So, let’s call it “fly ball opportunities.” Of course, in practice, you would look at more than five factors here. Notice that the “cumulative variance” of these five factors together accounts for about 59.3% of the total variability seen in the original 12 variables, so we are losing some information. But while you could try using more factors to preserve more information, it’s a balancing act because there are diminishing returns the more factors you include. Furthermore, recall that the whole point is to reduce the dimensionality of the input variables (though pvalue < 0.05 suggests we should probably look at more factors). But we’ll go with these five for now and see what happens. Another thing to look at is the “uniqueness” of each variable (which ranges from 0 to 1). The higher the value, the more the variable “stands on its own” and doesn’t really correlate strongly enough with any other variables to suggest a common factor. For example, E is kind of high, which makes sense.

Assign attribute scores.

In our call to factanal( ), we asked it to compute factor scores for the five factors. This basically multiplies the factor loadings with the players’ values for each of the original variables and sums them up for each factor. Then R internally standardizes each factor based on the mean and standard deviation for that factor. The end result is five standardized factor scores for each player, where 0 suggests completely average, +2 suggests two standard deviations above the average, and −2 suggests two standard deviations below the average. So, let’s label these factors and then append the scores onto our data set:

	> colnames(far$scores) <- c("power","contact","speed","defense","flyball")
	> bbdata <- data.frame(bbdata, far$scores)

Now, let’s test this out. Which players have had seasons at least two standard deviations above average on “power”?

	> subset(bbdata[,c('playerName','yearID','POS','AB','HR','power')],
	+ power >= 2)
	    playerName    yearID  POS  AB HR  power
	40  Bonds, Barry    2000  LF  480 49  2.238885
	41  Bonds, Barry    2001  LF  476 73  4.420423
	42  Bonds, Barry    2002  LF  403 46  3.033495
	43  Bonds, Barry    2003  LF  390 45  2.190994
	131 Gonzalez, Luis  2001  LF  609 57  2.121822
	305 Sosa, Sammy     1999  RF  625 63  2.644626
	306 Sosa, Sammy     2000  RF  604 50  2.261168
	307 Sosa, Sammy     2001  RF  577 64  3.205626

No surprises here! What about the players who scored lowest on “defense”?

	> subset(bbdata[,c('playerName','yearID','POS','POS_G','PO',
	+ 'A','E','DP','defense')], defense <= -1.5)
	    playerName year    ID   POS   POS_G  PO    A   E    DP defense
	13  Anderson, Brady    2000 CF    88     230   1   1    1  -1.604664
	149 Grissom, Marquis   1999 CF    149    374   1   5    2  -1.768149
	162 Henderson, Rickey  1999 LF    116    167   0   2    0  -1.869917
	216 Lawton, Matt       2000 RF    83     163   2   3    0  -1.508250
	258 O'Neill, Paul      2001 RF    130    210   1   4    0  -1.615759
	289 Sanchez, Alex      2002 CF    86     234   0   5    0  -1.648747
	306 Sosa, Sammy        2000 RF    156    316   3  10    1  -1.547480
	333 Wells, Vernon      2003 CF    161    383   3   4    0  -1.519429
	342 Williams, Bernie   2000 CF    137    353   2   0    1  -1.523878
	344 Williams, Bernie   2002 CF    147    350   2   5    1  -1.769403

No great surprises here either. These players have seasons with very few assist and double play totals relative to putouts, and there are some fairly high error counts.

Group players based on similarity.

We could go ahead and predict player salaries based on their factor scores right now, but this isn’t quite fair because not all outfielders are of the same ilk. Some are power hitters who play shaky defense, and others are fleet-footed and cover a lot of ground but are not going to hit towering 475-foot home runs. So, we shall attempt to “classify” our players into similar groups based on their factor scores to one another. One fairly simple method for doing this is to perform the k-means clustering procedure (see http://www.statsoft.com/textbook/stcluan.html#k). This procedure creates as many groups as you choose and then attempts to cluster all of the observations into these groups by seeking minimal variability within clusters while seeking maximal variability between clusters. We’ll try four groups and see what happens. Note that first I’ll “seed” the algorithm with what initial cluster centers to use. Because I have no real way of knowing how they’ll end up, I’ll simply arbitrarily choose the first four. Then, I’ll allow the algorithm to use up to 10,000 “turns” for moving observations around to find the best fit (think Tower of Hanoi game).

	> x <- subset(bbdata[,c('power','contact','speed','defense','flyball')])
	> k <- kmeans(x, centers=x[1:4,], iter.max = 10^4)
	> k$size
	[1] 59 98 67 133
	> k$centers
	       power     contact      speed    defense    flyball
	1 -0.4996645  0.51392727  1.2983112 -0.2806153  0.4546846
	2  0.1314219  0.57052991 -0.2257822  0.9370217 -0.3982199
	3  1.0818315 -0.03571007 -0.1355412 -0.2216914  1.0032891
	4 -0.4201643 -0.63038395 -0.3412965 -0.4542744 -0.4136933

This clusters our 357 players into four groups based on the factors we found earlier. It suggests that 59 players belong to Group 1, 98 players belong to Group 2, etc. R did all of the work. All we need to do now is to attach meaningful labels to these groups somehow and see if they make sense. Looking at the mean factor scores (a.k.a. cluster centers) for each group, it looks as though Group 1 consists of those players with below-average power but great speed and who make good contact. They also play below-average defense, yet they record a high number of putouts. Perhaps they are leadoff-type hitters. Group 2 looks to be composed of contact hitters with average pop, slightly below-average speed, and the ability to play decent defense, despite not seeing a lot of fly balls. Group 3 comprises your typical power hitters with slightly below-average contact, speed, and defense. But they do catch a lot of fly balls—interesting. We’ll see more about these players later. Finally, Group 4 consists of players that are below average all around. It also happens to be the largest group. So, perhaps these are journeymen players or platoon players. It’s also worth noting that we’ve taken some statistical liberties in this analysis because some extremely good players (like Barry Bonds) can count for up to five different seasons and can really raise the bar for other player-seasons. If you really wanted to be good, you’d probably want to average players’ seasons so that each player appears only once. But what would you do about the fact that the player might play LF one season and RF another? Or that his salary or team might change?

Attach group membership to data set.

Now we want to append group membership to our data set. This is easy enough:

	> bbdata$cluster <- k$cluster

Let’s take a quick look at the breakdown of clustered group versus position to see if it fits with these conjectures:

	> table(bbdata$cluster, bbdata$POS)

	  CF LF RF
	1 40 10  9
	2 12 42 44
	3 40 12 15
	4 44 42 47

Recall that Group 1 comprised the fleet-footed contact hitters (leadoff types) who also recorded a fair number of putouts defensively but are slightly below average defensively. Generally, center fielders are considered to be the best defensive outfielders, so there’s probably too much thrown into that one “defense” factor (more proof that we didn’t ask for enough). Group 2 consisted of contact hitters with a decent defense score despite fewer putouts. Group 3 was composed of the power hitters with average contact, speed, and defense, but a lot of putouts. I’m really surprised to see that this group is full of CF players and few LF and RF players. Finally, Group 4 consisted of the below-average all-around players, so it makes sense that it’s pretty equally distributed across all three of the outfield positions. Why should there be any more or less all-around LF than RF players, for example? It’s good to see this isn’t the case.

Transform salary variable.

Before we can accurately predict salaries for our players, we must keep in mind that salaries are not normally distributed. Specifically, taking a quick look at the distribution of salaries yields a very skewed picture, as shown in Figure 6-12.

Distribution of salaries

Figure 6-12. Distribution of salaries

	> hist(bbdata$salary)

Let’s transform the salaries into something that looks much less skewed. Let’s take the natural log of salary and use it. The results are shown in Figure 6-13.

	> bbdata$salary.log <- log(bbdata$salary)
	> hist(bbdata$salary.log)

Create linear regression model.

The time has come to predict players’ salaries! We will run a linear regression model for each of our four groups of players, using our factor scores, along with player age, to estimate salaries. We’ll then “undo” the predicted values because, remember, we actually are predicting log(salary), so we must undo this transformation by applying exp(salary.log). We’ll then stack all the subsets back into one great big data set.

	> # BREAK UP MAIN DATASET INTO SUBSETS
	> bbdata1 <- subset(bbdata, cluster==1)
	> bbdata2 <- subset(bbdata, cluster==2)
	> bbdata3 <- subset(bbdata, cluster==3)
	> bbdata4 <- subset(bbdata, cluster==4)

	> # RUN REGRESSION MODELS
	> lm1 <- lm(salary.log~age+power+contact+speed+defense+flyball, data=bbdata1)
	> lm2 <- lm(salary.log~age+power+contact+speed+defense+flyball, data=bbdata2)
	> lm3 <- lm(salary.log~age+power+contact+speed+defense+flyball, data=bbdata3)
	> lm4 <- lm(salary.log~age+power+contact+speed+defense+flyball, data=bbdata4)

	> # UNDO TRANSFORMATION OF PREDICTED LOG(SALARIES)
	> bbdata1$predicted <- exp(lm1$fitted.values)
	> bbdata2$predicted <- exp(lm2$fitted.values)
	> bbdata3$predicted <- exp(lm3$fitted.values)
	> bbdata4$predicted <- exp(lm4$fitted.values)

	> # STACK SUBSETS BACK TOGETHER TO FORM REAL DATASET
	> bbdata <-data.frame(rbind(bbdata1,bbdata2,bbdata3,bbdata4))
Distribution of log(salary)

Figure 6-13. Distribution of log(salary)

Glossing over all the statistical details, such as investigating how valid these regression models actually appear to be, discussing what statistical assumptions have been violated, etc., we’ll simply plot the predicted salaries against the actual salaries to see how well we did, and we’ll use the players’ group or cluster membership as the plotting points:

	> plot(salary~predicted, data=bbdata, col=cluster, pch=paste(cluster,''))
	> abline(lm(salary~predicted, data=bbdata))

As you can see in Figure 6-14, the points above the line represent players who have an actual salary higher than predicted, whereas the points below the line represent players who have an actual salary lower than predicted.

Actual versus predicted salaries

Figure 6-14. Actual versus predicted salaries

Ideally, all the points would lie on that regression line. What is the correlation between predicted and actual? We would hope it would be “high.”

	> r <- cor(bbdata$predicted,bbdata$salary)
	> r; r^2
	[1] 0.5861738
	[1] 0.3435997

Again, it’s OK but not great. An R2 value of .344 means that 34.4% of the observed variability in actual salaries can be accounted for by our input variables of age, group membership, and factor scores. It’s not surprising that there must be a lot more variables out there that also influence players’ actual salaries than the ones we’ve chosen.

Compare predicted versus actual salaries.

Let’s compute the difference between actual and predicted salaries and save the calculation for each player. We’ll then create a box plot, by group/cluster number, to see how it looks (see Figure 6-15).

Box plot of (actual minus predicted) salary versus group number

Figure 6-15. Box plot of (actual minus predicted) salary versus group number

	> bbdata$diff = bbdata$salary - bbdata$predicted
	> for(i in 1:4) {
	+ print(summary(subset(bbdata[,'diff'], bbdata$cluster==i)));
	+ }
	Min.       1st Qu.  Median  Mean   3rd Qu.  Max.
	-8215000   -559900  -16900  477900 2058000  5845000
	Min.       1st Qu.  Median  Mean   3rd Qu.  Max.
	-15080000  -697800  181300  544600 2486000  12040000
	Min.        1st Qu. Median  Mean   3rd Qu.  Max.
	-17840000  -1099000 339000  553900 2257000  9574000
	Min.        1st Qu. Median  Mean   3rd Qu.  Max.
	-9021000   -598800  109700  873400 2391000  11110000
	> boxplot(diff~cluster, data=bbdata)

On average, we’ve undershot actual salary by about $500,000. But this could be largely due to a few extremely highly paid players “underperforming,” which throws the average off for everyone. Looking at the median values, it’s not quite so dire. We’re only about $200,000 under for 50% of the players. And for Group 1, we’ve actually overestimated salary slightly for 50% of its players.

Identify most-underpaid players.

It’s easy enough to find which players have the very largest discrepancies between actual and predicted salaries, but let’s first standardize the differences so that everybody is on equal footing, and to help eliminate some of this actual minus predicted bias that exists:

	> bbdata$stdiff = (bbdata$diff - mean(bbdata$diff)) / sd(bbdata$diff)

Let’s find the most relatively underpaid players; that is, those at least two standard deviations below the average (actual minus predicted) difference:

	> underpaid <- subset(bbdata[,c('playerName','yearID',
	+                   'age','POS','G',
	+                   'cluster','salary','predicted',
	+                   'stdiff')], stdiff <= -2)
	> underpaid[sort.list(underpaid$stdiff),]
	      playerName        yearID age POS G   cluster  salary   predicted
	41    Bonds, Barry      2001   37  LF  153 3        10300000 28142677
	111   Finley, Steve     2003   38  CF  147 2         4250000 19334449
	133   Gonzalez, Luis    2003   36  LF  156 2         4000000 16800508
	36    Bichette, Dante   1999   36  LF  151 2         7250000 19185633
	163   Henderson, Rickey 2001   43  LF  123 4          300000 9321189
	131   Gonzalez, Luis    2001   34  LF  162 3         4833333 13448594
	152   Grissom, Marquis  2003   36  CF  149 1         1875000 10090267
	328   Walker, Larry     2001   35  RF  142 2        12166667 19884406
	162   Henderson, Rickey 1999   41  LF  121 4         1900000 9341312
	316   Surhoff, B.J.     1999   35  LF  162 2         3705516 10697211
	      stdiff
	41  -5.235698
	111 -4.455110
	133 -3.808747
	36  -3.563984
	163 -2.739187
	131 -2.624308
	152 -2.511108
	328 -2.370306
	162 -2.292076
	316 -2.164833

It’s nice to see that this list contains players from each of the four identified clusters. It’s interesting that all of them are older players who mostly play LF. Of course, Barry Bonds’ 2001 campaign in which he broke the single-season HR and SLG records is the biggest bargain, even if he was paid $10 million, given his all-around performance relative to other players in the “slugger” cluster. It’s interesting that Luis Gonzalez makes it twice—very underappreciated.

Identify most-overpaid players.

At the other end of the spectrum, we have players who were actually paid more than expected, given their relative performance:

	> overpaid <- subset(bbdata[,c('playerName','yearID',
	+                         'age','POS','G',
	+                         'cluster','salary','predicted',
	+                         'stdiff')], stdiff >= 2)
	> overpaid[sort.list(overpaid$stdiff, decreasing=TRUE),]
	    playerName         yearID age POS G   cluster salary   predicted
	279 Ramirez, Manny     2003    31 LF  154 2       20000000 7963658
	140 Green, Shawn       2003    31 RF  160 2       15666667 3818053
	244 Mondesi, Raul      2003    32 RF   98 4       13000000 1890954
	157 Guerrero, Vladimir 2003    27 RF  112 2       11500000 1250975
	148 Griffey, Ken       2001    32 CF  111 4       12500000 2454218
	191 Jones, Andruw      2003    26 CF  156 3       12000000 2425791
	173 Higginson, Bobby   2003    33 RF  130 4       11850000 2852073
	309 Sosa, Sammy        2003    35 RF  137 3       16000000 7435686
	138 Green, Shawn       2001    29 RF  161 3       12166667 3607110
	242 Mondesi, Raul      2000    29 RF   96 4       10000000 1521948
	190 Jones, Andruw      2002    25 CF  154 3       10000000 2231977
	   stdiff
	279 3.220166
	140 3.167039
	244 2.957739
	157 2.714350
	148 2.656831
	191 2.523375
	173 2.360285
	309 2.237571
	138 2.236225
	242 2.213158
	190 2.012218

There are a lot of big salaries here, and they just didn’t stand out enough from other players on the various factors.

I covered a lot in this hack, and yet, I intentionally skimmed over a lot of the underlying statistical theory to help keep the hack reasonably short. But I hope it piqued your interest to perform similar analyses using R and to learn more about multivariate statistical techniques, especially as they might be applied to baseball stats.

Hacking the Hack

Vary the numbers of factors and/or number of clusters.

You can get “tighter” predictions if you ask for more than five factors (remember, using five only retained 59% of the variability), though you would need to reinterpret all of them because there’s no guarantee the five existing ones would stay the same. Also, varying the number of clusters to group players would help fine-tune the predictions further, as well.

Look at different positions.

This hack could certainly be applicable to other positions, as well. Obviously, what is valued in an outfielder is different from, say, a shortstop or a catcher. But it gets a little thorny if you consider some players are versatile and play a variety of positions over the course of a season (and certainly over their careers). So perhaps identify utility players in a class by themselves.

Look at different years.

It would be interesting to see what mix of skills was most valued in 1985 versus 1995. Or, if you can find the salary data, you might want to look back even further.

Include more variables.

The variables included in this hack only scratch the surface for drawing a “complete” picture of a player’s worth. It would be much more interesting to include many more variables, such as what side of the plate he bats on; his experience, height, and weight; how good his team is; the home park he plays in; and his ground-ball-to-fly-ball tendency, just to name a few.

Tom Dierickx

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.196.223