OBP, SLG, and Scoring Runs

Calculate the relative strengths of how you can use OBP and SLG to estimate runs.

In his 2003 best-selling book, Moneyball (W. W. Norton & Company), Michael Lewis communicated that in the evaluation of players, the Oakland Athletics organization placed a weight on on-base percentage that was worth roughly three times the weight of slugging percentage:

…OPS was the simple addition of on-base and slugging percentages. Crude as it was, it was a much better indicator than any other offensive statistic of the number of runs a team would score. Simply adding the two statistics together, however, implied that they were of equal importance. If the goal was to raise a team’s OPS, an extra percentage point of on-base was as good as an extra percentage point of slugging.
Before his thought experiment Paul [DePodesta] had felt uneasy with this crude assumption; now he saw that the assumption was absurd. An extra point of on-base percentage was clearly more valuable than an extra point of slugging percentage—but by how much? … In his model an extra point of on-base percentage was worth three times an extra point of slugging percentage.

In this hack, we’ll show how to calculate how much more important OBP is than SLG, using a simple linear regression analysis. And we’ll demonstrate that this number is closer to two than it is to three.

The Data and the Code

In this hack, we take a very simple approach to calculating the relative values of OBP and SLG, as they pertain to scoring runs. With the help of the Baseball Archive database (as described in “Get a MySQL Database of Player and Team Statistics” [Hack #10] ), we will look at actual team outcomes over a time period to determine the best linear fit.

For each team and each season, we will consider the three values of OBP, SLG, and RS. For example, from the 2004 season, the two World Series teams would give us the following (OBP, SLG, RS) data:

	STL .344  .460  855 
	BOS .360  .472  949

We can express the problem we are trying to solve as follows: which values of A, B, and C will provide the best fit between actual team data and the following linear model?

	A * OBP + B * SLG + C = RS

The ratio |□ = A / B will give us the importance of OBP, relative to SLG, as it pertains to scoring runs. We will use the statistical analysis tool R and this simple linear model to calculate the best fit, but first we need to get our hands on the relevant data. We will use team totals for all 30 major league teams over the past five seasons (2000–2004) as the starting point of our investigation.

See “Learn Perl” [Hack #13] to see how you can obtain the data from the Baseball Archive database. For the time being, we will assume you have downloaded the data and that it resides on your local hard drive, in your current folder. Using the read.csv( ) function in R, we can read the data directly from a CSV file and load it into a data frame that we will call Teams:

	Teams <- read.csv("Teams.csv", header=TRUE)

Since the Baseball Archive database does not contain the explicit values for OBP and SLG, we will calculate these values and store them in columns within the data frame. We will also explicitly calculate and store singles and total bases along the way for simplicity.

First, we will calculate the number of singles per team by adding another column that we will call X1B (R doesn’t let column names start with a digit):

	Teams <- cbind(Teams, X1B = (Teams$H - (Teams$X2B + Teams$X3B + Teams$HR)))

Next, we will calculate the total bases values:

	Teams <- cbind(Teams, TB = (Teams$X1B + 2 * Teams$X2B + 3* Teams$X3B +
	4*Teams$HR))

We will calculate and store the slugging percentages with the following:

	Teams <- cbind(Teams, SLG = (Teams$TB / Teams$AB))

Finally, we will calculate the OBP by including walks but ignoring HBP. We exclude HBP because the data is not always available; also, the exclusion of HBP is a minor one that will not affect the results tremendously.

	Teams <- cbind(Teams, OBP = ((Teams$H + Teams$BB) / (Teams$AB + Teams$BB)))

From this full collection of data, we will extract only those years and teams of interest by using the subset( ) function as follows:

	RecentTeams <- subset(Teams, yearID>=2000)

Now, because we have set up this data frame nicely, we can take advantage of R’s linear fit tool to fit runs to a linear combination of SLG and OBP by issuing the following R command:

	fm <- lm(R ~ OBP + SLG, RecentTeams)

The Results

We can get a quick summary of the results by just typing the name of the result object as follows:

	fm

	Call:
	lm(formula = R ~ OBP + SLG, data = RecentTeams)

	Coefficients: 
	(Intercept)      OBP       SLG
	     -928.3   2863.7    1783.9

The simple way to interpret this output is that R is telling us that the best linear fit between runs, OBP, and slugging comes in the form of the following linear equation:

	Runs = 2863.7 * OBP + 1783.9 * SLG–928.3

Because our original question was concerned with the relative importance of OBP and SLG in determining runs, we can divide the coefficients to find the following:

	ρ=2863.7 / 1783.9 = 1.61

In other words, the best (linear) explanation of how much more significant OBP was to SLG during the years 2000–2004 is that OBP was about 1.61 times more “important.”

But just how well did the data fit the model? We can obtain more detailed output of the results by looking at the result object’s summary:

	summary(fm)
	
	Call:
	lm(formula = R ~ SLG + OBP, data = RecentTeams)

	Residuals: 
	 Min       1Q       Median  3Q      Max 
	 -75.207   -14.753  -1.445  14.676  57.881

	Coefficients: 
	            Estimate Std.  Error  t value  Pr(>|t|)
	(Intercept) -933.07        48.46  -19.26 <2e-16 ***
	SLG         1779.89       131.56   13.53 <2e-16 ***
	OBP         2882.99       239.44   12.04 <2e-16 ***
	---
	Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

	Residual standard error: 24.9 on 147 degrees of freedom
	Multiple R-Squared: 0.9164, Adjusted R-squared: 0.9153
	F-statistic: 805.6 on 2 and 147 DF, p-value: < 2.2e-16

Most importantly, we see that the R^2 value is roughly 0.92, which is a good indication that the quantities correlate well. For the purposes of this hack, we aren’t going to dwell on how well the data fits the model, but instead we will choose to investigate different year ranges to see if we can gain any insight into how this quantity changes from one year to the next.

To make it easy to calculate the coefficient for many different groups of seasons, we next create a function called rho that takes the start year and end year of a range of seasons for which we want to calculate the ratio of the OBP coefficient to the SLG coefficient:

	> rho <- function (year1, year2)
	TheTeams <- subset(Teams, yearID >= year1 & yearID <= year2)
	fm1 <- lm(R ~ OBP + SLG, TheTeams)
	(fm1$coef)[2] / (fm1$coef)[3]
	}

We can use this function to recalculate the previous quantity by simply typing the following:

	> rho(2000,2004)
	   OBP
	1.605250

With this function, one can easily generate the following matrix by entering the year in the lefthand column as the start year, with the year in the top row as the end year:

Table 6-12. 

Start/End

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

1995

2.2

1.0

1.4

1.5

1.6

1.7

1.7

1.6

1.5

1.4

1996

1.0

1.7

2.2

2.0

2.2

2.3

2.1

1.9

1.9

1997

2.6

2.8

2.3

2.5

2.4

2.1

2.0

1.9

1998

3.3

2.2

2.5

2.5

2.1

1.9

1.8

1999

1.7

2.2

2.2

1.9

1.7

1.6

2000

3.9

3.0

2.0

1.7

1.6

2001

3.2

1.5

1.4

1.3

2002

0.8

0.9

1.0

2003

1.1

1.2

2004

1.2

You can make several very interesting observations from this matrix alone:

  • This number is not particularly stable from one year to the next. It certainly could be the case in which an outlier or two from that season dramatically altered the best fit. Excluding such outliers could lead to more-consistent ratios from year to year.

  • If you did this exercise for a year or two and ended in 2001, the resulting ratio would have appeared to be around 3.0, more or less.

  • However, looking at more-recent years, especially with more and more history, the results tend to be closer to 2.0.

What does all of this tell us? The most important result, in our opinion, is that the inconsistency from one season to the next could indicate that the relationship between OBP, SLG, and RS might in fact not be linear. This shouldn’t be taken as news, in our opinion, because OBP and SLG do, in fact, double-count certain statistics, so you should instead appeal to a more sensible metric such as linear weights to gain a much better understanding of the correlation between baseball’s discrete events and the outcomes of those events.

Mark E. Johnson and Matthew S. Johnson

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.16.81