Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Measure Batting with Batting Average

Use R to calculate the most familiar statistic in baseball: batting average.

Batting average (AVG) is, by far, the most popular baseball statistic. It’s printed on baseball cards, flashed on scoreboards, and recounted by announcers during games. For more than 50 years, the best minds in baseball have talked about the problems with batting average, from Branch Rickey and Allan Roth’s work with the Brooklyn Dodgers in the 1950s, through Bill James and Pete Palmer’s work in the 1980s, to Michael Lewis’s book, Moneyball (W. W. Norton & Company).

I agree that batting average isn’t the best formula for measuring batting ability today, but I think it was once much more revealing. Batting average is one of the oldest statistics, dating back to the first time player statistics were published in the 19th century. In the early days of baseball, the pitcher’s role was not the same as it is today. The pitcher wasn’t supposed to trick batters with balls that looked like strikes, strikes that looked like balls, fastballs that flew past the batter, and changeups that slowly crossed the plate after the batter swung through the pitch. The early rules actually stated that a pitcher had to keep his arm stiff at the elbow and pitch underhand. The pitcher’s role was to throw the ball over the plate, allowing the batter the chance to put the ball in play. So, naturally, batting average excludes walks and the number of times a player is hit by a pitch because these are the fault of a bad pitcher and have nothing to do with the skills of a good batter.

I’m including this hack for three reasons. First, this is the most familiar of all baseball formulas, and it’s one of the simplest. I want to show you a few statistician tricks for quickly calculating, reading, and understanding numbers; with batting average, I know the formula itself won’t confuse you. Second, batting average is one of the key statistics used to score teams in fantasy baseball. Regardless of whether batting average helps real teams win real games, it will help fantasy teams win fantasy leagues. Finally, we all grew up knowing batting averages, believing that .360 was MVP quality, .300 was a really good average, and .240 wasn’t so hot. I know that batting average is a flawed statistic, but I have to admit that I still look at it.

Sample Code

Batting average formula.

Batting average measures how often a player reaches base by his own efforts. It includes all balls in play (hits and outs) and all strikeouts. It excludes walks because no ball was put into play on a walk. It also excludes bases reached on error.

Each time a player safely reaches base by putting the ball in play, he is credited with a hit. (H is the abbreviation for the number of hits received.) Each time a player has an opportunity to reach base by putting the ball into play, and either reaches base or strikes out, he is credited with an at bat. (AB usually represents the number of at bats.) Batting average is the number of base hits divided by the total number of at bats. The formula is:

	AVG = H / AB

There are a couple of special cases. If a batter is hit by a pitch, it doesn’t count as an at bat. When a player makes an out on a fly ball to the outfield, but the runner on third base scores, this is a sacrifice fly (SF) and does not count as an at bat. When a player deliberately makes contact with the ball, advances the base runners, and is put out at first base, this is a sacrifice hit (SH) and does not count as an at bat either.

Running the Hack

Let’s start simply, by just calculating AVG in R. As a starting point, let’s load a database with data that you can get from several web sites, as explained in “Get Historical Play-by-Play Data” [Hack #14] . (If you don’t want to install MySQL, you can load text files directly into R.)

To use any of the formulas in R, you will first have to load and attach this data set. If you have already loaded the data into R and saved it as an .RData file (as described in “Access Databases Directly from Excel or R” [Hack #33] ), you should include an R statement like this to load and attach the file:

	> load("bbdatabank.RData")
	> attach(batting);

If you have MySQL installed and you have a copy of the Baseball DataBank database, you can access this in R through the following commands (as explained in “Access Databases Directly from Excel or R” [Hack #33] ):

	# this example uses RMySQL
	library(RMySQL)
	drv <- dbDriver("MySQL")
	# change the username field as appropriate
	bbdb.con <- dbConnect(drv, username="jadler",
	  dbname="bbdatabank",host="localhost")
	batting.query <- dbSendQuery(bbdb.con, "SELECT * FROM batting")
	batting <- fetch(batting.query, n=-1)

In R, you can calculate batting average like this:

	batting$AVG <- batting$H / batting$AB

I like to be able to refer to variables by their short names (like AVG) without specifying the data frame (in this case, batting), so I use the following command to attach the data frame:

	attach(batting)

Summary statistics.

The first thing that I like to do when trying to understand a formula is to calculate summary statistics in R. The idea is to calculate the mean and median values (so that you know where average players are), the maximum and minimum (so that you know where the worst and best players are), and the 25th and 75th percentiles. You calculate the values at the quartiles so that you can see where the middle is: half of all players are between the first and third quartiles.

Let’s look at some summary statistics to understand how batting average has changed over time. Here are the summary statistics across the entire database (1871–2004):

	> summary(AVG, digits=3)
	   Min.  1st Qu.  Median     Mean    3rd Qu.    Max.      NA's
	  0.000    0.151   0.232     0.211     0.275   1.000 10680.000

Notice the large number of NA values. Many players never had a single qualifying at bat. Also, the experience of many players in the major leagues was so short (or their number of at bats in a season was so small) that they never managed to get a hit, or never made an out.

To make this distribution more useful, let’s cut this down to show only those players with a number of at bats that’s large enough to qualify for a batting title. To qualify, a player needs an average of at least 3.1 plate appearances in every game played by his team. For a more detailed discussion of this rule, see “Significant Number of At Bats” [Hack #63] .

Here’s a straightforward way to do this in R alone:

	> t <- subset(teams, select=c(teamID,yearID,G))
	> names(t) <- c("teamID", "yearID", "teamG")
	> b_and_t <- merge(batting, t, by=c("yearID", "teamID"))
	> b_and_t$AVG <- b_and_t$H / b_and_t$AB
	> b_and_t$qualify <- (b_and_t$AB + ifelse(is.na(b_and_t$BB),0, b_and_t$BB) +
	+ ifelse(is.na(b_and_t$HBP),0,b_and_t$HBP) +
	+ ifelse(is.na(b_and_t$SF),0,b_and_t$SF)) > 3.1 * b_and_t$teamG
	> attach(b_and_t)

However, this code does not correctly deal with players who batted on more than one team in one year. You can use R to fix this problem (using the aggregate function), but the query takes a long time (more than 10 minutes) to run. I gave up on it and decided to get the data out of a MySQL database. This query takes less than 30 seconds:

	> library(RMySQL)
	> drv <-dbDriver("MySQL")
	> con <- dbConnect(drv, username="jadler", dbname="bbdatabank", host="localhost")
	> bt.query <- dbSendQuery(con, statement = paste(
	+    "select playerID, yearID, ",
	+    "GROUP_CONCAT(teamID SEPARATOR ',') as teamIDs, ",
	+    "sum(G / teamG) * sum(teamG) / count(teamID) as qualifyingG, ",
	+    "sum(teamG) / count(teamID) as teamG, ",
	+    "sum(G) as G, sum(AB) as AB, sum(R) as R, ",
	+    "sum(H) as H, sum(2B) as X2B, sum(3B) as X3B, ",
	+    "sum(HR) as HR, sum(RBI) as RBI, sum(SB) as SB, ",
	+    "sum(CS) as CS, sum(BB) as BB, sum(SO) as SO, ",
	+    "sum(IBB) as IBB, sum(HBP) as HBP, sum(SH) as SH, ",
	+    "sum(SF) as SF, sum(GIDP) as GIDP ",
	+    "from  (select b.*, t.G as teamG ",
	+        "    from batting b inner join teams t ",
	+        "    on b.teamID=t.teamID and b.yearID=t.yearID) i ",
	+    "group by playerID, yearID"))> b_and_t <- fetch(bt.query, n=-1)
	> b_and_t <- fetch(bt.query, n=-1)
	> b_and_t$AVG <- b_and_t$H / b_and_t$AB
	> b_and_t$qualify <- (b_and_t$AB + ifelse(is.na(b_and_t$BB),0, b_and_t$BB) +
	+ ifelse(is.na(b_and_t$HBP),0,b_and_t$HBP) +
	+ ifelse(is.na(b_and_t$SF),0,b_and_t$SF)) > 3.1 * b_and_t$teamG
	> attach(b_and_t)

Here are the summary statistics across the entire database (1871–2004):

	> summary(subset(AVG,qualify),digits=3)
	  Min.  1st Qu. Median  Mean    3rd Qu.    Max.
	 0.0741 0.2580  0.2790  0.2800  0.3020   0.4920

Here are the summary statistics over the past decade (1995–2004):

	> summary(subset(AVG,qualify & yearID > 1994),digits=3)
	  Min.  1st Qu. Median  Mean    3rd Qu.    Max.
	 0.203    0.265  0.282  0.284     0.302   0.379

Top 10.

To get an even better feel for the statistics, let’s look at a few extreme values. Let’s look at the top 10 batting averages of all time. Here is the SQL code that I used to calculate these results. I used the same MLB rule used to qualify for awards: an average of 3.1 plate appearances per game is required to qualify.

(If you want, you can skip the create index statements. A database index is an object that tells the database how to find values in a table quickly. It’s a lot like an index in a book. Instead of flipping through a book to find fielding runs, you can just look up the page in an index. A database index works the same way. However, if you skip these statements, this query is likely to take substantially longer to run, maybe hours rather than seconds.)

RMySQL and MySQL 5.0

I had some difficulty with RMySQL and MySQL 5.0. MySQL 5.0 includes slightly different information about column types than earlier versions of MySQL, confusing applications like R. In particular, I had problems with floating-point numbers. Here is a short example of the error message that I received:

	There were 19 warnings (use warnings() to see them)
	> warnings( )
	Warning messages:
	1: RS-DBI driver warning: (unrecognized MySQL field type 246 in column 3)
	2: RS-DBI driver warning: (unrecognized MySQL field type 246 in column 4)
	3: RS-DBI driver warning: (unrecognized MySQL field type 246 in column 5)

	…

You need to fix the datatype of these fields in R (or else R considers them to be “character” types, and not numeric). To do this, I use the as.double and as.integer functions for decimal and integer values, respectively:

	b_and_t$qualifyingG <- as.double(b_and_t$qualifyingG)
	b_and_t$teamG <- as.integer(b_and_t$teamG)
	b_and_t$G <- as.integer(b_and_t$G)

	…

You can find the full version of the code I used at http://www.oreilly.com/catalog/baseballhks.

	create index batting_pidx on batting(idxLahman);
	create index batting_tidx on batting(idxTeams);
	create index master_pidx on master(idxLahman);
	create index teams_idx on teams(idxTeams);
	create index teams_fidx on teams(idxTeamsFranchises);
	create index teamsFranchises_idx on teamsFranchises(idxTeamsFranchises);

	create table b_and_t as
	select idxLahman, yearID,
	   GROUP_CONCAT(franchID SEPARATOR ",") as teamIDs,
	   sum(G / teamG) * sum(teamG) / count(franchID) as qualifyingG,
	   sum(teamG) / count(franchID) as teamG,
	   sum(G) as G, sum(AB) as AB, sum(R) as R,
	   sum(H) as H, sum(2B) as X2B, sum(3B) as X3B,
	   sum(HR) as HR, sum(RBI) as RBI, sum(SB) as SB,
	   sum(CS) as CS, sum(BB) as BB, sum(SO) as SO,
	   sum(IBB) as IBB, sum(HBP) as HBP, sum(SH) as SH,
	   sum(SF) as SF, sum(GIDP) as GIDP
	from (select b.*, t.yearID, t.G as teamG, f.franchID
	    from batting b inner join teams t
	    inner join TeamsFranchises f
	    where b.idxTeams=t.idxTeams and
	    t.idxTeamsFranchises=f.idxTeamsFranchises) i
	group by idxLahman, yearID;

	select f.franchName AS "First_Team",
	   nameFirst, nameLast, b.yearID,
	   b.AB, round(b.H/b.AB,3) as AVG
	from b_and_t b inner join master m inner join teamsFranchises f
	where b.idxLahman=m.idxLahman
	 and substr(b.teamIDs,1,3)=f.franchID
	 and b.AB + ifnull(b.BB, 0) + ifnull(b.HBP, 0) + ifnull(b.SF, 0)
	 > 3.1 * b.teamG
	 -- and b.yearID > 1994
	order by AVG DESC limit 10;

Here are the top 10 batting averages of all time (note that the Athletics played only 28 games in 1871):

	+------------------------+-----------+----------+--------+------+-------+
	| First_Team             | nameFirst | nameLast | yearID | AB   | AVG   |
	+------------------------+-----------+----------+--------+------+-------+
	| Philadelphia Athletics | Levi      | Meyerle  |   1871 |  130 | 0.492 |
	| Atlanta Braves         | Hugh      | Duffy    |   1894 |  539 | 0.440 |
	| St. Louis Cardinals    | Tip       | O'Neill  |   1887 |  517 | 0.435 |
	| Boston Red Stockings   | Ross      | Barnes   |   1872 |  229 | 0.432 |
	| Boston Red Stockings   | Cal       | McVey    |   1871 |  153 | 0.431 |
	| Chicago Cubs           | Ross      | Barnes   |   1876 |  322 | 0.429 |
	| Oakland Athletics      | Nap       | Lajoie   |   1901 |  544 | 0.426 |
	| Boston Red Stockings   | Ross      | Barnes   |   1873 |  322 | 0.425 |
	| St. Louis Cardinals    | Rogers    | Hornsby  |   1924 |  536 | 0.424 |
	| Baltimore Orioles      | Willie    | Keeler   |   1897 |  564 | 0.424 |
	+------------------------+-----------+----------+--------+------+-------+

Here are the top 10 seasons over the past decade (1995–2004):

	+----------------------+-----------+-------------+--------+------+-------+
	| First_Team           | nameFirst | nameLast    | yearID | AB   | AVG   |
	+----------------------+-----------+-------------+--------+------+-------+
	| Colorado Rockies     | Larry     | Walker      |   1999 |  438 | 0.379 |
	| Boston Red Sox       | Nomar     | Garciaparra |   2000 |  529 | 0.372 |
	| San Diego Padres     | Tony      | Gwynn       |   1997 |  592 | 0.372 |
	| Colorado Rockies     | Todd      | Helton      |   2000 |  580 | 0.372 |
	| Seattle Mariners     | Ichiro    | Suzuki      |   2004 |  704 | 0.372 |
	| San Francisco Giants | Barry     | Bonds       |   2002 |  403 | 0.370 |
	| San Diego Padres     | Tony      | Gwynn       |   1995 |  535 | 0.368 |
	| Colorado Rockies     | Larry     | Walker      |   1997 |  568 | 0.366 |
	| Colorado Rockies     | Larry     | Walker      |   1998 |  454 | 0.363 |
	| San Francisco Giants | Barry     | Bonds       |   2004 |  373 | 0.362 |
	+----------------------+-----------+-------------+--------+------+-------+

Distribution.

The third thing I like to do to understand a formula is plot a histogram of the distribution. The idea of a histogram is to create a set of evenly spaced bins across the range of values. Next, you count the number of items in each bin. (For batting average, the bins varied between about .20 and .40, and there were 40 bins of size .005. So, the first bin contained averages between .200 and .205, the second between .205 and .210, and so on.)

A simple command in R, called the hist() command, lets us plot this in one step:

	> hist(subset(AVG,(qualify & yearID>1994)), breaks=40)

The left part of Figure 5-1 shows the plot that this statement generates. Here’s one example of how to read this diagram. The little box whose left edge lines up with .25 contained players with batting averages of between .250 and .255, and it looks like there were about 75 players in that box.

Figure 5-1. Batting average histogram and box plot

The Shape of the Distribution

Here’s a little more to think about, if you’re mathematically inclined. You might notice that the distribution resembles a bell curve and is called a Gaussian or normal distribution. Because of something called the Central Limit Theorem, bell curves occur naturally everywhere when plotting the distribution of sample means. A lot of baseball formulas are averages, so it is not surprising that their distributions look like bell curves. If you want to know more about how or why this occurs, check out a statistics book. I talk more about probability distributions and batting average in “Significant Number of At Bats” [Hack #63] .

The thing to focus on here is that the general shape of the data resembles a bell curve. Notice also how common it is to be in the middle of the group, and how rare it is to be much better or much worse than average.

This isn’t a perfect normal distribution; it’s a little skewed toward low values. (Statistical terms like skew and kurtosis are used to explain how much a distribution deviates from a normal distribution.)

Box plot.

As a final step, let’s look at how batting averages have changed over time. I’ll use a special type of plot to show not only how the median values have changed, but also how the middle quartile, maximum, and minimum values have changed. The plot is called a box plot, and you can see an example of it in Figure 5-1. These plots show five values for each decade. The lines at the top and bottom (the whiskers) represent the maximum and minimum values. The box in the middle of each plot shows the middle range of values. The bottom of the box represents the 25th percentile, the line in the middle represents the median, and the top of the box represents the 75th percentile.

I use a trick to group by decade. I divide the year by 10 (for example, changing 2004 to 200.4), then I use a floor function to truncate the decimal (turning 200.4 to 200), and then I multiply the result by 10 (turning 200 to 2,000). This is a quick way to group values together by decade. I then plot the results using the boxplot() command in R:

	> b_and_t$decade <- as.factor(floor(b_and_t$yearID/10) * 10)
	> boxplot(AVG~decade,data=b_and_t,subset=qualify,
	+   pars=c(xlab="decade", ylab="batting average"),
	+   range=0)

The right part of Figure 5-1 shows the box plot that this code generated. As you can see, the range of batting averages narrowed over the first 100 years but has been stable for the past couple of decades. A big reason for this has nothing to do with the players’ skills: in the 1870s, teams played about 50 games a year; in the 1880s, most teams played only about 100 games; and by the 1890s, the average had increased to 140 games per year, which is close to today’s 162.

Notice the tight range of values: most major league players have an average of between .260 and .300.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Measure Batting with Batting Average

Create new playlist

Sign In

Sign Up