Measure Fielding with Range Factor

Use range factor to measure how much a defensive player works.

Range factor (RF) represents the number of balls fielded by a player per inning. It’s a pretty simple metric that you can calculate from readily available statistics (putouts, assists, and innings), and it does a pretty good job of ranking players by defensive ability. Bill James created the statistic and included it in his book 1977 Baseball Abstract.

As described in the previous hack, FP does a poor job of differentiating between players. RF is a much better statistic. RF simply measures the average number of putouts plus assists (successful defensive players) per inning. The idea is that all players will have about the same number of fielding opportunities, so we should measure players by the number of successfully fielded balls. It excludes errors because they just don’t tell you very much. Here’s the formula:

	RF = (PO + A) / (InnOuts / 3)

Sample Code

Let’s calculate range factor in SQL first. We’ll start by building a table with a summary of fielding statistics by player (in all positions played by that player):

	-- assumes indexes from
	"Measure Batting with Batting Average" [Hack #40]
	are there
	create index fielding_idx on fielding(idxLahman);
	create index fielding_tids on fielding(idxTeams);
	create table f_and_t as
	select idxLahman, GROUP_CONCAT(franchID SEPARATOR ",") as teamIDs,
	   pos, sum(i.G) as G, sum(i.GS) as GS, sum(i.InnOuts) as InnOuts,
	   sum(i.PO) as PO, sum(i.A) as A, sum(i.E) as E,
	   sum(i.DP) as DP, sum(i.PB) as PB, teamG, yearID
	from (select f.*, t.G as teamG, tf.franchID, t.yearID
	    from fielding f inner join teams t inner join teamsFranchises tf
	    where f.idxTeams=t.idxTeams
	     and t.idxTeamsFranchises=tf.idxTeamsFranchises) i
	group by idxLahman, yearID, pos;

We want to use the number of outs played to calculate range factor, but this information is not available for all players during all seasons. When this is not available, we’ll approximate range factor by using the number of games played:

	create table rf_t as
	select *, (PO + A) / (InnOuts / 27) as exactRF,
	  (PO + A) / G as approxRF,
	  (PO + A) /
	    (CASE WHEN InnOuts is not null
	      THEN InnOuts / 27
	      ELSE G END) as RF
	from f_and_t;

As a standard for range factor, we’ll consider only players who played in at least seven innings for every game played by their team in a year.

Summary statistics.

Here is the code for calculating summary statistics in R. Range factor depends greatly on position, so we’ll calculate it separately for each position in our database:

	f_and_t.query <- dbSendQuery(bbdb.con, "SELECT * FROM f_and_t")
	f_and_t <- fetch(f_and_t.query, n=-1)
	f_and_t$G <- as.integer(f_and_t$G)
	f_and_t$GS <- as.integer(f_and_t$GS)
	f_and_t$InnOuts <- as.integer(f_and_t$InnOuts)
	f_and_t$PO <- as.integer(f_and_t$PO)
	f_and_t$A <- as.integer(f_and_t$A)
	f_and_t$E <- as.integer(f_and_t$E)
	f_and_t$DP <- as.integer(f_and_t$DP)
	f_and_t$PB <- as.integer(f_and_t$PB)
	f_and_t$yearID <- as.integer(f_and_t$yearID)
	f_and_t$qualify <- f_and_t$InnOuts > 6 * f_and_t$teamG
	attach(f_and_t)
	RF <- (PO + A) / ifelse(is.na(InnOuts), G, InnOuts / 27)
	f_and_t$RF <- RF

To show a summary of range factor by position, we can use R’s tapply() function. This function lets us calculate summary statistics separately for each fielding position. (See the R help files for more information.)

	>tapply(RF,INDEX=pos,FUN=summary)
	$"1B"
	 Min.  1st Qu.   Median    Mean   3rd Qu.     Max.      NA's
	0.000    5.667    8.333   7.482    9.727    27.000     3.000

	$"2B"
	 Min.  1st Qu.  Median    Mean    3rd Qu.      Max.     NA's
	0.000    3.000   4.385   3.971     5.196    18.000     1.000

	$"3B"
	 Min.  1st Qu.  Median    Mean   3rd Qu.    Max.      NA's
	0.000    1.500   2.403   2.231     3.000   9.000     2.000

	$C
	 Min. 1st Qu.  Median     Mean  3rd Qu.    Max.
	0.000   3.732   4.867    4.727    5.909  27.000

	$CF
	 Min.  1st Qu.  Median    Mean  3rd Qu.     Max.
	0.000    1.667   2.308   2.281    2.750   18.000

	$DH
	 Min.  1st Qu.  Median    Mean  3rd Qu.    Max.    NA's
	                                   6154

	$LF
	 Min.  1st Qu.  Median    Mean   3rd Qu.    Max.    NA's
	0.000    1.000   1.684   1.651     2.114  18.000   1.000

	$OF
	 Min.  1st Qu.  Median    Mean   3rd Qu.    Max.
	0.000    1.209   1.762   1.657     2.150   8.000

	$P
	 Min.  1st Qu.  Median    Mean   3rd Qu.    Max.    NA's
	0.000   1.432    1.990   2.108     2.613  54.000  76.000

	$RF
	 Min. 1st Qu.  Median    Mean    3rd Qu.    Max.    NA's
	0.000   1.031   1.791   1.729      2.213  18.000   1.000

	$SS
	 Min. 1st Qu.  Median    Mean   3rd Qu.    Max.
	0.000   2.500   4.000   3.623     4.833  18.000

Let’s look at the values taken by range factor. The players who rank at the top in terms of range factor are not the best fielders in baseball. Infielders get more fielding opportunities than outfielders do. First basemen (especially) get a lot of fielding opportunities. These guys get credit for a lot of putouts, which probably means that the other infielders are doing a great job throwing them the ball.

Also, notice the high putout rates for catchers. One reason for this is that catchers are credited with putouts on strikeouts. This isn’t to demean catching; catching is a very tough position, probably the hardest defensive position in baseball. But you can’t compare RF for catchers to RF for outfielders.

Clearly, the only way you can meaningfully compare players with range factor is on a position-by-position basis. Even then, you need to be careful. This statistic can be interesting and informative, but it’s not just a competency measurement.

Top 10.

Here are the top 10 players by range factor for shortstop, third base, second base, and outfield positions. I chose to include only players who played in more than half of their team’s games. (This isn’t a perfect restriction. I’d rather include only players who played in at least half of the innings played, or in more than half of the opponents’ plate appearances. But we don’t have the information to do that.) I also include only players during the past 50 years.

	mysql> -- top ten shortstops by range factor
	mysql> select substr(teamIDs,1,3) as teamID, pos,
	 -> m.nameLast, m.nameFirst, f.yearID, f.RF, f.G
	 -> from rf_t f inner join master m
	 -> on f.idxLahman = m.idxLahman
	 -> where pos="SS" and yearID > 1954
	 -> and G * 2 > teamG
	 -> order by RF DESC LIMIT 10;
	+--------+-----+-----------+-----------+--------+--------+------+
	| teamID | pos | nameLast  | nameFirst | yearID | RF     | G    |
	+--------+-----+-----------+-----------+--------+--------+------+
	| STL    | SS  | Templeton | Garry     |   1980 | 5.8609 |  115 |
	| STL    | SS  | Smith     | Ozzie     |   1982 | 5.8561 |  139 |
	| SDP    | SS  | Smith     | Ozzie     |   1981 | 5.8364 |  110 |
	| SDP    | SS  | Smith     | Ozzie     |   1980 | 5.7532 |  158 |
	| MIL    | SS  | Yount     | Robin     |   1981 | 5.7097 |   93 |
	| STL    | SS  | Templeton | Garry     |   1981 | 5.6842 |   76 |
	| TBD    | SS  | Martinez  | Felix     |   2000 | 5.6778 |  106 |
	| CHW    | SS  | Aparicio  | Luis      |   1960 | 5.5948 |  153 |
	| MIL    | SS  | Yount     | Robin     |   1978 | 5.5920 |  125 |
	| LAD    | SS  | Zimmer    | Don       |   1958 | 5.5877 |  114 |
	+--------+-----+-----------+-----------+--------+--------+------+
	10 rows in set (0.40 sec)

	mysql> -- top ten second basemen by range factor
	mysql> select substr(teamIDs,1,3) as teamID, pos,
	 -> m.nameLast, m.nameFirst, f.yearID, f.RF, f.G
	 -> from rf_t f inner join master m
	 -> on f.idxLahman = m.idxLahman
	 -> where pos="2B" and yearID > 1954
	 -> and G * 2 > teamG
	 -> order by RF DESC LIMIT 10;
	+--------+-----+------------+-----------+--------+--------+------+
	| teamID | pos | nameLast   | nameFirst | yearID | RF     | G    |
	+--------+-----+------------+-----------+--------+--------+------+
	| ATL    | 2B  | Hubbard    | Glenn     |   1985 | 6.2714 |  140 |
	| PIT    | 2B  | Cash       | Dave      |   1972 | 6.2062 |   97 |
	| PIT    | 2B  | Mazeroski  | Bill      |   1963 | 6.1304 |  138 |
	| BAL    | 2B  | Grich      | Bobby     |   1975 | 6.0467 |  150 |
	| PIT    | 2B  | Mazeroski  | Bill      |   1961 | 6.0197 |  152 |
	| STL    | 2B  | Blasingame | Don       |   1956 | 5.9490 |   98 |
	| PIT    | 2B  | Stennett   | Rennie    |   1974 | 5.9481 |  154 |
	| PIT    | 2B  | Stennett   | Rennie    |   1976 | 5.9363 |  157 |
	| PHI    | 2B  | Trillo     | Manny     |   1980 | 5.9071 |  140 |
	| SDP    | 2B  | Fuentes    | Tito      |   1975 | 5.8944 |  142 |
	+--------+-----+------------+-----------+--------+--------+------+
	10 rows in set (0.44 sec)

	mysql> -- top ten third basemen by range factor
	mysql> select substr(teamIDs,1,3) as teamID, pos,
	 -> m.nameLast, m.nameFirst, f.yearID, f.RF, f.G
	 -> from rf_t f inner join master m
	 -> on f.idxLahman = m.idxLahman
	 -> where pos="3B" and yearID > 1954
	 -> and G * 2 > teamG
	 -> order by RF DESC LIMIT 10;
	+--------+-----+----------+-----------+--------+--------+------+
	| teamID | pos | nameLast | nameFirst | yearID | RF     | G    |
	+--------+-----+----------+-----------+--------+--------+------+
	| NYY    | 3B  | Boyer    | Clete     |   1962 | 3.7134 |  157 |
	| NYY    | 3B  | Boyer    | Clete     |   1966 | 3.6824 |   85 |
	| TEX    | 3B  | Bell     | Buddy     |   1982 | 3.6345 |  145 |
	| OAK    | 3B  | Lopez    | Hector    |   1955 | 3.6237 |   93 |
	| TEX    | 3B  | Bell     | Buddy     |   1981 | 3.6146 |   96 |
	| CLE    | 3B  | Nettles  | Graig     |   1971 | 3.6139 |  158 |
	| CHC    | 3B  | Santo    | Ron       |   1967 | 3.6025 |  161 |
	| NYY    | 3B  | Boyer    | Clete     |   1961 | 3.5745 |  141 |
	| CHC    | 3B  | Santo    | Ron       |   1966 | 3.5592 |  152 |
	| STL    | 3B  | Boyer    | Ken       |   1958 | 3.5139 |  144 |
	+--------+-----+----------+-----------+--------+--------+------+
	10 rows in set (0.41 sec)

	mysql> -- top ten outfielders by range factor
	mysql> select substr(teamIDs,1,3) as teamID, pos,
	 -> m.nameLast, m.nameFirst, f.yearID, f.RF, f.G
	 -> from rf_t f inner join master m
	 -> on f.idxLahman = m.idxLahman
	 -> where pos IN ("LF", "CF", "RF", "OF") and yearID > 1954
	 -> and G * 2 > teamG
	 -> order by RF DESC LIMIT 10;
	+--------+-----+----------+-----------+--------+--------+------+
	| teamID | pos | nameLast | nameFirst | yearID | RF     | G    |
	+--------+-----+----------+-----------+--------+--------+------+
	| MIN    | OF  | Puckett  | Kirby     |   1984 | 3.5469 |  128 |
	| CHW    | OF  | Lemon    | Chet      |   1977 | 3.5168 |  149 |
	| SEA    | CF  | Cameron  | Mike      |   2003 | 3.4206 |  147 |
	| ANA    | CF  | Erstad   | Darin     |   2002 | 3.3942 |  143 |
	| PHI    | OF  | Ashburn  | Richie    |   1956 | 3.3377 |  154 |
	| PHI    | OF  | Ashburn  | Richie    |   1957 | 3.3333 |  156 |
	| PHI    | OF  | Ashburn  | Richie    |   1958 | 3.3092 |  152 |
	| MIN    | CF  | Hunter   | Torii     |   2001 | 3.2934 |  147 |
	| OAK    | OF  | Murphy   | Dwayne    |   1980 | 3.2911 |  158 |
	| WSN    | OF  | Dawson   | Andre     |   1981 | 3.2718 |  103 |
	+--------+-----+----------+-----------+--------+--------+------+
	10 rows in set (0.44 sec)

Histogram.

There’s a twist on the way I’m presenting statistics in this section. In most sections, I’ve shown just one histogram for the past 10 years. However, range factors vary greatly by position. So, instead, I’m showing histograms of range factors for all positions, using lattice charts as a trick. For more on using these plots, see “Compare Teams and Players with Lattices” [Hack #35] .

Using the same qualifying rule (an average of six innings per game), plus including range factors for players over the last 25 years, let’s plot the distribution of range factor by defensive position:

	>library(lattice)
	>trellis.device(color=FALSE)
	>histogram(~RF|pos,xlab="Range Factor", data=subset(f_and_t,
	 f_and_t$yearID > 1980),nint=30)

In Figure 5-14, notice the much greater range factors for infielders than for outfielders.

Range factor distribution by position

Figure 5-14. Range factor distribution by position

Also notice that first basemen and catchers tend to have a wider distribution of range factors.

Box plot.

For lack of space, I’m omitting box plots (I’d want to show them separately for all positions). See “Measure Fielding with Linear Weights” [Hack #55] for a discussion of how fielding has changed over time.

Hacking the Hack

STATS, Inc. publishes a statistic called zone rating (ZR) that is related to RF. STATS divides the playing field into a set of “zones,” each of which is the primary responsibility of a defensive player. STATS scorekeepers carefully record where each ball lands and use this information to measure the quality of defensive players.

Zone rating is defined as:

ZR seems as though it should be a better measurement than RF. RF doesn’t take into account how many balls are hit near a fielder. Many things distort RF, including ballpark shapes (for example, the Green Monster in left field at Fenway Park), pitcher types (for example, sinkerball pitchers who get mostly ground ball outs), and luck. ZR counts actual fielding opportunities, so it should be more accurate.

Unfortunately, the Baseball Archive and Baseball DataBank databases don’t include the raw data to calculate ZR. (They do include ZR from some recent years.) You can calculate an approximate zone rating, maybe a ZRjr, from the Retrosheet event files.

I decided to try to calculate a quick-and-dirty version of ZR, which I’m calling Zone Rating Jr. (ZRjr), from Retrosheet event files. I started with the database described in “Make a Historical Play-by-Play Database” [Hack #22] . First, I decided to determine roughly where each ball landed and track whether the ball was turned into an out. The event files sometimes include extra information on where balls were hit. The BEVENT program will write this information to the hit_location field. (You can find a chart to interpret the hit_location field at http://www.retrosheet.org/location.htm.) In cases where the hit_location field was not blank and a ball did not go between two defensive players or go out of the park, I assigned responsibility to the player whose position matched the location. In cases where the hit_location field was blank, I assigned responsibility to the player who fielded the ball.

I created a temporary table with this information and called it zrjr_inner:

	create table zrjr_inner
	as select outs_on_play,
	   if (mid(hit_location,2,1) IN (1,2,3,4,5,6,7,8,9), 0, 1)
	      as opportunities,
	   hit_location,  event_text, fielded_by,
	   mid(game_id,4,4) as yearID,
	   if(length(hit_location) > 0 and mid(hit_location,2,1)
	      NOT IN (1,2,3,4,5,6,7,8,9),
	      left(hit_location,1), fielded_by) as pos,
	   case when batting_team=0
	    then left(game_id,3)
	    else visiting_team end as teamID,
	   case
	    if(outs_on_play=0 and batted_ball_type='G'
	        and fielded_by IN (7, 8, 9) and sf_flag='F'
	        and length(hit_location) > 0,
	      left(hit_location,1), fielded_by)
	    when '1' then pitcher
	    when '2' then catcher
	    when '3' then first_base
	    when '4' then second_base
	    when '5' then third_base
	    when '6' then shortstop
	    when '7' then left_field
	    when '8' then center_field
	    when '9' then right_field
	   end as player
	from pbp.pbp2k
	where fielded_by > 0 –- fielded_by=0 for HR, SO
	;

Next, I created an index on this table for fast summaries, and then I calculated ZRjr values for each player during each season. I decided to use the following formula:

	ZRjr = Outs made on plays fielded by fielder / Balls fielded by fielder

Here is the code that I used to calculate these values:

	create index zrjr_inner_idx on zrjr_inner(yearID,teamID,player);

	create table zrjr as
	select yearID, teamID, player, pos,
	    sum(outs_on_play) as outs,
	    sum(opportunities) as opportunities,
	    if(sum(opportunities) > 0,
	      sum(outs_on_play) / sum(opportunities),
	      null) as ZRjr
	from zrjr_inner
	where player is not null
	group by yearID, teamID, player, pos
	;

To show the top fielders with this ranking (for example, the top shortstops), you can use a query like this:

	mysql> select r.lastName, r.firstName, yearID, teamID, ZRjr
	  -> FROM ZRJR l inner join rosters r
	  -> on l.player=r.retroID and l.teamID=r.team and l.yearID=r.year
	  -> where opportunities > 100 AND pos=6
	  -> ORDER BY zrjr desc limit 10;
	+-----------+-----------+--------+--------+--------+
	| lastName  | firstName | yearID | teamID | ZRjr   |
	+-----------+-----------+--------+--------+--------+
	| Sanchez   | Rey       | 2001   | KCA    | 1.1150 |
	| Bordick   | Mike      | 2002   | BAL    | 1.1008 |
	| Rodriguez | Alex      | 2000   | SEA    | 1.0985 |
	| Tejada    | Miguel    | 2000   | OAK    | 1.0981 |
	| Vizquel   | Omar      | 2000   | CLE    | 1.0948 |
	| Clayton   | Royce     | 2002   | CHA    | 1.0935 |
	| Lopez     | Felipe    | 2002   | TOR    | 1.0917 |
	| Cruz      | Deivi     | 2000   | DET    | 1.0911 |
	| Valentin  | Jose      | 2002   | CHA    | 1.0909 |
	| Sanchez   | Rey       | 2000   | KCA    | 1.0907 |
	+-----------+-----------+--------+--------+--------+
	10 rows in set (0.03 sec)

This is a very coarse calculation, but it might be a little better than range factor is in some cases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.86.154