Use range factor to measure how much a defensive player works.
Range factor (RF) represents the number of balls fielded by a player per inning. It’s a pretty simple metric that you can calculate from readily available statistics (putouts, assists, and innings), and it does a pretty good job of ranking players by defensive ability. Bill James created the statistic and included it in his book 1977 Baseball Abstract.
As described in the previous hack, FP does a poor job of differentiating between players. RF is a much better statistic. RF simply measures the average number of putouts plus assists (successful defensive players) per inning. The idea is that all players will have about the same number of fielding opportunities, so we should measure players by the number of successfully fielded balls. It excludes errors because they just don’t tell you very much. Here’s the formula:
RF = (PO + A) / (InnOuts / 3)
Let’s calculate range factor in SQL first. We’ll start by building a table with a summary of fielding statistics by player (in all positions played by that player):
-- assumes indexes from "Measure Batting with Batting Average" [Hack #40] are there create index fielding_idx on fielding(idxLahman); create index fielding_tids on fielding(idxTeams); create table f_and_t as select idxLahman, GROUP_CONCAT(franchID SEPARATOR ",") as teamIDs, pos, sum(i.G) as G, sum(i.GS) as GS, sum(i.InnOuts) as InnOuts, sum(i.PO) as PO, sum(i.A) as A, sum(i.E) as E, sum(i.DP) as DP, sum(i.PB) as PB, teamG, yearID from (select f.*, t.G as teamG, tf.franchID, t.yearID from fielding f inner join teams t inner join teamsFranchises tf where f.idxTeams=t.idxTeams and t.idxTeamsFranchises=tf.idxTeamsFranchises) i group by idxLahman, yearID, pos;
We want to use the number of outs played to calculate range factor, but this information is not available for all players during all seasons. When this is not available, we’ll approximate range factor by using the number of games played:
create table rf_t as select *, (PO + A) / (InnOuts / 27) as exactRF, (PO + A) / G as approxRF, (PO + A) / (CASE WHEN InnOuts is not null THEN InnOuts / 27 ELSE G END) as RF from f_and_t;
As a standard for range factor, we’ll consider only players who played in at least seven innings for every game played by their team in a year.
Here is the code for calculating summary statistics in R. Range factor depends greatly on position, so we’ll calculate it separately for each position in our database:
f_and_t.query <- dbSendQuery(bbdb.con, "SELECT * FROM f_and_t") f_and_t <- fetch(f_and_t.query, n=-1) f_and_t$G <- as.integer(f_and_t$G) f_and_t$GS <- as.integer(f_and_t$GS) f_and_t$InnOuts <- as.integer(f_and_t$InnOuts) f_and_t$PO <- as.integer(f_and_t$PO) f_and_t$A <- as.integer(f_and_t$A) f_and_t$E <- as.integer(f_and_t$E) f_and_t$DP <- as.integer(f_and_t$DP) f_and_t$PB <- as.integer(f_and_t$PB) f_and_t$yearID <- as.integer(f_and_t$yearID) f_and_t$qualify <- f_and_t$InnOuts > 6 * f_and_t$teamG attach(f_and_t) RF <- (PO + A) / ifelse(is.na(InnOuts), G, InnOuts / 27) f_and_t$RF <- RF
To show a summary of range factor by position, we can use R’s tapply()
function. This function lets us calculate summary statistics separately for each fielding position. (See the R help files for more information.)
>tapply(RF,INDEX=pos,FUN=summary) $"1B" Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 5.667 8.333 7.482 9.727 27.000 3.000 $"2B" Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 3.000 4.385 3.971 5.196 18.000 1.000 $"3B" Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 1.500 2.403 2.231 3.000 9.000 2.000 $C Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.732 4.867 4.727 5.909 27.000 $CF Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.667 2.308 2.281 2.750 18.000 $DH Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 6154 $LF Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 1.000 1.684 1.651 2.114 18.000 1.000 $OF Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.209 1.762 1.657 2.150 8.000 $P Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 1.432 1.990 2.108 2.613 54.000 76.000 $RF Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 1.031 1.791 1.729 2.213 18.000 1.000 $SS Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.500 4.000 3.623 4.833 18.000
Let’s look at the values taken by range factor. The players who rank at the top in terms of range factor are not the best fielders in baseball. Infielders get more fielding opportunities than outfielders do. First basemen (especially) get a lot of fielding opportunities. These guys get credit for a lot of putouts, which probably means that the other infielders are doing a great job throwing them the ball.
Also, notice the high putout rates for catchers. One reason for this is that catchers are credited with putouts on strikeouts. This isn’t to demean catching; catching is a very tough position, probably the hardest defensive position in baseball. But you can’t compare RF for catchers to RF for outfielders.
Clearly, the only way you can meaningfully compare players with range factor is on a position-by-position basis. Even then, you need to be careful. This statistic can be interesting and informative, but it’s not just a competency measurement.
Here are the top 10 players by range factor for shortstop, third base, second base, and outfield positions. I chose to include only players who played in more than half of their team’s games. (This isn’t a perfect restriction. I’d rather include only players who played in at least half of the innings played, or in more than half of the opponents’ plate appearances. But we don’t have the information to do that.) I also include only players during the past 50 years.
mysql> -- top ten shortstops by range factor mysql> select substr(teamIDs,1,3) as teamID, pos, -> m.nameLast, m.nameFirst, f.yearID, f.RF, f.G -> from rf_t f inner join master m -> on f.idxLahman = m.idxLahman -> where pos="SS" and yearID > 1954 -> and G * 2 > teamG -> order by RF DESC LIMIT 10; +--------+-----+-----------+-----------+--------+--------+------+ | teamID | pos | nameLast | nameFirst | yearID | RF | G | +--------+-----+-----------+-----------+--------+--------+------+ | STL | SS | Templeton | Garry | 1980 | 5.8609 | 115 | | STL | SS | Smith | Ozzie | 1982 | 5.8561 | 139 | | SDP | SS | Smith | Ozzie | 1981 | 5.8364 | 110 | | SDP | SS | Smith | Ozzie | 1980 | 5.7532 | 158 | | MIL | SS | Yount | Robin | 1981 | 5.7097 | 93 | | STL | SS | Templeton | Garry | 1981 | 5.6842 | 76 | | TBD | SS | Martinez | Felix | 2000 | 5.6778 | 106 | | CHW | SS | Aparicio | Luis | 1960 | 5.5948 | 153 | | MIL | SS | Yount | Robin | 1978 | 5.5920 | 125 | | LAD | SS | Zimmer | Don | 1958 | 5.5877 | 114 | +--------+-----+-----------+-----------+--------+--------+------+ 10 rows in set (0.40 sec) mysql> -- top ten second basemen by range factor mysql> select substr(teamIDs,1,3) as teamID, pos, -> m.nameLast, m.nameFirst, f.yearID, f.RF, f.G -> from rf_t f inner join master m -> on f.idxLahman = m.idxLahman -> where pos="2B" and yearID > 1954 -> and G * 2 > teamG -> order by RF DESC LIMIT 10; +--------+-----+------------+-----------+--------+--------+------+ | teamID | pos | nameLast | nameFirst | yearID | RF | G | +--------+-----+------------+-----------+--------+--------+------+ | ATL | 2B | Hubbard | Glenn | 1985 | 6.2714 | 140 | | PIT | 2B | Cash | Dave | 1972 | 6.2062 | 97 | | PIT | 2B | Mazeroski | Bill | 1963 | 6.1304 | 138 | | BAL | 2B | Grich | Bobby | 1975 | 6.0467 | 150 | | PIT | 2B | Mazeroski | Bill | 1961 | 6.0197 | 152 | | STL | 2B | Blasingame | Don | 1956 | 5.9490 | 98 | | PIT | 2B | Stennett | Rennie | 1974 | 5.9481 | 154 | | PIT | 2B | Stennett | Rennie | 1976 | 5.9363 | 157 | | PHI | 2B | Trillo | Manny | 1980 | 5.9071 | 140 | | SDP | 2B | Fuentes | Tito | 1975 | 5.8944 | 142 | +--------+-----+------------+-----------+--------+--------+------+ 10 rows in set (0.44 sec) mysql> -- top ten third basemen by range factor mysql> select substr(teamIDs,1,3) as teamID, pos, -> m.nameLast, m.nameFirst, f.yearID, f.RF, f.G -> from rf_t f inner join master m -> on f.idxLahman = m.idxLahman -> where pos="3B" and yearID > 1954 -> and G * 2 > teamG -> order by RF DESC LIMIT 10; +--------+-----+----------+-----------+--------+--------+------+ | teamID | pos | nameLast | nameFirst | yearID | RF | G | +--------+-----+----------+-----------+--------+--------+------+ | NYY | 3B | Boyer | Clete | 1962 | 3.7134 | 157 | | NYY | 3B | Boyer | Clete | 1966 | 3.6824 | 85 | | TEX | 3B | Bell | Buddy | 1982 | 3.6345 | 145 | | OAK | 3B | Lopez | Hector | 1955 | 3.6237 | 93 | | TEX | 3B | Bell | Buddy | 1981 | 3.6146 | 96 | | CLE | 3B | Nettles | Graig | 1971 | 3.6139 | 158 | | CHC | 3B | Santo | Ron | 1967 | 3.6025 | 161 | | NYY | 3B | Boyer | Clete | 1961 | 3.5745 | 141 | | CHC | 3B | Santo | Ron | 1966 | 3.5592 | 152 | | STL | 3B | Boyer | Ken | 1958 | 3.5139 | 144 | +--------+-----+----------+-----------+--------+--------+------+ 10 rows in set (0.41 sec) mysql> -- top ten outfielders by range factor mysql> select substr(teamIDs,1,3) as teamID, pos, -> m.nameLast, m.nameFirst, f.yearID, f.RF, f.G -> from rf_t f inner join master m -> on f.idxLahman = m.idxLahman -> where pos IN ("LF", "CF", "RF", "OF") and yearID > 1954 -> and G * 2 > teamG -> order by RF DESC LIMIT 10; +--------+-----+----------+-----------+--------+--------+------+ | teamID | pos | nameLast | nameFirst | yearID | RF | G | +--------+-----+----------+-----------+--------+--------+------+ | MIN | OF | Puckett | Kirby | 1984 | 3.5469 | 128 | | CHW | OF | Lemon | Chet | 1977 | 3.5168 | 149 | | SEA | CF | Cameron | Mike | 2003 | 3.4206 | 147 | | ANA | CF | Erstad | Darin | 2002 | 3.3942 | 143 | | PHI | OF | Ashburn | Richie | 1956 | 3.3377 | 154 | | PHI | OF | Ashburn | Richie | 1957 | 3.3333 | 156 | | PHI | OF | Ashburn | Richie | 1958 | 3.3092 | 152 | | MIN | CF | Hunter | Torii | 2001 | 3.2934 | 147 | | OAK | OF | Murphy | Dwayne | 1980 | 3.2911 | 158 | | WSN | OF | Dawson | Andre | 1981 | 3.2718 | 103 | +--------+-----+----------+-----------+--------+--------+------+ 10 rows in set (0.44 sec)
There’s a twist on the way I’m presenting statistics in this section. In most sections, I’ve shown just one histogram for the past 10 years. However, range factors vary greatly by position. So, instead, I’m showing histograms of range factors for all positions, using lattice charts as a trick. For more on using these plots, see “Compare Teams and Players with Lattices” [Hack #35] .
Using the same qualifying rule (an average of six innings per game), plus including range factors for players over the last 25 years, let’s plot the distribution of range factor by defensive position:
>library(lattice) >trellis.device(color=FALSE) >histogram(~RF|pos,xlab="Range Factor", data=subset(f_and_t, f_and_t$yearID > 1980),nint=30)
In Figure 5-14, notice the much greater range factors for infielders than for outfielders.
Also notice that first basemen and catchers tend to have a wider distribution of range factors.
For lack of space, I’m omitting box plots (I’d want to show them separately for all positions). See “Measure Fielding with Linear Weights” [Hack #55] for a discussion of how fielding has changed over time.
STATS, Inc. publishes a statistic called zone rating (ZR) that is related to RF. STATS divides the playing field into a set of “zones,” each of which is the primary responsibility of a defensive player. STATS scorekeepers carefully record where each ball lands and use this information to measure the quality of defensive players.
Zone rating is defined as:
ZR seems as though it should be a better measurement than RF. RF doesn’t take into account how many balls are hit near a fielder. Many things distort RF, including ballpark shapes (for example, the Green Monster in left field at Fenway Park), pitcher types (for example, sinkerball pitchers who get mostly ground ball outs), and luck. ZR counts actual fielding opportunities, so it should be more accurate.
Unfortunately, the Baseball Archive and Baseball DataBank databases don’t include the raw data to calculate ZR. (They do include ZR from some recent years.) You can calculate an approximate zone rating, maybe a ZRjr, from the Retrosheet event files.
I decided to try to calculate a quick-and-dirty version of ZR, which I’m calling Zone Rating Jr. (ZRjr), from Retrosheet event files. I started with the database described in “Make a Historical Play-by-Play Database”
[Hack #22]
. First, I decided to determine roughly where each ball landed and track whether the ball was turned into an out. The event files sometimes include extra information on where balls were hit. The BEVENT program will write this information to the hit_location
field. (You can find a chart to interpret the hit_location
field at http://www.retrosheet.org/location.htm.) In cases where the hit_location
field was not blank and a ball did not go between two defensive players or go out of the park, I assigned responsibility to the player whose position matched the location. In cases where the hit_location
field was blank, I assigned responsibility to the player who fielded the ball.
I created a temporary table with this information and called it zrjr_inner:
create table zrjr_inner as select outs_on_play, if (mid(hit_location,2,1) IN (1,2,3,4,5,6,7,8,9), 0, 1) as opportunities, hit_location, event_text, fielded_by, mid(game_id,4,4) as yearID, if(length(hit_location) > 0 and mid(hit_location,2,1) NOT IN (1,2,3,4,5,6,7,8,9), left(hit_location,1), fielded_by) as pos, case when batting_team=0 then left(game_id,3) else visiting_team end as teamID, case if(outs_on_play=0 and batted_ball_type='G' and fielded_by IN (7, 8, 9) and sf_flag='F' and length(hit_location) > 0, left(hit_location,1), fielded_by) when '1' then pitcher when '2' then catcher when '3' then first_base when '4' then second_base when '5' then third_base when '6' then shortstop when '7' then left_field when '8' then center_field when '9' then right_field end as player from pbp.pbp2k where fielded_by > 0 –- fielded_by=0 for HR, SO ;
Next, I created an index on this table for fast summaries, and then I calculated ZRjr values for each player during each season. I decided to use the following formula:
ZRjr = Outs made on plays fielded by fielder / Balls fielded by fielder
Here is the code that I used to calculate these values:
create index zrjr_inner_idx on zrjr_inner(yearID,teamID,player); create table zrjr as select yearID, teamID, player, pos, sum(outs_on_play) as outs, sum(opportunities) as opportunities, if(sum(opportunities) > 0, sum(outs_on_play) / sum(opportunities), null) as ZRjr from zrjr_inner where player is not null group by yearID, teamID, player, pos ;
To show the top fielders with this ranking (for example, the top shortstops), you can use a query like this:
mysql> select r.lastName, r.firstName, yearID, teamID, ZRjr -> FROM ZRJR l inner join rosters r -> on l.player=r.retroID and l.teamID=r.team and l.yearID=r.year -> where opportunities > 100 AND pos=6 -> ORDER BY zrjr desc limit 10; +-----------+-----------+--------+--------+--------+ | lastName | firstName | yearID | teamID | ZRjr | +-----------+-----------+--------+--------+--------+ | Sanchez | Rey | 2001 | KCA | 1.1150 | | Bordick | Mike | 2002 | BAL | 1.1008 | | Rodriguez | Alex | 2000 | SEA | 1.0985 | | Tejada | Miguel | 2000 | OAK | 1.0981 | | Vizquel | Omar | 2000 | CLE | 1.0948 | | Clayton | Royce | 2002 | CHA | 1.0935 | | Lopez | Felipe | 2002 | TOR | 1.0917 | | Cruz | Deivi | 2000 | DET | 1.0911 | | Valentin | Jose | 2002 | CHA | 1.0909 | | Sanchez | Rey | 2000 | KCA | 1.0907 | +-----------+-----------+--------+--------+--------+ 10 rows in set (0.03 sec)
This is a very coarse calculation, but it might be a little better than range factor is in some cases.
3.147.86.154