Calculate the probability of a hit, walk, or out based on the ball–strike count.
The ball–strike count is one of the most important ways pitchers and hitters determine their strategy. Most batters will be ready to swing on an 0,2 (no balls, two strikes) count, but many batters won’t swing on a 3,0 count (three balls, no strikes). Of course, if the pitcher has generally good control, he might be able to throw the ball into the strike zone at this point, so it might be a good idea for the batter to swing. But then again, if the batter is planning to swing and the pitcher knows the batter is expecting a ball in the strike zone, the pitcher might decide to throw something unhittable and outside the strike zone…you get the idea. The count is at the core of baseball strategy. It drives the mental battle between the hitter and the pitcher.
This hack examines what is likely to happen on different counts. Clearly, a walk is more likely if the pitcher has thrown more balls, and a strikeout is more likely if the pitcher has thrown more strikes. But are extra base hits more likely in some situations? This hack shows you how to count the number of situations in which each of these things happened.
The key to understanding this code is to understand the possible set of counts and how a batter can move between them. Figure 6-4 illustrates how this works (the first number is the number of balls and the second is the number of strikes).
A batter can put a ball into play at any time. That means that from an 0,0 count, a batter can go to a 1,0 count or an 0,1 count, can stay in the same place (if there is a balk, a pitcher throws to first base, a base is stolen, or something else), can get a hit, can get on base on an error, can get on base on a fielder’s choice, or can be put out.
As you can tell, there is more than one path to certain states. As a simple example, consider a 1,1 count. A batter can get there in two ways: 0,0 to 1,0 to 1,1, or 0,0 to 0,1 to 1,1. But notice something else: many destinations cross through the same intermediate states. Batters with 3,1 counts and 2,2 counts once had 2,1 counts.
When we count the number of plays in which a batter had a certain count, we have to count all previous counts. For example, if a pitch sequence was “ball, ball, called strike, ball, swinging strike, and ball put in play as a single,” that means the counts would have been 0,0, 1,0, 2,0, 2,1, 3,1, and 3,2.
So, here is how we will count the number of situations that a batter was in each count. We’ll use a play-by-play database in MySQL, and we’ll use the REGEXP operator to help us search for pitch strings that start off with each pattern we want. We’ll code a flag for each situation.
To make the results a little easier to read, I summed the indicators (yielding a count of the number of times batters were in each situation). Here is the code:
select sum(CNT_10) AS c10, sum(CNT_20) AS c20, sum(CNT_30) AS c30, sum(CNT_11) AS c11, sum(CNT_12) AS c12, sum(CNT_21) AS c21, sum(CNT_31) AS c31, sum(CNT_22) AS c22, sum(CNT_32) AS c32, sum(CNT_01) AS c01, sum(CNT_02) AS c02, count(*) AS c00, event_type FROM (select IF(pitch_sequence REGEXP '^[.>123N]*[BIPV]',1,0) AS CNT_10, IF(pitch_sequence REGEXP '^[.>123N]*[BIPV][.>123N]*[BIPV]',1,0) AS CNT_20, IF(pitch_sequence REGEXP '^[.>123N]*[BIPV][.>123N]*[BIPV][.>123N]*[BIPV]',1,0) AS CNT_30, IF(pitch_sequence REGEXP '^[.>123N]*[CFKLMOQRST]',1,0) AS CNT_01, IF(pitch_sequence REGEXP '^[.>123N]*[CFKLMOQRST][.>123N]*[CFKLMOQRST]',1,0) AS CNT_02, IF(pitch_sequence REGEXP '^[.>123N]*[CFKLMOQRST][.>123N]*[BIPV]' or pitch_sequence REGEXP '^[.>123N]*[BIPV][.>123N]*[CFKLMOQRST]',1,0) AS CNT_11, IF( pitch_sequence REGEXP '^[.>123N]*[CFKLMOQRST][.>123N]*[BIPV][.>123N]*[BIPV]' or pitch_sequence REGEXP '^[.>123N]*[BIPV][.>123N]*[CFKLMOQRST][.>123N]*[BIPV]' or pitch_sequence REGEXP '^[.>123N]*[BIPV][.>123N]*[BIPV][.>123N]*[CFKLMOQRST]',1,0) AS CNT_21, IF( pitch_sequence REGEXP -- SBBB '^[.>123N]*[CFKLMOQRST][.>123N]*[BIPV][.>123N]*[BIPV][.>123N]*[BIPV]' or pitch_sequence REGEXP -- BSBB '^[.>123N]*[BIPV][.>123N]*[CFKLMOQRST][.>123N]*[BIPV][.>123N]*[BIPV]' or pitch_sequence REGEXP -- BBSB '^[.>123N]*[BIPV][.>123N]*[BIPV][.>123N]*[CFKLMOQRST][.>123N]*[BIPV]' or pitch_sequence REGEXP -- BBBS '^[.>123N]*[BIPV][.>123N]*[BIPV][.>123N]*[BIPV][.>123N]*[CFKLMOQRST]' ,1,0) AS CNT_31, IF( pitch_sequence REGEXP --SSB '^[.>123N]*[CFKLMOQRST][.>123N]*[CFKLMOQRST][.>123NF]*[BIPV]' or pitch_sequence REGEXP -- BSS '^[.>123N]*[BIPV][.>123N]*[CFKLMOQRST][.>123N]*[CFKLMOQRST]' or pitch_sequence REGEXP -- SBS '^[.>123N]*[CFKLMOQRST][.>123N]*[BIPV][.>123N]*[CFKLMOQRST]',1,0) AS CNT_12, IF( pitch_sequence REGEXP -- SSBB '^[.>123N]*[CFKLMOQRST][.>123N]*[CFKLMOQRST][.>123NF]*[BIPV][.> 123NF]*[BIPV]' or pitch_sequence REGEXP -- BBSS '^[.>123N]*[BIPV][.>123N]*[BIPV][.>123N]*[CFKLMOQRST][.> 123N]*[CFKLMOQRST]' or pitch_sequence REGEXP -- BSBS '^[.>123N]*[BIPV][.>123N]*[CFKLMOQRST][.>123N]*[BIPV][.> 123N]*[CFKLMOQRST]' or pitch_sequence REGEXP -- BSSB '^[.>123N]*[BIPV][.>123N]*[CFKLMOQRST][.>123N]*[CFKLMOQRST][.> 123NF]*[BIPV]' or pitch_sequence REGEXP -- SBSB '^[.>123N]*[CFKLMOQRST][.>123N]*[BIPV][.>123N]*[CFKLMOQRST][.> 123NF]*[BIPV]' or pitch_sequence REGEXP -- SBBS '^[.>123N]*[CFKLMOQRST][.>123N]*[BIPV][.>123N]*[BIPV][.> 123N]*[CFKLMOQRST]' ,1,0) AS CNT_22, IF( pitch_sequence REGEXP '^[.123NCFKLMOQRST]*[BIPV][.123NCFKLMOQRST]*[BIPV][. 123NCFKLMOQRST]*[BIPV]' and pitch_sequence REGEXP '^[.123BIPV]*[CFKLMOQRST][.123BIPV]*[CFKLMOQRST]',1,0) AS CNT_32, event_type FROM pbp.pbp2k where substring(game_id,4,4)="2004") pc group by event_type;
Running this code produces the results shown in Table 6-2.
Table 6-4. Calculating an expected hits matrix
c10 |
c20 |
c30 |
c11 |
c12 |
c21 |
c31 |
c22 |
c32 |
c01 |
c02 |
c00 |
Event type |
---|---|---|---|---|---|---|---|---|---|---|---|---|
36761 |
10803 |
1864 |
33638 |
18549 |
16602 |
5116 |
14844 |
6811 |
41081 |
12465 |
93196 |
2 |
10639 |
2710 |
477 |
15956 |
18432 |
6704 |
1607 |
13034 |
5030 |
21664 |
13559 |
32326 |
3 |
1263 |
384 |
75 |
943 |
483 |
475 |
150 |
414 |
3 |
1016 |
301 |
2297 |
4 |
128 |
30 |
2 |
64 |
29 |
25 |
7 |
14 |
0 |
123 |
30 |
253 |
5 |
374 |
118 |
19 |
354 |
136 |
174 |
65 |
107 |
2 |
352 |
80 |
731 |
6 |
195 |
60 |
10 |
156 |
76 |
80 |
21 |
53 |
20 |
196 |
54 |
500 |
8 |
745 |
201 |
36 |
535 |
427 |
232 |
47 |
321 |
119 |
612 |
262 |
1361 |
9 |
193 |
59 |
13 |
128 |
61 |
65 |
20 |
54 |
21 |
130 |
42 |
323 |
10 |
51 |
11 |
2 |
40 |
23 |
14 |
3 |
16 |
7 |
62 |
20 |
159 |
11 |
31 |
11 |
0 |
19 |
19 |
11 |
2 |
15 |
6 |
26 |
13 |
59 |
12 |
27 |
10 |
2 |
30 |
15 |
17 |
3 |
16 |
9 |
38 |
9 |
66 |
13 |
10987 |
7674 |
5247 |
6457 |
2627 |
7158 |
7155 |
4846 |
6796 |
4094 |
934 |
15084 |
14 |
1361 |
1324 |
1317 |
76 |
4 |
79 |
79 |
4 |
2 |
42 |
2 |
1403 |
15 |
562 |
134 |
16 |
613 |
412 |
225 |
44 |
237 |
71 |
939 |
319 |
1886 |
16 |
5 |
1 |
0 |
7 |
9 |
2 |
1 |
7 |
4 |
15 |
7 |
20 |
17 |
712 |
202 |
45 |
667 |
385 |
308 |
102 |
295 |
133 |
781 |
249 |
1811 |
18 |
202 |
66 |
12 |
159 |
99 |
84 |
35 |
84 |
39 |
196 |
65 |
477 |
19 |
11566 |
3398 |
583 |
10894 |
5992 |
5381 |
1660 |
4779 |
2183 |
13195 |
3931 |
29674 |
20 |
3763 |
1178 |
195 |
3207 |
1648 |
1733 |
578 |
1430 |
731 |
3801 |
1137 |
9046 |
21 |
403 |
131 |
19 |
306 |
176 |
183 |
61 |
155 |
76 |
359 |
124 |
909 |
22 |
2475 |
840 |
182 |
2006 |
965 |
1165 |
436 |
916 |
505 |
2094 |
564 |
5554 |
23 |
It’s a little hard to understand the data when it’s presented in this way, so I imported the data into Excel so that I could easily reformat it. The event_code field is generated by the Retrosheet BEVENT program, or by Chadwick. This is what these values mean:
Table 6-5.
Code |
Meaning |
Classification |
---|---|---|
0 |
Unknown event |
Other |
1 |
No event |
Other |
2 |
Generic out |
Out |
3 |
Strikeout |
Out |
4 |
Stolen base |
Other |
5 |
Defensive indifference |
Other |
6 |
Caught stealing |
Other |
7 |
Pickoff error |
Other |
8 |
Pickoff |
Other |
9 |
Wild pitch |
Other |
10 |
Passed ball |
Other |
11 |
Balk |
Other |
12 |
Other advance |
Other |
13 |
Foul error |
Other |
14 |
Walk |
Walk |
15 |
Intentional walk |
Walk |
16 |
Hit by pitch |
Walk |
17 |
Interference |
Other |
18 |
Error |
Other |
19 |
Fielder’s choice |
Out |
20 |
Single |
Hit |
21 |
Double |
Hit |
22 |
Triple |
Hit |
23 |
Home run |
Hit |
In Excel, I assigned one of the classification values in the preceding table to each event type and used a pivot table to summarize the results. As you can see, I skipped interference calls and errors. Errors are subjective, but they are often caused by someone other than the pitcher or batter, so I think it’s best to ignore them. Here is what I found about getting on base, based on the 2004 data:
Table 6-6.
On base |
No strikes |
One strike |
Two strikes |
---|---|---|---|
No balls |
33.5% |
28.0% |
21.2% |
One ball |
39.5% |
32.1% |
24.2% |
Two balls |
51.9% |
40.5% |
30.7% |
Three balls |
76.3% |
59.7% |
46.6% |
And here is what I found about getting a hit:
In general, as the number of strikes increases, the odds of getting on base decrease, except if there are three balls. (In particular, look at what happens on 3,0 counts—the odds of getting on base are enormous (76.3%), but nearly always from walking (only 9.9% hits). Additionally, as the number of balls increases, the chances of getting on base increase, but the chances of getting a hit decrease. This is pretty intuitive: if a pitcher is having trouble throwing strikes, he’s likely to walk batters. Additionally, if the pitcher is throwing balls, he’s likely to be unhittable.
Without even running another query, we can use this data to answer a handful of other questions about what happens on different pitch counts.
Let’s start with a simple question: what is the percentage of at bats ending in a strikeout? We can calculate this by dividing the number of strikeouts by the total number of hits, walks, and outs:
Table 6-8.
Strikeouts |
No strikes |
One strike |
Two strikes |
---|---|---|---|
No balls |
17.1% |
24.8% |
41.0% |
One ball |
13.5% |
21.8% |
37.7% |
Two balls |
9.6% |
17.1% |
32.3% |
Three balls |
4.8% |
9.6% |
22.6% |
Not surprisingly, a pitcher is most likely to get a strikeout on an 0,2 count and is least likely on a 3,0 count. The odds of a strikeout go way up with each strike, no matter what.
Announcers often say things during games such as “It’s a 3,0 count, the batter knows a fastball is coming, he’s going to hit this ball hard!” If pitchers were more likely to throw hittable balls on certain counts, and batters could control where the ball went, we should be able to see it in the data. In particular, we should see a different percentage of extra base hits on different counts. Using exactly the same data as before, let’s look at what actually happened in 2004. Here are the extra base hits:
Table 6-9.
Extra base hits |
No strikes |
One strike |
Two strikes |
---|---|---|---|
No balls |
34.3% |
32.2% |
31.7% |
One ball |
36.5% |
33.6% |
31.8% |
Two balls |
38.7% |
36.4% |
34.4% |
Three balls |
40.4% |
39.3% |
37.5% |
And here is the number of home runs:
Table 6-10.
Home runs |
No strikes |
One strike |
Two strikes |
---|---|---|---|
No balls |
12.3% |
10.8% |
9.8% |
One ball |
13.6% |
12.2% |
11.0% |
Two balls |
15.1% |
13.8% |
12.6% |
Three balls |
18.6% |
15.9% |
14.4% |
There appears to be a subtle effect here. As the number of balls increases, it’s slightly more likely that a player is going to be able to hit the ball hard. Additionally, a player seems slightly more likely to hit the ball hard on no strikes than on one or two strikes. This probably means that pitchers are more likely to throw fastballs than off-speed pitches.
Another interesting question: what happens on balls in play, based on the count? Is a ball in play less likely to end up in an out on certain counts? I looked at the number of hits on balls in play (which were defined as generic outs, fielder’s choices, singles, doubles, triples, and home runs). Here’s what I found:
Table 6-11.
Hits/balls in play |
No strikes |
One strike |
Two strikes |
---|---|---|---|
No balls |
32.5% |
32.0% |
31.5% |
One ball |
33.0% |
32.7% |
32.0% |
Two balls |
33.8% |
33.6% |
32.8% |
Three balls |
34.3% |
34.7% |
33.8% |
It looks as though there is a slight increase in the number of hits when there are more balls. On zero, one, or two balls, it appears that the play is more likely to result in a hit on fewer strikes (but not on three balls). Either way, this is a very subtle effect.
3.144.13.179