Use the regular expression tools in MySQL to find events that match a pattern.
Sometimes you want to parse every event and calculate every statistic; other times, it’s only important to find the events that match a pattern. This hack shows you how to use regular expressions for pattern matching. You can use this technique to create smaller files that you can open in a spreadsheet.
If you’ve imported play-by-play data into a MySQL database like the one we built in “Make a Historical Play-by-Play Database” [Hack #22] , you can select lines using regular expressions.
Let’s start with an easy example. Suppose you want to find all plays where the batter struck out. We know the play code for a strikeout always starts with K (see “Keep Score, Project Scoresheet–Style” [Hack #3] for information on play codes). To do this, we use the REGEXP operator in MySQL. This operator returns true if the field matches the regular expression, and false otherwise.
First, let’s select the right database in MySQL:
~/Desktop % mysql pbp Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or g. Your MySQL connection id is 3 to server version: 5.0.11-beta-max Type 'help;' or 'h' for help. Type 'c' to clear the buffer.
Now, let’s select a few items:
mysql> select game_id, visiting_team as vis, -> concat(case when batting_team=1 THEN "B" ELSE "T" END, -> inning) as inning, -> outs, event_num as num,batter, -> event_text -> from pbp2k -> where event_text REGEXP '^K' limit 5; +--------------+------+--------+------+------+----------+------------+ | game_id | vis | inning | outs | num | batter | event_text | +--------------+------+--------+------+------+----------+------------+ | ANA200004030 | NYA | B1 | 2 | 8 | vaugm001 | K | | ANA200004030 | NYA | B3 | 2 | 25 | vaugm001 | K | | ANA200004030 | NYA | T4 | 2 | 29 | martt002 | K | | ANA200004030 | NYA | T5 | 1 | 34 | posaj001 | K/C | | ANA200004030 | NYA | B5 | 2 | 43 | vaugm001 | K | +--------------+------+--------+------+------+----------+------------+ 5 rows in set (0.00 sec)
The interesting part here is the fourth line, where event_text REGEXP '^K’. (The limit 5 part just means “return no more than five lines.” Without this part, I wouldn’t be able to fit the response into this book.) The fourth line means “select only records where the event_text field starts with the letter K.” Specifically, the ^ means “match the beginning of the column.”
Here’s an overview of regular expressions. (I’m leaving out a lot of details because this is just an introduction.) In this book, I stick to some simple patterns, most of which I can explain on this page. Here are a few rules for writing them, and some examples:
To look for the word strike, you can use the expression /strike/.
Suppose you want to look for / characters. You can use the expression ///to match all strings with / characters. Without the , Perl would not be able to distinguish between the end of the regular expression (/) and the character for which you are searching (/).
If you want to search for all errors, you can look for /Ed/. This expression matches all strings where a number comes after the letter E.
If you want to search for sequences of numbers, look for /d+/. This matches all strings with sequences of one or more numbers.
Here is a short list of special characters and wildcards:
Table 2-5.
Special character |
Meaning |
---|---|
/ |
A forward slash (/) character |
. | |
d |
A wildcard matching any digit |
s |
A wildcard matching any whitespace |
w |
A wildcard matching any word character (numbers, letters, and _) |
* |
An asterisk (*) character |
|
A tab character |
|
A newline character |
+ |
A plus (+) character |
\ |
A backslash () character |
Here is a short list of ways to indicate repeated characters:
Table 2-6.
Special character |
Meaning |
---|---|
* |
Repeated any number of times, even zero |
+ |
Repeated at least once |
? |
Repeated zero or one time |
Here are a couple of other special characters that indicate placement:
Table 2-7.
Special character |
Meaning |
---|---|
^ |
Match the beginning of the string |
$ |
Match the end of the string |
You can search for any type of play that you want using regular expressions. Here are a few ideas:
All plays that start with three straight balls
All ground balls
All grand slam home runs
We show you how to read play codes and pitch codes in “Keep Score, Project Scoresheet–Style” [Hack #3] .
If you want to know more about regular expressions, check out the help files or manpages for Perl and SQL. (How do you get the manpages for Perl? If you’re using a PC and you installed Perl with Cygwin, you’ll need to install the man application as well. Then, open a command-line window and type man perl to see the list of manpage options for Perl. For example, type man perlrequick to get a short introduction to regular expressions in Perl.)
Even better, you might want to invest in a book on Perl or MySQL. I like the O’Reilly books (and I’ll bet my editors do, too), especially Programming Perl and MySQL Cookbook.
18.189.188.238