Use Regular Expressions to Identify Events

Use the regular expression tools in MySQL to find events that match a pattern.

Sometimes you want to parse every event and calculate every statistic; other times, it’s only important to find the events that match a pattern. This hack shows you how to use regular expressions for pattern matching. You can use this technique to create smaller files that you can open in a spreadsheet.

If you’ve imported play-by-play data into a MySQL database like the one we built in “Make a Historical Play-by-Play Database” [Hack #22] , you can select lines using regular expressions.

Let’s start with an easy example. Suppose you want to find all plays where the batter struck out. We know the play code for a strikeout always starts with K (see “Keep Score, Project Scoresheet–Style” [Hack #3] for information on play codes). To do this, we use the REGEXP operator in MySQL. This operator returns true if the field matches the regular expression, and false otherwise.

First, let’s select the right database in MySQL:

	~/Desktop % mysql pbp
	Reading table information for completion of table and column names
	You can turn off this feature to get a quicker startup with -A

	Welcome to the MySQL monitor.  Commands end with ; or g.
	Your MySQL connection id is 3 to server version: 5.0.11-beta-max

	Type 'help;' or 'h' for help. Type 'c' to clear the buffer.

Now, let’s select a few items:

	mysql> select game_id, visiting_team as vis,
	  ->     concat(case when batting_team=1 THEN "B" ELSE "T" END,
	  ->        inning) as inning,
	  ->     outs, event_num as num,batter,
	  ->     event_text
	  -> from pbp2k
	  -> where event_text REGEXP '^K' limit 5;
	+--------------+------+--------+------+------+----------+------------+
	| game_id      | vis  | inning | outs | num  | batter   | event_text |
	+--------------+------+--------+------+------+----------+------------+
	| ANA200004030 | NYA  | B1     |    2 |    8 | vaugm001 | K          |
	| ANA200004030 | NYA  | B3     |    2 |   25 | vaugm001 | K          |
	| ANA200004030 | NYA  | T4     |    2 |   29 | martt002 | K          |
	| ANA200004030 | NYA  | T5     |    1 |   34 | posaj001 | K/C        |
	| ANA200004030 | NYA  | B5     |    2 |   43 | vaugm001 | K          |
	+--------------+------+--------+------+------+----------+------------+
	5 rows in set (0.00 sec)

The interesting part here is the fourth line, where event_text REGEXP '^K’. (The limit 5 part just means “return no more than five lines.” Without this part, I wouldn’t be able to fit the response into this book.) The fourth line means “select only records where the event_text field starts with the letter K.” Specifically, the ^ means “match the beginning of the column.”

Here’s an overview of regular expressions. (I’m leaving out a lot of details because this is just an introduction.) In this book, I stick to some simple patterns, most of which I can explain on this page. Here are a few rules for writing them, and some examples:

Alphabetical characters and numbers represent themselves

To look for the word strike, you can use the expression /strike/.

You can’t write all characters directly

Suppose you want to look for / characters. You can use the expression ///to match all strings with / characters. Without the , Perl would not be able to distinguish between the end of the regular expression (/) and the character for which you are searching (/).

Wildcards can represent many different characters

If you want to search for all errors, you can look for /Ed/. This expression matches all strings where a number comes after the letter E.

Special characters indicate repeats

If you want to search for sequences of numbers, look for /d+/. This matches all strings with sequences of one or more numbers.

Here is a short list of special characters and wildcards:

Table 2-5. 

Special character

Meaning

/

A forward slash (/) character

.

A wildcard matching any character

d

A wildcard matching any digit

s

A wildcard matching any whitespace

w

A wildcard matching any word character (numbers, letters, and _)

*

An asterisk (*) character

A tab character

A newline character

+

A plus (+) character

\

A backslash () character

Here is a short list of ways to indicate repeated characters:

Table 2-6. 

Special character

Meaning

*

Repeated any number of times, even zero

+

Repeated at least once

?

Repeated zero or one time

Here are a couple of other special characters that indicate placement:

Table 2-7. 

Special character

Meaning

^

Match the beginning of the string

$

Match the end of the string

Hacking the Hack

You can search for any type of play that you want using regular expressions. Here are a few ideas:

pitch_sequence REGEXP '^BBB’

All plays that start with three straight balls

event_text REGEXP '/G’

All ground balls

event_text REGEXP '^HR.*3-H.*2-H.*1-H.*'

All grand slam home runs

See Also

We show you how to read play codes and pitch codes in “Keep Score, Project Scoresheet–Style” [Hack #3] .

If you want to know more about regular expressions, check out the help files or manpages for Perl and SQL. (How do you get the manpages for Perl? If you’re using a PC and you installed Perl with Cygwin, you’ll need to install the man application as well. Then, open a command-line window and type man perl to see the list of manpage options for Perl. For example, type man perlrequick to get a short introduction to regular expressions in Perl.)

Even better, you might want to invest in a book on Perl or MySQL. I like the O’Reilly books (and I’ll bet my editors do, too), especially Programming Perl and MySQL Cookbook.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.188.238