Keep Your Stats Database Up-to-Date

Create your own database of player statistics and keep it up-to-date as the season progresses.

If you want to know how a player is doing this season, the easiest way to learn is by looking it up at a popular baseball web site [Hack #5] . Most of the time, this will be fine: for example, if you just want to know David Ortiz’s slugging percentage for the year. However, suppose you want to ask a complicated question (say, how did David Ortiz do in the first two weeks of May?). Or suppose you want to use the data for something else, like managing your fantasy team (or league). In these cases, you might want to keep your own up-to-date database.

This hack shows a method for doing this.

The Code

Here’s the code to fetch data from the Web and save it directly to a database. There are three parts to this code: a script to create a database, a script to fill the database initially, and a script to update the database.

Create the box score database.

Let’s start off by creating a simple database in MySQL. If you want to follow along, you can create a file called create_boxes_db.sql and enter all the code in this section into the file:

	-- create a new database schema for the box score information,
	-- and a user to access the database
	GRANT ALL ON boxes.* to 'boxer'@'localhost'
	    IDENTIFIED BY 'boxers password';
	CREATE DATABASE boxes;
	USE boxes;

	-- Create three tables containing information about players:
	-- batting, fielding, and pitching. Fields are sized to
	-- minimize storage space for this database

	CREATE TABLE batting (
	   eliasID  INT(6),
	   teamID   CHAR(3),
	   gameID   VARCHAR(32),
	   gameDate DATE,
	   h        SMALLINT(2), -- hits
	   hr       SMALLINT(2), -- home runs
	   bb       SMALLINT(2), -- walks
	   so       SMALLINT(2), -- strikeouts
	   rbi      SMALLINT(2), -- runs batted in
	   ab       SMALLINT(2), -- at bats
	   runs     SMALLINT(2), -- runs scored
	   t        SMALLINT(2), -- triples
	   d        SMALLINT(2), -- doubles
	   lob      SMALLINT(2), -- left on base
	   sb       SMALLINT(2), -- stolen bases
	   cs       SMALLINT(2)  -- caught stealing
	);

	CREATE TABLE fielding (
	   eliasID   INT(6),
	   teamID    CHAR(3),
	   gameID    VARCHAR(32),
	   gameDate  DATE,
	   pos       CHAR(2),     -- position
	   po        SMALLINT(2), -- putouts
	   da        SMALLINT(2), -- assists
	   e         SMALLINT(2), -- errors
	   pb        SMALLINT(2)  -- passed balls
	);

	CREATE TABLE pitching (
	   eliasID   INT(6),
	   teamID    VARCHAR(3),
	   gameID    VARCHAR(32),
	   gameDate  DATE,
	   outs      SMALLINT(2),
	   bf        SMALLINT(2), -- batters faced
	   hr        SMALLINT(2), -- home runs
	   bb        SMALLINT(2), -- walks
	   so        SMALLINT(2), -- strikeouts
	   er        SMALLINT(2), -- earned runs
	   runs      SMALLINT(2), -- runs
	   hits      SMALLINT(2), -- hits
	   wins      SMALLINT(2), -- wins
	   losses    SMALLINT(2), -- losses
	   holds     SMALLINT(2), -- holds
	   hopps     SMALLINT(2), -- hold opportunities
	   saves     SMALLINT(2), -- saves
	   sopps     SMALLINT(2)  -- save opportunities
	);
	-- Add a table to keep track of a little information
	-- about each game, such as the opponents and date.

	CREATE TABLE games (
	   gameID    VARCHAR(32),
	   gameDate  DATE,
	   homeTeam  CHAR(3),
	   awayTeam  CHAR(3)
	);

	-- Keep track of the complete daily rosters. This
	-- table will include detailed information about
	-- who played in each game.

	CREATE TABLE rosters (
	   gameID    VARCHAR(32),
	   gameDate  DATE,
	   teamID    CHAR(3),
	   eliasID   INT(6),
	   nameFirst VARCHAR(32),
	   nameLast  VARCHAR(32),
	   nickName  VARCHAR(32),
	   jersey    SMALLINT(2),
	   throws    CHAR(1)
	);

	-- Show what we've created.
	show tables;

The update script.

Now we’ll tackle the Perl code for populating the database. We want to update the tables in the previous code snippet daily, and we want to be careful that each game is finished before we update the database. To be sure that each game is finished, I decided that after 12:00 p.m., GMT (Greenwich Mean Time), all of the games from the previous day would be finished. (Note the use of 36 hours; date values have a time of midnight by default.)

The most important decision to make is where to get the data. You can get this information anywhere box scores are available, but in this section, I’m going to use a really simple one. MLB.com’s Gameday servers include a copy of every box score in XML format. XML is a standard for sharing data in a computer-readable format. It looks a lot like HTML, the language used to make web pages; to learn more about XML, see http://www.w3.org/XML. This is great for us because some neat tools are available for reading XML files into Perl.

I used the same LWP library as before to download the files and a module called XML::Simple to make the downloaded file easy to read in Perl. (For more information on Perl modules, see “Get Perl” [Hack #12] .)

To make this script readable, I placed the code for reading box scores and saving the contents into a Perl module, called save_to_db. I used four different subroutines to do this: save_batting_and_fielding, save_pitching, save_game, and save_roster. You can guess what each one does. Enter the code in this section into a file called update_db.pl:

	#!/usr/bin/perl

	# updates database with latest information from box scores
	# To keep DB up to date, run with scheduled tasks or chron daily

	# we'll use some of the same subroutines for boootstrapping and updating
	use save_to_db;
	use Mysql;
	my $dbh = Mysql->connect('localhost', 'boxes', 'boxer', 'boxers password'),
	$dbh->{'dbh'}->{'PrintError'} = 1;
	use XML::Simple;
	my $xs = new XML::Simple(ForceArray => 1, KeepRoot => 1, KeyAttr => 'boxscore'),
	use LWP;
	my $browser = LWP::UserAgent->new;

	# to prevent partial loading of games in progress, only load a day's games
	# after 12:00 PM GMT

	my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)
	   = gmtime(time() - 60*60*36);
	my $fmon = length($mon + 1) == 1 ? '0' . ($mon + 1) : ($mon + 1);
	my $fday = length($mday) == 1 ?  '0' . $mon : $mday;

	# build base url for fetching box scores and start fetching
	my $baseurl = 'http://gd2.mlb.com/components/game/year_' . (1900 + $year) .
	 '/month_' . $fmon .
	 '/day_' . $fday . '/';

	# fetch names of all games played
	my $response = $browser->get($baseurl);
	die "Couldn't get $baseurl: ", $response->status_line, "
"
	  unless $response->is_success;
	my $dirhtml = $response->content;

	# now, load the box score for each game played that day
	while($dirhtml =~ m/<a href="(gid_.+)/"/g ) {
	  my $game = $1;
	  print "fetching box score for game $game
";
	  my $boxurl = $baseurl . $game . "/boxscore.xml";
	  my $response = $browser->get($boxurl);
	  die "Couldn't get $boxurl: ", $response->status_line, "
"
	    unless $response->is_success;
	  my $box = $xs->XMLin($response->content);
	  save_batting_and_fielding($dbh, $box);
	  save_pitching($dbh, $box);
	  save_game($dbh, $box);

	  my $playersurl = $baseurl . $game . "/players.txt";
	  my $response = $browser->get($playersurl);
	  die "Couldn't get $playersurl: ", $response->status_line, "
"
	    unless $response->is_success;
	  save_roster($dbh, $box, $response->content);
	}

The bootstrapping script.

If you start building this database in the middle of the season (as I did), you’ll probably want to preload the database with information on everything that has happened so far. Here’s a simple script to do this. By design, the bootstrapping script is very similar to the update script. (I omitted redundant parts that are the same as the update script. Actually, I kept all the tricky code in the save_to_db module.) Create a file called load_db.pl and add the following code into the file as listed here:

	#!/usr/bin/perl

	# loads a database with this season's box scores up through yesterday's games

	use save_to_db;

	use Mysql;
	my $dbh = Mysql->connect('localhost', 'boxes', 'boxer', 'boxers password'),
	$dbh->{'dbh'}->{'PrintError'} = 1;
	use XML::Simple;
	my $xs = new XML::Simple(ForceArray => 1, KeepRoot => 1, KeyAttr => 'boxscore'),
	use LWP;
	my $browser = LWP::UserAgent->new;

	# to prevent partial loading of games in progress, only load a day's games
	# after 12:00 PM GMT
	use Time::Local;
	my $mintime = timegm(0,0,0,3,3,105);
	my $maxtime = time() - 60*60*36;
	my $mintimestr = gmtime(timegm(0,0,0,3,3,105));
	my $maxtimestr = gmtime(time() - 60*60*36);

	print "iterating from $mintimestr to $maxtimestr
";
	for ($i = $mintime; $i <= $maxtime; $i += 86400) {
	   ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = gmtime($i);
	   $fmon = length($mon + 1) == 1 ? '0' . ($mon + 1) : ($mon + 1);
	   $fday = length($mday) == 1 ?  '0' . $mday : $mday;

	   # build base url for fetching box scores and start fetching
	   $baseurl = 'http://gd2.mlb.com/components/game/mlb/year_' . (1900 +
	     $year) .
	     '/month_' . $fmon .
	     '/day_' . $fday . '/';

	    my $response = $browser->get($baseurl);
	    die "Couldn't get $baseurl: ", $response->status_line, "
"
	       unless $response->is_success;
	    my $dirhtml = $response->content;
	  while($dirhtml =~ m/<a href="(gid_.+)/"/g ) {
	    my $game = $1;
	    print "fetching box score for game $game
";
	    my $boxurl = $baseurl . $game . "/boxscore.xml";
	    my $response = $browser->get($boxurl);
	    # die "Couldn't get $boxurl: ", $response->status_line, "
"
	    unless ($response->is_success) {
	      print "Couldn't get $boxurl: ", $response->status_line, "
";
	      next;
	    }
	    my $box = $xs->XMLin($response->content);
	    save_batting_and_fielding($dbh, $box);
	    save_pitching($dbh, $box);
	    save_game($dbh, $box);

	    my $playersurl = $baseurl . $game . "/players.txt";
	    my $response = $browser->get($playersurl);
	    unless ($response->is_success) {
	      print "Couldn't get $playersurl: ", $response->status_line,
	      "  n";
	      next;
	    }
	    save_roster($dbh, $box, $response->content);
	  }
	  # be a good spider and don't take up too much bandwidth
	  sleep(1);
	}

The helping code.

Create a file called save_to_db.pm for the helping code; make sure you save it in the same directory as the other Perl scripts for this hack. Then add all the following code snippets into the file.

First, we need some header information for the package, to give it a name and define what subroutines it exports:

	package save_to_db;
	require Exporter;
	@ISA = qw(Exporter);
	@EXPORT = qw (save_batting_and_fielding save_pitching save_game save_roster);

Next, we’ll define a couple of internal subroutines to simplify the code. The extract_date subroutine pulls a date out of a game identifier and formats it properly for MySQL. The extract_info subroutine pulls a few values out of the box score and assigns them to variables. These two subroutines contain code that repeats in each exported method.

	sub extract_date($) {
	  my($in) = @_;
	  my $gmyr = substr($in,0,4);
	  my $gmmn = substr($in,5,2);
	  my $gmdy = substr($in,8,2);
	  my $gamedate = ''' . $gmyr . '-' . $gmmn . '-' . $gmdy . ''';
	  return $gamedate;
	}

	sub extract_info($) {
	  my ($box) = @_;
	  my $home = $box->{boxscore}->[0]->{home_team_code};
	  my $away = $box->{boxscore}->[0]->{away_team_code};
	  my $gameid = "'" . $box->{boxscore}->[0]->{game_id} . "'";
	  my $gamedate = extract_date($box->{boxscore}->[0]->{game_id});
	  return ($home, $away, $gameid, $gamedate);
	}

Let’s start with pitching. The following subroutine iterates over each pitcher on each team, generates a database record for each performance, and writes this query to the database:

	sub save_pitching($$) {
	  my ($dbh, $box) = @_;
	  my ($home, $away, $gameid, $gamedate) = extract_info($box);
	  foreach $team (@{$box->{boxscore}->[0]->{pitching}}) {
	  foreach $pitcher (@{$team->{pitcher}}) {
	     $win=0; $loss=0; $hold=0; $holdopp=0; $save=0; $saveopp=0;
	     if    ($pitcher->{note} =~ /(W/) { $win = 1; }
	     elsif ($pitcher->{note} =~ /(L/)  {$loss = 1; }
	     elsif ($pitcher->{note} =~ /(S/)  {$save = 1; $saveopp = 1;}
	     elsif ($pitcher->{note} =~ /(BS/) {$saveopp = 1;}
	     elsif ($pitcher->{note} =~ /(H/)  {$hold = 1; $holdopp = 1;}
	     elsif ($pitcher->{note} =~ /(BH/) {$holdopp = 1;}
	     $ptchr_query = 'INSERT INTO pitching VALUES ('
	     . join(',', (
	          $pitcher->{id},
	          "'" . ($team->{team_flag}=="home" ?
	            $home : $away) . "'",
	           $gameid,
	           $gamedate,
	           $pitcher->{out},
	           $pitcher->{bf},
	           $pitcher->{hr},
	           $pitcher->{bb},
	           $pitcher->{so},
	           $pitcher->{er},
	           $pitcher->{r},
	           $pitcher->{h},
	           $win,
	           $loss,
	           $hold,
	           $holdopp,
	           $save,
	           $saveopp
	           )) . ")";
	       $sth = $dbh->query($ptchr_query);
	       die ("MySQL Error: $dbh->$errmsg
") unless defined($sth);
	     }
	   }
	}

The subroutine to extract batting and fielding information is similar, but I defined an extra helper routine to make them work better. The XML box scores, unfortunately, don’t include line items for every important measurement. However, this information is included in HTML-formatted notes. To extract the missing stuff (like stolen bases and errors), I used the following subroutine. This procedure pulls out the information for each field and returns an anonymous hash with information about what each player did. For example, here is how to use this procedure to extract information on steals:

	my %steals    = %{extractvars('SB', $team->{text_data}->[0])};

The process for pulling out the data is a little complicated and repetitive, so I defined it once in a subroutine to keep the code clean. Here is the subroutine itself:

	sub extractvars($$) {
	  # a procedure for extracting information from
	  # the text data field (like stolen bases, errors, etc.)
	  my ($type, $text) = @_;
	  my $stuff = {};

	  if ($text =~ m{ <b> $type </b>:s* (.*) .<br/> }x) {
	  my @players = split /),/, $1;
	    foreach $player (@players) {
	       # important: player names may include commas, spaces,
	       #    apostrophes, and periods, but no numbers
	       #    or parentheses
	    $player =~ /([ws,.']+)s(d?)s?(.*/;
	    $name = $1;
	    if ($2) {$num = $2;} else {$num = 1;}
	    $stuff->{$name} = $num;
	  }
	  }
	  return $stuff;
	}

Putting this together, here is the code for saving batting and fielding information:

	sub save_batting_and_fielding($$) {
	  my ($dbh, $box) = @_;
	  my ($home, $away, $gameid, $gamedate) = extract_info($box);
	  foreach $team (@{$box->{boxscore}->[0]->{batting}}) {
	  my %steals     = %{extractvars('SB', $team->{text_data}->[0])};
	  my %caughtstealing = %{extractvars('CS', $team->{text_data}->[0])};
	  my %errors     = %{extractvars('E',  $team->{text_data}->[0])};
	  my %passedballs  = %{extractvars('PB', $team->{text_data}->[0])};

	  foreach $batter (@{$team->{batter}}) {
	    my $sb = $steals{$batter->{name}};
	    $sb = 0 unless defined($sb);
	    my $cs = $caughtstealing{$batter->{name}};
	    $cs = 0 unless defined($cs);
	    my $e = $errors{$batter->{name}};
	    $e = 0 unless defined($e);
	    my $pb = $passedballs{$batter->{name}};
	    $pb = 0 unless defined($pb);

	    $batr_query = 'INSERT INTO batting VALUES ('
	    . join(',', (
	         $batter->{id},
	         "'" . ($team->{team_flag}=="home" ?
	            $home : $away) . "'",
	          $gameid,
	          $gamedate,
	          $batter->{h},
	          $batter->{hr},
	          $batter->{bb},
	          $batter->{so},
	          $batter->{rbi},
	          $batter->{ab},
	          $batter->{r},
	          $batter->{t},
	          $batter->{d},
	          $batter->{lob},
	          $sb,
	          $cs
	          )) . ")";
	    $sth = $dbh->query($batr_query);
	    die ("MySQL Error: $dbh->$errmsg
") unless defined($sth);

	    $fldr_query = 'INSERT INTO fielding VALUES ('
	    . join(',', (
	         $batter->{id},
	         "'" . ($team->{team_flag}=="home" ?
	           $home : $away) ."'",
	          $gameid,
	          $gamedate,
	          "'" . $batter->{pos} . "'",
	          $batter->{po},
	          $batter->{da},
	          $e,
	          $pb
	          )) . ")";
	    $sth = $dbh->query($fldr_query);
	    die ("MySQL Error: $dbh->$errmsg
") unless defined($sth);
	  }
	  }
	}

And, finally, here are the subroutines for writing game and roster information to the database:

	sub save_game($$) {
	  my ($dbh, $box) = @_;
	  my ($home, $away, $gameid, $gamedate) = extract_info($box);
	  foreach $team (@{$box->{boxscore}->[0]->{pitching}}) {
	  $game_query = 'INSERT INTO games VALUES ('
	    . join(',', (
	         $gameid,
	         $gamedate,
	         "'" . $home . "'",
	         "'" . $away . "'"
	      )) . ")";
	  $sth = $dbh->query($game_query);
	  die ("MySQL Error: $dbh->$errmsg
") unless defined($sth);
	  }
	}

	sub save_roster($$$) {
	  my ($dbh, $box, $file) = @_;
	  my ($home, $away, $gameid, $gamedate) = extract_info($box);
	  @players = split /&/, $file;
	  foreach $player (@players) {
	  ($playeridstr, $first, $last, $jersey, $nickname, $throws) =
	    split /|/, $player;
	  if ($playeridstr =~ /([ha])dd?=(d+)/) {
	     $homeaway = $1;
	     $playerid = $2;
	     if ($jersey == "--") {$jersey = "null";}
	     $roster_query = 'INSERT INTO rosters VALUE ('
	     . join(',', (
	          $gameid,
	          $gamedate,
	          "'" . ($homeaway=="h" ?
	            $home : $away) . "'",
	          $playerid,
	          '"' . $first . '"',
	          '"' . $last . '"',
	          '"' . $nickname . '"',
	          $jersey,
	          "'" . $throws . "'"
	          )) . ")";
	      $sth = $dbh->query($roster_query);
	      die ("MySQL Error: $dbh->$errmsg
") unless defined($sth);
	  }
	  }
	}
	# make sure to return a true value for the module
	1;

Running the Hack

You can run the create_boxes_db.sql script with the following command:

	% mysql -u root < create_boxes_db.sql

Barring any errors, this will show a list of the tables created:

	mysql> use boxes;
	Database changed
	mysql> show tables;
	+-----------------+
	| Tables_in_boxes |
	+-----------------+
	| batting         |
	| fielding        |
	| games           |
	| pitching        |
	| rosters         |
	+-----------------+

Warning

You need to execute create_boxes_db.sql only once to create the initial data structures!

You can run the bootstrap code manually with the following command:

	% perl load_db.pl

This will display the name of each box score as it loads and show you alerts if there are any problems:

	iterating from Sun Apr  3 00:00:00 2005 to Sun May 29 07:27:41 2005
	fetching box score for game gid_2005_04_03_anamlb_lanmlb_1
	fetching box score for game gid_2005_04_03_balmlb_phimlb_1
	fetching box score for game gid_2005_04_03_bosmlb_nyamlb_1
	…

You should run this script exactly once per day to add new game data into your database. I recommend using a utility like cron (on Unix-based platforms) or Scheduled Tasks (on Windows platforms) to execute this script exactly once per day.

Hacking the Hack

I showed how to extract information for MLB games and save it to a database. Here are a few tips on where to go from here:

Use a different data source

Although the MLB.com XML files are the easiest place to start, any electronically available box score will work. For example, you can use the HTML::TableContentParser module to extract data from web pages.

Fetch data from other leagues

International and minor league game data is available online, and you can build a database of this data (helpful for sophisticated fantasy players). For example, MLB.com Gameday has AAA league information at http://gd2.mlb.com/components/game/aaa and AA information at http://gd2.mlb.com/components/game/aax.

Save more data

I skipped a lot of fields in this hack (like game temperatures and double plays), but you can easily modify this hack to track this stuff.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.38.24