Chapter 5: Bizarro Ball Sample Data

5.1 Introduction

5.2 Sample Data Descriptions

5.2.1 AtBats

5.2.2 Games

5.2.3 Leagues

5.2.4 Pitches

5.2.5 Player_Candidates

5.2.6 Runs

5.2.7 Teams

5.3 Summary

5.1 Introduction

As mentioned previously, we have been approached by the headquarters office for Bizarro Ball about their interest in using analytics to improve the quality of the game. Specifically, the business users are interested in using the detailed data about all the events in a game to answer what we would call business intelligence questions like:

   How well is each batter performing overall?

   How well is each pitcher performing?

   Are there differences due to external factors like when and where the games were played?

   And so on.

The results data they want to use describe what happened for:

   Each pitch.

   Each plate appearance (which is often referred to as an at bat).

   Each runner who gets on base.

The business users have a number of different metrics they are interested in calculating using this data and have asked for a Proof of Concept (PoC) for how to analyze such data to calculate those metrics as well as some new metrics they believe might be useful. Given that there is no source of data for the PoC, we agreed to write sample SAS programs to generate data to approximate the kind of data they would need to collect on an ongoing basis.

The rest of this chapter describes the data sets we generated. That data will be used as the source data for the examples in the rest of the book. You can access the programs, the sample data, and the blog entries that are used in this book. Note that the blog entries include additional examples (including the programs that created the sample data). These are available from the author page at http://support.sas.com/authors. Select either “Paul Dorfman” or “Don Henderson.” Then look for the cover thumbnail of this book, and select “Blog Entries.” Those programs also make extensive use of the SAS hash object; however, that sample code is not directly related to the point of the Proof of Concept, and so they are not described here.

5.2 Sample Data Descriptions

We created a number of data sets to simulate the Bizarro Ball data. The data sets are not normalized- e.g., the AtBats data contains all the available fields about a player such as their name and team. The data corresponds to what a person watching the game would record or write down. The data sets contain:

   2,880 games (the Games data set).

   Approximately 875,000 pitches (the Pitches data set).

   Almost 290,000 at bats/plate appearances (the AtBats data set).

   Roughly 78,000 runs scored (the Runs data set).

   And more.

The following subsections describe seven data sets that are the primary source data for the examples in the rest of the book. These data sets reflect the basic data needed for Bizarro Ball that is also needed for our Proof of Concept. Should the business users decide to move forward and formally collect all the relevant data from Bizarro Ball games, that data would likely contains many more fields; for the purposes of our Proof of Concept we focused on creating a minimal set of fields needed to highlight how the SAS hash object could be used to answer questions about the data.

Each of the following subsections includes:

   A brief description of the data set.

   A description of how some of the fields in each data set can be used.

   A listing of the fields in each data set.

   A sample of the data (the first five rows).

Also please note the following about the sample data:

   A number of field names have a suffix of _SK which stands for Surrogate Key. That is a standard convention in many data tables. For example, the Teams data set could be uniquely identified by its name; however this can create data access issues if a team changes it name. A standard approach is to generate an arbitrary value and use that as the key to uniquely identify a data row.

   The Game_SK field is constructed from a number of source fields. Its display value is 32 characters that have no meaning and so the listings below only show the first four characters following by ellipses (. . .) in order to designate that only part of the value is displayed.

5.2.1 AtBats

A plate appearance (PA) in Bizarro Ball, like baseball, corresponds to each time a batter faces an opposing pitcher in a game. There is one observation in the AtBats data set for each such occurrence. A PA counts as an at bat (AB) based on what the results are of the batter facing the pitcher. For example, if the batter draws a walk (4 pitches out of the strike zone), that PA does not count as an AB. In response to a request from the business users, we agreed to call this data set AtBats as that is the term most commonly used to describe these events. The distinction between an AB and a PA is used in calculating some of the metrics we will generate later in this book.

The AB_Number field is a sequential counter for each team within a game that  uniquely identifies what the result of each plate appearance was. It provides a one-to-many link to the Pitches data set. Every observation in the AtBats data set has at least one row in the Pitches data set. It also provides a one-to-many (including 0) link to the Runs data set. For example, there is an observation in the Runs data set for each run scored as the result of a plate appearance. If no runs are scored as the result of the plate appearance, then there are no rows in the Runs data set for that AB_Number.

Many of the fields in this data set will be used as class variables in later chapters. Other fields are aggregated to calculate the various metrics the business users are interested in. For example, the fields whose names begin with Is_A_ are Boolean (missing/0 or 1) fields which specify how they should be counted when calculating a number of the metrics:

   Is_An AB: Designates if the plate appearance contributes a count of 1 to the number of at bats. Aggregating this field, for example,  will provide the denominator for the calculation of a batter’s batting average (BA).

   Is_An_Out: The batter made an out. This field can be used to aggregate how many innings a pitcher pitched (3 outs is an inning).

   Is_A_Hit: The batter got a hit (a single, double, triple, or home run). Aggregating this field, for example, will provide the numerator for the calculation of the batter’s batting average (BA).

   Is_An_OnBase: The batter reached base, regardless of how (e.g., a hit or a walk). Aggregating this field, for example, will provide the numerator for the on base percentage metric (OBP).

Since each observation in the data set contributes a count of 1 to the number of plate appearances, we did not create a specific field for that calculation; the number of rows can be counted/calculated.

Output 5.1 AtBats Contents

Output 5.1 AtBats Contents

Output 5.2 First 5 Observations in AtBats

Output 5.2 First 5 Observations in AtBats

image

image

5.2.2 Games

The Games data set is the schedule of games and contains information on which teams play each other on what date.

The fields in this data set are used as class variables in many of the aggregates.

Output 5.3 Games Contents

Output 5.3 Games Contents

Output 5.4 First 5 Observations in Games

Output 5.4 First 5 Observations in Games

5.2.3 Leagues

The Leagues data set contains just two rows – one for each of the two Bizarro Ball leagues. Its primary use case is to provide a label for aggregate metrics produced at or rolled up to the league level. The business users have indicated they are interested in comparing results between the two leagues.

Output 5.5 Leagues Contents

Output 5.5 Leagues Contents

Output 5.6 All Observations in Leagues

Output 5.6 All Observations in Leagues

5.2.4 Pitches

The Pitches data set contains one row for every pitch in every game. The value of the Result field on the last Pitch for an AtBat is the same as the value of the Result field in the AtBats data set for each AtBat. The AB_Number field can be used to link the Pitches and AtBats data and can be used when generating metrics that require data from both the AtBats and Pitches data (e.g., summarize the performance of pitchers vs. specific batters and vice versa).

As for the AtBats data set, many of the fields here will be used as class variables. The fields Is_A_Ball and Is_A_Strike are also Booleans (missing/0 or 1) and can be aggregated to evaluate pitch distribution. The fields Balls (how many pitches that are balls have been thrown in the current plate appearance) and Strikes (how many pitches that are strikes have been thrown in the current plate appearance) can be used as filter criteria as well as class variables. The variable Strikes is used as a filter variable in the sample case study in Chapter 13.

Output 5.7 Pitches Contents

Output 5.7 Pitches Contents

Output 5.8 First 5 Observations in Pitches

Output 5.8 First 5 Observations in Pitches

image

5.2.5 Player_Candidates

The Player_Candidates data set contains 10,000 rows which resulted from a Cartesian product of 100 unique first names and 100 unique last names. This data set is the pool of available players for the 32 Bizarro Ball teams (25 players per team).  A total of 32*25 observations in this data set were randomly assigned to specific positions (e.g., the Position_Code field designates if the player is a starting pitcher, a first baseman, and so on) and teams (the Team_SK field). The fields in this data set are descriptive in nature and will primarily be used as descriptive labels or classification fields.

Output 5.9 Player_Candidates Contents

Output 5.9 Player_Candidates Contents

Output 5.10 First 5 Observations in Player_Candidates

Output 5.10 First 5 Observations in Player_Candidates

5.2.6 Runs

The Runs data set contains information on each scored run – both the ID of the batter as well as the ID of the runner. The AB_Number field can be used to retrieve data values needed for various metrics from the AtBats and Pitches data sets. For example, we might want to rank the importance of what kinds of hits result in the most runs scored. Alternatively, what is the distribution of runs scored based on the number of outs. Aggregating the count of rows (i.e., runs scored) using Batter_ID as a classification variable calculates the Runs Batted In (aka RBIs) metric; using Runner_ID calculates the Runs_Scored metric for the player.

Output 5.11 Runs Contents

Output 5.11 Runs Contents

Output 5.12 First 5 Observations in Runs

Output 5.12 First 5 Observations in Runs

5.2.7 Teams

The Teams data sets contains 32 observations with the league and team name for each team. In real life, it would contain many more fields such as Owner, City, Manager, and so on. This data set is used to provide classification variables and labels.

Output 5.13 Teams Contents

Output 5.13 Teams Contents

Output 5.14 First 5 Observations in Teams

Output 5.14 First 5 Observations in Teams

5.3 Summary

These Bizarro Ball data sets will be used for the examples in the next few chapters. As mentioned in the Introduction, they were designed to be transactional in nature (i.e., what would be collected during games). In Chapter 7 this data will be used to create a rudimentary data warehouse/mart that can be more easily used for business intelligence type questions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.80.34