Glossary

air yards

The distance traveled by the pass from the line of scrimmage to the intended receiver, whether or not the pass was complete.

average depth of target (aDOT)

The average air yards traveled on targeted passes for quarterbacks and targets for receivers.

binomial

A type of statistical distribution with a binary (0/1-type) response. Examples include wins/losses for a game or sacks/no-sacks from a play.

bins

The discrete categories of numbers used to summarize data in a histogram.

book

Short for sportsbook in football gambling. This is the person, group, casino, or other similar enterprise that takes wagers on sporting (and other) events.

bounded

Describes a number whose value is constrained by other values. For example, a percentage is bounded by 0% and 100%.

boxplots

A data visualization that uses a “box” to show the middle 50% of data and stems for the upper and lower quartiles. Commonly, outliers are plotted as dots. The plots are also known as box-and-whisker plots.

Cheeseheads

Fans of the Green Bay Packers. For example, Richard is a Cheesehead because he likes the Packers. Eric is not a fan of the Packers and is therefore not a Cheesehead.

closing line

The final price offered by a sportsbook before a game starts. In theory, this contains all the opinions, expressed through wagers, of all bettors who have enough influence to move the line into place.

clustering

A type of statistical method for dividing data points into similar groups (clusters) based on a set of features.

coefficient

The predictor estimates from a regression-type analysis. Slopes and intercepts are special cases of coefficients.

completion percentage over expected (CPOE)

The rate at which a quarterback has successful (completed) passes, compared to what would be predicted (expected) given a situation based on an expected completion percentage model.

confidence interval (CI)

A measure of uncertainty around a point estimate such as a mean or regression coefficient. For example, a 95% CI around a mean will contain the true mean 95% of the time if you repeat the observation process many, many, many times. But, you will never know which 5% of times you are wrong.

context

What is going on around a situation; the factors involved in a play, such as the down, yards to go, and field position.

controlled for

Including one or more extra variables in a regression or regression-like model. For example, pass completion might be controlled for yards to go. See also corrected for and normalized.

corrected for

A synonym for controlled for.

data dictionary

Data about data. Also a synonym for metadata.

data pipeline

The flow of data from one location to another, with the data undergoing changes such as formatting along the way. See also pipe.

data wrangling

The process of getting data into the format you need to solve your problems. Synonyms include data cleaning, data formatting, data tidying, data transformation, data manipulation, data munging, and data mutating.

degrees of freedom

The “extra” number of data points left over from fitting a model.

dimensionality reduction

A statistical approach for reducing the number of features by creating new, independent features. Principal component analysis (PCA) is an example of one type of dimensionality reduction.

dimensions (of data)

The number of variables needed to describe data. Graphically, this is the number of axes needed to describe the data. Tabularly, this is the number of columns needed to describe the data. Algebraically, this is the number of independent variables needed to describe the data.

distance

The number of yards remaining to either obtain a new first down or score a touchdown.

down

A finite number of plays to advance the football a certain distance (measured in yards) and either score or obtain a new set of plays before a team loses possession of the ball.

draft approximate value (DrAV)

The approximate value generated by a player picked for his drafting team. This is a metric developed by Pro Football Reference.

draft capital

The resources a team uses during the NFL Draft, including the number of picks, pick rounds, and pick numbers.

edge

An advantage over the betting markets for predicting outcomes, usually expressed as a percentage.

expected point

The estimated, or expected, value for the number of points one would expect a team to score given the current game situation on that drive.

expected points added (EPA)

The difference between a team’s expected points from one play to the next, measuring the success of the play.

exploratory data analysis (EDA)

A subset of statistical analysis that analyzes data by describing or summarizing its main characteristics. Usually, this involves both graphical summaries such as plots and numerical summaries such as means and standard deviations.

feature

A predictor variable in a model. This term is used more commonly by data scientists whereas statisticians tend to use predictor or dependent variable.

for loop

A computer programming tool that repeats (or loops) over a function for a predefined number of iterations.

generalized linear models (GLMs)

An extension of linear models (such as simple linear regression and multiple regression) to include a link function and non-normal response variable such as logistic regression with binary data or Poisson regression with count data.

gridiron football

A synonym for American football.

group by

A concept from SQL-type languages that describes taking data and creating sub-groups (or grouping) based on (or by) a variable. For example, you might take Aaron Rodger’s passing yards and group by season to calculate his average passing yards per season.

handle

The total amount of money placed by bettors across all markets.

high-leverage situations

Important plays that determine the outcome of games. For example, converting the ball on third down, or fourth and goal. These plays, while of great importance, are generally not predictive game to game or season to season.

histogram

A type of plot that summarizes counts of data into discrete bins.

hit

Two uses in this book. A football colliding with another is a hit. Additionally, a computer script trying to download from a web page hits the page when trying to download.

interaction

In a regression-type model, sometimes two predictors or features change together. A relation (or interaction) between these two terms allows this to be included within the model.

intercept

The point where a simple linear regression crosses through 0. Also, sometimes used to refer to multiple regression coefficients with a discrete predictor variable.

internal

For sports bettors, the price they would offer the game if they were a sportsbook. The discrepancy between this value and the actual price of the game determines the edge.

interquartile range

The difference between the first and third quartile. See also quartile.

lag

A delay or offset. For example, comparing the number of passes per quarterback per game in one season (such as 2022) to the previous season (such as 2021) would have a lag of 1 season.

link function

The function that maps between the observed scale and model scale in a generalized linear model. For example, a logit function can link between the observed probability of an outcome occurring on the bounded 0–1 probability scale to the log-odds scale that ranges from negative to positive infinity.

log odds

Odds on the log scale.

long pass

A pass typically longer than 20 yards, although the actual threshold may vary.

metadata

The data describing the data. For example, metadata might indicate whether a time column displays minutes, seconds, or decimal minutes. This can often be thought of as a synonym for data dictionary.

mixed-effect model

A model with both fixed effects and random effects. See also random-effect model. Synonyms include hierarchical model, multilevel model, and repeated-measure or repeated-observation model.

moneyline

In American football, a bet on a team winning straight up.

multiple regression

A type of regression with more than one predictor variable. Simple linear regression is a special type of multiple regression.

normalize

This term has multiple definitions. In the book, we use it to refer to accounting for other variables in regression analysis. Also see correct for or control for. Normalization may also be used to define a transformation of data. Specifically, data are transformed to be on a normal distribution scale (or normalized) to have a mean of 0 and standard deviation of 1, thereby following a normal distribution.

North American football

A synonym for American football.

odds

In betting and logistic regression, the number of times an event occurs in relation to the number of time the event does not occur. For example, if Kansas City has 4-to-1 odds of winning this week, we would expect them to win one game for every four games they lose under a similar situation. Odds can either emerge empirically through events occurring and models estimating the odds or through betting as odds emerge through the “wisdom” of the crowds.

odds-ratio

Odds in ratio format. For example, 3-to-2 odds can be written as 3:2 or 1.5 odds-ratios.

open source

Describes software where code must freely accessible (anybody can look at the code) and freely available (not cost money).

origination

The process by which an oddsmaker or a bettor sets the line.

outliers

Data points that are far away from another data point.

overfit

Describes a model for which too many parameters have been estimated compared to the amount of data, or a model that fits one situation too well and does not apply to other situations.

p-values

The probability of obtaining the observed test statistic, assuming the null hypothesis of no difference is true. These values are increasingly falling out of favor with professional statisticians because of their common misuse.

Pearson’s correlation coefficient

A value from –1 to 1. A value of 1 means two groups are perfectly positively correlated, and as one increases, the other increases. A value of –1 means two groups are perfectly negatively correlated, and as one increases, the other decreases. A value of 0 means no correlation, and the values for one group do not have any relation to the values from another group.

pipe

To pass the outputs from one function directly to another function. See also data pipeline.

play-by-play (data)

The recorded results for each play of a football game. Often this data is “row poor,” in that there are far more features (columns) than plays (rows).

principal component analysis (PCA)

A statistical tool for creating fewer independent features from a set of features.

probability

A number between 0 and 1 that describes the chance of an event occurring. Multiple definitions of probability exist, including the frequentist definition, which is the long-term average under similar conditions (such as flipping a coin), and Bayesian, which is the belief in an event occurring (such as betting markets).

probability distributions

A mathematical function that assigns a value (specifically, a probability) between 0 and 1 to an event occurring.

proposition (bets)

A type of bet focusing on a specific outcome occurring, such as who will score the first touchdown. Also called prop for short.

push

A game for which the outcome lands on the spread and the better is refunded their money.

Pythonistas

People who use the Python language.

quartile

A quarter of the data based on numerical ranking. By definition, data can be divided into four quartiles.

random-effect model

A model with coefficients that are assumed to come from a shared distribution.

regression

A type of statistical model that describes the relationship between one response variable (a simple linear regression) and one or more predictor variables (a multiple regression). Also a special type of linear model.

regression candidate

With regression, observations are expected to regress to the mean (average) value through time. For example, a player who has a good year this year would be reasonably expected to have a year closer to average next year, especially if the source of their good year is a relatively unstable, or noisy, statistic.

relative risk

A method for understanding the results from a Poisson regression, similar to the outputs from a logistic regression with odds-ratios.

residual

The difference between a model’s predicted value for an observation and the actual value for an observation.

run yards over average (RYOE)

The number of running yards a player obtains compared to the value expected (or average) from a model given the play’s situation.

sabermetrics

Quantitative analysis of baseball, named after the Society for American Baseball Research (SABR).

scatterplot

A type of plot that plots points on both axes.

scrape

To use computer programs to download data from websites (as in web scraping).

set the line

The process of the oddsmaker(s) creating the odds.

Simpson’s paradox

A statistical phenomena whereby relationships between variables change based on different groupings using other variables.

slope

The change in a trend through time and often used to describe regression coefficients with continuous predictor variables.

spread (bet)

A betting market in American football that is the most popular and easy to understand. The spread is the point value meant to split outcomes in half over a large sample of games. This doesn’t necessarily mean the sportsbook wants half of the bets on either side of the spread though.

stability

Within the context of this book, stability of an evaluation metric is the metric’s ability to predict itself over a predetermined time frame. Also, see stability analysis and sticky stats.

stability analysis

The measurement of how well a metric or model output holds up through time. For example, with football, we would care about the stability of making predictions across seasons.

standard deviation

A measure of the spread, or dispersion, in a distribution.

standard error

A measure of the uncertainty around a distribution, given the uncertainty and sample size.

sticky stats

A term commonly used in fantasy football for numbers that are stable through time.

short pass

A pass typically less than 20 yards, although the actual threshold may vary (e.g., the first-down marker).

short-yardage back

A running back who tends to play when only a few (or “short”) number of yards are required to obtain a first down or a touchdown.

supervised learning

A type of statistical and machine learning algorithm where people know the groups ahead of time and the algorithm may be trained on data.

three true outcomes

Baseball’s first, and arguably most important, outcomes that can be modeled across area walks, strikeouts, and home runs. These outcomes also do not depend on the defense, other than rare exceptions.

total (bet)

A simple bet on whether the sum of the two teams’ points goes over or under a specified amount.

Total (for a game) (bet)

The number of points expected by the betting market for a game.

unsupervised learning

A type of statistical and machine learning algorithm where people do not know the groups ahead of time.

useR

People who use the R language.

variable

Depending on context, two definitions are used in the book. First, observations can be variable. For example, pass yards might be highly variable for a quarterback, meaning the quarterback lacks consistency. Second, a model can be variable. For example, air yards might be a predictor variable for the response variable completion in a regression model.

vig

See vigorish.

vigorish

The house (casino, bookie, or other similar institution that takes bets) advantage that ensures the house almost always makes money over the long-term.

win probability (WP)

A model to predict the probability that a team wins the game at a given point during the game.

wins above replacement

A framework for estimating the number of wins a player is worth during the course of a season, set of seasons, or a career. First created in baseball.

yards per attempt (YPA)

Also known as yards per passing attempt, YPA is the average number of yards a quarterback throws during a defined time period, such as game or season.

yards per carry (YPC)

The average number of yards a player runs the ball during a defined time period, such as game or season.

yards to go

The number of yards necessary to either obtain a first down or score during a play.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.196.175