9.1 Probabilistic Models

An important consideration in taking a drug is how it may affect one’s perception or general awareness. Suppose you want to model the length of time it takes to respond to a stimulus (a measure of awareness) as a function of the percentage of a certain drug in the bloodstream. The first question to be answered is this: “Do you think that an exact relationship exists between these two variables?” That is, do you think that it is possible to state the exact length of time it takes an individual (subject) to respond if the amount of the drug in the bloodstream is known? We think that you will agree with us that this is not possible, for several reasons: The reaction time depends on many variables other than the percentage of the drug in the bloodstream—for example, the time of day, the amount of sleep the subject had the night before, the subject’s visual acuity, the subject’s general reaction time without the drug, and the subject’s age. Even if many variables are included in a model, it is still unlikely that we would be able to predict the subject’s reaction time exactly. There will almost certainly be some variation in response times due strictly to random phenomena that cannot be modeled or explained.

If we were to construct a model that hypothesized an exact relationship between variables, it would be called a deterministic model. For example, if we believe that y, the reaction time (in seconds), will be exactly one-and-one-half times x, the amount of drug in the blood, we write

y=1.5x

This represents a deterministic relationship between the variables y and x. It implies that y can always be determined exactly when the value of x is known. There is no allowance for error in this prediction.

If, however, we believe that there will be unexplained variation in reaction times—perhaps caused by important, but unincluded, variables or by random phenomena—we discard the deterministic model and use a model that accounts for this random error. Our probabilistic model will include both a deterministic component and a random-­error component. For example, if we hypothesize that the response time y is related to the percentage x of drug by

y=1.5x+Random error

we are hypothesizing a probabilistic relationship between y and x. Note that the deterministic component of this probabilistic model is 1.5x.

Figure 9.1a shows the possible responses for five different values of x, the percentage of drug in the blood, when the model is deterministic. All the responses must fall exactly on the line because a deterministic model leaves no room for error.

Figure 9.1b shows a possible set of responses for the same values of x when we are using a probabilistic model. Note that the deterministic part of the model (the straight line itself) is the same. Now, however, the inclusion of a random-error component allows the response times to vary from this line. Since we know that the response time does vary randomly for a given value of x, the probabilistic model for y is more realistic than the deterministic model.

Figure 9.1

Possible reaction times y for five different drug percentages x

General Form of Probabilistic Models

y=Deterministic component + Random error

where y is the variable of interest. We always assume that the mean value of the random error equals 0. This is equivalent to assuming that the mean value of y, E(y), equals the deterministic component of the model; that is,

E(y)=Deterministic component

Biography Francis Galton (1822–1911)

The Law of Universal Regression

Francis Galton was the youngest of seven children born to a middle-class English family of Quaker faith. A cousin of Charles Darwin, Galton attended Trinity College (Cambridge, England) to study medicine. Due to the death of his father, Galton was unable to obtain his degree. His competence in both medicine and mathematics, however, led Galton to pursue a career as a scientist. He made major contributions to the fields of genetics, psychology, meteorology, and anthropology. Some consider Galton to be the first social scientist for his applications of the novel statistical concepts of the time—in particular, regression and correlation. While studying natural inheritance in 1886, Galton collected data on heights of parents and adult children. He noticed the tendency for tall (or short) parents to have tall (or short) children, but that the children were not as tall (or short), on average as their parents. Galton called this phenomenon the “law of universal regression,” for the average heights of adult children tended to “regress” to the mean of the population. With the help of his friend and disciple, Karl Pearson, Galton applied the straight-line model to the height data, and the term regression model was coined.

In this chapter, we present the simplest of probabilistic models—the straight-line model—which gets its name from the fact that the deterministic portion of the model graphs as a straight line. Fitting this model to a set of data is an example of regression analysis, or regression modeling. The elements of the straight-line model are summarized in the following box:

A First-Order (Straight-Line) Probabilistic Model

y=β0+β1x+ε

where

  • y=Dependent or response variable (quantitative variable to be modeled or predicted)

  • x=Independentor predictorvariable (quantitative variable used as a predictor of y)*

    β0+β1x=E(y)=Deterministic component

  • ε(epsilon)=Randomerrorcomponent

  • β0(beta zero)=y-intercept of the line—that is, the point at which the line intersects, or cuts through, the y-axis (see Figure 9.2)

  • β1(beta one)= Slope of the line—that is, the change (amount of increase or decrease) in the deterministic component of y for every one-unit increase in x.

[Note: A positive slope implies that E(y) increases by the amount β1. (See Figure 9.2.) A negative slope implies that E(y) decreases by the amount β1.]

Figure 9.2

The straight-line model

In the probabilistic model, the deterministic component is referred to as the line of means because the mean of y, E(y), is equal to the straight-line component of the model. That is,

E(y)=β0+β1x

Note that the Greek symbols β0 and β1 respectively represent the y-intercept and slope of the model. They are population parameters that will be known only if we have access to the entire population of (x, y) measurements. Together with a specific value of the independent variable x, they determine the mean value of y, which is just a specific point on the line of means (Figure 9.2).

Example 9.1 Modeling the Reaction Time to a Drug

Problem

  1. Consider an experiment designed to estimate the linear relationship between the percentage of a certain drug in the bloodstream of a subject and the length of time it takes the subject to react to a stimulus. In particular, the researchers want to predict reaction time based on the amount of drug in the bloodstream.

    1. In this study, identify the dependent and independent variables.

    2. Explain why a probabilistic model is more appropriate than a deterministic model.

    3. Write the equation of the straight-line, probabilistic model.

Solution

  1. Since the researchers want to predict (or model) reaction time, then y=reaction is the dependent variable. And since the researchers want to use amount of drug in the bloodstream to make the prediction, x=amount of drug is the independent variable.

  2. It would be unrealistic to expect all subjects with the same amount of drug in the bloodstream (x) to have the same reaction time (y). That is, we don’t expect amount of drug to determine, exactly, a subject’s reaction time due to subject-to-subject variation. Consequently, a probabilistic model is more appropriate than a deterministic model.

  3. The probabilistic model takes the form y=β0+β1x+ε, where β0 is the y-intercept of the line, β1 is the slope of the line, and ε represents random error.

Look Ahead

In the next section, we show how to estimate the y-intercept and slope of the line from the sample data.

Now Work Exercise 9.12

The values of β0 and β1 will be unknown in almost all practical applications of regression analysis. The process of developing a model, estimating the unknown parameters, and using the model can be viewed as the five-step procedure shown in the following box:

Conducting a Simple Linear Regression:

  1. Step 1 Hypothesize the deterministic component of the model that relates the mean E(y) to the independent variable x (Section 9.2).

  2. Step 2 Use the sample data to estimate unknown parameters in the model (Section 9.2).

  3. Step 3 Specify the probability distribution of the random-error term and estimate the standard deviation of this distribution (Section 9.3).

  4. Step 4 Statistically evaluate the usefulness of the model (Sections 9.4 and 9.5).

  5. Step 5 When satisfied that the model is useful, use it for prediction, estimation, and other purposes (Section 9.6).

Exercises 9.1–9.14

Understanding the Principles

  1. 9.1 Why do we generally prefer a probabilistic model to a deterministic model? Give examples for which the two types of models might be appropriate.

  2. 9.2 What is the difference between a dependent variable and an independent variable in a probabilistic model?

  3. 9.3 What is the line of means?

  4. 9.4 If a straight-line probabilistic relationship relates the mean E(y) to an independent variable x, does it imply that every value of the variable y will always fall exactly on the line of means? Why or why not?

Learning the Mechanics

  1. 9.5 In each case, graph the line that passes through the given points.

    1. (1, 1) and (5, 5)

    2. (0, 3) and (3, 0)

    3. (1,1) and (4, 2)

    4. (6,3) and (2, 6)

  2. 9.6 Give the slope and y-intercept for each of the lines graphed in Exercise 9.5 .

  3. 9.7 The equation (deterministic) for a straight line is

    y=β0+β1x

    If the line passes through the point (2,4), then x=2,y=4 must satisfy the equation; that is,

    4=β0+β1(2)

    Similarly, if the line passes through the point (4, 6), then x=4,y=6 must satisfy the equation; that is,

    6=β0+β1(4)

    Use these two equations to solve for β0 and β1; then find the equation of the line that passes through the points (2,4) and (4, 6).

  4. 9.8 Refer to Exercise 9.7 . Find the equations of the lines that pass through the points listed in Exercise 9.5 .

  5. 9.9 Plot the following lines:

    1. y=4+x

    2. y=52x

    3. y=4+3x

    4. y=2x

    5. y=x

    6. y=.50+1.5x

  6. 9.10 Give the slope and y-intercept for each of the lines defined in Exercise 9.9 .

Applying the Concepts—Basic

  1. 9.11 Measuring the moon’s orbit. A handheld digital camera was used to photograph the moon’s orbit and the pictures used to measure the angular size (in pixels) of the moon at various distances (heights) above the horizon (American Journal of Physics, April 2014). The photographer wanted to predict the moon’s size based on the moon’s distance above the horizon using simple linear regression.

    1. In this study, identify the dependent and independent variables.

    2. Explain why a probabilistic model is more appropriate than a deterministic model.

    3. Write the equation of the straight-line, probabilistic model.

  2. 9.12 Game performance of water polo players. The journal Biology of Sport (Vol. 31, 2014) published a study of the physiological performance of top-level water polo players. Eight Olympic male water polo players participated in the study. Two variables were measured for each during competition: mean heart rate over the four quarters of the game (expressed as a percentage of maximum heart rate) and maximal oxygen uptake (a measure of fitness). One objective of the study was to predict a player’s heart rate based on the player’s maximal oxygen uptake.

    1. In this study, identify the dependent and independent variables.

    2. Explain why a probabilistic model is more appropriate than a deterministic model.

    3. Write the equation of the straight-line, probabilistic model.

  3. 9.13 Estimating repair and replacement costs of water pipes. A team of civil engineers used simple linear regression analysis to model the ratio of repair to replacement cost of commercial pipe as a function of the pipe’s diameter (IHS Journal of Hydraulic Engineering, Sept. 2012).

    1. In this study, identify the dependent and independent variables.

    2. Explain why a probabilistic model is more appropriate than a deterministic model.

    3. Write the equation of the straight-line, probabilistic model.

  4. 9.14 Forecasting movie revenues with Twitter. A study presented at the 2010 IEEE International Conference on Web Intelligence and Intelligent Agent Technology investigated whether the volume of chatter on Twitter.com could be used to forecast the box office revenues of movies. For each in a sample of 24 recent movies, opening weekend box office revenue (in millions of dollars) was measured as well as the movie’s tweet rate (the average number of tweets referring to the movie one week prior to the movie’s release).

    1. In this study, identify the dependent and independent variables.

    2. Explain why a probabilistic model is more appropriate than a deterministic model.

    3. Write the equation of the straight-line, probabilistic model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.8.247