Now that we have discussed
the broad aims behind linear regression, let us take a look at how
this is achieved through a pictorial explanation, using the following
simple linear example.
Let us begin with an
example. Say you are an HR researcher. Continuing with the sales example,
you might wish to investigate the predictors of sales
figures among a sales force. Therefore sales is
the dependent variable (DV), which you operationalize as average dollars
sold per annum. You choose a range of possible predictors or independent
variables (IVs), which could be anything relevant, such as salesperson
age, education, attitudes, behaviors, economic variables, and the
like. Let us just look at age of the salesperson as
the only independent variable with which to start.
As seen there, each
observation (in this case each salesperson) is represented by a dot,
which represents their actual recorded value for each of the values
which are represented on the axes (the dependent variable sales on
the vertical Y-axis, the independent (predictor) variable age on the
horizontal X-axis). In the top half of
Figure 13.2 Starting to plot data values in simple regression I have simply picked out two salespeople whose data points
are indicated by position on the axes.
This is simple regression
– one predictor trying to explain a single dependent variable.
As discussed later, it does get slightly — although not a lot
— more complex when we add more predictor variables together.
Once we have a cloud
of data like that seen in the bottom half of
Figure 13.2 Starting to plot data values in simple regression, regression will seek to see if a straight line that is
drawn through the middle of the cloud can adequately represent the
shape of the data. For this to be the case, the data itself needs
to be formed in the rough shape of a straight line – in other
words, the data needs to form something of a tube shape. If yes, then
we settle on the inference that a straight line adequately represents
our relationship, and we can proceed to have a look at what that line
tells us.
This is obviously a
specific version of Step 2 from Chapter 11 – looking for a
pattern in data by fitting an exact mathematical shape into the data
(in this case, a straight line) and seeing how well the data corresponds
to the line.
-
The data do seem to have a linear
trend, in that the data forms a tube shape.
-
In addition, when we examine the
line that we have drawn through the data, we see that on average older
salespeople seem to sell more and younger, less.
-
Fitting the best possible line
through the data seems to give us an upward-sloping line. This would
then possibly tell us something about a link between age and sales.
Once we get to three
or more predictors, plotting the complete data in graphs becomes all
but impossible. However, the sample principles apply.
Summarizing the above,
then, we look at the data relationships to see if they seem to form
a straight line. If the data does seem to form a straight line, we
proceed to decide exactly what that line tells us. Therefore, there
are two questions you need to answer in multiple regression:
-
Is
my data a straight-line sort of shape? That is,
does a straight line fit the data?
-
What
does the line mean about the independent-dependent variable relationships? If
the regression relationship is a straight line, is that line sloped
sufficiently upwards (positive association) or downwards (negative)
to mean anything substantial?
Inexperienced researchers
often make the mistake of assuming that because the data fits a straight
line well, a meaningful relationship exists. However,
you
have a strong regression only if the line is sloped up or down for
an independent variable. In the case of a flat
line with almost no slope, as seen in
Figure 13.6 Good fit to a straight line but no apparent age effect on sales, the best indicator of a given salesperson’s sales
is really only the average sales across all salespeople).
Therefore the slopes
of the imaginary straight line through the data are the key thing
in regression, specifically how steeply upwards or downwards the slope
lies. Fit is merely the confirmation that the slopes of the straight
line are in fact meaningful for the dataset, in other words that a
straight line is a good way to summarize the data. Having said this,
fit remains the crucial condition for using regression in the first
place, so we will explore it further in the following section.