CHAPTER  9

Using Basic Statistical Procedures

9.1    Examining the Distribution of Data with PROC UNIVARIATE

9.2    Creating Statistical Graphics with PROC UNIVARIATE

9.3    Producing Statistics with PROC MEANS

9.4    Testing Means with PROC TTEST

9.5    Creating Statistical Graphics with PROC TTEST

9.6    Testing Categorical Data with PROC FREQ

9.7    Creating Statistical Graphics with PROC FREQ

9.8    Examining Correlations with PROC CORR

9.9    Creating Statistical Graphics with PROC CORR

9.10  Using PROC REG for Simple Regression Analysis

9.11  Creating Statistical Graphics with PROC REG

9.12  Using PROC ANOVA for One-Way Analysis of Variance

9.13  Reading the Output of PROC ANOVA

 

9.1     Examining the Distribution of Data with PROC UNIVARIATE

When you are doing statistical analysis, you usually have a goal in mind, a question that you are trying to answer, or a hypothesis you want to test. But before you jump into statistical tests, it is a good idea to pause and do a little exploration. A good procedure to use at this point is PROC UNIVARIATE.

PROC UNIVARIATE, which is part of Base SAS, produces statistics and graphs describing the distribution of a single variable. The statistics include the mean, median, mode, standard deviation, skewness, and kurtosis.

Using PROC UNIVARIATE is fairly simple. After the PROC statement, you specify one or more numeric variables in a VAR statement:

PROC UNIVARIATE;

    VAR variable-list;

Without a VAR statement, SAS will calculate statistics for all numeric variables in your data set. You can specify other options in the PROC statement, if you want, such as NORMAL, which produces tests of normality:

PROC UNIVARIATE NORMAL;

Example  The following data consist of test scores from a statistics class. Each line contains scores for 10 students.

56 78 84 73 90 44 76 87 92 75

85 67 90 84 74 64 73 78 69 56

87 73 100 54 81 78 69 64 73 65

This program reads the data from a file called Scores.dat and then runs PROC UNIVARIATE:

LIBNAME f2020 'c:MySASLib';

DATA f2020.statclass;

   INFILE 'c:MyRawDataScores.dat';

   INPUT Score @@;

RUN;

*Summarize data using PROC UNIVARIATE;

PROC UNIVARIATE DATA = f2020.statclass;

   VAR Score;

   TITLE;

RUN;

The output appears on the next page. The output starts with basic information about your distribution: number of observations (N), mean, and standard deviation. Skewness indicates how asymmetrical the distribution is (whether it is more spread out on one side), while kurtosis indicates how flat or peaked the distribution is. The normal distribution has values of 0 for both skewness and kurtosis. Other sections of the output contain three measures of central tendency: mean, median, and mode; tests of the hypothesis that the population mean is 0; quantiles; and extreme observations (in case you have outliers).

 

The UNIVARIATE Procedure Variable:  Score

Moments

N

30

Sum Weights

30

Mean

74.6333333

Sum Observations

2239

Std Deviation

12.5848385

Variance

158.378161

Skewness

-0.3495061

Kurtosis

0.10385765

Uncorrected SS

171697

Corrected SS

4592.96667

Coeff Variation

16.8622222

Std Error Mean

2.29766665

 

Basic Statistical Measures

Location

Variability

Mean

74.63333

Std Deviation

12.58484

Median

74.50000

Variance

158.37816

Mode

73.00000

Range

56.00000

 

 

Interquartile Range

17.00000

 

Tests for Location: Mu0=0

Test

Statistic

p Value

Student's t

t

32.48223

Pr > |t|

<.0001

Sign

M

15

Pr >= |M|

<.0001

Signed Rank

S

232.5

Pr >= |S|

<.0001

 

Quantiles (Definition 5)

Quantile

Estimate

100% Max

100.0

99%

100.0

95%

92.0

90%

90.0

75% Q3

84.0

50% Median

74.5

25% Q1

67.0

10%

56.0

5%

54.0

1%

44.0

0% Min

44.0

 

Extreme Observations

Lowest

Highest

Value

Obs

Value

Obs

44

6

87

21

54

24

90

5

56

20

90

13

56

1

92

9

64

28

100

23

9.2     Creating Statistical Graphics with PROC UNIVARIATE

The UNIVARIATE procedure can produce several graphs that are useful for data exploration. For example, histograms are a good way to visualize the distribution of a variable, while probability and quantile-quantile plots can show how the data compare to theoretical distributions. To produce these graphs, include the desired plot request statement. Here is the general form of PROC UNIVARIATE with plot requests:

PROC UNIVARIATE;

   VAR variable-list;

   Plot-request variable-list / options;

Plot requests  The following graphs can be requested with PROC UNIVARIATE:

CDFPLOT

cumulative distribution function plot

HISTOGRAM

histogram

PPPLOT

probability-probability plot

PROBPLOT

probability plot

QQPLOT

quantile-quantile plot

If no variable list is specified, then plots will be produced for all variables in the VAR statement. If there is no VAR statement or variable list, then plots will be produced for all numeric variables.

Plot options  The CDFPLOT and HISTOGRAM plots show the distribution of the specified variable. To overlay a curve showing a standard distribution, specify the desired distribution with a plot option. Available distribution options include: BETA, EXPONENTIAL, GAMMA, LOGNORMAL, NORMAL, and WEIBULL. The PPPLOT, PROBPLOT, and QQPLOT statements use the normal distribution as the default. If you would like to use a different distribution, specify it with a plot option. For example, to create a probability plot of the variable Score using the exponential distribution, use the following:

PROBPLOT Score / EXPONENTIAL;

Example  This example uses the data from the previous section, which consist of test scores from 30 students in a statistics class. The following program creates a histogram of the Score variable with the normal distribution overlaid, and a probability plot using the normal distribution:

LIBNAME f2020 'c:MySASLib';

*Create histogram and probability plot for variable Score;

PROC UNIVARIATE DATA = f2020.statclass;

   VAR Score;

    HISTOGRAM Score / NORMAL;

    PROBPLOT Score;

   TITLE;

RUN;

 

Here are the two plots: a histogram and a probability plot. (The NORMAL option in the HISTOGRAM statement produces additional tabular results that are not shown.) The relatively linear pattern formed by the points in the probability plot indicate that the data are closely matched to the normal distribution.

image

image

9.3     Producing Statistics with PROC MEANS

Most of the descriptive statistics that you produce with PROC UNIVARIATE you can also produce with PROC MEANS. PROC UNIVARIATE is useful when you want an in-depth statistical analysis of the data distribution. But if you know you want only a few statistics, then PROC MEANS is a better way to go. With PROC MEANS you can ask for just the statistics you want. The MEANS procedure does not produce any ODS Graphics.

The MEANS procedure requires only one statement:

PROC MEANS statistic-keywords;

If you do not include any statistic keywords, then MEANS will produce the mean, the number of nonmissing values, the standard deviation, the minimum value, and the maximum value for each numeric variable. The following table shows statistics you can request. (Some statistics have two names; the alternate name is shown in parentheses.) If you add any statistic keywords in the PROC MEANS statement, then MEANS will no longer produce the default statistics—you must request all the statistics you want.

CLM

two-sided confidence limits

RANGE

range

CSS

corrected sum of squares

SKEWNESS

skewness

CV

coefficient of variation

STDDEV

standard deviation

KURTOSIS

kurtosis

STDERR

standard error of the mean

LCLM

lower confidence limit

SUM

sum

MAX

maximum value

SUMWGT

sum of weighted variables

MEAN

mean

UCLM

upper confidence limit

MIN

minimum value

USS

uncorrected sum of squares variance

MODE

mode

VAR

variance

N

number of nonmissing values

PROBT

probability for Student’s t

NMISS

number of missing values

T

Student’s t

MEDIAN(P50)

median

Q3(P75)

75% quantile

Q1(P25)

25% quantile

P5

5% quantile

P1

1% quantile

P90

90% quantile

P10

10% quantile

P99

99% quantile

P95

95% quantile

 

 

 

Confidence limits  The default confidence level for the confidence limits is .05 or 95%. If you want a different confidence level, then request it with the ALPHA= option in the PROC MEANS statement. For example, if you want 90% confidence limits, then specify ALPHA=.10 along with the CLM option. Then the PROC MEANS statement would look like this:

PROC MEANS ALPHA = .10 CLM;

 

VAR statement  By default, PROC MEANS will produce statistics for all numeric variables in your data set. If you do not want all the variables, then specify the ones you want in the VAR statement. Here is the general form of the MEANS procedure with the VAR statement:

PROC MEANS options;

   VAR variable-list;

Example  Your friend is an aspiring author of children’s books. To increase her chances of getting her books published, she wants to know how many pages her books should have. At the local library, she counts the number of pages in a random selection of children’s picture books. Here are the data:

34 30 29 32 52 25 24 27 31 29

24 26 30 30 30 29 21 30 25 28

28 28 29 38 28 29 24 24 29 31

30 27 45 30 22 16 29 14 16 29

32 20 20 15 28 28 29 31 29 36

To determine the average number of pages in children’s picture books, use the MEANS procedure. PROC MEANS can also produce the median number of pages, as well as the 90% confidence limits. Here is the program that will read the data and produce the desired statistics:

DATA booklengths;

   INFILE 'c:MyRawDataPicbooks.dat';

   INPUT NumberOfPages @@;

RUN;

*Produce summary statistics;

PROC MEANS DATA = booklengths N MEAN MEDIAN CLM ALPHA = .10;

   TITLE 'Summary of Picture Book Lengths';

RUN;

Here are the results of the MEANS procedure:

Summary of Picture Book Lengths

The MEANS Procedure

Analysis Variable : NumberOfPages

N

Mean

Median

Lower 90%
CL for Mean

Upper 90%
CL for Mean

50

28.0000000

29.0000000

26.4419136

29.5580864

The average number of pages in the children’s books sampled was 28. The median value of 29 says that half the books sampled had 29 pages or fewer. The confidence limits tell us that we are 90% certain that the true population mean (all children’s picture books) falls between 26.44 and 29.56 pages. From this analysis your friend concludes that she should make her books between 26 and 30 pages long to maximize her chances of getting published (of course subject matter and writing style might also help).

9.4     Testing Means with PROC TTEST

As you would expect from its name, the TTEST procedure, which is part of SAS/STAT software, computes t tests. You use t tests when you want to compare means. Suppose, for example, that a statistics instructor selected a random sample of students to have extra tutoring. She could test whether the true mean of their scores was above a specific level (called a one-sample test), she could compare the scores of students who had tutoring to those who did not (a two independent sample test), and she could compare the scores of students before and after tutoring (a paired test). PROC TTEST performs all these types of tests.

One sample comparisons  To compute a t test for a single mean, you list that variable in a VAR statement. SAS will test whether the mean is significantly different from H0, a specified null value. The default value of H0 is zero. You can specify a different value using the H0= option.

PROC TTEST H0 = n options;

   VAR variable;

Two independent sample comparisons  To compare two independent groups, you use a CLASS and a VAR statement. In the CLASS statement, you list the variable that distinguishes the two groups. In the VAR statement, you list the response variable.

PROC TTEST options;

   CLASS variable;

   VAR variable;

Paired comparisons  When the variables that you are comparing are paired, you use a PAIRED statement. The simplest form of this statement lists the two variables to be compared, separated by an asterisk.

PROC TTEST options;

   PAIRED variable1 * variable2;

Options  Here are a few of the options available:

ALPHA = n           specifies the level for the confidence limits. The value of n must be between 0 (100% confidence) and 1 (0% confidence). The default is  0.05 (95% confidence limits).

CI = type           specifies the type of confidence interval for the standard deviation. If you don't specify this option, then by default the value of type is   EQUAL, which produces an equal-tailed confidence interval. Other possible values are UMPU for an interval based on the uniformly most powerful unbiased test, and NONE to request no confidence interval for the standard deviation.

H0 = n                  requests a test of the hypothesis H0 = n. The default value is 0.

NOBYVAR                moves the names of the variables from the title to the output table.

SIDES = type    specifies whether the p-value and confidence interval are one or two-tailed. Possible values of type are 2 (the default) for two-tailed, L (for a lower one-sided test) or U (for an upper one-sided test).

 

Example  The following data give the finishing times for semifinal and final races of the women’s 50-meter freestyle swim. Each swimmer’s initials are followed by their final time and semifinal time in seconds. Each line of data contains times for four swimmers.

RK 24.05 24.07 AH 24.28 24.45 MV 24.39 24.50 BS 24.46 24.57

FH 24.47 24.63 TA 24.61 24.71 JH 24.62 24.68 AV 24.69 24.64

The following program reads the raw data and uses a paired t test to test the mean difference between the semifinal and final times:

LIBNAME results 'c:MySASLib';

DATA results.swim;

   INFILE 'c:MyRawDataOlympic50mSwim.dat';

   INPUT Swimmer $ FinalTime SemiFinalTime @@;

RUN;

*Perform paired t test for semifinal and final times;

PROC TTEST DATA = results.swim;

  TITLE '50m Freestyle Semifinal vs. Final Results';

  PAIRED SemiFinalTime * FinalTime;

RUN;

Here are the tabular results produced by PROC TTEST. Graphical results are shown in the next section.

50m Freestyle Semifinal vs. Final Results

 

The TTEST Procedure

 

Difference:   SemiFinalTime - FinalTime

 

N

Mean

Std Dev

Std Err

Minimum

Maximum

8

0.0850

0.0731

0.0258

-0.0500

0.1700

 

 

Mean

95% CL Mean

Std Dev

95% CL Std Dev

0.0850

0.0239

0.1461

0.0731

0.0483

0.1488

 

 

DF

t Value

Pr > |t|

7

3.29

0.0133

 

In this example, the mean difference between each swimmer’s semifinal time and their final time is 0.0850 seconds. The t test shows significant evidence (DF=7, t= 3.29, p = 0.0133) of a difference between the mean semifinal and final times.

9.5     Creating Statistical Graphics with PROC TTEST

The TTEST procedure uses ODS Graphics to produce several plots that help you visualize your data including histograms, box plots, and Q-Q plots. Many plots are generated by default, but you can control which plots are created using the PLOTS option in the PROC TTEST statement. Here is the general form of the PROC TTEST statement with plot options:

PROC TTEST PLOTS = (plot-request-list);

Plot requests  The plots available to you depend on the type of comparison you request. Here are plots you can request for one sample, two sample, and paired t tests:

ALL                                     all appropriate plots

BOXPLOT                            box plots

HISTOGRAM                       histograms overlaid with normal and kernel density curves

INTERVALPLOT                plots of confidence interval of means

NONE                                   no plots

QQPLOT                              normal quantile-quantile (Q-Q) plot

SUMMARYPLOT                  one plot that includes both histograms and box plots

The following plots are also available for paired t tests:

AGREEMENTPLOT              agreement plots

PROFILESPLOT                profiles plot

Excluding automatic plots  By default the QQPLOT and SUMMARYPLOT plots are generated automatically for one sample, two sample and paired t tests. For paired t tests, the AGREEMENTPLOT and PROFILESPLOT are also generated by default. If you choose specific plots in the plot-list, the default plots will still be created unless you add the ONLY global option:

PROC TTEST PLOTS(ONLY) = (plot-request-list);

Example  This example uses the data from the previous section, which gives the finishing times for semifinal (SemiFinalTime) and final races (FinalTime) of the women’s 50-meter freestyle swim. The following program uses a paired t test to test the mean difference between the semifinal and final times, and requests just the Summary and Q-Q plots.

LIBNAME results 'c:MySASLib';

*Produce just the Summary and QQ plots;

PROC TTEST DATA = results.swim PLOTS(ONLY) = (SUMMARYPLOT QQPLOT);

  TITLE '50m Freestyle Semifinal vs. Final Results';

  PAIRED SemiFinalTime * FinalTime;

RUN;

 

Here are the results for the Q-Q and Summary plots. The tabular results for the paired t test were shown in the preceding section.

image

image

 

 

9.6     Testing Categorical Data with PROC FREQ

PROC FREQ, which is part of Base SAS, produces many statistics for categorical data. The best known of these is chi-square. One of the most common uses of PROC FREQ is to test the hypothesis of no association between two variables. Another use is to compute measures of association, which indicate the strength of the relationship between the variables. The general form of PROC FREQ is:

PROC FREQ;

   TABLES variable-combinations / options;

Options  Here are a few of  the statistical options that you can request:

AGREE

tests and measures of classification agreement, including McNemar’s test, Bowker’s test, Cochran’s Q test, and kappa statistics

CHISQ

chi-square tests of independence and measures of association

CL

confidence limits for measures of association

CMH

Cochran-Mantel-Haenszel statistics, typically for stratified two-way tables

EXACT

Fisher’s exact test for tables larger than 2X2

MEASURES

measures of association, including Pearson and Spearman correlation coefficients, gamma, Kendall’s tau-b, Stuart’s tau-c, Somer’s D, lambda, odds ratios, risk ratios, and confidence intervals

RELRISK

relative risk measures for 2X2 tables

TREND

Cochran-Armitage test for trend

Example  One day your neighbor, who rides the bus to work, complains that the regular bus is usually late. He says the express bus is usually on time. Realizing that this is categorical data, you decide to test whether there really is a relationship between the type of bus and arriving on time. You collect data for type of bus (E for express or R for regular) and promptness (L for late or O for on time). Each line of data contains several observations.

E O E L E L R O E O E O E O R L R O R L R O E O R L E O R L R O E O

E O R L E L E O R L E O R L E O R L E O R O E L E O E O E O E O E L

E O E O R L R L R O R L E L E O R L R O E O E O E O E L R O R L

The following program reads the raw data and runs PROC FREQ with the CHISQ option:

LIBNAME trans 'c:MySASLib';

DATA trans.bus;

   INFILE 'c:MyRawDataBus.dat';

   INPUT BusType $ OnTimeOrLate $ @@;

RUN;

*Perform chi-square analysis on bus data;

PROC FREQ DATA = trans.bus;

   TABLES BusType * OnTimeOrLate / CHISQ;

   TITLE;

RUN;

 

Here are the results showing that the regular bus is late 61.90% of the time, while the express bus is late only 24.14% of the time. Assuming that bus type and arrival time are independent, the probability of obtaining a chi-square this large or larger by chance alone is 0.0071. So the data do support the idea that there is an association between type of bus and arrival time. The Fisher’s exact test provides the same conclusion with a p-value of 0.0097.

The FREQ Procedure

Table of BusType by OnTimeOrLate

BusType

OnTimeOrLate

Frequency
Percent
Row Pct
Col Pct

L

O

Total

E

7
14.00
24.14
35.00

22
44.00
75.86
73.33

29
58.00


R

13
26.00
61.90
65.00

8
16.00
38.10
26.67

21
42.00


Total

20
40.00

30
60.00

50
100.00

 

Statistics for Table of BusType by OnTimeOrLate

 

Statistic

DF

Value

Prob

 

 

Chi-Square

1

7.2386

0.0071

 

 

Likelihood Ratio Chi-Square

1

7.3364

0.0068

 

 

Continuity Adj. Chi-Square

1

5.7505

0.0165

 

 

Mantel-Haenszel Chi-Square

1

7.0939

0.0077

 

 

Phi Coefficient

 

-0.3805

 

 

 

Contingency Coefficient

 

0.3556

 

 

 

Cramer's V

 

-0.3805

 

 

 

Fisher's Exact Test

Cell (1,1) Frequency (F)

7

Left-sided Pr <= F

0.0081

Right-sided Pr >= F

0.9987

 

 

Table Probability (P)

0.0067

Two-sided Pr <= P

0.0097

 

Sample Size = 50

 

9.7     Creating Statistical Graphics with PROC FREQ

The FREQ procedure uses ODS Graphics to produce several plots that help you visualize your data including frequency plots, odds ratio plots, agreement plots, deviation plots, and two types of plots with Kappa statistics and confidence limits. Here is the general form of PROC FREQ with plot options:

PROC FREQ;

   TABLES variable-combinations / options PLOTS = (plot-list);

Plot requests  The plots available to you depend on the type of table you request. For example, the DEVIATIONPLOT is available only for one-way tables when using the CHISQ option in the TABLES statement. Here are the plots that you can request along with the required option, if any, and type of table request:

Plot Name

Table type

Required option in TABLES statement

AGREEPLOT

two-way

AGREE

CUMFREQPLOT

one-way

 

DEVIATIONPLOT

one-way

CHISQ

FREQPLOT

any request

 

KAPPAPLOT

three-way

AGREE

ODDSRATIOPLOT

hx2x2

MEASURES or RELRISK

RELREISKPLOT

hx2x2

MEASURES or RELRISK

RISKDIFFPLOT

hx2x2

RISKDIFF

WTKAPPAPLOT

hxrxr (r>2)

AGREE

To produce a CUMFREQPLOT or a FREQPLOT, you must specify it in the PLOTS= option in the TABLES statement. Otherwise, if you do not specify any plots in the TABLE statement, then all plots associated with the table type you request will be produced by default.

Plot options  Many options are available that control the look of the plots generated. For a complete list of options, see the SAS Documentation. For example, the FREQPLOT has options for controlling the layout of the plots for two-way tables. By default, the bars are grouped vertically. To group the bars horizontally, use the following:

TABLES variable1 * variable2 / PLOTS = FREQPLOT(TWOWAY = GROUPHORIZONTAL);

To stack the bars, use the TWOWAY=STACKED option.

Example  This example uses the same data as the previous section about promptness of buses. The data are for type of bus (E for express or R for regular) and promptness (L for late or O for on time). The following program uses PROC FREQ to request a two-way frequency table. The PLOTS=FREQPLOT option in the TABLES statement produces a frequency plot. Adding the TWOWAY=GROUPHORIZONTAL option to FREQPLOT produces bars that are grouped horizontally instead of vertically. The FORMAT procedure creates formats that are applied to the BusType and OnTimeOrLate variables using a FORMAT statement in the FREQ procedure. This gives more descriptive labels to the plot.

 

LIBNAME trans 'c:MySASLib';

PROC FORMAT;

  VALUE $type 'R'='Regular'

              'E'='Express';

  VALUE $late 'O'='On Time'

              'L'='Late';

RUN;

*Produce frequency plot grouped horizontally;

PROC FREQ DATA = trans.bus;

   TABLES BusType * OnTimeOrLate / PLOTS=FREQPLOT(TWOWAY=GROUPHORIZONTAL);

   FORMAT BusType $Type. OnTimeOrLate $Late.;

RUN;

Here is the plot. The tabular portion of the output, the frequency table, was shown in the previous section.

image

 

9.8     Examining Correlations with PROC CORR

The CORR procedure, which is included with Base SAS, computes correlations. A correlation coefficient measures the strength of the linear relationship between two variables. If two variables are completely unrelated, they will have a correlation of 0. If two variables have a perfect linear relationship, they will have a correlation of 1.0 or –1.0. In real life, correlations fall somewhere between these numbers. The basic syntax for PROC CORR is rather simple:

PROC CORR;

RUN;

By default, SAS will compute correlations between all possible pairs of the numeric variables. You can add the VAR and WITH statements to specify variables:

VAR variable-list;

WITH variable-list;

Variables listed in the VAR statement appear across the top of the table of correlations, while variables listed in the WITH statement appear down the side of the table. If you use a VAR statement but no WITH statement, then the variables will appear both across the top and down the side.

By default, PROC CORR computes Pearson product-moment correlation coefficients. You can add options to the PROC statement to request nonparametric correlation coefficients. The SPEARMAN option in the statement below tells SAS to compute Spearman’s rank correlations instead of Pearson’s correlations:

PROC CORR SPEARMAN;

Other options include HOEFFDING (for Hoeffding’s D statistic) and KENDALL (for Kendall’s tau-b coefficient).

Example  Each student in a statistics class recorded three values: test score, the number of hours spent watching videos in the week prior to the test, and the number of hours spent exercising during the same week. Here are the raw data:

56 6 2   78 7 4   84 5 5   73 4 0   90 3 4

44 9 0   76 5 1   87 3 3   92 2 7   75 8 3

85 1 6   67 4 2   90 5 5   84 6 5   74 5 2

64 4 1   73 0 5   78 5 2   69 6 1   56 7 1

87 8 4   73 8 3  100 0 6   54 8 0   81 5 4

78 5 2   69 4 1   64 7 1   73 7 3   65 6 2

Notice that each line contains data for five students. The following program reads the raw data from a file called Exercise.dat, and then uses PROC CORR to compute the correlations:

 

LIBNAME mylib 'c:MySASLib';

DATA mylib.class;

   INFILE 'c:MyRawDataExercise.dat';

   INPUT Score Videos Exercise @@;

RUN;

*Perform correlation analysis;

PROC CORR DATA = mylib.class;

    VAR Videos Exercise;

   WITH Score;

   TITLE 'Correlations for Test Scores';

   TITLE2 'With Hours of Videos and Exercise';

RUN;

Here is the report from PROC CORR:

Correlations for Test Scores

With Hours of Videos and Exercise

The CORR Procedure

 

1 With Variables:

Score

2      Variables:

Videos Exercise

 

Simple Statistics

Variable

N

Mean

Std Dev

Sum

Minimum

Maximum

Score

30

74.63333

12.58484

2239

44.00000

100.00000

Videos

30

5.10000

2.33932

153.00000

0

9.00000

Exercise

30

2.83333

1.94906

85.00000

0

7.00000

 

Pearson Correlation Coefficients, N = 30 Prob > |r| under H0: Rho=0

 

Videos

Exercise

Score

-0.55390
0.0015

0.79733
<.0001

This report starts with descriptive statistics for each variable, and then lists the correlation matrix, which contains: correlation coefficients (in this case, Pearson), and the probability of getting a larger absolute value for each correlation assuming that the population correlation is zero.

In this example, both hours of videos and hours of exercise are correlated with test score, but exercise is positively correlated while videos is negatively correlated. This means students who watched more videos tended to have lower scores, while the students who spent more time exercising tended to have higher scores.

9.9     Creating Statistical Graphics with PROC CORR

The CORR procedure evaluates the strength of the linear relationship between pairs of variables. The tabular output gives the correlation coefficients and other simple statistics, and using ODS Graphics you can also visualize the relationship. Plots are not generated by default, so you need to specify the desired plots using the PLOTS= option. Here is the general form of PROC CORR with the PLOTS option:

PROC CORR PLOTS = (plot-list);

   VAR variable-list;

   WITH variable-list;

Plot requests  The CORR procedure can create two types of plots:

SCATTER

scatter plots for pairs of variables overlaid with prediction or confidence ellipses

MATRIX

matrix of scatter plots for all variables

Plot options  By default, the scatter plots include prediction ellipses for new observations. If you want confidence ellipses for means, then specify the ELLIPSE=CONFIDENCE option on the scatter plot:

PROC CORR PLOTS = SCATTER(ELLIPSE = CONFIDENCE);

If you do not want any ellipses on your scatter plots, then use the ELLIPSE=NONE option.

If you do not have a WITH statement, then matrix plots will show a symmetrical plot with all variable combinations appearing twice. By default the diagonal cells in the matrix will be empty. If you use the HISTOGRAM option for the matrix plot, then a histogram will be produced for each variable and displayed along the diagonal.

PROC CORR PLOTS = MATRIX(HISTOGRAM);

Example  This example uses the data from the previous section about students in a statistics class. For each student, we have the test score (Score), the number of hours spent watching videos in the week prior to the test (Videos), and the number of hours spent exercising during the same week (Exercise). The following program uses the same PROC CORR statements shown in the previous section except both the scatter and matrix plots are requested:

LIBNAME mylib 'c:MySASLib';

*Generate scatter and matrix plots for correlation analysis;

PROC CORR DATA = mylib.class PLOTS = (SCATTER MATRIX);

   VAR Videos Exercise;

   WITH Score;

   TITLE 'Correlations for Test Scores';

   TITLE2 'With Hours of Videos and Exercise';

RUN;

 

The program produces three plots: a scatter plot for Score by Videos, a scatter plot for Score by Exercise (not shown), and the Scatter Plot Matrix.

image

image

 

9.10   Using PROC REG for Simple Regression Analysis

The REG procedure fits linear regression models by least squares and is one of many SAS procedures that perform regression analysis. PROC REG is part of SAS/STAT, which is licensed separately from Base SAS. We will show an example of a simple regression analysis using continuous numeric variables with only one regressor variable. However, PROC REG is capable of analyzing models with many regressor variables using a variety of model-selection methods, including stepwise regression, forward selection, and backward elimination. Other procedures in SAS/STAT will perform non-linear and logistic regression. In SAS/ETS, you will find procedures for time series analysis. If you are unsure about what type of analysis you need, or are unfamiliar with basic statistical principles, we recommend that you seek advice from a trained statistician, or consult a good statistical textbook.

The REG procedure has only two required statements. It must start with the PROC REG statement and have a MODEL statement specifying the analysis model. Here is the general form:

PROC REG;

   MODEL dependent = independent;

In the MODEL statement, the dependent variable is listed on the left side of the equal sign and the independent, or regressor, variable is listed on the right.

Example  At your young neighbor’s T-ball game (that’s where the players hit the ball from the top of a tee instead of having the ball pitched to them), he said to you, “You can tell how far they’ll hit the ball by how tall they are.” To give him a little practical lesson in statistics, you decide to test his hypothesis. You gather data from 30 players, measuring their height in inches and their longest of three hits in feet. The following are the data. Notice that data for several players are listed on one line:

50 110  49 135  48 129  53 150  48 124 50 143  51 126  45 107  

53 146  50 154  47 136  52 144  47 124 50 133  50 128  50 118

48 135  47 129  45 126  48 118  45 121 53 142  46 122  47 119

51 134  49 130  46 132  51 144  50 132 50 131

The following program reads the data and performs the regression analysis. In the MODEL statement, Distance is the dependent variable, and Height is the independent variable.

LIBNAME tball 'c:MySASLib';

DATA tball.hits;

   INFILE 'c:MyRawDataBaseball.dat';

   INPUT Height Distance @@;

RUN;

* Perform regression analysis;

PROC REG DATA = tball.hits;

   MODEL Distance = Height;

   TITLE 'Results of Regression Analysis';

RUN;

The REG procedure produces tabular results and several graphs by default. Only the tabular results are shown here; see the next section for an example showing the graphical results. The first section of the tabular output is the analysis of variance section, which gives information about how well the model fits the data:

Results of Regression Analysis

The REG Procedure
Model: MODEL1
Dependent Variable: Distance

 

Number of Observations Read

30

 

 

Number of Observations Used

30

 

 

Analysis of Variance

Source

DF

Sum of Squares

Mean Square

F Value

Pr > F

Model

1

1365.50831

1365.50831

16.86

0.0003

Error

28

2268.35836

81.01280

 

 

Corrected Total

29

3633.86667

 

 

 

 

Root MSE

9.00071

R-Square

0.3758

 Dependent Mean

130.73333

Adj R-Sq

0.3535

Coeff Var

6.88479

 

 

 

DF

degrees of freedom associated with the source

Mean Square

mean square (sum of squares divided by the degrees of freedom)

F value

F value for testing the null hypothesis (all parameters are zero except intercept)

Pr > F

significance probability or p-value

Root MSE  

root mean square error

Coeff Var

the coefficient of variation

Adj R-sq

the R-square value adjusted for degrees of freedom

The parameter estimates follow the analysis of variance section and give the parameters for each term in the model, including the intercept:

Parameter Estimates

Variable

DF

Parameter Estimate

Standard Error

t Value

Pr > |t|

Intercept

1

-11.00859

34.56363

-0.32

0.7525

Height

1

2.89466

0.70506

4.11

0.0003

 

DF

degrees of freedom for the variable

t Value

t test for the parameter equal to zero

Pr > |t|

two-tailed significance probability

From the parameter estimates you can construct the regression equation:

Distance = -11.00859 + (2.89466 * Height)

In this example, the distance the ball was hit did increase with the player’s height. The slope of the model is significant (p= 0.0003) but the relationship is not very strong (R-square = 0.3758). Perhaps age or years of experience are better predictors of how far the ball will go.

9.11   Creating Statistical Graphics with PROC REG

There are many plots that are useful for visualizing the results of regression analysis and for assessing how well the model fits the data. PROC REG uses ODS Graphics to produce many such plots, including a diagnostic panel that contains up to nine plots in one figure. Some plots are produced automatically while others need to be specified. Here is the general form of PROC REG with the PLOTS option:

PROC REG PLOTS(options) = (plot-request-list);

   MODEL dependent = independent;

Plot requests  Here are some plots that you can request for simple linear regression:

FITPLOT

scatter plot with regression line and confidence and prediction bands

RESIDUALS

residuals plotted against independent variable

DIAGNOSTICS

diagnostics panel including all of the following plots

COOKSD

Cook’s D statistic by observation number

OBSERVEDBYPREDICTED

dependent variable by predicted values

QQPLOT

normal quantile plot of residuals

RESIDUALBYPREDICTED

residuals by predicted values

RESIDUALHISTOGRAM

histogram of residuals

RFPLOT

residual fit plot

RSTUDENTBYLEVERAGE

studentized residuals by leverage

RSTUDENTBYPREDICTED

studentized residuals by predicted values

Excluding automatic plots  By default, the RESIDUALS and DIAGNOSTICS plots are generated automatically. Additional plots may also be produced by default depending on the type of model. For example, a FITPLOT is automatically generated when there is one regressor variable. If you choose specific plots in the plot-request-list, the default plots will still be created unless you use the ONLY global option:

PROC REG PLOTS(ONLY) = (plot-request-list);

Example  The following example uses the same data as the previous section about T-ball players. The data are the player’s height in inches (Height) and their longest of three hits in feet (Distance). The following program performs the regression analysis as in the previous section. However, in this program only the FITPLOT and DIAGNOSTICS plots are requested:

LIBNAME tball 'c:MySASLib';

*Limit plots generated to diagnostics panel and fit plot;

PROC REG DATA = tball.hits PLOTS(ONLY) = (DIAGNOSTICS FITPLOT);

   MODEL Distance = Height;

   TITLE 'Results of Regression Analysis';

RUN;

 

Here are the results for the Fit Diagnostics panel and the Fit Plot. The tabular results are not shown here, but are the same as in the previous section.

image

image

 

9.12   Using PROC ANOVA for One-Way Analysis of Variance

The ANOVA procedure is one of many in SAS that perform analysis of variance. PROC ANOVA is part of SAS/STAT, which is licensed separately from Base SAS. PROC ANOVA is specifically designed for balanced data—data where there are equal numbers of observations in each combination of the classification factors. An exception is for one-way analysis of variance where the data do not need to be balanced. If you are not doing one-way analysis of variance and your data are not balanced, then you should use the GLM procedure, whose statements are almost identical to those of PROC ANOVA. Although we are discussing only simple one-way analysis of variance in this section, PROC ANOVA can handle multiple classification variables and models that include nested and crossed effects as well as repeated measures. If you are unsure of the appropriate analysis for your data, or are unfamiliar with basic statistical principles, we recommend that you seek advice from a trained statistician or consult a good statistical textbook.

The ANOVA procedure has two required statements: the CLASS and MODEL statements. Here is the general form of the ANOVA procedure:

PROC ANOVA;

   CLASS variable-list;

   MODEL dependent = effects;

The CLASS statement must come before the MODEL statement and defines the classification variables. For one-way analysis of variance, only one variable is listed. The MODEL statement defines the dependent variable and the effects. For one-way analysis of variance, the effect is the classification variable.

As you might expect, there are many optional statements for PROC ANOVA. One of the most useful is the MEANS statement, which calculates means of the dependent variable for any of the main effects in the MODEL statement. In addition, the MEANS statement can perform several types of multiple-comparison tests, including Bonferroni t tests (BON), Duncan’s multiple-range test (DUNCAN), Scheffe’s multiple-comparison procedure (SCHEFFE), pairwise t tests (T), and Tukey’s studentized range test (TUKEY). The MEANS statement has the following general form:

MEANS effects / options;

The effects can be any effect in the MODEL statement, and options include the name of the desired multiple-comparison test (DUNCAN for example).

Example  Your friend’s daughter plays basketball on a team that travels throughout the state. She complains that it seems like the girls from the other regions in the state are all taller than the girls from her region. You decide to test her hypothesis by getting the heights for a sample of girls from the four regions and performing one-way analysis of variance to see if there are any differences. Each data line includes region and height for eight girls:

West 65 West 58 West 63 West 57 West 61 West 53 West 56 West 66

West 55 West 56 West 65 West 54 West 55 West 62 West 55 West 58

East 65 East 55 East 57 East 66 East 59 East 63 East 58 East 57

East 58 East 63 East 61 East 62 East 58 East 57 East 65 East 57

South 63 South 63 South 68 South 56 South 60 South 65 South 64 South 62

South 59 South 67 South 59 South 65 South 66 South 67 South 64 South 68

North 63 North 65 North 58 North 55 North 57 North 66 North 59 North 61

North 65 North 56 North 57 North 63 North 61 North 60 North 64 North 62

You want to know which, if any, regions have taller girls than the rest, so you use the MEANS statement in your program and choose Scheffe’s multiple-comparison method to compare the means. Here is the program to read the data and perform the analysis of variance:

DATA heights;

   INFILE 'c:MyRawDataGirlHeights.dat';

   INPUT Region $ Height @@;

RUN;

* Use ANOVA to run one-way analysis of variance;

PROC ANOVA DATA = heights;

   CLASS Region;

   MODEL Height = Region;

   MEANS Region / SCHEFFE;

   TITLE "Girls' Heights from Four Regions";

RUN;

In this case, Region is the classification variable and also the effect in the MODEL statement. Height is the dependent variable. The MEANS statement will produce means of the girls’ heights for each region, and the SCHEFFE option will test which regions are different from the others.

Here is the box plot that is created automatically. The small p-value (p=0.0051) indicates that at least two of the four regions differ in mean height. The tabular output is shown and discussed in the next section.

image

9.13   Reading the Output of PROC ANOVA

The tabular output from PROC ANOVA has at least two parts. First, ANOVA produces a table giving information about the classification variables: number of levels, values, and number of observations. Next, it produces an analysis of variance table. If you use optional statements like MEANS, then their output will follow.

The example from the previous section used the following PROC ANOVA statements:

PROC ANOVA DATA = heights;

   CLASS Region;

   MODEL Height = Region;

   MEANS Region / SCHEFFE;

   TITLE "Girls' Heights from Four Regions";

RUN;

The graph produced by the ANOVA procedure is shown in the previous section. The first page of the tabular output (shown below) gives information about the classification variable Region. It has four levels with values East, North, South, and West; and there are 64 observations.

Girls' Heights from Four Regions

The ANOVA Procedure

 

Class Level Information

Class

Levels

Values

Region

4

East North South West

 

 

Number of Observations Read

64

Number of Observations Used

64

The second part of the output is the analysis of variance table:

Girls' Heights from Four Regions

The ANOVA Procedure

 

Dependent Variable: Height

 

Source

DF

Sum of Squares

Mean Square

F Value

Pr > F

Model

3

196.625000

65.541667

4.72

0.0051

Error

60

833.375000

13.889583

 

 

Corrected Total

63

1030.000000

 

 

 

 

R-Square

Coeff Var

Root MSE

Height Mean

0.190898

6.134771

3.726873

60.75000

 

Source

DF

Anova SS

Mean Square

F Value

Pr > F

Region

3

196.6250000

65.5416667

4.72

0.0051

Highlights of the output are:

   Source

source of variation

    DF

degrees of freedom for the model, error, and total

   Sum of Squares

sum of squares for the portion attributed to the model, error, and total

   Mean Square

mean square (sum of squares divided by the degrees of freedom)

   F Value

F value (mean square for model divided by the mean square for error)

   Pr > F

significance probability associated with the F statistic

    R-Square

R-square

Coeff Var

coefficient of variation

Root MSE

root mean square error

Height Mean

mean of the dependent variable

Because the effect of Region is significant (p = .0051), we conclude that there are differences in the mean heights of girls from the four regions. The SCHEFFE option in the MEANS statement compares the mean heights between the regions. The following results show that your friend’s daughter is partially correctone region (South) has taller girls than her region (West) but no other two regions differ significantly in mean height. Older versions of SAS may display a table instead of a graph. If you prefer a table, then add the LINESTABLE option to the MEANS statement.

Girls' Heights from Four Regions

The ANOVA Procedure

Scheffe's Test for Height

 

Note: This test controls the Type I experimentwise error rate.

 

Alpha

0.05

Error Degrees of Freedom

60

Error Mean Square

13.88958

Critical Value of F

2.75808

Minimum Significant Difference

3.7902

 

image

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.9.179