Computing Descriptive Statistics with PROC MEANS

You can use PROC MEANS to analyze quantitative (numeric) variables. For each variable analyzed, it provides the following information:

  • the number of observations on which calculations were performed (abbreviated “N” in the output);

  • the mean;

  • the standard deviation;

  • the minimum (smallest) value observed;

  • the maximum (largest) value observed.

These statistics are produced by default, and some additional statistics (to be described later) can also be requested as options.

Here is the general form for PROC MEANS:

PROC MEANS DATA=dataset-name
           option-list
           statistic-keyword-list ;
   VAR variable-list  ;
RUN;

The PROC MEANS Statement

The PROC MEANS statement begins with “PROC MEANS” and ends with a semicolon. It is recommended (on some platforms, it is required) that the statement should also specify the name of the dataset to be analyzed with the DATA= option.

The “option-list” appearing in the preceding program indicates that you can request a number of options with PROC MEANS. A complete list of these options appears in the SAS/STAT User’s Guide. Some options especially useful for social science research are:

MAXDEC=N

Specifies the maximum number of decimal places (digits to the right of the decimal point) to be used when printing results; the possible range is 0 to 8.

VARDEF=divisor

Specifies the divisor to be used when calculating variances and covariances. Following are two possible divisors:

VARDEF=DFDivisor is the degrees of freedom for the analysis: (n–1). This is the default.
VARDEF=NDivisor is the number of observations, n.

The “statistic-keyword-list” appearing in the program indicates that you can request a number of statistics to replace the default output. Some statistics that can be of particular value in social science research include the following. See the SAS/STAT User’s Guide for a more complete listing:

NMISSThe number of observations in the sample that displayed missing data for this variable.
RANGEThe range of values displayed in the sample.
SUMThe sum.
CSSThe corrected sum of squares.
USSThe uncorrected sum of squares.
VARThe variance.
STDERRThe standard error of the mean.
SKEWNESSThe skewness displayed by the sample. Skewness refers to the extent to which the sample distribution departs from the normal curve because of a long “tail” on either side of the distribution. If the long tail appears on the right side of the sample distribution (where the higher values appear), it is described as being positively skewed. If the long tail appears on the left side of the distribution (where the lower values appear), it is described as being negatively skewed.
KURTOSISThe kurtosis displayed by the sample. Kurtosis refers to the extent to which the sample distribution departs from the normal curve because it is either peaked or flat. If the sample distribution is relatively peaked (tall and skinny), it is described as being leptokurtic. If the distribution is relatively flat, it is described as being platykurtic.
TThe obtained value of Student’s t test for testing the null hypothesis that the population mean is 0.
PRTThe p value for the preceding t test; that is, the probability of obtaining a t value this large or larger if the population mean were 0.

To illustrate the use of these options and statistic keywords, assume that you want to use the MAXDEC option to limit the printing of results to two decimal places, use the VAR keyword to request that the variances of all quantitative variables be printed, and use the KURTOSIS keyword to request that the kurtosis of all quantitative variables be printed. You could do this with the following PROC MEANS statement:

PROC MEANS   DATA=D1   MAXDEC=2   VAR   KURTOSIS ;

The VAR Statement

Here again is the general form of the statements requesting the MEANS procedure, including the VAR statement:

PROC MEANS  DATA=dataset-name
            option-list
            statistic-keyword-list ;
   VAR  variable-list ;
RUN;

In the place of “variable-list” in the preceding VAR statement, you can list the quantitative variables to be analyzed. Each variable name should be separated by at least one blank space. If no VAR statement is used, SAS performs PROC MEANS on all of the quantitative variables in the dataset. This is true for many other SAS procedures as well, as explained in the following note:

What happens if I do not include a VAR statement?

For many SAS procedures, failure to include a VAR statement causes the system to perform the requested analyses on all variables in the dataset. For datasets with a large number of variables, leaving off the VAR statement can unintentionally result in a very long output file.


The program used to analyze your dataset includes the following statements. RESNEEDY and CLASS are specified in the VAR statement so that descriptive statistics would be calculated for both variables:

PROC MEANS   DATA=D1;
   VAR RESNEEDY CLASS;
RUN;

Output 5.1. Results of the MEANS Procedure
Variable   N          Mean       Std Dev       Minimum        Maximum
---------------------------------------------------------------------
RESNEEDY  13     3.6923077     1.3155870     1.0000000      5.0000000
CLASS     12     2.2500000     1.4222262     1.0000000      5.0000000
---------------------------------------------------------------------

Reviewing the Output

Output 5.1 contains the results created by the preceding program. Before doing any more sophisticated analyses, you should always perform PROC MEANS on each quantitative variable and carefully review the output to ensure that everything looks right. Under the heading “Variable” is the name of each variable being analyzed. The descriptive statistics for that variable appear to the right of the variable name. Below the heading “N” is the number of valid cases, or observations, on which calculations were performed. Notice that in this instance, calculations were performed on only 13 observations for RESNEEDY. This might come as a surprise, because the dataset actually contains 14 cases. Recall, however, that one participant did not respond to this question (question 1 on the survey). It is for this reason that N is equal to 13 rather than 14 for RESNEEDY in these analyses.

You should next examine the mean for the variable, to verify that it is a reasonable number. Remember that, with question 1, responses could range from 1 “Disagree Strongly” to 5 “Agree Strongly.” Therefore, the mean response should be somewhere between 1.00 and 5.00 for the RESNEEDY variable. If it is outside of this range, you know that an error has been made. In the present case, the mean for RESNEEDY is 3.69, which is within the predetermined range, so everything appears correct so far.

Using the same reasoning, it is prudent to next check the column headed “Minimum.” Here, you will find the lowest value on RESNEEDY that appeared in the dataset. If this is less than 1.00, you will again know that an error was made, because 1 was the lowest value that could have been assigned to a participant. On the printout, the minimum value is 1.00, which indicates no problems. Under “Maximum,” the largest value observed for that variable is reported. This should not exceed 5.00, because 5 was the largest score a participant could obtain on item 1. The reported maximum value is 5.00, so again it appears that there were no obvious errors in entering the data or writing the program.

Once you have reviewed the results for RESNEEDY, you should also inspect the results for CLASS. If any of the observed values are out of range, you should carefully review the program for programming errors, and the dataset for data entry error. In some cases, you might want to use PROC PRINT to print out the raw dataset because this makes the review easier. PROC PRINT is described later in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.19.217