Example 5.4 Subsetting a Table Based on the Calculated Average of a Group

Goal

Compute the average of a column for each group in a table. Select observations from each group based on the group's average.

Example Features

Featured StepPROC SQL
Featured Step Options and StatementsAVG aggregate function GROUP BY clause HAVING clause
Related TechniquePROC MEANS, CLASS, OUTPUT, and TYPES statements DATA step, match-merging by value

Input Table

Table SOYBEANS contains the soybean yield and acres of soybeans planted for 30 farms in four regions.

                 SOYBEANS


Obs    region    farmid    yield    acres
  1     NW        456        33       95
  2     NW        269        38      346
  3     NW        295        31       59
  4     NW        689        22      373
  5     NW        080        30      289
  6     NW        319        28       83
  7     NW        703        29      114
  8     SW        700        36      122
  9     SW        178        28      214
 10     SW        358        31      817
 11     SW        045        32        9
 12     SW        741        24       66
 13     SW        262        29       39
 14     SE        983        44      370
 15     SE        628        40      165
 16     SE        042        43      576
 17     SE        996        46      142
 18     SE        257        40      168
 19     SE        749        41       63
 20     SE        869        43      965
 21     SE        042        36      159
 22     MID       894        42      151
 23     MID       806        41       49
 24     MID       848        44      114
 25     MID       479        35       11
 26     MID       959        33      831
 27     MID       493        37       25
 28     MID       939        43      691
 29     MID       752        35      229
 30     MID       077        48      316


Resulting Table

Output 5.4 AVGPLUS Table

  Example 5.4 AVGPLUS Table Created with PROC SQL

                                             region_
Obs    region    farmid    yield    acres      avg
  1     MID       077        48      316       39.8
  2     MID       848        44      114       39.8
  3     MID       939        43      691       39.8
  4     MID       894        42      151       39.8
  5     MID       806        41       49       39.8
  6     NW        269        38      346       30.1
  7     NW        456        33       95       30.1
  8     NW        295        31       59       30.1
  9     SE        996        46      142       41.6
 10     SE        983        44      370       41.6
 11     SE        042        43      576       41.6
 12     SE        869        43      965       41.6
 13     SW        700        36      122       30.0
 14     SW        045        32        9       30.0
 15     SW        358        31      817       30.0


Example Overview

The PROC SQL step in this example computes a statistic for the groups that are defined by a column in a table, and it selects rows from the table based on the computed statistic. The SELECT statement adds a column that is the computed statistic. The GROUP BY clause defines the groups in which to compute the statistic. The HAVING clause returns rows from each group that meet a specific criterion based on each group's statistic.

The goal of this PROC SQL step is to find the farms in each region whose soybean yields are greater than the average for the region. The rows are grouped by REGION, which has four values. The mean value of YIELD is computed for each value of REGION and saved in column REGION AVG. The HAVING clause returns rows with a value of YIELD greater than the average YIELD for the row's region.

Program

Create table AVGPLUS. Select all columns from SOYBEANS. Calculate the average of YIELD and save it in REGION_AVG.

Group the data by the values of REGION, which causes the AVG function to compute the mean soybean yield for each region. Return only the rows for farms with yields greater than the yield. Include the CALCULATED keyword because REGION_AVG is not a column, but instead the alias for the computation that the query performs. Order the rows in AVGPLUS by REGION and within each value of REGION, order the rows in descending order of the values of YIELD.

proc sql;
  create table avgplus as
    select *,
           avg(yield) format=5.1 as region_avg

      from soybeans
      group by region


      having yield < calculated region_avg
                                      average


      order by region, yield descending;


quit;

Related Technique

The following program requires a PROC MEANS step, a PROC SORT step, and a DATA step to produce a data set equivalent to the table that was produced by PROC SQL earlier.

The program starts with a PROC MEANS step that computes the mean soybean yield per region and saves the means in output data set YIELDAVG. Next, the program sorts the input data set SOYBEANS by REGION and within REGION by descending values of YIELD. A DATA step follows that match-merges data sets SOYBEANS and YIELDAVG by REGION. It also selects observations where the value of YIELD is greater than the average yield for the observation's region.

Compute statistics on data set SOYBEANS. Do not print a report. Specify the variable whose values define the groups for which to compute the statistics. Produce statistics only for the groups that are defined by variable REGION. Specify the analysis variable. Save the mean of YIELD for each value of REGION in data set YIELDAVG. Name this new variable REGION_AVG. Do not save the variables _TYPE_ and _FREQ_ that PROC MEANS automatically creates.

Sort the input data set by REGION so that it can be match-merged to YIELDAVG by REGION. Sort within REGION by descending values of YIELD. Create data set AVGPLUS. Merge the input data set and the data set with the soybean yield averages by the values of REGION.

Output the observations where the value of YIELD is greater than the average yield of the region.

proc means data=soybeans noprint;

  class region;

  types region;

  var yield;
  output out=yieldavg(drop=_type_ _freq_)
             mean=region_avg;



run;
proc sort data=soybeans;
  by region descending yield;
run;


data avgplus;
  merge soybeans yieldavg;
    by region;

  format region_avg 5.1;
  if yield gt region_avg;


run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.212.29