Compute the average of a column for each group in a table. Select observations from each group based on the group's average.
Featured Step | PROC SQL |
Featured Step Options and Statements | AVG aggregate function GROUP BY clause HAVING clause |
Related Technique | PROC MEANS, CLASS, OUTPUT, and TYPES statements DATA step, match-merging by value |
Table SOYBEANS contains the soybean yield and acres of soybeans planted for 30 farms in four regions.
SOYBEANS Obs region farmid yield acres 1 NW 456 33 95 2 NW 269 38 346 3 NW 295 31 59 4 NW 689 22 373 5 NW 080 30 289 6 NW 319 28 83 7 NW 703 29 114 8 SW 700 36 122 9 SW 178 28 214 10 SW 358 31 817 11 SW 045 32 9 12 SW 741 24 66 13 SW 262 29 39 14 SE 983 44 370 15 SE 628 40 165 16 SE 042 43 576 17 SE 996 46 142 18 SE 257 40 168 19 SE 749 41 63 20 SE 869 43 965 21 SE 042 36 159 22 MID 894 42 151 23 MID 806 41 49 24 MID 848 44 114 25 MID 479 35 11 26 MID 959 33 831 27 MID 493 37 25 28 MID 939 43 691 29 MID 752 35 229 30 MID 077 48 316 |
Output 5.4 AVGPLUS TableExample 5.4 AVGPLUS Table Created with PROC SQL region_ Obs region farmid yield acres avg 1 MID 077 48 316 39.8 2 MID 848 44 114 39.8 3 MID 939 43 691 39.8 4 MID 894 42 151 39.8 5 MID 806 41 49 39.8 6 NW 269 38 346 30.1 7 NW 456 33 95 30.1 8 NW 295 31 59 30.1 9 SE 996 46 142 41.6 10 SE 983 44 370 41.6 11 SE 042 43 576 41.6 12 SE 869 43 965 41.6 13 SW 700 36 122 30.0 14 SW 045 32 9 30.0 15 SW 358 31 817 30.0 |
The PROC SQL step in this example computes a statistic for the groups that are defined by a column in a table, and it selects rows from the table based on the computed statistic. The SELECT statement adds a column that is the computed statistic. The GROUP BY clause defines the groups in which to compute the statistic. The HAVING clause returns rows from each group that meet a specific criterion based on each group's statistic.
The goal of this PROC SQL step is to find the farms in each region whose soybean yields are greater than the average for the region. The rows are grouped by REGION, which has four values. The mean value of YIELD is computed for each value of REGION and saved in column REGION AVG. The HAVING clause returns rows with a value of YIELD greater than the average YIELD for the row's region.
Create table AVGPLUS. Select all columns from SOYBEANS. Calculate the average of YIELD and save it in REGION_AVG.
Group the data by the values of REGION, which causes the AVG function to compute the mean soybean yield for each region. Return only the rows for farms with yields greater than the yield. Include the CALCULATED keyword because REGION_AVG is not a column, but instead the alias for the computation that the query performs. Order the rows in AVGPLUS by REGION and within each value of REGION, order the rows in descending order of the values of YIELD.
proc sql; create table avgplus as select *, avg(yield) format=5.1 as region_avg from soybeans group by region having yield < calculated region_avg average order by region, yield descending; quit;
The following program requires a PROC MEANS step, a PROC SORT step, and a DATA step to produce a data set equivalent to the table that was produced by PROC SQL earlier.
The program starts with a PROC MEANS step that computes the mean soybean yield per region and saves the means in output data set YIELDAVG. Next, the program sorts the input data set SOYBEANS by REGION and within REGION by descending values of YIELD. A DATA step follows that match-merges data sets SOYBEANS and YIELDAVG by REGION. It also selects observations where the value of YIELD is greater than the average yield for the observation's region.
Compute statistics on data set SOYBEANS. Do not print a report. Specify the variable whose values define the groups for which to compute the statistics. Produce statistics only for the groups that are defined by variable REGION. Specify the analysis variable. Save the mean of YIELD for each value of REGION in data set YIELDAVG. Name this new variable REGION_AVG. Do not save the variables _TYPE_ and _FREQ_ that PROC MEANS automatically creates.
Sort the input data set by REGION so that it can be match-merged to YIELDAVG by REGION. Sort within REGION by descending values of YIELD. Create data set AVGPLUS. Merge the input data set and the data set with the soybean yield averages by the values of REGION.
Output the observations where the value of YIELD is greater than the average yield of the region.
proc means data=soybeans noprint; class region; types region; var yield; output out=yieldavg(drop=_type_ _freq_) mean=region_avg; run; proc sort data=soybeans; by region descending yield; run; data avgplus; merge soybeans yieldavg; by region; format region_avg 5.1; if yield gt region_avg; run;
3.145.70.38