Calculate percentages that one row contributes to a BY group. Calculate statistics for the BY group, some of which are based on how the row contributes to the statistic for the BY group. Create columns to classify and contain the results of the calculations.
Featured Step | PROC SQL |
Featured Step Options and Statements | Aggregate functions
CASE expression GROUP BY clause ORDER BY clause |
Related Technique | PROC MEANS, CLASS, OUTPUT, and TYPES statements DATA step, match-merging by a value |
Table DISTRICTLEVY contains levy voting results for four towns and their precincts. Counts of voters voting for the levy are in the "YES" column. Counts of voters voting against the levy are in the "NO" column.
DISTRICTLEVY Obs town precinct yes no 1 Blue Hills 1 335 264 2 Blue Hills 2 631 497 3 Blue Hills 3 529 375 4 Collegetown 1 909 1300 5 Collegetown 2 677 573 6 Collegetown 3 1314 1117 7 Collegetown 4 1135 1054 8 Collegetown 5 1153 1135 9 Fairview 1 721 906 10 Fairview 2 428 611 11 Fairview 3 788 962 12 Fairview 4 335 471 13 West Ridge 1 818 738 14 West Ridge 2 768 851 15 West Ridge 3 692 610 16 West Ridge 4 892 931 17 West Ridge 5 838 965 18 West Ridge 6 1533 1453
Output 5.6 LEVYRESULTS TableExample 5.6 LEVYRESULTS Table Created with PROC SQL total_ voters_ pct_precinct_ pct_precinct_ most_ Obs town precinct yes no nprecincts town town yes yes_pct 1 Blue Hills 1 335 264 3 2631 22.8 55.9 No 2 Blue Hills 2 631 497 3 2631 42.9 55.9 No 3 Blue Hills 3 529 375 3 2631 34.4 58.5 Yes 4 Collegetown 1 909 1300 5 10367 21.3 41.1 No 5 Collegetown 2 677 573 5 10367 12.1 54.2 Yes 6 Collegetown 3 1314 1117 5 10367 23.4 54.1 No 7 Collegetown 4 1135 1054 5 10367 21.1 51.9 No 8 Collegetown 5 1153 1135 5 10367 22.1 50.4 No 9 Fairview 1 721 906 4 5222 31.2 44.3 No 10 Fairview 2 428 611 4 5222 19.9 41.2 No 11 Fairview 3 788 962 4 5222 33.5 45.0 Yes 12 Fairview 4 335 471 4 5222 15.4 41.6 No 13 West Ridge 1 818 738 6 11089 14.0 52.6 No 14 West Ridge 2 768 851 6 11089 14.6 47.4 No 15 West Ridge 3 692 610 6 11089 11.7 53.1 Yes 16 West Ridge 4 892 931 6 11089 16.4 48.9 No 17 West Ridge 5 838 965 6 11089 16.3 46.5 No 18 West Ridge 6 1533 1453 6 11089 26.9 51.3 No |
This example summarizes data by groups and computes new columns for each row in the input table by using the summary statistics. The PROC SQL step uses aggregate functions and the GROUP BY clause to compute percentages and statistics for groups of rows. The step creates new columns to hold the percentages and statistics and to classify some of the statistics. After computation, the summary statistics are merged back to each row in the input table. Therefore, the columns that hold the summary statistics are identical for all rows within a group. The output table has the same number of rows as the input table.
Aggregate functions summarize data in a table. Several functions are available that aggregate data. Not all of the statistics functions that are found in other procedures such as PROC MEANS and PROC TABULATE are available as aggregate functions. For a complete list of PROC SQL aggregate functions, see SAS documentation on PROC SQL.
Each row in table DISTRICTLEVY contains a precinct's YES or NO votes on a levy. The goal of the PROC SQL step is to analyze voting results per town. The number of precincts per town varies from three to six. The step computes five new columns:
NPRECINCTS, which is computed by using the COUNT function to tally the number of precincts per town. The value of NPRECINCTS is the same for all rows within a BY group.
TOTAL_VOTERS_TOWN, which is computed by using the SUM function to sum the number of YES votes and NO votes per town. The value of TOTAL_VOTERS_TOWN is the same for all rows within each BY value of TOWN.
PCT_PRECINCT_TOWN, which is the percentage of voters that the precinct contributes to the town.
PCT_PRECINCT_YES, which is the percentage of YES votes in the precinct.
MOST_YES_PCT, which is a character column with possible values of 'Yes' or 'No'. It has a value of 'Yes' when the percentage of YES votes for the precinct is the highest precinct percentage within the town. Precincts where the percentage of YES votes is less than the maximum percentage of YES votes within the town have a value of 'No'. Computation of MOST_YES_PCT uses the MAX aggregate function and a CASE expression.
The step includes an ORDER BY clause that arranges the rows in LEVYRESULTS by TOWN and PRECINCT. Without the ORDER BY clause, the rows might not be output in the order in which they were input from LEVYRESULTS. This example merges the statistics back to the original table. The results of the remerge might not arrange the rows in the expected order so the ORDER BY clause is needed.
The Related Technique in Example 5.5 also uses the GROUP BY clause to compute statistics for groups. It is different from this example in that it produces a summary table with one row per group. This example maintains the same number of rows in the output table as was found in the input table.
Create table LEVYRESULTS. Select all the columns from DISTRICTLEVY. Compute the number of precincts per value of TOWN. Compute the total number of voters per value of TOWN. Compute the percentage of voters that the precinct contributes to the town. Include the calculated keyword prior to the reference to new column PCT_VOTERS_PRECINCT since this value is computed within the query. Assign an alias and attribute to the column. Compute the percentage of YES votes in the precinct. Do not use the SUM aggregate function because the percentage should be computed for each row and not for all rows in the group. Compare the calculated column PCT_PRECINCT_YES to the maximum value of PCT_PRECINCT_YES for the town. Assign a value of "Yes" or "No" to column MOST_YES_PCT depending on whether the current the calculated column PCT_PRECINCT_YES is the maximum of all precincts in the town. Read the rows from DISTRICTLEVY. Group the rows in DISTRICTLEVY by the values of TOWN. When computing aggregate statistics, compute them for each value of TOWN. Order the rows in LEVYRESULTS by TOWN and PRECINCT.
proc sql; create table levyresults as select *, count(precinct) as nprecincts, sum(yes+no) as total_precinct, 100*(yes+no)/calculated total_voters_town as pct_voters_precinct format=5.1, 100*yes/(yes+no) as pct_precinct_yes format=5.1, case when max(calculated pct_precinct_yes) = calculated pct_precinct_yes then 'Yes' else 'No' end as most_yes_pct from districtlevy group by town order by town, precinct;
The following program uses two PROC MEANS steps and two DATA steps to create a data set equivalent to table LEVYRESULTS that was produced by the PROC SQL step in the main example. The PROC MEANS steps compute the SUM, N, and MAX statistics, and the DATA steps merge the statistics with the observations in DISTRICTLEVY in order to compute and classify the BY group statistics.
The program in the main example is simpler because all computations are performed in one step. If you need to compute statistics that are not PROC SQL aggregate functions, or if you need to compute percentages of different combinations of the data, it might be more efficient to adapt the following program. With the CLASS and TYPES statements in PROC MEANS, you can compute statistics for different combinations of your classification variables in one PROC MEANS step.
So that the match-merges by TOWN complete successfully and the observations in LEVYRESULTS are arranged by TOWN and PRECINCT, assume data set DISTRICTLEVY is sorted by TOWN and PRECINCT prior to the program.
Compute statistics on the observations in DISTRICTLEVY. Suppress the default listing of the results. Specify the variable whose values define the groups for which to compute statistics. Produce statistics for the groups that are defined by the values of TOWN. Compute the statistics on the YES and NO votes. Save SUM and N statistics in data set SUMLEVY. Do not keep automatic variables _TYPE_ and _FREQ_ that PROC MEANS generates. Name the statistics.
Create an intermediate data set that merges input data set DISTRICTLEVY by TOWN to the statistics that are saved in SUMLEVY.
Compute the total number of voters in the town by summing the SUM statistics for YES and NO votes that are produced by PROC MEANS. Compute the percentage of voters that the precinct contributes to the town. Use as the denominator the variable that was defined earlier. Compute the percentage of YES votes in the precinct.
Compute statistics on the intermediate data set so that the new variable PCT_PRECINCT_YES can be calculated. Repeat the CLASS and TYPES statement as in the first PROC MEANS step. Compute statistics on new variable PCT_PRECINCT_YES. Save the MAX statistic in data set MAXLEVY.
Create data set LEVYRESULTS by merging intermediate data set LEVYSTEP1 to PCTLEVY by TOWN. Define new variable MOST_YES_PCT.
Assign a value of "Yes" to variable MOST_YES_PCT when the current PCT_PRECINCT_YES is the maximum for all precincts in the town.
proc means data=districtlevy noprint; class town; types town; var yes no; output out=sumlevy(drop=_type_ _freq_) sum=town_yes town_no n=nprecincts; run; data levystep1; merge districtlevy sumlevy; by town; drop town_yes town_no; total_voters_town=sum(town_yes,town_no); pct_precinct_town=100*(yes+no)/total_voters_town; pct_precinct_yes=100*yes/(yes+no); run; proc means data=levystep1 noprint; class town; types town; var pct_precinct_yes; output out=maxlevy(drop=_type_ _freq_) max=max_pct_yes; run; data levyresults; merge levystep1 pctlevy; by town; length most_yes_pct $ 3; format pct_voters_precinct pct_precinct_yes 5.1; drop max_pct_yes; if pct_precinct_yes=max_pct_yes then most_yes_pct='Yes'; else most_yes_pct='No'; run;
18.191.234.150