Example 5.6 Calculating Percentages and Statistics That One Row Contributes to a BY Group

Goal

Calculate percentages that one row contributes to a BY group. Calculate statistics for the BY group, some of which are based on how the row contributes to the statistic for the BY group. Create columns to classify and contain the results of the calculations.

Example Features

Featured StepPROC SQL
Featured Step Options and StatementsAggregate functions

CASE expression

GROUP BY clause

ORDER BY clause
Related TechniquePROC MEANS, CLASS, OUTPUT, and TYPES statements DATA step, match-merging by a value

Input Table

Table DISTRICTLEVY contains levy voting results for four towns and their precincts. Counts of voters voting for the levy are in the "YES" column. Counts of voters voting against the levy are in the "NO" column.

                 DISTRICTLEVY

Obs       town        precinct     yes     no
  1    Blue Hills         1        335     264
  2    Blue Hills         2        631     497
  3    Blue Hills         3        529     375
  4    Collegetown        1        909    1300
  5    Collegetown        2        677     573
  6    Collegetown        3       1314    1117
  7    Collegetown        4       1135    1054
  8    Collegetown        5       1153    1135
  9    Fairview           1        721     906
 10    Fairview           2        428     611
 11    Fairview           3        788     962
 12    Fairview           4        335     471
 13    West Ridge         1        818     738
 14    West Ridge         2        768     851
 15    West Ridge         3        692     610
 16    West Ridge         4        892     931
 17    West Ridge         5        838     965
 18    West Ridge         6       1533    1453

Resulting Table

Output 5.6 LEVYRESULTS Table

                   Example 5.6 LEVYRESULTS Table Created with PROC SQL

                                               total_
                                              voters_ pct_precinct_ pct_precinct_  most_
Obs    town     precinct  yes  no  nprecincts   town      town           yes      yes_pct

  1 Blue Hills      1     335  264      3       2631       22.8          55.9       No
  2 Blue Hills      2     631  497      3       2631       42.9          55.9       No
  3 Blue Hills      3     529  375      3       2631       34.4          58.5       Yes
  4 Collegetown     1     909 1300      5      10367       21.3          41.1       No
  5 Collegetown     2     677  573      5      10367       12.1          54.2       Yes
  6 Collegetown     3    1314 1117      5      10367       23.4          54.1       No
  7 Collegetown     4    1135 1054      5      10367       21.1          51.9       No
  8 Collegetown     5    1153 1135      5      10367       22.1          50.4       No
  9 Fairview        1     721  906      4       5222       31.2          44.3       No
 10 Fairview        2     428  611      4       5222       19.9          41.2       No
 11 Fairview        3     788  962      4       5222       33.5          45.0       Yes
 12 Fairview        4     335  471      4       5222       15.4          41.6       No
 13 West Ridge      1     818  738      6      11089       14.0          52.6       No
 14 West Ridge      2     768  851      6      11089       14.6          47.4       No
 15 West Ridge      3     692  610      6      11089       11.7          53.1       Yes
 16 West Ridge      4     892  931      6      11089       16.4          48.9       No
 17 West Ridge      5     838  965      6      11089       16.3          46.5       No
 18 West Ridge      6    1533 1453      6      11089       26.9          51.3       No


Example Overview

This example summarizes data by groups and computes new columns for each row in the input table by using the summary statistics. The PROC SQL step uses aggregate functions and the GROUP BY clause to compute percentages and statistics for groups of rows. The step creates new columns to hold the percentages and statistics and to classify some of the statistics. After computation, the summary statistics are merged back to each row in the input table. Therefore, the columns that hold the summary statistics are identical for all rows within a group. The output table has the same number of rows as the input table.

Aggregate functions summarize data in a table. Several functions are available that aggregate data. Not all of the statistics functions that are found in other procedures such as PROC MEANS and PROC TABULATE are available as aggregate functions. For a complete list of PROC SQL aggregate functions, see SAS documentation on PROC SQL.

Each row in table DISTRICTLEVY contains a precinct's YES or NO votes on a levy. The goal of the PROC SQL step is to analyze voting results per town. The number of precincts per town varies from three to six. The step computes five new columns:

  • NPRECINCTS, which is computed by using the COUNT function to tally the number of precincts per town. The value of NPRECINCTS is the same for all rows within a BY group.

  • TOTAL_VOTERS_TOWN, which is computed by using the SUM function to sum the number of YES votes and NO votes per town. The value of TOTAL_VOTERS_TOWN is the same for all rows within each BY value of TOWN.

  • PCT_PRECINCT_TOWN, which is the percentage of voters that the precinct contributes to the town.

  • PCT_PRECINCT_YES, which is the percentage of YES votes in the precinct.

  • MOST_YES_PCT, which is a character column with possible values of 'Yes' or 'No'. It has a value of 'Yes' when the percentage of YES votes for the precinct is the highest precinct percentage within the town. Precincts where the percentage of YES votes is less than the maximum percentage of YES votes within the town have a value of 'No'. Computation of MOST_YES_PCT uses the MAX aggregate function and a CASE expression.

The step includes an ORDER BY clause that arranges the rows in LEVYRESULTS by TOWN and PRECINCT. Without the ORDER BY clause, the rows might not be output in the order in which they were input from LEVYRESULTS. This example merges the statistics back to the original table. The results of the remerge might not arrange the rows in the expected order so the ORDER BY clause is needed.

The Related Technique in Example 5.5 also uses the GROUP BY clause to compute statistics for groups. It is different from this example in that it produces a summary table with one row per group. This example maintains the same number of rows in the output table as was found in the input table.

Program

Create table LEVYRESULTS. Select all the columns from DISTRICTLEVY. Compute the number of precincts per value of TOWN. Compute the total number of voters per value of TOWN. Compute the percentage of voters that the precinct contributes to the town. Include the calculated keyword prior to the reference to new column PCT_VOTERS_PRECINCT since this value is computed within the query. Assign an alias and attribute to the column. Compute the percentage of YES votes in the precinct. Do not use the SUM aggregate function because the percentage should be computed for each row and not for all rows in the group. Compare the calculated column PCT_PRECINCT_YES to the maximum value of PCT_PRECINCT_YES for the town. Assign a value of "Yes" or "No" to column MOST_YES_PCT depending on whether the current the calculated column PCT_PRECINCT_YES is the maximum of all precincts in the town. Read the rows from DISTRICTLEVY. Group the rows in DISTRICTLEVY by the values of TOWN. When computing aggregate statistics, compute them for each value of TOWN. Order the rows in LEVYRESULTS by TOWN and PRECINCT.

proc sql;
  create table levyresults as
    select *,

              count(precinct) as nprecincts,

              sum(yes+no) as total_precinct,

           100*(yes+no)/calculated total_voters_town
                      as pct_voters_precinct format=5.1,





           100*yes/(yes+no) as pct_precinct_yes
                            format=5.1,



           case


             when max(calculated pct_precinct_yes)
                       = calculated pct_precinct_yes
                              then 'Yes'
             else 'No'
            end as most_yes_pct

      from districtlevy
      group by town


      order by town, precinct;

Related Technique

The following program uses two PROC MEANS steps and two DATA steps to create a data set equivalent to table LEVYRESULTS that was produced by the PROC SQL step in the main example. The PROC MEANS steps compute the SUM, N, and MAX statistics, and the DATA steps merge the statistics with the observations in DISTRICTLEVY in order to compute and classify the BY group statistics.

The program in the main example is simpler because all computations are performed in one step. If you need to compute statistics that are not PROC SQL aggregate functions, or if you need to compute percentages of different combinations of the data, it might be more efficient to adapt the following program. With the CLASS and TYPES statements in PROC MEANS, you can compute statistics for different combinations of your classification variables in one PROC MEANS step.

So that the match-merges by TOWN complete successfully and the observations in LEVYRESULTS are arranged by TOWN and PRECINCT, assume data set DISTRICTLEVY is sorted by TOWN and PRECINCT prior to the program.

Compute statistics on the observations in DISTRICTLEVY. Suppress the default listing of the results. Specify the variable whose values define the groups for which to compute statistics. Produce statistics for the groups that are defined by the values of TOWN. Compute the statistics on the YES and NO votes. Save SUM and N statistics in data set SUMLEVY. Do not keep automatic variables _TYPE_ and _FREQ_ that PROC MEANS generates. Name the statistics.

Create an intermediate data set that merges input data set DISTRICTLEVY by TOWN to the statistics that are saved in SUMLEVY.

Compute the total number of voters in the town by summing the SUM statistics for YES and NO votes that are produced by PROC MEANS. Compute the percentage of voters that the precinct contributes to the town. Use as the denominator the variable that was defined earlier. Compute the percentage of YES votes in the precinct.

Compute statistics on the intermediate data set so that the new variable PCT_PRECINCT_YES can be calculated. Repeat the CLASS and TYPES statement as in the first PROC MEANS step. Compute statistics on new variable PCT_PRECINCT_YES. Save the MAX statistic in data set MAXLEVY.

Create data set LEVYRESULTS by merging intermediate data set LEVYSTEP1 to PCTLEVY by TOWN. Define new variable MOST_YES_PCT.

Assign a value of "Yes" to variable MOST_YES_PCT when the current PCT_PRECINCT_YES is the maximum for all precincts in the town.

proc means data=districtlevy noprint;

  class town;


  types town;

  var yes no;

  output out=sumlevy(drop=_type_ _freq_)
               sum=town_yes town_no n=nprecincts;



run;
data levystep1;
  merge districtlevy sumlevy;
  by town;

  drop town_yes town_no;
  total_voters_town=sum(town_yes,town_no);


  pct_precinct_town=100*(yes+no)/total_voters_town;


  pct_precinct_yes=100*yes/(yes+no);

run;
proc means data=levystep1 noprint;


  class town;
  types town;

  var pct_precinct_yes;

  output out=maxlevy(drop=_type_ _freq_)
               max=max_pct_yes;
run;
data levyresults;
  merge levystep1 pctlevy;
  by town;
  length most_yes_pct $ 3;
  format pct_voters_precinct pct_precinct_yes 5.1;
  drop max_pct_yes;
  if pct_precinct_yes=max_pct_yes then
              most_yes_pct='Yes';
  else most_yes_pct='No';

run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.25.140