Example 5.1 Adding Values to All Observations in a Data Set

Goal

Combine summary statistics that are saved in a single observation in one data set with all observations in another data set. Compute new variables based on the values of the summary and detail information.

Example Features

Featured StepDATA step
Featured Step Options and StatementsIF _N_=1 statement
Related TechniquePROC SQL, Cartesian product
A Closer LookUnderstanding How the DATA Step Adds Values from an Observation in One Data Set to All Observations in Another

Input Data Sets

The first 10 of 200 observations in STUDENTS are listed.

PROC MEANS created data set OVERALL by analyzing variable SCORE in data set STUDENTS. The 10th percentile and 25th percentile statistics for SCORE are saved in OVERALL.

STUDENTS (first 10 observations)

          student_
   Obs    id         score
     1    101          74
     2    102          83
     3    103          76
     4    104          74
     5    105          84
     6    106          76
     7    107          79
     8    108          67
     9    109          70
    10    110          79
    . . .

         OVERALL Data Set

Obs    _TYPE_    _FREQ_    p10     p25
 1        0        200      70      74

Resulting Data Set

Output 5.1 First 10 Observations of PCSTUDENTS Data Set

Example 5.1 PCSTUDENTS Data Set Created with DATA Step

             student_
    Obs       id       score    top10pc    top25pc

      1       101        74        Y          Y
      2       102        83        Y          Y
      3       103        76        Y          Y
      4       104        74        Y          Y
      5       105        84        Y          Y
      6       106        76        Y          Y
      7       107        79        Y          Y
      8       108        67        N          N
      9       109        70        Y          N
     10       110        79        Y          Y
                   ...


Example Overview

This example shows how to add one observation from one data set to all observations in a detail data set. The single observation contains summary statistics for the detail data set. By adding the summary statistics to each observation in the detail data set, you can then compare each detail observation to statistics for all observations.

The following DATA step reads the single observation in summary data set OVERALL only during the DATA step's first iteration. The values for P10 and P25 from OVERALL remain in the Program Data Vector (PDV) while the DATA step reads each observation in STUDENTS. Data set OVERALL does not have to be read more than once.

Data set STUDENTS contains the individual test scores for each student. The DATA step adds to each observation that is read from STUDENTS the same values for two variables P10 and P25, which are stored in data set OVERALL. The PROC MEANS step that precedes the DATA step analyzes variable SCORE and saves in data set OVERALL the 10th percentile and first quartile in variables P10 and P25.

On the DATA step's first iteration, which is when automatic variable _N_ equals 1, the DATA step places in the PDV the values for P10 and P25 from the single observation in OVERALL. The values remain in the PDV for the duration of the DATA step because the SET statement for OVERALL never executes again. Values that are read with a SET statement are automatically retained until another observation is read from the data set named in the SET statement.

Each iteration reads one observation from STUDENTS. The DATA step computes two new variables, TOP10PC and TOP25PC, based on variables P10 and P25 that were added from OVERALL. The DATA step assigns the value 'Y' to variable TOP10PC if the student's score is in the top 10%; otherwise, it assigns the value 'N'. A similar rule is applied when determining the values for the top quartile variable, TOP25PC.

Each iteration writes out an observation that contains all the data for each student, including the two ranking variables TOP10PC and TOP25PC.

Program

Save in data set OVERALL the 10th percentile and first quartile statistics for variable SCORE.

Create PCSTUDENTS. Read an observation from OVERALL only on the first iteration and keep only the P10 and P25 variables. Read each observation from STUDENTS. Compute two new variables based on the values read from OVERALL that were retained in the PDV for P10 and P25.

proc means data=students noprint;
  var score;
  output out=overall p10=p10 p25=p25;
run;
data pcstudents(drop=p10 p25);
  if _n_=1 then set overall(keep=p10 p25);


  set students;

  top10pc='N';
  top25pc='N';
  if score ge p10 then top10pc='Y';
  if score ge p25 then top25pc='Y';

run;

Related Technique

The following PROC SQL step joins the two tables, OVERALL and STUDENTS. The program submits the same PROC MEANS step to create OVERALL that was shown in the main example. The join in the PROC SQL step produces a Cartesian product, which is a combination of each row from the first table with every row from the second table.

Generally, when joining larger data sets, you would want to avoid coding Cartesian product joins because of the large number of rows that would form the product. However, in this case, the size of the Cartesian product is manageable. With only one row in one of the tables (OVERALL) and 200 rows in the other table (STUDENTS), the number of rows in the product is 1 X 200=200.

When performing a Cartesian join, SAS generates a note in the SAS log about optimization of the join.

Save in data set OVERALL the 10th percentile and first quartile statistics for variable SCORE.

Create table PCSTUDENTS. Select specific columns from each table. Compute two new columns.

Specify a Cartesian product join.

proc means data=students noprint;
  var score;
  output out=overall p10=p10 p25=p25;
run;

proc sql;
  create table pcstudents as
    select student_id, score,
           case when score ge p10 then 'Y'
                else 'N'
           end as top10pc length=1,
           case when score ge p25 then 'Y'
                else 'N'
           end as top25pc length=1
    from overall, students;
quit;

A Closer Look

Understanding How the DATA Step Adds Values from an Observation in One Data Set to All Observations in Another

The program in the main example takes the one observation in OVERALL that was created by the PROC MEANS step and adds the values of the two variables, P10 and P25, to each observation in data set STUDENTS. The DATA step uses two SET statements. The following figures show how SAS processes the DATA step and adds P10 and P25 to each observation. The figures show only the first 10 of the 200 observations in STUDENTS.

Throughout execution of the DATA step, the two values that are contributed from OVERALL, P10 and P25, remain in the Program Data Vector (PDV). These values are also written to the data set PCSTUDENTS.

During the compilation phase, SAS reads the descriptor portions of the input data sets and creates the Program Data Vector (PDV).



The PDV contains the automatic variable, _N_, which keeps track of how many times the DATA step has iterated. (It does this by counting the number of times the DATA statement executes.) _N_ is a temporary variable and is not included in data set PCSTUDENTS.



The PDV also includes the two variables that were created by the DATA step, TOP10PC and TOP25PC.



Execution begins. On the first iteration of the DATA step, _N_ has a value of 1. The IF statement evaluates as true so the first SET statement executes. SAS reads the values of P10 and P25 from OVERALL and records the values in the PDV.



The second SET statement executes and reads the first observation in STUDENTS. The DATA step stores the values of ID and SCORE for the first observation in the PDV.



The DATA step calculates values for TOP10PC and TOP25PC and records the values in the PDV.



The first iteration of the DATA step concludes and variables ID, SCORE, TOP10PC, and TOP25PC are written to data set PCSTUDENTS. _N_ is a temporary variable. The DROP option on the DATA step prevents P10 and P25 from being saved in PCSTUDENTS.



SAS sets the value of _N_ to 2 at the beginning of the second iteration of the DATA step. The IF statement is false and so the first SET statement does not execute. However, the values of P10 and P25 remain in the PDV.



The second SET statement executes. Values for ID and STUDENT from the second observation in STUDENTS are recorded in the PDV.



The DATA steps computes values for TOP10PC and TOP25PC for the second observation and records them in the PDV. Lastly, it writes the contents of the PDV (except for _N_) to PCSTUDENTS.



The DATA step executes until all observations in STUDENTS are processed. The values for P10 and P25 remain in the PDV. Only the first 10 of the 200 observations in PCSTUDENTS are shown in this figure.



..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.93.210