Combine summary statistics that are saved in a single observation in one data set with all observations in another data set. Compute new variables based on the values of the summary and detail information.
Featured Step | DATA step |
Featured Step Options and Statements | IF _N_=1 statement |
Related Technique | PROC SQL, Cartesian product |
A Closer Look | Understanding How the DATA Step Adds Values from an Observation in One Data Set to All Observations in Another |
The first 10 of 200 observations in STUDENTS are listed.
PROC MEANS created data set OVERALL by analyzing variable SCORE in data set STUDENTS. The 10th percentile and 25th percentile statistics for SCORE are saved in OVERALL.
STUDENTS (first 10 observations) student_ Obs id score 1 101 74 2 102 83 3 103 76 4 104 74 5 105 84 6 106 76 7 107 79 8 108 67 9 109 70 10 110 79 . . . OVERALL Data Set Obs _TYPE_ _FREQ_ p10 p25 1 0 200 70 74
Output 5.1 First 10 Observations of PCSTUDENTS Data SetExample 5.1 PCSTUDENTS Data Set Created with DATA Step student_ Obs id score top10pc top25pc 1 101 74 Y Y 2 102 83 Y Y 3 103 76 Y Y 4 104 74 Y Y 5 105 84 Y Y 6 106 76 Y Y 7 107 79 Y Y 8 108 67 N N 9 109 70 Y N 10 110 79 Y Y ... |
This example shows how to add one observation from one data set to all observations in a detail data set. The single observation contains summary statistics for the detail data set. By adding the summary statistics to each observation in the detail data set, you can then compare each detail observation to statistics for all observations.
The following DATA step reads the single observation in summary data set OVERALL only during the DATA step's first iteration. The values for P10 and P25 from OVERALL remain in the Program Data Vector (PDV) while the DATA step reads each observation in STUDENTS. Data set OVERALL does not have to be read more than once.
Data set STUDENTS contains the individual test scores for each student. The DATA step adds to each observation that is read from STUDENTS the same values for two variables P10 and P25, which are stored in data set OVERALL. The PROC MEANS step that precedes the DATA step analyzes variable SCORE and saves in data set OVERALL the 10th percentile and first quartile in variables P10 and P25.
On the DATA step's first iteration, which is when automatic variable _N_ equals 1, the DATA step places in the PDV the values for P10 and P25 from the single observation in OVERALL. The values remain in the PDV for the duration of the DATA step because the SET statement for OVERALL never executes again. Values that are read with a SET statement are automatically retained until another observation is read from the data set named in the SET statement.
Each iteration reads one observation from STUDENTS. The DATA step computes two new variables, TOP10PC and TOP25PC, based on variables P10 and P25 that were added from OVERALL. The DATA step assigns the value 'Y' to variable TOP10PC if the student's score is in the top 10%; otherwise, it assigns the value 'N'. A similar rule is applied when determining the values for the top quartile variable, TOP25PC.
Each iteration writes out an observation that contains all the data for each student, including the two ranking variables TOP10PC and TOP25PC.
Save in data set OVERALL the 10th percentile and first quartile statistics for variable SCORE.
Create PCSTUDENTS. Read an observation from OVERALL only on the first iteration and keep only the P10 and P25 variables. Read each observation from STUDENTS. Compute two new variables based on the values read from OVERALL that were retained in the PDV for P10 and P25.
proc means data=students noprint; var score; output out=overall p10=p10 p25=p25; run; data pcstudents(drop=p10 p25); if _n_=1 then set overall(keep=p10 p25); set students; top10pc='N'; top25pc='N'; if score ge p10 then top10pc='Y'; if score ge p25 then top25pc='Y'; run;
The following PROC SQL step joins the two tables, OVERALL and STUDENTS. The program submits the same PROC MEANS step to create OVERALL that was shown in the main example. The join in the PROC SQL step produces a Cartesian product, which is a combination of each row from the first table with every row from the second table.
Generally, when joining larger data sets, you would want to avoid coding Cartesian product joins because of the large number of rows that would form the product. However, in this case, the size of the Cartesian product is manageable. With only one row in one of the tables (OVERALL) and 200 rows in the other table (STUDENTS), the number of rows in the product is 1 X 200=200.
When performing a Cartesian join, SAS generates a note in the SAS log about optimization of the join.
Save in data set OVERALL the 10th percentile and first quartile statistics for variable SCORE.
Create table PCSTUDENTS. Select specific columns from each table. Compute two new columns.
Specify a Cartesian product join.
proc means data=students noprint; var score; output out=overall p10=p10 p25=p25; run; proc sql; create table pcstudents as select student_id, score, case when score ge p10 then 'Y' else 'N' end as top10pc length=1, case when score ge p25 then 'Y' else 'N' end as top25pc length=1 from overall, students; quit;
The program in the main example takes the one observation in OVERALL that was created by the PROC MEANS step and adds the values of the two variables, P10 and P25, to each observation in data set STUDENTS. The DATA step uses two SET statements. The following figures show how SAS processes the DATA step and adds P10 and P25 to each observation. The figures show only the first 10 of the 200 observations in STUDENTS.
Throughout execution of the DATA step, the two values that are contributed from OVERALL, P10 and P25, remain in the Program Data Vector (PDV). These values are also written to the data set PCSTUDENTS.
During the compilation phase, SAS reads the descriptor portions of the input data sets and creates the Program Data Vector (PDV).
The PDV contains the automatic variable, _N_, which keeps track of how many times the DATA step has iterated. (It does this by counting the number of times the DATA statement executes.) _N_ is a temporary variable and is not included in data set PCSTUDENTS.
The PDV also includes the two variables that were created by the DATA step, TOP10PC and TOP25PC.
Execution begins. On the first iteration of the DATA step, _N_ has a value of 1. The IF statement evaluates as true so the first SET statement executes. SAS reads the values of P10 and P25 from OVERALL and records the values in the PDV.
The second SET statement executes and reads the first observation in STUDENTS. The DATA step stores the values of ID and SCORE for the first observation in the PDV.
The DATA step calculates values for TOP10PC and TOP25PC and records the values in the PDV.
The first iteration of the DATA step concludes and variables ID, SCORE, TOP10PC, and TOP25PC are written to data set PCSTUDENTS. _N_ is a temporary variable. The DROP option on the DATA step prevents P10 and P25 from being saved in PCSTUDENTS.
SAS sets the value of _N_ to 2 at the beginning of the second iteration of the DATA step. The IF statement is false and so the first SET statement does not execute. However, the values of P10 and P25 remain in the PDV.
The second SET statement executes. Values for ID and STUDENT from the second observation in STUDENTS are recorded in the PDV.
The DATA steps computes values for TOP10PC and TOP25PC for the second observation and records them in the PDV. Lastly, it writes the contents of the PDV (except for _N_) to PCSTUDENTS.
The DATA step executes until all observations in STUDENTS are processed. The values for P10 and P25 remain in the PDV. Only the first 10 of the 200 observations in PCSTUDENTS are shown in this figure.
3.18.104.213