Create variables that have the values of a variable from previous observations. Process the data set in BY groups so that the process of obtaining lagged values starts over with each new BY group.
This example uses the same input data set as Example 8.14.
Featured Step | DATA step |
Featured Step Options and Statements | LAGn function |
Related Technique | DATA step, BY and RETAIN statements, FIRST. and LAST. automatic variables |
A Closer Look | More on the DATA Step in This Example |
Data set PRODUCTION records the number of units of an item made on several days at four plant locations.
PRODUCTION units_ Obs plant proddate made 1 Aux Plant 03/02/2009 76 2 Main Plant 02/27/2009 393 3 Main Plant 03/03/2009 501 4 Main Plant 03/04/2009 492 5 Main Plant 03/05/2009 719 6 Main Plant 03/06/2009 111 7 Main Plant 03/09/2009 268 8 Main Plant 03/10/2009 350 9 Port Park 02/20/2009 791 10 Port Park 02/27/2009 658 11 Port Park 03/10/2009 981 12 Port Park 03/11/2009 612 13 Port Park 03/13/2009 664 14 West Side 02/23/2009 629 15 West Side 02/24/2009 543
Output 8.13 PRODLAST3 Data SetExample 8.13 PRODLAST3 Data Set Created with DATA Step units_ too_few_ run_avg_ units_ units_ units_ Obs plant proddate made days units made1 made2 made3 1 Aux Plant 03/02/2009 76 *** 76 . . . 2 Main Plant 03/05/2009 719 526 492 501 393 3 Main Plant 03/06/2009 111 456 719 492 501 4 Main Plant 03/09/2009 268 398 111 719 492 5 Main Plant 03/10/2009 350 362 268 111 719 6 Port Park 03/11/2009 612 761 981 658 791 7 Port Park 03/13/2009 664 729 612 981 658 8 West Side 02/24/2009 543 *** 586 629 . . |
This DATA step uses the LAGn function in conjunction with BY-group processing to obtain lagged values for a variable within each BY group. The goal of the program is for each observation read, find variable values in the three observations previous to it within the BY group. The DATA step includes code to handle the situation when a BY group does not have sufficient observations to look back three observations.
Data set PRODUCTION records the number of units of an item made on several days at four plant locations. One plant has only one production day, one has two production days, one has four production days, and the fourth has seven production days.
The goal of the DATA step is to find the three most recent production counts in the plant for the observation that is currently being processed. Variable UNITS_MADE stores the counts. Prior to the DATA step, PRODUCTION is sorted by PLANT, and within each value of PLANT, observations are sorted by PRODDATE. This sort arranges the observations in chronological order within each value of PLANT so that the lagged values are obtained in reverse chronological order: UNITS_MADE1 is the lagged value of UNITS_MADE for the most recent previous observation; UNITS_MADE2 is the lagged value of UNITS_MADE for the second most recent previous observation; and UNITS_MADE3 is the lagged value of UNITS_MADE for the third most recent previous observation.
Because it is required that the program must retrieve previous values within the BY group, the DATA step includes code to reset the variables that retain the lagged values at the beginning of each BY group. If you did not reset the lag variables at the beginning of the BY group, you would end up obtaining lagged values from a different BY group.
The program writes out observations under one of two conditions:
For BY groups with four or more production dates, all observations starting with the fourth production date. These observations will have values for all three lagged value variables (UNITS_MADE1, UNITS_MADE2, and UNITS_MADE3).
For BY groups with fewer than four production dates, one observation with as many lagged values as available. This corresponds to the last observation in the BY group.
The DATA step also computes an average of the current value of UNITS_MADE and the three lagged values. When a BY group has fewer than four production dates, the average is computed and variable TOO_FEW_DAYS is assigned a value of "***" to flag observations that do not have the full four days of production values to evaluate.
For a description of the processing of this DATA step, see the following "A Closer Look" section and Table 8.2.
Create data set PRODLAST3. Read the observations from data set PRODUCTION. Process the data in BY groups.
Define new variable TOO_FEW_DAYS to flag observations with fewer than four production values to evaluate.
Define an array of the three lagged value variables. Obtain the three lagged values of UNITS_MADE for the current observation. When the first observation in each BY group is processed, reset COUNT to 1. Accumulator variable COUNT is used by the following DO loop to set appropriate elements of array UNITS to missing. On each iteration of the DO loop, set to missing the array elements that have not yet received a lagged value for the current BY group. This action prevents the observation from having values from previous BY groups. Execute this DO group when processing at least the fourth observation in a BY group of at least four observations, or when processing the last observation in a BY group of fewer than four observations. Compute the running average of the current value of UNITS_MADE and the three lagged values of UNITS_MADE. Flag observations from BY groups that have fewer than four observations in the BY group. Output the observations that are processed by the IF-THEN-DO group.
Increase COUNT by 1 so that on the next iteration of the DATA step, if it is still the same BY group, the appropriate elements of UNITS will be set to missing when the DO loop executes.
data prodlast3; set production; by plant; drop count i; length too_few_days $ 3; format run_avg_units 5.; array units{3} units_made1-units_made3; units_made1=lag1(units_made); units_made2=lag2(units_made); units_made3=lag3(units_made); if first.plant then count=1; do i=count to dim(units); units{i}=.; end; if count ge 4 or (last.plant and count lt 4) then do; run_avg_units=mean(units_made, units_made1, units_made2, units_made3); if n(units_made, units_made, units_made2, units_made3) lt 4 then too_few_days='***'; output; end; count+1; run;
The following DATA step creates a data set equivalent to the one that was created by the DATA step in the main example. Instead of using the LAGn function, this DATA step simulates the action of the LAGn function by retaining the values of the three lagged variables—UNITS_MADE1, UNITS_MADE2, and UNITS_MADE3—across iterations of the DATA step. The variables' names are listed in a RETAIN statement.
Prior to the DATA step, PRODUCTION is sorted by PLANT, and within each value of PLANT, observations are sorted by PRODDATE. This sort arranges the observations in chronological order within each value of PLANT so that the lagged values are obtained in reverse chronological order as they are in the DATA step in the main example.
An IF-THEN-DO block executes when processing the first observation in a BY group. The statement in the DO loop inside the block initializes the three lagged variables to missing at the beginning of each BY group. Another statement in the block initializes the variable COUNT to 1. The use of variable COUNT in this DATA step is similar to its use in the DATA step in the main example.
A DO loop near the end of the DATA step shifts the values of UNITS_MADE and the three lagged variables down one element in the UNITS array.
As in the main example, this DATA step computes a running average of the UNITS_MADE series of variables, and it assigns a value to TOO_FEW_DAYS when there are fewer than four observations in a BY group. This DATA step also writes out observations under the same two conditions as the main DATA step.
Neither of the two DATA steps that are presented in this example has a distinct advantage over the other.
Create data set PRODLAST3. Read the observations from data set PRODUCTION.
Process the data in BY groups that are defined by the values of PLANT.
Define an array of UNITS_MADE and the three retained variables. Specify the three variables whose values should be retained across iterations of the DATA step. At the beginning of each BY group, initialize the variables whose values contain the lagged values of UNITS_MADE. Initialize COUNT to indicate the first observation in the BY group. Execute this DO group when processing the first three observations in a BY group and one of them is the last observation in the BY group. Flag observations from BY groups that have fewer than four observations in the BY group. Compute the running average of UNITS_MADE and the three lagged values of UNITS_MADE. Output the observations that are processed by the IF-THEN-DO block.
Execute this block when processing the fourth or greater observation in a BY group.
Output the observations that are processed by the ELSE-THEN-DO block.
Increase COUNT by 1 to track the position of the next observation in the BY group.
data prodlast3; set production; by plant; drop count i; length too_few_days $ 3; format run_avg_units 5.; array units{4} units_made units_made1-units_made3; retain units_made1-units_made3; if first.plant then do; do i=2 to 4; units{i}=.; end; count=1; end; if (1 le count lt 4) and last.plant then do; too_few_days='***'; run_avg_units=mean(units_made, units_made1, units_made2, units_made3); output; end; else if count ge 4 then do; run_avg_units=mean(units_made, units_made1, units_made2, units_made3); output; end; count+1; run;
Each call to the LAGn function stores a value in a queue and returns a value that was stored previously in that queue. Each occurrence of a LAGn function in a DATA step generates its own queue of values. The DATA step in the main example has three assignment statements that each invoke a different LAGn function: LAG1, LAG2, and LAG3. Therefore, this DATA step creates three queues of values that are independent of each other.
The DATA step invokes the LAGn functions on each iteration of the DATA step, even if the LAGn function retrieves values from a different BY group. Conditionally invoking the LAGn functions based on BY values would not produce the required data set. This is because storing values at the bottom of the LAGn queue and returning values from the top of the queue occurs only when the LAGn function executes. Conditional calls to LAGn would store and return values only from the observations for which the condition is satisfied.
Table 8.2 shows the values of selected variables at two places in the DATA step during execution. The "Top" position is immediately after the first IF statement executes. This is the IF statement that assigns a value to COUNT at the beginning of a BY group. The "Bottom" position is at the end of the DATA step following the COUNT+1 statement. The eight highlighted rows are the observations that the DATA step writes to data set PRODLAST3.
DATA Step Position | _N_ | LAST.PLANT | PLANT | PRODDATE | UNITS_MADE | UNITS_MADE1 | UNITS_MADE2 | UNITS_MADE3 | COUNT |
---|---|---|---|---|---|---|---|---|---|
Top | 1 | 1 | Aux Plant | 03/02/2009 | 76 | . | . | . | 1 |
Bottom | 1 | 1 | Aux Plant | 03/02/2009 | 76 | . | . | . | 1 |
Top | 2 | 0 | Main Plant | 02/27/2009 | 393 | 76 | . | . | 1 |
Bottom | 2 | 0 | Main Plant | 02/27/2009 | 393 | . | . | . | 2 |
Top | 3 | 0 | Main Plant | 03/03/2009 | 501 | 393 | . | . | 2 |
Bottom | 3 | 0 | Main Plant | 03/03/2009 | 501 | . | . | . | 3 |
Top | 4 | 0 | Main Plant | 03/04/2009 | 492 | 501 | 393 | . | 3 |
Bottom | 4 | 0 | Main Plant | 03/04/2009 | 492 | 501 | 393 | . | 4 |
Top | 5 | 0 | Main Plant | 03/05/2009 | 719 | 492 | 501 | 393 | 4 |
Bottom | 5 | 0 | Main Plant | 03/05/2009 | 719 | 492 | 501 | 393 | 5 |
Top | 6 | 0 | Main Plant | 03/06/2009 | 111 | 719 | 492 | 501 | 5 |
Bottom | 6 | 0 | Main Plant | 03/06/2009 | 111 | 719 | 492 | 501 | 6 |
Top | 7 | 0 | Main Plant | 03/09/2009 | 268 | 111 | 719 | 492 | 6 |
Bottom | 7 | 0 | Main Plant | 03/09/2009 | 268 | 111 | 719 | 492 | 7 |
Top | 8 | 1 | Main Plant | 03/10/2009 | 350 | 268 | 111 | 719 | 7 |
Bottom | 8 | 1 | Main Plant | 03/10/2009 | 350 | 268 | 111 | 719 | 8 |
Top | 9 | 0 | Port Park | 02/20/2009 | 791 | 350 | 268 | 111 | 1 |
Bottom | 9 | 0 | Port Park | 02/20/2009 | 791 | . | . | . | 2 |
Top | 10 | 0 | Port Park | 02/27/2009 | 658 | 791 | 350 | 268 | 2 |
Bottom | 10 | 0 | Port Park | 02/27/2009 | 658 | 791 | . | . | 3 |
Top | 11 | 0 | Port Park | 03/10/2009 | 981 | 658 | 791 | 350 | 3 |
Bottom | 11 | 0 | Port Park | 03/10/2009 | 981 | 658 | 791 | . | 4 |
Top | 12 | 0 | Port Park | 03/11/2009 | 612 | 981 | 658 | 791 | 4 |
Bottom | 12 | 0 | Port Park | 03/11/2009 | 612 | 981 | 658 | 791 | 5 |
Top | 13 | 1 | Port Park | 03/13/2009 | 664 | 612 | 981 | 658 | 5 |
Bottom | 13 | 1 | Port Park | 03/13/2009 | 664 | 612 | 981 | 658 | 6 |
Top | 14 | 0 | West Side | 02/23/2009 | 629 | 664 | 612 | 981 | 1 |
Bottom | 14 | 0 | West Side | 02/23/2009 | 629 | . | . | . | 2 |
Top | 15 | 1 | West Side | 02/24/2009 | 543 | 629 | . | . | 2 |
Bottom | 15 | 1 | West Side | 02/24/2009 | 543 | 629 | . | . | 3 |
3.145.45.5