Example 8.13 Obtaining the Previous Value of a Variable within a BY Group

Goal

Create variables that have the values of a variable from previous observations. Process the data set in BY groups so that the process of obtaining lagged values starts over with each new BY group.

This example uses the same input data set as Example 8.14.

Example Features

Featured StepDATA step
Featured Step Options and StatementsLAGn function
Related TechniqueDATA step, BY and RETAIN statements, FIRST. and LAST. automatic variables
A Closer LookMore on the DATA Step in This Example

Input Data Set

Data set PRODUCTION records the number of units of an item made on several days at four plant locations.

                  PRODUCTION
                                      units_
  Obs      plant         proddate      made
    1    Aux Plant     03/02/2009        76
    2    Main Plant    02/27/2009       393
    3    Main Plant    03/03/2009       501
    4    Main Plant    03/04/2009       492
    5    Main Plant    03/05/2009       719
    6    Main Plant    03/06/2009       111
    7    Main Plant    03/09/2009       268
    8    Main Plant    03/10/2009       350
    9    Port Park     02/20/2009       791
   10    Port Park     02/27/2009       658
   11    Port Park     03/10/2009       981
   12    Port Park     03/11/2009       612
   13    Port Park     03/13/2009       664
   14    West Side     02/23/2009       629
   15    West Side     02/24/2009       543

Resulting Data Set

Output 8.13 PRODLAST3 Data Set

            Example 8.13 PRODLAST3 Data Set Created with DATA Step

                             units_  too_few_  run_avg_  units_  units_  units_
Obs    plant       proddate   made     days     units     made1   made2   made3

 1   Aux Plant   03/02/2009     76     ***         76        .       .       .
 2   Main Plant  03/05/2009    719                526      492     501     393
 3   Main Plant  03/06/2009    111                456      719     492     501
 4   Main Plant  03/09/2009    268                398      111     719     492
 5   Main Plant  03/10/2009    350                362      268     111     719
 6   Port Park   03/11/2009    612                761      981     658     791
 7   Port Park   03/13/2009    664                729      612     981     658
 8   West Side   02/24/2009    543     ***        586      629       .       .


Example Overview

This DATA step uses the LAGn function in conjunction with BY-group processing to obtain lagged values for a variable within each BY group. The goal of the program is for each observation read, find variable values in the three observations previous to it within the BY group. The DATA step includes code to handle the situation when a BY group does not have sufficient observations to look back three observations.

Data set PRODUCTION records the number of units of an item made on several days at four plant locations. One plant has only one production day, one has two production days, one has four production days, and the fourth has seven production days.

The goal of the DATA step is to find the three most recent production counts in the plant for the observation that is currently being processed. Variable UNITS_MADE stores the counts. Prior to the DATA step, PRODUCTION is sorted by PLANT, and within each value of PLANT, observations are sorted by PRODDATE. This sort arranges the observations in chronological order within each value of PLANT so that the lagged values are obtained in reverse chronological order: UNITS_MADE1 is the lagged value of UNITS_MADE for the most recent previous observation; UNITS_MADE2 is the lagged value of UNITS_MADE for the second most recent previous observation; and UNITS_MADE3 is the lagged value of UNITS_MADE for the third most recent previous observation.

Because it is required that the program must retrieve previous values within the BY group, the DATA step includes code to reset the variables that retain the lagged values at the beginning of each BY group. If you did not reset the lag variables at the beginning of the BY group, you would end up obtaining lagged values from a different BY group.

The program writes out observations under one of two conditions:

  • For BY groups with four or more production dates, all observations starting with the fourth production date. These observations will have values for all three lagged value variables (UNITS_MADE1, UNITS_MADE2, and UNITS_MADE3).

  • For BY groups with fewer than four production dates, one observation with as many lagged values as available. This corresponds to the last observation in the BY group.

The DATA step also computes an average of the current value of UNITS_MADE and the three lagged values. When a BY group has fewer than four production dates, the average is computed and variable TOO_FEW_DAYS is assigned a value of "***" to flag observations that do not have the full four days of production values to evaluate.

For a description of the processing of this DATA step, see the following "A Closer Look" section and Table 8.2.

Program

Create data set PRODLAST3. Read the observations from data set PRODUCTION. Process the data in BY groups.

Define new variable TOO_FEW_DAYS to flag observations with fewer than four production values to evaluate.

Define an array of the three lagged value variables. Obtain the three lagged values of UNITS_MADE for the current observation. When the first observation in each BY group is processed, reset COUNT to 1. Accumulator variable COUNT is used by the following DO loop to set appropriate elements of array UNITS to missing. On each iteration of the DO loop, set to missing the array elements that have not yet received a lagged value for the current BY group. This action prevents the observation from having values from previous BY groups. Execute this DO group when processing at least the fourth observation in a BY group of at least four observations, or when processing the last observation in a BY group of fewer than four observations. Compute the running average of the current value of UNITS_MADE and the three lagged values of UNITS_MADE. Flag observations from BY groups that have fewer than four observations in the BY group. Output the observations that are processed by the IF-THEN-DO group.

Increase COUNT by 1 so that on the next iteration of the DATA step, if it is still the same BY group, the appropriate elements of UNITS will be set to missing when the DO loop executes.

data prodlast3;
  set production;

  by plant;
  drop count i;
  length too_few_days $ 3;


  format run_avg_units 5.;
  array units{3} units_made1-units_made3;

  units_made1=lag1(units_made);
  units_made2=lag2(units_made);
  units_made3=lag3(units_made);

  if first.plant then count=1;

  do i=count to dim(units);
    units{i}=.;
  end;



  if count ge 4 or (last.plant and count lt 4) then do;




    run_avg_units=mean(units_made, units_made1,
                       units_made2, units_made3);


    if n(units_made, units_made, units_made2,
         units_made3) lt 4 then too_few_days='***';

    output;

  end;
  count+1;




run;

Related Technique

The following DATA step creates a data set equivalent to the one that was created by the DATA step in the main example. Instead of using the LAGn function, this DATA step simulates the action of the LAGn function by retaining the values of the three lagged variables—UNITS_MADE1, UNITS_MADE2, and UNITS_MADE3—across iterations of the DATA step. The variables' names are listed in a RETAIN statement.

Prior to the DATA step, PRODUCTION is sorted by PLANT, and within each value of PLANT, observations are sorted by PRODDATE. This sort arranges the observations in chronological order within each value of PLANT so that the lagged values are obtained in reverse chronological order as they are in the DATA step in the main example.

An IF-THEN-DO block executes when processing the first observation in a BY group. The statement in the DO loop inside the block initializes the three lagged variables to missing at the beginning of each BY group. Another statement in the block initializes the variable COUNT to 1. The use of variable COUNT in this DATA step is similar to its use in the DATA step in the main example.

A DO loop near the end of the DATA step shifts the values of UNITS_MADE and the three lagged variables down one element in the UNITS array.

As in the main example, this DATA step computes a running average of the UNITS_MADE series of variables, and it assigns a value to TOO_FEW_DAYS when there are fewer than four observations in a BY group. This DATA step also writes out observations under the same two conditions as the main DATA step.

Neither of the two DATA steps that are presented in this example has a distinct advantage over the other.

Create data set PRODLAST3. Read the observations from data set PRODUCTION.

Process the data in BY groups that are defined by the values of PLANT.

Define an array of UNITS_MADE and the three retained variables. Specify the three variables whose values should be retained across iterations of the DATA step. At the beginning of each BY group, initialize the variables whose values contain the lagged values of UNITS_MADE. Initialize COUNT to indicate the first observation in the BY group. Execute this DO group when processing the first three observations in a BY group and one of them is the last observation in the BY group. Flag observations from BY groups that have fewer than four observations in the BY group. Compute the running average of UNITS_MADE and the three lagged values of UNITS_MADE. Output the observations that are processed by the IF-THEN-DO block.

Execute this block when processing the fourth or greater observation in a BY group.

Output the observations that are processed by the ELSE-THEN-DO block.

Increase COUNT by 1 to track the position of the next observation in the BY group.

data prodlast3;
  set production;

  by plant;

  drop count i;
  length too_few_days $ 3;
  format run_avg_units 5.;
  array units{4} units_made units_made1-units_made3;

  retain units_made1-units_made3;


  if first.plant then do;
    do i=2 to 4;
      units{i}=.;
    end;
    count=1;
  end;

  if (1 le count lt 4) and last.plant then do;


    too_few_days='***';

    run_avg_units=mean(units_made, units_made1,
                       units_made2, units_made3);

    output;

  end;
  else if count ge 4 then do;


    run_avg_units=mean(units_made, units_made1,
                       units_made2, units_made3);
    output;

  end;
  count+1;

run;

A Closer Look

More on the DATA Step in This Example

Each call to the LAGn function stores a value in a queue and returns a value that was stored previously in that queue. Each occurrence of a LAGn function in a DATA step generates its own queue of values. The DATA step in the main example has three assignment statements that each invoke a different LAGn function: LAG1, LAG2, and LAG3. Therefore, this DATA step creates three queues of values that are independent of each other.

The DATA step invokes the LAGn functions on each iteration of the DATA step, even if the LAGn function retrieves values from a different BY group. Conditionally invoking the LAGn functions based on BY values would not produce the required data set. This is because storing values at the bottom of the LAGn queue and returning values from the top of the queue occurs only when the LAGn function executes. Conditional calls to LAGn would store and return values only from the observations for which the condition is satisfied.

Table 8.2 shows the values of selected variables at two places in the DATA step during execution. The "Top" position is immediately after the first IF statement executes. This is the IF statement that assigns a value to COUNT at the beginning of a BY group. The "Bottom" position is at the end of the DATA step following the COUNT+1 statement. The eight highlighted rows are the observations that the DATA step writes to data set PRODLAST3.

Table 8.2. Variable Values at Different Places during the Processing of the DATA Step in This Example
DATA Step Position_N_LAST.PLANTPLANTPRODDATEUNITS_MADEUNITS_MADE1UNITS_MADE2UNITS_MADE3COUNT
Top11Aux Plant03/02/200976...1
Bottom11Aux Plant03/02/200976...1
Top20Main Plant02/27/200939376..1
Bottom20Main Plant02/27/2009393...2
Top30Main Plant03/03/2009501393..2
Bottom30Main Plant03/03/2009501...3
Top40Main Plant03/04/2009492501393.3
Bottom40Main Plant03/04/2009492501393.4
Top50Main Plant03/05/20097194925013934
Bottom50Main Plant03/05/20097194925013935
Top60Main Plant03/06/20091117194925015
Bottom60Main Plant03/06/20091117194925016
Top70Main Plant03/09/20092681117194926
Bottom70Main Plant03/09/20092681117194927
Top81Main Plant03/10/20093502681117197
Bottom81Main Plant03/10/20093502681117198
Top90Port Park02/20/20097913502681111
Bottom90Port Park02/20/2009791...2
Top100Port Park02/27/20096587913502682
Bottom100Port Park02/27/2009658791..3
Top110Port Park03/10/20099816587913503
Bottom110Port Park03/10/2009981658791.4
Top120Port Park03/11/20096129816587914
Bottom120Port Park03/11/20096129816587915
Top131Port Park03/13/20096646129816585
Bottom131Port Park03/13/20096646129816586
Top140West Side02/23/20096296646129811
Bottom140West Side02/23/2009629...2
Top151West Side02/24/2009543629..2
Bottom151West Side02/24/2009543629..3

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.47