Example 2.1 Concatenating Data Sets

Goal

Concatenate three data sets. Define a variable that identifies the data set from which the observation originated.

Example Features

Featured StepDATA step
Featured Step Options and StatementsSET statement, IN= option
Related Technique 1PROC SQL, OUTER UNION set operator and CORR keyword
Related Technique 2PROC APPEND

Input Data Sets

Data sets SPRINGGRADS, SUMMERGRADS, and FALLGRADS contain lists of biology students who graduated in spring, summer, and fall. The Student' final GPAs are included.

               SPRINGGRADS

Obs      student       concentration    gpa
 1   Johnson, Allison  Biochemistry    3.86
 2   Davis, Jeffrey    General Biology 3.91
 3   Hall, Daniel      Genetics        3.43
 4   Hill, Courtney    Ecology         4.00
 5   Scott,  Tiffany   Plant  Biology  3.58
 6   Martinez, Teresa  Zoology         3.21

The three data sets are also used in Example 2.2.

                SUMMERGRADS

Obs     student        concentration    gpa
 1   Robinson, Adam    Ecology          3.87
 2   Cox, William      Zoology          3.61

                FALLGRADS

Obs     student        concentration      gpa
 1   Mitchell, Frank   Biochemistry       3.25
 2   Rogers, Melissa   Microbiology       4.00
 3   Brown, Beverly    Molecular Biology  3.42

Resulting Data Set

Output 2.1 ALLGRADS Data Set

      Example 2.1 ALLGRADS Data Set Created by DATA Step

Obs       student        concentration        gpa   graduated

  1   Johnson, Allison   Biochemistry        3.86    Spring
  2   Davis, Jeffrey     General Biology     3.91    Spring
  3   Hall, Daniel       Genetics            3.43    Spring
  4   Hill, Courtney     Ecology             4.00    Spring
  5   Scott, Tiffany     Plant Biology       3.58    Spring
  6   Martinez, Teresa   Zoology             3.21    Spring
  7   Robinson, Adam     Ecology             3.87    Summer
  8   Cox, William       Zoology             3.61    Summer
  9   Mitchell, Frank    Biochemistry        3.25    Fall
 10   Rogers, Melissa    Microbiology        4.00    Fall
 11   Brown, Beverly     Molecular Biology   3.42    Fall


Example Overview

The DATA step in this program uses a SET statement to concatenate three data sets: SPRINGGRADS, SUMMERGRADS, and FALLGRADS. The names of the three data sets are listed in the SET statement in the order in which they should be concatenated.

The IN= data set option is added to each input data set to define temporary variables INSPRING, INSUMMER, and INFALL. IF-ELSE statements test the values of these variables to determine the data set source for the observation that is currently being processed and to assign values to new variable GRADUATED. For example, INSPRING=1 when the observation is from data set SPRINGGRADS. When the observation is from SUMMERGRADS or FALLGRADS, the value of INSPRING is 0.

Each of the three data sets contains the same three variables. Therefore, ALLGRADS contains these three variables plus variable GRADUATED, which is created in the DATA step.

If your data sets contain different variables, all variables from all the data sets are included in the output data set unless you specify the variables to keep in the output data set. When concatenating data sets with a SET statement, SAS assigns missing values to a variable for the observations that are contributed from a data set where the variable is not present, but where the variable is present in another data set.

Program

Create data set ALLGRADS. Concatenate three data sets. Add the IN= data set option to each input data set so that the data set origin of each observation can be determined. Define a new variable. Assign values to GRADUATED depending on the origin of the observation.

data allgrads;
  set springgrads(in=inspring)
      summergrads(in=insummer)
      fallgrads(in=infall);

  attrib graduated length=$6 label='Semester Graduated';
  if inspring then graduated='Spring';
  else if insummer then graduated='Summer';
  else if infall then graduated='Fall';
run;

Related Technique 1

The following PROC SQL step creates a table equivalent to the data set that was created by the DATA step in the main example. It concatenates three queries by using the OUTER UNION set operator and the CORR keyword. (CORR is the short version of CORRESPONDING.) Each query creates the same new column, GRADUATED, whose values reflect the origin of the row.

The OUTER UNION set operator concatenates tables. The CORR keyword matches and aligns the same-named columns that were extracted from the two input tables.

Create table ALLGRADS. Specify a SELECT clause for each table that you want to concatenate and specify them in the order you want the tables concatenated. Select all columns from table SPRINGGRADS. Define new column, GRADUATED.

Concatenate the rows from the next query to the rows that were selected by the first query. Include the CORR keyword to match and align same-named columns from the two queries. Select all columns from table SUMMERGRADS. Define new column, GRADUATED.

Concatenate the rows from the next query to the rows that were selected from joining the first two queries. Select all columns from table FALLGRADS. Define new column, GRADUATED.

proc sql;
  create table allgrads as



    select *,
           'Spring' as graduated length=6
               label='Semester Graduated'
      from springgrads
      outer union corr
    select *,
             'Summer' as graduated length=6
               label='Semester Graduated'
      from summergrads
      outer union corr


    select *,
           'Fall' as graduated length=6
               label='Semester Graduated'
       from fallgrads;
quit;

Related Technique 2

Using two PROC APPEND steps, the following program concatenates the three data sets SPRINGGRADS, SUMMERGRADS, and FALLGRADS. This procedure adds the observations from one data set to the end of another data set.

When concatenating data sets with PROC APPEND, you cannot add a variable the way you did in the DATA step and PROC SQL step in Related Technique 1.

Other limitations of PROC APPEND to be aware of include how the procedure concatenates data sets that have different variable attributes or different variables. For example, by default, PROC APPEND does not append a data set to another under two conditions:

  • The data set being appended to the base data set has one or more variables that are not present in the base data set.

  • A variable common to both data sets is character type in one data set and numeric type in the other.

For more information about these limitations and the uses of PROC APPEND, see Example 2.3 and SAS documentation.

The most practical use of PROC APPEND might be when you need to add a relatively small number of observations to the end of a much larger data set. Because PROC APPEND does not read the BASE= data set, you can save processing time compared to concatenating data sets with the DATA step or PROC SQL where all of the data set or table is read.

Delete temporary data set ALLGRADS so that the following steps do not add observations to an existing data set.

Add the observations from data set SPRINGGRADS to ALLGRADS. Create data set ALLGRADS (it does not exist because of execution of the preceding PROC DATASETS step). Add the observations from SUMMERGRADS to data set ALLGRADS. Add the observations from FALLGRADS to ALLGRADS to complete the concatenation of the three data sets.

proc datasets library=work nolist;
  delete allgrads;
run;
quit;
proc append base=allgrads data=springgrads;
run;


proc append base=allgrads data=summergrads;
run;

proc append base=allgrads data=fallgrads;
run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.164.149