Example 8.16 Obtaining Hierarchical Data from a Table and Matching Them to the Rows in the Same Table

Goal

Identify hierarchically related rows in a table. For each group of hierarchically related rows, extract data from the row that is at the higher level of the hierarchy and join these values to all rows in the group.

Example Features

Featured StepPROC SQL
Featured Step Options and StatementsSelf-join, WHERE clause
Related TechniqueDATA step, match-merging a data set with itself

Input Table

Table PERSONS records survey information about persons in households. It has data for four households. The respondent to the survey has a value of "reference" for column RELATIONSHIP. These rows are considered at the higher level of the hierarchy within each household's set of rows.

                                  PERSONS
                                                                       work_
Obs  household  person  relationship    gender         dob  education  status
  1   1001         1    reference         F     09/13/1972     AA        PT
  2   1001         2    spouse            M     10/25/1970     MS        FT
  3   1001         3    child             F     02/12/2002     00        NA
  4   1001         4    child             F     02/12/2002     00        NA
  5   1001         5    child             M     06/21/2005     00        NA
  6   1011         1    reference         F     07/12/1985     BA        FT
  7   1012         1    reference         F     01/30/1946     HS        RT
  8   1012         2    spouse            M     02/04/1940     HS        RT
  9   1012         3    other relative    F     12/22/1921     BA        RT
 10   1015         1    reference         M     06/28/1990     HS        SC
 11   1015         2    other             M     05/25/1990     HS        SC
 12   1015         3    other             M     07/16/1989     HS        SC
 13   1015         4    other             M     11/02/1988     BS        SC

Resulting Table

Output 8.16 FAMILY_SURVEY Table

                              Example 8.16 FAMILY_SURVEY Table Created with PROC SQL

                                                                 work_      reference_ reference_ reference_ reference_
Obs household  person relationship   gender        dob education status age    educ       work      gender       age

  1  1001         1   reference        F    09/13/1972    AA       PT    36     AA         PT         F          36
  2  1001         2   spouse           M    10/25/1970    MS       FT    38     AA         PT         F          36
  3  1001         3   child            F    02/12/2002    00       NA     7     AA         PT         F          36
  4  1001         4   child            F    02/12/2002    00       NA     7     AA         PT         F          36
  5  1001         5   child            M    06/21/2005    00       NA     3     AA         PT         F          36
  6  1011         1   reference        F    07/12/1985    BA       FT    23     BA         FT         F          23
  7  1012         1   reference        F    01/30/1946    HS       RT    63     HS         RT         F          63
  8  1012         2   spouse           M    02/04/1940    HS       RT    69     HS         RT         F          63
  9  1012         3   other relative   F    12/22/1921    BA       RT    87     HS         RT         F          63
 10  1015         1   reference        M    06/28/1990    HS       SC    18     HS         SC         M          18
 11  1015         2   other            M    05/25/1990    HS       SC    19     HS         SC         M          18
 12  1015         3   other            M    07/16/1989    HS       SC    19     HS         SC         M          18
 13  1015         4   other            M    11/02/1988    BS       SC    20     HS         SC         M          18


Example Overview

This example shows how to extract hierarchically related data from a table and join them back to rows in the same table by a matching column.

The table in this example has two levels of hierarchy. Table PERSONS records survey information about the members of a household with one person in the household serving as the reference person. This person's information is considered to be at the higher level of the hierarchy. Column RELATIONSHIP identifies the roles of the members of the household.

The goal of the program is to extract the reference person's information and join it to each row in the reference person's household.

The PROC SQL step performs a self-join, also called a reflexive join. It lists table PERSONS twice on the FROM clause. A different alias, P or REF, is assigned to each reference to PERSONS.

The SELECT clause specifies the columns to retrieve from the join. It assigns new column names to the columns that are retrieved from the second reference to PERSONS so that PROC SQL does not overwrite the person's unique values. It also computes the age of each person in the household and the age of the reference person. The age of the reference person is computed using the reference person's value of DOB. The new column's value is the same for all rows within a household.

A WHERE clause joins table PERSONS to itself by values of column FAMILY, and it selects rows only from the second reference to PERSONS where the value of RELATIONSHIP is "reference".

Note that the code does not check if a household is missing a row for the reference person. When adapting the code for your use, you might need to add code for situations of missing data.

Program

Create table HOUSEHOLD_SURVEY. Select all columns from the first reference to PERSONS, as specified by alias P. Compute the age of each member in the household. Use the formula that is discussed in Example 9.13. Retrieve values for three columns from the second reference to PERSONS, as specified by alias REF. Specify new names for the columns so that their values do not replace the same-named columns that were retrieved from the first reference to PERSONS. Using the date of birth for the reference person, which is retrieved from the second reference to PERSONS, compute the age of the reference person so that this constant value within the household is added to each row. Specify a self-join of table PERSONS. Assign an alias to each reference to table PERSONS. Join the rows by the values of HOUSEHOLD. Select from the second reference to PERSONS only the rows for the household member at the higher level in the hierarchy. Order the rows in table HOUSEHOLD_SURVEY.

proc sql;
  create table household_survey as
    select p.*,


           floor((intck('month',p.dob,'15jun2010'd) -
           (day('15jun2010'd) < day(p.dob)))/12) as age,

           ref.education as reference_educ,
           ref.work_status as reference_work,
           ref.gender as reference_gender,




           floor((intck('month',ref.dob,'15jun2010'd) -
             (day('15jun2010'd) < day(ref.dob)))/12) as
                              reference_age



      from persons p, persons ref

      where p.household=ref.household and
            ref.relationship='reference'


      order by p.household,p.person;

quit;

Related Technique

The DATA step that follows creates a data set equivalent to the table that was produced by PROC SQL in the main example. It merges data set PERSONS with itself by the values of HOUSEHOLD. The second reference to PERSONS in the MERGE statement specifies that only rows where the value of RELATIONSHIP is "reference" be merged to the observations that were retrieved by the first reference to PERSONS. It also renames several variables so that the values from the second reference do not overwrite the unique values for each person in the household that were retrieved by the first reference to PERSONS.

There can be multiple observations per value of HOUSEHOLD in the data set. At the beginning of each BY group, the second reference to PERSONS in the MERGE statement causes the DATA step to load into the Program Data Vector (PDV) the single observation in PERSONS that corresponds to the reference person. The values of the variables that are contributed from the second reference to PERSONS remain in the PDV throughout the remainder of processing for the BY group. These values for the person higher in the hierarchy are output with each observation in the BY group.

For a description of how SAS retains a variable value across iterations of a DATA step, see "Understanding How the DATA Step Adds Values from an Observation in One Data Set to All Observations in Another" in Example 5.1 in the "A Closer Look" section.

Assume prior to the DATA step, data set PERSONS is sorted by HOUSEHOLD, and within the values of HOUSEHOLD, by the values of PERSON. It is necessary to sort or index PERSONS by HOUSEHOLD so that the match-merging can be performed. Additionally, sorting by PERSON causes the observations to be in the same order as the rows in the table that was created by PROC SQL in the main example.

Create data set HOUSEHOLD_SURVEY. Merge PERSONS with itself. From the second reference to PERSONS, select only the rows for the reference person. Rename variables that were extracted from the second reference to PERSONS so that they do not overwrite the unique values for each person as extracted from the first reference to PERSONS. Drop this unneeded variable so that its values do not replace the values that were extracted from the first reference to PERSONS. Match data set PERSONS to itself by the values of HOUSEHOLD.

Compute the age of each member in the household. Using the date of birth for the reference person, which is retrieved from the second reference to PERSONS, compute the age of the reference person so that this constant value within the household is added to each observation.

data household_survey;

  merge persons
        persons(where=(relationship='reference')

                rename=(education=reference_educ
                        work_status=reference_work
                        dob=reference_dob
                        gender=reference_gender)


                        drop=person);


    by household;

  drop reference_dob;
  age=floor((intck('month',dob,'15jun2010'd) -
           (day('15jun2010'd) < day(dob)))/12);
  reference_age= floor(
            (intck('month',reference_dob,'15jun2010'd) -
         (day('15jun2010'd) <  day(reference_dob)))/12);



run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.137.91