Identify hierarchically related rows in a table. For each group of hierarchically related rows, extract data from the row that is at the higher level of the hierarchy and join these values to all rows in the group.
Featured Step | PROC SQL |
Featured Step Options and Statements | Self-join, WHERE clause |
Related Technique | DATA step, match-merging a data set with itself |
Table PERSONS records survey information about persons in households. It has data for four households. The respondent to the survey has a value of "reference" for column RELATIONSHIP. These rows are considered at the higher level of the hierarchy within each household's set of rows.
PERSONS work_ Obs household person relationship gender dob education status 1 1001 1 reference F 09/13/1972 AA PT 2 1001 2 spouse M 10/25/1970 MS FT 3 1001 3 child F 02/12/2002 00 NA 4 1001 4 child F 02/12/2002 00 NA 5 1001 5 child M 06/21/2005 00 NA 6 1011 1 reference F 07/12/1985 BA FT 7 1012 1 reference F 01/30/1946 HS RT 8 1012 2 spouse M 02/04/1940 HS RT 9 1012 3 other relative F 12/22/1921 BA RT 10 1015 1 reference M 06/28/1990 HS SC 11 1015 2 other M 05/25/1990 HS SC 12 1015 3 other M 07/16/1989 HS SC 13 1015 4 other M 11/02/1988 BS SC
Output 8.16 FAMILY_SURVEY TableExample 8.16 FAMILY_SURVEY Table Created with PROC SQL work_ reference_ reference_ reference_ reference_ Obs household person relationship gender dob education status age educ work gender age 1 1001 1 reference F 09/13/1972 AA PT 36 AA PT F 36 2 1001 2 spouse M 10/25/1970 MS FT 38 AA PT F 36 3 1001 3 child F 02/12/2002 00 NA 7 AA PT F 36 4 1001 4 child F 02/12/2002 00 NA 7 AA PT F 36 5 1001 5 child M 06/21/2005 00 NA 3 AA PT F 36 6 1011 1 reference F 07/12/1985 BA FT 23 BA FT F 23 7 1012 1 reference F 01/30/1946 HS RT 63 HS RT F 63 8 1012 2 spouse M 02/04/1940 HS RT 69 HS RT F 63 9 1012 3 other relative F 12/22/1921 BA RT 87 HS RT F 63 10 1015 1 reference M 06/28/1990 HS SC 18 HS SC M 18 11 1015 2 other M 05/25/1990 HS SC 19 HS SC M 18 12 1015 3 other M 07/16/1989 HS SC 19 HS SC M 18 13 1015 4 other M 11/02/1988 BS SC 20 HS SC M 18 |
This example shows how to extract hierarchically related data from a table and join them back to rows in the same table by a matching column.
The table in this example has two levels of hierarchy. Table PERSONS records survey information about the members of a household with one person in the household serving as the reference person. This person's information is considered to be at the higher level of the hierarchy. Column RELATIONSHIP identifies the roles of the members of the household.
The goal of the program is to extract the reference person's information and join it to each row in the reference person's household.
The PROC SQL step performs a self-join, also called a reflexive join. It lists table PERSONS twice on the FROM clause. A different alias, P or REF, is assigned to each reference to PERSONS.
The SELECT clause specifies the columns to retrieve from the join. It assigns new column names to the columns that are retrieved from the second reference to PERSONS so that PROC SQL does not overwrite the person's unique values. It also computes the age of each person in the household and the age of the reference person. The age of the reference person is computed using the reference person's value of DOB. The new column's value is the same for all rows within a household.
A WHERE clause joins table PERSONS to itself by values of column FAMILY, and it selects rows only from the second reference to PERSONS where the value of RELATIONSHIP is "reference".
Note that the code does not check if a household is missing a row for the reference person. When adapting the code for your use, you might need to add code for situations of missing data.
Create table HOUSEHOLD_SURVEY. Select all columns from the first reference to PERSONS, as specified by alias P. Compute the age of each member in the household. Use the formula that is discussed in Example 9.13. Retrieve values for three columns from the second reference to PERSONS, as specified by alias REF. Specify new names for the columns so that their values do not replace the same-named columns that were retrieved from the first reference to PERSONS. Using the date of birth for the reference person, which is retrieved from the second reference to PERSONS, compute the age of the reference person so that this constant value within the household is added to each row. Specify a self-join of table PERSONS. Assign an alias to each reference to table PERSONS. Join the rows by the values of HOUSEHOLD. Select from the second reference to PERSONS only the rows for the household member at the higher level in the hierarchy. Order the rows in table HOUSEHOLD_SURVEY.
proc sql; create table household_survey as select p.*, floor((intck('month',p.dob,'15jun2010'd) - (day('15jun2010'd) < day(p.dob)))/12) as age, ref.education as reference_educ, ref.work_status as reference_work, ref.gender as reference_gender, floor((intck('month',ref.dob,'15jun2010'd) - (day('15jun2010'd) < day(ref.dob)))/12) as reference_age from persons p, persons ref where p.household=ref.household and ref.relationship='reference' order by p.household,p.person; quit;
The DATA step that follows creates a data set equivalent to the table that was produced by PROC SQL in the main example. It merges data set PERSONS with itself by the values of HOUSEHOLD. The second reference to PERSONS in the MERGE statement specifies that only rows where the value of RELATIONSHIP is "reference" be merged to the observations that were retrieved by the first reference to PERSONS. It also renames several variables so that the values from the second reference do not overwrite the unique values for each person in the household that were retrieved by the first reference to PERSONS.
There can be multiple observations per value of HOUSEHOLD in the data set. At the beginning of each BY group, the second reference to PERSONS in the MERGE statement causes the DATA step to load into the Program Data Vector (PDV) the single observation in PERSONS that corresponds to the reference person. The values of the variables that are contributed from the second reference to PERSONS remain in the PDV throughout the remainder of processing for the BY group. These values for the person higher in the hierarchy are output with each observation in the BY group.
For a description of how SAS retains a variable value across iterations of a DATA step, see "Understanding How the DATA Step Adds Values from an Observation in One Data Set to All Observations in Another" in Example 5.1 in the "A Closer Look" section.
Assume prior to the DATA step, data set PERSONS is sorted by HOUSEHOLD, and within the values of HOUSEHOLD, by the values of PERSON. It is necessary to sort or index PERSONS by HOUSEHOLD so that the match-merging can be performed. Additionally, sorting by PERSON causes the observations to be in the same order as the rows in the table that was created by PROC SQL in the main example.
Create data set HOUSEHOLD_SURVEY. Merge PERSONS with itself. From the second reference to PERSONS, select only the rows for the reference person. Rename variables that were extracted from the second reference to PERSONS so that they do not overwrite the unique values for each person as extracted from the first reference to PERSONS. Drop this unneeded variable so that its values do not replace the values that were extracted from the first reference to PERSONS. Match data set PERSONS to itself by the values of HOUSEHOLD.
Compute the age of each member in the household. Using the date of birth for the reference person, which is retrieved from the second reference to PERSONS, compute the age of the reference person so that this constant value within the household is added to each observation.
data household_survey; merge persons persons(where=(relationship='reference') rename=(education=reference_educ work_status=reference_work dob=reference_dob gender=reference_gender) drop=person); by household; drop reference_dob; age=floor((intck('month',dob,'15jun2010'd) - (day('15jun2010'd) < day(dob)))/12); reference_age= floor( (intck('month',reference_dob,'15jun2010'd) - (day('15jun2010'd) < day(reference_dob)))/12); run;
52.15.137.91