Example 3.1 Merging Data Sets by a Common Variable 33
Example 3.2 Merging Observations from Multiple Data Sets by a Common Variable 36
Example 3.3 Combining Observations When Variable Values Do Not Match Exactly 40
Example 3.4 Combining Observations by the Formatted Value of a Variable 42
Example 3.5 Combining Multiple Tables When the Matching Column Has Different Attributes 45
Example 3.6 Combining Rows When There Is No Common Column 49
Example 3.7 Matching Observations Randomly 52
Example 3.8 Combining Multiple Data Sets without a Variable Common to All the Data Sets 55
Example 3.9 Generating Every Combination of Rows (Cartesian Product) between Tables 58
Example 3.10 Generating Every Combination of Rows between Tables Based on a Common Column 61
Example 3.12 Combining and Collapsing Observations Based on a Common Variable 72
This chapter presents ways to combine your data sets horizontally by aligning observations side-by-side or by overlaying observations. The examples match observations from the input data sets either by the values of variables in common or by programmatically aligning observations when the data sets do not have variables in common.
DATA steps, PROC SQL, and hash objects in DATA steps can match data sets and tables, and they can combine two or more data sets at a time.
Typically when combining data sets with the DATA step, these data sets are either sorted or indexed by the values of the matching variables. You do not have to sort or index your tables before combining them with PROC SQL, but you might want to so that the step executes more efficiently.
When developing your code to combine data sets horizontally, you need to understand the relationships among your input data sets. Relationships among multiple sources of input data exist when each source contains common data, either at the physical or logical level. For example, employee data and department data could be related through an employee ID variable that shares common values. Another data set could contain numeric sequence numbers whose values logically relate it to a separate data set by observation number.
Once data relationships exist, they fall into one of four categories:
one-to-one
one-to-many
many-to-one
many-to-many
All related data fall into one of these categories. You must be able to identify the existing relationships in your data because this knowledge is essential to understanding how to process your input data to produce desired results.
For an illustration of these relationships, see Chapter 1.
You can also sometimes choose to match your data sets by processing them either sequentially or directly. Programming statements in a DATA step enable you to specify an access method. PROC SQL decides for you the most efficient method to use on your code and data sets. This chapter includes examples of both access methods.
Sequential access means that SAS accesses the observations in your data set in the order in which they appear in the physical file. This is achieved with the SET, MERGE, MODIFY, and UPDATE statements in the DATA step.
Direct access means that SAS goes straight to an observation in a SAS data set without processing each observation that precedes it. This is achieved with the POINT= option in either the SET or MODIFY statement.
Computer resources such as CPU time, I/O, and disk storage can be conserved based on the access method that you choose.
For more information about sequential access and direct access, see Chapter 1.
3.21.46.92