Chapter 3. Combining Data Sets Horizontally: Match-Merging Data Sets by Value

Example 3.1 Merging Data Sets by a Common Variable 33

Example 3.2 Merging Observations from Multiple Data Sets by a Common Variable 36

Example 3.3 Combining Observations When Variable Values Do Not Match Exactly 40

Example 3.4 Combining Observations by the Formatted Value of a Variable 42

Example 3.5 Combining Multiple Tables When the Matching Column Has Different Attributes 45

Example 3.6 Combining Rows When There Is No Common Column 49

Example 3.7 Matching Observations Randomly 52

Example 3.8 Combining Multiple Data Sets without a Variable Common to All the Data Sets 55

Example 3.9 Generating Every Combination of Rows (Cartesian Product) between Tables 58

Example 3.10 Generating Every Combination of Rows between Tables Based on a Common Column 61

Example 3.11 Generating Every Combination of Observations between Data Sets Based on a Common Variable When an Index Is Available 65

Example 3.12 Combining and Collapsing Observations Based on a Common Variable 72

Example 3.13 Combining and Collapsing Observations Based on a Common Variable When the Transaction Data Set Is Indexed 76

This chapter presents ways to combine your data sets horizontally by aligning observations side-by-side or by overlaying observations. The examples match observations from the input data sets either by the values of variables in common or by programmatically aligning observations when the data sets do not have variables in common.

DATA steps, PROC SQL, and hash objects in DATA steps can match data sets and tables, and they can combine two or more data sets at a time.

Typically when combining data sets with the DATA step, these data sets are either sorted or indexed by the values of the matching variables. You do not have to sort or index your tables before combining them with PROC SQL, but you might want to so that the step executes more efficiently.

When developing your code to combine data sets horizontally, you need to understand the relationships among your input data sets. Relationships among multiple sources of input data exist when each source contains common data, either at the physical or logical level. For example, employee data and department data could be related through an employee ID variable that shares common values. Another data set could contain numeric sequence numbers whose values logically relate it to a separate data set by observation number.

Once data relationships exist, they fall into one of four categories:

  • one-to-one

  • one-to-many

  • many-to-one

  • many-to-many

All related data fall into one of these categories. You must be able to identify the existing relationships in your data because this knowledge is essential to understanding how to process your input data to produce desired results.

For an illustration of these relationships, see Chapter 1.

You can also sometimes choose to match your data sets by processing them either sequentially or directly. Programming statements in a DATA step enable you to specify an access method. PROC SQL decides for you the most efficient method to use on your code and data sets. This chapter includes examples of both access methods.

  • Sequential access means that SAS accesses the observations in your data set in the order in which they appear in the physical file. This is achieved with the SET, MERGE, MODIFY, and UPDATE statements in the DATA step.

  • Direct access means that SAS goes straight to an observation in a SAS data set without processing each observation that precedes it. This is achieved with the POINT= option in either the SET or MODIFY statement.

Computer resources such as CPU time, I/O, and disk storage can be conserved based on the access method that you choose.

For more information about sequential access and direct access, see Chapter 1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.82.93