Chapter 2: Table-Level Operations

In this chapter, we will discuss the hash table operations pertaining to the table as a whole entity. It means that these operations are concerned not with particular items in the table but with its existence, properties, and all its items at once.

2.2 CREATE Operation

The Create operation creates an operable initialized instance of a hash object (and the associated hash table). It involves the following compile time and run-time stages:

● Compile time:

1. Declare a hash object with a given name.

2. Define a PDV variable of type hash with the same name.

● Run time:

1. Create a hash object instance.

2. Generate a unique value of type hash to identify the instance.

3. Assign this value to the PDV variable defined above.

4. Define the key portion hash variables.

5. Define the data portion hash variables.

6. Validate the syntax used at stages 1, 4, 5.

7. Check that for each defined hash variable, a host variable with the same name exists in the PDV (i.e., check for parameter type matching).

8. Initialize the hash object instance.

The following code snippet is a simple example of implementing this plan:

Program 2.1 Chapter 2 Create Operation Template.sas

data _null_ ;

declare hash H ;

H = _new_ hash() ;

H.defineKey ("K") ;

H.defineData ("D") ;

H.defineDone () ;

stop ;

K = . ;

D = "" ;

run ;

Using this step as a base template, let us take a look at the different stages of the Create operation.

2.2.1 Declaring a Hash Object

The hash tool for declaring a hash object is the DECLARE statement. Despite its outward simplicity, it would benefit us to dwell on a number of points:

● The DECLARE statement can be abbreviated as DCL.

● DECLARE|DCL and HASH are keywords, and so the DATA step compiler checks them for syntactic validity.

● The keyword HASH must be followed by a name given to the object. In the example above, the object is named H. However, from the standpoint of syntax, it can be any valid SAS variable name within the constraints of the VALIDVARNAME system option currently in effect.

● In the form shown above (i.e., with no parentheses after the object name - we will dwell more on it later), the DECLARE|DCL statement is a compile-time only directive. It means that, at run time, the statement is ignored. (In this sense, it is similar, for example, to the ARRAY statement.)

● The name assigned to the object - in the step above, H - is the name of a PDV variable the compiler creates when it parses the DECLARE statement. Its future purpose is to hold a non-scalar value of type hash identifying a concrete hash object instance. Correspondingly, variable H is a non-scalar variable of type hash.

● Since variable H is non-scalar, i.e., not numeric or character, it cannot coexist in the same DATA step with a numeric or character variable with the same name.

● Likewise, it cannot coexist with a SAS array with the same name. Although from the standpoint of the compiler, the array name is also a reference to a non-scalar variable, its data type is different from type hash.

● Hence, if a numeric or character variable or array with this name is already present in the PDV, it cannot be used to name a hash object. For the same reason, this variable cannot be assigned to a scalar variable, nor can a scalar variable be assigned to it. Failure to observe these safety rules will create a data type conflict and generate a compilation error. In short, an attempt to use such a variable as if it were numeric or character may result in an error message that a scalar cannot be converted to an object of type hash (or vice versa).

● The best practice in this regard is to ensure that the compiler does not see the variable name used to name the object anywhere in the step other than as a valid reference to the hash object it denotes.

● In particular, it means that the name of any scalar (numeric or character) variable or of an array cannot be the same as a hash object name, and vice versa.

● With respect to the scalar variables, this rule encompasses not only the variables overtly referenced in the DATA step by name, but any variable present in the PDV. Thus, it includes the variable coming from any data set or view referenced in the SET, MERGE, UPDATE, or MODIFY statements.

It must be also mentioned that a given DATA step can contain only a single DECLARE statement with a given hash object name. Repeating the DECLARE statement with the same object name, such as:

data _null_ ;

dcl hash H1 ;

dcl hash H2 ; *No conflict here;

dcl hash H1 ; *Compile-time error: H1 is already defined;

run ;

results in a compile-time error and prevents the step from being executed. An error message is written to the log indicating that variable H1 has been already defined, as a non-scalar variable cannot be declared more than once. (Not coincidentally, the compiler reacts the same way and with the same error message to an attempt of defining an array with the same name twice.)

2.2.2 Creating a Hash Object Instance

Before hash table variables can be defined, it is not enough to merely declare the hash object. In addition, an instance of the declared object must be created. In the step above, it is done by using the _NEW_ operator:

H = _new_ hash() ;

First, let us make a few observations about the syntax of this statement:

● _NEW_ and HASH are keywords, and so their syntax ought to be strictly observed.

● The receiving variable named on the left side of the assignment (in this case, H) must have been already compiled as type hash. In other words, this statement must be preceded by a valid DECLARE statement defining H as type hash.

● It cannot be done in reverse. If the compiler sees the statement with the _NEW_ operator first, it will, by default, set variable H as numeric (i.e., scalar), and the DECLARE statement trying to define H as type hash will create a data type conflict.

● The blank space between the parentheses following the HASH keyword is intended for argument tags. They may have a profound impact on the hash object operations and are discussed elsewhere in the book. When they are left out, as above, they are assigned default values.

Now let us see what actions this deceptively simple statement implies:

● Create a new instance of hash object H.

● Generate a distinct non-scalar value of type hash to identify the newly created hash object instance. One way to think of it is of a pointer to the location in memory where the object instance resides.

● Assign this pointer value to PDV variable H of type hash.

The PDV value of H is what the program uses to identify the concrete hash object instance when the object is referenced by name - in this case, H. Any hash tool, such as a method or attribute, referenced by H, works on the instance pointed at by the current PDV value of H. In other words, the hash object instance whose identifying value is currently stored in variable H is active. Yet another way to express it is to say that the current PDV value of H surfaces the hash object instance it identifies.

Understanding this concept is quite important because, as we will see later in the book, for a given object name (such as H) defined in the DECLARE statement, more than one hash object instance can be created and used. In this case, we need to know how to tell the program to use a concrete instance at a given execution point according to program logic; and this is done by making the instance we want the program to work on active (or, which is the same, by surfacing it).

Another takeaway from this section is that the program is intended to use only a single instance of any given hash object, and measures must be taken to prevent the statement that creates a new instance from being executed more than once. We will discuss these measures later in this chapter.

2.2.3 Combining Declaration and Instantiation

The two statements that declare a hash object and then create a new instance of it:

declare hash H ;

H = _new_ hash() ;

can be combined into a single statement:

declare hash H() ;

Note that the only syntactic addition to the DECLARE statement used before is the pair of parentheses after the hash object name. In a single line of code, this statement:

At compile time:

1. Declares a hash object named H.

2. Creates PDV variable H of type hash to hold object instance identifying values.

At run time:

1. Creates a new instance of hash object H.

2. Generates a distinct type hash value of H to identify it.

3. Assigns this value to variable H, thereby making the instance active.

Note that since the combined DECLARE statement is not an overt assignment statement, it may seem that it does not assign anything to variable H. However, this is not true: It does assign, behind-the-scenes, the newly created value identifying the instant to variable H, just like the overt assignment statement with the _NEW_ operator.

Also note that the space between the parentheses following the object name can be filled with argument tags and their arguments in exactly the same manner as when the _NEW_ operator is used in a separate statement.

The compound DECLARE statement, for most intents and purposes, is equivalent to the two separate statements it replaces. However, combining the compile time part with the run-time part in a single statement results in a certain loss of flexibility. In particular, if there is a need to create and use two instances of hash object H at different points in the DATA step program, it is perfectly okay to code:

dcl hash H ;

H = _new_ hash() ; *Create instance #1;

...

H = _new_ hash() ; *Create instance #2;

However, the same purpose cannot be achieved by coding:

dcl hash H() ;

...

dcl hash H() ; *Compile time error;

because the second statement will result in a compile time error. Seeing the first statement, the compiler interprets its compile time declarative part as a directive to define variable H of type hash and creates it. But when the compiler sees the same declarative part in the second statement, it stops compilation with an error, since it cannot define a non-scalar variable more than once, and generates an error message stating that variable H is already defined.

2.2.4 Defining Hash Table Variables

Now an instance of hash object H has been created, and the PDV value of variable H has been assigned a unique value that identifies the instance as active (i.e., surfaced). However, at this point, no hash table associated with the instance is yet defined. To do so, at the next stage of the Create operation we need to provide the object constructor with the names of hash variables for the key and data portions of the table.

This is done by calling the DEFINEKEY and DEFINEDATA methods, respectively. In the sample DATA step above, the key portion is defined with a single variable K, and the data portion - with a single variable D - as follows:

H.defineKey ("K") ;

H.defineData ("D") ;

Any method call generates a numeric return code indicating if the call has succeeded (if the return code is zero) or failed (if it is not zero). In this book, the method call style shown above is termed unassigned since its return code is not assigned to a separate variable. The other style, termed assigned, captures the return code by assigning it to a numeric variable, so that it can be examined later. For example, in the following assigned calls (equivalent to the unassigned calls above) the return codes are captured in variable RC:

rc = H.defineKey ("K") ;

rc = H.defineData ("D") ;

As a side note, in this case calling the methods assigned or unassigned is merely a matter of style preference because these methods always succeed (i.e., return a zero code) if their syntax is formally correct. If there is anything wrong with them otherwise (e.g., if a variable has an invalid name, a conflict with another variable, etc.), the error will be caught later on at the next stage of the Create operation.

Let us now look at some details associated with the DEFINEKEY and DEFINEDATA methods:

● The calls do not have to follow the order shown above, i.e., either the DEFINEKEY method or the DEFINEDATA method can be called first or vice versa.

● In these method calls, the hash variable names are defined using character literal constants - in other words, quoted strings with a fixed value. As a side note, single and double quotes are equally valid, and their choice is a matter of the style one prefers - unless, of course, the content between them is a macro variable reference (in which case double quotes must be used to resolve it).

● Due to the dynamic nature of the hash object, the method argument defining a hash variable does not have to be just a character literal. In fact, it can be any DATA step character expression (unquoted), as long as it resolves to the character value representing the name of the variable we want to define (such as, in this case, "K" or "D". This is a valuable feature further expounded upon in this chapter and also used in other chapters to create dynamic code.

● The hash variables being defined inherit all their attributes, such as the length, data type, format, etc., from the like-named host variables present in the PDV at the time of call. In our example, they are variables K and D placed into the PDV during the DATA step compilation phase as a result of parsing the assignment statements following the STOP statement.

2.2.5 Omitting the DEFINEDATA Method

It is possible to omit the DEFINEDATA method and call the DEFINEKEY method only. At times, it can be useful to do so; however, the user ought to be mindful of the following:

● It does not mean that the table will have only the key portion and the data portion will be absent. As already noted, a SAS hash table cannot exist without both.

● So, if the table definition includes a DEFINEKEY call only, all the variables defined by it to the key portion will be automatically included in the data portion as well.

● It should be kept in mind, especially when the key portion is defined with numerous and/or long variables, that including them automatically in the data portion by omitting a DEFINEDATA method call automatically doubles the overall hash entry length and increases the hash table memory footprint.

● Though it is possible to call DEFINEKEY without calling DEFINEDATA, the reverse is not true! At least one valid DEFINEKEY call must be always included in the definition. Failure to do so will be detected by the DEFINEDONE method call and result in a run-time error with the log message that the keys are "uninitialized".

For example, suppose that in our sample step above we omitted the DEFINEDATA call and included only

H.defineKey ("K") ;

Then it would be exactly equivalent to calling both methods with variable K as an argument:

H.defineKey ("K") ;

H.defineData("K") ;

It means that now both the key portion and the data portion contain variable K. And if there were more than one variable defined by a stand-alone DEFINEKEY call, all of them, in the same order, would end up in the data portion as well.

2.2.6 Wrapping Up the Create Operation

The last stage of the Create operation is executed by calling the DEFINEDONE method, either unassigned or assigned:

H.defineDone() ; /* unassigned call */

rc = H.defineDone() ; /* assigned call */

The DEFINEDONE method is responsible for the following actions:

● Validate the internal syntax of the DEFINEDATA and DEFINEKEY method calls.

● Make sure that host variables with the same exact names exist in the PDV.

● If either condition above is not satisfied, the DEFINEDONE call will fail, return a non-zero code, generate an error message in the log, and stop the DATA step.

● Otherwise, initialize the hash object instance.

2.2.7 PDV Host Variables and Parameter Type Matching

As it has been repeatedly stated, for every variable name defined in the hash table entry, a like-named variable must exist in the PDV. A SAS programmer new to hash object programming might at first get the impression that when the DATA step compiler parses hash variable definition statements, such as:

H.definekey ("K") ;

H.defineData ("D") ;

it will infer from them the type of K and D as numeric and place both into the PDV. This is what the compiler does, for example, when it parses code unrelated to the hash object and encounters a new variable name as part of a SAS expression.

Yet, with respect to the hash object this is not the case at all. While parsing code related to the hash object, the compiler performs only two actions:

1. Validates the syntax. For example, if in the DEFINEKEY call above, the period were missing, or the method name were mistyped, or the parentheses and quotes were unbalanced, etc., a compilation error would occur.

2. Validates the reference variable. The compiler checks if the variable by which the object is referenced - such as variable H above - has been already properly defined as a variable of type hash. Thus, if variable H of type hash were not defined in a valid DECLARE statement before H is referenced, it would also lead to a compilation error.

As long as these two items check out, the compiler's job is done as far as hash object code is concerned. It does not see the hash variable names passed to the DEFINEDONE and DEFINEDATA methods, does not check if they are valid, and therefore does not create the corresponding variables in the PDV.

This is why a means must be provided for the compiler to create variable named K and D (in this case) in the PDV, replete with their attributes, at compile time. This is the purpose of the two assignment statements coded last in the sample DATA step after the STOP statement:

K = . ;

D = "" ;

To wit, their goal is to place host variables named K and D, corresponding to the hash variables also named K and D, into the PDV and make them available for the hash object at run time later. The procedure can be described as follows:

● During the compile phase, the DATA step compiler parses the assignment statements.

● It infers from the literals assigned to K and D that a numeric variable K and a character variable D with length 1 be placed into the PDV.

● This way, by the time the DEFINEDONE method is called during the execution phase, variables K and D are already in the PDV, along with their respective attributes.

● Seeing that hash variable K has been defined in the key portion, the DEFINEDONE method searches the PDV for a host variable with the name K. Since it is there, hash variable K passes the check.

● The same actions are then performed with respect to the hash variable D and host variable D.

● If both K and D pass muster, they are initialized for use in the hash table. This wraps up the Create operation.

Pre-defining host variables in the PDV at compile time, so that the hash object can use them later at run time, is also termed parameter type matching. Note that placing parameter type matching statements (in this case, the assignments) after the STOP statement is optional; i.e., they can appear anywhere in the DATA step. Above, it is done primarily to highlight the temporal separation between the compilation phase (during which the host variables are created in the PDV) and the execution phase (during which they are relied upon by the hash object operations).

From the standpoint of the Create operation, the location of parameter type matching statements in the step is irrelevant. If parameter type matching is their only purpose - that is, they are not intended to be executed at run time - the program has to be structured accordingly. Above, this is done by placing them after the STOP statement. Another way is to place them in a block of code preceded by an IF condition that is always false, such as:

IF 0 then do ;

<parameter type matching code>

end ;

2.2.8 Other Ways of Hard-Coded Parameter Type Matching

Needless to say, it does not necessarily have to be done via assignment statements. Any valid block of code letting the compiler populate the PDV with variables with the same names as the defined hash variable names will also work. For example, instead of using the assignment statements above, a LENGTH statement could be used to achieve exactly the same parameter type matching effect. The only purpose of the MISSING call routine below is to avoid the pesky "uninitialized" warning in the log if one of the variables is not valued.

length K 8 D $ 1 ;

call missing (K, D) ;

Or, alternatively, the RETAIN statement could be used as well, with the same result:

retain K . D "" ;

2.2.9 Dynamic Parameter Type Matching via File Reference

The parameter type matching techniques shown above suffer from the same basic flaw: They are essentially hard-coded. It is okay if the DATA step in question is where the variables defined by these techniques are created in the first place. However, more often than not, the values with which a hash table is eventually populated come from reading a SAS data file. In this case, hard-coding presents a problem, and here is why.

Suppose that we have a data set containing variables K and D. For example:

Program 2.2 Chapter 2 HashValues Sample Data set.sas

data hashValues ;

input K D:$1. ;

cards ;

1 A

2 B

3 C

run ;

Suppose further that we want to use K as the hash table key and D as its data portion variable - for example, suppose we want to insert the (K,D) value pairs from file hashValues into the table later on. If we decided to use hard-coding for parameter type matching, we would first need to ensure that the data types and lengths of the hard-coded variables match those in the file. In turn, it means that we would need to find out what those attributes are by doing, for example, any of the following:

● Locate the original code used to create the file (it may not be even available).

● Query dictionary.columns or sashelp.vcolumn.

● Run the CONTENTS procedure and look at the output.

● Take a look at the file properties via the SAS viewer or another interface.

Doing any of those things runs counter to the principles of robust automated programming. Worse still, after finding the attributes of K and D in the file, they would need to be hard-coded in the program correctly, so as to avoid conflicts with the like-named variables in the file. Such practice is quite problematic for two reasons:

1. Errare humanum est ("to err is human").

2. The more variables are involved in the process, the more laborious it gets and the more ominous the truth encapsulated by the adage above becomes.

Therefore, it is much less labor-intensive and much less error-prone just to let the compiler itself read the descriptor of the data set in question (in this case, hashValues) and place the needed variables along with their attributes into the PDV. Moreover, it is easy to do because the compiler performs this action anytime the name of the data set in question is referenced by a file-reading statement, such as SET, MERGE, UPDATE, or MODIFY. Given that, there are three distinct cases:

1. One of these statements referencing the file in question is already present somewhere in the DATA step, and the requisite variables are kept, i.e., not eliminated by the KEEP= or DROP= data set option.

2. Same as above, but the requisite variables are not kept.

3. None of these statements referencing the file in question is present anywhere in the step.

In case #1, parameter type matching occurs automatically, and so no other action to achieve it is required. In case #2, all that is required to achieve parameter type matching is to recode the DROP= or KEEP= variable list (or omit it altogether) in order to ensure that the requisite variables are kept. Once it is done, this case becomes no different from case #1. We will delve more into these two cases later after concentrating on case #3.

2.2.10 Parameter Type Matching by Forced File Reference

In case #3, the simplest way to attain the goal is to include a non-executable SET statement referencing the file in question (and keeping the requisite variables) anywhere in the DATA step. Making it non-executable ensures that the file is seen only at compile time, and no data is read from it at run time (thus preventing the statement from possibly compromising the rest of the program).

If the DATA step contains the unconditional STOP statement (as in the step above), any statement following it is non-executable. Hence, in this case the parameter type matching SET statement can be simply included after STOP, e.g.:

stop ;

...

SET hashValues (keep = K D) ;

...

run ;

This parameter type matching technique operates as follows:

● Since SET is coded after the STOP statement, it reads no actual data from data set hashValues at run time.

● However, at compile time the compiler reads the descriptor of the data set and places variables K and D and their attributes into the PDV.

● It initializes PDV variables K and D to the missing values of the appropriate data types, thus avoiding "uninitialized" warnings in the SAS log.

If the step does not already contain an unconditional STOP statement, the same parameter matching effect can be achieved by coding, somewhere in the step, the SET statement preceded by an obviously false condition to make it non-executable. For example:

IF 0 then SET hashValues (keep = K D) ;

Because the condition above is always false, it prevents the SET statement from being executed at run time, yet still exposes it to the compiler at compile time. Its actions are exactly the same as of the SET statement placed after STOP.

In the ensuing chapters, this robust parameter type matching method will be used widely in both variations. Note that the technique allows for a number of modifications depending on the situation and need. For example, if variables K and D were not in the same file but in two different files - say, hashValuesK and hashValuesD, respectively - the issue can be addressed simply by recoding the SET statement as their concatenation, i.e.:

IF 0 then SET hashValuesK(keep = K) hashValuesD(keep = D) ;

Incidentally, the MERGE statement can be used instead of SET to the same effect, regardless of whether it references one file or more.

2.2.11 Parameter Type Matching by Default File Reference

Under a number of realistic scenarios, no special measures to ensure parameter type matching are needed at all. This happens in two cases.

1. When the same file, from which we want the compiler to obtain the host variables, is already referenced explicitly in order to read the actual data from it. For example, consider this variation of our sample DATA step:

data _null_ ;

dcl hash H()

H.defineKey ("K") ;

H.defineData ("D") ;

H.defineDone () ;

do until (lr) ;

SET hashValues (keep = K D) end = lr ;

<e.g.: code to insert(K,D) values into table H>

end ;

<more code>

stop ;

run ;

Since the compiler sees the SET statement referencing hashValues, it is unnecessary to reference it again elsewhere in the step for the purpose of parameter type matching. This is because from the standpoint of the compiler reading the data set descriptor, it does not matter whether the SET statement is run-time executable or not.

2. When the host variables with the same names as the intended hash variables occur naturally as part of the DATA step program, and so the compiler places them and their attributes into the PDV during the compilation phase. Consider, for instance, the following snippet:

data _null_ ;

dcl hash H() ;

H.defineKey ("K") ;

H.defineData ("D") ;

H.defineDone () ; ❷

do K = 1 to length ("ABCDEF") ; ❶

D = char ("ABCDEF", K) ; ❷

<code to insert current (K,D) pair into table H>

end ;

<more code>

stop ;

run ;

❶ The compiler places variable K into the PDV as numeric as an effect of parsing the DO statement where K is used as the loop index.

❷ Then it parses the next statement and creates host variable D as character 1 because this is the type and length the function CHAR returns.

❸ Thus, by the time program control hits the DEFINEDONE call at run time, the host variables for hash variables K and D are already in the PDV with the attributes required, and so no extra measures are needed to make it happen.

2.2.12 Defining Multiple Hash Variables

So far, we have dealt with the case of a single hash variable in the key and data portion. However, in most real-life situations, the key portion or data portion or both comprise more than one variable. Therefore, we need a way to tell the DEFINEKEY and DEFINEDATA methods how to include them all. For example, consider the variables in data set Bizarro.Player_candidates with the following attributes:

Figure 2.1 Player_candidates Data Set Metadata Sample

Now suppose that we need to create a hash table H with composite key (Player_ID,Team_SK) and the data portion containing the rest of the variables - for example, intending to use table H as a lookup table downstream. The simplest (but not necessarily the best) way of doing it is to pass comma-separated lists of the respective variable names as character literals to the DEFINEKEY and DEFINEDATA methods as arguments:

Program 2.3 Chapter 2 Define Multiple Hash Variables.sas

data _null_ ;

dcl hash H() ;

H.defineKey ("Player_ID","Team_SK") ;

H.defineData("First_name","Last_name","Position_code") ;

H.defineDone() ;

stop ;

set bizarro.Player_candidates ;

run ;

The SET statement facilitates parameter type matching by letting the compiler examine the descriptor of Bizarro.Player_candidates and place all its variables in the PDV. To reiterate, the order of the DEFINEKEY and DEFINEDATA calls is irrelevant, and they can be swapped.

To date, this technique of defining multiple hash variables as hard-coded variable lists has been used predominantly - in particular because the SAS documentation neither offers or suggests any other way. However, it can also be done differently. Namely, each variable can be defined using its own method call. For instance, the two calls above can be replaced, without changing the final result whatsoever, with the following series of individual calls, each comprising a single variable name:

H.defineKey ("Player_ID") ;

H.defineKey ("Team_SK") ;

H.defineData("First_name") ;

H.defineData("Last_name") ;

H.defineData("Position_code") ;

Moreover, delimited-list calls and individual calls can be combined without contradicting each other. For example, to define the data portion, we can list First_name and Last_name in one call and leave Position_code for another:

H.defineData("First_name", "Last_name") ;

H.defineData("Position_code") ;

The calls, either individual or combined, can be issued in any order: It will only alter the sequence in which the variables are placed into the corresponding portions of the hash entry.

2.2.13 Defining Hash Variables as Non-Literal Expressions

The ability to define hash variables one at a time shown above raises the question: Why would it make sense to define them one at a time in separate method calls if it can be done by listing them in a single call? The answer is that it makes no sense as long as the variable names are hard-coded as character literal constants, i.e., fixed quoted values, such as "Player_ID", "Team_SK", etc.

However, it starts making sense as soon as we realize that, generally speaking, any argument to the DEFINEKEY or DEFINEDATA method represents a generic SAS character expression. A character literal constant is merely the most basic character expression (and it is also the most static since it represents a fixed value).

Suppose that at the time of a DEFINEKEY or DEFINEDATA call, we have a PDV character variable _hVarName of length $32 valued with the name of a hash variable we need to define. For example, imagine that we want to call DEFINEKEY to define hash variable Player_ID; and somewhere in the step before the method call we have the statements:

retain _kVarList "Player_ID Team_SK" ;

length _hVarName $ 32 ;

_hVarName = scan(_kVarList,1) ;

It means that _hVarName is populated with the value "Player_ID". But _hVarName, being a character variable, is a character expression, too. Therefore, in this case, we can pass it to the method, unquoted, instead of hard-coding a literal constant. That is, instead of coding:

H.defineKey ("Player_ID") ;

we can code:

H.defineKey (_hVarName) ;

Note that though in the above snippet _hVarName is populated with "Player_ID" via the SCAN function, the concrete way by which it receives the value is irrelevant. For example, as we will see later on, a variable similar to _hVarName can come from a data set populated with the names of the hash variables to be defined.

Developing the idea of using non-literal expressions further, let us observe that the SCAN function expression is a character expression in its own right. Hence, instead of creating an intermediate variable (such as _hVarName), the entire expression can be passed to the DEFINEKEY method call directly:

H.defineKey (scan(_kVarList,1)) ;

In sum, any valid character expression can be passed to the DEFINEKEY and DEFINEDATA methods as arguments as long as it resolves to the value representing the name of the hash variable we need to define. Needless to say, the value must be a valid SAS variable name and have a like-named counterpart host variable in the PDV.

2.2.14 Defining Hash Variables Dynamically One at a Time

Now it should be easy to understand why using non-literal expressions to define hash variables one at a time can actually shorten a program and make it tidier. Suppose that we have a list of numeric D1-D100 to be defined in the data portion of table H. Passing the variable names as character literals to the DEFINEDATA method, we would have to code:

h.defineData ("D1","D2",...,"D99","D100") ;

Coding this kind of argument list is tedious, messy, and error-prone - it's easy to accidentally mistype a name or miss a quote or a comma. A more astutely lazy programmer could write a macro or a separate preliminary DATA step to assemble the requisite list with all the requisite quotes and commas and pass it to the method as a macro variable reference. For example:

data _null_ ;

length arg $ 32767 ;

do x = 1 to 100 ;

arg = catx (",", arg, quote (cats ("D", x))) ;

end ;

call symputx ("arg", arg) ;

run ;

And then downstream in the DATA step where DEFINEDATA is called:

H.defineData (&arg) ;

However, neither jumping through these sorts of hoops nor hard-coding is necessary if we take into account the dynamic character expression nature of the DEFINEDATA arguments. Instead, we can simply call the method repeatedly in a DO loop for each hash variable one at a time in the same DATA step where DEFINEDATA is called:

array DD D1-D100 ;

do over DD ;

H.defineData(put(vname(DD),$32.)) ;

end ;

Above, in each iteration of the loop, the character expression passed to DEFINEDATA automatically resolves to the name of the individual hash variable inferred from the corresponding array element and passes it to the method call. The final result of adding one hash data variable to the data portion one at a time in this manner will be exactly identical to hard-coding (if done correctly) or resolving the macro variable reference. Of course, the same is true if we should need to add a long list of hash variables to the key portion of H by calling the DEFINEKEY method.

2.2.15 Defining Hash Variables Using Metadata

As noted above, most of the time the key and data values loaded into a hash table come from variables in a SAS data file. In such cases, programming logic almost always dictates that the hash variables be defined with the same names as the names of the data set variables the key and data values come from. The names of these variables are already stored in the dictionary table Dictionary.Columns or in the view Sashelp.Vcolumn in the $32 character variable Name. Since variable Name is just a case of a character expression, using its values to define the names of our hash variables we need to do only the following:

1. Read the dictionary table or view and filter it to suit our needs.

2. For each row read from the filtered table or view, call the DEFINEKEY and/or DEFINEDATA method (depending on whether the respective value of Name is designated for the key or data portion) and pass Name to the method call.

To illustrate the concept, let us suppose that we intend, down the line, to load data from data set Bizarro.Player_candidates into hash table H. Correspondingly, we want to define its hash variables as named after the variables in the data set. More specifically, we want to:

● Define the hash variables in the key portion using the data set variable names ending in "_ID" and "_SK" (for example, we may know that together, the variables with such suffixes form a unique composite key).

● Define the hash variables in the data portion using the names of the rest of the variables in the data set.

A primitive, static, and error-prone way of doing this is to eyeball the metadata related to the data set (for instance, in the SAS viewer) and then hard-code the arguments to the DEFINEKEY and DEFINEDATA method calls based on the findings. A more advanced, dynamic, and robust approach is to exploit the system dictionary tables as outlined above. The dictionary view Sashelp.Vcolumn makes it possible to define the requisite hash variables dynamically right within the DATA step where the rest of the Create operation is performed:

Program 2.4 Chapter 2 Define Hash Variables Selectively Metadata.sas

data _null_ ;

dcl hash H() ;

do until (lr) ; ❶

set sashelp.vcolumn (keep=memname libname Name) end=lr ;

where libname="BIZARRO" and memname="PLAYER_CANDIDATES" ; ❷

isKey = scan (upcase (Name), -1, "_") in ("ID","SK") ; ❸

if isKey then H.defineKey(Name) ; ❹

else H.defineData(Name) ;

end ;

H.defineDone () ; ❺

stop ;

set bizarro.Player_candidates ; ❻

run ;

The hash variable definition plan executed above (after the hash table is declared) is as follows:

❶ Use an explicit DO loop to read a subset of sashelp.vcolumn view one record at a time.

❷ Subset sashelp.vcolumn view to the rows related to data set Bizarro.Player_candidates only. Column Name read from it contains, as its values, the names to be defined to either the key portion or data portion.

❸ If Name ends in "_ID" or "_SK", set Boolean variable isKey to 1; else set it to 0.

❹ If isKey=1, use expression Name (a variable is an expression) as the argument to the DEFINEKEY call. Otherwise, use it as the argument to the DEFINEDATA call.

❺ At this point, all the rows from sashelp.vcolumn subset have been read, and for each value of Name coming from them, either DEFINEKEY or DEFINEDONE has been called. Call DEFINEDONE to wrap up the Create operation.

❻ Make sure, at compile time, that all the host variables corresponding to the hash variables defined to table H reside in the PDV.

Incidentally, the IF-THEN-ELSE block ➍ above can be replaced, with the same effect, by a single statement:

RC = ifN (isKey, H.defineKey(Name), H.defineData(Name)) ;

Due to the way the IFN functions works, if isKey = 1 (i.e., evaluates as true), the DEFINEKEY method is called; otherwise if isKey=0 (i.e., evaluates false), the DEFINEDATA method is called. Note that the assignment statement here is a dummy statement, i.e., it is used merely as a vehicle to execute the IFN function. Respectively, RC is used merely as a dummy variable to make the assignment statement valid. (As explained earlier, capturing the return code from these method calls is unnecessary. That said, by way of coding, RC variable in this case will indeed receive the return code from whichever method is called - which is why it is named RC in the first place.)

Alternatively, instead of using sashelp.vcolumn view directly in the DATA step, the system table Dictionary.Columns could be used in a preliminary SQL step to create a subset related to Bizarro.Player_Candidates and containing only the fields Name and isKey. Though doing so requires an extra step, it also offers certain advantages, such as a cleaner log, better performance, and the ability to use the LIKE operator to create variable isKey.

2.2.16 Multiple Instances Issue

The following statement is a run time directive (unlike the DECLARE statement, which is a compile time directive):

H = _new_ hash() ;

It means that it is executed every time program control passes through it. Hence, if it is placed inside a loop, it will create a new instance of hash object H at every iteration. It will occur regardless of whether the loop is an explicit DO or the implied DATA step loop. Therefore, in the following step, the statement is executed twice:

data _null_ ;

dcl hash H ;

do i = 1 to 2 ;

H = _new_ hash() ;

end ;

*...rest of program;

run ;

At run time, the DCL statement is ignored, but the assignment statement is executed twice and thus creates two separate instances of hash object H. The same happens if program control passes through the assignment statement in the implied DATA step loop. For example:

data _null_ ;

set Bizarro.Player_candidates (obs=2) ;

dcl hash H ;

H = _new_ hash() ;

*...rest of program...;

run ;

It is no different if the compound DECLARE statement is used:

data _null_ ;

set Bizarro.Player_candidates (obs=2) ;

dcl hash H() ;

*...rest of program...;

run ;

In this case, the declarative part of the statement is ignored at run time; yet, the part that creates a new hash object instance is executed twice, and so two separate instances of H are created.

This behavior can be useful if the program intends to create and use multiple instances of the same hash object. However, most programs use only a single instant of every named hash object. In this case, this default behavior results in a number of undesirable side effects:

● More instances of the same hash object than needed are created.

● If an instance is not used, it needlessly consumes memory and other computer resources.

● Worse still, this overhead can be compounded if the unintended instances are numerous. For example, if in the step above, the input were not limited by the OBS=2, a separate instance of H would be created for every observation read in.

Moreover, if the loop contained the entire block of code representing the Create operation, every one of its run-time statements and method calls would be re-executed for each input observation and would thus add to the overhead:

data _null_ ;

dcl hash H() ;

H.definekey("Player_ID") ;

H.definedata("Position_code") ;

H.definedone() ;

set bizarro.Player_candidates ;

*...rest of program;

run ;

In this case, not only would another instance of H be needlessly created for each input observation, but the Create operation methods would be needlessly called just as many times.

Therefore, if we need only a single instance per hash object, measures must be taken to ensure that no more than one instance is created and acted upon. If multiple instances are needed, our DATA step program must make sure that each instance can be referenced. Section 2.2.18 suggests several such alternatives. Chapter 9 Hash of Hashes – Looping Thru SAS Hash Objects presents a use case for creating multiple instances with the same name.

2.2.17 Ensuring Single Instance Usage

The obvious way to ensure that only a single hash object instance is created and initialized is to ensure that program control passes through the Create operation statements only once. Generally speaking, there are two techniques to achieve it:

1. Execute them only on the condition of _N_=1.

2. Use the DO loop to take explicit control of reading the input.

These two approaches are exemplified in the exhibit below, where:

● Input file Player_candidates is a WORK library copy of file Bizarro.Player_candidates.

● Variable LR used with the END=LR option is initially automatically set to LR=0. The SET statement sets it to LR=1 when it reads the last input record.

● Note that the name "LR" is an abbreviation denoting the "last record". It is used in this context below and throughout the book.

Table 2.1 Ensuring a Single Hash Object Instance

For the coding style shown on the left:

● The condition _N_=1 prevents the program from executing the Create operation code block more than once by rendering it operable only in the first iteration of the implied DATA step loop.

● The step is stopped when, in the last iteration of the implied loop, the SET statement attempts to read from the empty buffer after the last record has been read.

● If more code is needed after the last input record has been processed, the condition IF LR=1 ensures that it is executed only once. Though it seems logical to code this block last (just before RUN), it is instead placed before SET. Doing so ensures that it is executed even if a conditional DELETE or subsetting IF statement coded after the SET statement should evaluate true on the last record.

For the coding style on the right:

● The Create operation code block is executed unconditionally.

● The file is processed by reading its records explicitly in a DO UNTIL loop terminated after the SET statement reading the last record makes it LR=1.

● If more code is needed after that, it is placed, unconditionally, after the DO loop.

● The STOP statement terminates the step. This way, all code is executed only during the first iteration of the implied loop since it never iterates again.

Both techniques have their preferred uses depending on the program logic and, to some extent, preferred programming style. In this book, both styles are exemplified, the choice depending on the circumstances.

The style shown on the left is suggested in the SAS documentation. However, the style on the right is more logically straightforward, especially if file post-processing is needed, and, to a degree, more efficient. Note that this style is a version of a technique commonly known in the SAS programming community as the DoW loop. Above, it is used to take explicit looping control over the entire input file. Another variant of it, also exemplified in this book, is used to take control over each BY group, one at a time, read from a sorted or grouped file.

2.2.18 Handling Multiple Instances

In the previous section, we dealt with the ways to ensure that only a single instance of a given named hash object is created and used when this is what the program needs. However, under different circumstances using multiple instances of the same object is not only desirable but advantageous in terms of flexibility and dynamic code. That raises a question: If more than one instance of the same hash object is created, how do we tell the program which one to use? To answer it, suppose that we have created two instances of hash object H, as in the following schematic DATA step:

data _null_ ;

dcl hash H ;

H = _new_ hash() ; *Create instance of H #1;

*...code block #1...;

H = _new_ hash() ; *Create instance of H #2;

*...code block #2...;

*...rest of program...;

run ;

Each time the same statement is executed, it creates a new instance. Hence, when it is called twice, as above, the following happens:

1. When it is executed for the first time, it creates a new instance (#1) and makes it active by storing its identifying value in PDV variable H. Thus, any reference to H in the code block #1 will cause the program to work on instance #1.

2. When it is executed for the second time, it creates another new instance (#2) and makes it active by overwriting the PDV value of H with the pointer value identifying instance #2. Now any reference to H in the code block #2 and the rest of the program will cause it to work on instance #2.

Now let us suppose that we need the rest of the program, instead of working on instance #2, to resume working on instance #1 again. With the program as shown above, it presents a problem. Namely, the pointer value identifying instance #1 (originally stored in H) is no longer available since it is overwritten in H and not stored anywhere else. So, even though the instance exists, it can no longer be identified by the program.

The way around the problem is to create another variable of type hash and use it to save the value identifying instance #1. Then, later on, the saved value can be reassigned back to H and thus direct the program to resume working on instance #1 again:

data _null_ ;

dcl hash SAVE ;

dcl hash H ;

H = _new_ hash() ; *Create H instance #1;

SAVE = H ; *Save current PDV value of H;

*...block #1...;

H = _new_ hash() ; *Create H instance #2;

*...block #2...;

H = SAVE ;

*...rest of program...;

run ;

The reason we need another DECLARE|DCL statement is that the value of type hash identifying instance #1 cannot be saved in a scalar variable. Instead, we need another variable of type hash (in this case, SAVE), and the DECLARE statement is the only vehicle to create it.

After the value saved in variable SAVE is reassigned to H, instance #1 is reactivated since now the PDV value of H is again related to this instance. That is, any reference to object H in the rest of the program will cause it to work on instance #1.

If the program needs to use and intermittently activate more than two hash object instances, more type hash variables can be created, each in a separate DECLARE statement, to store their identifying values for later use. However, it is easy to perceive that as the number of such instances grows (and especially if it is not known beforehand), this technique can quickly become unwieldy.

Fortunately, the pointers to hash object instances can be stored in a separate hash table and retrieved from it into the PDV. This much more suitable way of activating an instance will be discussed and exemplified later in the book (especially in Chapters 6 and 9). However, regardless of the technique, the capability to surface an individual instance at will makes programs using the hash object highly flexible and dynamic.

2.2.19 Create Operation Hash Tools

● Statements: DECLARE (DCL).

● Operators: _NEW_.

● Methods: DEFINEKEY, DEFINEDATA, DEFINEDONE.

2.3 DELETE (Table) Operation

This operation serves to delete a hash object instance altogether, including its table and hence all of its table's content. It is useful when all the data processing the programs needs to do with the table is finished and it is no longer needed. By deleting the instance, we free up the memory occupied by both the items stored in its table and its underlying structure - in contrast to the Clear operation after which the underlying structure (and the memory it occupies) is preserved.

2.3.1 The DELETE Method

The only way to delete a hash object instance is to call the DELETE method. Suppose we have declared a hash object named H (and thus created a PDV variable H of type hash) and created one or more of its instances. The following call deletes the active instance of H - that is, the instance identified by the current PDV value of H:

rc = H.delete() ;

If the instance pointed at by the current PDV value of variable H exists, it will be deleted successfully, and the method will return RC=0. It will fail in two cases:

● No instances of hash object H have been created.

● The instance identified by the current PDV value of H no longer exists because it has been deleted previously.

In both cases, the DATA step will be aborted with an error message stating that object H is uninitialized.

The method can always be called unassigned, i.e.:

H.delete() ;

The reason for it is that capturing its return code in a separate variable offers no utility. If the DELETE method fails, the step is instantly aborted; and so no further programming action is possible, based on the return code.

2.3.2 DELETE Operation Details

A few points regarding the DELETE method deserve to be emphasized:

● It does not delete the PDV variable, such as H above, associated with the object. Once defined, this variable persists for the duration of the step. In this sense, it is no different from any other variable defined in the PDV.

● The method does not delete all instances of the hash object referenced in the call.

● It deletes only the active instance. Hence, if other instances need to be deleted, each of them must be made active first and then deleted using a separate call.

As a side note, in contrast to the CLEAR method described below, the DELETE method can successfully delete a hash object instance even if its hash table is locked by a hash iterator.

2.3.3 Delete (Table) Operation Hash Tools

● Methods: DELETE.

2.4 CLEAR Operation

This is categorized as a table-level operation because it deletes all the hash table items at once and releases the memory formerly occupied by them. While the operation eliminates the items from the table, it preserves its entry. In other words, it leaves the table empty, yet keeps the table itself and its defined structure.

2.4.1 The CLEAR Method

The Clear operation is performed by calling the CLEAR method. If the hash object whose table we need to clear is named H, the only piece of code needed to trigger the Clear operation is:

rc = H.CLEAR() ;

The CLEAR method can always be called unassigned, i.e., without capturing its return code in a separate variable:

H.CLEAR() ;

This is because for this method (as well as a number of others), capturing its return code is useless. There are only two reasons why this method can fail:

1. The hash object instance referenced by H does not exist. In this case, the step will be immediately aborted.

2. The table is locked by a hash iterator (discussed in detail later). In this case, the step will be immediately stopped as well.

In either case, if the method call should fail, no further statements would be executed. Thus, the return code, even if captured, could not be examined, and so there is no reason to capture it in the first place.

2.4.2 Clear Operation vs Delete (Table) Operation

The Clear operation is extremely valuable in the situations when a hash table is used to process one block of data after another. Most often (though not always) it happens, for example, when the table is populated during the processing of one BY group and then needs to be reinitialized in preparation for the processing of the next one. By emptying the table before every BY group, the Clear operation ensures that the table uses only as much memory as it needs to load the largest BY group - as opposed to the amount of memory required to load the whole file. It can also be used if a single DATA step is generated by macro language logic, which needs to clear and reload the table using, for example, a WHERE clause.

In principle, the same can be done using the Delete (table) operation. However, since it also deletes the table itself, it requires redoing the entire Create operation before each consecutive block of data being dealt with. Compared to merely purging the table of its items while keeping the table itself, recreating the table can be quite expensive. In real-life situations, where the data blocks to be processed may number in millions, the accumulated cost of multiple Delete (table) operations can quickly become prohibitive.

Having said that, the Delete (table) operation has one advantage: It can delete a table locked by a hash iterator. However, this advantage is moot because there exist simple ways (discussed later on) to unlock the table.

2.4.3 CLEAR Operation Hash Tools

● Methods: CLEAR.

2.5 OUTPUT Operation

The Output operation is designed to unload (i.e., write or copy) the data currently stored in a hash table to a SAS data set file indicated by the program. We classify it as table-level because by default (i.e., unless specifically filtered), it writes every hash item as an output data set observation using a single statement. Before discussing the operation in earnest, let us first note some high-level details:

● Only the data stored in the data portion hash variables is written out; the key portion variables are ignored. Hence, if the key-values are needed in the output file, they have to be defined both in the data portion and key portion.

● Every data portion hash variable becomes an output data set variable with the name and all other attributes inherited from the corresponding PDV host variable.

● The operation is executed at run time completely independently from the DATA step facilities writing data to the data sets specified in the DATA statement.

The hash object tools supporting the Output operation are the OUTPUT method and the DATASET argument tag used to specify the output file. Let us take a look at them first.

2.5.1 The OUTPUT Method

Suppose that we need to write the data stored in the data portion variables of hash table H to a SAS data set Work.fromHash. This is done by calling the OUTPUT method, where the name of the output data set is specified using the DATASET argument tag:

rc = H.OUTPUT (dataset:"work.fromHash") ;

Above, the method is called assigned, as its return code is assigned to variable RC, resulting in RC=0 if the call is successful and RC≠0 otherwise. However, no useful programming action can be taken based on its value because, if the call fails, the DATA step will be stopped there and then with a run-time error. Therefore, common practice is to call the OUTPUT method unassigned:

H.OUTPUT (dataset:"work.fromHash") ;

As usual, if the output data set is written to the WORK (or USER) library, the library specification can be omitted, i.e.:

H.OUTPUT (dataset:"fromHash") ;

The calls shown above represent the most basic syntactic form of calling the OUTPUT method for two reasons:

● They result in unloading the hash table data content to the output file as is. That is:

◦ All hash table items are output as observations in the logical order they are stored in the hash table.

◦ All data portion variables of numeric and character (scalar) type end up in the output as the data set variables with exactly the same names and attributes as the corresponding PDV host variables (and in the order the latter are defined by the DEFINEDATA method). Note that non-scalar variables (such as of type hash), if present in the data portion, are ignored because they cannot be stored in a SAS data set (and a warning to that effect is written to the log).

● The argument to the DATASET argument tag is hard-coded as a character literal constant.

However, neither has to be the case: The functionality of the OUTPUT method is broader, and, correspondingly, its syntax is more flexible. We will discuss some of its richer features later in this section, and many examples of applying them to practical situations will be given in the parts and chapters that follow.

2.5.2 Open-Write-Close Cycle

The Output operation is performed strictly during the DATA step run time. It consists of three phases:

1. Open the data set specified with the DATASET argument tag for output access with member-level control, i.e., for writing.

2. Write the data portion variables to the data set, one observation per (unfiltered) item.

3. Close the data set when finished.

2.5.3 Open-Write-Close Cycle Encapsulation

The open-write-close cycle described above is encapsulated by the Output operation at run time. More specifically, it means that the operation is:

● Handled exclusively by the hash object, with no other DATA step I/O facilities involved. In particular, its actions are independent of the DATA step OUTPUT statement (implicit or explicit) and/or its timing.

● Finalized before program control moves to the executable statement following the statement containing the OUTPUT method call, and the file it has just written to is closed.

Therefore, after the operation has been successfully executed, its output data set is no longer locked for output access with member-level control. As such, the data set at this point can be:

● Reopened, read, and modified while the DATA step keeps running. For example, it can be viewed in the SAS viewer or used by another batch or interactive process.

● Reopened, read, and loaded into another hash table later on in the same DATA step (using the implicit Insert operation described in Chapter 3).

Because the Output operation cycle is run-time encapsulated, it can be performed in the same step as many times as needed to open, write (or rewrite), and close as many output data sets as program logic may dictate.

For the same reason, if the Output operation is successful, the output data set written by it is preserved as written in its destination library even if, later on in the DATA step, it is stopped or aborted due to a run-time error. This behavior stands in contrast with the behavior of the data sets listed in the DATA statement because they are not closed until the DATA step ceases execution.

2.5.4 Avoiding Open File Conflicts

The need to open the data set specified in the Output operation for writing has its own implications. Because a currently opened data set cannot be re-opened for writing, the OUTPUT method call will fail if its target data set already exists and is opened. In this event, the step will be stopped, and an error message to this effect will be written to the log. It can occur in two distinct cases:

1. The target data set already exists in the library and has been opened by another program (e.g., is being viewed in the SAS viewer or read by another program).

2. The name of the target data set is listed in the DATA statement of the same DATA step where the OUTPUT method is called. This is because all data sets listed in the DATA statement are automatically opened for writing when the step begins its execution.

The reason the open file conflict occurs in the situation #2 is that any data set listed in the DATA statement is automatically opened before the step begins its execution and locked for member-level control output access; and so the OUTPUT method called at run time cannot open it.

Therefore, the same output data set cannot be listed in the DATA statement and specified as the target for the OUTPUT method anywhere in the step. In other words, a step similar to the step schematically shown below will result in a run-time error at the time of the OUTPUT method call:

data ONE TWO ;

...

h.output (dataset:"ONE") ; *OPEN-FILE CONFLICT;

...

run ;

However, no conflict will occur in the above step if the method call is coded, for instance, as follows:

h.output (dataset:"THREE") ;

because the Output operation target data set is not listed in the DATA statement. Likewise, no open file conflict of this kind is possible if the DATA statement list is _NULL_:

data _NULL_ ;

...

h.output (dataset:"ONE") ;

...

run ;

The ability to write data to an output data set dynamically with the DATA statement data set specified as _NULL_ looked like an unusual and impressive new SAS feature at the time when the hash object was first offered.

2.5.5 Output Data Set Member Types

With the OUTPUT method, the data set specification supplied to the DATASET argument tag cannot point to a view, i.e., to a SAS data set of member type VIEW. It can be only a SAS data file, i.e., a SAS data set of member type DATA.

First, the method call cannot create a view. A call such as the following is invalid:

H.OUTPUT (dataset:"vHash/view=vHash") ; *INCORRECT!;

It will result in an error and corresponding error message, and the step will be stopped.

Second, the method cannot overwrite an existing view:

● If a view with the same name as specified to the DATASET argument tag already exists, the method will fail and the step will be also stopped with an error message.

● This behavior is consistent with the fact that a data set of member type VIEW cannot be overwritten with a data set of member type DATA and vice versa.

Bearing that in mind, a program may include a provision to check, via the dictionary tables or SAS I/O functions, whether the output data set exists and what member type it has before the name of the target data set passed to the DATASET argument tag is constructed.

2.5.6 Creating and Overwriting Output Data Set

The hash object handles the output data set specified in the DATASET argument tag differently depending on whether a data set with the same name already exists or not:

1. If it does not exist, a new data set is created. In this situation:

◦ Its variable names and other attributes are inherited from the PDV host variables corresponding to the hash variables in the data portion of the table.

◦ The variables appear in the order defined by the DEFINEDATA method, which may be different from the order the host variables are stored in the PDV.

2. If it does exist, there are two situations:

◦ Most commonly, it is an ordinary data set that is not part of a generation group. In this case, it is overwritten. It means that from the usage standpoint (regardless of behind-the-scenes details), the existing data set is erased and a new data set with the same name is created in its stead exactly as described in #1 above.

◦ It is part of a generation group. In this case, a new generation data set with the next generation number is created. Because it is a physically new file, it is treated as a data set that does not exist as described in #1.

Therefore, if in the same DATA step the OUTPUT method is called more than once with the same output data set name, each subsequent call will overwrite the data set written by the call preceding it. It can be illustrated schematically as:

data ... ;

...

h.output (dataset:"OUT") ;

...

h.output (dataset:"OUT") ; *Overwrites OUT written by call #1;

...

h.output (dataset:"OUT") ; *Overwrites OUT written by call #2;

...

run ;

In this step, the first call creates a new data set Work.Out (if it does not yet exist) or overwrites it (if it already exists). The second call overwrites the like-named data set written by the first call, and the third call overwrites the data set written by the second call. Though the data in hash table H may change between the calls, the state of the data set written last reflects the most recent data the table contains.

Hence, if there is a need to save the data written by each call, there are two options:

1. Name the output data sets for the different calls differently - say, OUT1, OUT2, and OUT3.

2. On the first call, use the data set option GENMAX= to create a generation group with the value greater than the number of calls. For example:

data ... ;

...

h.output (dataset:"OUT(genmax=3)") ;

...

h.output (dataset:"OUT") ;

...

h.output (dataset:"OUT") ;

...

run ;

This way, each call will write its own data to its own generation data set without overwriting the data set written by the prior call. Note that the data set option GENMAX= used in the first call is not an exception as far as using output data set options is concerned.

2.5.7 Using Output Data Set Options

Output data set options, such as KEEP=, DROP=, RENAME=, INDEX=, WHERE=, etc., can be used with the output data set in parentheses following its name. The GENMAX= option shown in the prior section is just one example.

Of particular interest is the WHERE= option because can be used to filter the data written to the output data set. For example, if the hash table had a data portion variable Runs (such as variable Runs in data set Bizarro.AtBats), the following method call would output only the items where Runs is greater than zero:

H.OUTPUT (dataset:"work.fromHash(WHERE=(Runs>0))") ;

Or, if we wanted to drop the variable from the output, we could code:

H.OUTPUT (dataset:"work.fromHash(DROP=Runs)") ;

Other output data set options can be used in the same vein and/or combined. The rules of coding them described in the SAS documentation are the same as for any data set specified as output in the DATA statement or in a SAS procedure.

2.5.8 DATASET Argument as Non-Literal Expression

Heretofore, in all examples of using the OUTPUT method, the arguments to the DATASET argument tag have been given as character literal constants, i.e., a quoted fixed string value. Just as with the DEFINEKEY and DEFINEDATA methods discussed above, the documentation describing the OUTPUT method may give an impression that using a character literal is the only option. However, this is not the case.

In actuality, the argument of the DATASET argument tag can be any valid character expression, as long as it resolves to the required data set name - if need be, together with the necessary data set options. Taking advantage of this fact can make a program using the OUTPUT method much more dynamic than using character literals alone.

In the simplest case, let us say that we want to unload a hash table H into a data set named fromHash in a library whose libname is HashOut, and we want to use the WHERE clause to filter the data on the condition Run>0. Using the DATASET argument as a character literal, we could code, as already shown above:

H.OUTPUT (dataset:"work.fromHash(WHERE=(Runs>0))") ;

Now suppose that in the program we already have a PDV character variable named arg valued as follows:

arg = "work.fromHash(WHERE=(Runs>0))" ;

before the OUTPUT method call. In this case, instead of hard-coding the DATASET argument, we can code instead:

data ... ;

...

arg = "work.fromHash(WHERE=(Runs>0))" ;

...

H.OUTPUT (dataset:arg) ;

...

run ;

The reason it can be done this way is that variable arg by itself is a character expression. The fact that, above, it is valued via an assignment statement is unimportant: It can be valued by another mechanism (such as the INPUT or RETAIN statement) or come, already properly valued, from a SAS data set.

As a more involved case, suppose that before the OUTPUT method is called, we have a number of variables representing different parts of the argument value we want to create. For example, the data set specification and the WHERE clause:

dsname = "work.fromHash" ;

where = "Runs>0" ;

In this case, the variables can be combined into a single expression to be passed to the DATASET argument tag:

data ... ;

...

dsname = "work.fromHash" ;

where = "Runs>0" ;

...

H.OUTPUT (dataset: cats(dsname, "(where=(", where, "))")) ;

...

run ;

In most use cases, the components of the expression passed to the DATASET argument tag come from some kind of parameter file. This way, the output data set destination, name, filtering, etc., can be controlled dynamically based on the pre-stored control information and program logic. We will see many examples of applying this concept later in the book.

2.5.9 Output Data Order

As hash object users, we are oblivious to the order and manner in which the hash items are physically stored in a hash table internally. In fact, it does not matter. What really matters is the order in which the items are accessed during hash table operations, for this is how we use them and logically perceive their order in the table. This is similar to how many database systems manage their data tables.

From this operational standpoint, we can simply - and correctly - assume that the logical order in which the items are stored in the table is exactly the order in which they are written out by the Output operation to a data file, such as work.fromHash above. Not surprisingly, this is also precisely the order in which the hash items are accessed by the Enumerate by Key (Keynumerate) and Enumerate All operations, discussed later in this part of the book.

Incidentally, this is why in the realm of hash object programming the Output operation is a great diagnostic tool. While we cannot eyeball the hash table itself, we can always write its data content to a file, view and analyze the latter, and amend our code based on the findings.

2.5.10 Output Operation Hash Tools

● Methods: OUTPUT.

● Argument tags: DATASET.

2.6 DESCRIBE Operation

The Describe operation allows us to retrieve the properties of a hash table as a whole. This is done by using the tools called hash object attributes. Currently, two attributes are supported:

1. The NUM_ITEMS attribute. It returns the number of items currently stored in the hash table of the active instance referenced when it is called.

2. The ITEM_SIZE attribute. It returns the number of bytes the hash table entry occupies in computer memory.

Like the methods, the attributes are called by using the object-dot notation to reference the hash object in question. Also, just as with any method reference to the hash object name, an attribute object reference returns the information related to the table of the active hash object instance. Let us discuss the two attributes one at a time.

2.6.1 The NUM_ITEMS Attribute

To get the number of items stored in the table of a hash object instance referenced as H into variable N_items, we can code:

N_items = H.num_items ;

Note that in order to be used in a program, the value of the attribute does not have to be necessarily assigned to a separate variable such as N_items above. This is because H.num_items is a numeric expression and, as such, can be used in any other numeric SAS expression directly. For example, to make a DO loop iterate half as many times as there are items in table H, we can code:

do x = 1 to divide(H.num_items,2) ;

* code inside the loop ;

end ;

Or, to execute some action only if the hash table is empty (i.e., has no items):

if H.num_items = 0 then do ;

* action ;

end ;

The most valuable utility of the NUM_ITEMS attribute lies in the fact that it returns the current number of items in a hash table, automatically adjusted as it grows or shrinks when items are added to or removed from it. Therefore, it can be used to:

● Determine the upper index limit of an iterative DO loop used to iterate through the hash table sequentially (i.e., enumerate it).

● Help calculate hash table statistics that depend on the number of items in the table (for example, percentiles).

● Implement dynamic data structures, such as stacks and queues.

These uses of the NUM_ITEMS attribute will be discussed in detail and exemplified in the book later on.

2.6.2 The ITEM_SIZE Attribute

The ITEM_SIZE attribute is a Describe operation hash tool that returns the length of the hash table entry expressed in bytes. If we create an analogy between a hash table and a SAS data set, this metric roughly corresponds to the "row length" property of the SAS data set. To call the attribute and return its value into a numeric variable Entry_length, we can code:

Entry_length = h.item_size ;

Though it is difficult to think of its utility from the standpoint of dynamic programming, the attribute can be a great help in assessing the memory footprint of a future hash table. Thus, it is particularly useful in the applications where the hash object memory may be taken to the system limits, and so it is paramount to evaluate, during the program design stage, how much memory it may occupy when filled with items.

There is no hard-and-fast rule of determining the hash entry length a priori based on the lengths of the variables in the key and data portions, all the more true in that it varies based on the platform. However, the ITEM_SIZE attribute returns the exact, actual hash entry length. Thus, it can be used to form a fairly accurate idea of how much memory is needed to accommodate a hash table with a given number of items (or how many items the available memory can accommodate) on the system where it is supposed to be used.

A good use case is to determine how many items can be loaded for a given amount of memory into a table defined in a specific way. For example, imagine that we want to load data set Bizarro.At_Bats into hash table H defined with specific hash variables and, during the design phase, we need to evaluate how many items will fit in the table given 1 GB of memory. To achieve that, we would:

1. Code a DATA step with statements needed to execute the Create operation with the hash variables defined as we want them.

2. Obtain the value returned by the ITEM_SIZE attribute.

3. Divide 1 GB (1024**3=2**30 bytes) by the value and print the result in the log.

For example:

Program 2.5 Chapter 2 Number of Hash Items in Given Memory.sas

data _null_ ;

dcl hash H () ;

H.defineKey ("Player_ID", "Team_SK") ;

H.defineData ("First_name", "Last_name", "Position_code") ;

H.definedone () ;

Entry_length = H.item_size ;

N_items_in_1GB = round (2**30 / Entry_length) ;

put (Entry_length N_items_in_1GB) (=comma16./) ;

stop ;

set bizarro.Player_candidates ;

run ;

Running the step (in this case, on the X64_7PRO platform) results in the following information printed in the SAS log:

Entry_length=80

N_items_in_1GB=13,421,773

Hence, erring on the safer side, we can reasonably expect that the table as created and defined above can accommodate about 13 million items in 1 GB of memory.

Note that the total system length of the hash variables in the key and data portions above is 43. So, if we used it to estimate hash memory usage instead of using the exact number 80 returned by the ITEM_SIZE attribute, it would be underestimated almost by the factor of 2.

2.6.3 Describe Operation Hash Tools

● Attributes: NUM_ITEMS, ITEM_SIZE.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: Table-Level Operations

Create new playlist

Sign In

Sign Up