Chapter 1: Hash Object Essentials

On the whole, the chapter should provide a conceptual background related to the hash object and hash tables and serve as a stepping stone to understanding hash table operations.

Since we have two distinct sets of users who are in this Proof of Concept, this chapter would likely be of much more interest to the IT users as they are more likely than the business users to understand the details and the nuances discussed here. We did suggest that it would be worthwhile for the business users to skim this chapter, as it should give them a good overview of the power and flexibility of the SAS hash object/table.

1.2 Hash Object in a Nutshell

The first production release of the hash object appeared in the SAS 9.1. Perhaps the original motive for its development had been to offer a DATA step programmer a table look-up facility either much faster or more convenient - or both - than the numerous other methods already available in the SAS arsenal. The goal was certainly achieved right off the bat. But what is more, the potential capabilities built into the newfangled hash object were much more scalable and functionally flexible than those of a mere lookup table. In fact, it became the first in-memory data structure accessible from the DATA step that could emerge, disappear, grow, shrink, and get updated dynamically at run time. The scalability of the hash object has made it possible to vastly expand the original hash object functionality in future versions and releases, and its functional flexibility has enabled SAS programmers to invent and implement new uses for it, perhaps even unforeseen by its developers.

So, what is the hash object? In a nutshell, it is a dynamic data structure controlled during execution time from the DATA step (or the DS2 procedure) environment. It consists of the following principal elements:

● A hash table for data storage and retrieval specifically organized to perform table operations based on searching the table quickly and efficiently via its key.

● An underlying, behind-the-scenes hashing algorithm which, in tandem with the specific table organization, facilitates the search.

● A set of tools to control the very existence of the table - that is, to create and delete it.

● A set of tools to activate the table operations and thus enable information exchange between the DATA step environment and the table.

● Optional: a hash iterator object instance linked to the hash table with the purpose of accessing the table entries sequentially.

The terms "hash object" and "hash table" are most likely derived from the hashing algorithm underlying their functionality. Let us now discuss the hash table and its specific features and usage prerequisites.

1.3 Hash Table

From the standpoint of a user, the hash object’s table is a table with rows and columns - just like any other table, such as a SAS data file. Picture the image of a SAS data set, and you have pretty much pictured what a hash table may look like. For example, let us suppose that it contains a small subset of data from data set Bizarro.Player_candidates:

Table 1.1 Hash Object Table Layout

Reminds us of an indexed SAS data set, does it not? Indeed, it looks like a relational table with rows and columns. Furthermore, we have a composite key (Team_SK, Player_ID) and the rest of the variables associated with the key, also termed the satellite data. The analogy between an indexed SAS data set and a hash table is actually pretty deep, especially in terms of the common table operations both can perform. However, there are a number of significant distinctions dictated by the intrinsic hash table properties. Let us examine them and make notes of the specific hash table nomenclature along the way.

1.4 Hash Table Properties

To make the hash table’s properties stand out more clearly, it may be useful to compare them with the properties of the indexed SAS data set from a number of perspectives.

1.4.1 Residence and Volatility

● The hash table resides completely in memory. This is one of the factors that makes its operations very fast. On the flip side, it also limits the total amount of data it can contain, which consists of the actual data and some underlying overhead needed to make the hash table operations work.

● The hash table is temporary. Even if the table is not deleted explicitly, it exists only for the duration of the DATA step. Therefore, the hash table cannot persist across SAS program steps. However, its content can be saved in a SAS data set (or its logical equivalent, such as an RDBMS table) before the DATA step completes execution and then reloaded into a hash table in DATA steps that follow.

1.4.2 Hash Variables Role Enforcement

● The hash variables are specifically defined as belonging to two distinct domains: the key portion and the data portion. Their combination in a row forms what is termed a hash table entry.

● Both the key and the data portions are strictly mandatory. That is, at least one hash variable must be defined for the key portion and at least one for the data portion. (Note that this is different from an indexed SAS table used for pure look-up where no data portion is necessary.)

● The two portions communicate with the DATA step program data vector (PDV) differently. Namely, only the values of the data portion variables can be used to update their PDV host variables.

● Likewise, only the data portion content can be “dumped” into a SAS data file.

● In the same vein, in the opposite data traffic direction, only the data portion hash variables can be updated from the DATA step PDV variables or other expressions.

● However, if a hash variable is defined in the key portion, a hash variable with the same name can also be defined in the data portion. Note that because the data portion variable can be updated and the key portion variable with the same name cannot, their values can be different within one and the same hash item.

1.4.3 Key Variables

● Together, the key portion variables form the hash table key used to access the table.

● The table key is simple if the key portion contains one variable, or compound if there is more than one. For example, in the sample table above, we have a two-term compound key consisting of variables (Team_SK, Player_ID).

● A compound key is processed as a whole, i.e., as if its components were concatenated.

● Hence, unlike an indexed SAS table, the hash table can be searched based on the entire key only, rather than also on a number of its consecutive leading components.

1.4.4 Program Data Vector (PDV) Host Variables

Defining the hash table with at least one key and one data variable is not the only requirement to make it operable. In addition, in order to communicate with the DATA step, the hash variables must have corresponding variables predefined in the PDV before the table can become usable. In other words, at the time when the hash object tools are invoked to define hash variables, variables with the same exact names must already exist in the PDV. Let us make several salient points about them:

● In this book, from now on, we call the PDV variables corresponding to the variables in the hash table the PDV host variables.

● This is because they are the PDV locations from which the hash data variables get their values and into which they are retrieved.

● When a hash variable is defined in a hash table, it is from the existing host variable with the same name that it inherits all attributes, i.e., the data type, length, format, informat, and label.

● Therefore, if, as mentioned above, key portion and the data portion each contain a hash variable with the same name, it will have all the same exact attributes in both portions as inherited from one, and only one, PDV host variable with the same name.

● The job of creating the PDV host variables, as any other PDV variables, belongs to the DATA step compiler. It is complete when the entire DATA step has been scanned by the compiler, i.e., before any hash object action - invoked at run time - can occur.

● Providing the compiler with the ability to create the PDV host variables is sometimes called parameter type matching. We will see later that it can be done in a variety of ways, different from the standpoint of automation, robustness, and error-proneness.

In order to use the hash object properly, you must understand the concept of the PDV host variables and their interaction with the hash variables. This is as important to understand as the rules of Bizarro Ball if you want to play the game.

1.5 Hash Table Lookup Organization

● The table is internally organized to facilitate the hash search algorithm.

● Reciprocally, the algorithm is designed to make use of this internal structure.

● This tandem of the table structure and the algorithm is sufficient and necessary to facilitate an extremely fast mechanism of direct-addressing table look-up based on the table key.

● Hence, there is no need for the overhead of a separate index structure, such as the index file in the case of an indexed SAS table. (In fact, as we will see later, the hash table itself can be used as a very efficient memory-resident search index.)

For the purposes of this book, it is rather unimportant how the underlying hash table structure and the hashing algorithm work – by the same token as a car driver can operate the vehicle perfectly well and know next to nothing about what is going on under the hood. As far as this subtopic is concerned, hash object users need only be aware that its key-based operations work fast – in fact, faster or on par with other look-up techniques available in SAS. In particular:

● The hash object performs its key-based operations in constant time. A more technical way of saying it is that the run time for the key-based operations scales as O(1).

● The meaning of O(1) notation is simple: The speed of hash search does not depend on the number of items in the table. If N is the number of unique keys in the table, the time needed to either find a key in it or discover that it is not there does not depend on N. For example, the same hash table is searched equally fast for, say, N=1,000 and N=1,000,000.

● It still does not hurt to know how such a feat is achieved behind the scenes. For the benefit of those who agree, a brief overview is given in the last, optional, section of this chapter, “Peek Under the Hood”.

1.5.1 Hash Table Versus Indexed SAS Data File

To look at the hash table properties still more systematically, it may be instructive to compile a table of the differences between a hash table and an indexed SAS file:

Table 1.2 Hash Table vs Indexed SAS File

1.6 Table Operations and Hash Object Tools

To recap, the hash object is a table in memory internally organized around the hashing algorithm and tools to store and retrieve data efficiently. In order for any data table to be useful, the programming language used to access it must have tools to facilitate a set of fundamental standard operations. In turn, the operations can be used to solve programming or data processing tasks. Let us take a brief look at the hierarchy comprising the tasks, operations, and tools.

1.6.1 Tasks, Operations, Environment, and Tools Hierarchy

Whenever a data processing task is to be accomplished, we do not start by thinking of tools needed to achieve the result. Rather, we think about accomplishing the task in terms of the operations we use as the building blocks of the solution. Suppose that we have a file and need to replace the value of variable VAR with 1 in every record where VAR=0. At a high level, the line of thought is likely to be:

1. Read.

2. Search for records where VAR=0.

3. Update the value of VAR with 1.

4. Write.

Thus, we first think of the operations (read, search, update, write) to be performed. Once the plan of operations has been settled on, we would then search for an environment and tools capable of performing the operations. For example, we can decide whether to use the DATA step or SQL environment. Each environment has its own set of tools, and so, depending on the choice of environment, we could then decide which tools (such as statements, clauses, etc.) could be used to perform the operations. The logical hierarchy of solving the problem is sequenced as Tasks->Operations->Environment>Tools.

In this book, our focus is on the SAS hash object environment. Therefore, it is structured as follows:

● Classify and discuss the hash table operations.

● Exemplify the hash object tools needed to perform each operation.

● Demonstrate how various data processing tasks can be accomplished using the hash table operations.

1.6.2 General Data Table Operations

Let us consider a general data table - not necessarily a hash table, at this point. In order for the table to serve as a programmable data storage and retrieval medium, the software must facilitate a number of standard table operations generally known as CRUD - an abbreviation for Create, Retrieve, Update, Delete. (Since the last three operations cannot be performed without the Search operation, its availability is tacitly assumed, even though it is not listed.) For instance, an indexed SAS data set is a typical case of a data table on which all these operations are supported via the DATA step, SQL, and other procedures. A SAS array is another example of a data table (albeit in this case, the SAS tools supporting its operations are different). And of course, all these operations are available for the tables of any commercial database.

In this respect, a SAS hash table is no different: The hash object facilitates all the basic operations on it and, in addition, supports a number of useful operations dictated by their specific nature. They can be subdivided into two levels: One related to the table as a whole, and the other - to the individual table items. Below, the operations are classified based on these two levels:

Table 1.3 Hash Table Operations Classification

1.6.3 Hash Object Tools Classification

The hash table operations are implemented by the hash object tools. These tools, however, have their own classification, syntactic form, and nomenclature, different from other SAS tools with similar functions. Let us consider these three distinct aspects.

The hash object tools fall into a number of distinct categories listed below:

Table 1.4 Hash Object Tools Classification

Generally speaking, there exists no one-to-one correspondence between a particular hash tool and a particular standard operation. Some operations, such as search and retrieval, can be performed by more than one tool. Yet some others, such as enumeration, require using a combination of two or more tools.

1.6.4 Hash Object Syntax

Unlike the SAS tools predating its advent, most hash tools are invoked using the object-dot syntax. Even though it may appear unusual at first to those who have not used it, it is not complicated and is easy to learn from code examples (abundant in this book) and from the documentation, as well as online forums, such as communities.sas.com. Since the beginning of SAS 9, the DATA step compiler has been endowed with the facility to parse and recognize this syntax as valid. In fact, this is the only way the compiler looks at code that uses the hash object tools: Syntax checking is the only thing done with the hash object at compile time. Everything else is done by the object itself at execution (run) time.

1.6.5 Hash Object Nomenclature

The key words used to call the hash object tools to action have their own naming conventions, more or less reflecting the nature of actions they perform and/or operations they support. However, their nomenclature is conventional in the sense that it adheres to using the standard SAS names.

This is also true of the name given to the hash object itself when it is declared. Since the DATA step compiler views such a name as a variable (albeit of a non-scalar data type different from that of numeric or character), it must abide by all the SAS variables' naming conventions - including the rules imposed by the value of the VALIDVARNAME system option currently in effect. Thus, for example, if VALIDVARNAME=ANY, the hash object can be named # by referencing it in code as "#"n; however, then all subsequent references to it must follow this exact form.

1.7 Peek Under the Hood

As noted earlier, it is not really necessary to be aware of the hash object’s underlying table structure and its hashing look-up mechanism in order to use it productively. However, a degree of such awareness is akin to that of inquisitive drivers who know, at a high level, how their cars work. Likewise, a more inquisitive hash object user is better equipped than one who is oblivious to what is going on beneath the surface – particularly in certain situations (some of which are presented later in this book) when the object is utilized on the verge of its capacity.

We provided this peek at the request of one of the IT users who were most interested in understanding the ins/outs of the SAS hash object. So, as just stated, the details that follow are targeted to more advanced IT users and programmers.

1.7.1 Table Organization and Unindexed Key Search

When there is a need to rapidly search a large table, creating an index on the table key may be the first impulse. After all, this is how we look for a keyword in a book. The idea that the table organization itself coupled with a suitable look-up strategy (and not relying on a separate index) can be used for searching just as successfully may not spring to mind just as naturally.

And yet, it should be no surprise. For instance, let us think how we look for a book page given its number. Using the fact that the pages are ordered, we open the book somewhere in the middle and decide in which half our page must be located. To wit, if the page number being sought is greater than the page number the book is opened to, we continue to search only the pages above the latter; and if the opposite is true, we continue to search below it. Then we proceed in the same manner with the remaining part of the book and repeat the process until the page we need is finally located. In this process of binary search, the book is nothing more than a table keyed by the page number. Effectively, taking advantage of the page order as the table organization, we use a divide-and-conquer algorithm to find the page we need. In doing so, we need no index to find it, relying solely on the table structure (i.e., the fact that it is ordered) itself.

In the same vein, an ordered table with N unique arbitrary keys (not necessarily serial natural numbers like book pages) can be searched in O(log2(N)) time using the binary search. In this case, the key order is the table’s organization, and the binary search is the algorithm. The O(log2(N)) is an example of the so-called “big O” notation. O(log2(N)) means that every time N doubles, the maximum number of the comparisons between the keys in the table and the search key needed to find or reject it (let us denote this number as Q) increases merely by 1. By contrast, with the linear (“brute-force”) search, Q is proportional to N, which is why in "big O" notation its search behavior is described as O(N).

Thus, the binary search (though more complex and computationally heavy) scales much better with the table size. For example, at N=4, Q=2 for the linear search and Q=3 for the binary search. But already at N=16, the respective numbers are Q=8 and Q=5, while at N=1024, they diverge as far from each other as Q=512 and Q=11, correspondingly.

However, while the binary search is obviously very good (and consequently widely used), it has a couple of shortcomings. First, it requires the table to have been sorted by the key. Second, Q still grows as N grows, albeit more slowly. The tandem of the hash table organization coupled with the hashing algorithm rectifies both flaws: (a) it does not require presorting, and (b) it facilitates searching with O(1) running time. Let us look more closely how it works in the hash object table.

1.7.2 Internal Hash Table Structure

The hash table contains a number of AVL (Adelson-Volsky and Landis) search trees, which can also be thought of as “buckets”. An AVL tree is a balanced binary tree designed in such a way and populated by such a balancing mechanism that its search run time is always O(log2(N)) – i.e., the same as the binary search - regardless of the uniformity or skewness of the input key values. (Balancing the tree while it is being populated requires a bit of overhead; however, in the underlying SAS software it is coded extremely tightly and efficiently.) Visually, the structure (where N is the number of unique keys and H is the number of trees) can be viewed schematically as follows:

Table 1.5 Hash Object Table Organization Scheme

The number of the AVL trees in the hash object table (denoted as H above) is controlled by the value assigned to the argument tag HASHEXP, namely, H=2**HASHEXP. So, the number of buckets can be only a power of 2: 1, 2, 4, and so on up to the maximum of 2**20=1,048,576. If the HASHEXP argument tag is not specified, HASHEXP:8, i.e., H=256 is assumed by default. Any HASHEXP value over 20 is auto-truncated to 20. We will return to the question of picking the right number of trees later after discussing the central principle of the hashing scheme.

1.7.3 Hashing Scheme

The central idea behind hashing is to insert each key loaded into the table into its own tree in such a clever way that each tree receives about the same number of keys regardless of their values and without any prior knowledge of these values' actual distribution.

If that were possible, we would have N/H keys in each tree. Suppose we have 2**20 (about 1 million) keys to load. If we loaded them in a single tree (HASHEXP:0, H=1), using the binary tree search to find or reject a search key would require 21 comparisons between the search key and some keys in the tree.

However, if the keys were uniformly loaded into H=1024 trees (HASHEXP:10), each would contain only about 1024 keys. Given a search key, we would know the number of the bucket where it is located. Searching among 1024 keys in that tree via the binary search would then require only 11 key comparisons to find or reject the search key; i.e., the look-up speed would be practically doubled.

1.7.4 Hash Function

This divide-and-conquer strategy is not difficult to envision. But how do we make sure that each bucket receives an approximately equal number of input keys if we know nothing about their values a priori? If our input keys were, say, natural numbers from KEY=1 to KEY=1024 and we had 8 trees to fill, we would just divide the keys in 8 equal ranges, with exactly 4 keys per tree. But what to do if we do not even know anything about the input key values? And what if the keys are not merely natural numbers but arbitrary character keys or, even worse, mixed-type compound keys?

Luckily, there exist certain transformations called hash functions. A good hash function has three fundamental properties:

1. It can accept N arbitrary keys with arbitrary values as arguments and return a natural number TN (tree number) from 1 to H: TN = hash_function (KEY).

2. Each TN number from 1 to H will be assigned to approximately N/H keys, with very little or no variation between different TN numbers.

3. For any given key-value, it can return one and only one value of TN. In other words, there will be no situation when the same key is assigned to two different trees.

4. It is reasonably fast to compute. Indeed, if it were very slow to calculate and hence take an inordinately long time to find to which tree a key belongs, it would defeat the very purpose of rapid and efficient search in the first place.

To see how such a feat can be pulled off, let us form a composite key from the pair of variables (Team_SK, Player_ID) from data set Bizarro.Player_candidates. Now let us plug all of its 10,000 distinct values into the cocktail of the nested functions used as a hash function below to distribute the resulting TN values into numbers from 1 to 8. Key-indexed array Freq below is used to obtain the frequency on the number of TN values placed into each bucket:

Program 1.1 Hash Function Bucket Distribution

data _null_ ;

set bizarro.Player_candidates end = LR ;

TN = 1 + mod (rank (MD5 (cats (Team_SK, Player_ID))), 8) ;

array Freq [8] (8*0) ;

Freq[TN] + 1 ;

if LR then put Freq[*] ;

run ;

From the values of Freq[TN] printed in the SAS log, we get the following picture:

Table 1.6 Hash Function Bucket Distribution

The reason the keys are distributed so evenly is that the MD5 function that is supplied by SAS is a hash function itself. Above, it consumes the composite key (Team_SK, Player_ID) converted into a character string by the CATS function and generates a 16-byte character string (so-called signature). The RANK function returns the position of the first byte of the signature in the collating sequence. Finally, the MOD function uses the divisor of 8 to distribute the position number (ranging from 0 to 255) between the values of TN from 1 to 8.

While there is about 4 percent variability between the most and least filled buckets, for all practical intents and purposes using the binary tree search within any of these buckets would be equally fast. As opposed to binary-searching all 10,000 keys, it would save about 3 key comparisons per binary search in the fullest bucket. Since comparing keys is usually the most computationally taxing part of searching algorithms, distributing the keys among the trees may well justify the overhead of computing TN by using the hash function above.

The expression given above is merely one example of a decent hash function used just to illustrate the idea. It works well because MD5 is itself a hash function. Though the internal hash function working for the hash object behind the scenes is different, conceptually it serves the same goal of distributing the keys evenly across the allocated number of the AVL trees.

1.7.5 Hash Table Structure and Algorithm in Tandem

Now that we know that having a good hash function is possible, we can spell out how the hash table's internal structure and the hashing algorithm work in tandem. Given a key to search for:

● The key is plugged into the hash function.

● The hash function returns a tree number TN from 1 to H. The tree with that specific value of TN is the only tree where the key can be possibly found.

● If the tree is empty, the key is not in the table.

● Otherwise, the tree is binary-searched, and the key is either found or rejected.

Further actions depend on the concrete task. For instance, all we may want to know is whether the key is in the table, so no further action is necessary. Alternatively, if the key is not found, we may want to load it, in which case it will be inserted into the tree, whose number is TN. Or if the key is found, we may want to remove it from the table if necessary.

1.7.6 The HASHEXP Effect

To recap, the value of the argument tag HASHEXP determines the number of AVL trees H the hash object creates for its hash table. It follows from the nature of the algorithm that the fewer keys there are in each tree, the speedier the search. But let us look at it more closely:

● Suppose that we have N=2**24 (about 16.7 million) keys in the table. With HASHEXP:20 and hence H=2**20 (about 1 million), there will be 2**4=16 keys hashing, on the average, to one bucket. Searching for a key among the16 keys within any AVL tree requires about log2(16)+1=5 key comparisons.

● Now if we had HASHEXP:8 (the default), i.e., H=2**8=256, there would be 2**16=65,536 keys, on the average, hashing to one bucket tree. That would require 17 comparisons to find or reject a search key in a given tree. And indeed, a simple test can show that HASHEXP:20 results in searching about twice as fast as HASHEXP:8.

● The penalty of increasing HASHEXP and H comes in the form of the amount of a certain part of base memory required to allocate 2**20 trees versus 2**8. However, base memory is not memory needed to hold the actual keys and data but rather memory required to support the table infrastructure. In other words, it is memory occupied by the empty table. It is static; i.e., once allocated, it does not change regardless of the amount of actual data loaded in the table. And the penalty is severe: For example, on the Windows 64-bit platform, a table with two numeric variables (one per portion) needs about 17 MB with HASHEXP:20 and about 1 MB with HASHEXP:8. Despite the 17 times difference, in the modern world of gigantic cheap memories, the 16 MB static difference is of little consequence.

● Thus, HASHEXP:20 can be coded safely under any circumstances to trade faster execution for a few megabytes of extra memory.

● Even so, it still hardly makes sense to allocate more trees H than the number of unique keys N and waste any memory on empty trees. Yet valuing HASHEXP with a value less than 8 (default) does not make much sense either because the base hash memory difference between HASHEXP:8 (256 trees) and HASHEXP:0 (1 tree) is practically negligible.

The real limitation on the hash object memory footprint comes from the length of its entry multiplied by the number of hash items. It is crucially important in a number of data processing situations when using the hash object appears to be the best or only solution but requires large data volumes to be stored in a hash table. In this book, a separate chapter is dedicated to the ways and means of reducing hash memory usage.

1.7.7 What Is in the Name?

It is an easy guess that the SAS hash object is called the hash object because its built-in hash function and hashing algorithm underlie its functionality and performance capabilities. The name has been in use for almost two decades and irrevocably internalized by the SAS community.

To those who are interested in SAS lore it may be interesting to know that the first name for this wonderful addition to the SAS arsenal was "associative array". Perhaps the idea was to take the attention off the mechanism underlying the object and refocus it on what it does from the perspective of usage.

An associative array is an abstract data type resembling an ordinary array, with the distinction that (a) it is dynamic (i.e., grows and shrinks as items are inserted and deleted) and (b) can be subscribed, not only using an integer key, but any key, including a composite key with character components. It is easy to perceive that the hash object possesses both properties. So, calling the hash object an associative array initially made sense, for most SAS programmers are well familiar with arrays and could relate to the new-fangled capability from this angle.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1: Hash Object Essentials

Create new playlist

Sign In

Sign Up