A
ADD method 53, 57–58, 61, 67, 147, 151, 165, 170, 173, 189, 199, 200, 212, 236, 243, 274, 276, 281, 365
Adelson-Volsky and Landis (AVL) 11
algorithm, hash table 13
APPEND procedure 161
argument tag mode 53–54
ARRAY statement 239, 308
arrays
about 279
associative 14, 271
sorting 279–282
assigned CHECK calls 50
associative array 14, 271
AtBats (Bizarro Ball) 118–121
auto-formatting 343–345
AVL (Adelson-Volsky and Landis) 11
AVL binary tree 271
B
balanced binary tree 11
base memory 14
"big O" notation 11
binary search 10
Bizarro Ball sample data
about 117–118, 125
AtBats 118–121
descriptions 118–125
Games 121
Leagues 122
Pitches 122–123
Player_Candidates 123–124
Runs 124–125
Teams 125
Bizarro Ball star scheme data warehouse
creating 182–190
defining fact and dimension hash tables 183–185
defining tables 182–183
BY statement 153, 215, 236–237, 269, 270–271, 291
C
CALL MISSING function 195, 200, 353
Carpenter's Complete Guide to the SAS REPORT Procedure (Carpenter) 368
case studies
Researching Alternative Pitching 349–359
What If the Count Is 0-2 After the First Two Pitches 361–369
CATS function 226, 238, 341
CATX function 147, 309, 329, 336, 338, 340–341, 342–343, 344–345, 345–346
CHAR function 26
character expression 20–21, 53–54
CHECK calls
assigned 50
unassigned 50–51
CHECK method 49, 51, 65, 80, 84, 111, 131, 163, 170, 293
CLASS statement 219
CLEAR method 36–37, 114, 147, 149, 205, 321, 330
CLEAR operation
about 36
CLEAR method 36–37
DELETE operation vs. 37
hash tools 37
joining data 298
sorted input 145–146
collisions (MD5 function) 339–340
compile phase 22
compile time statement 95
compile-time only directive 17
composite keys 5, 72, 279
compound key 6
concatenation
delimiters and endpoints 341–343
length sizing 340–341
order and consistency 345–346
CONTENTS procedure 24
CONTINUE statement 364
counts, adding 132–134, 252–255
Create, Retrieve, Update, Delete (CRUD) 8
CREATE operation
about 16–17
combining declaration and instantiation 19–20
creating hash object instances 18–19
declaring hash objects 17–18
defining hash table variables 20–21
defining hash variables as non-literal expressions 28–29
defining hash variables dynamically one at a time 29
defining hash variables using metadata 29–31
defining multiple hash variables 27–28
dynamic parameter type matching 23–25
ensuring single instance usage 33–34
finishing 21–22
handling multiple instances 34–35
hash tools 35
multiple instances issue 31–32
omitting DEFINEDATA method 21
parameter type matching 22–26
PDV host variables 22–23
unsorted input 151
CRUD (Create, Retrieve, Update, Delete) 8
D
data
combining 134–144
grouping 153–156
joining 294–299, 308–312
ordering 153–156
splitting 144–152
subsetting 128–134
testing for grouping 269–271
data aggregates
about 191–193
calculating mean 209–213
calculating medians 201–205, 209–213
calculating mode 201–202, 207–213
calculating percentiles 201–202, 205–207, 209–213
calculating unique counts 198–201
creating 191–228
creating multi-way 219–228
defining using parameter files 223–228
determining distribution of consecutive events 213–219
getting variables from tables 194–198
key order 289–292
MD5 hash key reduction 302–305
merging joins with 137–138
uniform MD5 split on the fly 328–330
via key byte split 321–325
via partial key split 316–321
data portion 5
data portion offload
about 310
joining data 310–312
selective unduplication 312–315
DATA statement 39–40, 145, 146, 357
data tables
getting variables from 194–198
operations 8–9
data tasks 127–157
data types, of expressions 50
data unduplication
key order 292–294
MD5 hash key reduction 305–308
DATASET argument 42–43, 44
DATASETS procedure 186
DCL statement 32, 219
declaration, combining with instantiation 19–20
DECLARE HASH statement 233
DECLARE statement 17–20, 22, 31, 35, 69, 96–97, 114, 214
DEFINEDATA method 20–22, 27–31, 35, 41–43, 136, 184, 185, 205, 215, 219, 225, 286, 315
DEFINEDONE method 21–23, 26, 35, 136, 170, 185, 225
DEFINEKEY method 20–22, 27–31, 35, 42–43, 184, 185, 205, 225, 286, 296
DELETE ALL operation
about 61–62
hash tools 63
hash-PDV interaction 63
implementing 62–63
item locking and 63
DELETE method 35–36, 89–90, 149, 205
DELETE operation
about 35, 88–89
CLEAR operation vs. 37
DELETE method 35–36
details for 36
hash tools 36
implementing stacks 275
multiple items 90–92
UPDATE operation vs. 94
DELETE statement 33
delimiters, concatenation 341–343
dequeue 273
DESCRIBE operation
about 44
hash tools 46
ITEM_SIZE attribute 45–46
NUM_ITEMS attribute 44–45, 46
DIM function 242
Direct Retrieve 64–65
direct-addressing table look-up 7
disk, memory vs. 284–288
DISTINCT option 226
DO loop 29, 30, 31, 33–34, 45, 69, 80, 81, 91, 103–104, 133, 140, 278, 281, 293, 296, 311
DO statement 147
DO UNTIL loop 33
DO WHILE loop 86, 136, 137–138, 152, 170, 337, 364
DO_OVER method 81–82, 84, 87–88, 133–134, 136, 139–140, 337
DoW loop 246, 270, 278, 291, 298, 337
DROP= option 24, 42
Duplicate Insert 56–57, 58–59
dynamic data structures, hash objects as
See hash object
dynamic parameter type matching 23–25
E
END= option 33
end points
concatenation of 341–343
enumerating from 103–105
END statement 147
ENUMERATE ALL operation
about 95
combining data 134
creating HoH tables 233
creating iterator object 97–98
DELETE ALL method 63
direct iterator access 99–102
enumerating from end points 103–105
enumerating sequentially from key-items 108–109
FIRST method 105–106
harvesting same-key items from key-items 109–110
hash iterator 110–111
hash iterator object 95–97
hash iterator pointer 98–99
hash tools 114
hash-PDV interaction 114
implementing stacks 274–275
item locking 110–111
iterator priming 105
keeping iterator in table 106–108
LAST method 105–106
linking iterator object 97–98
locking 111–112
locking hash tables 113–114
locking same-key item groups 113
NEXT method 105–106
PREV method 105–106
sequential access 103
sorting arrays 281
unlocking 111–112
ENUMERATE operation 59, 64, 69, 134
enumeration
See also ENUMERATE ALL operation; KEYNUMERATE operation
about 76
of hash tables 76–77
as a process 76
environment 8
events, consecutive 245–247, 277–279
execution phase 22
explicit formatting 343–345
Explicit Insert
about 54–55
Implicit Insert vs. 56
rules for 55–56
explicit search 49
expressions, data types of 50
F
fact tables
creating and updating 161–163
defining via metadata 183–185
updating 186–190
FIFO (first-in-first-out) 272
file reference
dynamic parameter type matching via 23–24
dynamic parameter type matching via forced 25
parameter type matching by default 26
FIND + FIND_NEXT method 139–140
FIND method 65–66, 82, 84, 86, 87, 94, 113, 134, 143, 151, 167, 174, 176, 181, 193, 195, 197, 219, 242, 251, 254, 259, 314, 352, 364, 365
FIND_NEXT method 80, 81, 83–84, 87–88, 90, 92, 94, 113, 170, 173, 180, 365
FIND_PREV method 82–84
FIRST method 99–101, 103, 105, 111, 113, 114, 202, 212
NEXT/PREV methods vs. 105–106
first-in-first-out (FIFO) 272
FORMAT procedure 192, 319
formatting
auto-formatting 343–345
explicit 343–345
FREQ procedure 213–214
frontal attack 292, 294, 299, 305, 308, 313
FULLSTIMER option 291, 300
functions
See specific functions
G
Games (Bizarro Ball) 121
general data table 8
GENMAX= option 41–42
grouped data, non-grouped data vs. 269
H
hash function 12–13
hash index 310
See also data portion offload
hash iterator 63, 110–111
hash iterator object 4, 95–97
hash iterator pointer 98–99
hash memory
See hash objects
hash object attributes 44
hash objects
about 1, 3–4, 261–262, 284
combining memory-saving techniques 335–338
creating instances 18–19
creating method calls using macros 184–185
data portion offload 310–315
declaring 17–18
MD5 argument concatenation 338–346
MD5 hash key reduction 299–309
memory vs. disk 284–288
nomenclature 10
stable unduplication 262–268
syntax 9–10
testing data for grouping 269–271
tool classifications 9
tools 8–10
uniform input split 315–327
uniform MD5 split on the fly 327–332
uniform split using SAS index 332–335
using existing key order 288–299
Hash of Hashes (HoH)
about 150
consecutive events 245–247
multiple splits 247–260
Hash of Hashes (HoH) table
about 231–232
calculating medians 236–244
calculating mode 239–244
calculating percentiles 234–236, 239–244
creating 232–234
hash search algorithm 6
hash table entry 5
hash tables
about 4–5, 127–128, 157
algorithm 13
checking for grouping using 269–271
combining data 134–144
data tasks using 127–157
defined 4
defining variables 20–21
enumerating 76–77
grouping data 153–156
indexed SAS data files vs. 7
internal structure 11
keys 6
locking 113–114
lookup organization 6–7
operations 8–10
order principle 72
ordering data 153–156
organization 10–11
as other data structures 271–279
properties of 5–6
sorting arrays using 279–282
splitting data 144–152
structure 13
subsetting data 128–134
hash tools
CLEAR operation 36–37, 145–146, 298
CREATE operation 16–35, 151
DELETE ALL operation 61–63
DELETE operation 35–37, 88–92, 275
DESCRIBE operation 44–46
ENUMERATE ALL operation 63, 77–78, 95–114, 134, 233, 274–275, 281
INSERT operation 51–61, 151, 161, 165, 233, 281
KEYNUMERATE operation 68, 79–94, 134, 138–141, 298, 309, 311
ORDER operation 68–73
OUTPUT operation 37–44, 72
RETRIEVE operation 64–66, 134, 138, 140, 143, 167, 176, 193, 194, 232, 310
UPDATE ALL operation 66–68
hash variables
defining as non-literal expressions 28–29
defining dynamically one at a time 29
defining multiple 27–28
defining using metadata 29–31
role enforcement of 5–6
HASHEXP 13–14
hashing algorithm 4
hashing scheme 11–12
HAS_NEXT method 83–84, 91, 92
HAS_PREV method 83–84
HoH
See Hash of Hashes (HoH) table
host variables 5, 6
I
IF statement 141
IFN function 31
IF-THEN-ELSE statement 163
IMPLICIT INSERT operation 56, 185
implicit search 48–49
INDEX= option 42, 180
indexed SAS data files, hash tables vs. 7
inner joins 139
INPUT statement 43
INSERT operation
about 51
ADD method 53
argument tag mode 53–54
creating and updating fact tables 161
creating hash of hash (HOH) tables 233
Duplicate Insert 56–57, 58–59
dynamic memory acquisition 51–52
handling Type 0 dimension tables 165
hash data split 151
hash tools 61
hash-PDV interaction 61
Implicit Insert 52–54
insertion order 59–61
METHOD call mode 52–53
REF method 53
REPLACE method 53
sorting arrays 281
Unique Insert 56–58
instances
handling multiple 34–35
multiple 31–32
single usage of 33–34
instantiation, combining with declaration 19–20
interleaving files 294, 297
intrinsic data grouping 155–156
item locking
DELETE ALL operation and 63
hash iterator and 110–111
item-level operations
about 48, 76
DELETE ALL operation 61–63
ENUMERATE ALL operation 95–114
enumeration 76–78
INSERT operation 51–61
KEYNUMERATE operation 79–94
ORDER operation 68–73
RETRIEVE operation 64–66
SEARCH operation 48–51
UPDATE ALL operation 66–68
ITEM_SIZE attribute 45–46
iterator object
creating 97–98
direct access 99–102
keeping in tables 106–108
linking 97–98
priming 105
sequential access 103
J
joins
inner 139
left 135–137
merging with aggregates 137–138
right 135–137
uniform MD5 split on the fly 330–332
unique-key 140–144
via key byte split 325–327
K
KEEP= option 24, 42
KEY argument tag 49
key order
about 288–289
data aggregation 289–292
data unduplication 292–294
joining data 294–299
key portion 5
key variables 6
key-items
enumerating sequentially from 108–109
harvesting same-key items from 109–110
KEYNUMERATE operation
about 79
backward 82–83
coding styles 81
combining data 134
DO_OVER method 81–82
DO_OVER vs, FIND + FIND_NEXT 139
FIND_NEXT 80
harvesting hash items 85–88
hash tools 94
hash-PDV interaction 94
HAS_NEXT 84
HAS_PREV 84
joining data 298, 309, 311
mechanics of 79–80
merging joins 138
selective DELETE operations 88–92, 94
selective UPDATE operation 68
selective UPDATE operations 92–94
staying within items lists 83–84
unique-key joins 140–141
L
LAG function 163
LAST method 101, 105–106, 114, 275
last-in-first-out (LIFO) 272
Leagues (Bizarro Ball) 122
left joins 135–137
LENGTH statement 23
%LET statement 281, 356
LIFO (last-in-first-out) 272
LINK statement 189, 190, 197, 203, 205, 222, 226, 227, 231, 365
locking
about 111–112
hash tables 113–114
same-key item groups 113
M
macros, creating hash object method calls using 184–185
MD5 function
about 288
argument concatenation 338–346
auto-formatting 343–345
collisions 339–340
concatenation delimiters and endpoints 341–343
concatenation length sizing 340–341
concatenation order and consistency 345–346
explicit formatting 343–345
SHA256 function 339–340
MD5 hash key reduction
about 299–300
data aggregation 302–305
data unduplication 305–308
general concept of 300–301
joining data 308–309
in sample data 301–302
mean, calculating 209–213
MEANS procedure 198, 219
medians, calculating 201–205, 209–213, 236–244
member types, output data set 40
memory, disk vs. 284–288
MERGE statement 17, 24, 165
merging joins with aggregates 137–138
metadata
defining fact and dimension hash tables via 183–185
defining hash variables using 29–31
METHOD call mode 52–53
methods
See specific methods
metrics
adding 353–354
creating 191–228
MOD function 318, 323
mode, calculating 201–202, 207–213, 239–244
MODIFY statement 17, 24, 165
MPRINT option 329
N
name 14
NEXT method 103–106, 110, 111, 112, 114, 207, 208, 211–212, 236
nomenclature
hash object 10
for hash object tools 9
non-grouped data, grouped data vs. 269
non-literal expressions
DATASET argument as 42–43
defining hash variables as 28–29
NOTSORTED option 72, 269–271, 278, 291
numeric expression 50
NUM_ITEMS attribute 44–45, 46, 61, 106–108
O
O(1) notation 7, 11
object-dot syntax 9
one-to-one correspondence 9
open file conflicts, avoiding 39–40
Open-Write-Close cycle 39
operations
data table 8–9
hash tables 8–10
ORDER operation
about 68–69, 153
hash items vs. hash item groups 70–71
hash table order principle 72
hash tools 73
hash-PDV interaction 73
invoking 69
ORDERED argument tag plasticity 69–70
ordering by composite keys 72
OUTPUT operation effects 72
setting SORTEDBY= option 73
ORDERED argument tag 70–71
organization, of hash tables 10–11
output data, order of 43–44
output data set
creating and overwriting 41–42
options for 42
OUTPUT method 38–40, 42–44, 73, 78, 85, 147, 152, 154, 156, 165, 166, 174, 176, 177, 189, 192, 212, 227, 243, 251, 313, 314, 357, 365
OUTPUT operation
about 37–38
avoiding open file conflicts 39–40
creating output data set 41–42
data set member types 40
DATASET argument as non-literal expression 42–43
effects of 72
hash tools 44
Open-Write-Close cycle 39
output data order 43–44
output data set options 42
OUTPUT method 38
overwriting output data set 41–42
OUTPUT statement 107–108, 136, 139, 140
outputs, reordering split 153–155
P
parameter files, defining aggregates using 223–228
parameter type matching 6, 22–26
PDV (program data vector) 5
PDV host variables 22–23
percentiles, calculating 201–202, 205–207, 209–213, 234–236, 239–244
Pitches (Bizarro Ball) 122–123
Player_Candidates (Bizarro Ball) 123–124
PoC (Proof of Concept) 117, 202
PREV method 105–106, 112, 114, 276
procedures
See specific procedures
program data vector (PDV) 5, 6
Proof of Concept (PoC) 117, 202
properties
of hash tables 5–6
key 316
PUT function 240
PUT statement 111, 233–234, 325, 330
Q
queues
about 272–273
implementing 275–276
R
RANK function 321–322
record order 305
REF method 53, 58, 61, 63, 80, 84, 215, 219
REMOVE method 62–63, 65, 111, 163, 189, 278
REMOVEDUP method 89–92, 93, 113
RENAME statement 42, 165, 174, 190
REPLACE method 53, 58, 59, 61, 63, 67, 68, 88, 167, 171, 173, 174, 176, 190, 193, 208, 211–212, 219, 281, 313, 315
REPLACEDUP method 89, 92, 93–94, 171, 178, 180
Researching Alternative Pitching case study
about 349, 359
adding metrics 353–354
output data set 354–358
sample program 350–353
residence, of hash tables 5
RETAIN statement 23, 43, 315
RETRIEVE operation
about 64
combining data 134
creating aggregates 193, 194
creating HoH tables 232
data portion offload 310
Direct Retrieve 64–65
DO_OVER vs. FIND = FIND_NEXT 140
hash tools 66
hash-PDV interaction 66
implicit vs. explicit FIND calls 65–66
including facts into SCD tables 176
merging joins 138
performing table lookups 167
unique-key joins 143
RETURN statement 189, 205
right joins 135–137
run time directive 31
Runs (Bizarro Ball) 124–125
run-time assignment statement 95
S
SAS index, uniform split using 332–335
satellite data 5
SCAN function 28, 296, 299
SCD tables
See slowly changing dimension (SCD) tables
SEARCH operation
about 48, 134
argument tag type match 50
assigned CHECK calls 50
explicit search 49
hash tools 51
hash-PDV interaction 51
implicit search 48–49
unassigned CHECK calls 50–51
SELECT statement 145
selective unduplication, data portion offload 312–315
SELECTIVE UPDATE operation 67, 68, 170–171, 178
semicolon (;) 226
SET statement 17, 24, 25, 27–28, 33, 143, 147, 165, 166, 170, 185, 207, 212, 234, 285, 290
SETCUR method 102, 108–109, 112, 114
SHA256 function 339–340
simple key 6
slashline hash table 194
slowly changing dimension (SCD) tables
creating and updating 163–182
defining via metadata 183–185
Type 0 165–167
Type 1 167–168
Type 2 168–173
Type 3 173–176
Type 4 177
Type 6 177–182
updating 186–190
SORT procedure 193, 279–282, 306, 312
SORTC procedure 279–282
SORTEDBY= option 73, 154
SORTN procedure 279–282
splits, multiple 247–260
splitting data
about 144–145
sorted input 145–149
unsorted input 150–152
SQL procedure 161, 163, 193, 224–225, 239, 240, 306
stacks
about 272–273
implementing 273–275
using to find consecutive events 277–279
stand-alone call 51
standard table operations 8
star schema data warehouse
about 159–160
creating Bizarro Ball 182–190
creating fact tables 161–163
creating initial data structure for 185–186
creating slowly changing dimension (SCD) tables 163–182
updating fact tables 161–163
updating slowly changing dimension (SCD) tables 163–182
statements
See specific statements
STOP statement 25, 166, 207, 212, 234, 270
structure, of hash tables 13
subsetting, of data
about 128
adding simple counts 132–134
methods of 128–129
reasons for using hash tables 132
via hash tables 129–132
SUM method 215, 217, 219, 337
SUMINC function 216
SUMMARY procedure 198, 219
syntactic form, for hash object tools 9
syntax, hash object 9–10
%SYSFUNC function 299
T
table lookups, performing 166–167, 171–173, 176, 177, 180–182
table-level operations
about 16
CLEAR operations 36–37
CREATE operation 16–35
DELETE operation 35–36
DESCRIBE operation 44–46
OUTPUT operation 37–44
tables
See data tables
tasks 8
Teams (Bizarro Ball) 125
temporal separation 23
tools
classifications 9
hash object 8–10
TRANWRD function 299
Type 0 slowly changing dimension (SCD) tables 165–167
Type 1 slowly changing dimension (SCD) tables 167–168
Type 2 slowly changing dimension (SCD) tables 168–173
Type 3 slowly changing dimension (SCD) tables 173–176
Type 4 slowly changing dimension (SCD) tables 177
Type 6 slowly changing dimension (SCD) tables 177–182
TYPES statement 219
U
unassigned CHECK calls 50–51
underscore (_) 222
uniform input split
about 315
aggregation via key byte split 321–325
aggregation via partial key split 316–321
joining via key byte split 325–327
using key properties 316
using SAS index 332–335
uniform MD5 split on the fly
about 327–328
aggregation 328–330
joins 330–332
unindexed key search 10–11
Unique Insert 56–57, 57–58
unique-key joins 140–144
unique-key mode 59–60
UNIVARIATE procedure 201
unlocking 111–112
UPDATE ALL operation
about 66–67
assigned vs. unassigned REPLACE calls 68
hash tools 68
hash-PDV interaction 68
implementing 67
implicit vs. explicit REPLACE calls 68
selective 68
UPDATE operation
about 88–89, 92–94
DELETE operation vs. 94
UPDATE statement 17, 24, 165
UUIDGEN function 177
V
variables
defining hash table 20–21
getting from tables 194–198
PDV host 22–23
VNAME function 242
volatility, of hash tables 5
VVALUE function 243
W
WAYS statement 219
What If the Count Is 0-2 After the First Two Pitches case study
about 361, 369
sample program 361–368
WHERE= option 42
18.227.190.211