Index

Note: Page numbers followed by “b”, “f” and “t” indicate boxes, figures and tables respectively.

A
Accuracy loss, causes of, 148–149
Accuracy measurement, 42
Alias comparators, 217–218
Alignment Comparator for Multi-valued Attributes (ACMA), 215–217, 216f
Ambiguous representation, 24, 24f
American National Standards Institute (ANSI), 191
Application programming interface (API), 93
families, 95–96
GetIdentifier(), 94f
GetIdentifierList(), 96f
GetKeywords(), 95f
identity resolution, 94
Approximate string match (ASM), 47
algorithms, 47
comparators, 209
initial match, 210
Jaro String Comparator, 212
Jaro-Winkler Comparator, 212–213
Levenshtein edit comparator, 210–211
Maximum q-Gram, 211
qTR algorithm, 211–212
transpose, 210
Asserted resolution, 71
confirmation assertions, 74–77
correction assertions, 71–74
Assertion management, 78
assertion cart, 80
grouping identifiers, 80
initial login screen, 79f
IVS, 79
home page, 79f
operating modes, 80
login identifier, 78
Attribute-based projection, 124–125, 124t
One-Pass algorithm using, 134b–140b
R-Swoosh algorithm using, 140b–145b
Attribute-based resolution
identity capture and update for, 188–190
iterative update process for ER system, 189f
Attribute-level matching, 46
See also Match key
character strings, 47
comparator, 46
ER and MDM comparators, 47
Soundex algorithm, 47
variation in string values, 47
Attribute(s), 19–20
entropy, 36
level weights, 110–111
uniqueness, 35
weight, 35–37
Automated update process, 66, 67f
clerical review indicators, 67
analysis of cases, 68–69
entity resolution and record linking, 67–68
ER assessment, 68
ER outcome analysis and root cause analysis, 68
quality assurance validation processes, 68
cluster-level review indicators, 69–70
IKB, 67
new entity references, 66
pair-level review indicators, 69
B
Batch identity resolution, 89–90, 90f
client system, 90
managed entity identifiers, 91–92
unmanaged entity identifiers, 91–92
Benchmarking, 38–39
Best record version, 55, 55f
Big Data, 13, 193
challenges, 15
MDM and, 15–16
value-added proposition, 14
Big entities, 188
problems, 188
Blocking, 147
causes of accuracy loss, 148–149
dynamic vs. preresolution, 153–155
ER system, 147
match key, 150
and match rule alignment, 151–152
problem of similarity functions, 152–153
for scoring rules, 158–160
precision, 155–156
as prematching, 149–150
recall, 155–156
Boolean rules, 47–48, 48f, 69, 107, 120–121
match key blocking for, 157–158
Bootstrap phase, 168–170
Bring-Your-Own-Identifier (BYOI), 53–54
“Brute force” method, 126
C
Capture, Store, Resolution, Update, Dispose model (CSRUD model), 28, 161
attribute-based resolution, 188–190
capture phase and IKB, 179–180
distributed resolution, 165–167
large component, 185
big entity problems, 188
incremental transitive closure, 187–188, 187f
postresolution transitive closure, 186–187, 186f
large-scale ER
for MDM, 161–163
with single match key blocking, 161–163
multiple-index resolution, 165–167
persistent entity identifiers, 181–182
capture based on match keys transitive closure, 183f
Prior EIS, 185
simple update scenario, 182f, 184f
transitive closure of references, 183
record-based resolution, 165–167
single index generator, 162f
transitive closure problem, 163–165
update problem identification, 180–181
Capture phase, 31, 31f
attribute
entropy, 36
uniqueness, 35
weight, 36–37
benchmarking, 38–39
building foundation, 32–33
data matching strategies, 46–50
data preparation, 33–34
ER results assessment, 37–46
identity attributes selection, 34–37
IKB, 31–32
input references, 32
intersection matrix, 39, 40t, 42
equivalent pairs, 41
equivalent references, 41
fundamental law of ER, 41
linked pairs, 42
partition classes, 40–41
partition of set, 39
references with sets of links, 40t
true and false positives and negatives, 41
True Link, 40
problem sets, 39
proposed measures, 44–45
Cluster Comparison method, 45–46
pairwise method, 45
review indicators, 32
truth sets, 38
TWi, 43–44
characteristics, 44
True link and ER link, 44, 45t
truth set evaluation, 44
utility, 44
understanding data, 33
unique identifier, 31
Capture phase, 179–180
Capture process implementation, 50
Central registry, 58–59
“Certified records”, See “Golden records”
Chief data officer (CDO), 9, 116
Chief information officer (CIO), 116
Churn rate, 6–7
Clerical review indicators, 67
analysis of cases, 68–69
entity resolution and record linking, 67–68
ER
assessment, 68
outcome analysis and root cause analysis, 68
quality assurance validation processes, 68
Closed universe models, 99–100
Cluster Comparison method, 45–46
Cluster-level matching, 50
Cluster-level review indicators, 69–70
Cluster-to-cluster classification, 122, 126
attribute-based projection, 124–125, 124t
record-based projection, 123
reference-to-cluster
classification, 124–125
match scenario, 123f
transitive closure, 125–126
unique reference assumption, 125–126
Comma-separated values (CSV), 163, 197–198
Common Object Request Broker Architecture (CORBA), 94
Comparator, 46
Compressed Document Set Architecture (CoDoSA), 163
Confidence scores, 96
depth and degree of match, 97–99
match context, 99–100
model, 100–102
Confirmation assertions, 74
reference-to-reference assertion, 76, 77f
reference-to-structure assertion, 77, 77f
true negative assertion, 75–76, 76f
true positive assertion, 74–75, 75f
Conformance to data specifications, 199–200
ISO 8000 standard, 202
message and supporting references, 201
message referencing data specification, 201f
multiple-record schema, 200f
single-record message structure, 200f
XML elements, 202
Correction assertions, 71
reference-transfer assertion, 74, 74f
structure-split assertion, 72, 73f
levels of grouping, 73
synchronization of identifiers, 73
transactions, 73
structure-to-structure assertion, 71, 72f
EIS, 72
set of assertion transactions, 72
Critical data elements (CDEs), 34
CRUD model, 27
CSRUD Life Cycle, 119
automated update configuration, 180–181
update problem identification, 180–181
Customer data integration (CDI), 8, 55
Customer recognition, 89
Customer relationship management (CRM), 6–7, 55
Customer satisfaction, 6–8
D
Data
preparation, 33–34
quality, 191–193
science, 14
scientists, 15
Data governance program (DG program), 9–10
adoption, 10
control, 10
data stewardship model, 10
DBA, 9–10
Data matching strategies, 46
attribute-level matching, 46
character strings, 47
comparator, 46
ER and MDM comparators, 47
Soundex algorithm, 47
variation in string values, 47
Boolean rules, 47–48, 48f
capture process implementation, 50
cluster-level matching, 50
hybrid rules, 49–50
MDM, 46
reference-level matching, 47
scoring rule, 48–49, 49f
Data stewardship, 65
asserted resolution, 71–77
automated update process, 66–70
CSRUD life cycle, 65
EIS visualization tools, 77–83
entity identifiers management, 84–87
manual update process, 66, 70–71
model, 10
rate of change, 66
root cause of information quality issues, 65
Data warehousing (DW), 6–7
Database administrator (DBA), 9–10
Dedicated MDM systems, 55–58
Deduplication phase, 169, 171–177
Depth and degree of match, 97–99
Deterministic matching, 119–121
DG program, See Data governance program
Distributed resolution, 165
references and match keys as graph, 166–167
transitive closure as graph problem, 165–166
Dynamic blocking, 153–155
E
E-R database model, See Entity-relation database model
Electronic Commerce Code Management Association (ECCMA), 191
Entity identifiers management, 84
models for, 85
pull model, 85–87
push model, 87
problem of association information latency, 84–85
Entity identity information management (EIIM), 3–4, 10–11, 21–22, 27, 53, 115
configurations, 119
EIS, 4–6
ER and data structures, 4
false negative error, 22
false positive error, 22
and Fellegi-Sunter, 115–116
goal of, 22
identity information, 4
life cycle management models, 27
CSRUD model, 28
Loshin model, 27–28
POSMAD model, 27
“matching” records, 6
“merge-purge” operation, 5
OYSTER open source ER system, 6
SERF, 116
strategies, 53–54
time aspect, 5
Entity identity integrity, 22–23, 23f
ambiguous representation, 24, 24f
culture and expectation, 25
discovery, 26
false negative, 25
incomplete state, 25, 26f
master data table, 22–23
MDM
registry entries, 25–26
system, 24
meaningless state, 25, 25f
primary key value, 23
proper representation, 23–24, 23f
surjective function, 24
Entity identity structure (EIS), 4–6, 21–22, 31, 53, 116
attribute-based, 56, 56f
duplicate record filter, 57
exemplar record, 56
BYOI, 53–54
dedicated MDM systems, 55–58
EIIM strategies, 53–54
ER algorithms and, 58
IKB, 58–60
O&D MDM, 54
record-based, 56, 57f, 58
with duplicate record filter, 57f
with exemplar record, 58f
issue with, 57
with record filter and exemplar record, 58f
storing vs. sharing, 59–60
survivor record strategy, 55
best record version, 55, 55f
exemplar record, 55f, 56
rules, 56
versions, 55
visualization tools, 77–78
assertion management, 78–80
negative resolution review mode, 81–82, 83f
positive resolution review mode, 83, 85f
search mode, 80–81, 81f
Entity resolution (ER), 3–4, 18, 53, 119, 165
appropriate algorithm selection, 126–145
checklist, 119
deterministic, 119–121
weights calculation, 121–122
cluster-to-cluster classification, 122–126
comparators
alias comparators, 217–218
ASM comparators, 209–213
multivalued comparators, 213–217
phonetic comparators, 218
token comparators, 213–217
consistency, 115
with consistent classification, 5f
de-duplication applications, 3–4
exact match and standardization, 207
overcoming variation in string values, 208–209
scanning comparators, 209
standardizing, 207–208
fundamental law, 19
information quality, 4
key data cleansing process, 3
using Null Rule, 177–179
One-Pass algorithm, 128–145
outcomes measurements, 42
accuracy measurement, 42
F-Measure, 43
false negative rate, 43
false positive rate, 43
R-Swoosh algorithm, 137b–142b
results assessment, 37–46
set of references, 114–115
Entity-relation database model (E-R database model), 11
Entity/entities, 17–18
of entities, 12
entity-based data integration, 6–8
reference, 18
resolution problem, 19
Exemplar record, 55f, 56
eXtensible Business Reporting Language (XBRL), 197
Extensible markup language (XML), 191
External reference architecture, 60–61, 61f
F
F-Measure, 43
False negatives (FN), 43
errors, 22, 148
rate, 43
False positives (FP), 43
errors, 22, 148
rate, 43
Fellegi-Sunter Theory of Record Linking, 67–68, 105
context and constraints of record linkage, 105–106
EIIM and, 115–116
fundamental Fellegi-Sunter theorem, 108–110
matching rule, 106–107
scoring rule, 110–111
attribute level weights and, 110–111
frequency-based weights and, 112
Format variation, 208
Frequency-based weights, 112
“Fuzzy” match, 46, 49
G
Garbage-in-garbage-out rule (GIGO rule), 92
Global Justice XML Data Model (GJXML), 197
“Golden records”, 1, 203–204
Google™, 14
H
Hadoop File System (HDFS), 91, 161, 179
Hadoop implementation, 175–177
Hadoop Map/Reduce framework, 161–162
Hash keys, 151
Hashing algorithms, 151
Hierarchical MDM, 12
Hybrid rules, 49–50
I
IAIDQ Domains of Information Quality, 192
Identification Guide (IG), 203
Identity, internal vs. external view, 19–20
issues, 20
merge-purge process, 21
occupancy history, 20, 20f
occupancy records, 21
Identity attributes, 17, 19–20
internal view of identity, 20
selection, 34
measures, 35
primary identity attributes, 34–35
supporting identity attributes, 35
Identity knowledge base (IKB), 31, 58–60, 66, 179–180
Identity resolution, 89
access modes, 89
batch identity resolution, 89–92, 90f
interactive identity resolution, 92–93, 93f
API, 94–96
confidence scores, 96–102
Identity Visualization System (IVS), 78, 79f
Incomplete state, 25, 26f
Incremental transitive closure, 187–188, 187f
Information quality, 191–193
Information Quality Certified Professional (IQCP), 4, 192
Information retrieval (IR), 155
Informed linking, See Asserted resolution
Interactive identity resolution, 92–93, 93f
International Association for Information and Data Quality (IAIDQ), 192
International Organization for Standardization (ISO), 191
data quality vs. information quality, 191–193
relevance to MDM, 193
Intersection matrix, 39, 40t, 42
equivalent pairs, 41
equivalent references, 41
fundamental law of ER, 41
linked pairs, 42
partition classes, 40–41
partition of set, 39
references with sets of links, 40t
true and false positives and negatives, 41
True Link, 40
Inverted indexing, 150
ISO 8000–110 standard, 191
adding new parts, 203
accuracy, 204
completeness, 204–205
provenance, 204
components, 196
conformance to data specifications, 199–202
general requirements, 196
message referencing a data specification, 201f
multiple-record schema, 200f
semantic encoding, 198–199
single-record message structure, 200f
syntax of message, 197–198
goals, 193
ISO 22745 standard industrial systems and integration, 203
motivational example, 194–196
scope, 193–194
simple and strong compliance with, 202–203
unambiguous and portable data, 193
Iteration phase, 169–171
J
Jaccard coefficient, 213–214
Jaro String Comparator, 212
Jaro-Winkler Comparator, 212–213
K
Key-value pairs, decoding, 163
Knowledge-based linking, See Asserted resolution
L
“Large entity” problem, 150
Large-scale ER
for MDM, 161–163
with single match key blocking, 161
decoding key-value pairs, 163
Hadoop Map/Reduce framework, 162
single index generator, 162f
Latent semantic analysis, 218
Left-to-right (LR), 158
Levenshtein edit comparator, 210–211
Levenshtein Edit Distance comparator, 47
Link append process, 91
Loshin model, 27–28
M
Managed entity identifiers, 91–92
Manual update process, 66, 70–71
Master data, 1
Master data management (MDM), 1–4
architectures, 60
external reference architecture, 60–61, 61f
reconciliation engine, 63
registry architecture, 61–63
transaction hub architecture, 63–64
business case for, 6
better security, 10–11
better service, 8
cost reduction of poor data quality, 9
customer satisfaction and entity-based data integration, 6–8
success measurement, 11
components, 3f
DG program, 9–10
adoption, 10
control, 10
data stewardship model, 10
DBA, 9–10
dimensions, 11
hierarchical MDM, 12
multi-channel MDM, 13
multi-cultural MDM, 13
multi-domain MDM, 11–12
policies, 2
relevance to, 193
system using background and foreground operations, 59f
Match context, 99
closed universe models, 99–100
confidence score model, 100–102
open universe models, 99–100
Match key, 151
blocking, 150
for Boolean rules, 157–158
and match rule alignment, 151–152
preresolution blocking with multiple, 154–155
problem of similarity functions, 152–153
for scoring rules, 158–160
generators, 151
indexing, 150
Match threshold, 111
Matching rule, 106–107
“Matching” records, 6
Maximum q-Gram, 211
Meaningless state, 25, 25f
Merge-purge
operation, 5
process, 21, 26
Metadata, 2
Multi-channel MDM, 13
Multi-cultural MDM, 13
Multiple-index resolution, 165
references and match keys as graph, 166–167
transitive closure as graph problem, 165–166
Multivalued comparators, 213–217
N
n-Gram algorithms, 211
N-squared problem, 15–16
Natural language processing (NLP), 14
Negative resolution review mode, 81–82, 83f
North Atlantic Treaty Organization (NATO), 193, 203
Null Rule, ER using, 177–178
O
Occupancy history, 20, 20f
Once-and-Done MDM (O&D MDM), 54
One-Pass algorithm, 128
using attribute-based projection, 134b–136b
input reordered, 137b–140b
using record-based projection, 128b–131b
input reordered, 131b–133b
Open Technical Dictionary (OTD), 203
Open universe models, 99–100
OYSTER open source ER system, 6, 7f
P
Pair-level review indicators, 69
Pairwise method, 45
Party domain, 11
Pattern ratio, 108
Period entities, 11–12
Persistent identifiers, 26–27, 84
Phonetic comparators, 218
Phonetic encoding algorithms, 151
Phonetic variation, 208
Place domain, 11–12
Point-of-sale (POS), 92–93
Positive resolution review mode, 83, 85f
POSMAD model, 27
Postresolution transitive closure, 186–187, 186f
Precision, 43, 127
Prematching, blocking as, 149–150
Preprocess standardization, 207–208
Preresolution blocking, 153–155
Primary identity attributes, 34–35
Probabilistic matching, 37, 119–121
Problem sets, 39
Product domain, 11–12
Proper representation, 23–24, 23f
Pull model, 85–87
Push model, 87
Q
q-Gram algorithms, 211
q-Gram Tetrahedral Ratio algorithm (qTR algorithm), 211–212
R
R-Swoosh algorithm, 115, 137b–140b
using attribute-based projection, 140b–142b
input reordered, 142b–145b
Radio frequency tag identification (RFID), 54
Recall, 43, 126
Reconciliation engine, 63
Record linking, 105–106
Record-based projection, 123, 165
One-Pass algorithm using, 125b–133b
references and match keys as graph, 166–167
transitive closure as graph problem, 165–166
Reference
codes, 2
data, 2
Reference data management (RDM), 1
Reference-level matching, 47
Reference-to-cluster classification, 124–125
Reference-to-reference assertion, 76, 77f
Reference-to-structure assertion, 77, 77f
Reference-transfer assertion, 74, 74f
Registry architecture, 61
hub organization, 62–63
IKB and systems, 62
reference, 61–62
schema, 61f
semantic encoding, 62
trusted broker architecture, 62
Representational State Transfer (REST), 94
RESTful APIs, 94
Return-on-investment (ROI), 11
Review indicators, 32
Review threshold, 111
Root mean square (RMS), 216
S
Scanning comparators, 209
Scoring rules, 48–49, 49f, 69, 110–111, 122
attribute level weights and, 110–111
frequency-based weights and, 112
match key blocking for, 158–160
Search mode, 80–81, 81f
Semantic encoding, 62, 193, 198–199
Service level agreement (SLA), 89–90, 196
Shannon’s Schematic for Communication, 18
Social security number (SSN), 34–35, 158
Soft rules, 67–68
Software-as-a-service (SaaS), 10
Soundex algorithm, 47, 218
Soundex comparator, 218
Standard blocking, 150
Stanford Entity Resolution Framework (SERF), 112–113, 116, 137b–140b
abstraction of match, 113–114
consistent ER, 115
merge operations, 113–114
R-Swoosh algorithm, 115
set of references ER, 114–115
Structure query language (SQL), 179
Structure-split assertion, 72, 73f
levels of grouping, 73
synchronization of identifiers, 73
transactions, 73
Structure-to-structure assertion, 71, 72f
EIS, 72
set of assertion transactions, 72
Supporting identity attributes, 35
Surjective function, 24
Surrogate identity, 18
Survivor record strategy, 55
best record version, 55, 55f
exemplar record, 55f, 56
rules, 56
versions, 55
Syntax of message, 197–198
System hub, See Central registry
Systems of record (SOR), 1
T
Taguchi’s Loss Function, 9
Talburt-Wang Index (TWi), 43–44
characteristics, 44
True link and ER link, 44, 45t
truth set evaluation, 44
utility, 44
Technical Committee (TC), 191
term frequency-inverse document frequency (tf-idf), 214
cosine similarity, 214–215
Theoretical foundations
EIIM, 115–116
Fellegi-Sunter Theory Of Record Linkage, 105–112
SERF, 112–115
Token comparators, 213–217
Transaction hub architecture, 63–64
Transitive closure, 125–126
as graph problem, 165–166
incremental, 187–188, 187f
iterative, nonrecursive algorithm for, 167–168
bootstrap phase, 168–170, 173t
deduplication phase, 169, 171–177, 174t
distributed processing, 168
Hadoop implementation example, 175–177
iteration phase, 169–171
key-value pairs, 168–169
postresolution, 186–187, 186f
problem, 163
ER process, 165
match key generators, 164
match key values, 164t
True Link, 40
True negative assertion, 75–76, 76f
True positive assertion, 74–75, 75f
Trusted broker architecture, 62
Truth sets, 38
U
U.S. Technical Advisory Group (TAG), 191
Uniform resource identifiers (URI), 198
Unique reference assumption, 18, 125–126
Universal Product Code (UPC), 19–20
Unmanaged entity identifiers, 91–92
V
Variation in string values, 208–209
Very large database system (VLDBS), 59–60
W
Weak rules, 67–69
X
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.30.236