Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Index

Note: Page numbers followed by “b”, “f” and “t” indicate boxes, figures and tables respectively.

Accuracy loss, causes of, 148–149

Accuracy measurement, 42

Alias comparators, 217–218

Alignment Comparator for Multi-valued Attributes (ACMA), 215–217, 216f

Ambiguous representation, 24, 24f

American National Standards Institute (ANSI), 191

Application programming interface (API), 93

families, 95–96

GetIdentifier(), 94f

GetIdentifierList(), 96f

GetKeywords(), 95f

identity resolution, 94

Approximate string match (ASM), 47

algorithms, 47

comparators, 209

initial match, 210

Jaro String Comparator, 212

Jaro-Winkler Comparator, 212–213

Levenshtein edit comparator, 210–211

Maximum q-Gram, 211

qTR algorithm, 211–212

transpose, 210

Asserted resolution, 71

confirmation assertions, 74–77

correction assertions, 71–74

Assertion management, 78

See also Structure-split assertion

assertion cart, 80

grouping identifiers, 80

initial login screen, 79f

IVS, 79

home page, 79f

operating modes, 80

Attribute-based projection, 124–125, 124t

One-Pass algorithm using, 134b–140b

R-Swoosh algorithm using, 140b–145b

Attribute-based resolution

See also Batch identity resolution

identity capture and update for, 188–190

iterative update process for ER system, 189f

Attribute-level matching, 46

See also Match key

character strings, 47

comparator, 46

ER and MDM comparators, 47

Soundex algorithm, 47

variation in string values, 47

Attribute(s), 19–20

See also Identity attributes

entropy, 36

level weights, 110–111

uniqueness, 35

weight, 35–37

Automated update process, 66, 67f

See also Manual update process

clerical review indicators, 67

analysis of cases, 68–69

entity resolution and record linking, 67–68

ER assessment, 68

ER outcome analysis and root cause analysis, 68

quality assurance validation processes, 68

cluster-level review indicators, 69–70

IKB, 67

new entity references, 66

pair-level review indicators, 69

Batch identity resolution, 89–90, 90f

See also Attribute-based resolution

client system, 90

managed entity identifiers, 91–92

unmanaged entity identifiers, 91–92

Benchmarking, 38–39

Best record version, 55, 55f

Big Data, 13, 193

challenges, 15

MDM and, 15–16

value-added proposition, 14

Big entities, 188

problems, 188

Blocking, 147

causes of accuracy loss, 148–149

dynamic vs. preresolution, 153–155

ER system, 147

match key, 150

and match rule alignment, 151–152

problem of similarity functions, 152–153

for scoring rules, 158–160

precision, 155–156

as prematching, 149–150

recall, 155–156

Boolean rules, 47–48, 48f, 69, 107, 120–121

See also Hybrid rules; Scoring rules

match key blocking for, 157–158

Bootstrap phase, 168–170

Bring-Your-Own-Identifier (BYOI), 53–54

“Brute force” method, 126

Capture, Store, Resolution, Update, Dispose model (CSRUD model), 28, 161

See also Big Data; CSRUD Life Cycle

attribute-based resolution, 188–190

capture phase and IKB, 179–180

distributed resolution, 165–167

large component, 185

big entity problems, 188

incremental transitive closure, 187–188, 187f

postresolution transitive closure, 186–187, 186f

large-scale ER

for MDM, 161–163

with single match key blocking, 161–163

multiple-index resolution, 165–167

persistent entity identifiers, 181–182

capture based on match keys transitive closure, 183f

Prior EIS, 185

simple update scenario, 182f, 184f

transitive closure of references, 183

record-based resolution, 165–167

single index generator, 162f

transitive closure problem, 163–165

update problem identification, 180–181

Capture phase, 31, 31f

attribute

entropy, 36

uniqueness, 35

weight, 36–37

benchmarking, 38–39

building foundation, 32–33

data matching strategies, 46–50

data preparation, 33–34

ER results assessment, 37–46

identity attributes selection, 34–37

IKB, 31–32

input references, 32

intersection matrix, 39, 40t, 42

equivalent pairs, 41

equivalent references, 41

fundamental law of ER, 41

linked pairs, 42

partition classes, 40–41

partition of set, 39

references with sets of links, 40t

true and false positives and negatives, 41

True Link, 40

problem sets, 39

proposed measures, 44–45

Cluster Comparison method, 45–46

pairwise method, 45

review indicators, 32

truth sets, 38

TWi, 43–44

characteristics, 44

True link and ER link, 44, 45t

truth set evaluation, 44

utility, 44

understanding data, 33

unique identifier, 31

Capture phase, 179–180

Capture process implementation, 50

CDEs, See Critical data elements

CDI, See Customer data integration

CDO, See Chief data officer

Central registry, 58–59

“Certified records”, See “Golden records”

Chief data officer (CDO), 9, 116

Chief information officer (CIO), 116

Churn rate, 6–7

CIO, See Chief information officer

Clerical review indicators, 67

analysis of cases, 68–69

entity resolution and record linking, 67–68

assessment, 68

outcome analysis and root cause analysis, 68

quality assurance validation processes, 68

Closed universe models, 99–100

Cluster Comparison method, 45–46

Cluster-level matching, 50

Cluster-level review indicators, 69–70

Cluster-to-cluster classification, 122, 126

attribute-based projection, 124–125, 124t

record-based projection, 123

reference-to-cluster

classification, 124–125

match scenario, 123f

transitive closure, 125–126

unique reference assumption, 125–126

CoDoSA, See Compressed Document Set Architecture

Comma-separated values (CSV), 163, 197–198

Common Object Request Broker Architecture (CORBA), 94

Comparator, 46

Compressed Document Set Architecture (CoDoSA), 163

Confidence scores, 96

depth and degree of match, 97–99

match context, 99–100

model, 100–102

Confirmation assertions, 74

reference-to-reference assertion, 76, 77f

reference-to-structure assertion, 77, 77f

true negative assertion, 75–76, 76f

true positive assertion, 74–75, 75f

Conformance to data specifications, 199–200

ISO 8000 standard, 202

message and supporting references, 201

message referencing data specification, 201f

multiple-record schema, 200f

single-record message structure, 200f

XML elements, 202

CORBA, See Common Object Request Broker Architecture

Correction assertions, 71

reference-transfer assertion, 74, 74f

structure-split assertion, 72, 73f

levels of grouping, 73

synchronization of identifiers, 73

transactions, 73

structure-to-structure assertion, 71, 72f

EIS, 72

set of assertion transactions, 72

Critical data elements (CDEs), 34

CRM, See Customer relationship management

CRUD model, 27

CSRUD Life Cycle, 119

See also Automated update process

automated update configuration, 180–181

update problem identification, 180–181

CSRUD model, See Capture, Store, Resolution, Update, Dispose model

CSV, See Comma-separated values

Customer data integration (CDI), 8, 55

Customer recognition, 89

Customer relationship management (CRM), 6–7, 55

Customer satisfaction, 6–8

Data

preparation, 33–34

quality, 191–193

science, 14

scientists, 15

Data governance program (DG program), 9–10

adoption, 10

control, 10

data stewardship model, 10

DBA, 9–10

Data matching strategies, 46

attribute-level matching, 46

character strings, 47

comparator, 46

ER and MDM comparators, 47

Soundex algorithm, 47

variation in string values, 47

Boolean rules, 47–48, 48f

capture process implementation, 50

cluster-level matching, 50

hybrid rules, 49–50

MDM, 46

reference-level matching, 47

scoring rule, 48–49, 49f

Data stewardship, 65

asserted resolution, 71–77

automated update process, 66–70

CSRUD life cycle, 65

EIS visualization tools, 77–83

entity identifiers management, 84–87

manual update process, 66, 70–71

model, 10

rate of change, 66

root cause of information quality issues, 65

Data warehousing (DW), 6–7

Database administrator (DBA), 9–10

Dedicated MDM systems, 55–58

Deduplication phase, 169, 171–177

Depth and degree of match, 97–99

Deterministic matching, 119–121

DG program, See Data governance program

Distributed resolution, 165

references and match keys as graph, 166–167

transitive closure as graph problem, 165–166

DW, See Data warehousing

Dynamic blocking, 153–155

E-R database model, See Entity-relation database model

ECCMA, See Electronic Commerce Code Management Association

EIIM, See Entity identity information management

EIS, See Entity identity structure

Electronic Commerce Code Management Association (ECCMA), 191

Entity identifiers management, 84

models for, 85

pull model, 85–87

push model, 87

problem of association information latency, 84–85

Entity identity information management (EIIM), 3–4, 10–11, 21–22, 27, 53, 115

configurations, 119

EIS, 4–6

ER and data structures, 4

false negative error, 22

false positive error, 22

and Fellegi-Sunter, 115–116

goal of, 22

identity information, 4

life cycle management models, 27

CSRUD model, 28

Loshin model, 27–28

POSMAD model, 27

“matching” records, 6

“merge-purge” operation, 5

OYSTER open source ER system, 6

SERF, 116

strategies, 53–54

time aspect, 5

Entity identity integrity, 22–23, 23f

ambiguous representation, 24, 24f

culture and expectation, 25

discovery, 26

false negative, 25

incomplete state, 25, 26f

master data table, 22–23

MDM

registry entries, 25–26

system, 24

meaningless state, 25, 25f

primary key value, 23

proper representation, 23–24, 23f

surjective function, 24

Entity identity structure (EIS), 4–6, 21–22, 31, 53, 116

attribute-based, 56, 56f

duplicate record filter, 57

exemplar record, 56

BYOI, 53–54

dedicated MDM systems, 55–58

EIIM strategies, 53–54

ER algorithms and, 58

IKB, 58–60

O&D MDM, 54

record-based, 56, 57f, 58

with duplicate record filter, 57f

with exemplar record, 58f

issue with, 57

with record filter and exemplar record, 58f

storing vs. sharing, 59–60

survivor record strategy, 55

best record version, 55, 55f

exemplar record, 55f, 56

rules, 56

versions, 55

visualization tools, 77–78

assertion management, 78–80

negative resolution review mode, 81–82, 83f

positive resolution review mode, 83, 85f

search mode, 80–81, 81f

Entity resolution (ER), 3–4, 18, 53, 119, 165

appropriate algorithm selection, 126–145

checklist, 119

deterministic, 119–121

weights calculation, 121–122

cluster-to-cluster classification, 122–126

comparators

alias comparators, 217–218

ASM comparators, 209–213

multivalued comparators, 213–217

phonetic comparators, 218

token comparators, 213–217

consistency, 115

with consistent classification, 5f

de-duplication applications, 3–4

exact match and standardization, 207

overcoming variation in string values, 208–209

scanning comparators, 209

standardizing, 207–208

fundamental law, 19

information quality, 4

key data cleansing process, 3

using Null Rule, 177–179

One-Pass algorithm, 128–145

outcomes measurements, 42

accuracy measurement, 42

F-Measure, 43

false negative rate, 43

false positive rate, 43

R-Swoosh algorithm, 137b–142b

results assessment, 37–46

set of references, 114–115

Entity-relation database model (E-R database model), 11

Entity/entities, 17–18

of entities, 12

entity-based data integration, 6–8

reference, 18

resolution problem, 19

ER, See Entity resolution

Exemplar record, 55f, 56

eXtensible Business Reporting Language (XBRL), 197

Extensible markup language (XML), 191

External reference architecture, 60–61, 61f

F-Measure, 43

False negatives (FN), 43

errors, 22, 148

rate, 43

False positives (FP), 43

errors, 22, 148

rate, 43

Fellegi-Sunter Theory of Record Linking, 67–68, 105

context and constraints of record linkage, 105–106

EIIM and, 115–116

fundamental Fellegi-Sunter theorem, 108–110

matching rule, 106–107

scoring rule, 110–111

attribute level weights and, 110–111

frequency-based weights and, 112

FN, See False negatives

Format variation, 208

FP, See False positives

Frequency-based weights, 112

“Fuzzy” match, 46, 49

Garbage-in-garbage-out rule (GIGO rule), 92

Global Justice XML Data Model (GJXML), 197

“Golden records”, 1, 203–204

Google™, 14

Hadoop File System (HDFS), 91, 161, 179

Hadoop implementation, 175–177

Hadoop Map/Reduce framework, 161–162

Hash keys, 151

Hashing algorithms, 151

Hierarchical MDM, 12

Hybrid rules, 49–50

See also Boolean rules, Scoring rules

IAIDQ, See International Association for Information and Data Quality

IAIDQ Domains of Information Quality, 192

Identification Guide (IG), 203

Identity, internal vs. external view, 19–20

issues, 20

merge-purge process, 21

occupancy history, 20, 20f

occupancy records, 21

Identity attributes, 17, 19–20

internal view of identity, 20

selection, 34

measures, 35

primary identity attributes, 34–35

supporting identity attributes, 35

Identity knowledge base (IKB), 31, 58–60, 66, 179–180

Identity resolution, 89

access modes, 89

batch identity resolution, 89–92, 90f

interactive identity resolution, 92–93, 93f

API, 94–96

confidence scores, 96–102

Identity Visualization System (IVS), 78, 79f

IG, See Identification Guide

IKB, See Identity knowledge base

Incomplete state, 25, 26f

Incremental transitive closure, 187–188, 187f

Information quality, 191–193

Information Quality Certified Professional (IQCP), 4, 192

Information retrieval (IR), 155

Informed linking, See Asserted resolution

Interactive identity resolution, 92–93, 93f

See also Batch identity resolution

International Association for Information and Data Quality (IAIDQ), 192

International Organization for Standardization (ISO), 191

See also ISO 8000–110 standard

data quality vs. information quality, 191–193

relevance to MDM, 193

Intersection matrix, 39, 40t, 42

equivalent pairs, 41

equivalent references, 41

fundamental law of ER, 41

linked pairs, 42

partition classes, 40–41

partition of set, 39

references with sets of links, 40t

true and false positives and negatives, 41

True Link, 40

Inverted indexing, 150

IQCP, See Information Quality Certified Professional

IR, See Information retrieval

ISO, See International Organization for Standardization

ISO 8000–110 standard, 191

adding new parts, 203

accuracy, 204

completeness, 204–205

provenance, 204

components, 196

conformance to data specifications, 199–202

general requirements, 196

message referencing a data specification, 201f

multiple-record schema, 200f

semantic encoding, 198–199

single-record message structure, 200f

syntax of message, 197–198

goals, 193

ISO 22745 standard industrial systems and integration, 203

motivational example, 194–196

scope, 193–194

simple and strong compliance with, 202–203

unambiguous and portable data, 193

Iteration phase, 169–171

IVS, See Identity Visualization System

Jaccard coefficient, 213–214

Jaro String Comparator, 212

Jaro-Winkler Comparator, 212–213

Key-value pairs, decoding, 163

Knowledge-based linking, See Asserted resolution

“Large entity” problem, 150

Large-scale ER

for MDM, 161–163

with single match key blocking, 161

decoding key-value pairs, 163

Hadoop Map/Reduce framework, 162

single index generator, 162f

Latent semantic analysis, 218

Left-to-right (LR), 158

Levenshtein edit comparator, 210–211

Levenshtein Edit Distance comparator, 47

Link append process, 91

Loshin model, 27–28

LR, See Left-to-right

Managed entity identifiers, 91–92

Manual update process, 66, 70–71

See also Automated update process

Master data, 1

Master data management (MDM), 1–4

architectures, 60

external reference architecture, 60–61, 61f

reconciliation engine, 63

registry architecture, 61–63

transaction hub architecture, 63–64

business case for, 6

better security, 10–11

better service, 8

cost reduction of poor data quality, 9

customer satisfaction and entity-based data integration, 6–8

success measurement, 11

components, 3f

DG program, 9–10

adoption, 10

control, 10

data stewardship model, 10

DBA, 9–10

dimensions, 11

hierarchical MDM, 12

multi-channel MDM, 13

multi-cultural MDM, 13

multi-domain MDM, 11–12

policies, 2

relevance to, 193

system using background and foreground operations, 59f

Match context, 99

closed universe models, 99–100

confidence score model, 100–102

open universe models, 99–100

Match key, 151

See also Attribute-level matching

blocking, 150

for Boolean rules, 157–158

and match rule alignment, 151–152

preresolution blocking with multiple, 154–155

problem of similarity functions, 152–153

for scoring rules, 158–160

generators, 151

indexing, 150

Match threshold, 111

Matching rule, 106–107

“Matching” records, 6

Maximum q-Gram, 211

MDM, See Master data management

Meaningless state, 25, 25f

Merge-purge

operation, 5

process, 21, 26

Metadata, 2

Multi-channel MDM, 13

Multi-cultural MDM, 13

Multiple-index resolution, 165

references and match keys as graph, 166–167

transitive closure as graph problem, 165–166

Multivalued comparators, 213–217

n-Gram algorithms, 211

N-squared problem, 15–16

Natural language processing (NLP), 14

Negative resolution review mode, 81–82, 83f

North Atlantic Treaty Organization (NATO), 193, 203

Null Rule, ER using, 177–178

Occupancy history, 20, 20f

Once-and-Done MDM (O&D MDM), 54

One-Pass algorithm, 128

using attribute-based projection, 134b–136b

input reordered, 137b–140b

using record-based projection, 128b–131b

input reordered, 131b–133b

Open Technical Dictionary (OTD), 203

Open universe models, 99–100

OYSTER open source ER system, 6, 7f

Pair-level review indicators, 69

Pairwise method, 45

Party domain, 11

Pattern ratio, 108

Period entities, 11–12

Persistent identifiers, 26–27, 84

Phonetic comparators, 218

Phonetic encoding algorithms, 151

Phonetic variation, 208

Place domain, 11–12

Point-of-sale (POS), 92–93

Positive resolution review mode, 83, 85f

POSMAD model, 27

Postresolution transitive closure, 186–187, 186f

Precision, 43, 127

Prematching, blocking as, 149–150

Preprocess standardization, 207–208

Preresolution blocking, 153–155

Primary identity attributes, 34–35

Probabilistic matching, 37, 119–121

Problem sets, 39

Product domain, 11–12

Proper representation, 23–24, 23f

Pull model, 85–87

Push model, 87

q-Gram algorithms, 211

q-Gram Tetrahedral Ratio algorithm (qTR algorithm), 211–212

R-Swoosh algorithm, 115, 137b–140b

using attribute-based projection, 140b–142b

input reordered, 142b–145b

Radio frequency tag identification (RFID), 54

RDM, See Reference data management

Recall, 43, 126

Reconciliation engine, 63

Record linking, 105–106

Record-based projection, 123, 165

One-Pass algorithm using, 125b–133b

references and match keys as graph, 166–167

transitive closure as graph problem, 165–166

Reference

codes, 2

data, 2

Reference data management (RDM), 1

Reference-level matching, 47

Reference-to-cluster classification, 124–125

Reference-to-reference assertion, 76, 77f

Reference-to-structure assertion, 77, 77f

Reference-transfer assertion, 74, 74f

Registry architecture, 61

hub organization, 62–63

IKB and systems, 62

reference, 61–62

schema, 61f

semantic encoding, 62

trusted broker architecture, 62

Representational State Transfer (REST), 94

RESTful APIs, 94

Return-on-investment (ROI), 11

Review indicators, 32

Review threshold, 111

RFID, See Radio frequency tag identification

ROI, See Return-on-investment

Root mean square (RMS), 216

SaaS, See Software-as-a-service

Scanning comparators, 209

Scoring rules, 48–49, 49f, 69, 110–111, 122

See also Boolean rules, Hybrid rules

attribute level weights and, 110–111

frequency-based weights and, 112

match key blocking for, 158–160

Search mode, 80–81, 81f

Semantic encoding, 62, 193, 198–199

SERF, See Stanford Entity Resolution Framework

Service level agreement (SLA), 89–90, 196

Shannon’s Schematic for Communication, 18

SLA, See Service level agreement

Social security number (SSN), 34–35, 158

Soft rules, 67–68

Software-as-a-service (SaaS), 10

SOR, See Systems of record

Soundex algorithm, 47, 218

Soundex comparator, 218

SQL, See Structure query language

SSN, See Social security number

Standard blocking, 150

Stanford Entity Resolution Framework (SERF), 112–113, 116, 137b–140b

abstraction of match, 113–114

consistent ER, 115

merge operations, 113–114

R-Swoosh algorithm, 115

set of references ER, 114–115

Structure query language (SQL), 179

Structure-split assertion, 72, 73f

See also Assertion management

levels of grouping, 73

synchronization of identifiers, 73

transactions, 73

Structure-to-structure assertion, 71, 72f

EIS, 72

set of assertion transactions, 72

Supporting identity attributes, 35

Surjective function, 24

Surrogate identity, 18

Survivor record strategy, 55

best record version, 55, 55f

exemplar record, 55f, 56

rules, 56

versions, 55

Syntax of message, 197–198

System hub, See Central registry

Systems of record (SOR), 1

TAG, See U.S. Technical Advisory Group

Taguchi’s Loss Function, 9

Talburt-Wang Index (TWi), 43–44

characteristics, 44

True link and ER link, 44, 45t

truth set evaluation, 44

utility, 44

Technical Committee (TC), 191

term frequency-inverse document frequency (tf-idf), 214

cosine similarity, 214–215

Theoretical foundations

EIIM, 115–116

Fellegi-Sunter Theory Of Record Linkage, 105–112

SERF, 112–115

Token comparators, 213–217

Transaction hub architecture, 63–64

Transitive closure, 125–126

as graph problem, 165–166

incremental, 187–188, 187f

iterative, nonrecursive algorithm for, 167–168

bootstrap phase, 168–170, 173t

deduplication phase, 169, 171–177, 174t

distributed processing, 168

Hadoop implementation example, 175–177

iteration phase, 169–171

key-value pairs, 168–169

postresolution, 186–187, 186f

problem, 163

ER process, 165

match key generators, 164

match key values, 164t

True Link, 40

True negative assertion, 75–76, 76f

True positive assertion, 74–75, 75f

Trusted broker architecture, 62

Truth sets, 38

TWi, See Talburt-Wang Index

U.S. Technical Advisory Group (TAG), 191

Uniform resource identifiers (URI), 198

Unique reference assumption, 18, 125–126

Universal Product Code (UPC), 19–20

Unmanaged entity identifiers, 91–92

Variation in string values, 208–209

Very large database system (VLDBS), 59–60

Weak rules, 67–69

XBRL, See eXtensible Business Reporting Language

XML, See Extensible markup language

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for References

Create new playlist

Sign In

Sign Up

Index

Table of Contents for
References