Identity matching

In this section, we will cover one important data preparation topic, which is about identity matching and related solutions. We will discuss some of Spark's special features for solving identity issues and also some data matching solutions made easy with Spark.

After this section, we will be capable of taking care of some common data identity problems with Apache Spark.

Identity issues

For data preparation, we often need to deal with some data elements that belong to the same person or units, but which do not look similar to them. For example, we may have purchased some data for customer Larry Z. and web activity data for L. Zhang. Is Larry Z a same person as L. Zhang? Are there many identity variations in the data?

Matching entities is a big challenge for machine learning data preparation as these types of entity variation are very common and could be caused by many different reasons, such as duplications, errors, name variants, and intentional aliasing. Sometimes, it could be very difficult to complete the matching or even just to find the linking, and this work is definitely very time consuming. However, it is necessary and extremely important as any kind of mismatching will produce a lot of errors, and no matching will produce biases. At the same time, a correct matching also has additional values as an aid to group detection, such as with terror cells and drug cartels.

Some new methods, such as fuzzy matching, have been developed to attack this issue. However, in this section, we will focus on some commonly used methods. These commonly used approaches include:

  • Manual search with SQL queries.

    This is a labor intensive with few discoveries but good accuracy.

  • Automated data cleansing.

    This type of approach often adopts a few rules that use the most informative attributes.

  • Lexical similarity.

    This approach is rational and useful but can generate many false alarms.

  • Feature and relationship statistics.

    This approach is a good one but does not address nonlinear effects.

The accuracy of any of the preceding methods often depends on the sparseness and size of the data and also on whether these tasks are to resolve duplications, errors, variants, or aliases.

Identity matching on Spark

Similarly to the previous section, we would like to review some methods utilizing SampleClean to deal with entity matching issues even though the most commonly used tools are SparkSQL or R.

Entity resolution

SampleClean provides an easy-to-use interface for some basic entity matching tasks. It provides the EntityResolution class that wraps some common deduplication programming patterns.

A basic EntityResolution class involves the following steps:

  1. Identifying a column of inconsistent categorical attributes.
  2. Linking together similar attributes.
  3. Selecting a single canonical representation of the linked attributes.
  4. Applying changes to the data.

Short string comparison

Here, we have a column of short strings that are inconsistently represented (for example, United States, United States). The EntityResolution.shortAttributeCanonicalize function takes as input the current context, the name of the working set to clean, the column to fix, and a threshold in [0,1] (0 merges all, and 1 merges only the exact matches). It uses EditDistance as its default similarity metric. The following is a coding example:

val algorithm = EntityResolution.shortAttributeCanonicalize(scc,workingSetName,columnName,threshold)

Long string comparison

Here, we have a column of long strings, such as addresses, that are close but not exact. The basic strategy is to tokenize these strings and compare the set of words rather than the whole string. It uses the WeightedJaccard similarity metric as default. The following is a coding example:

longAttributeCanonicalize(scc,workingSetName,columnName,threshold)

Record deduplication

A more advanced deduplication task is when records, rather than individual columns, are inconsistent. That is, there are multiple records that refer to the same real entity. RecordDeduplication uses Long Attribute similarity metrics as default. The following is a coding example:

RecordDeduplication.deduplication(scc, workingSetName, columnProjection, threshold)

Note

For more information on the SampleClean guide, visit http://sampleclean.org/guide/.

Identity matching made better

Similarly to data cleaning, with SampleClean and Spark together we can make things easy—that is write less code and utilize less data—as demonstrated in the previous section. As discussed, automated cleaning is easy and fast, but its accuracy may not be good. A common approach to make things better is to utilize more people for labor-intensive approval based on crowd sourcing.

Here, SampleClean combines Algorithms, Machines, and People, all in its crowd-sourced deduplication.

Crowdsourced deduplication

As crowdsourcing scales poorly to very large datasets, the SampleClean system asks the crowd to deduplicate only a sample of the data and then train predictive models to generalize the crowd's work to the entire dataset. In particular, SampleClean applies Active Learning to sample points that lead to a good model quickly.

Configuring the crowd

To clean data using crowd workers, SampleClean uses the open source AMPCrowd service to support multiple crowd platforms and provide automated quality control. Thus, users must have a running installation of AMPCrowd. In addition, crowd operators must be configured to point to the AMPCrowd server by passing the CrowdConfiguration objects.

Using the crowd

SampleClean currently provides one main crowd operator: ActiveLearningMatcher. This is an add-on step to an existing EntityResolution algorithm that trains a crowd-supervised model to predict duplicates. Take a look at the following code:

createCrowdMatcher(scc:SampleCleanContext, attribute:String, workingSetName:String) 
val crowdMatcher = EntityResolution.createCrowdMatcher(scc,attribute,workingSetName)

Make sure to configure the matcher here, as follows:

crowdMatcher.alstrategy.setCrowdParameters(crowdConfig)

To add this matcher to existing algorithms, use the following function:

addMatcher(matcher:Matcher)
algorithm.components.addMatcher(crowdMatcher)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.162.37