Chapter 3. Entity Matching

In my set of outdoor tools, I have a large hand axe that I have always called a mattock. But my friend from the western United States calls it a Pulaski. When he asks me to hand him the Pulaski, it always gives me a moment of pause. Sometimes, we might know a thing by more than one name, or two things might share the same name, which can lead to confusion. This happens with people all the time. Have you ever been mistaken for someone else who shares your same first and last name? Have you ever used a nickname or an alias? In a children's playground, 10 women might turn around when they hear a child call out Mom! A man who always goes by the name Bob would be immediately suspicious when an unfamiliar telephone caller asks to speak with Robert. A pharmacy technician gives John T. Smith the medicine intended for John M. Smith, leading to disastrous results.

In this chapter, we are concerned with the accurate identification of entities, or things, and the correct assignment of matching entities. Are these two things really the same, or are they different? Can we determine that two similar-looking things really are the same, based on their other characteristics? To extend the market basket example from Chapter 2, Association Rule Mining, suppose we are a grocery store, and we have our own database full of information about our shoppers and items they have purchased from our store. Now, imagine our store merges with another grocery store chain, and we are now the caretakers of all of their shopping data as well. We would like to see if any of our shoppers also shopped at this other store. But how do we connect the first dataset to the second? Typically, we will look for some unique identifier that is in common between the two datasets. But what if that identifier does not exist? Or what if there is an attribute that both datasets have in common, but its values are not unique?

How can we solve this challenge of entity matching where there is no clear, unambiguous connection between two datasets? In this chapter we will learn:

  • Common strategies for entity matching, including attribute matching, disjoint sets, contextual matching, and profiling a typical match
  • How to evaluate efficacy of the chosen methods, including calculating the precision, recall, and F-measure for a result set
  • How to apply entity matching methods to a real-world problem using data from two separate collections of data about free, libre, and open source software (FLOSS) projects

Before we get started, you may be wondering where in the data mining workflow entity matching fits. Is it data cleaning, or data integration, or data analysis, or what? Typically, entity matching would be considered primarily a data cleaning and data integration step, since its main purpose is to produce a dataset that is as accurate as possible, and one that is ready to be mined for other patterns. However, because we will use pattern-finding techniques in order to accomplish the matching, you will probably notice that it does have some of the flavor of a data analysis step as well.

What is entity matching?

Finding matching items is one of the oldest tasks in database processing, and as databases get larger and more distributed, this task becomes more and more important. Each time two datasets are merged, questions arise about how to identify duplicates, how to connect items from the first dataset to the similar items in the second data set. When we find ourselves asking Are these two things different even though they have the same name? or Are these other two things the same, even though they have different names? we can apply entity matching techniques to find out the answer.

In light of all this concern with the names for an item, it is perhaps appropriate that this task itself has many names: entity matching, entity disambiguation, object consolidation, duplicate identification, merge/purge, and record linkage, to name a few. We will use the term entity matching in this chapter to generically describe this class of activities.

Consider the following examples where entity matching might be helpful. How many different people are likely to be represented in this dataset? Which of these are the same person?

Name

John Smith

John R. Smith

John R. Smith, Jr.

Jon Smith

JON R SMITH, JR

It is very difficult to tell whether these are five different individuals or all variations of the same person, or some number in between. There could be missing middle initials, missing suffixes, spelling problems, and mistakes. We have no idea of the data quality, and no strong clues about how to match these individuals. Suppose we add an Address attribute to the data. Now which of these are the same person?

Name

Address

John Smith

123 Main St.

John R. Smith

46 Pine Way

John R. Smith, Jr.

46 Pine Way

Jon Smith

12587 34th St.

JON R SMITH, JR

46 Pine Way

With the addition of the Address field, it now appears that there might be three distinct individuals in this dataset: one that lives on Main St, one that lives on Pine Way, and one that lives on 34th St. Of course, we also still have the possibility of an error; for example, John Smith at 123 Main St. might be showing an old address, or the Name value may be spelled wrong.

What if we add an Age attribute? Now which of these are the same person?

Name

Address

Age

John Smith

123 Main St.

55

John R. Smith

46 Pine Way

48

John R. Smith, Jr.

46 Pine Way

16

Jon Smith

12587 34th St.

34

JON R SMITH, JR

46 Pine Way

16

By adding additional details, such as Address and Age, we gain information that helps us determine that it is likely that there are actually four distinct people represented here. Two of the 46 Pine Way rows are much younger than the other row. Both of those younger individuals have a Jr. suffix, perhaps indicating that a father and son both live at that address.

To keep the rows separate, we can now add unique identifiers, as shown with the Id attribute:

Name

Address

Age

Id

John Smith

123 Main St.

55

1

John R. Smith

46 Pine Way

48

2

John R. Smith, Jr.

46 Pine Way

16

3

Jon Smith

12587 34th St.

34

4

JON R SMITH, JR

46 Pine Way

16

3

In an ideal world, we would start with a unique identifier for every entity in our database. But unfortunately, this is not the reality of many data projects. Not only will the unique identifiers be missing or non-existent, but these clear-cut attributes like Age and Address will also be missing, dirty, or non-existent. We might have attribute A for half the data and attribute B for the other half. Consider this dataset:

Name

Address

Age

State issuing driver's license

John Smith

123 Main St.

55

NC

John R. Smith

46 Pine Way

48

VA

John R. Smith, Jr.

46 Pine Way

16

 

Jon Smith

12587 34th St.

34

 

JON R SMITH, JR

46 Pine Way

 

FL

In this example, one of the ages is missing, and two of the driver's license issuing states are missing. Are these two Jr. people really the same in this dataset? Why do they have the same street address as the father but a different driver's license state? Did the son just get his Florida driver's license when he turned 16 but the father forgot to renew his license when they moved from Virginia to Florida? There is a lot of domain knowledge required here to make sense of the potential matches. We must make many complicated assumptions in order to properly assign a match, and yet this is just a simple example with five rows and four columns! Imagine the logic we would have to employ if we really were merging the grocery store shopping data, or data from two banks, or data from two different university record-keeping systems.

Merging data

In the previous examples, we were looking at five records stored in the same simple dataset. Unfortunately, with many data mining projects, we do not begin with entities that are already stored in the same dataset. In such a case, we must merge the data from multiple datasets together into one. Merging data means that we will need to combine multiple datasets, either physically or logically. Physically moving data from one dataset into another may mean moving it into the same tables or files, for example. Creating logical connections between two or more datasets may mean creating views or queries that join multiple disparate tables using a column in common. Either way, we must keep a few data quality principles in mind before merging data:

  • Generally, we want to keep as much data as possible. Throwing away data is anathema and we want to avoid it if at all possible. Create a new column, create a new table, or otherwise move the data rather than deleting it. You never know when you will need that column again.
  • Clean data is better. Merged data should be as uniform as possible, with standardized character sets, formatting, and so forth. I admit I am a bit fussy about having clean data (I did write a book on it, after all), but as you can see from the previous example with John Smith, our job of matching entities will be hard enough without introducing additional cleaning issues on top of everything else!
  • Data should be atomic. This means it needs to be broken down into the smallest parts possible, without losing meaning. Separate address, city, state, and postal code fields will probably be more useful than one giant field with all parts of the address in it.

Merging datasets vertically

Merging two data sets vertically is sometimes called a data append. Usually when a vertical merge works best is when we have two datasets with the same columns, in the same order, and we just stack one dataset on top of the other. The resulting dataset may require duplication identification and purging of the duplicate data. Traditionally, for example in a relational database environment, vertical merging requires that both datasets have the same columns, or else there will be null values. Here is an example of a vertical merge between Dataset 1 and Dataset 2.

Dataset 1:

Name

Address

Age

State issuing driver's license

Frank Edwards

123 Main St.

55

NY

Kathryn White

460 Pine Cr.

54

NC

Laura Hartley

460 Pine Cr.

20

NC

Dataset 2:

Name

Address

Age

State issuing driver's license

Kathleen Richard

990 Michigan Ave.

23

FL

Susie Murphy

22 Butterfield Cir.

60

MO

Laura Hartley

460 Pine Cr.

18

NC

The vertically merged dataset is shown here:

Name

Address

Age

State issuing driver's license

Frank Edwards

123 Main St.

55

NY

Kathryn White

460 Pine Cr.

54

NC

Laura Hartley

460 Pine Cr.

20

NC

Kathleen Richard

990 Michigan Ave.

23

FL

Susie Murphy

22 Butterfield Cir.

60

MO

Laura Hartley

460 Pine Cr.

18

NC

The task following a vertical merge is to identify whether Laura Hartley on row 3 is the same Laura Hartley on row 6. If these two people are the same, we may wish to remove one of the rows so that we do not have a duplicate. However, it may not be obvious at first which row is the correct one to keep, since these two entities show different ages, but the same street address. Is Laura 18 or is she 20? Depending on our goals, we may want to merge the data first and remove duplicates second, or we may want to identify duplicates before creating the merged dataset. Or, we may wish to leave both records in, and somehow identify or tag them as being possible duplicates. Different strategies will work for different datasets and different problem domains.

Merging datasets horizontally

A horizontal merge is used when we have datasets that represent the same type of entity, but we have different attributes that describe each one. In this case, a vertical merge will not work because there are too many attributes in one dataset that are not in the other one. Here is an example. Notice that there are some attributes in Dataset 1 that do not appear in Dataset 2.

Dataset 1:

Name

Address

Age

State issuing driver's license

Kathleen Richard

990 Michigan Ave.

23

FL

Susie Murphy

22 Butterfield Cir.

60

MO

Laura Hartley

460 Pine Cr.

18

NC

Dataset 2:

Name

Year of first subscription

Favorite color

Home state

Kathleen Richard

1999

blue

FL

Susie Murphy

1998

purple

MO

Laura Hartley

2012

green

NC

This problem definitely requires entity matching and consolidation. Ideally, we will end up with a dataset that has the attributes from both datasets, and one row for each entity, like the following horizontally merged dataset:

Name

Year of first subscription

Favorite color

Home state

Address

Age

State issuing driver's license

Kathleen Richard

1999

blue

FL

990 Michigan Ave.

23

FL

Susie Murphy

1998

purple

MO

22 Butterfield Cir.

60

MO

Laura Hartley

2012

green

NC

460 Pine Cr.

18

NC

Horizontal merging does not require that both datasets have all the same columns, but there should be something to match on, in order to ensure that we have only one row per entity. In this example case, the Name column is unique and serves as the column in common between the two sets. However, in a real entity matching scenario, we may not be so lucky. Columns may have to be split, transformed, combined, adjusted, or otherwise tweaked in order to find a match.

Techniques for matching

No matter whether we are working with horizontally or vertically merged data, we will need some way of matching entities. If we are vertically merging, we will use entity matching to find duplicates, and if we are horizontally merging, we will use entity matching to identify a minimum set of rows. What are the common techniques for finding entity matches? What kind of technique works best in various situations, and what are the pros and cons of each?

It turns out that there is no one single best entity matching algorithm, just as there is no one best string similarity algorithm. But we will outline a few of the options in the following sections so that we can choose from among them later in this chapter when we are working on a real project.

Attribute-based similarity matching

Similarity matching on attributes is one of the oldest techniques for matching entities. To do this, we set up a similarity function on the different attributes of each entity, and score each pairwise combination between 0 and 1 as to how similar they are on that attribute. For example, given the datasets of people that we presented earlier, we could use the name and street address columns, along with a string matching function. Each pair of records would be scored 0 to 1 on their similarity to each other. This method is not limited to string values, but categorical and numeric values can also be mapped to an agree/disagree result on a scale of 0 to 1.

Be careful of pairwise comparisons

This method has some strengths in that it is quite flexible and can be used on many different types of data. However, as we learned in Chapter 2, Association Rule Mining, any time we need to perform pairwise comparisons, we are looking at a very large number of possible combinations. This means we need to take pains to reduce the set of possible matches, and we may need to use persistent storage (as we did in Chapter 2, Association Rule Mining) to remember the comparisons we have already made in the past.

Leverage rare values

Another interesting feature of attribute-based similarity matching is that very rare values can be leveraged to increase accuracy of the match. In other words, it is easier to find a match between two entities with very rare values, than it is to match entities with common values. We can use this knowledge to adjust the probability of matching entities based on the rarity or commonness of the attribute values. A common illustration of this idea given by geneticist Howard Newcombe in a 1959 Science paper (Automatic Linkage of Vital Records), and expanded in later highly-cited papers (such as Winkler's The State of Record Linkage and Current Research Problems) is to compare the ease of matching entities where there is a rare first and last name with the difficulty of matching a common name. For example, it will be easier to match two entities with the name Finklestein McGlockenspiel, which is a very rare name in English-speaking countries. It will be relatively difficult to match people who have a common name, for example John Smith.

Methods for matching attributes

Attribute matching strategies requires that the analyst define what qualifies as a match. Equality of values seems like an obvious matching goal: Smith equals Smith, 43 equals 43. However, in the absence of precisely equal values, we may need to be a little more creative about performing a match. What values are close enough to count as a match?

Range-based or distance from target

When working with numeric values or dates, close enough could be indicated by setting a range or a distance from the target. For example, in some domain dealing with age, we may declare any value in the low 40s is good enough to match the age 43. In another domain, we may decide that the eye colors green and hazel are close enough to be counted as matching. The rules governing this type of rule-based proximity matching will often depend on the domain and the goals of the problem.

String edit distance

Measuring the approximate equality of two strings is a very interesting exercise. How close are the names Jones and James?

There are a number of basic string metrics that can quantify the amount of edit distance between two strings. The edit distance is the number of edits that would transform one string into another. Two commonly-used methods for measuring the edit distance between two strings are the Hamming distance and the Levenshtein distance.

Hamming distance

For strings of equal length, the Hamming distance is a measure of the number of character substitutions needed to turn one string into the other. The Hamming distance between blue and glue is 1, since there is only a single edit needed to change the b character to a g. The Hamming distance is not designed for strings of different lengths, so we can pad the shorter word with dummy characters to account for this problem. To get the Hamming distance between cat and catch, we can turn cat into cat**, which has a Hamming distance of 2 from catch.

Levenshtein distance

For any two strings, the Levenshtein distance is a measure of the minimum number of substitutions, deletions, or insertions required to change one string into another. The Levenshtein distance between blue and glue is also 1. And the Levenshtein distance between cat and catch is 2, since there are two additions, zero deletions, and zero changes. The Levenshtein distance between catch and cap is 3, since there are two deletions and one substitution. The Hamming distance between catch and cap is also 3, using the padding method described previously.

Do Hamming and Levenshtein always return the same value? No. Since the Hamming distance does not allow for insertions or deletions, sometimes it requires more actions than Levenshtein. Consider the strings scat and cats. The Hamming distance is 4 since this transformation requires the following substitutions: s changes to c, c changes to a, a changes to t, and t changes to s. The Levenshtein distance between scat and cats is only 2: delete s from the first position, insert s at the final position. So while the Hamming distance has the potential to be greater than the Levenshtein distance, the Hamming distance is less complicated, and easier to estimate than Levenshtein.

There are many other variations on edit distance calculation, and different techniques will work in different scenarios. Does your data only have strings of the same length? Do you need to handle fuzzy (approximate) string matches, or are you more interested in exact string matches? No matter what technique we choose, we must remember that for this chapter, our overarching goal is matching entities, and the purpose of the string metrics is to determine how similar the string attributes are between some pair of entities. Knowing how similar these string attributes are is just one additional clue as to how similar the entities are. Are there other ways of measuring the similarity of strings?

Soundex

Invented in 1918, Soundex is another rudimentary technique for measuring how similar two strings are to each other. But, just as its name suggests, Soundex is a measure of how phonetically similar the strings are. Soundex only works with English pronunciations. With Soundex, each word can be encoded into a character sequence, for example a traditional four-character Soundex algorithm encodes the name Peters as P362. The encoding rules are fairly simple. The first letter of a word is retained as the first letter of the new code. Subsequent vowels are ignored, along with the letters W, Y, and H. Consonants are assigned numbers in groups, chosen from a list as follows:

Letter

Becomes

B, F, P, V

1

C, G, J, K, Q, S, X, Z

2

D, T

3

L

4

M, N

5

R

6

Smith thus becomes S530. The first letter S is retained, the M becomes a 5, the I is dropped, the T becomes a 3, and the H is dropped. A 0 character is added to pad the code to four characters. Smythe also becomes S530. The first letter S is retained, the M becomes a 5, the Y is dropped, the T becomes a 3, the H is dropped, and the E is dropped.

James becomes J520. The first letter J is retained, the M becomes a 5, the S becomes a 2, and the vowels A and E are dropped. Since M and N are both in the same group, the code for James and Jones are exactly the same.

Consider a longer word, such as Ellington. Its code will be E452. Here we retain the E as the starting letter, substitute 4 for the letter L, drop the second 4 representing the second letter L, drop the I, substitute 5 for N, substitute 2 for G, substitute 3 for T, drop the O, and substitute 5 for the final N. Drop any digits that follow the first four characters. The result for Ellington is E452.

For our purposes, calculating the Soundex of strings is just another way to prepare data so that we can compare values. We are seeking entity matches, and, depending on the data we have, a Soundex comparison between two strings may be a very useful determinant of similarity. Another good application for Soundex is to use it to narrow down which strings should be compared. Rather than performing pairwise comparisons on every pair of strings in a database, we may find efficiency gains by only comparing pairs of strings that have similar Soundex codes.

Both edit distances and Soundex can be used in an entity matching scenario where we are trying to use the attributes of each entity as the matching criteria. But what happens when we do not have enough attributes, or where the attributes are different across our different sets of entities?

Leveraging disjoint sets

An alternate formulation of the attribute matching similarity solution can be used when we have disjoint attribute sets. Disjoint sets are those where the attributes in one dataset overlap with, but are not identical to, the attributes in another set. Consider the example earlier in this chapter where we attempted to horizontally merge two sets {name, address, age, drivers_license_state} and {name, year_of_first_subscription, favorite_color, home_state}. These two sets overlap on the name attribute, but nothing else.

We could proceed with attribute-based similarity matching using the common attribute, name in this case, but to ensure accuracy in those matched entities, we could also add a sanity check by way of a profile. We do this by setting up a concept of a typical profile that a matched record will have. Each profile is set up based on the disjoint attributes of the resulting record. For example, we can mandate that a typical user will have a drivers_license_state attribute that matches the home_state attribute. Additionally, we may mandate that if drivers_license_state is not null, then the age will be greater than 16 (or whatever age is considered a typical driving age in this domain). By doing this, we ensure that even though we only have a single matching attribute, which is name, we are still able to use the values present in the disjoint attributes to assist the overall matching process.

Disjoint set attribute matching is useful, but still requires that we have entities with some attributes in common, and that the attributes we have are trustworthy enough to serve as the match criteria. What happens if we do not have good attributes, or if the values are unreliable in some way?

Context-based similarity matching

An alternative to attribute-based matching is to take into account the way each entity relates to others within some sort of contextual space. To do this, imagine setting up a hierarchy or a graph structure for each dataset, and match entities based on their similar positions in the graph. For example, suppose we have one dataset that includes a roster of family members listed by name and connected by relationships:

  • Richard is the father of Margaret
  • Margaret is the mother of Steven

Next, we have a separate list of nicknames and relationships, such as:

  • Grandpa is the father of Mom
  • Mom is the mother of Bob
  • Peg is the mother of Bob

Our task is to match the nicknames to the real names and find the matching identities. Because we have the same hierarchy between the entities in both datasets, we can more easily match the nicknames to real names, learning that Richard is known as Grandpa, Margaret is known as Mom and Peg, and Steven is known as Bob. Making the inference that Peg and Margaret are the same person assumes that in this particular family context, we have decided that each entity can have only one mother.

Context-based similarity matching works especially well with entities that have hierarchical or membership relationships to one another. Depending on how many entities and relationships there are in the sets, and how complicated the relationships are, it is also possible to put a likelihood score on a pairwise match. For example, if there are entities that appear in one list but not in the other list, we may have to assign a score of zero, whereas if the relationships are not exactly equivalent between two entities, we can assign a score to the match that is positive, but closer to zero. This kind of matching can work well when there are limited attributes, or unreliable or sparse attribute values.

Machine learning-based entity matching

If you have read any of the current literature on machine learning and data mining, you may be thinking that it would be clever to use something like a Bayesian classifier, support vector machine, or a decision tree to perform entity matching. After all, entity matching could be considered a binary classification problem (match/no match), so a trained classifier might work here. In such a case, the analyst would design training sets of accurate and inaccurate pairwise matches. Then the machine learning algorithm can use these training sets to learn which attributes contribute to true and false matches. Many of these systems have been developed in academic literature, using both theoretical and real-world data.

At this early stage, the researchers are finding that the effectiveness of the training-based entity matchers is good, but that their efficiency scales negatively with the size of the dataset. A large dataset simply takes too long to train on, especially with a wide set of attributes. So the construction of the training set takes time above and beyond the actual discovery of the matches, which takes even more time. If we attempt to reduce the dataset size by way of additional training activity, we find that this also is difficult and time-consuming.

A number of solutions have been proposed for these issues, including using parallel processing. Building machine learning-based matchers is an open research area with a lot of exciting developments and experimentation going on. Interested readers should look into the current academic comparisons of training-based, non-training-based, and hybrid entity matching solutions, for example the Köpcke et. al. 2010 article (Evaluation of entity resolution approaches on real-world match problems) or the 2011 Kolb et al. paper (Learning-based Entity Resolution with MapReduce).

There are many more variations on entity matching presented in the academic literature, but in this chapter we have focused on gaining a broad understanding of the most common matching strategies in use today. Improvements to the basic techniques will include greater concern for two things: speed and accuracy. In the next section, we will learn how to evaluate the speed and accuracy of whatever entity matching strategy we choose.

Evaluation of entity matching techniques

No matter what strategy we choose for entity matching, we will likely be concerned with how efficiently we can generate the matches, and how accurate those matches are. We mentioned previously that any time we can reduce the number of pairwise matches, we can improve the speed of the matching activity. Also, we mentioned that we can improve the accuracy of the matches by taking into account different types of data, and by leveraging other aspects of domain-specific knowledge such as hierarchies, relationships between attributes, and profiles of a typical result. In this section we will learn how to communicate in more detail about the different measures of success for any entity matching strategy we choose.

Efficiency – how long does it take to do the matching?

How long it takes to execute the entity matching exercise, called its execution time or runtime, is a critically important factor in choosing and evaluating an entity matching strategy. Without taking specific precautions to reduce the size of the sets being compared, the bulk of the execution time will likely be spent doing pairwise comparisons. This is a similar situation to where we found ourselves in Chapter 2, Association Rule Mining when finding frequent itemsets. Just like with the frequent itemset generation, we need to put some concerted effort into finding ways to limit the number of comparisons we have to make. Herman Wells is not likely the same person as Edward White, so we should not waste time comparing them, but Herman Wells and Harmon Walls might be the same person after all.

The good news is that we have many strategies for reducing the size of the lists of pairs that have to be compared. Numeric ranges, date ranges, string edit distances, and Soundex were mentioned earlier as very simple ways of reducing the number of comparisons. In the research literature, these techniques - and many, many more - are called blocking methods. A good blocking method will reduce the number of possible comparisons just enough to impact efficiency in a positive way, but will not reduce the comparisons to the point that it is impossible to find a match.

Aside from blocking to reduce the number of entities to be compared, we also want to reduce the number of attributes that need to be compared, as well as the number and complexity of the comparisons in the procedure overall. A matching task that relies on a single, simple numeric test will be more efficient than a matcher that relies on a half-dozen blockers and complicated string matches.

Effectiveness – how accurate are the matches that we generate?

Assuming that our procedure manages to propose some candidate matches, how do we know they are correct? We would like entities that are the same to be correctly matched (true positives), entities that are not the same to remain unmatched (true negatives), and we would like to minimize both types of incorrectly matched entities (false positives and false negatives).

We can measure the effectiveness of our procedure with classical information retrieval terms: accuracy, precision, recall, and specificity. Precision is the number of correct guesses (true positives) divided by the number of guesses that we proposed. Our proposals are comprised of both true positives and false positives.

Recall is the number of correct guesses (true positives) divided by the total number of actual matches. The total number of actual matches is comprised of both true positives and false negatives.

Accuracy is simply the number of correct guesses divided by the total number of all guesses. Specificity is the actual negatives divided by the sum of the guessed negatives and the actual negatives.

To clarify these important concepts, an example is probably in order. Consider the following table showing various proposed entity matches and the result given by our (fictitious) matching procedure. The right-most column shows whether the proposed match was a true positive, true negative, false positive, or false negative:

Proposed entity match

Our Guess

Correct Answer

Verdict

Richard Lewis matches Rick Lewis

YES

YES

TP

Richard Lewis matches Rich L. Lewis

NO

YES

FN

Ricky Lewis matches Rich Lewis

NO

YES

FN

lupin84 matches rlupin84

YES

YES

TP

R. J. Louis matches R. J. Lewis

YES

YES

TP

RLU matches RLIU

YES

NO

FP

Richard Liu matches Richard Lu

NO

NO

TN

Richard Lou matches Richard Lu

YES

YES

TP

R. J. Lupin matches R. J. Lewpin

NO

NO

TN

In this example, we got six guesses correct out of 10 possible, for an overall accuracy of 6/10 or .60. The specificity of our method is calculated as two true negatives divided by four total negatives, which yields 2/4 or .50.

We can calculate precision as follows:

precision = tp/(tp+fp)
precision = 4/(4+1)
precision = 4/5 = .80

A perfect precision score of 1.0 means that there were no false positives, which is good, but precision does not take into account false negatives.

Recall, on the other hand, does take into account false negatives. We can calculate recall as follows:

recall = tp/(tp+fn)
recall = 4/(4+2)
recall = 4/6 = 2/3 = .66

A perfect recall score of 1.0 means that there were no false negatives, which is also good, although it does not take into account false positives like precision did.

The F-measure, abbreviated as F1, takes both precision and recall into account, producing the harmonic mean of the two:

f1 = 2((precision * recall) / (precision + recall))
f1 = 2((.8*.66) / (.8 + .66))
f1 = .72

These measures, accuracy, precision, recall, specificity, and F-measure, can be used to compare two classification procedures, such as entity match procedures, to each other. With these, we are able to determine which procedure produces results with the highest values.

It is worth remembering at this point that in order to calculate these values, we have to know whether the class we chose (is a match, is not a match) is actually correct. This means one of two things will need to happen: if we have a small enough set of matched entities, we will need to be able to check them by hand to see whether they are correct, or if we have a large set of matches, we will need to practice by using a test set in which we have pre-coded the correct results.

Usefulness – how practical is the matching procedure to use?

Ideally, a matching procedure will be efficient, effective, and practical. What do we mean by practical? A practical procedure will be both generalizable and composed of minimal manual steps. Generalizable means that the entity matching procedure is broadly applicable to many situations. These various situations could include different domains, different times, or different systems. For example, if we design a procedure that matches person entities based on names, but we use a dataset comprised of book characters from Victorian fiction, it might work well for a particular academic project, but probably will not be of much use to a global social media platform in 2016. Similarly, a procedure designed for a niche test scenario may not be easily applied to real-world data. It is not always required that a procedure be generalizable to every possible situation, but we need to at least acknowledge in advance the likely boundaries of the thing we are designing.

Additionally, if the procedure we design has a lot of manual steps, it may not be practical to expect that using the procedure is sustainable into the future, as it will be very time-consuming and tedious. While manual procedures may promise higher accuracy, the time it takes to do the manual work does threaten overall the efficiency of a matching procedure. Even in a machine learning scenario, we should take into account whether or not the training sets are being constructed manually, and how long this takes. To increase efficiency, it is tempting to try to automate manual procedures. However, automation of a manual process can result in higher errors, which will defeat any gains made in accuracy. Correcting those errors then leads to less efficiency. As with many things in life, it seems like there is always a tradeoff!

At this point, we have considered different strategies for finding matching entities, and we have learned how to evaluate the matches we found. It is time to put our learning into practice with a real-world application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.163.10