Entity matching project

As with the application example in Chapter 2, Association Rule Mining, where we found frequently occurring sets of tags from Freecode projects, this project will also use data from the free, libre, and open source software (FLOSS) realm. Our task here is to find software projects that are being hosted on different code repositories, but actually represent the same entity. Specifically, we are interested in finding projects that were formerly hosted on the now defunct RubyForge.org site, but have subsequently migrated to its successor, the https://rubygems.org/ site. RubyForge and RubyGems are both code repositories for software written in the Ruby language, but they are slightly different in what they offer. RubyForge was a hosting site for software projects, and it included file downloads, source code control, mailing lists, discussion forums, and so on. On RubyForge, each project could be comprised of many files, including libraries, documentation, and the like. RubyGems.org does away with many of those project hosting features and simply hosts the gems, or library files, individually.

However, there is some thread of continuity between the two sites. RubyGems was created as a successor to RubyForge and many of the projects migrated to it. If we can connect projects as they move between these two sites, we will be able to create a longer history for each project. For those of us who study software evolution, having a longer record of project activity is a worthy goal. We can easily imagine the same goal being applied to other domains. For example, it may be important to match customers after an acquisition of one company by another, or matching graduates of a university as they move from the academic world into the working world.

Difficulties with matching software projects

Matching projects as they move around between hosting facilities is complicated by problems such as:

  • Different projects on different sites using the same project name
  • The same project using different names on different sites
  • The same projects using slightly different variations for their URLs on different sites
  • Projects gaining or losing team members across sites
  • Different metadata attributes on one site versus the other
  • No guarantee of whether a project on one site ever was actually hosted on the other site

Two examples

First, consider the project called Rmagick on RubyForge. Is this the same thing as the project called Rmagick on RubyGems? Let's compare the values for similar attributes for each project:

RubyForge Attribute

Value

System name

rmagick

URL

http://rmagick.rubyforge.org

Topic

Editors

Topic

Graphic Conversion

Topic

Viewers

Description

RMagick is moving to GitHub! See our new repository at: http://github.com/rmagick/rmagick. RMagick is an interface to the ImageMagick and GraphicsMagick image processing libraries. See http://rmagick.rubyforge.org/ for prereqs, install FAQ, and more.

Developer username

mmaiza

Developer real name

Moncef Maiza

RubyGems Attribute

Value

System name

rmagick

URL

https://github.com/rmagick/rmagick

Description

RMagick is an interface between Ruby and ImageMagick.

Owner username

baror

Owner username

bentomas

Owner username

bf4

Owner username

vasilesky

Owner username

mmaiza

Author name

Tim Hunter, Omer Bar-or, Benjamin Thomas, Moncef Maiza (this is listed as one long string)

Next, consider the project called Vapor on both RubyForge and RubyGems.

RubyForge Attribute

Value

System name

Vapor

URL

http://vapor.rubyforge.org

Topic

Database

Description

A persistent Object-Repository for Ruby, providing transparent persistence of interrelated Ruby application objects to a PostgreSQL database.

Developer username

oliver

Developer real name

Oliver M. Bolzer

RubyGems Attribute

Value

System name

Vapor

URL

http://helabs.com.br/opensource

Description

Retrieve user information from Steam.

Owner username

guilleiguaran

Owner username

lunks

Owner username

maurogeorge

Author name

Pedro Nascimento

Are these Rmagick projects the same? Are the Vapor projects the same? Which attributes are most helpful in making this determination? Let's investigate the different possibilities.

Matching on project names

Matching on the name of project is problematic, especially for projects that have dictionary words (such as vapor) as the name. Just as we discussed earlier, it is easier to confirm a match between people who have unusual names than it is to confirm or reject a match between people with common names. The same holds true for software projects. Rmagick is a much less common word than is vapor. In fact, it turns out that currently there are four projects on RubyGems with the word vapor somewhere in their names.

Matching on people names

Matching on developer or owner names is also problematic, but not so much because of the risk of false positives with overly common names. After all, with such a small developer pool, the risk of having the same name as another developer or project owner is small, especially when similar project names are taken into account. However, there are real risks of false negatives with matching people names, due to variations in name spelling, including nicknames and usernames. In the RMagick example, the developer username mmaiza does appear in both lists, which is a strong indication that these projects are the same.

Matching on URLs

Matching projects using their home page or URL is easy in some ways and problematic in others. First, the risk of false positives is low. Since URLs are unique, we can be fairly confident that if a RubyForge project and a RubyGems project both state that they are using the same URL, that they are the same project. There are two exceptions to this rule that will help us find false positives. First, a rare exception might be if one project was using the expired or lapsed domain of another project, but this is so unlikely as to be impractical to plan around. Second, a more common exception would be if two projects listed the same generic URL, such as http://rubyforge.org, perhaps as a placeholder. This might unintentionally connect projects.

There are also small differences in URLs that might yield false negatives if we are not careful. Common differences include a trailing / at the end of a URL, the presence or absence of www at the front of a domain, http versus https in the protocol, and so on. String manipulation and edit distance metrics may be able to help with some of these differences. As with unusual names, a homepage URL is useful for confirming a positive match, but less useful for ruling out a match.

Matching on topics and description keywords

Matching on the topic and keywords found in the textual description of a project is an interesting option. In the Vapor example, it is obvious that these projects are quite different when we read their descriptions: one is a database storage software package, the other is related to gaming. However, in the Rmagick example, even though the RubyGems description of the software package is very generic, and the RubyForge topic seems to be serving as a notification that the project has moved to GitHhub, we do find words in common between the two. The keyword ImageMagick appears in both, as does the word interface.

So given these attributes and what we know about minimizing risks of both false positives and false negatives, how do we move forward with a matching strategy?

  1. Consider matching by URLs. We will start by identifying matches using the technique for which the risk of false positives is lowest. We stated that it would be highly unlikely for two unrelated projects to use the same URL, therefore a very low (ideally, zero) Levenshtein edit distance between URLs is a good indicator that these projects are a match. In addition, as a blocking strategy, any 100% matches we find at this stage can be removed from the candidate sets, thus reducing the number of comparisons in later rounds.
  2. Consider matching by project names. Matching project names is the next-easiest step, but it has a low risk of false positives for rare names, and a moderate risk of false positives for projects with common names or dictionary word names. As a blocking step, we will only consider RF projects at this stage if they have not already been found in Step 1.

Recall, however, the case of the two Vapor projects that turned out to be completely different, despite sharing the same name. Are there additional clues we can look for that will indicate whether the matching pair we have found is actually a match? Here are some things we can look for.

Is the RubyForge project name found anywhere in the RubyGems URL? If so, this match candidate might really be a match. An example of this is the Abstractstack project:

RF_project_name: abstractstack
RF_url: http://github.com/kachick/abstractstack

RG_project_name: abstractstack
RG_url: http://kachick.github.com/abstractstack

Are any of the RubyForge project developer names found in the RubyGems project list of owners or authors? If so, this might indicate that these projects really do match. An example of this is the aerial project:

RF_project_name: aerial
RF_url: http://aerial.rubyforge.org
RF_dev_username: mattsears
RF_dev_realname: Matt Sears

RG_project_name: aerial
RG_url: http://github.com/mattsears/aerial
RG_person_name: mattsears
RG_person_name: Matt Sears

The dataset

To get started on finding these matches, first we will need a database of projects from both RubyForge and RubyGems. The FLOSSmole project (http://flossmole.org) has been collecting data from both of these sites for several years. For this book, we have excerpted the relevant attributes from the FLOSSmole data and created the following five database tables (primary key columns are shown with (PK)):

  • Rf_entities: The attributes are project_name (PK), long_name, url, description
  • Rf_entity_people: The attributes are project_name (PK), dev_username (PK), dev_realname
  • Rf_entity_topics: The attributes are project_name (PK), topic (PK)
  • Rg_entities: The attributes are project_name (PK), url, description
  • Rg_entity_people: The attributes are project_name (PK), person_name (PK)

These tables are available for loading into a MySQL database by downloading the gzipped file containing the CREATE and INSERT statements from the GitHub site for this book: https://github.com/megansquire/masteringDM/blob/master/ch3/RFRGdata.sql.gz.

Once the file has been downloaded, gunzip it and load the data into your MySQL environment. A fast way to do this is by using the command line MySQL client and the command line gunzip program. Note that in this example we have already created a database called rfrg to hold this data:

$ gunzip RFRGdata.sql.gz
$ mysql –h localhost –u username –p password
mysql> use rfrg;
mysql> source RFRGdata.sql;

The source command will load the CREATEs and INSERTs needed to populate the five data tables. When you are done, you can use show tables to see five tables loaded, as follows:

mysql> show tables;
+-------------------------+
| Tables_in_test          |
+-------------------------+
| book_rf_entities        |
| book_rf_entity_people   |
| book_rf_entity_topics   |
| book_rg_entities        |
| book_rg_entity_people   |
+-------------------------+

Now we need to create a table to hold the candidate matches. I have called this table book_entity_matches and we can CREATE it as follows:

CREATE TABLE IF NOT EXISTS book_entity_matches (
  rf_project_name varchar(100) NOT NULL,
  rg_project_name varchar(100) NOT NULL,
  url_levenshtein int(11) DEFAULT NULL,
  rf_name_soundex varchar(5) DEFAULT NULL,
  rg_name_soundex varchar(5) DEFAULT NULL,
  name_levenshtein int(11) DEFAULT NULL,
  rf_name_in_rg_name tinyint(1) DEFAULT NULL,
  rf_name_in_rg_url tinyint(1) DEFAULT NULL,
  rf_dev_in_rg_dev tinyint(1) DEFAULT NULL,
  PRIMARY KEY (rf_project_name,rg_project_name)
) ENGINE=MyISAM;

The table is created and empty, and is ready to hold our candidate entity matches. The next section will show the code needed to find candidates and populate this book_entity_matches table.

The code

To implement our proposed strategy for finding candidate matched pairs, we are going to write a short procedure that does the following:

  1. Run a simple SQL select to find project pairs with matching URLs. Add these to the candidate list.
  2. Run a simple SQL select to find project pairs with matching names. Add these to the candidate list.
  3. For each candidate pair:
    • Calculate the Levenshtein distance on URLs
    • Calculate the Levenshtein distance on names
    • Set Boolean: is the RubyForge name found in the RubyGems name?
    • Set Boolean: is the RubyForge name found in the RubyGems URL?
    • Set Boolean: is any RubyForge developer found on the list of RubyGems developers?

Once we have done these things, we can evaluate how good our matches are and decide how to refine the procedure or how to change it. The following code for this procedure is divided into chunks for easier understanding. As with previous chapters, the entire code base can be found on the GitHub repository for this book at https://github.com/megansquire/masteringDM.

Let's start off with some import statements for the three libraries we will need. These should already be in your Anaconda installation if you followed along in the previous chapters:

import pymysql
import sys
from nltk.metrics import *

Next we will set up our database connection. You should fill in the connection details with your own. We also need two database cursors. Our first two SQL statements will simply find projects with matching URLs and populate those as candidates, and then find projects with matching names, and populate those as candidates:

db = pymysql.connect(host='localhost',
                     db='rfrg',
                     user='',
                     passwd='',
                     port=3306,
                     charset='utf8mb4')
cursor = db.cursor()

# get all projects with matching URLs
cursor.execute("INSERT INTO book_entity_matches ( 
                rf_project_name, 
                rg_project_name) 
                SELECT rf.project_name, rg.project_name 
                FROM book_rf_entities rf 
                INNER JOIN book_rg_entities rg 
                ON rf.url = rg.url")

# get projects that have matching project names
cursor.execute("INSERT INTO book_entity_matches(rf_project_name, 
                                                rg_project_name) 
                SELECT rf.project_name, rg.project_name 
                FROM book_rf_entities rf 
                INNER JOIN book_rg_entities rg 
                ON rf.project_name = rg.project_name 
                WHERE rf.project_name NOT IN ( 
                    SELECT bem.rf_project_name 
                    FROM book_entity_matches bem)")

Our next step is to calculate the string metrics for each pair. First we will select out these pairs, then we will set up a loop to process each pair individually:

cursor.execute("SELECT bem.rf_project_name, 
                       bem.rg_project_name, 
                       rfe.url, 
                       rge.url 
                FROM book_entity_matches bem 
                INNER JOIN book_rg_entities rge 
                  ON bem.rg_project_name = rge.project_name 
                INNER JOIN book_rf_entities rfe 
                  ON bem.rf_project_name = rfe.project_name 
                ORDER BY bem.rf_project_name")
projectPairs = cursor.fetchall()

The for loop operates on lowercase versions of the names and URLs because the two systems, RubyForge and RubyGems, are not consistent in their requirement for casing of names and URLs. RubyGems allows case-sensitive project names, but RubyForge does not, and both sites allow capital letters in their URLs:

for(projectPair) in projectPairs:
    RFname = projectPair[0]
    RGname = projectPair[1]
    RFurl  = projectPair[2]
    RGurl  = projectPair[3]
    
    # lowercase everything
    RFnameLC = RFname.lower()
    RGnameLC = RGname.lower()
    RFurlLC  = RFurl.lower()
    RGurlLC  = RGurl.lower()

The next section relies on two string metric functions, one of which, edit_distance(), is loaded with the nltk package, and one of which is a public domain function called soundex() written by Gregory Jorgensen and reproduced here. The code and explanation for soundex() is shown at the end of this section:

    levNames = edit_distance(RFnameLC, RGnameLC)
    levURLs  = edit_distance(RFurlLC, RGurlLC)
    soundexRFname = soundex(RFnameLC)
    soundexRGname = soundex(RGnameLC)

In the next few lines, we test whether the RubyForge name is found inside the RubyGems project name, and whether the RubyForge name is found inside the RubyGems URL string. If so, this is a good indication that the projects are a match, even if their names or URLs do not match exactly. We use Boolean variables rf_in_rg and rf_in_rgurl to hold the answers:

    # is the RF project name inside the RG project name?
    if RFnameLC in RGnameLC:
        rf_in_rg = 1
    else:
        rf_in_rg = 0

    # is the RF project name inside the RG project URL?
    if RFnameLC in RGurl:
        rf_in_rgurl = 1
    else:
        rf_in_rgurl = 0 

Next we test whether the developers listed on the RubyForge list are found anywhere in the RubyGems list. We only need one developer match in common, so we let the SQL do the heavy lifting here for creating the matches. We fetchone() and set our Boolean rfdev_in_rgdev:

    # do RF devs match the RG devs?
    cursor.execute("SELECT rf.dev_username, rf.dev_realname 
                    FROM book_rf_entity_people rf 
                    WHERE rf.project_name =  %s 
                    AND (rf.dev_username IN ( 
                        SELECT rg.person_name 
                        FROM book_rg_entity_people rg 
                        WHERE rg.project_name =  %s) 
                        OR 
                        rf.dev_realname IN ( 
                        SELECT rg.person_name 
                        FROM book_rg_entity_people rg 
                        WHERE rg.project_name = %s))",
                        (RFname, RGname, RGname))
    result = cursor.fetchone()
    if result is not None:
        rfdev_in_rgdev = 1
    else:
        rfdev_in_rgdev = 0

Now we have a bunch of string metrics and Booleans and we are ready to write all these to the database table. For each match candidate pair, we will UPDATE the row indicating what these values are:

    cursor.execute("UPDATE book_entity_matches 
                        SET rf_name_soundex    = %s, 
                            rg_name_soundex    = %s, 
                            url_levenshtein    = %s, 
                            name_levenshtein   = %s, 
                            rf_name_in_rg_name = %s, 
                            rf_name_in_rg_url  = %s, 
                            rf_dev_in_rg_dev   = %s 
                        WHERE rf_project_name = %s 
                        AND rg_project_name = %s", 
                        (soundexRFname,
                         soundexRGname,
                         levURLs,
                         levNames, 
                         rf_in_rg, 
                         rf_in_rgurl, 
                         rfdev_in_rgdev, 
                         RFname, 
                         RGname))
db.close()

We close the database connection and we are all set. Before we discuss evaluating the results, here is the soundex() function written by Gregory Jorgenson and described earlier. This function takes two parameters. The first is the word for which you are generating the Soundex code. The second parameter is the length of the Soundex code you want to generate. The initial Soundex algorithm called for a 4-character code, which is also the default in this program. However, this length is configurable. If you find that you are dealing in a domain with very long and consonant-heavy words, you should increase the length. Remember that Soundex is designed for English pronunciations, so proceed with caution in non-English settings:

def soundex(name, len=4):
    """ 
    soundex module conforming to Knuth's algorithm
    implementation 2000-12-24 by Gregory Jorgensen
    public domain
    available at: http://code.activestate.com/recipes/52213-soundex-algorithm/
    """
    # digits holds the soundex values for the alphabet
    digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    # translate alpha chars in name to soundex digits
    for c in name.upper():
        if c.isalpha():
            if not fc: fc = c   # remember first letter
            d = digits[ord(c)-ord('A')]
            # duplicate consecutive soundex digits are skipped
            if not sndx or (d != sndx[-1]):
                sndx += d

    # replace first digit with first alpha character
    sndx = fc + sndx[1:]

    # remove all 0s from the soundex code
    sndx = sndx.replace('0','')

    # return soundex code padded to len characters
    return (sndx + (len * '0'))[:len]

At this point we have a fully populated match table, and it is time to look at our results.

The results

A typical result in our result set now looks like the following:

rf_project_name: aafc
rg_project_name: acts_as_flux_capacitor
url_levenshtein: 0
rf_name_soundex: A120
rg_name_soundex: A232
name_levenshtein: 18
rf_name_in_rg_name: 0
rf_name_in_rg_url: 1
rf_dev_in_rg_dev: 1

What this means is that the RubyForge project aafc has been matched to the RubyGems project acts_as_flux_capacitor based on their URL, and that they also share a relationship between the RubyForge developer lists. Their names are quite different, as demonstrated by the different Soundex scores and the high name Levenshtein value. Still, just with the URL matches and the shared developer list, we have pretty good evidence that these are the same project.

At this point, it is time to evaluate our overall candidate match list and decide whether our procedure works well enough or not. Overall, did we find good matches? Here are some ways to consider this question.

How many entity matches did we find?

Our procedure produced around 5,800 matches. Is this a good number? Consider that we started with just over 9,600 RubyForge projects, and our ideal goal would be to find a match on RubyGems for as many of those projects as we can. The total 5,800 might seem like a really high number, but remember that RubyForge is a project hosting facility, whereas RubyGems is a gem, or library, hosting facility. On RubyForge, one project could be made up of many files and libraries, but they all lived under a single name. On RubyGems, each gem is listed separately. Therefore, many gems could share the same URL. As evidence of this, consider that there are 72 gems that listed their URL as some form of http://rubyonrails.org/. Rails is a huge project with many, many gems in it. On RubyForge those lived under one umbrella project, but on RubyGems they are all listed separately.

So, a single RubyForge project might be listed alongside several RubyGems projects. Therefore, we need a query that will tell us how many actual RubyForge projects we were able to find matches for in RubyGems:

SELECT count(DISTINCT rf_project_name)
FROM book_entity_matches;
+---------------------------------+
| count(DISTINCT rf_project_name) |
+---------------------------------+
|                            4252 |
+---------------------------------+
1 row in set (0.01 sec)

That query yields 4,252 distinct RubyForge projects that were matched to RubyGems projects. That means that our simple URL and name matching procedure was able to find at least one match for 44% of the projects in our set. While we might want these numbers to be higher, let's take a moment for a reality check.

First, remember why we are looking for matches to begin with. We stated that we wanted to find matches so we could build complete project histories for software projects. A total of 4,252 project histories is a lot of material to work with, and that is certainly more matches than we would have been able to build in a reasonable time by hand.

Second, RubyForge was begun in 2003, and finally shut down in 2014. RubyGems started in 2009 and was named the official gem host in 2010. It is a distinct possibility that many of the RubyForge projects simply did not want to move to RubyGems. Maybe the projects became defunct or had lost their team leaders. Or, if they did move they may have changed URLs, names, and project teams so significantly that we will not be able to find them without a lot of additional manual procedures and domain knowledge about the project anyway.

When we think about it this way, finding possible matches for 4,252 projects is not bad!

How good are the pairs we found?

This is also a good question. Let's consider it in terms of false positives first. To attempt to ferret out false positives, we would look for two main categories of incorrect matches.

First, we want to identify pairs that had matching URLs but nothing else was good, for example, no shared developers, no RubyForge word stems in the RubyGems name or URL, high distance Levenshtein values on names, or nonmatching Soundex on names. We will call these Type 1 False Positives, or FP1 for short. To find these FP1 errors, we can run the following query against our match table:

SELECT rf.url, bme.* 
FROM book_entity_matches bme
INNER JOIN book_rf_entities rf
 ON bme.rf_project_name = rf.project_name
WHERE url_levenshtein = 0
AND rf_name_soundex <> rg_name_soundex
AND name_levenshtein > 0
AND rf_name_in_rg_name = 0
AND rf_name_in_rg_url = 0
AND rf_dev_in_rg_dev = 0;

This query yields 123 projects, 42 of which have to do with the rails project alone. We can confirm this by adding the clause and rf.url like %rubyonrails% to the end of the query. Another handful are RubyGems libraries that are part of either the will-paginate or muravey projects, both of which had dozens of subprojects, but which all shared a common project name on RubyForge, and the same common parent URL. They were flagged as false positives because the name segment was either punctuated differently or had changed names over time, or because a patched library on RubyGems was given the prefix of a developer, but it still used the generic URL to the main project.

Another set of potential false positives are the pairs that share a common name but the URLs are different; they do not share a common developer, and the RubyForge project name is not found inside the URL either. We can call these Type 2 False Positives (or FP2). Here is a query we can use to identify these possible FP2 errors:

SELECT rf.url, rg.url, bme . * 
FROM book_entity_matches bme
INNER JOIN book_rf_entities rf 
 ON bme.rf_project_name = rf.project_name
INNER JOIN book_rg_entities rg 
 ON bme.rg_project_name = rg.project_name
WHERE name_levenshtein =0
AND url_levenshtein > 0
AND rf_name_in_rg_url =0
AND rf_dev_in_rg_dev =0;

This query yields 121 results. These are potentially more problematic than the FP1 errors we found previously because in order to determine whether they are truly false positives, we will need to read the textual description of the projects, or if that does not yield any results, we will have to go manually to each project page and determine whether these are the same. It seems likely that projects with very unusual names are probably the same (two projects named hatenagraphup for example) but what about the projects named helloworld and index?

True positives (TP) are an easier story: 100% matches are easily found with this query, which looks for projects that have a matching name, matching URL, and three matching Booleans, including at least one developer in common:

SELECT rf.url, rg.url, bme . * 
FROM book_entity_matches bme
INNER JOIN book_rf_entities rf 
 ON bme.rf_project_name = rf.project_name
INNER JOIN book_rg_entities rg 
 ON bme.rg_project_name = rg.project_name
WHERE name_levenshtein =0
AND url_levenshtein =0
AND rf_name_in_rg_url =1
AND rf_dev_in_rg_dev =1
AND rf_name_in_rg_name =1;

This yields 1,082 matches, dwarfing the FP1 and FP2 errors combined. The remaining matches are those somewhere between suspicious and certain. These are ones that are probably matches, but one or more of the Booleans was set to zero. To summarize we have:

  • 5,796 total matches
  • 4,252 distinct RubyForge projects that have been matched
  • 1,082 positive matches
  • 4,470 probably positive matches
  • 123 possibly false FP1 matches
  • 121 possibly false FP2 matches

Even if we declare the FP1 and FP2 categories to be entirely false, this still yields a precision value of 96%:

precision = 5552/(5796)
96%

But let's not get too excited with a 96% precision value just yet. There are plenty of reasons to be less than thrilled with our results. First, remember that we still have 3,807 RubyForge projects that we did not find any matches for at all. Second, and more important, we have no counts for negatives, either false or true. The reason for this is that since the beginning our matching procedure was loaded with only projects that either matched on URL or name, and no other projects were included. We also have no master list or pre-labeled training set for showing true negatives. Not having negatives means we cannot easily calculate recall or F-measure.

Still, if our goal is to find matching projects for further study of software evolution over time, I think we have succeeded in this. We now have a list of between 1,082 and 5,552 projects that we can connect, with some confidence. We can track these projects as they jumped from one hosting facility to another. Finding matches by URL was fairly successful, but by adding in some more complex matching by names, URLs, and developers allowed us to increase the number of matches we found, and differentiate between them based on how accurate the matches were.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.171.90