As with the application example in Chapter 2, Association Rule Mining, where we found frequently occurring sets of tags from Freecode projects, this project will also use data from the free, libre, and open source software (FLOSS) realm. Our task here is to find software projects that are being hosted on different code repositories, but actually represent the same entity. Specifically, we are interested in finding projects that were formerly hosted on the now defunct RubyForge.org site, but have subsequently migrated to its successor, the https://rubygems.org/ site. RubyForge and RubyGems are both code repositories for software written in the Ruby language, but they are slightly different in what they offer. RubyForge was a hosting site for software projects, and it included file downloads, source code control, mailing lists, discussion forums, and so on. On RubyForge, each project could be comprised of many files, including libraries, documentation, and the like. RubyGems.org does away with many of those project hosting features and simply hosts the gems, or library files, individually.
However, there is some thread of continuity between the two sites. RubyGems was created as a successor to RubyForge and many of the projects migrated to it. If we can connect projects as they move between these two sites, we will be able to create a longer history for each project. For those of us who study software evolution, having a longer record of project activity is a worthy goal. We can easily imagine the same goal being applied to other domains. For example, it may be important to match customers after an acquisition of one company by another, or matching graduates of a university as they move from the academic world into the working world.
Matching projects as they move around between hosting facilities is complicated by problems such as:
First, consider the project called Rmagick on RubyForge. Is this the same thing as the project called Rmagick on RubyGems? Let's compare the values for similar attributes for each project:
RubyForge Attribute |
Value |
---|---|
System name |
rmagick |
URL | |
Topic |
Editors |
Topic |
Graphic Conversion |
Topic |
Viewers |
Description |
RMagick is moving to GitHub! See our new repository at: http://github.com/rmagick/rmagick. RMagick is an interface to the ImageMagick and GraphicsMagick image processing libraries. See http://rmagick.rubyforge.org/ for prereqs, install FAQ, and more. |
Developer username |
mmaiza |
Developer real name |
Moncef Maiza |
RubyGems Attribute |
Value |
---|---|
System name |
rmagick |
URL | |
Description | |
Owner username |
baror |
Owner username |
bentomas |
Owner username |
bf4 |
Owner username |
vasilesky |
Owner username |
mmaiza |
Author name |
Tim Hunter, Omer Bar-or, Benjamin Thomas, Moncef Maiza (this is listed as one long string) |
Next, consider the project called Vapor on both RubyForge and RubyGems.
RubyGems Attribute |
Value |
---|---|
System name |
Vapor |
URL | |
Description | |
Owner username |
guilleiguaran |
Owner username |
lunks |
Owner username |
maurogeorge |
Author name |
Pedro Nascimento |
Are these Rmagick projects the same? Are the Vapor projects the same? Which attributes are most helpful in making this determination? Let's investigate the different possibilities.
Matching on the name of project is problematic, especially for projects that have dictionary words (such as vapor) as the name. Just as we discussed earlier, it is easier to confirm a match between people who have unusual names than it is to confirm or reject a match between people with common names. The same holds true for software projects. Rmagick is a much less common word than is vapor. In fact, it turns out that currently there are four projects on RubyGems with the word vapor somewhere in their names.
Matching on developer or owner names is also problematic, but not so much because of the risk of false positives with overly common names. After all, with such a small developer pool, the risk of having the same name as another developer or project owner is small, especially when similar project names are taken into account. However, there are real risks of false negatives with matching people names, due to variations in name spelling, including nicknames and usernames. In the RMagick example, the developer username mmaiza does appear in both lists, which is a strong indication that these projects are the same.
Matching projects using their home page or URL is easy in some ways and problematic in others. First, the risk of false positives is low. Since URLs are unique, we can be fairly confident that if a RubyForge project and a RubyGems project both state that they are using the same URL, that they are the same project. There are two exceptions to this rule that will help us find false positives. First, a rare exception might be if one project was using the expired or lapsed domain of another project, but this is so unlikely as to be impractical to plan around. Second, a more common exception would be if two projects listed the same generic URL, such as http://rubyforge.org, perhaps as a placeholder. This might unintentionally connect projects.
There are also small differences in URLs that might yield false negatives if we are not careful. Common differences include a trailing / at the end of a URL, the presence or absence of www at the front of a domain, http versus https in the protocol, and so on. String manipulation and edit distance metrics may be able to help with some of these differences. As with unusual names, a homepage URL is useful for confirming a positive match, but less useful for ruling out a match.
Matching on the topic and keywords found in the textual description of a project is an interesting option. In the Vapor example, it is obvious that these projects are quite different when we read their descriptions: one is a database storage software package, the other is related to gaming. However, in the Rmagick example, even though the RubyGems description of the software package is very generic, and the RubyForge topic seems to be serving as a notification that the project has moved to GitHhub, we do find words in common between the two. The keyword ImageMagick appears in both, as does the word interface.
So given these attributes and what we know about minimizing risks of both false positives and false negatives, how do we move forward with a matching strategy?
Recall, however, the case of the two Vapor projects that turned out to be completely different, despite sharing the same name. Are there additional clues we can look for that will indicate whether the matching pair we have found is actually a match? Here are some things we can look for.
Is the RubyForge project name found anywhere in the RubyGems URL? If so, this match candidate might really be a match. An example of this is the Abstractstack
project:
RF_project_name: abstractstack RF_url: http://github.com/kachick/abstractstack RG_project_name: abstractstack RG_url: http://kachick.github.com/abstractstack
Are any of the RubyForge project developer names found in the RubyGems project list of owners or authors? If so, this might indicate that these projects really do match. An example of this is the aerial project:
RF_project_name: aerial RF_url: http://aerial.rubyforge.org RF_dev_username: mattsears RF_dev_realname: Matt Sears RG_project_name: aerial RG_url: http://github.com/mattsears/aerial RG_person_name: mattsears RG_person_name: Matt Sears
To get started on finding these matches, first we will need a database of projects from both RubyForge and RubyGems. The FLOSSmole project (http://flossmole.org) has been collecting data from both of these sites for several years. For this book, we have excerpted the relevant attributes from the FLOSSmole data and created the following five database tables (primary key columns are shown with (PK)):
Rf_entities
: The attributes are project_name
(PK), long_name
, url
, descriptionRf_entity_people
: The attributes are project_name
(PK), dev_username
(PK), dev_realname
Rf_entity_topics
: The attributes are project_name
(PK), topic
(PK)Rg_entities
: The attributes are project_name
(PK), url
, description
Rg_entity_people
: The attributes are project_name
(PK), person_name
(PK)These tables are available for loading into a MySQL database by downloading the gzipped
file containing the CREATE
and INSERT
statements from the GitHub site for this book: https://github.com/megansquire/masteringDM/blob/master/ch3/RFRGdata.sql.gz.
Once the file has been downloaded, gunzip
it and load the data into your MySQL environment. A fast way to do this is by using the command line MySQL client and the command line gunzip
program. Note that in this example we have already created a database called rfrg to hold this data:
$ gunzip RFRGdata.sql.gz $ mysql –h localhost –u username –p password mysql> use rfrg; mysql> source RFRGdata.sql;
The source command will load the CREATEs
and INSERTs
needed to populate the five data tables. When you are done, you can use show tables
to see five tables loaded, as follows:
mysql> show tables; +-------------------------+ | Tables_in_test | +-------------------------+ | book_rf_entities | | book_rf_entity_people | | book_rf_entity_topics | | book_rg_entities | | book_rg_entity_people | +-------------------------+
Now we need to create a table to hold the candidate matches. I have called this table book_entity_matches
and we can CREATE
it as follows:
CREATE TABLE IF NOT EXISTS book_entity_matches ( rf_project_name varchar(100) NOT NULL, rg_project_name varchar(100) NOT NULL, url_levenshtein int(11) DEFAULT NULL, rf_name_soundex varchar(5) DEFAULT NULL, rg_name_soundex varchar(5) DEFAULT NULL, name_levenshtein int(11) DEFAULT NULL, rf_name_in_rg_name tinyint(1) DEFAULT NULL, rf_name_in_rg_url tinyint(1) DEFAULT NULL, rf_dev_in_rg_dev tinyint(1) DEFAULT NULL, PRIMARY KEY (rf_project_name,rg_project_name) ) ENGINE=MyISAM;
The table is created and empty, and is ready to hold our candidate entity matches. The next section will show the code needed to find candidates and populate this book_entity_matches
table.
To implement our proposed strategy for finding candidate matched pairs, we are going to write a short procedure that does the following:
Once we have done these things, we can evaluate how good our matches are and decide how to refine the procedure or how to change it. The following code for this procedure is divided into chunks for easier understanding. As with previous chapters, the entire code base can be found on the GitHub repository for this book at https://github.com/megansquire/masteringDM.
Let's start off with some import statements for the three libraries we will need. These should already be in your Anaconda installation if you followed along in the previous chapters:
import pymysql import sys from nltk.metrics import *
Next we will set up our database connection. You should fill in the connection details with your own. We also need two database cursors. Our first two SQL statements will simply find projects with matching URLs and populate those as candidates, and then find projects with matching names, and populate those as candidates:
db = pymysql.connect(host='localhost', db='rfrg', user='', passwd='', port=3306, charset='utf8mb4') cursor = db.cursor() # get all projects with matching URLs cursor.execute("INSERT INTO book_entity_matches ( rf_project_name, rg_project_name) SELECT rf.project_name, rg.project_name FROM book_rf_entities rf INNER JOIN book_rg_entities rg ON rf.url = rg.url") # get projects that have matching project names cursor.execute("INSERT INTO book_entity_matches(rf_project_name, rg_project_name) SELECT rf.project_name, rg.project_name FROM book_rf_entities rf INNER JOIN book_rg_entities rg ON rf.project_name = rg.project_name WHERE rf.project_name NOT IN ( SELECT bem.rf_project_name FROM book_entity_matches bem)")
Our next step is to calculate the string metrics for each pair. First we will select out these pairs, then we will set up a loop to process each pair individually:
cursor.execute("SELECT bem.rf_project_name, bem.rg_project_name, rfe.url, rge.url FROM book_entity_matches bem INNER JOIN book_rg_entities rge ON bem.rg_project_name = rge.project_name INNER JOIN book_rf_entities rfe ON bem.rf_project_name = rfe.project_name ORDER BY bem.rf_project_name") projectPairs = cursor.fetchall()
The for
loop operates on lowercase versions of the names and URLs because the two systems, RubyForge and RubyGems, are not consistent in their requirement for casing of names and URLs. RubyGems allows case-sensitive project names, but RubyForge does not, and both sites allow capital letters in their URLs:
for(projectPair) in projectPairs: RFname = projectPair[0] RGname = projectPair[1] RFurl = projectPair[2] RGurl = projectPair[3] # lowercase everything RFnameLC = RFname.lower() RGnameLC = RGname.lower() RFurlLC = RFurl.lower() RGurlLC = RGurl.lower()
The next section relies on two string metric functions, one of which, edit_distance(),
is loaded with the nltk
package, and one of which is a public domain function called soundex()
written by Gregory Jorgensen and reproduced here. The code and explanation for soundex()
is shown at the end of this section:
levNames = edit_distance(RFnameLC, RGnameLC) levURLs = edit_distance(RFurlLC, RGurlLC) soundexRFname = soundex(RFnameLC) soundexRGname = soundex(RGnameLC)
In the next few lines, we test whether the RubyForge name is found inside the RubyGems project name, and whether the RubyForge name is found inside the RubyGems URL string. If so, this is a good indication that the projects are a match, even if their names or URLs do not match exactly. We use Boolean variables rf_in_rg
and rf_in_rgurl
to hold the answers:
# is the RF project name inside the RG project name? if RFnameLC in RGnameLC: rf_in_rg = 1 else: rf_in_rg = 0 # is the RF project name inside the RG project URL? if RFnameLC in RGurl: rf_in_rgurl = 1 else: rf_in_rgurl = 0
Next we test whether the developers listed on the RubyForge list are found anywhere in the RubyGems list. We only need one developer match in common, so we let the SQL do the heavy lifting here for creating the matches. We fetchone()
and set our Boolean rfdev_in_rgdev
:
# do RF devs match the RG devs? cursor.execute("SELECT rf.dev_username, rf.dev_realname FROM book_rf_entity_people rf WHERE rf.project_name = %s AND (rf.dev_username IN ( SELECT rg.person_name FROM book_rg_entity_people rg WHERE rg.project_name = %s) OR rf.dev_realname IN ( SELECT rg.person_name FROM book_rg_entity_people rg WHERE rg.project_name = %s))", (RFname, RGname, RGname)) result = cursor.fetchone() if result is not None: rfdev_in_rgdev = 1 else: rfdev_in_rgdev = 0
Now we have a bunch of string metrics and Booleans and we are ready to write all these to the database table. For each match candidate pair, we will UPDATE
the row indicating what these values are:
cursor.execute("UPDATE book_entity_matches SET rf_name_soundex = %s, rg_name_soundex = %s, url_levenshtein = %s, name_levenshtein = %s, rf_name_in_rg_name = %s, rf_name_in_rg_url = %s, rf_dev_in_rg_dev = %s WHERE rf_project_name = %s AND rg_project_name = %s", (soundexRFname, soundexRGname, levURLs, levNames, rf_in_rg, rf_in_rgurl, rfdev_in_rgdev, RFname, RGname)) db.close()
We close the database connection and we are all set. Before we discuss evaluating the results, here is the soundex()
function written by Gregory Jorgenson and described earlier. This function takes two parameters. The first is the word for which you are generating the Soundex code. The second parameter is the length of the Soundex code you want to generate. The initial Soundex algorithm called for a 4-character code, which is also the default in this program. However, this length is configurable. If you find that you are dealing in a domain with very long and consonant-heavy words, you should increase the length. Remember that Soundex is designed for English pronunciations, so proceed with caution in non-English settings:
def soundex(name, len=4): """ soundex module conforming to Knuth's algorithm implementation 2000-12-24 by Gregory Jorgensen public domain available at: http://code.activestate.com/recipes/52213-soundex-algorithm/ """ # digits holds the soundex values for the alphabet digits = '01230120022455012623010202' sndx = '' fc = '' # translate alpha chars in name to soundex digits for c in name.upper(): if c.isalpha(): if not fc: fc = c # remember first letter d = digits[ord(c)-ord('A')] # duplicate consecutive soundex digits are skipped if not sndx or (d != sndx[-1]): sndx += d # replace first digit with first alpha character sndx = fc + sndx[1:] # remove all 0s from the soundex code sndx = sndx.replace('0','') # return soundex code padded to len characters return (sndx + (len * '0'))[:len]
At this point we have a fully populated match table, and it is time to look at our results.
A typical result in our result set now looks like the following:
rf_project_name: aafc rg_project_name: acts_as_flux_capacitor url_levenshtein: 0 rf_name_soundex: A120 rg_name_soundex: A232 name_levenshtein: 18 rf_name_in_rg_name: 0 rf_name_in_rg_url: 1 rf_dev_in_rg_dev: 1
What this means is that the RubyForge project aafc has been matched to the RubyGems project acts_as_flux_capacitor based on their URL, and that they also share a relationship between the RubyForge developer lists. Their names are quite different, as demonstrated by the different Soundex scores and the high name Levenshtein value. Still, just with the URL matches and the shared developer list, we have pretty good evidence that these are the same project.
At this point, it is time to evaluate our overall candidate match list and decide whether our procedure works well enough or not. Overall, did we find good matches? Here are some ways to consider this question.
Our procedure produced around 5,800 matches. Is this a good number? Consider that we started with just over 9,600 RubyForge projects, and our ideal goal would be to find a match on RubyGems for as many of those projects as we can. The total 5,800 might seem like a really high number, but remember that RubyForge is a project hosting facility, whereas RubyGems is a gem, or library, hosting facility. On RubyForge, one project could be made up of many files and libraries, but they all lived under a single name. On RubyGems, each gem is listed separately. Therefore, many gems could share the same URL. As evidence of this, consider that there are 72 gems that listed their URL as some form of http://rubyonrails.org/. Rails is a huge project with many, many gems in it. On RubyForge those lived under one umbrella project, but on RubyGems they are all listed separately.
So, a single RubyForge project might be listed alongside several RubyGems projects. Therefore, we need a query that will tell us how many actual RubyForge projects we were able to find matches for in RubyGems:
SELECT count(DISTINCT rf_project_name) FROM book_entity_matches; +---------------------------------+ | count(DISTINCT rf_project_name) | +---------------------------------+ | 4252 | +---------------------------------+ 1 row in set (0.01 sec)
That query yields 4,252 distinct RubyForge projects that were matched to RubyGems projects. That means that our simple URL and name matching procedure was able to find at least one match for 44% of the projects in our set. While we might want these numbers to be higher, let's take a moment for a reality check.
First, remember why we are looking for matches to begin with. We stated that we wanted to find matches so we could build complete project histories for software projects. A total of 4,252 project histories is a lot of material to work with, and that is certainly more matches than we would have been able to build in a reasonable time by hand.
Second, RubyForge was begun in 2003, and finally shut down in 2014. RubyGems started in 2009 and was named the official gem host in 2010. It is a distinct possibility that many of the RubyForge projects simply did not want to move to RubyGems. Maybe the projects became defunct or had lost their team leaders. Or, if they did move they may have changed URLs, names, and project teams so significantly that we will not be able to find them without a lot of additional manual procedures and domain knowledge about the project anyway.
When we think about it this way, finding possible matches for 4,252 projects is not bad!
This is also a good question. Let's consider it in terms of false positives first. To attempt to ferret out false positives, we would look for two main categories of incorrect matches.
First, we want to identify pairs that had matching URLs but nothing else was good, for example, no shared developers, no RubyForge word stems in the RubyGems name or URL, high distance Levenshtein values on names, or nonmatching Soundex on names. We will call these Type 1 False Positives, or FP1 for short. To find these FP1 errors, we can run the following query against our match table:
SELECT rf.url, bme.* FROM book_entity_matches bme INNER JOIN book_rf_entities rf ON bme.rf_project_name = rf.project_name WHERE url_levenshtein = 0 AND rf_name_soundex <> rg_name_soundex AND name_levenshtein > 0 AND rf_name_in_rg_name = 0 AND rf_name_in_rg_url = 0 AND rf_dev_in_rg_dev = 0;
This query yields 123 projects, 42 of which have to do with the rails
project alone. We can confirm this by adding the clause and rf.url
like %rubyonrails%
to the end of the query. Another handful are RubyGems libraries that are part of either the will-paginate
or muravey
projects, both of which had dozens of subprojects, but which all shared a common project name on RubyForge, and the same common parent URL. They were flagged as false positives because the name segment was either punctuated differently or had changed names over time, or because a patched library on RubyGems was given the prefix of a developer, but it still used the generic URL to the main project.
Another set of potential false positives are the pairs that share a common name but the URLs are different; they do not share a common developer, and the RubyForge project name is not found inside the URL either. We can call these Type 2 False Positives (or FP2). Here is a query we can use to identify these possible FP2 errors:
SELECT rf.url, rg.url, bme . * FROM book_entity_matches bme INNER JOIN book_rf_entities rf ON bme.rf_project_name = rf.project_name INNER JOIN book_rg_entities rg ON bme.rg_project_name = rg.project_name WHERE name_levenshtein =0 AND url_levenshtein > 0 AND rf_name_in_rg_url =0 AND rf_dev_in_rg_dev =0;
This query yields 121 results. These are potentially more problematic than the FP1 errors we found previously because in order to determine whether they are truly false positives, we will need to read the textual description of the projects, or if that does not yield any results, we will have to go manually to each project page and determine whether these are the same. It seems likely that projects with very unusual names are probably the same (two projects named hatenagraphup for example) but what about the projects named helloworld and index?
True positives (TP) are an easier story: 100% matches are easily found with this query, which looks for projects that have a matching name, matching URL, and three matching Booleans, including at least one developer in common:
SELECT rf.url, rg.url, bme . * FROM book_entity_matches bme INNER JOIN book_rf_entities rf ON bme.rf_project_name = rf.project_name INNER JOIN book_rg_entities rg ON bme.rg_project_name = rg.project_name WHERE name_levenshtein =0 AND url_levenshtein =0 AND rf_name_in_rg_url =1 AND rf_dev_in_rg_dev =1 AND rf_name_in_rg_name =1;
This yields 1,082 matches, dwarfing the FP1 and FP2 errors combined. The remaining matches are those somewhere between suspicious and certain. These are ones that are probably matches, but one or more of the Booleans was set to zero. To summarize we have:
Even if we declare the FP1 and FP2 categories to be entirely false, this still yields a precision value of 96%:
precision = 5552/(5796) 96%
But let's not get too excited with a 96% precision value just yet. There are plenty of reasons to be less than thrilled with our results. First, remember that we still have 3,807 RubyForge projects that we did not find any matches for at all. Second, and more important, we have no counts for negatives, either false or true. The reason for this is that since the beginning our matching procedure was loaded with only projects that either matched on URL or name, and no other projects were included. We also have no master list or pre-labeled training set for showing true negatives. Not having negatives means we cannot easily calculate recall or F-measure.
Still, if our goal is to find matching projects for further study of software evolution over time, I think we have succeeded in this. We now have a list of between 1,082 and 5,552 projects that we can connect, with some confidence. We can track these projects as they jumped from one hosting facility to another. Finding matches by URL was fairly successful, but by adding in some more complex matching by names, URLs, and developers allowed us to increase the number of matches we found, and differentiate between them based on how accurate the matches were.
3.142.171.90