Summary

In this chapter, we learned how to connect entities even when there is no common identifier for them. This task, called entity matching, is broadly applicable to many domains, and is one of the oldest tasks in data processing. Once we have matched entities, we are able to perform data mining on sets that were previously unconnected.

To do so, we tackled common strategies for entity matching, attribute-based, disjoint sets, and context-based. We learned several techniques for estimating whether strings are similar, including edit distances like Hamming and Levenshtein, and phonetic encodings such as Soundex, and we learned how to use blocking techniques to reduce or eliminate pairwise testing. Since it is important to evaluate the effectiveness of our entity matching methods, we learned how to calculate false positive and false negative rates. Finally, we tested our knowledge by designing an entity matching procedure for a real-world problem using data from two separate collections of data about free, libre, and open source software (FLOSS) projects. Using attributes they had in common and some simple string metrics, we were able to construct a list of several thousand projects that had moved from one hosting site to the other.

In the next chapter, we will continue to work with the RubyForge project data, dipping our toes into the deep water of social network analysis. Rather than focusing on the projects as entities, we will turn our attention to the software developers themselves. How did they self-organize to work on the different projects, and how did the developer network change over time?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.57.223