Detecting Clones in Software

The problem of detecting code clones is an interesting one technically; if an existing code fragment is duplicated and then changed to fit a new purpose, how can you recognize this with an automated tool? If all code clones were verbatim copies that were never subsequently altered—and some clones do fit this description—then code clone detection would be pretty easy. However, usually clones are adapted for new uses: existing lines are changed or removed, and new code may be added. Thus, it’s probably a good idea to pause at this point and ask the question: just what is a software clone?

Well, first we should note that almost all clone detection techniques actually measure similarity of code chunks. That is, we typically don’t have access to logs that record actual copy-paste edit events; we just infer they happened if the similarity measures are within a given threshold.

Second, there is no consensus on what similarity means concretely or what thresholds are reasonable. (Ira Baxter, a researcher in the community, likes to say, “Software clones are segments of code that are similar…according to some definition of similarity.”) Detection tools use a wide variety of techniques. Some tools treat programs as a sequence of character strings, and perform textual comparisons. Other tools compare token streams, abstract syntax trees (ASTs), and program dependence graphs (PDGs). Some compute metrics or compare lightweight semantic models of program components. And more and more tools use hybrid approaches, combining computationally cheap techniques applied widely with more expensive techniques applied narrowly.

Although having a wide range of techniques to attack this problem makes for an interesting research field, in practice the working definition of what exactly a clone is depends strongly on the tool that is used. For example, simple string comparisons are unable to account for a change in identifier names. Consequently, two code segments that are considered to be clones by one tool may not be by a different tool. And even when experts sit together in the same room and examine code manually, there’s often no clear consensus as to how similar code segments must be to be considered as clones [Kapser et al. 2007]. Needless to say, this lack of agreement hinders progress because there’s no “gold standard” or accepted benchmark, and so it is hard to know just how good your results are or to evaluate tools against each other.

That said, there has been some progress in clarifying the problem domain. Bellon et al. have created a categorization scheme for software clones that has been widely embraced by the community [Bellon 2007]:

Type 1

Two chunks of code have identical text (possibly ignoring differences in comments and white space formatting).

Type 2

Two chunks of code have identical text, except that identifiers and explicit constant values may differ.

Type 3

Two chunks of code have segments that are Type 2 clones, but may also have “gaps” where they do not agree.

As mentioned earlier, detecting Type 1 clones is fairly easy. All that is needed is simple lexical processing to normalize the input, after which any number of algorithms can be used to perform sequence analysis on the remaining lines of text.

Detecting Type 2 clones is not much harder. A lexical analyzer can perform simple rewriting (for example, mapping all identifiers to “id” and all string constants to “foo”), after which the resulting token stream can be analyzed as lines of plain text or using programming-language-aware analyses.

Type 3 clones—the interesting ones—are harder because they require sensible threshold values: how long must the matching contiguous chunks be, and how long can the gaps between them be? Given suitably loose thresholds, any two chunks of code can be considered to be Type 3 clones!

And not all clone detection tools map neatly to Bellon et al.’s definition. Metrics-based detection, for example, does not explicitly model program structure, but rather blends structural and semantic information of program fragments into sets of numbers and compares the numbers.

So let’s take for granted that, at least for the moment, there’s no authoritative answer to the question, “Just how much cloning is there in my software system?” This doesn’t mean we have to pack up and go home. Our previous experience made us pretty confident that we were finding a good percentage of the clones that were out there. So we decided to use a “best effort” approach with state-of-the-art detection techniques, and see where that led us.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.139.168