Investigating the Practice of Code Cloning

The unspoken and largely unchallenged presumption among software engineers is that code cloning is a bad thing. Always. OK, maybe you can cheat a little in the short term, but in the long term, it’s a bad idea. In fact, Kent Beck says precisely this in his chapter on “code smells” in Fowler’s Refactoring:

Number one in the stink parade is duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them.

Our experience and intuition said this was too simplistic a view. For example, our colleague Jim Cordy reminded us that the engineering view of using existing solutions was pretty different: in languages such as FORTRAN and COBOL, where the syntax is awkward and the ability to form high-level abstractions is limited, existing solutions are often treated as tools to be reused and adapted for new situations. (This sounds like what a library is for, but often these kinds of solutions can’t be packaged up so neatly to make a library.)

So we decided to go in a different direction: what are the characteristics of code cloning in industrial software systems? What patterns exist? Can we identify them using static analysis? Can we make judgment calls about when cloning might be a reasonable, and even advantageous, design decision? Armed with knowledge and empirical studies, can we use code duplication as a principled engineering tool?

We set out to create a catalog of cloning patterns that would be informed by the work of our colleagues as well as our own previous explorations in code clone detection and analysis. We devised a template listing for the patterns that included:

  • The name of the pattern

  • A description of the perceived motivation of the programmers to create the clone in the first place

  • A list of the advantages and disadvantages

  • A description of the management and long-term maintenance issues that might need to be addressed subsequently by developers

  • A description of the structural manifestations of the clone pattern with a code base (e.g., how it might be recognized by a tool)

  • A set of known examples of the pattern

We ended up identifying three broad but distinct categories of cloning: forking, templating, and customizing. Within each, we further identified several cloning patterns. We’ll now give you an overview of these categories and patterns, but for more details, you might want to read our paper on the subject [Kapser and Godfrey 2008].

Forking

Forking occurs when developers want to springboard the development of similar solutions within different contexts. The original code is typically copied to new source files, and then the variants are maintained independently, although probably with some coordination between the development teams. Forking helps to protect the stability of the system by forcing variation points to the fringes, away from the core components. In this way, hardware or platform variants interact with the rest of the system through a kind of virtualization layer, and experimental features can be tested with relatively little risk to the main system. Forking is a good strategy when variants need to evolve independently, especially if the long-term evolution of the variants is unclear.

We identified three cloning patterns that fit into the category of forking:

Hardware variation

This can occur when a new piece of hardware is released that bears a strong semantic resemblance to a previous version, yet is different enough to require special handling. For example, a single device driver in the Linux operating system may combine support for several different hardware versions of a SCSI card. When a new card is released that is similar enough to be able to take advantage of some of the functionality in an existing driver, a new driver may be created by cloning some of the existing code. Since the drivers are maintained independently, new code is isolated from older, working code, minimizing risk to backward compatibility for users of the older cards and drivers. Additionally, drivers are usually managed and installed as single units, so it is difficult to refactor common sub-pieces into components that can be shared.

Platform variation

This is closely related to hardware variation, and occurs when there is a natural abstraction layer for certain kinds of tasks that may be implemented differently by different target platforms or operating systems (OSes). The Apache Portable Runtime (APR) is a good example: programs use a universal API provided by the APR to access low-level services such as memory allocation and file I/O. The universal API is implemented differently—but often very similarly at the code level—by the various supported OSes. Much of the APR implementation is divided into OS-specific directories, and there is a lot of high-level cloning between the various implementation files. This allows the variants to be maintained separately, which is advantageous because subsequent changes are often OS-specific. However, there is also clearly a good deal of communication between the APR developers of the different platforms, which minimizes the risks of design drift and inconsistent maintenance.

Experimental variation

This is often employed when a system has a stable branch with many users, but developers wish to try out new features that might negatively impact either the performance or the reliability of the system. So the developers create a safe sandbox for experimenting by cloning the existing system and playing around with the new ideas there. Features that seem like good ideas can then in time be backported to the main codebase, and the clone itself may eventually be abandoned or even adopted as the new baseline. The Apache project, the Linux kernel, and many other well-known open source systems have used experimental variation extensively.

Templating

Templating is a way to directly copy behavior of existing code in the absence of appropriate abstraction mechanisms such as inheritance or generics. Templating is used when clones share a similar set of requirements, such as behavioral requirements or the use of a particular library. When these requirements change, all clones must be maintained together. For example, consider the following code from the Gnumeric version 1.6.3 codebase:

gnumeric_oct2bin (FunctionEvalInfo *ei, GnmValue const * const *argv)
{
        return val_to_base (ei, argv[0], argv[1],
                8, 2,
                0, GNM_const(7777777777.0),
                V2B_STRINGS_MAXLEN | V2B_STRINGS_BLANK_ZERO);
}

gnumeric_hex2bin (FunctionEvalInfo *ei, GnmValue const * const *argv)
{
        return val_to_base (ei, argv[0], argv[1],
                16, 2,
                0, GNM_const(9999999999.0),
                V2B_STRINGS_MAXLEN | V2B_STRINGS_BLANK_ZERO);
}

The differences between these cloned functions are in the function names and the explicit numeric constants used. Experienced C programmers will recognize this as an idiom for performing simple conversions from one data format to another. In an object-oriented programming language, inheritance and genericity can often greatly simplify the implementation of tasks like these; on the other hand, procedural languages such as C generally require the creation of a large set of these routines. This phenomenon often results in a lot of small clones: if there are m input formats and n output formats, then m*n virtually identical functions must be created and maintained. Although this approach may seem mathematically inelegant and a bit of a pain to deal with, the inexpressiveness of the language may simply demand it. In such cases, the clones are an unfortunate necessity. We identified four templating patterns:

Parameterized code

This pattern is characterized by a simple and precise mapping between variants, such as in the previous code. Given a more powerful and expressive programming language, the clones could be easily merged into a single function using features such as inheritance and generics. Sometimes parameterized code is generated automatically by special-purpose tools to avoid accidental errors; this technique was used in the early days of Java’s JDK, before generics had been added to the language.

Boilerplating

This pattern is related to parameterized code but is a little more general, relaxing the requirement for a precise mapping between the variants. It is particularly common in systems written in older procedural languages such as COBOL and FORTRAN. In such cases, a programmatic solution may be seen as an intellectual asset worthy of study, reuse, and emulation in other contexts. The lack of language support for user-defined abstractions often means that a solution to a relatively straightforward problem may be lengthy, complicated, and awkward. When a developer has to solve a similar problem later on, he may choose to copy and then tweak a known working solution rather than develop a new one from scratch, and in so doing avoid the mistakes of those who have gone before. Boilerplating differs from parameterized code in that the differences between clones usually cannot be easily mapped to a single solution.

API protocols

This pattern is a variation on boilerplating, where programmers invoke a library or framework with a recommended protocol for use. For example, when creating a button in a GUI using the Java SWING API, a common order of activities is to create the button, add it to a container, and assign the action listeners. Similarly, setting up a network socket in a C program within Unix requires an established set of function calls to a particular library. Documentation for libraries, and especially for frameworks, often take the form of “cookbooks,” giving code exemplars that demonstrate how various common tasks can be implemented. Users are encouraged to copy these exemplars into their code and adapt them for their own needs.

Programming idioms

This pattern employs language-specific programming idioms systematically throughout a codebase to perform certain kinds of low-level tasks. For example, within the Apache codebase, there is an explicit idiom for how a pointer to a platform-specific data structure should be set in the memory pool. First, the code checks whether the data structure containing the pointer exists in the memory pool. If not, space is allocated for it and then the platform-specific pointer is assigned. This idiom exists because the APR library uses similarly defined data structures to point to platform-specific constructs such as pthreads. These data structures also store platform-specific data that is relevant to the concept, such as the exit status of the thread. In a slight variation on this idiom, code often checks whether a memory pool exists, and returns an error if it does not. We found at least 15 occurrences of this particular idiom within the APR subsystem.

Customizing

Customizing arises when existing code is found that solves a problem that’s pretty similar to the one under consideration, but not similar enough to use the same code. Sometimes customization clones arise from pressures that are more managerial than directly technical, such as code ownership or the desire to isolate fresh code from well-tested code. In such cases, the existing code can’t be modified “in place” to achieve the desired new functionality, so it is copied and adapted elsewhere within the system.

Customizing differs from forking and templating in several ways. In customization, cloning is often a starting point for a new design idea that diverges from the original over time; there may be little or no coordination between the clones in the long term. In forking, although variants tend to evolve more or less independently in the details, there is usually communication between the development groups when common external requirements change. In templating, the relationship between clones is even stronger, as it is mostly the inexpressiveness of the language or design of the system that prevents the clones being combined into a single abstraction. While templating and forking typically have the goal of maintaining the original behavior to a high degree, customization is a reuse of behavior, often without the requirement that the behaviors must remain strongly similar over time. The uncoordinated evolution that occurs in customization clones sets them apart from other clones in important ways: their differences can be harder to spot, the effects of the changes on behavior may be harder to understand, and the code clones may be harder to detect.

We identified two customization patterns:

Bug workarounds

These are used when a developer finds a bug in code that she doesn’t have permission to change, perhaps in a third-party library or framework. Since she can’t change the original code, the developer may choose to copy and paste the offending function into a module or class that she can modify, and apply the fix there, perhaps with a guard to ensure that it is being applied correctly. In object-oriented programs, this has a particularly neat solution: it may be possible to create an inheritance descendant of the class and override the buggy method.

Replicate and specialize

This is used when a developer has found a piece of functionality that he wants to adapt to a new purpose elsewhere in the system. The duplication might be small in size and narrow in context or large and wide-ranging. LaToza et al. noted the latter case as a practice within Microsoft, which they call “clone and own” [LaToza et al. 2006]: when a product group wants specialized functionality added to code that is owned by another development group, they may decide simply to create a copy of the original for use within their system and then adapt for their own purposes. However, the implicit understanding is that they will also be responsible for its future maintenance and evolution; that is, they now own this clone (see also [Al-Ekram 2005] and [German 2009] for more on this topic).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.120.136