10

Cohesion

Cohesion (in computer science) is defined as “the degree to which the elements inside a module belong together.”1

Modularity and Cohesion: Fundamentals of Design

My favorite way to describe good software design is based on this Kent Beck quote:

Pull the things that are unrelated further apart, and put the things that are related closer together.

This simple, slightly jokey phrase has some real truth in it. Good design in software is really about the way in which we organize the code in the systems that we create. All my recommended principles to help us manage complexity are really about compartmentalizing our systems. We need to be able to build our systems out of smaller, more easily understandable, more easily testable, discrete pieces. To achieve this, we certainly need techniques that will allow us to “Pull the unrelated things further apart,” but we also need to take seriously the need to “put the related things closer together.” That is where cohesion comes in.

Cohesion is one of the more slippery concepts here. I can do something naive, like support the idea of modules in my programming language and claim, as a result, that my code is modular. This is wrong; simply throwing a collection of unrelated stuff into a file does not make the code modular in any but the most trivial sense.

1. Source: Wikipedia https://en.wikipedia.org/wiki/Cohesion_(computer_science)

When I speak of modularity, I really mean components of the system that genuinely hide information from other components (modules). If the code within the module is not cohesive, then this doesn’t work.

The trouble is that this is open to overly simplistic interpretations. This is probably the point at which the art, skill, and experience of the practitioner really come into play. This balance point between genuinely modular systems and cohesion often seems to confuse people.

A Basic Reduction in Cohesion

How often have you seen a piece of code that will retrieve some data, parse it, and then store it somewhere else? Surely the “store” step is related to the “change” step? Isn’t that good cohesion? They are all the steps that we need together, aren’t they?

Well, not really—let’s look at an example. First my caveats: it is going to be hard to tease apart several ideas here. This code is inevitably going to demonstrate a bit of each of the ideas in this section, so I rely on you to focus on where it touches on cohesion and smile knowingly when I also touch on separation of concerns, modularity, and so on.

Listing 10.1 shows rather unpleasant code as a demonstration. However, it serves my purpose to give us something concrete to explore. This code reads a small file containing a list of words, sorts them alphabetically, and then writes a new file with the resulting sorted list—load, process, and store!

This is a fairly common pattern for lots of different problems: read some data, process it, and then store the results somewhere else.

Listing 10.1 Really Bad Code, Naively Cohesive

public class ReallyBadCohesion
{
    public boolean loadProcessAndStore() throws IOException
    {
        String[] words;
        List<String> sorted;
        try (FileReader reader =                     new FileReader("./resources/words.txt"))         {             char[] chars = new char[1024];             reader.read(chars);             words = new String(chars).split(" |");
        }         sorted = Arrays.asList(words);         sorted.sort(null);
        try (FileWriter writer =                     new FileWriter("./resources/test/sorted.txt"))         {             for (String word : sorted)             {                 writer.write(word);                 writer.write(" ");             }             return true;         }     } }

I find this code extremely unpleasant, and I had to force myself to write it like this. This code is screaming “poor separation of concerns,” “poor modularity,” “tight coupling,” and almost “zero abstraction,” but what about cohesion?

Here we have everything that it does in a single function. I see lots of production code that looks like this, only usually much longer and much more complex, so even worse!

A naive view of cohesion is that everything is together and so easy to see. So ignoring the other techniques of managing complexity for a moment, is this easier to understand? How long would it take you to understand what this code does? How long if I hadn’t helped with a descriptive method name?

Now look at Listing 10.2, which is a slight improvement.

Listing 10.2 Bad Code, Mildly Better Cohesion

public class BadCohesion
{
    public boolean loadProcessAndStore() throws IOException
    {
        String[] words = readWords();
        List<String> sorted = sortWords(words);
        return storeWords(sorted);
    }
    private String[] readWords() throws IOException     {         try (FileReader reader =                     new FileReader("./resources/words.txt"))         {             char[] chars = new char[1024];             reader.read(chars);             return new String(chars).split(" |");         }     }
    private List<String> sortWords(String[] words)     {         List<String> sorted = Arrays.asList(words);         sorted.sort(null);         return sorted;     }
    private boolean storeWords(List<String> sorted) throws IOException     {         try (FileWriter writer =                     new FileWriter("./resources/test/sorted.txt"))         {             for (String word : sorted)             {                 writer.write(word);                 writer.write(" ");             }             return true;         }     } }

Listing 10.2 is still not good, but it is more cohesive; the parts of the code that are closely related are more clearly delineated and literally closer together. Simplistically, everything that you need to know about readWords is named and contained in a single method. The overall flow of the method loadProcessAndStore is plain to see now, even if I had chosen a less descriptive name. The information in this version is more cohesive than the information in Listing 10.1. It is now significantly clearer which parts of the code are more closely related to one another, even though the code is functionally identical. All of this makes this version significantly easier to read and, as a result, makes it much easier to modify.

Note that there are more lines of code in Listing 10.2. This example is written in Java, which is a rather verbose language, and the boilerplate costs are quite high, but even without that there is a small overhead to improving the readability. This is not necessarily a bad thing!

There is a common desire among programmers to reduce the amount of typing that they do. Clear concision is valuable. If we can express ideas simply, then that is of significant value, but you don’t measure simplicity in terms of the fewest characters typed. ICanWriteASentenceOmittingSpaces is shorter, but it is also much less pleasant to read!

It is a mistake to optimize code to reduce typing. We are optimizing for the wrong things. Code is a communication tool; we should use it to communicate. Sure, it needs to be machine-readable and executable too, but that is not really its primary goal. If it was, then we would still be programming systems by flipping switches on the front of our computers or writing machine code.

The primary goal of code is to communicate ideas to humans. We write code to express ideas as clearly and simply as we can—at least that is how it should work. We should never choose brevity at the cost of obscurity. Making our code readable is, to my mind, both a professional duty of care and one of the most important guiding principles in managing complexity. So I prefer to optimize to reduce thinking rather than to reduce typing.

Back to the code: this second example is clearly more readable. It is much easier to see its intent, it is still pretty horrible, it is not modular, there is not much separation of concerns, it is inflexible with hard-coded strings for filenames, and it is not testable other than running the whole thing and dealing with the file system. But we have improved the cohesion. Each part of the code is now focused on one part of the task. Each part has access only to what it needs to accomplish that task. We will return to this example in later chapters to see how we can improve on it further.

Context Matters

I asked a friend, whose code I admire, if he had any recommendations to demonstrate the importance of cohesion, and he recommended the Sesame Street YouTube video,2 “One of these things is not like another.”

So that is a bit of a joke, but it also raises a key point. Cohesion, more than the other tools to manage complexity, is contextual. Depending on context, “All of these things may not be like the other.”

We have to make choices, and these choices are intimately entangled with the other tools. I can’t clearly separate cohesion from modularity or separation of concerns because those techniques help to define what cohesion means in the context of my design.

One effective tool to drive this kind of decision-making is domain-driven design.3 Allowing our thinking, and our designs, to be guided by the problem domain helps us to identify paths that are more likely to be profitable in the long run.

2. A Sesame Street song called “One of these things is not like the other”: https://youtu.be/rsRjQDrDnY8

3. Domain Driven Design is the title of a book written by Eric Evans and an approach to the design of software systems. See https://amzn.to/3cQpNaL.

Domain-Driven Design

Domain-driven design is an approach to design where we aim to capture the core behaviors of our code in essence as simulations of the problem domain. The design of our system aims to accurately model the problem.

This approach includes a number of important, valuable ideas.

It allows us to reduce the chance of misunderstanding. We aim to create a “ubiquitous language” to express ideas in the problem domain. This is an agreed, accurate way of describing ideas in the problem domain, using words consistently, and with agreed meanings. We then apply this language to the way that we talk about the design of our systems too.

So if I am talking about my software and I say that this “Limit-order matched,” then that makes sense in terms of the code, where the concepts of “limit orders” and “matching” are clearly represented, and named LimitOrder and Match. These are precisely the same words that we use when describing the scenario in business terms with nontechnical people.

This ubiquitous language is effectively developed and refined through capturing requirements and the kind of high-level test cases that can act as “executable specifications for the behavior of the system” that can drive the development process.

DDD also introduced the concept of the “bounded context.” This is a part of a system that shares common concepts. For example, an order-management system probably has a different concept of “order” from a billing system, so these are two, distinct bounded contexts.

This is an extremely useful concept for helping to identify sensible modules or subsystems when designing our systems. The big advantage of using bounded contexts in this way is that they are naturally more loosely coupled in the real problem domain, so they are likely to guide us to create more loosely coupled systems.

We can use ideas like ubiquitous language and bounded context to guide the design of our systems. If we follow their lead, we tend to build better systems, and they help us to more clearly see the core, essential complexity of our system and differentiate that from the accidental complexity that often, otherwise, can obscure what our code is really attempting to do.

If we design our system so that it is a simulation of the problem domain, as far as we understand it, then an idea that is viewed as a small change from the perspective of the problem domain will also be a small step in the code. This is a nice property to have.

Domain-driven design is a powerful tool in creating better designs and provides a suite of organizing principles that can help guide our design efforts and encourages us to improve the modularity, cohesion, and separation of concerns in our code. At the same time, it leads us toward a coarse-grained organization of our code that is naturally more loosely coupled.

Another important tool that helps us create better systems is separation of concerns, which we will talk about in considerably more detail in the next chapter, but for now it is perhaps the closest thing that I have to a rule to guide my own programming. “One class, one thing; one method/function, one thing.”

I strongly dislike both of the code examples presented in this chapter so far and feel slightly embarrassed to show them to you, because my design instincts are screaming at me that the separation of concerns is so terrible in both cases. Listing 10.2 is better; at least each method now does one thing, but the class is still terrible. If you don’t already see it, we will look at why that matters in the next chapter.

Finally, in my box of tools, there is testability. I started writing these bad code examples as I always start when writing code: by writing a test. I had to stop almost immediately, though, because there was no way that I could practice TDD and write code this bad! I had to dump the test and start again, and I confess that I felt like I had stepped back in time. I did write tests for my examples to check to see if they did what I expected, but this code is not properly testable.

Testability strongly encourages modularity, separation of concerns, and the other attributes that we value in high-quality code. That, in turn, helps us make an initial approximation of the contexts and abstractions that we like the look of in our design and where to make our code more cohesive.

Note, there are no guarantees here, and that is the ultimate point of this book. There are no simple, cookie-cutter answers. This book provides mental tools that help us structure our thinking when we don’t have the answers.

The techniques in this book are not meant to deliver the answers to you; that is still up to you. They are rather to provide you with a collection of ideas and techniques that will allow you to safely make progress even when you don’t yet know the answer. When you are creating a system of any real complexity, that is always the case; we never know the answers until we are finished!

You can think of this as a fairly defensive approach, and it is, but the aim is to keep our freedom of choice open. That is one of the significant benefits of working to manage complexity. As we learn more, we can change our code on an ongoing basis to reflect that learning. I think a better adjective than “defensive” is “incremental.”

We make progress incrementally through a series of experiments, and we use the techniques of managing complexity to protect ourselves from making mistakes that are too damaging.

This is how science and engineering work. We control the variables, take a small step, and evaluate where we are. If our evaluation suggests that we took a misstep, then we take a step back and decide what to try next. If it looks okay, we control the variables, take another small step, and so on.

Another way to think of this is that software development is a kind of evolutionary process. Our job as programmers is to guide our learning and our designs through an incremental process of directed evolution toward desirable outcomes.

High-Performance Software

One of the common excuses for unpleasant code, like that shown in Listing 10.1, is that you have to write more complex code if you want high performance. I spent the latter part of my career working on systems at the cutting edge of high performance, and I can assure you that this is not the case. High-performance systems demand simple, well-designed code.

Think for a moment what high performance means in software terms. To achieve “high performance,” we need to do the maximum amount of work for the smallest number of instructions.

The more complex our code, the more likely that the paths through our code are not optimal, because the “simplest possible route” through our code is obscured by the complexity of the code itself. This is a surprising idea to many programmers, but the route to fast code is to write simple, easy-to-understand code.

This is even more true as you start taking a broader system view.

Let’s revisit our trivial example again. I have heard programmers make the argument that the code in Listing 10.1 is going to be faster than the code in Listing 10.2 because of the “overhead” of the method calls that Listing 10.2 adds. I am afraid that for most modern languages this is nonsense. Most modern compilers will look at the code in Listing 10.2 and inline the methods. Most modern optimizing compilers will do more than that. Modern compilers do a fantastic job of optimizing code to run efficiently on modern hardware. They excel when the code is simple and predictable, so the more complex your code is, the less help you will gain from your compiler’s optimizer. Most optimizers in compilers simply give up trying once the cyclomatic complexity4 of a block of code exceeds some threshold.

I ran a series of benchmarks against both versions of this code. They were not very good, because this code is bad. We are not sufficiently controlling the variables to really see clearly what is happening, but what was obvious was that there was no real measurable difference at this level of test.

The differences were too tiny to be distinguished from everything else that is going on here. On one run, the BadCohesion version was best; on another the ReallyBadCohesion was best. On a series of benchmark runs, for each of 50,000 iterations of the loadProcessStore method, the difference was no more than 300 milliseconds overall, so on average, that is roughly a difference of 6 nanoseconds per call and was actually slightly more often in favor of the version with the additional method calls.

This is a poor test, because the thing that we are interested in, the cost of the method calls, is dwarfed by the cost of the I/O. Testability—in this case performance testability—once again can help guide us toward a better outcome. We will discuss this in more detail in the next chapter.

There is so much going on “under the hood” that it is hard, even for experts, to predict the outcome. What is the answer? If you are really interested in the performance of your code, don’t guess about what will be fast and what will be slow; measure it!

4. A software metric used to indicate the complexity of a program.

Link to Coupling

If we want to retain our freedom to explore and to sometimes make mistakes, we need to worry about the costs of coupling.

Coupling: Given two lines of code, A and B, they are coupled when B must change behavior only because A changed.

Cohesion: They are cohesive when a change to A allows B to change so that both add new value.5

Coupling is really too generic a term. There are different kinds of coupling that need to be considered (an idea that we will explore in more detail in Chapter 13).

It is ridiculous to imagine a system that has no coupling. If we want two pieces of our system to communicate, they must be coupled to some degree. So like cohesion, coupling is a matter of degree rather than any kind of absolute measure. The cost, though, of inappropriate levels of coupling is extremely high, so it is important to take its influence into account in our designs.

Coupling is in some ways the cost of cohesion. In the areas of your system that are cohesive, they are likely to also be more tightly coupled.

Driving High Cohesion with TDD

Yet again using automated tests, and specifically TDD, to drive our design gives us a lot of benefits. Striving to achieve a testable design and nicely abstracted, behaviorally focused tests for our system will apply a pressure on our design to make our code cohesive.

We create a test case before we write the code that describes the behavior that we aim to observe in the system. This allows us to focus on the design of the external API/Interface to our code, whatever that might be. Now we work to write an implementation that will fulfill the small, executable specification that we have created. If we write too much code, more than is needed to meet the specification, we are cheating our development process and reducing the cohesion of the implementation. If we write too little, then the behavioral intent won’t be met. The discipline of TDD encourages us to hit the sweet spot for cohesion.

As ever, there are no guarantees. This is not a mechanical process, and it still relies upon the experience and skill of the programmer, but the approach applies a pressure toward a better outcome that wasn’t there before and amplifies those skills and that experience.

5. Coupling and cohesion are described on the famous C2 wiki, https://wiki.c2.com/?CouplingAndCohesion.

How to Achieve Cohesive Software

The key measure of cohesion is the extent, or cost, of change. If you have to wander around your codebase changing it in many places to make a change, that is not a very cohesive system. Cohesion is a measure of functional relatedness. It is a measurement of relatedness of purpose. This is slippery stuff!

Let’s look at a simple example.

If I create a class with two methods, each associated with a member variable (see Listing 10.3), this is poor cohesion, because the variables are really unrelated. They are specific to different methods but stored together at the level of the class even though they are unrelated.

Listing 10.3 More Poor Cohesion

class PoorCohesion:    
    def __init__(self):
        self.a = 0
        self.b = 0
    
    def process_a(x):
        a = a + x
    def process_b(x):         b = b * x

Listing 10.4 shows a much nicer, more cohesive solution to this. Note that as well as being more cohesive, this version is also more modular and has a better separation of concerns. We can’t duck the relatedness of these ideas.

Listing 10.4 Better Cohesion

class BetterCohesionA:   
    def __init__(self):
        self.a = 0
    def process_a(x):         a = a + x
class BetterCohesionB:     def __init__(self):                self.b = 0     def process_b(x):         b = b * x

In combination with the rest of our principles for managing complexity, our desire to achieve a testable design helps us to improve the cohesiveness of our solutions. A good example of this is the impact of taking separation of concerns seriously, particularly when thinking about separating accidental complexity6 from essential complexity.7

Listing 10.5 shows three simple examples of improving the cohesiveness of our code by consciously focusing on separating “essential” and “accidental” complexity. In each example, we are adding an item to a shopping cart, storing it in a database, and calculating the value of the cart.

Listing 10.5 Three Cohesion Examples

def add_to_cart1(self, item):
    self.cart.add(item)
    conn = sqlite3.connect(‘my_db.sqlite')     cur = conn.cursor()     cur.execute('INSERT INTO cart (name, price)     values (item.name, item.price)’)     conn.commit()     conn.close()
    return self.calculate_cart_total();
def add_to_cart2(self, item):     self.cart.add(item)     self.store.store_item(item)     return self.calculate_cart_total();
def add_to_cart3(self, item, listener):     self.cart.add(item)     listener.on_item_added(self, item)

The first function is clearly not cohesive code. There are lots of concepts and variables jumbled together here and a complete mix of essential and accidental complexity. I would say that this is very poor code, even at this essentially trivial scale. I would avoid writing code like this because it makes thinking about what is going on harder, even at this extremely simple scale.

6. The accidental complexity of a system is the complexity imposed on the system because we are running on a computer. It is the stuff that is a side effect of solving the real problem that we are interested in, e.g., the problems of persisting information, of dealing with concurrency or complex APIs, etc.

7. The essential complexity of a system is the complexity that is inherent to solving the problem, e.g., the calculation of an interest rate or the addition of an item to a shopping cart.

The second example is a little better. This is more coherent. The concepts in this function are related and represent a more consistent level of abstraction in that they are mostly related to the essential complexity of the problem. The “store” instruction is probably debatable, but at least we have hidden the details of the accidental complexity at this point.

The last one is interesting. I would argue that it is certainly cohesive. To get useful work done, we need to both add the item to the cart and inform other, potentially interested parties that the addition has been made. We have entirely separated the concerns of storage and the need to calculate a total for the cart. These things may happen, in response to being notified of the addition, or they may not if those parts of the code didn’t register interest in this “item added” event.

This code either is more cohesive, where the essential complexity of the problem is all here and the other behaviors are side effects, or is less cohesive if you consider “store” and “total” to be parts of this problem. Ultimately, this is contextual and a design choice based on the context of the problems that you are solving.

Costs of Poor Cohesion

Cohesion is perhaps the least directly quantifiable aspect of my “tools for managing complexity,” but it is important. The problem is that when cohesion is poor, our code and our systems are less flexible, more difficult to test, and more difficult to work on.

In the simple example in Listing 10.5, the impact of cohesive code is clear. If the code confuses different responsibilities, it lacks clarity and readability as add_to_cart1 demonstrates. If responsibilities are more widely spread, it may be more difficult to see what is happening, as in add_to_cart3. By keeping related ideas close together, we maximize the readability as in add_to_cart2.

In reality, there are some advantages to the style of design hinted at in add_to_cart3, and this code is certainly a nicer place to work than version 1.

My point here, though, is that there is sweet spot for cohesion. If you jumble too many concepts together, you lose cohesion at a fairly detailed level. In example 1, you could argue that all the work is done inside a single method, but this is only naively cohesive.

In reality, the concepts associated with adding an item to a shopping cart, the business of the function, are mixed in with other duties that obscure the picture. Even in this simple example, it is less clear what this code is doing until we dig in. We have to know a lot more stuff to properly understand this code.

The other alternative, add_to_cart3, while more flexible as a design, still lacks clarity. At this extreme it is easy for responsibilities to be so diffuse, so widely dispersed, that it is impossible to understand the picture without reading and understanding a lot of code. This could be a good thing, but my point is that there is a cost in clarity to coupling this loose, as well as some benefits.

Both of these failings are extremely common in production systems. In fact, they’re so common that they may even be the norm for large complex systems.

This is a failure of design and comes at a significant cost. This is a cost that you will be familiar with if you have ever worked on “legacy code.”8

There is a simple, subjective way to spot poor cohesion. If you have ever read a piece of code and thought “I don’t know what this code does,” it is probably because the cohesion is poor.

Cohesion in Human Systems

As with many of the other ideas in this book, problems with cohesion aren’t limited only to the code that we write and to the systems that we build. Cohesion is an idea that works at the level of information, so it is just as important in getting the organizations in which we work structured sensibly. The most obvious example of this is in team organization. The findings from the “State of DevOps” report say that one of the leading predictors of high performance, measured in terms of throughput and stability, is the ability of teams to make their own decisions without the need to ask permission of anyone outside the team. Another way to think of that is that the information and skills of the team are cohesive, in that the team has all that it needs within its bounds to make decisions and to make progress.

Summary

Cohesion is probably the slipperiest of the ideas in the list of ideas for managing complexity. Software developers can, and do, sometimes argue that simply having all the code in one place, one file, and one function even, is at least cohesive, but this is too simplistic.

Code that randomly combines ideas in this way is not cohesive; it is just unstructured. It’s bad. It prevents us from seeing clearly what the code does and how to change it safely.

Cohesion is about putting related concepts, concepts that change together, together in the code. If they are only “together” by accident because everything is “together,” we have not really gained much traction.

Cohesion is the counter to modularity and primarily makes sense when considered in combination with modularity. One of the most effective tools to help us strike a good working balance between cohesion and modularity is separation of concerns.

8. Legacy code or legacy systems are systems that have been around for a while. They probably still deliver important value to the organizations that operate them, but they have often devolved into poorly designed tangled messes of code. Michael Feathers defines legacy system as a “system without tests.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.103.8