Epilogue: What now?

Phew! We’ve covered a lot of ground in this book and for most of our audience all of these ideas are new. With that in mind, we can’t hope to make you experts in these techniques. All we can really do is show you the broad-brush ideas, and just enough code for you to go ahead and write something from scratch.

The code we’ve shown in this book isn’t battle-hardened production code: it’s a set of lego blocks that you can play with to make your first house, spaceship, and skyscraper.

That leaves us with two big tasks left for our final chapter. We want to talk about how to start applying these ideas for real in an existing system, and we need to warn you about some of the things we had to skip. We’ve given you a whole new arsenal of ways to shoot yourself in the foot, so we should discuss some basic firearms safety.

How do I get there from here?

Chances are that a lot of you are thinking something like this:

“OK Bob and Harry, that’s all well and good, and if I ever get hired to work on a green-field new service, I know what to do. But in the meantime, I’m here with my big ball of Django mud, and I don’t see any way to get to your nice, clean, perfect, untainted, simplistic model. Not from here.”

We hear you. Once you’ve already built a big ball of mud, it’s hard to know how to start improving things. Really, we need to tackle things step by step.

First things first: what problem are you trying to solve? Is the software too hard to change? Is the performance unacceptable? Have you got weird inexplicable bugs?

Having a clear goal in mind will help you to prioritise the work that needs to be done and, importantly, communicate the reasons for doing it to the rest of the team. Businesses tend to have very pragmatic approaches to technical debt and refactoring, so long as engineers can make a reasoned argument for fixing things.

Tip

Making complex changes to a system is often an easier sell if you link it to feature work. Perhaps you’re launching a new product, or opening your service to new markets? This is the right time to spend engineering resources on fixing the foundations. With a six month project to deliver, it’s easier to make the argument for three weeks of clean-up work. Bob refers to this as “architecture tax”.

Separating entangled responsibilities

At the beginning of the book, we said that the main characteristic of a big ball of mud is homogeneity: every part of the system looks the same, because we haven’t been clear about the responsibilities of each component. To fix that, we’ll need to start separating responsibilities out and introducing some clear boundaries. One of the first things we can do is to start building a service layer.

This was the system where Bob first learned how to break apart a ball of mud, and it was a doozy. There was logic everywhere - in the web pages, in “manager” objects, in “helpers”, in fat “service” classes that we’d written to abstract the managers and helpers, and in hairy “command” objects that we’d written to break apart the services.

If you’re working in a system that’s reached this point, it can feel hopeless, but it’s never too late to start weeding an overgrown garden. Eventually we hired an architect who knew what he was doing, and he helped us get things back under control.

Start by working out the use cases of your system. If you have a user interface, what actions does it perform? If you’ve got some back-end processing component, then maybe each cron job or celery job is a single use-case. Each of your use cases needs to have an imperative name: “Apply Billing Charges”, “Clean Abandoned Accounts”, or “Raise Purchase Order”.

In our case, most of our use-cases were part of the manager classes and had names like “Create Workspace” or “Delete Document Version”. Each use-case was invoked from a web front end.

Our goal is to create a single function or class for each of these supported operations that deals with orchestrating the work to be done. Each use case should:

  • Start its own database transaction if needed

  • Fetch any required data

  • Check any preconditions (see the ensure pattern in the validation appendix)

  • Update the domain model

  • Persist any changes

Each use case should succeed or fail as an atomic unit. You might need to call one use case from another - that’s okay, just make a note of it, and try to avoid long-running database transactions.

Note

One of the biggest problems we had was that manager methods called other manager methods, and data access could happen from the model objects themselves. It was hard to understand what each operation did without going on a treasure- hunt across the codebase. Pulling all the logic into a single method, and using a unit of work to control our transactions, made the system easier to reason about.

This is a good opportunity to pull any data-access or orchestration code out of the domain model and up into the use-cases. We should also try to pull IO concerns (eg. sending emails, writing files) out of the domain model and up into the use-case functions. We apply the techniques from Chapter 3 on abstractions to keep our handlers unit testable even when they’re performing IO.

Tip

It’s fine if you’ve got duplication in the use-case functions. We’re not trying to write perfect code, we’re just trying to extract some meaningful layers. It’s better to duplicate some code in a few places than to have use-case functions calling one another in a long chain.

These use-case functions will mostly be about logging, data access, and error handling. Once you’ve done this step, you’ll have a grasp of what your program actually does, and a way to make sure each operation has a clearly defined start and finish. We’ll have taken a step toward building a pure domain model.

Read Working Effectively With Legacy Code for guidance on how to get legacy code under test, and how to start separating responsibilities. https://www.oreilly.com/library/view/working-effectively-with/0131177052/

Identifying Aggregates and Bounded Contexts

Part of the problem with the codebase in our case study was that the object graph was highly connected. Each account had many workspaces, each workspace had many members, all of whom had their own accounts. Each workspace contained many documents, which had many versions.

You can’t express the full horror of the thing in an ERD diagram. For one thing, there wasn’t really a single account related to a user. Instead there was some bizarre rule where you had to enumerate all of the accounts associated to the user via the workspaces and take the one with the earliest creation date.

Every object in the system was part of an inheritance hierarchy that included SecureObject and Version, and this inheritance hierarchy was mirrored directly in the database schema, so that every query had to join across ten different tables and look at a discriminator column just to tell what kind of objects you were working with.

The codebase made it easy to “dot” your way through these objects like so:

user.account.workspaces[0].documents.versions[1].owner.account.workspaces[0].settings;

It’s easy to build a system this way with Django ORM or SQLAlchemy but it’s to be avoided. While it’s convenient, it makes it very hard to reason about performance because each property might trigger a lookup to the database.

Tip

Aggregates are a consistency boundary. In general each use-case should update a single aggregate at a time. One handler fetches one aggregate from a repository, modifies its state, and raises any events that happen as a result. If you need data from another part of the system, it’s totally find to use a read model, but avoid updating multiple aggregates in a single transaction. When we choose to separate code into different aggregates, we’re explicitly choosing to make them eventually consistent with one another. Chapter 6 has much more on this topic.

There were a bunch of operations that required us to loop over objects this way, for example:

# Lock a user's workspaces for non-payment

def lock_account(user):
    for workspace in user.account.workspaces:
        workspace.archive()

Or even recurse over collections of folders and documents:

def lock_documents_in_folder(folder):

    for doc in folder.documents:
         doc.archive()

     for child in folder.children:
         lock_documents_in_folder(child)

These operations killed performance but fixing them meant giving up our single object graph. Instead we began to identify aggregates and to break the direct links between objects.

Note

We talked about the infamous “select N+1 problem” in Chapter 11, and how we might choose to use different techniques when reading data for queries vs reading data for commands.

Mostly we did this by replacing direct references with identifiers:

Before:

class Workspace:
   folders:  List[Folder]


class Folder:
   permissions: PermissionSet
   documents: List[Document]
   parent: Folder
   children: List[Folder]


class Document:
    parent: Folder
    workspace: Workspace
    version: List[DocumentVersion]

After:

class Document:
   id: int
   workspace_id: int
   parent_id: int

   # Note that our Document Aggregate continued to hold all its versions
   # so that we could treat the whole document as a single unit.
   versions: List[DocumentVersion]


class Folder:
   id: int
   permissions: PermissionSet
   workspace_id: int


class Workspace:
    id: int
Tip

Bi-directional links are often a sign that your aggregates aren’t right. In our original code, a Document knew about its containing Folder, and the Folder had a collection of Documents. This makes it easy to traverse the object graph but stops us from thinking properly about the consistency boundaries we need. We break apart aggregates by using references instead. In the new model, a Document had a folder_id but no way to directly access the Folder.

If we needed to read data, we avoided writing complex loops and transforms and tried to replace them with straight SQL. For example, one of our screens was a tree view of folders and documents.

This screen was incredibly heavy on the database, because it relied on nested for loops that triggered a lazy-loaded ORM.

Tip

We use this same technique in the book in Chapter 11 where we replace a nested loop over ORM objects with a simple SQL query. It’s the first step in a CQRS approach.

After a lot of head-scratching, we replaced the ORM code with a big, ugly stored procedure. The code looked horrible, but it was much faster and it helped us to break the links between Folder and Document.

When we needed to write data, we changed a single aggregate at a time, and we introduced a message bus to handle events. For example, in the new model, when we locked an account, we could first query for all the affected workspaces SELECT id FROM workspace WHERE account_id = ?.

We could then raise a new command for each workspace:

for workspace_id in workspaces:
    bus.handle(LockWorkspace(workspace_id))

An Event-driven Approach to go Microservices Via Strangler Pattern

When building the availability service we used a technique called event interception to move functionality from one place to another. This is a three step process:

  1. Raise events to represent the changes happening in a system you want to replace.

  2. Build a second system that consumes those events and uses them to build its own domain model.

  3. Replace the older system with the new.

Convincing your stakeholders to try something new

If you’re thinking about carving a new system out of a big ball of mud, you’re probably suffering problems with reliability, performance, maintainability or all three simultaneously. Deep intractable problems call for drastic measures!

We recommend domain modelling as a first step. In many over-grown systems, the engineers, product owners, and customers no longer speak the same language. Business stakeholders speak about the system in abstract, process-focused terms while developers are forced to speak about the system as it physically exists in its wild and chaotic state.

Figuring out how to model your domain is a complex task, and the subject of many decent books in its own right. We like to use interactive techniques like Event Storming, and CRC modelling, because humans are good at collaborating through play.

Event Modelling is a technique that brings engineers and product owners together to understand a system in terms of commands, queries, and events.

The goal is to be able to talk about the system using the same ubiquitous language, so that you can agree on where the complexity lies.

We’ve found a lot of value in treating domain problems as TDD kata. For example, the first code we wrote for the availability service was the batch and order line model. You can treat this as a lunch time workshop, or as a spike at the beginning of a project. Once you can demonstrate the value of modelling, it’s easier to make the argument for structuring the project to make modelling easy.

Getting started

Once you’ve managed to convince your team to try a new way of building software it’s easy to get overwhelmed by the sheer number of new things to think about.

What message broker should we use? How should we

  • decide on a piece of the old system to carve out.

  • get your system to produce events

    • as its main outputs

    • and as inputs to your new system

  • consume them in your new service. we now have a separate db and bounded context

  • the new system produces

    • either the same events the old one did (and we can switch those old parts off)

    • or new ones, and we switch over the downstream things progressively

More required reading

  • Monolith to Microservices by Sam Newman, and his original book, Building Microservices. Strangler (Fig) pattern is mentioned as a favorite, also many others. Good if you’re thinking of moving to microservices, and also on integration patterns and the considerations of async messaging-based integration.

  • Clean Architectures in Python by Leonardo Giordani, which came out in 2019, is one of the few previous books on application architecture in Python.

Questions our Tech Reviewers Asked That We Couldn’t Work Into Prose

Do I need to do all of this at once? Can I just do a bit at a time?

No, you can absolutely adopt these techniques bit by bit. We recommend, if you have an existing system, building a service layer to try and keep orchestration in one place. Once you have that, it’s much easier to push logic into the model and edge concerns like validation, or error handling, to the entry points.
It’s worth having a service layer even if you still have a big messy Django ORM because it’s a way to start understanding the boundaries of operations.

Extracting use-cases will break a lot of my existing code, it’s too tangled

Just copy-paste. It’s okay to cause more duplication in the short-term. Think of this as a multi-step process. Your code is in a bad state now, so copy and paste it to a new place, and then make that new code clean and tidy.
Once you’ve done that, you can replace uses of the old code with calls to your new code and finally delete the mess. Fixing large code bases is a messy and painful process. Don’t expect things to get instantly better, and don’t worry if some bits of your application stay messy.

Do I need to do CQRS? That sounds weird, can’t I just use repositories for

Of course you can! The techniques we’re presenting in this book are intended to make your life easier, they’re not some kind of ascetic discipline with which to punish yourself.
In our first case-study system, we had a lot of “View Builder” objects that used repositories to fetch data and then performed some transformations to return dumb read models. The advantage is that when you hit a performance problem, it’s easy to rewrite a view builder to use custom queries or raw SQL.

How should use cases interact across a larger system? Is it a problem for one to call another?

This might be an interim step. Again, in the first case-study, we had handlers that would need to invoke other handlers. This gets really messy, though, and it’s much better to move to using a message bus to separate these concerns.
Generally, your system will have a single message bus implementation, and a bunch of different subdomains that center on a particular aggregate or set of aggregates. When your use case has finished, it can raise an event, and a handler elsewhere can run.

Is it a smell for a use case to use multiple repositories, and why?

Yes! A repository is a pattern that we use for reading aggregates from our persistent store. By definition, we should only ever be updating one aggregate at a time. If you need to read data to figure out what to do, then consider a read-model that returns just the data you need, even if you cheat and build it with a repo and domain objects under the hood.

What if I have a read-only but business-logic heavy system?

View models can have complex logic in them. In this book we’ve encouraged you to separate your read and write models because they have different consistency and throughput requirements. Mostly, we can use simpler logic for reads, but that’s not always true. In particular, permissions and authorization models can add a lot of complexity to our read-side.
We’ve written systems where the view models needed extensive unit tests. In those systems, we split a view builder from a view fetcher.

This makes it easy to test the view builder by giving it mocked data, eg. a list of dicts.
“Fancy CQRS” with event handlers is really a way of running our complex view logic whenever we write so that we can avoid running it when we read.

Do I need to build microservices to do this stuff?

Egads, no! These techniques pre-date microservices by a decade or so. Aggregates, domain events, and dependency inversion are ways to control complexity in large systems. It so happens that when you’ve built a set of use cases and a model for a business process, it’s relatively easy to move it to its own service, but that’s not a requirement.

I’m using Django, can I still do this?

We have an entire appendix, just for you! Appendix D

How can I convince other engineers to try this stuff?

tbc

Footguns

OK, so we’ve give you a whole bunch of new toys to play with, here’s the fine print. Harry and Bob do not recommend that you copy-paste our code into a production system and rebuild your automated trading platform on Redis pub-sub. For reasons of brevity and simplicity, we’ve hand-waved a lot of tricky subjects. Here’s a list of things we think you should know before trying this for real.

Reliable messaging is hard

Redis PUBSUB is not reliable and shouldn’t be used as a general purpose messaging tool. We picked it because it’s familiar and easy to run. At MADE we run Eventstore as our messaging tool, but we’ve had experience with Rabbit and AWS EventBridge.

Tyler Treat has some excellent blog posts on his site bravenewgeek.com and you should read at least: * https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/ * https://bravenewgeek.com/what-you-want-is-what-you-dont-understanding-trade-offs-in-distributed-messaging/

We explicitly choose small, focused transactions that can fail independently

In Chapter 8 we update our process so that deallocating an order line and reallocating the line happen in two separate units of work. You will need monitoring to know when these transactions fail, and tooling to replay events. Some of this is made easier by using a transaction log as your message broker (eg. Kafka, EventStore). You might also look at the Outbox pattern: * https://microservices.io/patterns/data/transactional-outbox.html

We don’t discuss idempotency

We haven’t given any real thought to what happens when handlers are retried. In practice you will want to make handlers idempotent so that calling them repeatedly with the same message will not make repeated changes to state. This is a key technique for building reliability, because it enables us to safely retry events when they fail.

There’s a lot of good material on idempotent message handling, try starting here: * https://blog.sapiensworks.com/post/2015/08/26/How-To-Ensure-Idempotency * https://lostechies.com/jimmybogard/2013/06/03/un-reliability-in-messaging-idempotency-and-de-duplication/

Your events will need to change their schema over time

You’ll need to find some way of documenting your events and sharing schema with consumers. We like using JSON schema and markdown because it’s simple but there is other prior art.

Greg Young wrote an entire book on managing event-driven systems over time: https://www.goodreads.com/book/show/34327067-versioning-in-an-event-sourced-system

Enterprise Integration Patterns by (as always) Martin Fowler is a pretty good start for messaging patterns.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.196.217