Chapter 4. Implementing Microservice Communication

There is a bewildering array of options out there for how one microservice can talk to another. But which is the right one: SOAP? XML-RPC? REST? GRPC?

Well, as we discussed in the previous chapter, your choice of technology should be driven in large part based on the style of communication you want. Deciding between blocking synchronous or non-blocking asynchronous calls, request-response or event-driven collaboration, will help you whittle down what might otherwise be a very long list of technology.

In this chapter, we’re going to now look at some of the common technology used for microservice communication. But new options are always coming up, so before we discuss specific technology, let’s think about what we want out of whatever technology we pick.

Make Backwards Compatibility Easy

When making changes to our microservices, we need to make sure we don’t break compatibility with any consuming microservices. As such, we want to ensure that whatever technology we pick makes it easy to make backwards compatible changes. Simple operations like adding new fields shouldn’t break clients. We also ideally want the ability to validate that the changes we have made are backwards compatible - and have a way to get that feedback before we deploy our microservice into production.

Make Your Interface Explicit

It is important that the interface that a microservice exposes to the outside world is explicit. This means that it is clear to a consumer of a microservice as to what functionality that microservice exposes. But it also means that it is clear to a developer working on a microservice as to what functionality needs to remain intact for external parties - we want to avoid a situation where a change to a microservice causes an accidental breakage in compatibility.

Schemas can go a long way to helping ensure that the interface a microservice exposes is explicit. Some of the technology we can look at requires the use of a schema, for others the use of a schema is optional. Either way, I strongly encourage the use of a schema, as well as enough supporting documentation to be clear about what functionality a consumer can expect a microservice to provide.

Keep Your APIs Technology-Agnostic

If you have been in the IT industry for more than 15 minutes, you don’t need me to tell you that we work in a space that is changing rapidly. The one certainty is change. New tools, frameworks, and languages are coming out all the time, implementing new ideas that can help us work faster and more effectively. Right now, you might be a .NET shop. But what about in a year from now, or five years from now? What if you want to experiment with an alternative technology stack that might make you more productive?

I am a big fan of keeping my options open, which is why I am such a fan of microservices. It is also why I think it is very important to ensure that you keep the APIs used for communication between microservices technology-agnostic. This means avoiding integration technology that dictates what technology stacks we can use to implement our microservices.

Make Your Service Simple for Consumers

We want to make it easy for consumers to use our microservice. Having a beautifully factored microservice doesn’t count for much if the cost of using it as a consumer is sky high! So let’s think about what makes it easy for consumers to use our wonderful new service. Ideally, we’d like to allow our clients full freedom in their technology choice, but on the other hand, providing a client library can ease adoption. Often, however, such libraries are incompatible with other things we want to achieve. For example, we might use client libraries to make it easy for consumers, but this can come at the cost of increased coupling.

Hide Internal Implementation Detail

We don’t want our consumers to be bound to our internal implementation. This leads to increased coupling. This means that if we want to change something inside our microservice, we can break our consumers by requiring them to also change. That increases the cost of change—exactly what we are trying to avoid. It also means we are less likely to want to make a change for fear of having to upgrade our consumers, which can lead to increased technical debt within the service. So any technology that pushes us to expose internal representation detail should be avoided.

There is a whole host of technology we could look at, but rather than looking broadly at a long list of options in this space, I will highlight some of the most popular and interesting choices. Here are the options we’ll be looking at:

Remote Procedure Calls (RPC)

Frameworks that allow for local method calls to be invoked on a remote process. Common options include SOAP and GRPC.

REST

An architectural style where you expose resources (Customer, Order etc) that can be accessed using a common set of verbs (GET, POST). There is a bit more to REST than that, but we’ll get to that shortly.

GraphQL

A relatively new protocol that allows for consumers to define custom queries that can fetch information from multiple downstream microservices, filtering the results to return only what is needed.

Message Brokers

Middleware that allows for asynchronous communication either via queues or topics.

TODO: Show this tech against styles we outlined previously?

Remote Procedure Calls

Remote procedure call refers to the technique of making a local call and having it execute on a remote service somewhere. There are a number of different types of RPC technology out there. Most of the technology in this space requires an explicit schema, such as SOAP or GRPC. The use of a separate schema makes it easier to generate client and server stubs for different technology stacks, so, for example, I could have a Java server exposing a SOAP interface, and a .NET client generated from the Web Service Definition Language (WSDL) definition of the interface. Other technology, like Java RMI, calls for a tighter coupling between the client and server, requiring that both use the same underlying technology but avoid the need for a shared interface definition. All these technologies, however, have the same, core characteristic in that they make a remote call look like a local call.

Typically, using an RPC technology means you are buying into a serialization protocol. The RPC framework defines how data is serialized and deserialized. GRPC for example uses the protocol buffer serialization format for this purpose. Some implementations are tied to a specific networking protocol (like SOAP, which makes nominal use of HTTP), whereas others might allow you to use different types of networking protocols, which themselves can provide additional features. For example, TCP offers guarantees about delivery, whereas UDP doesn’t but has a much lower overhead. This can allow you to use different networking technology for different use cases.

RPC frameworks that have an explicit schema make it very easy to generate client code. This can avoid the need for client libraries, as any client can just generate their own code against this service specification. For client side code generation to work though, the client needs some way to get the schema out of band - in other words the consumer needs to have access to the schema before it plans to make calls. AVRO RPC is an interesting outlier here, as it has the option to send the full schema along with the payload, allowing for clients to dynamically interpret the schema.

The ease of generation of client-side code is one of the main selling points of RPC: its ease of use. The fact that I can just make a normal method call and theoretically ignore the rest is a huge boon.

Challenges

As we’ve seen, RPC offers some great advantages, but it’s not without it’s downsides - and some RPC implementations can be more problematic than others. Many of these issues can be dealt with, but they deserve further exploration.

Technology Coupling

Some RPC mechanisms, like Java RMI, are heavily tied to a specific platform, which can limit which technology can be used in the client and server. Thrift and protocol buffers have an impressive amount of support for alternative languages, which can reduce this downside somewhat, but be aware that sometimes RPC technology comes with restrictions on interoperability.

In a way, this technology coupling can be a form of exposing internal technical implementation details. For example, the use of RMI ties not only the client to the JVM, but the server too.

To be fair, there are a number of RPC implementations that don’t have this restriction - GRPC, SOAP and Thrift are all examples that allow for interoperability between different technology stacks.

Local Calls Are Not Like Remote Calls

The core idea of RPC is to hide the complexity of a remote call. This can though lead to hiding too much. The drive in some forms of RPC to make remote method calls look like local method calls hides the fact that these two things are very different. I can make large numbers of local, in-process calls without worrying overly about the performance. With RPC, though, the cost of marshalling and un-marshalling payloads can be significant, not to mention the time taken to send things over the network. This means you need to think differently about API design for remote interfaces versus local interfaces. Just taking a local API and trying to make it a service boundary without any more thought is likely to get you in trouble. In some of the worst examples, developers may be using remote calls without knowing it, if the abstraction is overly opaque.

You need to think about the network itself. Famously, the first of the fallacies of distributed computing is “The network is reliable”. Networks aren’t reliable. They can and will fail, even if your client and the server you are speaking to are fine. They can fail fast, they can fail slow, and they can even malform your packets. You should assume that your networks are plagued with malevolent entities ready to unleash their ire on a whim. Therefore, the failure modes you can expect are different. A failure could be caused by the remote server returning an error, or by you making a bad call. Can you tell the difference, and if so, can you do anything about it? And what do you do when the remote server just starts responding slowly? We’ll cover this topic when we talk about resiliency in [Link to Come].

Brittleness

Some of the most popular implementations of RPC can lead to some nasty forms of brittleness, Java’s RMI being a very good example. Let’s consider a very simple Java interface that we have decided to make a remote API for our customer service. Example 4-1 declares the methods we are going to expose remotely. Java RMI then generates the client and server stubs for our method.

Example 4-1. Defining a service endpoint using Java RMI
import java.rmi.Remote;
import java.rmi.RemoteException;

public interface CustomerRemote extends Remote {
  public Customer findCustomer(String id) throws RemoteException;

  public Customer createCustomer(String firstname, String surname, String emailAddress)
      throws RemoteException;
}

In this interface, createCustomer takes the first name, surname, and email address. What happens if we decide to allow the Customer object to also be created with just an email address? We could add a new method at this point pretty easily, like so:

...
public Customer createCustomer(String emailAddress) throws RemoteException;
...

The problem is that now we need to regenerate the client stubs too. Clients that want to consume the new method need the new stubs, and depending on the nature of the changes to the specification, consumers that don’t need the new method may also need to have their stubs upgraded too. This is manageable, of course, but to a point. The reality is that changes like this are fairly common. RPC endpoints often end up having a large number of methods for different ways of creating or interacting with objects. This is due in part to the fact that we are still thinking of these remote calls as local ones.

There is another sort of brittleness, though. Let’s take a look at what our Customer object looks like:

public class Customer implements Serializable {
  private String firstName;
  private String surname;
  private String emailAddress;
  private String age;
}

Now, what if it turns out that although we expose the age field in our Customer objects, none of our consumers ever use it? We decide we want to remove this field. But if the server implementation removes age from its definition of this type, and we don’t do the same to all the consumers, then even though they never used the field, the code associated with deserializing the Customer object on the consumer side will break. To roll out this change, I would have to deploy both a new server and clients at the same time. This is a key challenge with any RPC mechanism that promotes the use of binary stub generation: you don’t get to separate client and server deployments. If you use this technology, lock-step releases may be in your future.

Similar problems occur if I want to restructure the Customer object even if I didn’t remove fields—for example, if I wanted to encapsulate firstName and surname into a new naming type to make it easier to manage. I could, of course, fix this by passing around dictionary types as the parameters of my calls, but at that point, I lose many of the benefits of the generated stubs because I’ll still have to manually match and extract the fields I want.

In practice, objects used as part of binary serialization across the wire can be thought of as expand-only types. This brittleness results in the types being exposed over the wire and becoming a mass of fields, some of which are no longer used but can’t be safely removed.

Where To Use It

Despite its shortcomings, I actually quite like RPC, and the more modern implementations, such as GRPC, are excellent, whereas other implementations have significant issues which would cause me to give them a wide berth. Java RMI for example has a number of issues regarding brittleness and limited technology choices, and SOAP is pretty heavyweight from a developer perspective, especially when compared with more modern choices.

Just be aware of some of the potential pitfalls associated with RPC if you’re going to pick this model. Don’t abstract your remote calls to the point where the network is completely hidden, and ensure that you can evolve the server interface without having to insist on lock-step upgrades for clients. Finding the right balance for your client code is important, for example. Make sure your clients aren’t oblivious to the fact that a network call is going to be made. Client libraries are often used in the context of RPC, and if not structured right they can be problematic. We’ll talk more about them shortly.

If I was looking at options in this space, GRPC would be top of my list. Built to take advantage of HTTP/2, it has some impressive performance characteristics and good general ease of use. I also appreciate the ecosystem around GRPC, including tools like Protolock1, something we’ll discuss later in this chapter when we discuss schemas.

GRPC fits a synchronous request-response model well, but can also work in conjunction with reactive extensions. It’s high on my list whenever I’m in situations where I have a good deal of control over both the client and server ends of the spectrum. If you’re having to support a wide variety of other applications that might need to talk to your microservices, the need to compile client-side code against a server-side schema can be problematic. In which case, some form of REST over HTTP API would likely be a better fit.

REST

Representational State Transfer (REST) is an architectural style inspired by the Web. There are many principles and constraints behind the REST style, but we are going to focus on those that really help us when we face integration challenges in a microservices world, and when we’re looking for an alternative style to RPC for our service interfaces.

Most important when thinking about REST is the concept of resources. You can think of a resource as a thing that the service itself knows about, like a Customer. The server creates different representations of this Customer on request. How a resource is shown externally is completely decoupled from how it is stored internally. A client might ask for a JSON representation of a Customer, for example, even if it is stored in a completely different format. Once a client has a representation of this Customer, it can then make requests to change it, and the server may or may not comply with them.

There are many different styles of REST, and I touch only briefly on them here. I strongly recommend you take a look at the Richardson Maturity Model, where the different styles of REST are compared.

REST itself doesn’t really talk about underlying protocols, although it is most commonly used over HTTP. I have seen implementations of REST using very different protocols before, such as serial or USB, although this can require a lot of work. Some of the features that HTTP gives us as part of the specification, such as verbs, make implementing REST over HTTP easier, whereas with other protocols you’ll have to handle these features yourself.

REST and HTTP

HTTP itself defines some useful capabilities that play very well with the REST style. For example, the HTTP verbs (e.g., GET, POST, and PUT) already have well-understood meanings in the HTTP specification as to how they should work with resources. The REST architectural style actually tells us that methods should behave the same way on all resources, and the HTTP specification happens to define a bunch of methods we can use. GET retrieves a resource in an idempotent way, and POST creates a new resource. This means we can avoid lots of different createCustomer or editCustomer methods. Instead, we can simply POST a customer representation to request that the server create a new resource, and initiate a GET request to retrieve a representation of a resource. Conceptually, there is one endpoint in the form of a Customer resource in these cases, and the operations we can carry out upon it are baked into the HTTP protocol.

HTTP also brings a large ecosystem of supporting tools and technology. We get to use HTTP caching proxies like Varnish and load balancers like mod_proxy, and many monitoring tools already have lots of support for HTTP out of the box. These building blocks allow us to handle large volumes of HTTP traffic and route them smartly, in a fairly transparent way. We also get to use all the available security controls with HTTP to secure our communications. From basic auth to client certs, the HTTP ecosystem gives us lots of tools to make the security process easier, and we’ll explore that topic more in [Link to Come]. That said, to get these benefits, you have to use HTTP well. Use it badly, and it can be as insecure and hard to scale as any other technology out there. Use it right, though, and you get a lot of help.

Note that HTTP can be used to implement RPC too. SOAP, for example, gets routed over HTTP, but unfortunately uses very little of the specification. Verbs are ignored, as are simple things like HTTP error codes. GRPC on the other hand has been designed to take advantage of the capabilities of HTTP/2 such as the ability to send multiple request-response streams over a single connection.

Hypermedia As the Engine of Application State

Another principle introduced in REST that can help us avoid the coupling between client and server is the concept of hypermedia as the engine of application state (often abbreviated as HATEOAS, and boy, did it need an abbreviation). This is fairly dense wording and a fairly interesting concept, so let’s break it down a bit.

Hypermedia is a concept whereby a piece of content contains links to various other pieces of content in a variety of formats (e.g., text, images, sounds). This should be pretty familiar to you, as it’s what the average web page does: you follow links, which are a form of hypermedia controls, to see related content. The idea behind HATEOAS is that clients should perform interactions (potentially leading to state transitions) with the server via these links to other resources. It doesn’t need to know where exactly customers live on the server by knowing which URI to hit; instead, the client looks for and navigates links to find what it needs.

This is a bit of an odd concept, so let’s first step back and consider how people interact with a web page, which we’ve already established is rich with hypermedia controls.

Think of the Amazon.com shopping site. The location of the shopping cart has changed over time. The graphic has changed. The link has changed. But as humans we are smart enough to still see a shopping cart, know what it is, and interact with it. We have an understanding of what a shopping cart means, even if the exact form and underlying control used to represent it has changed. We know that if we want to view the cart, this is the control we want to interact with. This is how web pages can change incrementally over time. As long as these implicit contracts between the customer and the website are still met, changes don’t need to be breaking changes.

With hypermedia controls, we are trying to achieve the same level of smarts for our electronic consumers. Let’s look at a hypermedia control that we might have for MusicCorp. We’ve accessed a resource representing a catalog entry for a given album in Example 4-2. Along with information about the album, we see a number of hypermedia controls.

Example 4-2. Hypermedia controls used on an album listing
<album>
  <name>Give Blood</name>
  <link rel="/artist" href="/artist/theBrakes" /> 1
  <description>
    Awesome, short, brutish, funny and loud. Must buy!
  </description>
  <link rel="/instantpurchase" href="/instantPurchase/1234" /> 2
</album>
1

This hypermedia control shows us where to find information about the artist.

2

And if we want to purchase the album, we now know where to go.

In this document, we have two hypermedia controls. The client reading such a document needs to know that a control with a relation of artist is where it needs to navigate to get information about the artist, and that instantpurchase is part of the protocol used to purchase the album. The client has to understand the semantics of the API in much the same way as a human being needs to understand that on a shopping website the cart is where the items to be purchased will be.

As a client, I don’t need to know which URI scheme to access to buy the album, I just need to access the resource, find the buy control, and navigate to that. The buy control could change location, the URI could change, or the site could even send me to another service altogether, and as a client I wouldn’t care. This gives us a huge amount of decoupling between the client and server.

We are greatly abstracted from the underlying detail here. We could completely change the implementation of how the control is presented as long as the client can still find a control that matches its understanding of the protocol, in the same way that a shopping cart control might go from being a simple link to a more complex JavaScript control. We are also free to add new controls to the document, perhaps representing new state transitions that we can perform on the resource in question. We would end up breaking our consumers only if we fundamentally changed the semantics of one of the controls so it behaved very differently, or if we removed a control altogether.

The theory is that by using these controls to decouple the client and server we gain significant benefits over time that hopefully offset the increase in the time it takes to get these protocols up and running. Unfortunately, although these ideas all seem sensible in theory, I’ve found that this form of REST is rarely practiced, for reasons I’ve not entirely got to grips with. This makes HATEOS specifically a much harder concept for me to promote for those already committed to the use of REST. Fundamentally, many of the ideas in REST are predicated on creating distributed hypermedia systems, and this isn’t what most people end up building.

Challenges

In terms of ease of consumption, historically you wouldn’t be able to generate client-side code for your REST over HTTP application protocol like you can with RPC implementations. This has often lead to people creating REST APIs providing client libraries for consumers to make use of. These client libraries give you a binding to the API to make client integration easier. The problem is that client libraries can cause some challenges with regards to coupling between the client and server, something we’ll discuss in “DRY and the Perils of Code Reuse in a Microservice World”.

In recent years this problem has been somewhat alleviated. The OpenAPI specification2, that grew out of the Swagger documentation format, now provides you with the ability to define enough information on a REST endpoint to allow for the generation of client-side code in a variety of languages. In my experience, I haven’t seen many teams actually making use of this functionality even if they were already using Swagger for documentation. I have a suspicion that this may be due to the difficulties of retrofitting its use into current APIs. I do also have concerns about a specification previously just being used for documentation now being used to define a more explicit contract. This can lead to a much more complex specification - comparing an OpenAPI schema with a protocol buffer schema for example is quite a stark contrast. Despite my reservations though, it’s good that this option now exists.

Performance may also be an issue. REST over HTTP payloads can actually be more compact than SOAP because it supports alternative formats like JSON or even binary, but it will still be nowhere near as lean a binary protocol as Thrift might be. The overhead of HTTP for each request may also be a concern for low-latency requirements. All mainstream HTTP protocols in current use require the use of Transmission Control Protocol (TCP) under the hood, which has inefficiencies compared with alternative networking protocols, and some RPC implementations can allow you to use alternative networking protocols to TCP such as User Datagram Protocol (UDP).

The limitations placed on HTTP due to the requirement to use TCP are being addressed. HTTP/3, which is currently in the process of being finalized, is looking to shift over to using the newer QUIC protocol. QUIC provides the same sorts of capabilities as TCP (such as improved guarantees over UDP) but has some significant improvements over TCP, which have been shown to deliver improvements in latency and reductions in bandwidth. It’s likely that HTTP/3 will take several years before it has a widespread impact on the public internet, but it seems reasonable to assume that organizations can benefit earlier than this within their own networks.

With respect to HATEOS specifically, you can encounter additional performance issues. As clients need to navgiate multiple controls to find the right endpoints for a given operation, this can lead to very chatty protocols - multiple round trips may be required for each operation. Ultimately, this is a trade-off. If you decide to adopt a HATEOS-style of REST, I would suggest you start with having your clients navigate these controls first, then optimize later if necessary. Remember that we have a large amount of help out of the box by using HTTP, which we discussed earlier. The evils of premature optimization have been well documented before, so I don’t need to expand upon them here. Also note that a lot of these approaches were developed to create distributed hypertext systems, and not all of them fit! Sometimes you’ll find yourself just wanting good old-fashioned RPC.

Despite these disadvantages, REST over HTTP is a sensible default choice for service-to-service interactions. If you want to know more, I recommend REST in Practice (O’Reilly)3, which covers the topic of REST over HTTP in depth.

Where To Use It

Due to its widespread use in the industry, a REST over HTTP based API is an obvious choice for a synchronous request-response interface if you are looking to allow access from as wide a variety of clients as possible. It would be a mistake to think of a REST over HTTP as just being a “good enough for most things” choice, but there is something to that. It’s a widely understood style of interface, that most people are familiar with, and guarantees interoperability from a huge variety of technologies.

Due in large part to the capabilities of HTTP, and the extent to which REST builds upon these capabilities (rather than hiding them), these APIs excel in situations where you want large scale and effective caching of requests. It’s for this reason that they are the obvious choice for exposing APIs to external parties or client interfaces. They may well suffer when compared to more efficient communication protocols, and although you can construct asynchronous interaction protocols over the top of REST-based APIs, that’s not really a great fit compared to the alternatives for general microservice-to-microservice communication.

Despite intellectually appreciating the goals behind HATEOS, I haven’t in my experience seen the additional work to implement this style of REST deliver worthwhile benefits in the long run, nor can I recall in the last few years talking to any teams implementing a microservice architecture that can speak to the value of using HATEOS. My own experiences are obviously only one set of data points, and I don’t doubt that for some people it may have worked well. But this concept does not seem to have caught on as much as I thought it would. It could be that the concepts behind HATEOS are too alien for us to grasp, or it could be the lack of tools or standards in this space, or perhaps the model just doesn’t work for the sorts of systems we have ended up building.

So for use at the perimeter, it works fantastically well, and for synchronous request-response based communication between microservices, it’s great.

GraphQL

In recent years, GraphQL4 has gained more popularity, due in large part to the fact that it excels in one specific area. Namely, it makes it possible for a client-side device to define queries that can avoid the need to make multiple requests to retrieve the same information. This can offer significant improvements in terms of the performance of constrained client-side devices, and also avoid the need to implement bespoke server-side aggregation.

To take a simple example, imagine a mobile device that wants to display a page showing an overview of a customer’s latest orders. The page needs to contain some information about the customer. along with information about the 5 most recent orders the client placed. The screen only needs a few fields from the customer record, and only needs the date, value and shipped status of each order. The mobile device could issue calls to two downstream microservices to retrieve the required information, but this would involve making multiple calls, including pulling back information that isn’t actually required. Especially with mobile devices, this can be wasteful - it uses up more of a mobile device’s data plan than is needed, and can take longer.

GraphQL allows for the mobile device to issue a single query that can pull back all the required information. For this to work, you need a microservice which exposes a GraphQL endpoint to the client device. This GraphQL endpoint is the entry for all client queries, and exposes a schema for the client devices to use. This schema exposes the types available to the client, and a nice graphical query builder is also available to make creating these queries easier. By reducing the amount of calls and amount of data retrieved by the client device, you can deal neatly with some of the challenges that occur when building user interfaces with microservice architectures.

Challenges

Early on, one challenge was lack of language support for the GraphQL specification, with JavaScript being your only choice initially. This has improved greatly, with all major technologies now having support for the specification. In fact across the board there have been significant improvements in GraphQL and the various implementations, making it a much less risky prospect than it might have been a few years ago. That said, a few challenges do remain with the technology which you might want to be aware of.

As the client device can issue dynamically changing queries, this can potentially cause an issue with server-side load. I’ve heard of teams who have had issues with GraphQL queries causing significant load on the server-side as a result of this. To compare GraphQL with something like SQL, we have the same issue there. An expensive SQL statement can cause significant problems for a database, potentially having a large impact on the wider system. The same problem applies with GraphQL. The difference is that at least with SQL we have tools like query planners for our databases, which can help us diagnose problematic queries, whereas a similar problem with GraphQL can be harder to track down. Server-side throttling of requests is one potential issue, but as the execution of the call may be spread across multiple microservices, this is far from straightforward.

Compared with normal REST-based HTTP APIs, caching is also more complex. With REST-based API, I can set one of many response headers to help client side devices, or intermediate caches like content delivery networks, cache responses so they don’t need to be requested again. This isn’t possible in the same way with GraphQL. The advice I’ve seen around this issue seems to revolve around just associating an ID with every returned resource (and remember, a GraphQL query could contain multiple resources), and then having the client device cache the request against that ID. As far as I can tell, this makes the use of Content Delivery Networks (CDNs) or caching reverse proxies incredibly difficult without additional work, or additional tooling.

Although I’ve seen some implementation-specific solutions to this problem (such as those found in the JavaScript Apollo implementation), caching feels like it was either consciously or unconsciously ignored as part of the initial development of GraphQL. If the queries you are issuing are highly specific in nature to a particular user, then this lack of request-level caching may not be a deal breaker of course, as your cache-hit ratio is likely to be low. I do wonder though if this limitation means that you’ll still end up with a hybrid solution for client devices, with some (more generic) requests going over normal REST-based HTTP APIs, with other requests going over GraphQL.

Another issue, is that while GraphQL theoretically can handle writes, it doesn’t seem to fit as well as reads. This does lead to situations where teams are using GraphQL for read, but REST for writes.

The last issue is something which may be entirely subjective, but I still think it’s worth raising. GraphQL makes it feel like you are just working with data, which can reinforce the idea that the microservices you are talking to are in fact just wrappers over databases. I’ve seen multiple people in fact compare GraphQL with OData, a technology which is designed as a generic API for accessing data from databases. As we’ve already discussed at length, the idea of just treating microservices as wrappers over databases can be very problematic. Microservices expose functionality over networked interfaces. Some of that functionality might require or result in data being exposed, but they should still have their own internal logic and behavior. Just because you are using GraphQL, don’t slip into thinking of your microservices as little more than an API on a database - it’s essential that your GraphQL API isn’t coupled to the underlying datastores of your microservices.

Where To Use It

GraphQL’s sweet spot is for use at the perimeter of the system, exposing functionality to external clients. These clients are typically GUIs, and it’s an obvious fit for mobile devices given their constraints in terms of their limited ability to surface data to the end user and nature of mobile networks. GraphQL has also seen use though for external APIs, GitHub being an early adopter of GraphQL. If you have an external API which often requires external clients to make multiple calls to get the information they need, then GraphQL can help make these APIs much more efficient and friendly.

Fundamentally, GraphQL is a call aggregation mechanism, so in the context of a microservice architecture it would be used to aggregate calls over multiple downstream microservices, as we saw in <<>>. As such, it’s not something that would replace general microservice-to-microservice communication.

An alternative to the use of GraphQL would be to consider an alternative pattern like the Backend For Frontend (BFF) pattern - we’ll look at that and compare with GraphQL and other aggregation techniques further in [Link to Come].

Message Brokers

Message brokers are intermediaries, often called middleware, that sit between processes to manage communication between them. They are a popular choice to help implement asynchronous communication between microservices as they offer a variety of powerful capabilities.

As we discussed earlier, a message is a generic concept which defines the thing that a message broker sends. A message could contain a request, a response, or an event. Rather than one microservice directly communicating with another microservice, instead, it gives a message to a message broker, with information about how the message should be sent.

Topics and Queues

Brokers tend to provide either queues, topics, or both. Queues are typically point to point. A sender puts a message on a queue, and a consumer reads from that queue. With a topic-based system, multiple consumers are able to subscribe to a topic, and each subscribed consumer will receive a copy of that message.

A consumer could represent one or more microservices - typically modelled as a consumer group. This would be useful when you have multiple instances of a microservice, and you want any one of them to be able to receive a message. In Figure 4-1, we see an example where the Order Processor has three deployed instances, all as part of the same consumer group. When a message is put into the queue, only one member of the consumer group will receive that message - this means the queue works as a load distribution mechanism - this is an example of the Competing Consumers pattern we touched on briefly in Chapter 3.

Topics allow for multiple subscribers to receive the same messages, useful for event broadcast
Figure 4-1. A queue allows for one consumer group

With topics, you can have multiple consumer groups. In Figure 4-2, an event representing an order being paid for is put onto the Order Status topic. A copy of that event is received by both the Warehouse microservice, and the Notifications microservice, both of which are in separate consumer groups. Only one instance of each consumer group will see that event.

Topics allow for multiple subscribers to receive the same messages, useful for event broadcast
Figure 4-2. Topics allow for multiple subscribers to receive the same messages, useful for event broadcast

At first glance, a queue just looks like a topic with a single consumer group. A large part of the distinction between the two is that when sending a message over a queue, there is knowledge of what the message is being sent to. With a topic, this information is hidden from the sender of the message - they are unaware of who (if anyone) will end up receiving the message.

Topics are a good fit for event-based collaboration, where queues would be more appropriate for request/response communication. This should be considered as general guidance though rather than a strict rule.

Guaranteed Delivery

So why use a broker? Fundamentally, they provide some capabilities that can be very useful for asynchronous communication. The properties they provide vary, but the most interesting feature is that of guaranteed delivery, something which all widely used brokers support in some way. Guaranteed delivery describes a commitment by the broker to ensure that the message is delivered.

From the point of view of the microservice sending the message, this can be very useful. If the downstream destination is unavailable, then this isn’t a problem - the broker will hold on to the message until it can be delivered. This can reduce the number of things an upstream microservice needs to worry about. When compared to a synchronous direct call, for example an HTTP request, if the downstream destination isn’t reachable, the upstream microservice will need to work out what to do with the request - should it retry the call, or give up?

For guaranteed delivery to work, a broker will need to ensure that any messages not yet delivered are going to be held in a durable fashion until they are able to be delivered. To deliver on this promise, a broker will normally run as some sort of cluster-based system, ensuring that the loss of a single machine doesn’t cause the message to be lost. There is typically a lot involved in running a broker correctly, partly due to the challenges in managing cluster-based software. Often, the promise of guaranteed delivery can be undermined if the broker isn’t setup correctly. As an example, RabbitMQ requires instances in a cluster to communicate over relatively low-latency networks, otherwise the instances can start to get confused about the current state of messages being handled, resulting in data loss. I’m not highlighting this particular limitation as a way of saying that RabbitMQ is in anyway bad, all brokers have restrictions as to how they need to be run to deliver the promise of guaranteed delivery. If you plan to run your own broker, make sure you read the documentation carefully.

It’s also worth noting that what any given broker means by guaranteed delivery can vary. Again, reading the documentation is a great start.

Trust

One of the big draws of a broker is the property of guaranteed delivery. But for this to work, you need to trust not only the people who created the broker, but also the way that broker has operated. If you’ve built a system that is based on the assumption that delivery is guaranteed, and that turns out not to be the case due to an issue with the underlying broker, it can cause significant issues. The hope of course is that you are offloading that work to software created by people who can do that job better than you can. Ultimately, you have to decide how much you want to trust the broker you are making use of.

Other Characteristics

Aside from guaranteed delivery, there are other characteristics that brokers can provide that you may find to be useful.

Most brokers can guarantee the order in which messages will be delivered, but this isn’t universal, and even then the scope of this guarantee can be limited. With Kafka for example, ordering is only guaranteed within a single partition. If you are unable to be certain that messages will be received in order, your consumer may need to compensate for this, perhaps by deferring processing of messages that are received out of order, until the missing messages are received.

Some brokers provide transactions on write - Kafka as an example allows you to write to multiple topics in a single transaction. Some brokers can also provide read transactionality, and this is something I’ve made use of when using a number of brokers via the Java Messaging Service (JMS) APIs. This can be useful if you want to ensure the message can be processed by the consumer before removing it from the broker.

Another, somewhat controversial feature promised by some brokers is that of exactly once delivery. One of the easier ways to provide guaranteed delivery is allowing the message to be resent. This can result in a consumer seeing the same message more than once (even if this is a rare situation). Most brokers will do what they can to reduce the chance of this, or hide this fact from the consumer, but some brokers have gone further and guarantee exactly once delivery. This is a complex topic, as I’ve spoken to some experts who state that guaranteeing this in all cases is impossible, while other experts say you basically can do this with a few simple workarounds. Either way, if your broker of choice claims to implement this, then pay really careful attention to how this is implemented. Even better, build your consumers in such a way that they are prepared for the fact that they might receive a message more than once, and can handle this situation. A very simple example would be for each message to have an ID which a consumer can check when the message is received. If a message with that ID has already been processed, the message can be ignored.

Choices

A variety of message brokers exist. Popular examples include RabbitMQ, ActiveMQ, and Kafka (which we’ll explore further shortly). The main public cloud vendors also provide a variety of products that play this role, from managed versions of those brokers you could install on your own infrastructure, to bespoke implementations that are specific to a given platform. AWS for example has the Simple Queue Service (SQS), Simple Notification Service (SNS), and Kinesis, all of which provide different flavours of fully managed brokers. SQS was in fact the first ever product released by AWS, launched back in 2006.

Kafka

Kafka is worth highlighting as a specific broker, due in large part to its popularity in recent years. Part of this popularity is due to its use in helping moving large volumes of data around as part of implementing stream processing pipelines. This can help move from batch-oriented processing to more real-time processing.

There are a few characteristics of kafka which are worth highlighting. Firstly, it is designed for very large scale - it was built at LinkedIn to replace multiple existing message clusters with a single platform. Kafka is built to allow for multiple consumers and producers - I’ve spoken to one expert at a large technology company who had over 50K producers and consumers working on the same cluster. To be fair, very few organizations have problems at that level of scale, but for some organizations, the ability to scale kafka easily (relatively speaking) can be very useful.

Another fairly unique feature of kafka is message permanence. With a normal message broker, once the last consumer has received a message, the broker no longer needs to hold on to that message. With Kafka, messages can be stored for a configurable period. This means that messages can be stored forever. This can allow consumers to re-ingest messages that they had already processed, or allow newly deployed consumers to process messages that were sent previously.

Finally, Kafka has been rolling out built-in support for stream processing. Rather than using Kafka to send messages to a dedicated stream processing tool like Apache Flink, instead some tasks can be done inside Kafka itself. Using KSQL, you can define SQL-like statements that can process one or more topics on the fly. This can give you something akin to a dynamically updating materialized database view, with the source of data being Kafka topics rather than a database. These capabilities open up some very interesting possibilities about how data is managed in distributed systems. If you’d like to explore these ideas in more detail, I can recommend “Designing Event-Driven Systems” by Ben Stopford5 (I have to recommend Ben’s book, as I wrote the foreword for it!). For a deeper dive on Kafka in general, I’d suggest “Kafka: The Definitive Guide”6.

Serialization Formats

Some of the technology choices we’ve looked at - specifically some of the RPC implementations - make choices for you regarding how data in serialized and deserialized. When picking GRPC for example, any data sent will be converted into protocol buffer format. Many of the technology options though give us a lot of freedom in terms of how we covert data for network calls. Pick Kafka as your broker of choice, and you can send messages in a variety of formats. So which format should you chose?

Textual Formats

The use of standard textual formats gives clients a lot of flexibility as to how they consume resources. REST APIs mostly typically use a textual format for the request and response bodies, even if theoretically you can quite happily send binary data over HTTP. In fact, this is how GRPC works - using HTTP underneath, but sending binary protocol buffers.

JSON has usurped XML as the text serialization format of choice. You can point to a number of reasons why this occurred, but the main reason is that one of the main consumers of APIs is often a browser, where JSON is a great fit. JSON became popular partly as a result of the backlash against XML, and proponents cite its relative compactness and simplicity when compared to XML as another winning factor. The reality is that the size of a JSON vs XML payload is rarely a massive differential, especially as these payloads are typically compressed. It’s also worth pointing out that some of the simplicity of JSON comes at a cost - in our rush to adopt simpler protocols, schemas went out of the window (more on that later).

AVRO is an interesting serialization format. It takes JSON as an underlying structure and uses it to define a schema-based format. AVRO has found a lot of popularity as a format for message payloads, partly due to the ability to send the schema as part of the payload, which can make supporting multiple different messaging formats much easier.

Personally, though, I am still a fan of XML. Some of the tool support is better. For example, if I want to extract only certain parts of the payload (a technique we’ll discuss more in “Handling Change Between Microservices”) I can use XPATH, which is a well-understood standard with lots of tool support, or even CSS selectors, which many find even easier. With JSON, I have JSONPATH, but this is not as widely supported. I find it odd that people pick JSON because it is nice and lightweight, then try and push concepts into it like hypermedia controls that already exist in XML. I accept, though, that I am probably in the minority here and that JSON is the format of choice for most people!

Binary Formats

Where textual formats have benefits like making it easy for humans to read them, and provide a lot of interoperability with different tools and technologies, the world of binary serialization protocols is where you want to be if you start getting worried about payload size, or the efficiencies of writing and reading the payloads. Protocol buffers have been around for a while, and are often used outside the scope of GRPC - they probably represent the most popular binary serialization format for microservice-based communication.

This space though is large, and there are a number of other formats out there that have been developed with a variety of requirements in mind. Simple Binary Encoding7, Cap’n Proto8, and FlatBuffers9 all come to mind. Although benchmarks abound for each of these formats, highlighting their relevant benefits compared to protocol buffers, JSON, or other formats, benchmarks suffer from a fundamental problem that they may not necessarily represent how you are going to use them. If you’re looking to eek the last few bytes out of your serialization format, or shave microseconds off the time taken to read or write these payloads, I strongly suggest you carry out your own comparison of these various formats. In my experience, the vast majority of systems rarely have to worry about such optimizations though, as they can often achieve the improvements they are looking for by sending less data, or not making the call at all. If you are building an ultra-low latency distributed system though, make sure you’re prepared to dive head first into the world of binary serialization formats.

Schemas

One discussion that comes up time and again is should we use schemas to define what our endpoints expose, and what they accept? Schemas can come in lots of different types, and typically picking a serialization format will define which schema technology you can use. If you’re working with raw XML, you’d use XML Schema Definition (XSD), raw json, you’d use JSON-Schema. Some of the technology choices we’ve touched on (specifically a sizable subset of the RPC options) require the use of explicit schemas, so if you picked those technologies you’d have to make use of schemas. SOAP works through use of a schema specification called the Web Service Definition Language (WSDL), while GRPC requires the use of a protocol buffer specification. Other technology choices we’ve explored make the use of schemas optional, and this is where things get more interesting.

Personally speaking, I am in favour of having explicit schemas for microservice endpoints. This is for two key reasons. Firstly, it goes a long way to being an explicit representation of what a microservice endpoint exposes, and what it can accept. This makes life easier for both developers working on the microservice, but also their consumers. Schemas may not replace the need for good documentation, but they certainly can help reduce the amount of documentation required.

The other reason I like explicit schemas though, is how they help in terms of catching accidental breakages of microservice endpoints. We’ll explore how to handle changes between microservices in a moment, but it’s first worth exploring the different types of breakages and the role schemas can play.

Structural vs Semantic Contract Breakages

Broadly speaking, we can break contract breakages down into two categories - structural breakages, and semantic breakages. Structural breakages refer to situations where the structure of the endpoint changes in such a way that a consumer is now incompatible - this could represent fields or methods being removed, or new required fields being added. Semantic breakages refer to situations where the structure of the microservices endpoint remains the same, but the behavior changes in such a way as to break consumers expectations.

Let’s take a simple example. You have a highly complex Hard Calculations microservice that exposes a calculate method on its endpoint. This calculate method takes two integers, both of which are required fields. If you changed Hard Calculations such that the calculate method now takes only one integer, then consumers would break - they’d be sending requests with two integers which the Hard Calculations microservice would reject. This is an example of a structural change, and in general these changes can be easier to spot.

Semantic changes are more problematic. This is where the structure of the endpoint doesn’t change, but the behavior of the endpoint does. Coming back to our calculate method, imagine that in the first version, the two provided integers are addded together and the results returned. So far so good. Now, we change Hard Calculations so that the calculate method now multiplies the integers together and returns the result. The semantics of the calculate method have changed in a way that could break expectations of the consumers.

Should You Use Schemas?

By using schemas, and comparing different versions of schemas, we can catch structural breakages. Catching semantic breakages requires the use of testing. If you don’t have schemas, or have schemas but decide to not compare schema changes for compatibility, then the burden of catching structural breakages before you get to production also falls on testing. Arguably, the situation is somewhat analogous with static vs dynamic typing in programming languages. With a statically typed language, the types are fixed at compile time - if your code does something with an instance of a type that isn’t allowed (like calling a method that doesn’t exist), then the compiler can catch that mistake. This can leave you to focus testing efforts on other sorts of problems. With a dynamically typed language though, some of your testing will need to catch mistakes that a compiler picks up for statically typed languages.

Now, I’m pretty relaxed about static vs dynamically typed languages, and I’ve found myself to be very productive (relatively speaking) in both. Certainly, dynamically typed languages give you some significant benefits which for many people justify giving up on compile time safety. Personally speaking though, if we bring the discussion back to microservice interactions, I haven’t found that a similar balanced tradeoff exists when it comes to schema vs schemaless communication. Put simply, I think that having an explicit schema more than offsets any perceived benefit of having schema-less based communication.

The main argument for schemaless endpoints seems to be that schemas need more work and don’t give enough value. This IMHO is partly a failure of imagination, and partly a failure of good tooling to help schemas have more value when it comes to using them to catch structural breakages.

Really, the question isn’t actually if you have a schema or not - it’s whether or not that schema is explicit. If you are consuming data from a schemaless API, you still have expectations as to what data should be in there, and how that data should be structured. Your code that will handle the data will be written with a set of assumptions in mind as to how that data is structured. In such a case the schema is arguably totally implicit, rather than explicit10. A lot of my desire for an explicit schema is driven by the fact that I think it’s important to be as explicit as possible as to what a microservice does (or doesn’t) expose.

Ultimately, a lot of what schemas provide is an explicit representation of part of the structure contract between a client and server. They help make things explicit, and can greatly aid communication between teams as well as work as a safety net. In situations where the cost of change is reduced, for example where both client and server are owned by the same team, then I am more relaxed about you not having schemas.

Handling Change Between Microservices

Probably the most common question I get about microservices, after “how big should they be?” is “how do you handle versioning?”. When this question gets asked, it’s rarely a query regarding what sort of numbering scheme you should use, it’s more about how you handle changes in the contracts between microservices.

How you handle change really breaks down into two topics. In a moment, we’ll look at what happens if you need to make a breaking change. But before that, we’ll look at what you can do to avoid making a breaking change in the first place.

Avoiding Breaking Changes

If you want to avoid making breaking changes, there are a few key ideas which are worth exploring - many of which we’ve already touched on at the start of the chapter. If you can put these ideas into practice, you’ll find it much easier to allow for microservices to be changed independently from one another.

Expansion Changes

Add new things to a microservice interface, don’t remove old things

Tolerant Reader

When consuming a microservice interface, be flexible in what you expect.

Right Technology

Pick technology that makes it easier to make backwards compatible changes to the interface.

Explicit Interface

Be explicit about what a microservice exposes. This makes things easier for the client, and easier for the maintainers of the microservice to understand what can be changed freely.

Catch Accidental Breaking Changes Early

Have mechanisms in place to catch interface changes that will break consumers in production, before those changes are deployed.

These ideas do reinforce each other, and many build upon that key concept of information hiding that we’ve discussed frequently so far. Let’s look at each in turn.

Expansion Changes

Probably the easiest place to start is by only adding new things to a microservice contract, and don’t remove anything else. Consider the example of adding a new field to a payload - assuming the client is in some way tolerant of such changes, this shouldn’t have a material impact. Adding a new dob field to a customer record should be fine for example.

Tolerant Reader

How the consumer of a microservice is implemented can have a lot to say regarding making backwards compatible changes easy. Specifically, we want to avoid client code binding too tightly to the interface of a microservice. Let’s consider an email microservice, whose job it is to send out emails to our customers from time to time. It gets asked to send an order shipped email to a customer with the ID 1234. It goes off and retrieves the customer with that ID, and gets back something like the response shown in Example 4-3.

Example 4-3. Sample response from the customer service
<customer>
  <firstname>Sam</firstname>
  <lastname>Newman</lastname>
  <email>[email protected]</email>
  <telephoneNumber>555-1234-5678</telephoneNumber>
</customer>

Now to send the email, the email microservice only needs the firstname, lastname, and email fields. We don’t need to know the telephoneNumber. We want to simply pull out those fields we care about, and ignore the rest. Some binding technology, especially that used by strongly typed languages, can attempt to bind all fields whether the consumer wants them or not. What happens if we realize that no one is using the telephoneNumber and we decide to remove it? This could cause consumers to break needlessly.

Likewise, what if we wanted to restructure our Customer object to support more details, perhaps adding some further structure as in Example 4-4? The data our email service wants is still there, and still with the same name, but if our code makes very explicit assumptions as to where the firstname and lastname fields will be stored, then it could break again. In this instance, we could instead use XPath to pull out the fields we care about, allowing us to be ambivalent about where the fields are, as long as we can find them. This pattern—of implementing a reader able to ignore changes we don’t care about—is what Martin Fowler calls a Tolerant Reader.

Example 4-4. A restructured Customer resource: the data is all still there, but can our consumers find it?
<customer>
  <naming>
    <firstname>Sam</firstname>
    <lastname>Newman</lastname>
    <nickname>Magpiebrain</nickname>
    <fullname>Sam "Magpiebrain" Newman</fullname>
  </naming>
  <email>[email protected]</email>
</customer>

The example of a client trying to be as flexible as possible in consuming a service demonstrates Postel’s Law (otherwise known as the robustness principle), which states: “Be conservative in what you do, be liberal in what you accept from others.” The original context for this piece of wisdom was the interaction of devices over networks, where you should expect all sorts of odd things to happen. In the context of microservice-based interactions, it leads us to try and structure our client code to be tolerant of changes to payloads.

Right Technology

As we’ve already explored, some technology can be more brittle when it comes to allowing us to change interfaces - I’ve already highlighted my own personal frustrations with Java RMI. On the other hand, some integration implementations go out of their way to make it as easy as possible for changes to be made without breaking clients. At the simple end of the spectrum, protocol buffers, the serialization format used as part of GRPC, has the concept of field number. Each entry in a protocol buffer has to define a field number, which client code expects to find. If new fields are added, the client doesn’t care. AVRO allows for the schema to be sent along with the payload, allowing clients to potentially interpret a payload much like a dynamic type.

At the more extreme end of the spectrum, the REST concept of HATEOS is largely all about enabling clients to make use of REST endpoints even when they change by making use of the previously discussed hypermedia links. This does call for you to buy into the entire HATEOS mindset of course.

Explicit Interface

I am a big fan of a microservice exposing an explicit schema denoting what it’s endpoints do. Having an explicit schema makes it clear to consumers as to what they can expect, but it also makes it much more clear to a developer working on a microservice as to what things should remain untouched to ensure you don’t break consumers. Put another way, an explicit schema goes a long way to making the boundaries of information hiding more explicit - what’s exposed in the schema is by definition not hidden.

Having an explicit schema for RPC is long established, and is in fact a requirement for many RPC implementations. REST on the other hand has typically viewed the concept of a schema as optional, to the point where I find explicit schemas for REST endpoints to be vanishingly rare. This is changing, with things like the aforementioned OpenAPI specification gaining traction, and the JSON Schema specification also gaining in maturity.

Asynchronous messaging protocols have struggled more in this space. You can have a schema for the payload of a message easily enough, and in fact this is an area where AVRO is frequently used. However having an explicit interface needs to go further than this. If we consider a microservice that fires events, which events does it expose? There are a few attempts at making explicit schemas for event-based endpoints underway. One is AsyncAPI11 which has picked up a number of big name users, but the one gaining most traction seems to be CloudEvents specification12 which is backed by the Cloud Native Computing Foundation. Azure’s event grid product supports the CloudEvents format, a sign of different vendors supporting this format which should help with interoperability. This is still a fairly new space, so it will be interesting to see how things shake out over the next few years.

Catch Accidental Breaking Changes Early

It’s crucial to make sure we pick up changes that will break consumers as soon as possible, because even if we choose the best possible technology, it’s possible that an innocent change of a microservice could cause consumers to break. As we’ve already touched on, using schemas can help us pick up structural changes, assuming we use some sort of tooling to help compare schema versions. There is a wide range of tooling out there to do this for different schema types. We have ProtoLock13 for protocol buffers, json-schema-diff-validator for JSON-Schema14, or openapi-diff for the openAPI specification15. More tools seem to be cropping up all the time in this space - what you’re looking for though is something that doesn’t just report on the differences between two schemas, but something that will pass or fail based on compatibility - this would allow you to fail a CI build if incompatible schemas are found, ensuring that your microservice won’t get deployed.

The open source Confluent schema registry16 supports JSON-schema, AVRO and protocol buffers, and is capable of comparing newly uploaded versions for backwards compatibility. Although it was built to help as part of an ecosystem where Kafka is being used, the registry isn’t tied to kafka in anyway, and you could make use of this in other situations to ensure backwards compatibility based on schema comparison.

Schema comparison tools can help us catch structural breakages, but what about semantic breakages? Or what if you aren’t making use of schemas in the first place? Then we’re looking at testing. This is a topic we’ll explore in more detail in [Link to Come], but I wanted to highlight consumer-driven contract testing which explicitly helps in this area. Just remember, if you don’t have schemas, expect your testing to have to do more work to catch breaking changes.

If you’re supporting multiple different client libraries, running tests using each library you support against the latest service is another technique that can help. Once you realize you are going to break a consumer, you have the choice to either try to avoid the break altogether or else embrace it and start having the right conversations with the people looking after the consuming services.

Managing Breaking Changes

So you’ve gone as far as you can to ensure that the changes you’re making to a microservice’s interface are backwards compatible, but you’ve realized that you just have to make a change that will constitute a breaking change. What can you do in such a situation? You’ve got three main options:

Lock-step Deployment

Require that the microservice exposing the interface and all consumers of that interface are changed at the same time

Coexist Incompatible Microservice Versions

Run old and new versions of the microservice side by side

Emulate The Old Interface

Have your microservice expose the new interface, and also emulate the old interface

Lock-Step Deployment

Of course, lock-step deployment flies in the face of independent deployability. If we want to be able to deploy a new version of our microservice with a breaking change to it’s interface, but still do this in an independent fashion, we need to give our consumers time to upgrade to the new interface. That leads us on to the next two options I’d consider.

Coexist Incompatible Microservice Versions

Another versioning solution often cited is to have different versions of the service live at once, and for older consumers to route their traffic to the older version, with newer versions seeing the new one, as shown in Figure 4-3. This is the approach used sparingly by Netflix in situations where the cost of changing older consumers is too high, especially in rare cases where legacy devices are still tied to older versions of the API. Personally, I am not a fan of this idea, and understand why Netflix uses it rarely. First, if I need to fix an internal bug in my service, I now have to fix and deploy two different sets of services. This would probably mean I have to branch the codebase for my service, and this is always problematic. Second, it means I need smarts to handle directing consumers to the right microservice. This behavior inevitably ends up sitting in middleware somewhere or a bunch of nginx scripts, making it harder to reason about the behavior of the system. Finally, consider any persistent state our service might manage. Customers created by either version of the service need to be stored and made visible to all services, no matter which version was used to create the data in the first place. This can be an additional source of complexity.

Running multiple versions of the same service to support old endpoints
Figure 4-3. Running multiple versions of the same service to support old endpoints

Coexisting concurrent service versions for a short period of time can make perfect sense, especially when you’re doing things like blue/green deployments or canary releases (we’ll be discussing these patterns more in [Link to Come]). In these situations, we may be coexisting versions only for a few minutes or perhaps hours, and normally will have only two different versions of the service present at the same time. The longer it takes for you to get consumers upgraded to the newer version and released, the more you should look to coexist different endpoints in the same microservice rather than coexist entirely different versions. I remain unconvinced that this work is worthwhile for the average project.

Emulate The Old Interface

If we’ve done all we can to avoid introducing a breaking interface change, our next job is to limit the impact. The thing we want to avoid is forcing consumers to upgrade in lock-step with us, as we always want to maintain the ability to release microservices independently of each other. One approach I have used successfully to handle this is to coexist both the old and new interfaces in the same running service. So if we want to release a breaking change, we deploy a new version of the service that exposes both the old and new versions of the endpoint.

This allows us to get the new microservice out as soon as possible, along with the new interface, but give time for consumers to move over. Once all of the consumers are no longer using the old endpoint, you can remove it along with any associated code, as shown in Figure 4-4.

Coexisting different endpoint versions allows consumers to migrate gradually
Figure 4-4. One microservice emulating the old endpoint and exposing the new backwards incompatible endpoint

When I last used this approach, we had gotten ourselves into a bit of a mess with the number of consumers we had and the number of breaking changes we had made. This meant that we were actually coexisting three different versions of the endpoint. This is not something I’d recommend! Keeping all the code around and the associated testing required to ensure they all worked was absolutely an additional burden. To make this more manageable, we internally transformed all requests to the V1 endpoint to a V2 request, and then V2 requests to the V3 endpoint. This meant we could clearly delineate what code was going to be retired when the old endpoint(s) died.

This is in effect an example of the expand and contract pattern, which allows us to phase breaking changes in. We expand the capabilities we offer, supporting both old and new ways of doing something. Once the old consumers do things in the new way, we contract our API, removing the old functionality.

If you are going to coexist endpoints, you need a way for callers to route their requests accordingly. For systems making use of HTTP, I have seen this done with both version numbers in request headers and also in the URI itself—for example, /v1/customer/ or /v2/customer/. I’m torn as to which approach makes the most sense. On the one hand, I like URIs being opaque to discourage clients from hard-coding URI templates, but on the other hand, this approach does make things very obvious and can simplify request routing.

For RPC, things can be a little trickier. I have handled this with protocol buffers by putting my methods in different namespaces—for example, v1.createCustomer and v2.createCustomer—but when you are trying to support different versions of the same types being sent over the network, this can become really painful.

Which Approach Do I Prefer?

For situations where the same team manages both the microservice and all consumers, I am somewhat relaxed about a lock-step release in limited situations. Assuming it really is a one-off situation, then doing this in a situation where the impact is limited to a single team can be justifiable. I am very cautious about this though, as there is the danger that a one-off activity becomes business as usual, and there goes independent deployability. Use lock-step deployments too often, and you’ll end up with a distributed monolith before long.

Co-existing different versions of the same microservice can be problematic, as we discussed. I’d only consider doing this in situations where we only planned to run the microservice versions side by side for a short period of time. The reality is that when you need to give consumers time to upgrade, you could be looking at weeks or more. In other situations where you might co-exist microservice versions, perhaps as part of a blue/green deployment or canary release, the durations involved are much shorter, offsetting the downsides of this approach.

My general preference is where possible to use emulation of old endpoints. The challenges of implementing emulation are in my opinion much easier to deal with than co-existing of microservice versions.

The Social Contract

Which approach you pick will be due in large part to the expectations consumers have of how these changes will be made. Keeping the old interface lying around can have a cost, and ideally you’d like to turn it off and remove associated code and infrastructure as soon as possible. On the other hand, you want to give consumers as much time as possible to make a change. And remember, in many cases the backwards-incompatible changes you are making are often things that have been asked for by the consumers, or will actually end up benefiting them. There is a balancing act of course, between the needs of the microservice maintainers, and the consumers - and this needs to be discussed.

I’ve found that in many situations, how these changes will be handled has never been discussed, leading to all sorts of challenges. As with schemas, having some degree of explicitness in how backwards-incompatible changes will be made can greatly simplify things.

You don’t need reams of paper and huge meetings necessarily to agree these things. But both the owner and consumer of a microservice need to be clear on a few things. Assuming you aren’t going down the route of lock-step releases, I’d suggest being clear on a few things:

  • How will you raise the issue that the interface needs to change?

  • How will the consumers and microservice teams collaborate to agree on what the change will look like?

  • Who is expected to do the work to update the consumers?

  • When the change is agreed, how long will consumers have to shift over to the new interface before it is removed?

Remember, one of the secrets to an effective microservice architecture is to embrace a consumer-first approach. Your microservices exist to be called by other consumers. Their needs are paramount, and if you are making changes to a microservice that are going to cause upstream consumers problems, this needs to be taken into account.

In some situations of course it might not be possible to change the consumers. I’ve heard from Netflix that they had issues (at least historically), with old set-top boxes using older versions of the Netflix APIs. These set-top boxes cannot be upgraded easily, so the old endpoints have to remain available unless and until the number of older set-top boxes drops to a level where they can have their support disabled. Decisions to stop old consumers being able to access your endpoints can sometimes end up being financial - how much money does it cost you to support the old interface, balanced against how much money you make from those consumers.

Tracking Usage

Even if you do agree on a time by which consumers should stop using the old interface, would you know if they had actually stopped using it? Making sure you have logging in place for each endpoint your microservice exposes can help, as can ensuring that you have some sort of client identifier so you can chat to the team in question if you need to work with them to get them to migrate away from your old interface. This could be something as simple as asking consumers to put their identifer in the user agent header when making HTTP requests, or you could require that all calls go via some sort of API Gateway where clients need keys to identify themselves.

Extreme Measures

So, assuming you know a consumer is still using an old interface that you want to remove, and they are dragging their heels about moving to the new version, what can you do about it? Well, the first thing to do is talk to them. Perhaps you can lend them a hand to make the changes happen. If all else fails, and they still don’t upgrade even after agreeing to, then there are some extreme techniques I’ve seen used.

In one large tech company, I discussed with them how they handled this issue. Internally, they had a very generous period of one year before old interfaces would be retired. I asked how they knew if consumers were still using the old interfaces, and they replied that they didn’t bother tracking that information really. After one year they just turned the old interface off. It was recognized internally that if this caused a consumer to break, then in that company it was accepted that it was the fault of the consuming microservice’s team - they’d had a year to make the change, and hadn’t done it. Of course, this approach won’t work for many (I said it was extreme!). It also leads to a large degree of inefficiency. By not knowing if the old interface was used, they denied themselves the opportunity to remove it before the year had passed. Personally, even if I was to suggest just turning the endpoint off after a certain period of time, I’d still definitely want tracking of who was going to be impacted.

Another extreme measure I saw was actually in the context of deprecating libraries, but it could also theoretically be used for microservice endpoints. The example given was of an old library that people were trying to retire from use inside the organization, in favour of a newer, better one. Despite lots of work, other teams were still dragging their heels. The solution was to insert a sleep in the old library, so that it responded more slowly to calls (with logging to show what was happening). Over time, the team just kept increasing the duration of the sleep, until eventually the other teams got the message. You obviously have to be extremely sure that you’ve exhausted other reasonable efforts to get consumers to upgrade before considering something like this!

DRY and the Perils of Code Reuse in a Microservice World

One of the acronyms we developers hear a lot is DRY: don’t repeat yourself. Though its definition is sometimes simplified as trying to avoid duplicating code, DRY more accurately means that we want to avoid duplicating our system behavior and knowledge. This is very sensible advice in general. Having lots of lines of code that do the same thing makes your codebase larger than needed, and therefore harder to reason about. When you want to change behavior, and that behavior is duplicated in many parts of your system, it is easy to forget everywhere you need to make a change, which can lead to bugs. So using DRY as a mantra, in general, makes sense.

DRY is what leads us to create code that can be reused. We pull duplicated code into abstractions that we can then call from multiple places. Perhaps we go as far as making a shared library that we can use everywhere! It turns out though that sharing code in a microservice environment is a bit more involved than that. As always, we have more than one option to consider.

Sharing Code Via Libraries

One of the things we want to avoid at all costs is overly coupling a microservice and consumers such that any small change to the microservice itself can cause unnecessary changes to the consumer. Sometimes, however, the use of shared code can create this very coupling. For example, at one client we had a library of common domain objects that represented the core entities in use in our system. This library was used by all the services we had. But when a change was made to one of them, all services had to be updated. Our system communicated via message queues, which also had to be drained of their now invalid contents, and woe betide you if you forgot.

If your use of shared code ever leaks outside your service boundary, you have introduced a potential form of coupling. Using common code like logging libraries is fine, as they are internal concepts that are invisible to the outside world. RealEstate.com.au makes use of a tailored service template to help bootstrap new service creation. Rather than make this code shared, the company copies it for every new service to ensure that coupling doesn’t leak in.

The really important point about sharing code via libraries is that you cannot update all uses of the library at once. Although multiple microservices might all use the same library, they do so typically by packaging that library into the microservice deployment. To upgrade the version of the library being used, you’d therefore need to redeploy the microservice. If you want to update the same library everywhere at exactly the same time, this could lead to a widespread deployment of multiple different microservices all at the same time, with all the associated headaches.

So, if using libraries for code reuse across microservice boundaries, you have to accept that multiple different versions of the same library might be out there at the same time. You can of course look to update all of these to the last version over time, but as long as you are OK with this fact, then by all means reuse code via libraries. If you really do need to update that code for all users of it at exactly the same time, then you’ll actually want to look at reusing code via a dedicated microservice instead.

There is one specific use case associated with reuse through libraries which is worth exploring further, though.

Client Libraries

I’ve spoken to more than one team that has insisted that creating client libraries for your services is an essential part of creating services in the first place. The argument is that this makes it easy to use your service, and avoids the duplication of code required to consume the service itself.

The problem, of course, is that if the same people create both the server API and the client API, there is the danger that logic that should exist on the server starts leaking into the client. I should know: I’ve done this myself. The more logic that creeps into the client library, the more cohesion starts to break down, and you find yourself having to change multiple clients to roll out fixes to your server. You also limit technology choices, especially if you mandate that the client library has to be used.

A model for client libraries I like is the one for Amazon Web Services (AWS). The underlying SOAP or REST web service calls can be made directly, but everyone ends up using just one of the various software development kits (SDKs) that exist, which provide abstractions over the underlying API. These SDKs, though, are written by the community or AWS people other than those who work on the API itself. This degree of separation seems to work, and avoids some of the pitfalls of client libraries. Part of the reason this works so well is that the client is in charge of when the upgrade happens. If you go down the path of client libraries yourself, make sure this is the case.

Netflix in particular places special emphasis on the client library, but I worry that people view that purely through the lens of avoiding code duplication. In fact, the client libraries used by Netflix are as much (if not more) about ensuring reliability and scalability of their systems. The Netflix client libraries handle service discovery, failure modes, logging, and other aspects that aren’t actually about the nature of the service itself. Without these shared clients, it would be hard to ensure that each piece of client/server communications behaved well at the massive scale at which Netflix operates. Their use at Netflix has certainly made it easy to get up and running and increase productivity while also ensuring the system behaves well. However, according to at least one person at Netflix, over time this has led to a degree of coupling between client and server that has been problematic.

If the client library approach is something you’re thinking about, it can be important to separate out client code to handle the underlying transport protocol, which can deal with things like service discovery and failure, from things related to the destination service itself. Decide whether or not you are going to insist on the client library being used, or if you’ll allow people using different technology stacks to make calls to the underlying API. And finally, make sure that the clients are in charge of when to upgrade their client libraries: we need to ensure we maintain the ability to release our services independently of each other!

Service Meshes and API Gateways

Service Meshes and API Gateways do offer a potential way to share code between microservices without requring the creation of new client libraries, or new microservices. Put (very) simply, service meshes and API gateways can work as proxies between microservices. This can mean that they can be used to to implement some microservice-agnostic behaviour which might otherwise have to be done in code, such as service discovery or logging.

If you are using either an API gateway or a service mesh to implement shared, common behaviour for your microservices, it’s essential that this behaviour is totally generic - in other words, the behaviour in the proxy bares no relation to any specific behaviour of an individual microservice.

API Gateways and Service Meshes are topics which need to be explored more fully - we’ll come back to them both in [Link to Come].

Summary

So, we’ve covered a lot of ground here - let’s break down some of what we’ve covered.

  • Firstly, ensure that the problem you are trying to solve guides your technology choice. Based on your context, and your preferred communication style, use that to select the technology that is most appropriate to you - don’t fall into the trap of picking the technology first. The model shared again in Figure 4-5 can help guide your decision making, but just following this model isn’t a replacement for sitting down and thinking about your own situation.

Different styles of inter-microservice communication along with example implementing technologies
Figure 4-5. Different styles of inter-microservice communication along with example implementing technologies
  • Whatever choice you make, consider the use of schemas as part of helping make your contracts more explicit, but also to help catch accidental breaking changes.

  • Where possible, strive to make changes which are backwards compatible to ensure that independent deployability remains as possibility.

  • If you do have to make backwards incompatible changes, find a way to allow consumers time to upgrade to avoid lock-step deployments.

Next, we need to address the fact that most people don’t start with a microservice architecture, and look at how you can take an existing monolithic system and migrate it to a microservice architecture.

1 https://protolock.dev/

2 https://github.com/OAI/OpenAPI-Specification/

3 Robinson, Ian, Jim Webber, and Savas Parastatidis, “REST in Practice: Hypermedia and Systems Architecture”. O’Reilly 2010

4 https://graphql.org/

5 Stopford, Ben. Designing Event-Driven Systems. O’Reilly 2017.

6 Narkhede, Neha, Gwen Shapira and Todd Palino. Kafka: The Definitive Guide. O’Reilly 2017

7 https://github.com/real-logic/simple-binary-encoding

8 https://capnproto.org/

9 https://google.github.io/flatbuffers/

10 Martin Fowler explores this in more detail in the context of schemaless datastorage: https://martinfowler.com/articles/schemaless/

11 https://www.asyncapi.com/

12 https://cloudevents.io/

13 https://github.com/nilslice/protolock

14 https://www.npmjs.com/package/json-schema-diff-validator

15 Note that there are actually three different tools in this space with the same name! https://github.com/Azure/openapi-diff seems to get closest to a tool that actually passes or fails compatibility

16 https://github.com/confluentinc/schema-registry#documentation

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.222.239