© Paul Michaels 2022
P. MichaelsSoftware Architecture by Examplehttps://doi.org/10.1007/978-1-4842-7990-8_1

1. The Ticket Sales Problem

Paul Michaels1  
(1)
Derbyshire, UK
 

When I first started out in IT, the industry was still quite niche. Although many people used computers in their day job, access to the Internet or even a personal computer at home was still a way off for most people. This was around the time that the industry was gearing up to fix the millennium bug: an issue that was prevalent in many systems, because 20 or 30 years earlier, when these systems were written, the programmers had assumed they would have been replaced before the year 2000.

At this time, if you wanted to go and see a band play, you typically had two choices: you physically visited a record shop or ticket sales venue, for which the queues often stretched out of the door, or you used the telephone. These days, you wouldn’t have to go far to find someone young enough to not remember such times, but it’s worth bearing in mind how far we have come in 20 years!

The queues stretching out of the door of the ticket booths have now been replaced by millions of people accessing a website at the same time to try and buy tickets. In this chapter, we’ll be discussing one of the most prevalent issues in the IT industry today: how to cope with massive spikes in traffic.

Background

Our new client, 123 Tickets, has asked us to replace their existing system. The existing system that they run works fine for most of the year, but the company makes most of their revenue from just three dates, when they are contracted as the main reseller for premier music festivals. Their existing system simply can’t cope with the huge spike in sales and frequently crashes just minutes after the tickets go on sale.

The venues are unhappy with 123 Tickets’ ability to cope with the sales, and the contracts are in danger of not being renewed.

Let’s have a look at last year’s usage. Figure 1-1 shows a graph with the usage and error statistics from last year.
Figure 1-1

System usage graph

In Figure 1-1, we can see some very useful information. Firstly, we can see that almost all the business that this company has conducted is during three months of the year; and we can see that when the system is busy, the system errors spike. We can also see that the demand for the tickets far exceeds the supply.

Let’s consider exactly what the requirements from 123 Tickets are.

Requirements

Whenever a system, of any type, is designed, a target should be established. For example, if you’re designing a car, your target is a vehicle that transports one or more people between places and is roadworthy (whatever that may mean in your locality). The fact that your car may have three wheels, or two doors, or be painted blue is an optional feature; that is, the car is still a car if it has two, four, or five doors; it’s still a car if it is blue or red; however, it is not a car if it has no wheels because it would be unable to fulfill its requirement of transporting people.

When designing a system, it’s always worth considering this: what the system needs to do in order to fulfill its basic function. For example, our ticket ordering system presumably needs to allow people to purchase tickets if it did not, we could not sensibly call it a “ticket ordering system”; but does it need to allow people to purchase ice cream when they arrive at the venue? Probably not, as without that, it’s still a “ticket ordering system.”

We should, therefore, discuss with the client the list of things that the system needs to do in order to be the system; and all the time, we should challenge whether that thing is necessary. To clarify, I’m not saying that anyone should sit in front of a client and argue them into submission about features that they are requesting and willing to pay for; however, we may decide that what is being described is not a single system, but two, or three. Why this is useful is something we’ll revisit later in this chapter.

I’m very purposely staying away from any reference to software at this stage, and the reason will become clearer later on.

Let’s lay out exactly what we need the system for 123 Tickets to do. This list is a high-level list of features that the current system provides and which the client has identified we would need to provide:
  • Maintain a list of registered users.

  • Provide a list of upcoming events for which there are tickets available.

  • Allow a user to purchase up to ten tickets for any single event.

  • Maintain a count of available tickets.

  • Allow users to pick a seat where applicable not all events are seated (and none of the big festivals are seated).

Now that we’ve identified what’s required, we can discuss the options for providing that.

Options

All too often, software developers and architects reach for the tools that they know best. I’m no exception; any code samples that you’ll see in this book are written in .Net. However, exploring other possibilities is not only a useful exercise but also solidifies the requirements in our minds. In each chapter, I’ll make the case for solving the problem without using technology.

It may seem like a strange thing for a book on software architecture; however, all over the world, people are solving problems without technology; in some cases, that’s the best solution. Software design and development costs money; in some cases, it costs a considerable amount of money, and it is not without risk. According to a 2017 report from the Project Management Institute, between 6 and 24% of projects end in failure. These are not only software projects; however, if we accept that as a rough guide, it means that we can reasonably expect around one in ten software projects to fail (source: www.pmi.org/learning/thought-leadership/pulse/pulse-of-the-profession-2017).

In our case, 123 Tickets has an existing system, but let’s imagine that our advice to the client is to remove that system and replace it with a manual process. What would that look like?

Manual Process

First of all, we would need to maintain a list of valid users for the system; we could keep this in an address book. Each time someone wished to be added to the system, we would write their name and address into our address book; the maintenance of this book would represent all or part of somebody’s job.

Secondly, we would keep a list of events; presumably, we’d use something like a yearly diary to do so; each event would be marked in on the day it was to happen. Somebody would then go through every event for the following two or three months and write on a sheet of paper what, and when, these events were.

Our next step would be to order the tickets from the supplier – when they arrive, our ticket count would simply be that somebody would simply count the remaining tickets for each event.

When a customer phoned up, the operator would go through the following process:
  1. 1.

    Ask for a name, and look them up in the phone book; if they are not already in there, then add them.

     
  2. 2.

    Check the event that they wished to book a ticket for and ensure that there were sufficient tickets.

     
  3. 3.

    If the venue is seated, talk through the options for seating with the customer, and establish which tickets would be best.

     
  4. 4.

    Put the tickets in an envelope (so that they cannot be sold to another person) and take payment details.

     
  5. 5.

    If the payment fails, or the customer changes their mind before payment is made, the tickets are returned to the pile for that event; otherwise, they are posted to the customer.

     

In fact, when we consider this, we realize that the manual process is actually quite neat; maybe this is the right approach. Of course, there’s a minor snag; even during the smallest festival, over 500,000 attempts were made to purchase tickets; however, before we abandon our manual approach, let’s just continue this thought experiment for another few paragraphs.

Let’s say that we did need to implement this manually and we had a single operator. What would happen if 500,000 or more people tried to phone in to buy tickets at the same time? Well, the way most basic phone systems work is that the first person would be connected, and until that sale had finished, everyone would get an engaged tone.

So how could we structure this so that, given enough time, we could deal with all these requests? One possible solution may be to divert the calls to an answering machine service (for the purpose of this, we’ll assume that the answering machine can take multiple calls at any one time without the caller getting an engaged tone), asking the customer to leave details of the venue and ticket requirements; the operator could then phone each person back as they became free.

Our manual system does have an issue; let’s say that our operator is very efficient and can process each call in five minutes; if that were the case, it would take this person around 42,000 hours to process 500,000 tickets. That’s around 22 years (based on a 35-hour week)!

How would we solve that? In fact, you’re probably thinking that the solution is very obvious: employ more people. If one person would take 22 years, it follows that 2 would take 11. If we had 1000 people, we’d clear the backlog in just over a week. While this may seem obvious, it’s easy to forget this knowledge when we start looking at automated solutions.

This is not, however, a new system; the client already has a system in and running, so let’s investigate what their existing system looks like.

Existing System

123 Tickets has an existing system. Figure 1-2 shows the architecture for their existing system.
Figure 1-2

Existing architecture

There’s a lot to like with this current system design; it’s modular that is, each distinct business domain is separated into its own module, and the bulk of the logic is in a separate API. However, let’s go back to our manual process and see where we could rethink some of the modules. Our manual process seems to fall naturally into the following four areas:
  • Maintain a list of valid users.

  • Maintain a list of events.

  • Maintain a ticket inventory.

  • Ticket ordering process.

While we seem close, the existing system has a module that manages seat allocation, but we’ve identified that this should probably fit under the ordering process. Further, it appears that we need a ticket inventory; it’s very likely that the existing system does maintain a ticket inventory – perhaps it’s inside the Booking Module.

Existing System Considerations

When working with an existing system, there’s a balance to be struck; often, if a system is already in place and working correctly, this is a much better solution than a system that doesn’t exist yet.

Let me clarify what I’m saying here and what I’m not saying. Each system that’s in place has a value; that value is comprised of factors, such as the cost of the system to develop, the value that system provides while it’s running, the cost of supporting and maintaining the system, and the cost of replacing the system. The latter is a very important figure and is not measured in just pounds and pence.

Replacing a system involves disruption; there’s the disruption caused by actually replacing the system, but also, you’ll need business domain knowledge, which means you’ll be removing people from their day-to-day jobs; the new system is unlikely to look and feel exactly like the old, and so there will be a period when people get used to the new system. There’s a risk associated with both creating the new system and implementing it; the risk will vary and can be mitigated, but it will never be zero.

This does not mean that if we have a system that works poorly, we should design around it for fear of breaking anything; however, after we establish a goal, a target architecture, it may influence how, or even whether, we choose to reach that target completely.

Something else that we should come back to at this stage is what constitutes our core system.

Minimum Viable Product

Earlier, we asked what features, about the ticket ordering system, made this a ticket ordering system; in fact, seat allocation may be one feature that is not required in this system. This is the concept of a Minimum Viable Product, or MVP, and it applies equally to replacing or upgrading a system, as to creating a new one; without defining the scope of the system, you risk any project spiralling out of control, both in terms of cost and complexity.

We should be careful how we go about defining an MVP. Clients and developers are not the only members of the team that are capable of causing scope creep. Let’s assume that your team has a product manager, and they decide, after speaking to the client, that the system needs an ice cream ordering module; without a documented MVP, it’s quite easy for a new user story to get created. In this case, everyone on the project would make an assumption that this functionality did form part of the MVP.

Alternatively, imagine that a system tester finds an issue with the system, along the lines of “The system’s ice cream ordering facility is missing or nonfunctional.” Again, this may inadvertently result in scope creep.

These are, clearly, contrived examples; and I’m not singling out the roles mentioned – pretty much anyone on, or off, the project could inadvertently instigate scope creep. I’m simply making the point that having determined what our MVP is, we should maintain that document and that everyone in the team should bear it in mind when considering changes, defects, enhancements, or requests.

That said, let’s move on to discussing what our target architecture might be.

Target Architecture

A target architecture is just that – something to aim for. You can establish a target architecture, with full knowledge that you’re unlikely to ever actually reach that architecture. However, it gives you a guide for how to change your system and where to add new functionality.

Before we get into what our target architecture might look like, we should address the elephant in the room: the inability of our system to deal with the spikes in usage.

When designing a new system, any architectural decisions should be based on the expected usage of the system, plus a sensible margin. For example, in Chapter 5, we’ll be discussing an application that is responsible for administering a system; maybe two or three people will be using that at any one time (and that’s on a busy day); even if we gave that system 300% leeway for growth, we’re still only dealing with under ten users.

I’d like to clarify that if you don’t consider performance or concurrency in your system at all, even ten users can overload it. However, in this chapter, we’ll discuss techniques for dealing with huge spikes of traffic; here, we have potentially millions of concurrent users which will be overkill for most systems.

The takeaway here is that in software, as in life, everything has a price: a literal price - i.e., the software costs time (which equates to money) to create and maintain; and a technical price. Ironically, most of the time, the price for dealing with huge spikes in traffic results in a slight hit on speed.

How to Deal with High Throughput

Imagine that we have a funnel and we’re trying to pour water into the funnel (Figure 1-3). Instinctively, you’ll realize that there is a limit to the rate at which we can pour water into that funnel, above which, the funnel will fill up and eventually, the water will spill out of the top.
Figure 1-3

Pouring water into a funnel

If we want to cope with the additional flow, we only have the following choices:
  • Widen the funnel

  • Use multiple funnels

If we go back to our little thought experiment around how we would do this manually, we see that we have similar choices in our manual system: we can either ask our telephonist to work faster, or we can increase our staff.

Widening the Funnel

Widening the funnel in the case of a web server means essentially one of two things: increase the physical capacity of the server, or increase the capacity of the service.

Server

Increasing the capacity of the server may be very straightforward. If you’re hosting this on a cloud provider, you may simply need to press a few buttons, and the server is suddenly a higher specification. If you’re hosting this yourself, then you might need to upgrade the physical machine. This approach has some definite advantages: you don’t need to change the software – which in turn means that the change is very low risk (in fact, the only risk is that the hardware upgrade may fail for some reason – these days, a very unusual event).

This approach is not to be dismissed. If you have a service that has consistently high traffic, or where spikes are predictable, this can often be a cheap way to keep an existing system working for an extra six months.

However, it does have its downsides. Firstly, you still have a limited capacity; that is, you’ve widened the funnel, but it’s still a funnel, so there is an amount of traffic that will still overload your system. Secondly, you are paying the price for the additional hardware, even if (as in our case) you don’t come close to using it for 362 days of the year.

Service

Increasing the capacity of the service can simply (and often) mean optimizing your code. If the service is not processing the required traffic, that may be because you have inefficient code processing the traffic, or it may relate to the speed at which you can insert data into the database.

Most systems that are written equate to a database, with some kind of façade in front of that database to allow the user to insert data into it and to retrieve data from it. What this means is that the bottleneck (or the thin end of the funnel, if you like) is often the database.

Let’s spend some time exploring what this might mean and how we could leverage that knowledge to increase our throughput. For the purposes of this illustration, I’ll talk about the specific challenges of a relational database; but NoSQL databases have their own challenges; in some respects, they are the exact opposite of those outlined here.

In a relational database, there is always a trade-off between reading data and writing it; databases can write data very quickly, where there are no indexes on the table; however, in this case, it would be almost impossible to read the data, as you would need to execute a full table scan for any query.

As we describe this, a somewhat obvious solution seems to present itself. If you can optimize a database to read but it slows down writes and you can optimize a database for writes but pay the price on reads, what if you could simply separate the two activities?

In fact, this concept has a name: CQRS. I first became aware of this after reading a blog post by Martin Fowler; however, he credits Greg Young as the first person to describe it (https://martinfowler.com/bliki/CQRS.html).

This does help with throughput; you can, essentially, have an offline process to update your read database from the write; however, you are still writing into a database, which means that you are still limited by the speed at which the inserts can run. Further, you have fractured your system; you now have two data stores to worry about, and the data between them may not be consistent. In some cases, this approach may be an excellent choice; in fact, it may be an excellent choice in this case; however, on its own, it will not solve the issue.

Multiple Funnels

If, instead of improving our single funnel, we can increase the funnel count (Figure 1-4), we can potentially accept twice, three times, or twenty times the input data. Of course, we’re not proposing to have 20 databases, so what we’re really doing is moving the problem. Again, this approach is not free, but before we discuss the cost, let’s discuss what this might look like using our funnel example.
Figure 1-4

Multiple funnels

To some extent, the analogy breaks down a little at this point; however, this does serve to illustrate broadly what we’re doing. Essentially the trough stores the water as it comes in and releases it at a pace that the funnels can cope with, meaning that each funnel is no longer overloaded as the water comes in.

I think we’ve taken this analogy as far as it is sensible, so let’s think back again to our manual process. We thought that we could increase our throughput by increasing the number of people working through the bookings. We had, in fact, devised a system very similar to this; the answering machine was essentially like the trough above – as calls come in, they are stored on the answering machine. Our telephonists then process the calls at a rate that they can cope with. Here, we have a queue of messages, and that’s the approach that we’re going to use for our architecture – a message queue.

Message Queues

Let’s quickly review exactly what a message queue is, in a way that involves the technology itself without an answering machine or a funnel in sight.

The principle behind this is simple: a message is sent to the queue, and one or more clients can retrieve and process that message. In practice, there is more to it than that; however, you could (and many people do) implement a message queue as, for example, a table in a database, or a series of files on a disk.

In one of my first jobs, I worked on an EDI (Electronic Data Interchange) system. This system worked in the following way:
  1. 1.

    A sales order would be raised. The system that raised it would add a file to a location on a shared network drive. The file would simply be a comma-separated file in a pre-agreed format.

     
  2. 2.

    The EDI process would pick up this file (or the oldest file in the directory), read the contents, parse them, and then add the data into a second system (in our case, a CRM system).

     
At the time, we didn’t think of this as a message queue, but that is essentially what it was. With a very small tweak, we could have introduced multiple consumers (by simply moving the file to a second directory before processing).
Figure 1-5

EDI

The system worked reasonably well; most of the time, the orders came through without issue; however, on occasion, the message (or the file) would either be corrupted in some way or would contain information that we didn’t expect. For example, let’s imagine that the file looked like this:
PROD-CODE-1,3,20.56,2002-02-09
PROD-CODE-2,1,10.00,2002-02-09
PROD-CODE-3,15,0.23,2002-02-10

Let’s say that this information represents:

Product Code, Order Quantity, Unit Price, Sales Date

Now, let’s imagine that we get a file through that looks like this:
PROD-CODE-1,,3,20.56,2002-02-09
PROD-CODE-2,,1,10.00,2002-02-09
PROD-CODE-3,,15,0.23,2002-02-10

The EDI system is expecting four fields, but seeing five. In this case, the EDI system itself would typically crash, and every subsequent message would start to queue up; even though the subsequent messages may be fine, our system is down.

I imagine you’re thinking of several ways that this issue could be alleviated: the file could simply be ignored or skipped if there was an error, or some kind of error handling process could move the file out of the way.

What we needed here was a message broker.

Message Brokers

A message broker is a piece of software that is agnostic of the message itself; the message broker has a number of protocols that it can communicate over, but its real value is it acts as an intermediary between the sender and the consumer.

In our EDI case before, the message broker would essentially replace the file system; however, message brokers are pieces of software in their own right. Rather than just communicating via an agreed location on a file share, you communicate via the message broker: send messages to the broker, and take messages from it. In the example before, where the message cannot be read, message brokers provide functionality to move the message out of the way, while others are processed.

Typically, message brokers provide a lot of additional functionality, most of which is beyond the scope of this book. However, since we are talking about using queues, it’s worth understanding a little about the infrastructure and setup that may be involved in using a queue.
Figure 1-6

Sales order system using a message broker

As we said earlier, there’s no free lunch here. The message broker provides a great deal of functionality for us; it maintains state, handles failures, takes care of the routing between endpoints (so no one needs to write to a disk share but rather sends a message to an endpoint), and handles transactions; however, as a result, you’re no longer simply writing files to a disk on a share, so you sacrifice some performance. In reality, with modern message brokers, using clever caching techniques, the hit is probably negligible (in fact, I imagine if you compared our EDI system back then to any modern message broker, the message broker would be far more performant) – but it’s worth bearing in mind that you are introducing processing and complexity into the system.

I’d also like to clarify a point about message brokers; there are dozens of them out there: RabbitMQ, ActiveMQ, ZeroMQ, Apache Kafka, GCP Pub/Sub, Amazon SQS, Microsoft Azure Service Bus, and the list goes on and on. They all have implementation quirks, and they all have advantages and disadvantages. It’s probably not even fair to call them all “message brokers”; but my point here is that I’m talking about generics in a world of specifics. I will be covering a specific implementation later in this chapter; but it’s very likely that you can pick any message broker and substitute it for the one that we use here.

Now that we’ve discussed what queues are, and why they can help in this specific scenario, and spoken about such things as message brokers, let’s briefly discuss other advantages that using a queue can provide.

Separation of Concerns

The reason that most people choose to use a queue initially is the same as ours: they have a large volume of data, and they want their system to be able to cope with sudden fluctuations in traffic. However, using a message bus provides other benefits. One such benefit is separation of concerns; consider Figure 1-7.
Figure 1-7

Queue

Process A can perform some task and put a message onto a bus. Process B can read the message from that bus and perform a task. However, Process A has absolutely no dependency or even knowledge that Process B exists and vice versa. In fact, we could replace Process A with another process, or several processes, without affecting Process B in the slightest.

The second advantage is resilience. Let’s imagine that Process B fails: if Process A were calling Process B, then that would cause both processes to fail; however, since all Process A is doing is writing to a queue, the messages will simply remain in the queue until such time as they are picked up.

Let’s take a real-world and very common use of a queue: email. Process A in this case is a website that processes a sales order, and Process B sends confirmation emails out to the customer. If the website were to directly call a process to send emails and that failed, then either your entire website would be down or the email confirmations would be lost. However, in the aforementioned situation, the emails are simply left in the queue.

There are many other advantages to using message queues and using a message bus in general (there are messaging patterns other than queues), and I would encourage you to investigate this paradigm further; however, an exhaustive explanation is outside the scope of this book.

Now that we’ve covered our use of a message bus, let’s see our proposed target architecture.

Target Architecture Diagram

Let’s move on to a proposed architecture and discuss the advantages and disadvantages.
Figure 1-8

Target architecture

What we can see in Figure 1-8 is that while the client can access most of the system directly, in order to book a ticket, it enters into a queue. This acts as a buffer between user demand and the capacity of the system. Further, once the ticket request has been added to the queue, we can have one or multiple processes that pick up the request and process it. In our case, this is our new service (entitled “Ticket Ordering Process” in the diagram). If we think back to our manual process, this is our army of staff, listening to answering machine messages one at a time.

You’ll notice that there is a second queue going back to the client; this is to allow a response from the process. That is, the service can reply to the client to indicate the result of the operation.

Finally, we’ve introduced a proxy between the rest of the system and the external API. Let’s spend a little time discussing why we might wish to do this.

Proxy

The Proxy Pattern was one of the patterns introduced in the Gang of Four book Design Patterns: Elements of Reusable Object-Oriented Software. It is very often used in this type of scenario in order to insulate your system against change; since you have no control over the third-party API, you introduce a buffer between your system and external systems which you have no control over. The advantage here is that should the owner of the third-party system decide to change their API, you have only to change your proxy and not your main system.

You may already practice this kind of architectural decision without realizing it; if you’ve ever used an interface in place of a concrete class, you’re probably doing it so that your system can be insulated against change in that class; the change that you’re insulating against might be that in your unit test, you mock the class out, but that is a change in the implementation of that class (it’s changed from providing functionality to providing no functionality).

I often like to think of real-world examples for such things; in this case, a car is an excellent example of a real-world proxy. Depending on where you live in the world, you’re likely to need to undergo some form of test in order to legally drive a car; in that test, you’ll be ascertained on your ability to move the car – for example, you’ll need to be familiar with what the pedals do and how the steering wheel functions. However, you wouldn’t expect any of this to change if you got out of a Ford and into a Volkswagen. Essentially, you’ve learned to interact with a proxy; the steering wheel may technically function differently in these two makes of car, but the public interface (turn the wheel left or right) remains consistent.

Note

As a quick disclaimer, I’m not a mechanic; the preceding illustration is just that: in fact, I have no idea whether the steering mechanics change between different makes of car; my point is merely that they could.

Now that we’ve explored the theory, let’s see what this might look like in a concrete example.

Before we get into the example, let’s talk a little about our choice of message broker. I’m going to use a cloud message queue service: Azure Service Bus. The choice of using a cloud service is the real architectural decision here, not the specific vendor; all of the cloud providers have relatively comparable services, and which one you choose will depend on a number of factors – few of which would fall into the category of an architectural choice.

A Note on Cloud Vendors

Certain vendors have variations in services, some of which may better suit your needs; for example, GCP (at the time of writing) doesn’t provide a FIFO message bus service – so if that were important, you may be forced to choose another vendor.

My suggestion would be unless you have a very specific need for a particular application, then choose a provider based on the familiarity of the team with that technology. If everyone in your team knows AWS, then pick that.

The caveat here is market availability for people familiar with that technology; a lot of people are familiar with Azure or AWS, and recruiting someone to work with those technologies is relatively easy; however, if you pick some more obscure system, you may find it difficult to find people to work on it. This should not be underestimated; however good a technology may be, you need people to work on it, and you need people to want to work on it. If you architect a system that uses a very old or very specific technology, it doesn’t matter how good the system design is because either you won’t be able to get people to work on it or you’ll have to pay a huge premium for them to do so.

I would also advise against an all-in or all-out approach. What I mean by that is that you should try to keep any interaction with your cloud provider abstracted to a sensible level so that if you were to want to move from one provider to another, that would be possible without much work. There are times when this will be difficult, but remember that if your business depends on, say, GCP functions and Google decides tomorrow (or in ten years) that they are moving away from them, you are unlikely to have a say in that decision.

This may sound a lot like I’m heavily advocating the use of containers, and I’m not necessarily. If you have the experience and time to manage a Kubernetes cluster, then maybe that is the right approach, but if you don’t, then be very careful about taking on an infrastructure overhead.

Finally, and somewhat at odds with what I’ve previously stated, it’s probably an idea to not distribute your system across too many cloud providers. Although they are all broadly the same, this means that you have multiple places to go and investigate when something goes wrong.

In summary, my advice is to follow the same principles that you would, were the cloud provider a database engine. Good practice is to abstract interaction with the database where feasible in case you needed to change the database engine; but you would be unlikely to interact with both SQL Server and Oracle because the maintenance overhead would outweigh any benefits they might provide.

Obviously, this is an opinion, and it’s subject to change; as Kubernetes matures, or other technologies emerge, this may become a solved problem.

Why Cloud?

One of the driving requirements in this case is the ability to cope with a traffic spike. Don’t misunderstand what I’m saying here – you absolutely can create a system to handle spikes in traffic that works on-premis.

Note

Just to elaborate on what I mean by “on-premises”: by this, I simply mean that you own the infrastructure directly; that is, if the server blew up tomorrow, it would be your responsibility to replace it.

However, what the cloud (any cloud) offers, is a cost benefit. You are not responsible for maintaining the infrastructure, and you only pay for what you use. If you were to build this up locally, you would have to provision a server that could cope with the highest spike in traffic, even though for most of the time, you could probably run the system fine on a far less powerful machine.

If you had multiple jobs that all had different spikes in traffic and you had a massive scale, it might make sense to consider a private cloud; this is the last I’ll say in this book about private cloud – while it does have its place, its use cases are very specific; and certainly none of the chapters in this book lend themselves to it.

Now that we’ve briefly discussed our choices, let’s delve into some actual code. We can talk about some other decisions and principles as we go through.

Examples

Clearly, in this book, we won’t be writing a full enterprise ticket purchasing system and website. This book is about architectural principles and what they look like in a solid code example; as a result, a large portion of the system will need to be mocked out.

External APIs

The first thing we’ll need is something to simulate the third-party systems that we’ll be interfacing with. The code for this can be found in the GitHub repo associated with this book:

https://github.com/Apress/software-architecture-by-example

You can simply clone this repo and run the API; however, if you wish to follow along and create this API yourself, instructions can be found in Appendix A – Chapter 1.

Assuming that you have either cloned the repo or have followed the instructions, then the endpoints for our dummy third-party system will be as follows:

https://localhost:5001/externalticketbooking/gettickets

https://localhost:5001/externalticketbooking/reserveticket

https://localhost:5001/externalticketbooking/purchaseticket

Now that we’ve covered the third-party API, let’s discuss the two main slices of our system: reading the ticket availability and ordering a ticket.

Getting Ticket Availability

The call to get ticket availability would probably not change between the existing and the new system; however, we have decided to introduce a proxy between our API and the third-party one.

The idea behind an API is to act as a public interface to your system, so if it’s designed well (i.e., generically), then you should be able to change the functionality and structure behind the API, without affecting the interface.

Dealing with external APIs

APIs, as you can see from the third-party endpoints before, are about as loosely coupled as you can make one system from another, until such time as you tightly couple them. I like to think of APIs in the same way I think about my mobile phone. I expect my phone to change in the next version, and I half expect it to change so significantly that my current headphones will no longer work with the new phone. If you treat external APIs the same, you’ll be in a better place – assume that the API can change and can even change underneath you; be careful about using techniques such as serialization or anything that makes assumptions about the shape of the data that comes back.

If we follow Figure 1-8, we’ll see that the interface between the external APIs and our system, in our revised version, is dealt with by a proxy and no longer by the API itself.

Let’s drill into a few specific points on retrieving ticket information. In this instance, we will be calling the API directly, rather than via a service. Let’s have a look at our implementation for this method in the API.

As you can see from Listing 1-1, this method doesn’t call anything directly; it simply calls an in-process proxy.
        [HttpGet]
        public async Task<IEnumerable<TicketInformation>> GetTickets()
        {
            var result = await _ticketService.GetTickets();
            if (result.IsSuccess)
            {
                return result.Data;
            }
            else
            {
                // Log Error
                return null;
            }
        }
Listing 1-1

Code – TicketSales.Api/Controllers/TicketInventoryController.cs

In-process vs. out-of-process proxy

In our example, we’re using an in-process proxy here; essentially, we’re simply using functionality abstracted by an interface; for example, the API calls an interface method called GetTickets. This method then provides functionality to call the third-party API. This type of proxy is often referred to as an SDK. A second type of proxy is where we would, essentially, create another API that we would call, and that API would relay calls to the third party. There’s no reason why both of these types of proxy cannot be used in conjunction; however, they provide slightly different benefits. The in-process proxy does abstract the call to the API; however, changing it does mean that you need to (at least) recompile your software – the out-of-process version avoids that. The price that you pay for the out-of-process proxy is the increased complexity of having another API and, essentially, another point of potential failure.

Why haven’t we used the queue here as well as in the call to order a ticket? Well, you certainly could do that; however, it does add an overhead to the call. It’s also worth considering what benefits that would give. Let’s imagine that the system is hugely overloaded, and we call the API to get ticket availability; because the system is so busy, the call times out. Now let’s imagine the same scenario, but this time we’ve added a message to a queue; the call would immediately return, and when the system was successfully able to return the data, it would. However, it’s possible that since the call was made, the data that’s returned is now out of date.

For the sake of completeness, let’s see what the call inside the proxy looks like:
        public async Task<DataResult<IEnumerable<TicketInformation>>> GetTickets()
        {
            var client = _httpClientFactory.CreateClient();
            HttpResponseMessage response = await client.GetAsync(
                $"{_ticketServiceConfiguration.Endpoint}/GetTickets");
            if (response.IsSuccessStatusCode)
            {
                string result = await response.Content.ReadAsStringAsync();
                var options = new JsonSerializerOptions()
                {
                    PropertyNamingPolicy = JsonNamingPolicy.CamelCase
                };
                var data = JsonSerializer.Deserialize<IEnumerable<TicketInformation>>(result, options);
                return DataResult<IEnumerable<TicketInformation>>.Success(data);
            }
            else
            {
                // Log error
                return DataResult<IEnumerable<TicketInformation>>.Failure($"Error: {response.ReasonPhrase}");
            }
        }
Listing 1-2

Code – TicketSales.ThirdPartyProxy/TicketService.cs

We’ll talk more about the proxy and some of the decisions made in this code later. The one thing that I would like to say here is that we are deserializing the JSON data that is returned. I’ve done this to make the code simpler; however, if you are dealing with an API that you expect to change, then consider manually parsing the JSON. This can be done using a strategy such as XPath – you don’t need to rewrite JSON.NET.

While manually parsing the data does represent more work, it should make your system more stable; deserialization is very dependent on the shape of the data being returned. Obviously, if the third-party API was returning a field called Price and they change it to a field called Cost, there’s pretty much nothing you can do – hence the use of a proxy.

Ordering a Ticket

Ordering a ticket is the big change to the system. The previous system was overwhelmed by the quantity of traffic, and so here, we’ve introduced a buffer (in the form of a queue) between the request and the API. We’ve already spoken about what a queue is and even how a queue can help with this specific scenario, so let’s see what using the queue looks like in practice.

Adding a Message to a Queue

This is the easiest part of the process, especially using something like Azure Service Bus, as there’s an SDK that ships with it. In the sample project, the code to actually interface with the Service Bus is segregated into its own project. Listing 1-3 shows what this process looks like.
public async Task<string> AddNewMessage(string messageBody, string correlationId = "")
{
    var message = new Message(Encoding.UTF8.GetBytes(messageBody))
    {
        CorrelationId = string.IsNullOrWhiteSpace(correlationId) ? Guid.NewGuid().ToString() : correlationId,
    };
    await _sendQueueClient.SendAsync(message);
    return message.CorrelationId;
}
Listing 1-3

TicketSales.ServiceBusHelper/QueueHelper.cs

There’s very little to look at in this listing, except the Correlation ID; however, we’ll come back to that when we discuss the response message. In fact, we have a second method that sends a message and waits for a reply, shown in Listing 1-4.
public async Task<string> SendMessageAwaitReply(string messageBody)
{
    var correlationId = await AddNewMessage(messageBody);
    var result = await GetMessageByCorrelationId(correlationId);
    return result;
}
Listing 1-4

TicketSales.ServiceBusHelper/QueueHelper.cs

In the next section, we’ll talk about the Correlation ID and discuss how we can use it to get a specific message from the queue.

Getting a Response from the Queue

Typically, using a message queue is one-way communication; the fire-and-forget pattern. However, in our case, we (or rather the person buying the ticket) needs to know whether their purchase was successful. There are a number of ways to deal with this; for example, you could simply send the customer an email, telling them that their purchase was successful.

In fact, this offline processing is the one that you’ll see most typically in recent times: either based on the fact that we have your data and so we can guarantee that it will eventually work or based on the fact that the transaction is very likely to work. For example, if I post an update on a chat forum or a social media site, it doesn’t really matter if the update doesn’t appear for a few seconds, or even a few minutes. Another situation that is similar to ours (with one key difference): if I’m purchasing a product but I’m certain that we will either be able to immediately supply or at least source that product, then we can make the decision to take the order and then process a refund where we can’t obtain it.

Our case is a little different: if we can’t get the ticket, then we need to let the user know. This could be over email; alternatively, we can use a second queue in order to communicate back to the client. Obviously, we need to match these messages; otherwise, I may be confirming some other person’s order.

In fact, if you ever go out to eat in a restaurant, you’ll see this exact system in action: the person that takes your order does so on a notepad, which has your table number written on. They hand this to the chef, and at some point in the future, the chef hands back the slip of paper, along with the food order; the serving staff then check that the food matches what’s on the paper and return that to the table number that’s written on the slip.

In order to achieve the same result, all we need to do is attach our table number to our request; in queue terms, that’s typically a Correlation ID, although it doesn’t have to be. For example, if I have a system-defined piece of information (say, this was a digital food ordering service and I had an actual table number), I can easily use that. The Correlation ID is simply a convenience to prevent you having to manually add this to the message each time.

Listing 1-5 shows the code that receives the message and filters for the Correlation ID.
public async Task<string> GetMessageByCorrelationId(string correlationId)
{
    var tcs = new TaskCompletionSource<Message>();
    string returnMessageBody = string.Empty;
    var messageHandlerOptions = new MessageHandlerOptions(ExceptionReceivedHandler)
    {
        AutoComplete = false
    };
    _responseQueueClient.RegisterMessageHandler(async (message, cancellationToken) =>
    {
        if (message.CorrelationId == correlationId)
        {
            returnMessageBody = Encoding.UTF8.GetString(message.Body, 0, message.Body.Length);
            await _responseQueueClient.CompleteAsync(message.SystemProperties.LockToken);
            tcs.TrySetResult(message);
        }
        else
        {
            await _responseQueueClient.AbandonAsync(message.SystemProperties.LockToken);
        }
    }, messageHandlerOptions);
    await tcs.Task;
    return returnMessageBody;
}
Listing 1-5

TicketSales.ServiceBusHelper/QueueHelper.cs

I would ask that you don’t pay too much attention to the TaskCompletionSource, suffice to say that it’s a mechanism in .Net to turn an event into an awaitable task. It’s only necessary because of a quirk of the .Net Core SDK.

The important aspects of this code start with AutoComplete = false. This prevents the message from being immediately acknowledged on receipt. When reading a message from a queue, there are essentially two possibilities: you can read the message and immediately remove it from the queue, or you can read the message and then remove it from the queue at some point in the future when you have determined that you’ve completed your processing. If we take this back to our restaurant example, AutoComplete = true essentially means that service staff would pick up the prepared food from the kitchen, read the table number, and then burn the ticket; if this was not the food that they were expecting, they would be forced to throw it in the bin, as no other serving staff would know where it was supposed to go.

AutoComplete = false has another effect of locking the message; this is to prevent the message from being read, and dealt with, multiple times (our restaurant analogy breaks down here, as more than one person cannot physically pick up the same plate of food).

RegisterMessageHandler then sets up a listener for the events in the system. Once an event is received, we check the CorrelationId. If it’s the one that we’re expecting, then we process it and complete the message (CompleteAsync); otherwise, we abandon it (AbandonAsync returns the message to the queue).

Summary

Our client’s main requirement for this chapter was that we create a system that was able to cope with the spike in traffic that they expect during one of their large ticket sales events.

In this chapter, we established a mechanism for analyzing a system design by comparing the manual process to the proposed automated one. We will conduct the same exercise for all subsequent chapters. If you follow this process and decide that, in fact, a manual process is faster, cheaper, safer, or simpler, then you should think very carefully about whether an automated system is the best approach.

We’ve investigated the principles behind a message broker, along with other strategies for coping with spikes in traffic. After discussing the value that an existing system holds, we analyzed ways to quantify that value, along with establishing a requirement for a minimum viable product.

By leveraging a cloud message broker offering, we’ve managed to not only alleviate the issues with the spikes in traffic but also ensure that for most of the year, we’re not paying for infrastructure that we simply don’t need.

We’ve covered some of the peripheral benefits of using a message broker for intersystem communication and the benefits and caveats behind using a cloud provider, along with some considerations when choosing one.

Finally, we’ve spoken about how you can insulate your system design against external (and internal) change to make the system more maintainable and extensible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.249.141