Chapter 12: Cross-Cutting Concerns

Throughout the previous chapters, we have explored many different aspects of Java application development. Starting from the beginning of the development life cycle (including requirements collection and architecture design), we've focused on many different technological aspects, including frameworks and middleware.

At this point, several cross-cutting concerns need to be examined, regardless of the kind of application we are building and the architectural style we choose. In this chapter, we are going to look at a few of these aspects, as follows:

  • Identity management
  • Security
  • Resiliency

The cross-cutting concerns discussed in this chapter provide some very useful information about topics that are crucial for a project's success. Indeed, implementing identity management, security, and resiliency in the right way can be beneficial to the success of our application, both from an architectural point of view (by providing elegant, scalable, and reusable solutions) and a functional point of view (by avoiding reinventing the wheel and approaching these issues in a standardized way).

With that said, let's get started with a classic issue in application development: identity management.

Identity management

Identity management is a broad concept that deals with many different aspects and involves interaction with many different systems.

This concept is indeed related to identifying a user (that is, who is asking for a particular resource or functionality) and checking the associated permissions (whether they are allowed to do so and so, or not). So, it's easy to see how this is a core concept, common in many applications and many components inside the application. If we have different functionalities provided by different components (as in a microservices application), then obviously each of them will need to perform the same kind of checks, to be sure about the user's identity and act accordingly.

However, having an ad hoc identity management infrastructure for each application can be considered an antipattern, especially in a complex enterprise environment, since each application (or component) has the same goal of identifying the user and its permissions.

For this reason, a common approach is to define a company-wide identity management strategy and adopt it in all of the applications, including the off-premises and microservices architectures.

Now, to come back to the fundamentals, identity management is basically about two main concepts:

  • Authentication: This is a way of ensuring, with the maximum possible degree of certainty, that the person asking for access to a resource (or to perform an action) is the person that they claim to be. Here is a diagram of the username and password authentication method:
Figure 12.1 – Authentication

Figure 12.1 – Authentication

  • Authorization: This is a way of declaring who can access each resource and perform a specific action, as shown in the following diagram. This may involve authenticated and non-authenticated entities (sometimes referred to as anonymous access).
Figure 12.2 – Authorization

Figure 12.2 – Authorization

Both authentication and authorization include two main scenarios:

  • Machine to machine: This is when the entity requesting access is an application, for example, in batch calculations or other processes that do not directly involve the interaction of a human user. This is also called server to server.
  • Interactive or use: This is the other scenario, with a human operator interacting directly with the resource, hence requesting authentication and authorization.

Now that we have the hang of some basic concepts, let's learn a bit more about authentication and authorization.

Authentication

As stated, authentication is about verifying that the entity performing a request (be it a human or a machine) is who they claim to be. There are many different ways to perform this verification. The main differentiator is what the user presents (and needs to be checked). It falls into one of the following three categories:

  • Something that the user knows: This refers to secrets, such as passwords, pins, or similar things, like the sequence to unlock a mobile phone.
  • Something that the user has: This refers to physical devices (such as badges or hardware tokens) or software artifacts (such as certificates and software tokens).
  • Something that the user is: In this case, authentication is linked to biometric factors (such as a fingerprint or face identification), or similar things like a signature.

There are several things to consider here, as follows:

  • The first is that a piece of public information, such as a username, may be associated with the authentication factor. In this case, multiple users can share the same factor (such as a password or a badge) and we can tell them apart by using the username. The unintentional occurrence of this pattern (such as two users choosing the same password by accident) may be harmless, whereas intentional implementations (multiple users using the same badge) can be a security issue.
  • You also have to consider that a combination of more than one authentication factor is considered a best practice and is encouraged for stronger security implementations. This is called multi-factor authentication (MFA). Moreover, in some specific environments (such as banking) this may be mandated by specific regulations. Strong authentication is often one of those specifics and refers to an authentication process leveraging at least two different factors, belonging to different groups (for example, something that a user knows, plus something that a user has).
  • Some authentication factors may be subject to policies. The most common examples are password rules (length, complexity) or expiration policies (forcing a user to change a factor after a certain time where possible).

Of course, an immediate concern that comes to mind is how and where to store the information relevant for implementing authentication – in other words, where to save our usernames and passwords (and/or the other kinds of secrets used for authentication).

The most common technology used for this goal is LDAP, which is short for Lightweight Directory Access Protocol. LDAP is a protocol for storing user information. An LDAP server can be seen as a standard way to store information about users, including things such as usernames, emails, phone numbers, and, of course, passwords. Being quite an old standard, around since the 1990s, it's widely adopted and compatible with a lot of other technology.

Without going into too much detail, we can look at it as just another datastore, which we can connect to using a connection URL. Then, we can query the datastore by passing specific attributes to search for specific entries.

The authentication operation against an LDAP server is called Bind. LDAP can typically encrypt the passwords in various ways. One very famous implementation of an LDAP server (technically, an extension of it, providing more services than just the standard) is Microsoft Active Directory.

LDAP is not the only way to store user information (including passwords) but is likely the only widely adopted standard. Indeed, it is common to store user information in relational databases, but this is almost exclusively done in a custom way, meaning that there is no standard naming nor formats for tables and columns storing usernames, passwords, and so on.

One other way to store user information is to use files, but this is an approach that's not scalable nor efficient. It works mostly for a small set of users or testing purposes. A common file format used to store user information is .htpasswd, which is simply a flat file storing a username and password, in a definition originally used by the Apache httpd server for authentication purposes.

It is a commonly accepted best practice to store passwords in an encrypted form whenever possible. This is a crucial point. Whatever the user store technology (such as LDAP or a database), it is crucial that the passwords are not stored in cleartext. The reason is simple and quite obvious: if our server gets compromised in some way, the attacker should not be able to access the stored passwords.

I have used the word encryption generically. A solution, indeed, can be to encrypt the passwords with a symmetrical algorithm, such as AES. Symmetrical encryption implies that by using a specific secret key, I can make the password unusable. Then, I can again decrypt the password using the same key.

This approach is useful, but we will still need to store the key securely since an attacker with the encrypted password and the key can access the original password as cleartext. Hence, a more secure way is to store the hashed password.

By hashing a password, you transform it into an encrypted string. The great thing, compared to the previous approach, is that we are implementing asymmetrical encryption. There is no way (if we are using a proper algorithm) to reverse the encrypted string to the original one in a reasonable amount of time. In this way, we can store the encrypted passwords without requiring any key. To validate the passwords provided by the clients, we simply apply the same hashing algorithm used for saving it initially and compare the results. Even if an attacker gains access to our user information store, the stolen encrypted passwords will be more or less useless.

Important Note

It's certainly better to encrypt a password rather than store it in cleartext; even the encrypted ones are not 100% secure. Indeed, even if it is impossible, in theory, to reconstruct the original password from a hashed value, some techniques attempt to do so. In particular, it is possible to try to run a brute-force attack, which basically tries a lot of passwords (from a dictionary, or simply random strings), hashes them, and compares the output with a known actual hash. A more efficient alternative is to use rainbow tables, which are basically tables of passwords and their pre-computed hashes. Defenses against these kinds of techniques are possible, however, by using longer and more complex passwords and using salting, which is a way to add some more randomness to hashed passwords.

Authorization

User authorization is complementary to authentication. Once we are sure that a user is who they claim to be (using authentication), we have to understand what they are allowed to do. This means which resources and which operations they are permitted to use.

The most basic form of authorization is no authorization. In simple systems, you can allow an authenticated user to do everything.

A better approach, in real-world applications, is to grant granular permissions, differentiated for different kinds of users. This is basically the concept of roles.

A role can be considered a link between a set of users and a set of permissions. It is usually mapped to a job function or a department and is defined by a list of permissions, in terms of resources that can be accessed and functionalities that can be used. Each user can be associated with a role, and with this, they inherit the permissions associated with that role.

This kind of authorization methodology is called Role-Based Access Control (RBAC). Based on the kind of RBAC implementation, each user can be assigned to more than one role, with different kinds of compositions. Normally, policies are additive, meaning that a user belonging to more than one role gets all the permissions from both roles. However, this may be subject to slight changes, especially if the permissions conflict, up to the point that there may be implementations denying the possibility of having more than one role associated with each user.

Another aspect of RBAC implementations concerns role inheritance. Some RBAC implementations employ the concept of a hierarchy of roles, meaning that a role can inherit the set of permissions associated with its parent role. This allows for a modular system. In the Java Enterprise world, JAAS (short for Java Authentication and Authorization Service) is the implementation standard for authentication and authorization. It can be regarded as a reference implementation of an RBAC-based security system.

An alternative to RBAC is Policy-Based Access Control (PBAC). In this approach, the permission is calculated against a set of attributes, using Boolean logic, in the form of an if then statement, where more than one attribute can be combined with AND, OR, and other logic operators. The attributes can be simply related to the user (such as checking whether a user belongs to a particular group), or to other conditions (such as the time of the day, the source IP, and the geographical location).

SELinux, which is a security module underlying some Linux OS variants (including Android) is a common implementation of PBAC.

Identity and Access Management

Identity and Access Management (IAM) is a term usually associated with systems that provide authentication, authorization, and other identity security services to client applications. The function of an IAM system is to implement such features in a unified way, so each application can directly use it and benefit from an adequate level of security. Other than what we have seen here in terms of authentication and authorization, an IAM system also provides the following:

  • Decoupling the user store: This means that usernames, passwords, and other information can be stored in the technology of choice (such as LDAP or a database), and the client application does not need to know the implementation details. An IAM can also usually unify multiple storage systems in a unique view. And of course, if the user storage system needs to change (such as being moved from LDAP to a database), or we have to add a new one, we don't need to make any changes to the client applications.
  • Federating other authentication systems (such as more IAM systems): This can be particularly useful in shared systems where access is required from more than one organization. Most of us have experienced something like this when accessing a service through a third-party login using Google or Facebook.
  • Single sign-on (SSO): This means that we only need to log in (and log out) once, and then we can directly access the set of applications configured in the IAM.

There are many different ways (and standards) to implement such features, depending on each specific IAM product used. Such standards often boil down to some key concepts:

  • Provisioning and connecting each application managed by the IAM: This usually means configuring each application to point to the IAM. In the Java world, a common way to achieve this is to configure a servlet filter to intercept all requests. Other alternatives are agent software or reverse proxies that implement the same functionality of intercepting all the requests coming to our application.
  • Checking each request coming to each application: In case a request needs to be authenticated (because it is trying to access a protected resource or perform a limited action), check whether the client is already authenticated. If not, redirect to an authentication system (such as a login form).
  • Identifying the user: Once the client provides a valid authentication credential (such as a username and password), it must be provided with a unique identifier, which is regarded as the ID card of the user, used to recognize it across different requests (and potentially log in to other applications in an SSO scenario). To do so, the client is often provided with a session token, which may then be stored by the client application (as in a cookie) and usually has a limited lifespan.

A standard way to implement this kind of scenario is the OAuth protocol.

However, IAM is not the only security aspect that we need to take care of in a cloud-native architecture. Indeed, the topic of security in an application (especially in a cloud-native one) includes many more considerations. We are going to discuss some of them in the next section.

Security

Security is a very complex aspect, as well as a foundational and crucial one. Unless security is your main focus (which is unlikely if you are in charge of defining the whole architecture of a cloud-native application), chances are that you will have some experts to work with. Nevertheless, it's important to take care of some simple security implications right from the outset of software implementation (including requirement collection, design, and development), to avoid going through a security check after you have completed architecture and development, only to realize that you have to make a lot of changes to implement security (thereby incurring costs and delays).

This approach is often referred to as shift-left security, and it's a common practice in DevOps teams.

Intrinsic software security

The first aspect to take care of is intrinsic software security. Indeed, software code can be subject to security vulnerabilities, often due to bugs or poor software testing.

The main scenario is software behaving unexpectedly as a result of a malformed or maliciously crafted input. Some common security issues of this kind are the following:

  • SQL injection: A malicious parameter is passed to the application and is attached to a SQL string. The application then performs a special SQL operation that is different from the expected operation and can allow the attacker access to unauthorized data (or even to damage existing data).
  • Unsafe memory handling: A purposely wrong parameter is passed to the application and is copied to a special portion of memory, which the server interprets as executable code. Hence, unauthorized instructions can be executed. A well-known instance of this kind of bug is the buffer overflow.
  • Cross-site scripting: This is a specific security issue in web applications where an attacker can inject client-server code (such as JavaScript) that is then executed and the attacker can use it to steal data or perform unauthorized operations.

There are several techniques for avoiding or mitigating these issues:

  • Input sanitizing: Every input should be checked for special characters and anything unnecessary. Checking the format and the length is also important.
  • Running as a user with limited permissions on the local machine (the fewer permissions, the better): If there's an unexpected security exception, the impact may be limited.
  • Sandboxing: In this case, the application will run within a limited and constrained environment. It is kind of an extension of the previous approach. There are various techniques for doing this, depending on the specific application technology. The JVM itself is kind of a sandbox. Containers are another way to implement sandboxing.

The preceding topics are a quick list of common issues (and advice to mitigate them) with regard to software development. However, these approaches, while crucial, are not exhaustive, and it's important to take a look at the overall security of our applications and systems, which will involve some other considerations.

Overall application security

Good overall security starts with the way we write our application but doesn't end there. There are several other security techniques that may involve different IT departments, such as network administrators. Let's look at some of them here:

  • Network firewalls: They are an integral piece of the enterprise security strategy and are very often completely transparent to developers and architects (at least until you find that some of the connections you want to make are failing due to a missing network rule). The primary duty of firewalls is to block all the network connections unless they are explicitly allowed. This includes rules on ports, protocols, IP addresses, and so on.

Nowadays, however, firewalls are way more sophisticated than they used to be. They are now capable of inspecting the application-level protocols and are often not only deployed at the forefront of the infrastructure but also between each component, to monitor and limit unauthorized accesses.

For the same reason, some orchestrator tools (such as Kubernetes, but also the public cloud providers) offer the possibility to implement the so-called network policies, which are essentially Access Control Lists (ACLs) acting as a network firewall, hence not allowing (or dropping) unwanted network connections. Firewalls can be hardware appliances (with major vendors including Cisco and Check Point, among others), or even software distributions (such as PFSense and Zeroshell).

  • Intrusion Protection Systems (IPSes) (similar to Intrusion Detection Systems, with a slight difference in the implementation): These are an extension to firewalls. An IPS, like a firewall, is capable of inspecting network connections. But instead of just identifying authorized and unauthorized routes, an IPS is also capable of inspecting the packages to identify signatures (recurrent patterns) of well-known attacks (such as SQL injections or similar behaviors).

Moreover, an IPS can inspect other aspects of an application beyond just its network connections. Typically, an IPS can access application logs or even inspect the application behavior at runtime, with the same goal of identifying and blocking malevolent behavior. In this context, IPSes are similar to antivirus software running on workstations. Two common IPS implementations are Snort and Suricata.

  • Source code inspection: This is focused on analyzing the code for well-known bugs. While this is a general-purpose technique, it can be focused on security issues. In most cases, this kind of analysis is integrated into the software delivery cycle as a standard step for each release. This kind of test is also named static software analysis because it refers to inspecting the software when it is not being executed (hence, looking at the source code).

A technique similar to the previous point is checking the versions of dependencies in an application. This may refer to libraries, such as Maven dependencies. Such modules are checked against databases for known vulnerabilities linked to the specific version. This is part of following the general recommendation of keeping the software constantly patched and upgraded.

All of the aspects seen so far are relevant best practices that can be partially or completely adopted in your project. However, there are contexts where security checks and considerations must be applied in a standardized and well-defined way, which we will see next.

Security standards and regulations

Security is a core concept in applications, especially in some specific industries, such as financial services, defense, healthcare, and the public sector. But it's really a cross-concept that cannot be ignored in any context. For this reason, there are sets of regulations, sometimes mandated by law or industry standards (for example, banking associations), that mandate and standardize some security practices. These include the following:

  • Payment Card Industry Data Security Standard (PCI DSS): This is a very widespread standard for implementing and maintaining IT systems that provide credit card payments. The goal is to reduce fraud and establish the maximum level of trust and safety for credit card users. PCI DSS mandates a set of rules not only on the system itself (such as access control and network security) but also in the way IT staff should handle such systems (by defining roles and responsibilities).
  • Common Criteria (CC): This is an international standard (under the denomination ISO/IEC 15408) that certifies a set of tests for checking the security of an IT system. Such certification is conducted by authorized entities, and the certified systems are registered on an official list.
  • Open Web Application Security Project (OWASP): This approach is a bit different from what we have seen so far. Instead of being a centralized testing institution providing a certification, OWASP is an open and distributed initiative that provides a set of tools, tests, and best practices for application security (especially focused on web application security). OWASP also shares and maintains a list of well-known security issues. The association distributes the Dependency-Check tool (https://owasp.org/www-project-dependency-check), which helps in identifying vulnerable software dependencies, and the Dependency-Track tool monitors and checks dependency usage.

As we explained, security is a crucial topic that must be considered important in all project phases (from design to implementation to testing) and across all different teams (from developers to testers to sysadmins). This is the reason why we decided to consider it a cross-cutting concern (and why we discussed it in this chapter). To establish and maintain security in our applications, best practices must be taken into account at every step of a development project, including coding. But to maintain a safe system, we should also consider other potential sources of disruption and data loss, and ways to avoid or mitigate them, which we will look at in the next section.

Resiliency

Security is about preventing fraudulent activities, the theft of data, and other improper behavior that could lead to service disruptions. However, our application can go down or provide degraded service for several other reasons. This could be due to a traffic spike causing an overload, a software bug, or a hardware failure.

The core concept (sometimes underestimated) behind the resiliency of a system is the Service Level Agreement (SLA).

An SLA is an attempt to quantify (and usually enforce with a contract) some core metrics that our service should respect.

Uptime

The most widely used SLA is uptime, measuring the availability of the system. It is a basic metric, and it's commonly very meaningful for services providing essential components, such as connectivity or access to storage. However, if we consider more complex systems (such as an entire application, or a set of different applications, as in microservices architectures), it becomes more complex to define. Indeed, our application may still be available, but responding with the wrong content, or simply showing static pages (such as a so-called courtesy page, explaining that the system is currently unavailable).

So, the uptime should be defined carefully in complex systems, by restricting it to specific features and defining their expected behaviors (such as the data that these features should provide).

Uptime is usually measured as a percentage over a defined period, such as 99.9% per year.

When considering the uptime, it's useful to define the two possible types of outages:

  • Planned downtime: This refers to service disruption occurring due to maintenance or other predictable operations, such as deployments. To reduce planned downtime, one technique is to reduce the number of releases. However, this kind of technique may be impractical for modern systems because it will reduce agility and increase time to market. So, an alternative approach is to implement rolling releases or similar techniques to continue to provide services (eventually in a degraded mode) while performing releases or other maintenance activities.
  • Unplanned downtime: This is, of course, linked to unpredictable events, such as system crashes or hardware failures. As we will see in this section, there are several techniques available for increasing uptime, especially in cloud-native architectures.

With regard to unplanned downtime, there are several further metrics (I would say sub-metrics) that measure certain specific aspects that are useful for further monitoring of the service levels of a system:

  • Mean time between failures: This measures the average time between two services outages (as said before, an outage can be defined in many ways, ranging from being completely down to services answering incorrectly). A system with a short mean time between failures, even if still respecting the overall uptime SLA, should be considered unstable and probably fixed or strengthened.
  • Mean time to recovery: This measures the average time to restore a system to operation following a period of downtime. This includes any kind of workaround or manual fix to resolve an issue. These kinds of fixes are considered temporary. A system with a high mean time to recovery might need some supporting tools or better training for the team operating it.
  • Mean time to repair: This is similar to the previous metric but measures the complete resolution of an issue definitively. The difference between this metric and the previous one is subtle and subjective. A high mean time to repair can signify a poorly designed system or the lack of good troubleshooting tools.

Uptime is not the only SLA to consider when monitoring a system with regard to resiliency. Several other metrics can be measured, such as the response time of an API (which can be measured with something such as 90% of the calls should respond in under 1 millisecond), the error rate (the percentage of successful calls per day), or other related metrics.

But as we said, there are several techniques to achieve these SLAs and increase system reliability.

Increasing system resiliency

The most commonly used (sometimes overused) technique for increasing system resiliency is clustering. A cluster is a set of components working concurrently in a mirrored way. In a cluster, there is a way to constantly share the system status between two or more instances. In this way, we can keep the services running in case downtime (planned or unplanned) occurs in one of the systems belonging to the cluster itself.

Moreover, a cluster may involve a redundant implementation of every subsystem (including network, storage, and so on) to further improve resiliency in the event of the failure of supporting infrastructure.

Clusters are usually complex to set up and there is a price to pay for the increase in reliability, usually a performance impact due to the replication of the state. Moreover, depending on the specific application, there are several restrictions for implementing a cluster, such as network latency and the number of servers (nodes).

We discussed a related topic in Chapter 11, Dealing with Data, when talking about NoSQL repositories and the CAP theorem. Since the data inside a cluster can be considered distributed storage, it must obey the CAP theorem.

A cluster often relies on a networking device, such as a network load balancer, that points to every node of the cluster and re-establishes full system operativity in case of a failure, by redirecting all the requests to a node that is still alive.

To communicate with each other, the cluster nodes use specific protocols, commonly known as a heartbeat protocol, which usually involves the exchange of special messages over the network or filesystem. A widely used library for implementing heartbeat and leader election in Java is JGroups.

One common issue with clusters is the split-brain problem. In a split-brain situation, the cluster is divided into two or more subsets, which are unable to reach each other via the heartbeat. This usually occurs because of a network interruption between the two system subsets, caused by a physical disconnection, a misconfiguration, or something else. In this situation, one part of the cluster is unaware if the other part is still up and running (but cannot be seen using the heartbeat) or is down. To maintain data consistency (as seen in the CAP theorem in Chapter 11, Dealing with Data) in the case of split-brain, the cluster may decide to stop operating (or at least stop writing functionalities) to avoid processing conflicting operations in two parts of the cluster that are not communicating with each other.

To address these scenarios, clusters may invoke the concept of a quorum. A quorum is the number of nodes in a cluster required for the cluster to operate properly. A quorum is commonly fixed to half of the cluster nodes + 1.

While the details may vary with the type of specific cluster implementation, a quorum is usually necessary to elect a cluster leader. The leader may be the only member of the cluster running, or, more commonly, having other duties related to cluster coordination (such as declaring a cluster fully functional or not). To properly handle split-brain situations, a cluster is usually composed of an odd number of nodes, so if there's a split between two subsets, one of the two will be in the majority and continue to operate, while the other will be the minority and will shut down (or at least deny write operations).

An alternative to the use of a quorum is the technique of witnesses. A cluster may be implemented with an even number of nodes, and then have a special node (usually dislocated in the cloud or a remote location) that acts as a witness.

If there's a cluster split, the witness node can reach every subset of the cluster and decide which one should continue to operate.

As we have said, clustering can be expensive and has lots of requirements. Moreover, in modern architectures (such as microservices), there are alternative approaches for operating in the case of a failure in distributed setups. One common consideration is about the eventual consistency, discussed in the previous chapter, under the Exploring NoSQL repositories section.

For all these reasons, there are other approaches to improving system availability, which can be used as an alternative or a complement to clustering.

Further techniques for improving reliability

An alternative approach to clustering used to improve reliability is High Availability (HA). An HA system is similar to a clustered system. The main difference is that in normal conditions, not all nodes are serving requests. Conversely, in this kind of setup, there is usually one (or a limited number of) primary nodes running and serving requests, and one or more failover nodes, which are ready to take over in the case of a failure of the primary node.

The time for restoring the system may vary depending on the implementation of the systems and the amount of data to restore. The system taking over can already be up and running (and more or less aligned with the primary). In this scenario, it's called a Hot Standby. An alternative scenario is when the failover servers are usually shut down and lack data. In this case, the system is called Cold Standby and may take some time to become fully operational.

An extreme case of Cold Standby is Disaster Recovery (DR). This kind of system, often mandated by law, is dislocated in a remote geographical location, and aligned periodically. How remote it should be and how often it is aligned are parameters that will vary depending on how critical the system is and how much budget is available. A DR system, as the name implies, is useful when recovering from the complete disruption of a data center (due to things such as a fire, an earthquake, or flooding). Those events, even if unlikely, are crucial to consider because being unprepared means losing a lot of money or being unable to re-establish a system.

DR is also linked to the concept of Backup and Restore. Constantly backing up data (and configurations) is crucial to re-establishing system operation in the case of a disaster or unforeseen data loss (think about a human error or a bug). Backed-up data should also be periodically tested for restore to check the completeness of data, especially if (as is advised) the data is encrypted.

Whether you are planning to use clustering, HA, or DR, two special metrics are commonly used to measure the effectiveness and the goals of this kind of configuration:

  • Recovery time objective (RTO): This is the time needed for a failover node to take over after the primary node fails. This time can be 0 (or very limited) in the case of clustering, as every node is already up and running, or can be very high in the case of DR (which may be acceptable as the occurrence of a disaster is usually very unlikely).
  • Recovery point objective (RPO): This is the amount of data that it is acceptable to lose. This may be measured in terms of time (such as the number of minutes, hours, or days since the last sync), the number of records, or something similar. An RPO can be 0 (or very limited) in a clustered system, while it can be reasonably high (such as 24 hours) in the case of DR.

A last important topic is the physical location of the application. Indeed, all of the approaches that we have seen so far (clustering, HA, and DR, with all the related metrics and measurements, such as RPO and RTO) can be implemented in various physical setups, greatly varying the final effect (and the implementation costs).

One core concept is the data center. Indeed, each node (or portion) of a cluster (or of an HA or DR setup) can be running on a physically different data center in a specific geographical location.

Running servers in different data centers usually provides the maximum level of resiliency, especially if the data centers are far away from each other. On the other hand, the connection between applications running in different data centers can be expensive and subject to high latency. Cloud providers often call the different data centers availability zones, further grouping them by geographical area, to provide information about the distance between them and the users.

However, even if an application is running in just one data center, there are techniques to improve resilience to failures and disasters. A good option is running the application copies in different rooms in a data center. The rooms of a data center, even if belonging to the same building, can be greatly independent of each other. These data centers may apply specific techniques to enforce such independence (such as dedicated power lines, different networking equipment, and specific air conditioning systems). However, it's easy to understand that major disasters such as earthquakes, floods, and fires will be disruptive for all the rooms in the same way. However, hosting in separate rooms is usually cheaper than in separate data centers, and rooms have quite good connectivity with each other.

A lower degree of isolation can be obtained by running different copies of our application (different nodes of a cluster) on different racks. A rack is a grouping of servers, often all running in the same room (or close to each other, at least). In this sense, two applications running on different racks may be unaffected by minor issues impacting just one rack, such as a local network hardware failure or power adapter disruption, as these physical devices are commonly specific to each rack.

Of course, a blackout or a defect in the air conditioning system will almost certainly impact all the instances of our cluster, even if running on different racks. For all of these reasons, different racks are cheaper than the other implementations seen so far, but can be pretty poor in offering resilience to major disasters, and are completely unsuitable for implementing proper DR.

A closer alternative to different racks is running our application in the same rack but on different machines. Here, the only redundancy available is against local hardware failures, such as a disk, memory, or CPU malfunctioning. Every other physical issue, including minor ones (such as a power adapter failing), will almost certainly impact the cluster availability.

Last but not least, it is possible to have the instances of a cluster running on the same physical machine thanks to containers or server virtualization.

Needless to say, this technique, while very easy and cheap to implement, will not provide any protection against hardware failures. The only available reliability improvement is against software crashes, as the different nodes will be isolated to some degree.

All of the approaches that we have seen so far offer ways to improve the overall application availability and were available long before cloud-native applications. However, some modern evolutions of such techniques (such as the saga pattern, seen in Chapter 9, Designing Cloud-Native Architectures) happen to better suit modern applications (such as microservices-based ones).

A topic that is worth highlighting is reliability. In the past, reliability was treated exclusively at the infrastructure level, with highly available hardware and redundant network connections. However, nowadays, it is more common to design application architectures that are aware of where they run, or of how many instances are running concurrently. In other words, applications that take reliability and high availability into consideration have become common. In this way, it is possible to implement mixed approaches that provide degraded functionalities if failure is detected in other nodes of the cluster. So, our application will still be partially available (for example in read-only mode), thereby reducing the total outage.

Another technique is to apply tiering to functionalities (for example, to different microservices). To do so, it's possible to label each specific functionality according to the severity and the SLA needed. Hence, some functionalities can be deployed on highly resilient, expensive, and geographically distributed systems, while other functionalities can be considered disposable (or less important) and then deployed on cheaper infrastructure, taking into account that they will be impacted by outages in some situations (but this is accepted, as the functionalities provided are not considered core).

All of these last considerations are to say that even if you will never have the job of completely designing the availability options of a full data center (or of more than one data center) in your role as a software architect, you will still benefit from knowing the basics of application availability so that you can design applications properly (especially the cloud-native, microservices-based ones), thereby greatly improving the overall availability of the system.

With this section, we have completed our overview of cross-cutting concerns in software architectures.

Summary

In this chapter, we have seen an overview of the different cross-cutting concerns that affect software architecture. This also included some solutions and supporting systems and tools.

We have learned the different ways of managing identity inside our application (especially when it involves several different components, such as in a microservice architecture).

We had an overview of the security considerations to be made when designing and implementing an application (such as intrinsic software security and overall software security), which are crucial in a shift-left approach, which is the way security is managed in DevOps scenarios.

Last but not least, we had a complete overview of application resiliency, discussing what a cluster is, what the implications of using clustering are, and what other alternatives (such as HA and DR) can be implemented.

In the next chapter, we are going to explore the tooling supporting the software life cycle, with a particular focus on continuous integration and continuous delivery.

Further reading

  • Neil Daswani, Christoph Kern, and Anita Kesavan: Foundations of Security: What Every Programmer Needs to Know
  • The Keycloak community: The Keycloak OpenSource IDM (https://www.keycloak.org)
  • Evan Marcus and Hal Stern: Blueprints for High Availability: Timely, Practical, Reliable
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.105.108