Chapter 9. Building Secure and Usable Systems

What does it mean to build a secure system?

Building a secure system means designing and implementing a system that deals with the risks that you are taking, and leaves only acceptable risks behind.

The level of security that you apply will depend on your organization, on your industry, on your customers, and on the type of system. If you are building an online consumer marketplace, or a mobile game, you’ll have a very different set of security concerns than someone building an encryption device designed to be used in the field by Marines in the advance guard.

However some patterns are common in most security-based situations.

Design to Resist Compromise

A secure system must be built to resist compromise, whether that’s resisting remote SQL injection attacks, being resistant to power differential attacks, or not leaking electromagnetic spectrum information. The point is that you understand the operating environment and can build in protection against compromise.

You do this by changing your design assumptions.

We know from years of crypto research that you should assume that all user input is compromised, and that attackers can repeatedly and reliably introduce whatever input they want. We should also assume that attackers can reliably read all output from your system.

You can even go so far as to assume that the attackers know everything about your system and how it works.1 So where does that leave you?

Resisting compromise is about carefully checking all input data, and providing as little information on output as you can get away with. It’s about anticipating failures and making sure that you handle them safely, and detecting and recording errors and attacks. And making life as difficult as you can for an attacker, while still making sure that the system is usable and understandable.

For almost all of us, we can assume that our attackers have bounded resources and motivation. That is, if we can resist long enough, they will either run out of time and money, or get bored and move elsewhere. (If you aren’t in that camp, then you need specialist help and probably shouldn’t be reading this book!) If a nation state decides to attack you, they are probably going to succeed anyway—but you still want to make it as difficult for them as possible, and try to catch them in the act.

Security Versus Usability

People often talk about security and usability being the opposite faces of the same coin. That security forces us to build unusable systems and that usability experts want to remove the security features that we build in.

This is increasingly understood to be wrong. It shows a tunnel-vision view of security as a technical mission rather than a holistic mission.

Systems that are not usable due to over-engineered security controls tend to push users to find unauthorized workarounds that compromise the security of the system: passwords written on Post-it notes, to teams sharing accounts because it’s easier, to systems with default passwords so the administrators can sort things out when they break.

We, as security, need to think holistically about the systems that we build, and ensure that the user is encouraged and able to use the service in the most secure manner possible. Take whatever steps you need to protect the system, but do this in ways that don’t get in the way of what the system is designed to do or how it is supposed to be used.

Technical Controls

When we think about securing a system, most people think of technical controls. These are solutions that prevent an attacker (or user) from accessing unauthorized or unintended functionality or system control functions.

Technical controls are often bolted on to the design of a system to solve security problems. They suffer from the dual problem of being an attractive vendor space and being highly limited in scope.

It’s common for vendors to get excited about how amazing their technical black box is and how much it will do to make your system safe. But ask yourself this: do you really think that all the organizations that suffered intrusions last year were not running next-generation firewalls, intrusion detection systems, advanced endpoint protection, or other magic black boxes?

Some natural skepticism is healthy when choosing technical controls and products. Although they are an important part of most security architectures, in most cases they only address low-hanging fruit (common vulnerabilities) or specific edge cases. Much as locking the door to your home won’t protect it from all thieves, technical controls and products won’t protect your system from all threats. They must be considered as part of a wider, more holistic approach.

We divide security controls in systems into a few different categories. Like all categorical systems, they overlap and aren’t perfect, but it helps us start to talk about such things.

Deterrent Controls

Deterrent controls deter or make clear to people what will happen if they attack your system. These are the technical equivalent of a signpost on your gate warning visitors to “Beware of the dog” or that “This building is under surveillance.”

In a physical sense, this can include other highly visible controls, so CCTV or electrified or barbed wire fences are just as much a deterrent as they are a highly effective resistive control (see below).

In a digital system, this can include things like modifying service headers to include warnings, and warning messages or customized error handling. The visible presence of such things makes it clear that you know what you are doing, and makes it clear to attackers when they are crossing a legal line.

Resistive Controls

Resistive controls are designed to slow down an attacker, not stop them. Limiting the number of sign-in attempts from a single address, or rate-limiting uses of the system are resistive. A clever attacker can get around these controls, but they slow down attackers, increasing the chance that they will get caught or making them want to look elsewhere.

As these controls aim to slow down and frustrate an attacker, we can also include simple steps like code obfuscation, generic error messages, and responsive session management.

Code obfuscation has been an area of much debate as a resistive control and allows us to consider the cost of such choices. Running your code through an obfuscation system renders it difficult to read and often makes it difficult to review and understand. While this will slow down less skilled attackers, remember there is a cost to this control. Obfuscated code is also difficult to debug and can cause frustration for support teams.

Resistive controls are useful but should be applied with care. Slowing down and frustrating attackers is rarely acceptable if it means frustrating and causing confusion or friction for genuine system users.

Protective Controls

Protective controls actually prevent an attack from occurring. Most technical security controls are this kind of control, so firewalls, access control lists (ACLs), IP restrictions, and so forth are all designed to prevent misuse of the system.

Some protective controls, such as making systems only available for set time periods or from a finite set of physical locations, may have a high impact on usability. While there may be strong business cases for these controls, we must always consider the edge cases in our environments.

For example, controls that prevent users from logging in from outside office hours or location can prevent staff from checking their emails when traveling to a conference.

People (including attackers) are like water when it comes to protective controls that get in their way. They will work around them and come up with pragmatic solutions to get themselves moving again.

Detective Controls

Some controls are designed not to prevent or even slow the attackers, but merely to detect an intrusion. These can vary from simple log auditing or security event tools to more advanced tools like honeypots, traffic flow graphs, and even monitoring of CPU load for abnormalities.

Detective controls are widely applied and familiar to operations and infrastructure teams alike. The core challenges with these controls are ensuring that they are tuned to detect the right things for your environment and that someone is watching the logs and ready to respond if something is detected.

Compensating Controls

Finally, there are times when the control you want might not be available. So if you want to prevent staff from accessing the login system from outside the country, but a significant proportion of your staff travels for work, you might be stuck.

What you often do is apply other controls that you might not otherwise require to compensate for the lack of the controls that you do have. For example, in that situation, you can instead use a physical, second-factor token on authentication, and monitor any changes of location compared to staff travel records to detect misuse.

We take a look at several other examples of compensating controls in Chapter 13, Operations and OpSec where we cover controls such as WAFs, RASP, and runtime defense.

Security Architecture

So what does all this mean? How can we build a secure system?

We can use the controls, combined with some principles to build a system that resists and actively defends against attack.

We need to be aware from the beginning what the system does and who is going to attack it. Hopefully, by following the guidance in Chapter 7 on threat assessments, you will understand who is likely to attack the system, what they are after, and the most likely ways that they will attack you.

Next, you need to think about your system itself. We often talk about systems as being a single, monolithic black box. That’s an artifact of traditional systems engineering thinking. Back in the old days, building a system required buying sufficient computers to run the system, and the physical hardware might be a major proportion of the cost.

These days, with containerization, microservices, virtual machines, and cloud-based compute being cheap, and looking to get much cheaper, we can instead start to see more separation between the components of the system.

From a security perspective, this appears to be much better. Each component can be reasoned about much more easily and can be secured in its normal operation. Components are often smaller, more singular in their purpose, and able to be deployed, changed, and managed independently.

However, this creates a new set of security risks and challenges.

The traditional model of security is to think of a system as being similar to an M&M or a Smartie (depending on American or British upbringing, I guess), or in New Zealand, an armadillo (because apparently New Zealanders have more wildlife than candy, or think that wildlife is candy). All of these things have a hard outer shell, but a soft gooey interior.

Perimeterless Security

Historically, when we defined the architecture of an M&M (or armadillo) system, the system would only have a few perimeters or trust boundaries. Anything outside of a trust boundary (such as third-party systems or users) would be treated as untrusted, while all systems and entities within the trust boundary or hardened perimeter, behind the DMZ and firewalls, would be considered to be safe.

As we saw in the previous chapter on threats and attacks, understanding trust and trust boundaries is relatively simple in monolithic, on-premises systems. In this model, there are only a few entities or components inside the perimeter, and no reason to consider them to be a risk.

But too much depends on only a few perimeters. Once an attacker breaches a perimeter, everything is open to him.

As we separate and uncouple our architectures, we have to change this way of thinking. Because each entity is managed separately and uncoupled, we can no longer simply trust that other components in the system are not compromised. This also means that we cannot allow admin staff on internal networks to have unlimited access across the system (absolutely no “God mode” admins or support back doors).

Instead, systems should be built so that they do not assume that other points anywhere on the physical or virtual network are trustworthy. In other words, a low-trust network, or what is now being called a zero trust network.

Note

To learn more about how to design and operate a zero trust network, you should read up on what Google is doing with its BeyondCorp initiative.

And read Zero Trust Networks: Building Trusted Systems in Untrusted Networks (O’Reilly) by Evan Gilman and Doug Barth.

In this environment, everything on the network needs to be protected against malicious insiders or attackers who have breached your perimeter or other defenses:

  • Reassess and audit-identity at every point. You must always know who you are dealing with, whether through time-sensitive tokens issued by an authentication server, or cryptographically safe keys or similar techniques to prove identity.

  • Consider using TLS for secure network communications between services, at the very least when talking to edge services, authentication services, and other critical services.

  • Revalidate and check inputs, even from core services and platform services. This means validating all headers, and every field of every request.

  • Enforce access control rules on data and on API functions at each point. Make these rules as simple as you can get away with, but ensure that they are consistently enforced, and that they follow the principle of least privilege.

  • Treat all sensitive data (any data that somebody might want to steal) as toxic.2 Always know where it is and who owns it, handle it safely, be careful with who you share it with and how you store it (if you must store it), and keep it encrypted whenever possible.

  • Log traffic at each point—not just at the perimeter firewall—so that you can identify when and where a compromise actually occurred. Log records should be forwarded to a secure, central logging service to make it possible to trace requests across services and to protect the logs if a service or node is compromised.

  • Harden all runtimes (OS, VMs, containers, databases) just as you would if these boxes were in your DMZ.

  • Use circuit breakers and bulkheads to contain runtime failures and to minimize the “blast radius” of a security breach. These are stability patterns from Michael Nygards’s book Release It! (Pragmatic Bookshelf), which explains how to design, build, and operate resilient online systems.

    Bulkheads can be built around connections, thread pools, processes, and data. Circuit breakers protect callers from downstream malfunctions, automatically detecting and recovering from timeouts and hangs when calling other services. Netflix’s Hystrix is a good example of how to implement a circuit breaker.

  • Take advantage of containers to help manage and protect services. Although containers don’t provide the same level of runtime isolation as a VM, they are much lighter weight, and they can be—and should be—packaged and set up with only the minimal set of dependencies and capabilities required for a specific service, reducing the overall attack surface of the network.

  • Be very careful with handling private keys and other secrets. Consider using a secure key management service like AWS KMS, or a general purpose secrets manager like Hashicorp’s Vault, as we explain in Chapter 13, Operations and OpSec.

Assume Compromised

One of the key new mandates is to assume that all other external-facing services are compromised. That is to say that if you have a layered architecture, then you should assume that the layer above is compromised at all times.

You could treat services in the layer below as compromised; but in most cases, if the layer below you is compromised, then it’s game over. It’s incredibly difficult to defend from an attack from lower down the call stack. You can and should try to protect your service (and the services above you) from errors and runtime failures in layers below.

Any service or user that can call your service is at a higher level and is therefore dangerous. So it doesn’t matter whether the requests are coming from the internet, from a staff terminal, another service, or from the administrator of the system; they should be treated as if the thing sending the requests has been compromised.

What do we do if we think the layer above is compromised? We question the identity of the caller. We check all input carefully for anything attempting to come down to our layer. We are careful about what information we return back up: only the information needed, and nothing more. And we audit what happened at each point: what we got, when we got it, and what we did with it.

Try to be paranoid, but practical. Good defensive design and coding, being careful about what data your services need to share, and thinking about how to contain runtime failures and breaches will make your system more secure and more resilient.

Complexity and Security

Complexity is the enemy of security. As systems become bigger and more complex, they become harder to understand and harder to secure.

You can’t secure what you don’t understand.

Bruce Schneier, A Plea for Simplicity

Agile and Lean development help to reduce complexity by trying to keep the design of the system as simple as possible.

Incremental design starting with the simplest model that works, Lean’s insistence on delivering a working Minimum Viable Product (MVP) to users as quickly as possible, and YAGNI (you aren’t gonna need it), which reminds the team to focus on only what’s needed now when implementing a feature instead of trying to anticipate what might be needed in the future, are all forces against complexity and “Big Design Up Front.”

Keeping the feature set to a minimum and ensuring that each feature is as simple as possible help to reduce security risks by making the attack surface of the application small.

Complexity will still creep in over time. But iterative and continuous refactoring help ensure that the design is cleaned up and stays simple as you go forward.

However there’s an important distinction between irreducibly simple and dangerously naïve.

A clean architecture with well-defined interfaces and a minimal feature set is not the same as a simplistic and incomplete design that focuses only on implementing features quickly, without dealing with data safety and confidentiality, or providing defense against runtime failures and attacks.

There’s also an important distinction between essential complexity and accidental complexity.

Some design problems, especially in security, are hard to solve properly: cryptography and distributed identity management are good examples. This is essential complexity that you can manage by offloading the work and the risk, using proven, trusted libraries or services instead of trying to figure out how to do this on your own.

But as we’ll see in Chapter 10, Code Review for Security, there are many cases of unnecessary complexity that introduce unnecessary risks. Code that is difficult to understand and code that cannot be thoroughly tested is code that you cannot trust to be secure or safe. Systems that you cannot build repeatably and cannot deploy with confidence are not secure or safe.

Again, many of the Agile practices that we’ve looked at in this book can help to drive down unnecessary complexity and reduce risk:

  • Test-driven development and behavior-driven design

  • Shared code ownership following common code guidelines

  • Automated code scanning to catch bad coding practices and code smells

  • Pair programming and peer reviews

  • Disciplined refactoring

  • Continuous integration

These are all forces for making code simpler and safer. Automating build chains, automated deployment, and continuous delivery reduces complexity and risk in delivering and implementing changes by standardizing steps and making them testable, and by making it safer and cheaper to roll out small, incremental improvements and fixes instead of big-bang upgrades.

Breaking large systems down into smaller parts with clear separation of concerns helps to reduce complexity, at least at first. Small, single-purpose services are trivially easy to understand and test in isolation and safe to deploy on their own. However, at some point as you continue to create more small services, the total complexity in the system increases significantly and so does the security risk.3

There are no simple answers on how to deal with this kind of complexity. You will need to enforce consistent patterns and design policies and controls across teams, provide deep visibility into the system and how it works, and regularly reassess your architecture for gaps and weaknesses. And, as we’ve explained, make sure that each component is designed to work in a hostile, untrusting, and unpredictable environment.

Key Takeaways

Building a secure system requires that you change your design assumptions in the following ways:

  • Design the system to resist compromise. Assume that attackers know everything about your system and how it works. Build protection in against failures, mistakes, and attacks.

  • Technical security controls can be bolted on to the architecture of the system to help deter, resist, detect, or protect against attacks, or to compensate for weaknesses in your system. Or security controls can be threaded through the architecture and design of the system as part of your code.

  • Always assume that the system is compromised, and that you cannot rely on perimeter defenses or black boxes.

  • While security adds necessary complexity to design, unnecessary complexity is the enemy of security. Take advantage of Agile principles and practices to reduce complexity in design and code. This will make the system easier to change and safer to run.

1 This idea goes back over 100 years, and is commonly known as Shannon’s Maxim: “the enemy knows the system.”

2 See Bruce Schneier’s blog entry, “Data Is a Toxic Asset”, March 4, 2016.

3 To understand more about these issues and how to deal with them, watch Laura’s presentation on “Practical Microservice Security”, presented at NDC Sydney 2016.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.131.178