11 Emerging practices

This chapter covers

  • Using multiple AWS accounts
  • Using temporary stacks
  • Avoiding keeping sensitive data in plain text in environment variables
  • Using EventBridge in event-driven architectures

The term serverless came about after AWS released the Lambda service back in 2014. In that sense, the serverless paradigm (building applications using managed services, including for all your compute needs) is something of a new kid on the block.

New paradigms give us new ways to look at problems and solve them differently, perhaps more efficiently. This should be obvious by now as we have discussed several serverless architectures in this book, and you must admit they look very different than the equivalent serverful architectures; they are more event-driven, and they often involve many different services working together.

New paradigms also require us to think and work differently. For example, instead of thinking about cost as a function of the size of a fleet of virtual machines and how long you need them for, we need to think about cost in terms of request count and execution duration. The code we write and the way we deploy and monitor our applications also need to change to take full advantage of this new paradigm and mitigate some of its limitations.

The following emerging practices are used by teams that have successfully adopted serverless technologies in their organization. Many are useful outside of the context of serverless, such as using multiple AWS accounts and using EventBridge in an event-driven architecture. Although none of them are silver bullets (is anything?), they are useful in the right contexts and are ideas worth considering.

11.1 Using multiple AWS accounts

Every AWS employee you speak to nowadays will tell you that you should have multiple AWS accounts and manage them with AWS Organizations (https://aws.amazon.com/organizations). At the minimum, you should have at least one AWS account per environment. For larger organizations, you should go further and have at least one AWS account per team per environment. There are many reasons why this is considered a best practice—regardless of whether you’re working with serverless technologies—including those discussed in the following sections.

11.1.1 Isolate security breaches

Imagine the nightmare scenario where an attacker has gained access into your AWS environment and is then able to access and steal your users’ data. This nightmare scenario can happen in many ways, and here are three that jump to mind right away:

  • An EC2 instance is exposed publicly and the attacker is able to SSH into the instance using brute force. Once inside, they can use the instance’s IAM role to access other AWS resources.

  • A misconfigured web application firewall (WAF) allows the attacker to execute a server-side request forgery (SSRF) attack and trick the WAF to relay requests to the EC2 metadata service. This allows the attacker to find out the temporary AWS credentials used by the WAF server. From here, the attacker is able to access other AWS resources in the account. This is what happened in the Capital One data breach in 2019.

  • An employee accidentally includes their AWS credentials in a Git commit in a public GitHub repo. The attacker scans public GitHub repos for AWS credentials and finds this commit. The attacker is then able to access all the AWS resources that the employee had access to. AWS also scans public GitHub repos for active AWS credentials and warns its customers when it finds them. But the damage is often done already by the time the customer realizes it.

Using multiple accounts doesn’t stop these attack vectors, but it limits the blast radius of a security breach to a single account (and hopefully not your production account!).

11.1.2 Eliminate contention for shared service limits

Throughout this book, we have talked about AWS service limits several times already. As your organization and your system grow, more engineers need to work on the system, and you will likely create more and more services that take care of specific domains within the larger system (think microservices). As this happens, you will likely run into those pesky service limits more frequently because there is more contention for the shared-service limits.

It gets worse from here. Because service limits apply at the region level and affect all the resources in a region, it means that one team or one service can exhaust all the available throughput (for example, Lambda concurrent executions) in the region and throttle everything else.

What’s more, if all the environments are run from the same AWS account, then something happening in a non-production environment can also impact users in production. For example, a load test in staging can consume too many Lambda concurrent executions so that users are not able to access your APIs in production because those API functions are throttled.

Having separate accounts for each team and each environment eliminates the contention altogether. If a team makes mistakes or experiences a sudden traffic spike in their services, the extra throughput they consume will not impact other services. Any service limit-related throttling would be contained to that account and limit the blast radius of these incidents. Equally, you can safely run load tests in non-production environments knowing that they won’t affect your users in production.

What if, within a team, the same contention exists between different services? Maybe one of the team’s services handles much more traffic than the rest and occasionally causes other services to be throttled. Well, then you want to move that service into its own set of accounts of dev, test, staging, and production. This technique of using AWS accounts as bulkheads to isolate and contain the blast radius can go as far as you need. You don’t have to stop at one account per team per environment. Make the techniques work for you, not the other way around.

11.1.3 Better cost monitoring

If everything runs from the same AWS account, then you will have a hard time attributing your AWS costs to different environments or teams or services. Having multiple accounts lets you see the cost for those accounts easily.

11.1.4 Better autonomy for your teams

From a security and access control point of view, if each team has its own set of AWS accounts, then you can afford to give them more autonomy and control of their own AWS accounts. If everyone shares the same AWS account and that account is used for both non-production as well as production environments, then the stakes are high. Mistakes have a large blast radius and teams can accidentally delete or update other teams’ resources, or even delete users’ data in production. That is why you need to be careful in terms of managing access. It creates a lot of complexity and stress for whomever must manage access (typically the security team or a cloud platform team).

In my experience, the high stakes and complexity invite gatekeeping and create friction between the various disciplines. Feature teams often have to suffer delays as they wait for an over-worked platform team to grant them the access they need. Resentment builds and harmony erodes, and soon it becomes an “us versus them” situation.

Giving every team their own AWS accounts limits the blast radius of any issues and lowers the stakes. You can then afford to give your teams more autonomy within their own accounts. The platform team/security team can instead focus on setting up guardrails and governance infrastructure so they can identify problems quickly. And they should work with the feature teams to ensure they follow organizational best practices and meet your security requirements.

11.1.5 Infrastructure-as-code for AWS Organizations

Having multiple AWS accounts means you need to have some way to manage them, especially as you scale your organization. The number of AWS accounts can grow, and as more engineers join the organization, it becomes more important to have strong governance and oversight of your AWS environment.

One of the shortcomings of AWS Organizations is that you can’t update the configurations of your organization using infrastructure as code (IaC). For example, CloudFormation is a regional service and is limited to provisioning resources within a single account and region. At the time of writing, the only tool that allows you to apply IaC to AWS Organizations is org-formation (https://github.com/org-formation/org-formation-cli). It’s an open source tool that lets you capture the configuration of your AWS accounts and the entire AWS organization using IaC. I have used it with several projects and I can’t recommend it highly enough!

A topic related to using multiple AWS accounts is the use of temporary CloudFormation stacks for temporary environments, such as those for feature branches or to carry out end-to-end (e2e) tests. We discuss temporary stacks next.

11.2 Using temporary stacks

One of the benefits of serverless technologies is that you pay for them only when people use your application. When your code is not running, you aren’t charged. Combine this with the fact that it’s easy to deploy a serverless application using tools such as the Serverless Framework. Because it’s so easy to create new environments and there is no uptime cost for having these environments, many teams create temporary environments for when they work on feature branches or to run their e2e tests.

11.2.1 Common AWS account structure

It’s common for teams to have multiple AWS accounts, one for each environment. Though there doesn’t seem to be a consensus on how to use these environments, we tend to follow these conventions:

  • The dev environment is shared by the team. This is where the latest development changes are deployed to and tested end to end. This environment is unstable by nature and shouldn’t be used by other teams.

  • The test environment is where other teams can integrate with your team’s work. This environment should be stable so it doesn’t slow down other teams.

  • The staging environment should closely resemble the production environment and may often contain dumps of production data. This is where you can stress test your release candidate in a production-like environment.

  • And then there’s the production environment.

As discussed earlier in this chapter, it’s best practice to have multiple AWS accounts—at least one account per team per environment. In the dev account, you can also have more than one environment—one for each developer or each feature branch.

11.2.2 Use temporary stacks for feature branches

When we start work on a new feature, we still feel our way toward the best solution for the problem. The codebase is unstable and many bugs haven’t been ironed out yet. Deploying our half-baked changes to the dev environment can be quite disruptive:

  • It risks destabilizing the team’s shared environment.

  • It overwrites other features the team is working on.

  • Team members may fight over who gets to deploy their feature branch to the shared environment.

Instead, we can deploy the feature branch to a temporary environment. Using the Serverless Framework is as easy as running the command sls deploy -s my-feature, where my-feature is both the name of the environment and the name of the CloudFormation stack. This deploys all the Lambda functions, API Gateway, and any other related resources such as DynamoDB tables in their own CloudFormation stack. We are able to test our work-in-progress feature in an AWS account without affecting other team members’ work.

Having these temporary CloudFormation stacks for each feature branch has negligible cost overhead. When the developer is done with the feature, the temporary stack can be easily removed by running the command sls remove -s my-feature. However, because these temporary stacks are an extension of your feature branch, they exhibit the same problems when you have long-lived feature branches. Namely, they get out of sync with other systems they need to integrate with. This applies to the incoming events that trigger your Lambda functions (such as the payloads from SQS/SNS/Kinesis), as well as data your function depends on (such as the data schema in DynamoDB tables). We find teams that use serverless technologies tend to move faster, which makes the problems with long-lived feature branches more prominent and noticeable.

As a rule of thumb, don’t leave feature branches hanging around for more than a week. If the work is large and takes longer to implement, then break it up into smaller features. When you’re working on a feature branch, you should also integrate from the main development branch regularly—no less than once per day.

11.2.3 Use temporary stacks for e2e tests

Another common use of temporary CloudFormation stacks is for running e2e tests. One of the common problems with these tests is that you need to insert test data into a shared AWS environment. Over time, this adds a lot of junk data in those environments and can make it difficult for other team members. For example, testers often have to do manual tests on the mobile or web app, and all the test data left by your automated tests can create confusion and make their job more difficult than it needs to be. As a rule of thumb, we always do the following:

  • Insert the data a test case needs before the test.

  • Delete the data after the test finishes.

Using the Jest (https://jestjs.io) JavaScript framework, you can capture the before and after steps as part of your test suite. They help keep our tests robust and self-contained because they don’t implicitly rely on data to exist. They also help reduce the amount of junk data in the shared dev environment.

But despite our best intentions, mistakes happen, and sometimes we deliberately cut corners to gain agility in the short term. Over time, these shared environments still end up with tons of test data. As a countermeasure, many teams employ cron jobs to wipe these environments from time to time.

An emerging practice to combat these challenges is to create a temporary CloudFormation stack during the CI/CD pipeline. The temporary stack is used to run the e2e tests and destroyed afterwards. This way, there is no need to clean up test data, either as part of your test fixture or with cron jobs. The drawbacks include the following:

  • The CI/CD pipeline takes longer to run.

  • You still leave test data in external systems, so it’s not a complete solution.

You should weigh the benefits of this approach against the delay it adds to your CI/CD pipeline. Personally, we think it’s a great approach, and we see more teams starting to adopt it. To make CI/CD pipelines go faster, some teams keep a number of these temporary stacks around and reuse them in a round-robin fashion. This way, you still enjoy the benefit of being able to run e2e tests against a temporary environment but shorten the time it takes to deploy the temporary environment (updating an existing CloudFormation stack is significantly faster than creating a new stack).

11.3 Avoid sensitive data in plain text in environment variables

One common mistake we have seen for both serverful and serverless applications is that sensitive data (such as API keys and credentials) is left in plain text in environment variables. When it comes to security, serverless applications are more secure because AWS takes care of the security of the operational environment of our application. This includes securing the virtual machines our code runs on as well as their network configurations, and it includes the security of the operating system itself.

Our Lambda functions run on bare-metal EC2 instances that AWS manages, and the EC2 instances reside in AWS-managed VPCs. There’s no easy way for an attacker to find out information about the virtual machine itself, and there’s no way for attackers to SSH into these virtual machines.

The operating systems are constantly updated and patched with the latest security patches, sometimes before the patch is even available to the general public. Such was the case during the Meltdown and Spectre debacle when all EC2 instances behind Lambda and Fargate were quickly patched against the vulnerabilities long before the rest of us were able to patch our container and EC2 images. Having AWS manage the operational environment of our code removes a huge class of attack vectors from our plate, but we are still responsible for the security of our application and its data.

11.3.1 Attackers can still get in

Even though the operational environment of our code is secured by AWS, it’s still possible for attackers to get inside the execution environment of our functions via other means, including the following:

  • Attacker successfully executes a code injection attack. For example, if your application or any of its dependencies use JavaScript’s eval() function against a piece of user input, then you’re vulnerable to these attacks.

  • Attacker compromises one of your dependencies and publishes a malicious version of the dependency that steals information from your application at run time. Remember that time when a security researcher gained publish access to 14% of NPM packages (http://mng.bz/N4PN)? Or that time an attacker compromised the NPM account for one of EsLint’s maintainers and published a malicious version of eslint-scope and eslint-config-eslint (http://mng.bz/DKPn)?

  • Attacker publishes a malicious NPM package with similar names to popular NPM packages and steals information from your application on initialization. An example is the time when an attacker published a malicious package called crossenv using the popular NPM package cross-env as bait (http://mng.bz/l9d6).

Once inside, attackers often steal information from common, easily accessible places such as environment variables. This is why it’s so important that we avoid putting sensitive data in plain text in environment variables.

11.3.2 Handle sensitive data securely

Sensitive data should be encrypted both in transit and at rest. This means it should be stored in an encrypted form; within AWS, you can use both the SSM Parameter Store and the Secrets Manager to store it. Both services support encryption at rest, integrate directly with AWS Key Management Service (KMS), and allow you to use Customer Managed Keys (CMKs). The same encrypted at-rest principle should be applied to how sensitive data is stored in your application. There are multiple ways to achieve this; for example:

  • Store the sensitive data in encrypted form in environment variables and decrypt it using KMS during cold start.

  • Keep the sensitive data in SSM Parameter Store or Secrets Manager, and during the Lambda function cold start, fetch it from SSM Parameter Store/Secrets Manager.

Once decrypted, the data can be kept in an application variable or closure where it can be easily accessed by your code. The important thing is that sensitive data should never be placed back into the environment variables in unencrypted form. Our personal preference is to fetch sensitive data from the SSM Parameter Store/Secrets Manager during cold start. We would use middy’s SSM middleware (https://github.com/middyjs/middy/tree/main/packages/ssm) to inject the decrypted data into the context variable and cache it for some time.

This way, we can rotate these secrets at the source without having to redeploy the application. Once the cache expires, the middleware fetches the new values on the next Lambda invocation. It also makes it easier to manage shared secrets where multiple services need to access the same secret. Finally, this approach allows more granular control of permissions because the Lambda function requires permissions to access the secrets in SSM Parameter Store/Secrets Manager.

There are other variants of these two approaches; for example, instead of storing encrypted secrets in environment variables, you can store them in an encrypted file that is deployed as part of the application. During Lambda cold start, this file is decrypted with KMS, and the secrets it contains are then extracted and stored away from the environment variables.

11.4 Use EventBridge in event-driven architectures

Amazon SNS and SQS have long been the go-to option for AWS developers when it comes to service integration. However, since its rebranding, Amazon EventBridge (formerly Amazon CloudWatch Events) has become a popular alternative, and I would argue that it’s actually a much better option as the event bus in an event-driven architecture.

11.4.1 Content-based filtering

SNS lets you filter messages via filtering policies. But you can’t filter messages by their content, you can only filter by message attributes, and you can only have up to 10 attributes per message. If you require content-based filtering, then it has to be done in code. EventBridge, on the other hand, supports content-based filtering and lets you pattern match against an event’s content. In addition, it supports advanced filtering rules such as these:

  • Numeric comparison

  • Prefix matching

  • IP address matching

  • Existence matching

  • Anything-but matching

note Check out the blog post at http://mng.bz/B1w0 on EventBridge’s content-based filtering for more details on these advanced rules.

In an event-driven architecture, it’s often desirable to have a centralized event bus. It makes it easy for subsystems to subscribe to events triggered by any other subsystem and for you to create an archive that captures everything happening in the whole application (for both audit and replay purposes).

With content-based filtering, it’s possible to have a centralized event bus in EventBridge. Subscribers can freely subscribe to the exact events they want without having to negotiate with the event publishers on what attributes to include. This is usually not feasible with SNS, and you have to use multiple SNS topics.

11.4.2 Schema discovery

A common challenge with event-driven architectures is identifying and versioning event schemas. EventBridge deals with this challenge with its schema registry and provides a built-in mechanism for schema discovery.

EventBridge captures a wide range of events from AWS services (such as when an EC2 instance’s state has changed) in the default event bus. It provides the schema for these AWS events in the default schema registry. You also can enable schema discovery on any event bus, and EventBridge samples the ingested events and generates and versions schema definitions for these events.

If you’re programmatically generating schema definitions for your application events already, then you can also create a custom schema registry and publish your schema definitions there as part of your CI/CD pipeline. That way, your developers always have an up-to-date list of the events in circulation and what information they can find on these events.

Open-source tools such as the evb-cli (https://www.npmjs.com/package/@mhlabs/evb-cli) even let you generate EventBridge patterns using the schema definitions in a schema registry. This is handy, especially if you’re new to EventBridge’s pattern language!

11.4.3 Archive and replay events

Another common requirement for event-driven architectures is to be able to archive the ingested events and replay them later. The archive requirement is often part of a larger set of audit or compliance requirements and is therefore a must-have in many systems. Luckily, EventBridge offers archive and replay capabilities out of the box. When you create an archive, you can configure the retention period, which can be set to indefinite. You can optionally configure a filter so that only matching events are included in the archive.

When you need to replay events from the archive, you can choose a start and end time so that only the events captured in the specified time range will be replayed. One thing to keep in mind about event replays is that EventBridge does not preserve the original order of the events as they were received. Instead, EventBridge looks to replay these events as quickly as possible, which means you can expect a lot of concurrency and that most events will be replayed out of sequence.

If ordering is important to you when replaying events, then you should check out the evb-cli project mentioned earlier. Its evb replay command supports paced replays, which retains the ordering of events and lets you control how quickly events are replayed. For example, using a replay speed of 100 replays events in real time means replaying an hour’s worth of events would take an hour.

11.4.4 More targets

Whereas SNS supports a handful of targets (such as HTTP, Email, SQS, Lambda, and SMS), EventBridge supports more than 15 AWS services (including SNS, SQS, Kinesis, and Lambda), and you can forward events to another EventBridge bus in another account.

This extensive reach helps to remove a lot of unnecessary glue code. For example, to start a Step Functions state machine, you would have needed a Lambda function between SNS and Step Functions. With EventBridge, you can connect the rule to the state machine directly.

11.4.5 Topology

There are different ways to arrange event buses in EventBridge. For example, you can have a centralized event bus, every service can publish events to their own event bus, or maybe you have a few domain-specific event buses that are shared by related services. There is no clear consensus on which approach is the best because everyone’s context is different, and each approach has its pros and cons. However, we personally favor the centralized event bus approach because it has some great advantages including the following:

  • You can implement an archive and a schema registry in one place.

  • You can manage access and permissions in one place.

  • All the events you need are available in one event bus.

  • There are fewer resources to manage.

But it also has some shortcomings that you need to consider:

  • There is a single point of failure. Having said that, EventBridge is already highly available, and the infrastructure that ingests, filters, and forwards events to configured targets is distributed across multiple availability zones.

  • Service teams have less autonomy as they all depend on the centralized event bus.

There is also the question of AWS account topology. That is, which account do you deploy the event bus to if a given environment consists of multiple AWS accounts (such as when you have one account per team)? Should you deploy the centralized event bus in its own account or in the account that perhaps make the most sense? That is a wider topic that is outside the scope of this chapter, but I recommend you check this re:Invent 2020 session by Stephen Liedig: https://www.youtube.com/watch?v=Wk0FoXTUEjo. It goes into detail about the different configurations and the pros and cons of each.

Summary

And that’s it for a list of emerging practices that you should seriously consider adopting in your projects. We call these emerging practices because they are not adopted ubiquitously but are gaining traction in the AWS community. As the AWS ecosystem and serverless technologies develop and mature, more practices emerge and take root. It’s worth remembering that no practice should be considered best in its own right, and you must always consider the context and environment a practice is applied in.

As technology and your organization change, your context changes too. Many of the things that you might once consider as best practice can easily become anti-patterns. For example, monorepos work great when you are a small team, but by the time you grow to hundreds or perhaps thousands of engineers, monorepos present many challenges that require complex solutions to address.

The same goes for how we build, test, deploy, and operate software. What worked great in private data centers and server farms might not translate well to the cloud. And practices that serve us well when we have to manage both the infrastructure our code runs on as well as the code itself might work against us as we build applications with serverless technologies.

Best practices and design patterns should be the start of the conversation, not the end. After all, these so-called best practices and design patterns are collective documentations of things that others have done that worked for them to some degree at some time. There’s no guarantee that they’ll work for you today. And it’s easy to see parallels from other industries. For example, did you know that lobotomies were part of mainstream mental healthcare from 1930s to 1950s before they were outlawed in the 1970s and considered outright barbaric by today’s standards?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.90.172