Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

11. Monitoring, Auditing, and Alerting

Miguel A. Calles¹

(1)

La Habra, CA, USA

In this chapter, we will discuss monitoring, auditing, and alerting. We will consider monitoring to be the process and tools we use to assess our application, auditing to be the process of looking for deviations from desired settings, and alerting to be the notification process when there are monitoring and auditing findings. We will review cloud provider services we can use to implement monitoring, auditing, and alerting.

The Importance of Monitoring, Auditing, and Alerting

The OWASP Top Ten¹ and the Interpretation for Serverless² include insufficient logging and monitoring³ as a security risk. Logging allows the application to provide a record of its health by creating log records of different log levels. Debug logs provide us with more information and data than we would typically want during standard application logging. Information logs provide us with a record of the execution. We use warning logs to let us know when a process did not run normally but did not fail. Error logs inform us when a process fails to execute correctly, and it could not recover. Some logging tools have additional levels that provide more or less information than the levels we noted. Each log level is useful in monitoring.

The log levels are enabled depending on the development stage, and we can monitor them accordingly. We most likely enable debug logs in development stages and not in production stages. We can use debug logs to troubleshoot an application error because it provides amplifying information (e.g., variable values). Monitoring debug logs allows us to find potential issues before we release our code into the production stage. We can use information logs to assess how a system behaves under normal and optimal conditions. Monitoring information logs allows us to see patterns when a system operates correctly, and we can identify a degraded mode when the patterns start deviating. We can use warning logs to capture information about system degradation. Monitoring these logs allows us to determine when a software change introduced a defect or when a third-party service (e.g., a credit card merchant API) is having issues. We can use error logs to capture information that is preventing our application from functioning. Monitoring error logs allows us to respond to failures in our application by alerting the appropriate team. Our application in the production stage will likely record information, warning, and error logs.

Failure to produce any logs will prevent us from detecting failures and malicious activity. Let’s suppose a malicious actor is trying numerous credit cards bought on the dark web. We might log these credit card processing failures as warnings or errors. Without capturing these logs, our monitoring solution is unable to notice an increased number of credit card failures and which parties are involved. Without alerting our support team, we are allowing the malicious actor to make purchases with stolen credit card numbers. We can potentially put the business under a financial risk and loss of credibility if the response teams fail to catch and stop this malicious activity promptly.

We want to audit our system to identify malicious activity outside the application. You can think of auditing as a type of monitoring that determines whether the application and infrastructure maintain the expected configuration. Failure to audit might allow a malicious actor to weaken the security posture or take over the application. Let’s suppose we capture customer-uploaded images in a private storage bucket, but a malicious actor manages to get sufficient privileges to modify the bucket settings. That actor can change the storage bucket to have public access, and now all those private customer images are viewable by anyone on the Internet. Had auditing been in place, the change could have been detected before too many customer images were exposed on the Internet. Without regular or continuous auditing, we would be unable to detect the undesired change and revert the bucket settings to private and minimize the exposure.

Monitoring

Monitoring allows us to assess our serverless application. We can learn how it functions when all is working well. We can see patterns in function execution times, logging, storage consumption, API gateway status codes, database reads and writes, costs, and more. Monitoring our serverless application allows us to determine when it is experiencing degradation and failures. We will review the general principles and provider-specific capabilities that can assist us in monitoring our application.

General Principles

How we monitor our application will vary depending on its design, the business and security requirements, and the budget. Yet there are some general principles we should consider implementing in our monitoring solution.

Billing

We will discuss billing as the first item to monitor because the company’s financial health can make or break an organization and whether that organization chooses to continue funding the development and operation of our serverless application. Therefore, we want to ensure we properly monitor billing in our cloud provider account.

Serverless applications are growing in popularity due to the potential for reduced costs when compared to typical server-based applications. The cloud providers typically bill for usage and storage or might include a free tier where we could potentially run our application at no cost. As our application grows in popularity and more users are using it, the costs may increase. We may also see cost increases for reasons other than an increasing user base. There are various reasons why the costs might increase: there might be an inefficient process that is running more often, the cloud provider changed its pricing methodology, a circular execution bug, or many other scenarios. We should aim to keep our serverless costs low and question increases in costs.

As an example, our application might be vulnerable to a Distributed Denial of Service (DDoS) attack where an attacker sends thousands of requests to a serverless function that takes a long time to execute. The serverless function takes the request inputs and makes a GraphQL query. The GraphQL configuration allows including additional queries within a query (which we will refer to as “nesting” queries). The attacker sends an input that causes the GraphQL query with several levels of nesting. It might take several seconds (even up to the maximum allowed execution time) for the GraphQL response to cause the function to send a return response.⁴ This one function execution is costly. If the attacker can send numerous functions at once, it could cause our application to become unresponsive and increase our application costs. Monitoring costs allows us to find anomalies and to find ways to optimize our application in areas with high usage costs.⁵

HTTP Status Codes

We might be using an API gateway to trigger our serverless functions using HTTP events or object storage to host a serverless website. In either case, we should monitor any errors. We should monitor HTTP error codes in the 400s and 500s. Monitoring these status codes gives us insight into potential issues.

Each HTTP status code corresponds to an appropriate error, and seeing these codes in increasing quantities helps us determine what might be an issue. For example, an increase or significant spike in “401 Unauthorized” codes could indicate a malicious actor trying to brute force a protected endpoint, an error in our authentication method, or someone trying to discover new endpoints. Having numerous “500 Internal Server Error” codes might suggest the latest version of the software has a new bug, someone could be submitting invalid function inputs, or the function was not correctly configured or deployed. We might notice numerous “400 Not Found” codes, which might suggest someone was looking for protected pages. Monitoring the API gateway and object storage status codes could help us find and address issues before they become problematic and severe.

Log Aggregation

The application might store logs in various locations. Functions, object storage, the storage buckets, API gateways, and databases will store logs in their configured locations. The log storage might segregate the logs on a per-resource basis. Or, one action might generate logs in multiple locations, which makes seeing all the related logs somewhat troublesome. Using a log aggregation service allows us to see all the logs in one convenient view.

The cloud providers have some log aggregation capabilities, but we might want to consider using a third-party service too. Third-party services might have proprietary searching, filtering, and analysis capabilities. They might also support additional customization. Furthermore, they might integrate with alerting capabilities. We should select a log aggregation tool that gives us the log visibility and features we need to assess the application health, perform troubleshooting, and address business, security, and privacy requirements.

Outages

We should monitor planned outages. Planned outages are events the cloud provider schedules in advance to perform routine maintenance, make upgrades, and address security issues. Knowing when planned outages occur allows us to inform our application users and make any necessary plans to ensure the application is ready for a disruption of service. Failure to know the time frame of a planned outage might result in the response team receiving urgent notifications about an “unplanned” outage in our application.

We should also monitor for unplanned outages. As good and reliable as the cloud providers are, they do experience unexpected failures and degradation of service. The cloud provider will typically post an update on their status website and by other means. Accurately monitoring the provider status allows us to notify our application users about a disruption.

As speedily as the provider might send a service disruption status, it might be several moments since that the outage occurred. We should perform our own monitoring to assess the health of our application. Our monitoring should notice any change in behavior (e.g., numerous HTTP status errors, increased latencies, a growing number of function timeouts, etc.). It might detect a potential service outage earlier than the provider might, or it might catch a self-induced issue (i.e., not caused by the provider).

Utilization and Metrics

We should monitor the utilization and metrics to have a picture of how our application is performing. We can monitor our serverless functions, object storage, logging, API gateways, databases, and the other cloud provider services by monitoring their utilization and metrics.

We might typically associate utilization with the central processing unit (CPU) and memory, but these do apply to our serverless functions. The functions run in a container, which is a virtualized computer running a lightweight operating system. We have the option to select the memory allocated to each function, and the cloud provider determines the CPU allocation; typically, assigning more memory results in a higher CPU allocation. By monitoring the utilization patterns for each function, we can decide if we have allocated too little or too much memory. We can determine whether to optimize a function or increase the memory allocation if we find the CPU utilization is high. We might come to discover certain functions with high CPU utilization are vulnerable to exploits (e.g., Regular Expression Denial of Service attacks) designed to utilize as much CPU, memory, or time as possible. We may use the monitoring data to determine the optimal CPU-memory allocation for each function, which ultimately helps reduce our costs (higher allowances have higher prices). Monitoring the serverless function utilization allows us to tune our functions and observe any potential performance issues and detect Denial of Service attacks.

We can monitor the storage usage. Databases, object storage, and logging utilize the storage, which might have an impact on performance and cost. Queries and lookups might take longer when database tables have too much data. Keeping unused or old data in databases and object storage increases our costs, especially when we exceed any free-tier limits. We might move data into less expensive storage for archiving purposes if we need to satisfy any data retention requirements. We should monitor any related metrics to ensure we keep our storage sizes and costs as small as possible.

We monitor performance metrics to assess how the services our application uses are functioning. We can observe throttling and errors in our database performance, which might indicate the application is exceeding its read-write capacities.⁶ We might see the throttling of stream records, which might indicate an usual volume of activity.⁷ In addition to monitoring HTTP status codes, we can assess whether our API gateways are experiencing issues with caching and latencies.⁸ Monitoring the metrics that give insights into the performance of the application allows us to respond when there are performance issues.

Third-Party Solutions

We might want to consider using a third-party solution instead of solely relying on those provided by our cloud provider. When our cloud provider is experiencing degradation, our monitoring solution might experience it as well. Having a third-party solution built independently from our cloud provider gives us increased reliability in our monitoring because any issues affecting our serverless application and infrastructure have no impact on our third-party solution. Additionally, third-party solutions may have additional features and integrations that our cloud providers might not provide.

Third-party solutions might be a service or a piece of software. We can use a service to avoid having to configure and maintain additional resources. A service might provide continuous or real-time monitoring features to ensure we see the most recent data. In contrast, a piece of software will require us to execute it manually or install and configure it to provide continuous or real-time features. We can adopt either approach or both depending on our business and technical requirements.

We might need to give third-party services access to our cloud provider environment. We should be cautious to grant least-privileged IAM policies and roles to the third-party entity. A data or Cybersecurity breach might give a malicious actor access to our AWS account. The malicious user might make undesired changes to our application and infrastructure if we used permissive IAM policies and roles that allow modifying and deleting resources.

We might need to give third-party software access as well. We run a risk similar to using a third-party service if someone gets access to the software or the credentials it uses. We should consider enabling and disabling the credentials before and after manually executing the software, respectively. If we develop a continuous monitoring solution, we should consider updating the credentials periodically; this applies to services as well. We should weigh the benefits and risks of using third-party software.

Amazon Web Services (AWS)

We will discuss how we can leverage the different AWS features and capabilities to monitor our application and infrastructure.

Account Settings

We should update our alternate contacts to have billing, operations, and security contacts. We should use an email distribution list for each. Doing so allows the appropriate contacts to receive relevant notifications. We should update our billing preferences to receive free-tier usage alerts. This will enable us to respond if we are aiming to stay within the free tiers and keep costs low.

AWS Billing and Cost Management

The AWS Billing and Cost Management has different capabilities to help us monitor our usage and costs.⁹ The service provides a dashboard, AWS Budgets, AWS Cost Explorer, AWS Cost and Usage Report, and AWS Cost Categories.

The dashboard provides a quick overview of the current incurred costs. We can quickly identify our most expensive services and assess whether we remain within the AWS Free Tier or exceed it.

AWS Budgets allows us to budget based on cost, usage, reservations, and savings plans. In a serverless environment, the cost and usage budgets are the most applicable. We might consider a reservation budget if we decide to use Elasticsearch¹⁰ (an AWS-managed log aggregation service) to search the application logs.

We should configure notifications when the costs (actual or forecasted) and usage exceed the budgets. We can notify an email address and an SNS topic. An email distribution list or a ticketing system, which contains the appropriate billing contacts, should receive the notification.

AWS Cost Explorer allows us to group and filter usage costs. We can see patterns in usage costs and identify which services need cost optimization.

AWS Cost and Usage Report allows us to create CSV exports of costs on an hourly or daily basis. We can use these CSV files to tailor our reports to conform to our business requirements. We should configure the report to provide additional details to identify trends in specific resources, even though it will increase the report size.

Amazon CloudWatch

CloudWatch provides multiple monitoring capabilities. CloudWatch Dashboards allows us to create customized dashboards with charts, counts, and query results. CloudWatch metrics allows us to monitor billing, usage, and metrics for services (e.g., DynamoDB, S3). CloudWatch Alarms allows us to notify via SNS when CloudWatch metrics exceeds and returns to a specified threshold. We can store our logs in CloudWatch Log Groups and query or forward those logs to a log aggregator. We also can query the log groups with CloudWatch Log Insights. CloudWatch Events allows us to configure an event source that triggers a Lambda function, SQS queue, SNS topic, Kinesis stream, and other targets that we can use to monitor the sources. CloudWatch ServiceLens takes advantage of AWS X-Ray to organize logs, metrics, X-Ray traces, and deployment canaries for a comprehensive monitoring solution for microservices. We can consider each Serverless configuration file as its microservices, for all intents and purposes, and using ServiceLens might be beneficial to enable. CloudWatch Synthetics allows us to monitor our application. We can have it check the availability of our API endpoints and perform a workflow to ensure the application is behaving as it should. We can get notified when an endpoint stops responding or a workflow fails to reach completion.

AWS X-Ray

AWS X-Ray allows us to link events and executions in our serverless environment. Our application has numerous individual parts, making it difficult to trace how they interconnect. Each Lambda function operates independently of each other, and multiple events can trigger them. Each DynamoDB table, API Gateway, SQS queue, SNS topic, and so on operate independently of each other as well. X-Ray allows us to follow the execution path by tracing the event. For example, a user request to a URL hosted by API Gateway will forward the request to a Lambda function. The function sends an event to an SNS topic. The topic forwards the message to an SQS queue, and the queue triggers another Lambda function. X-Ray allows us to trace that path. Furthermore, X-Ray will enable us to get a picture of the execution path via the service map. The tracing data and service map allows X-Ray to show latencies, error response codes, timestamps, and additional information.

AWS Personal Health Dashboard

The Personal Health Dashboard allows us to see when service degradations and outages affect the services our application uses. We can see a history of service issues and whether they affected our services.

Amazon Simple Storage Service (S3)

We can monitor our S3 buckets in multiple ways.¹¹ We can enable server access logs to capture log requests made to a bucket. We can use these logs to recognize patterns and anomalies. We can use the S3 management features. We enable the relevant CloudWatch metrics to monitor and visualize storage, requests, data transfer, and replication. We can allow analytics that automatically assesses access patterns and provide suggestions on the data lifecycle. We can use S3 to capture application logs in addition to CloudWatch Logs. S3 storage pricing is smaller than CloudWatch Logs. Using S3 to store logs allows us to store large-sized logs for extended periods and typically at lower costs.

Amazon Elasticsearch Service

Elasticsearch Service allows us to aggregate our application logs, CloudWatch metrics, and other relevant data. With relevant application data in one location, we can perform searches, queries, and monitoring. For example, we can correlate increased application error logs with error metrics to monitor for degraded application performance. We can use open source programs such as Kibana¹² to visualize the data and Logstash¹³ to integrate with other tools.

AWS Lambda

We can monitor AWS Lambda functions using the Lambda dashboard, CloudWatch metrics, and AWS X-Ray.¹⁴

The Lambda dashboard has prebuilt graphs we can view to monitor the overall performance of our Lambdas. These graphs can help us determine whether there is a performance issue (e.g., approaching the maximum allowed function invocations) for that AWS account. We cannot see issues with individual functions.

We can leverage CloudWatch metrics to monitor metrics for individual functions. We can use the function metrics to identify potential performance issues for each function. For example, we can monitor a Lambda function duration and update our Serverless configuration to limit the execution time to coincide with the typical duration time. We can also set up CloudWatch Alarms to notify us when metrics exceed specified thresholds.

We can leverage AWS X-Ray to trace the events that trigger a Lambda, the Lambda execution, and its following events, Lambda executions, or returns. Knowing the execution path allows us to pinpoint errors.

Amazon DynamoDB

We can monitor DynamoDB tables using the DynamoDB dashboard and CloudWatch metrics.¹⁵ The dashboard provides us a status on CloudWatch Alarms and utilization of the DynamoDB capacity limits. We can use AWS CloudWatch metrics to monitor DynamoDB metrics. We can monitor these metrics with alarms to check when we exceed our capacities or see numerous errors, and we can adjust our provisioning appropriately.

Azure

We will discuss how we can leverage the different Azure features and capabilities to monitor our application and infrastructure.

Account Settings

We should go to the Microsoft Account security settings and specify an alternate email address to receive security alerts. We should use an email distribution list for each. Doing so allows the appropriate contacts to receive relevant security notifications.

Azure Cost Management + Billing

Cost Management + Billing allows us to create a budget, perform a cost analysis, and optimize costs with recommendations from Azure Advisor. We can create a budget against a scope and configure an email alert when the costs exceed a percentage or amount threshold of the budget. The cost analysis graphs the costs per service, location, and subscription and compares it against the budgets. Advisor will provide us with recommendations by removing unused resources and enabling Blob storage lifecycle management to delete or archive old, unused data.

Azure Monitor

Azure Monitor allows us to centralize our logs and performance metrics in one location and monitor them. It collects Microsoft Active Directory (AD) audit logs, activity logs, resource logs, platform metrics, and any data reported via the Azure Data Collector API. These different data sources are grouped into metrics or logs. We can use metrics to access our application performance. The log data contains information about AD sign-in activity, service health, resource configuration changes, and resource operations logs. We can use the data sources in Azure Log Analytics to query the log data to find unusual behaviors, as described in the “General Principles” section. We can use them in Azure Application Insights to help us get an understanding of our application usage performance and any exceptions. We can use them in Azure Metrics Explorer to query our metrics. We can use them in Azure Dashboards to visualize our data. Finally, we can use them in Azure Workbooks to create interactive reports.

Azure Service Health

Azure Service Health allows us to see when service degradations and outages affect the services our application uses. Service Health provides us with a dashboard showing the service health and details of any disruption affecting our Azure resources.

Azure Sentinel

Azure Sentinel is a cloud-based security information and event management (SIEM) that uses artificial intelligence to analyze data from multiple Microsoft sources (e.g., Microsoft 365, Azure AD, and Microsoft Cloud App Security). We can use the REST API and built-in data connectors to ingest data from other sources. Azure Sentinel will generate alerts when it detects anomalies and potential security threats. We can create playbooks to automate our workflows in response to security alerts.

Google Cloud

We will discuss how we can leverage the different Google Cloud features and capabilities to monitor our application and infrastructure.

Account Settings

We should consider using a security email distribution list to get security-related notifications, and we should be cautious about who is on the distribution. If our Google account is personal, someone could potentially use that email address to recover a password. G Suite and Google Cloud Identity organizations provide additional features and help reduce that risk.

Billing and Cost Management

We can use Billing to review our costs. Billing provides a dashboard, reports, exports, cost breakdowns, price list, and the ability to create budgets. We can review our costs and identify which services and resources are the most costly and might need investigation and optimization. Budgets allow us to specify the desired monthly costs for the selected projects and services and to create notifications when the costs exceed certain thresholds. We can define a payment profile where we list an email distribution list that contains the appropriate billing contacts. Billing also provides us with recommendations for optimizing our costs.

IAM & Admin

We can update our “Privacy & Security” settings to specify email distribution lists for the European Union (EU) Representative Contact and Data Protection Officer in compliance with the EU General Data Protection Regulation legislation.

Google Cloud’s Operations Suite

Google Cloud’s operations suite provides services to help us monitor our application and infrastructure. We can send our logs to the Cloud Logging API, which passes them to the Logs Router. We specify rules in the Logs Router to determine which logs get accepted or discarded and which to include in exports. We can store the logs to the Logging storage, Cloud Storage, BigQuery, and Pub/Sub, which will send them to external systems. We can use Error Reporting to analyze our logs and notify us about errors. We can use Cloud Trace and Cloud Profiler to get latency statistics and resource consumption profiles, respectively, to help us find performance issues. Cloud Debugger allows us to investigate production issues without affecting the application. Cloud Monitoring will enable us to get health checks for our application endpoints to check that our application and its APIs are up and running. Cloud Monitoring also collects metadata, events, and metrics to visualize the data in a dashboard and send alerts. We can use Service Monitoring to detect issues in our App Engine serverless applications. We do this by defining service-level objects to monitor the appropriate App Engine operational and performance data, identify issues, generate service graphs, and create relevant alerts. We can take advantage of the different services to monitor low-level matters up to the entire application.

Auditing

Auditing allows us to check whether we correctly configured our serverless application and infrastructure. Over time, as the application evolves, we will notice changing settings and resources. Ideally, all the settings and resources will match what we expect. Realistically though, we may find there are unused resources still running, deprecated settings still enabled, and undesired modifications to security settings. With auditing, we can discover unusual behavior and potential weaknesses that might affect our security posture. We will review the general principles and provider-specific capabilities that can assist us in auditing our application and infrastructure.

General Principles

How we audit our application will vary depending on its design, the business and security requirements, and the budget. Yet, there are some general principles we can consider to implement in our auditing solution.

Authorization Attempts

We should consider monitoring failed authorization attempts in our serverless application and infrastructure. Our application should protect users by looking for suspicious activity and locking those user accounts because someone might be trying to take over our cloud provider account, which would put our business and application at risk.

Configuration Settings

Our application and infrastructure are configured based on specific settings. Some of these settings might potentially be benign if changed. But others could have serious implications. For example, application parameters that limit consecutive failed login attempts and require a minimum retry time between login attempts might have reduced security effectiveness if modified. Increasing the limit to an extremely large number and reducing the retry time to zero would allow someone to attempt a brute-force attack against login credentials. We should consider monitoring whether configuration settings match an approved baseline and when they are modified to ensure those changes pose no known or plausible security threat.

Infrastructure Changes

Although we are building and supporting a serverless application, the cloud provider does not limit us to having only serverless resources in our account. There may be unused resources enabled in our infrastructure. For example, a developer might have created a test resource and forgot to remove it, or a malicious user created a virtual machine to mine digital currency. It is also probable resources get modified or deleted. For example, someone might change our serverless database provisioning to a high value, resulting in a larger monthly bill, or a malicious user could delete user data records. We should consider monitoring when infrastructure is added, modified, or deleted. Additionally, check for the existence of new infrastructure and ensure its configuration matches a known baseline.

Privileges

In Chapter 6, we discussed Identity and Access Management (IAM) principles and how we might potentially implement them. The IAM policies and settings might start changing after we initially implement them. IAM privileges need updating as our application evolves, team members change, and development/business/security needs change. The changing landscape introduces the possibility that IAM privileges might need updating, or individuals might have access to resources they no longer need. We should consider auditing our IAM privileges to ensure we maintain the least privileges, assess what resources and permissions are in use, and remove access from users as appropriate.

Users and Credentials

We should consider auditing users and credentials that provide access to our cloud provider accounts. These entry points provide read, write, and delete privileges and might have detrimental effects if accessed by malicious users. We should check user accounts when a person joins or leaves the team and make sure we create and delete user accounts as appropriate. If a user has been inactive for a set period of time (e.g., 30 days), we should consider deactivating that user account. Users might have created credentials (e.g., access keys) that give another entry point to access the cloud provider account. We should review the last time they were used and their age. We should consider deactivating them when they were last used for a long period (e.g., two weeks) and were created more than a specified period (e.g., 90 days). This also applies to the credentials that our application uses. We should check the last time users changed their passwords and require a password change when it is older than a specified period (e.g., 90 days). We should check whether users have multi-factor authentication (MFA) enabled and, if not, require they enable it to add a layer of protection. Making sure we verify that our user accounts are up to date and credentials are short-lived helps reduce the risk of a malicious actor using either to access our cloud provider account.

Unusual Activity and Behaviors

Our application, users, and integrations follow typical behaviors. We would not expect a team colocated in the same region and working during a day shift to do massive workloads during the middle of the night from different locations. We should be suspicious of access from a country or region where our team members do not reside. We should be concerned when seeing API calls to disable security, monitoring, auditing, and alerting settings. We should, therefore, audit for unusual activity to identify activity that might have negative impacts on our business and application.

This topic might logically fit in the monitoring section because we would have to monitor this activity so we can respond to this promptly. However, this topic is in auditing because auditing is a general practice for checking compliance, whereas monitoring is responding to events. Security monitoring allows us to respond to security threats, but it can be argued we cannot identify a threat without some audit checks. Furthermore, the monitoring section focuses mostly on operational monitoring, whereas the auditing section focuses mainly on security compliance. Discussing a security monitoring topic seemed most appropriate in this section.

Known Vulnerabilities

We should audit our application code and resources for known vulnerabilities. As time progresses and our application evolves, the security communities might discover vulnerabilities that affect our application and infrastructure. Fortunately, using serverless resources offloads most of the vulnerability monitoring to the cloud provider. However, this does leave us with addressing any vulnerabilities in the configuration. Configuration settings that were once deemed secure might now be considered less secure. Our application code might use packages where security researchers have reported a known vulnerability. Failure to mitigate known vulnerabilities might leave our application and infrastructure vulnerable to the attack vectors mentioned in security findings. Therefore, auditing for security vulnerabilities will help us ensure we maintain a current security posture.

Third-Party Solutions

We might want to consider using a third-party solution that provides additional auditing capabilities to our cloud provider solution. Third-party solutions might support multiple auditing frameworks, requirement definitions, and industry best practices. They might provide other reporting formats and capabilities. They might integrate with our services that support our business and development processes. The third-party solutions and their additional features might be worth considering.

Third-party solutions might be services or a pieces of software. The benefits of a service are the features it provides and having reduced maintenance, but we risk storing security findings outside our cloud provider account. A data or Cybersecurity breach to the third-party services might expose our weaknesses to an outside party. We should consider addressing security findings as quickly as possible. Using a piece of software adds to our burden, but where we store the results is within our control. We can save the data using our internal processes or immediately purge the results after drafting a mitigation plan. We can adopt either approach or both depending on our business and technical requirements.

Our discussion about IAM privileges for our monitoring solution also applies to our auditing solution.

AWS

We will discuss how we can leverage the different AWS features and capabilities to audit our application and infrastructure.

AWS Config

We can use AWS Config to audit and monitor changes to our AWS resources. Config will identify our AWS resources and capture their current configuration. Whenever we add, modify, or delete a resource, Config will record the change in addition to who made it. We can create rules to define validation configurations for our resources. Config uses those rules to notify us when a resource fails to comply with its related Config rules. Config does support performing remediation to resolve findings. We can integrate Config with AWS CloudTrail to capture the API calls made to change a resource configuration. We should consider enabling Config to continuously audit and monitor our resources to ensure they maintain their desired configurations and comply with any business requirements and legal regulations.

AWS CloudTrail

CloudTrail allows us to audit and monitor data and management operations. It will capture API calls made to our serverless resources (e.g., S3, Lambda, and KMS) and record detailed information about the requests (e.g., IAM role, time, and IP address). It also captures actions for creating, modifying, and deleting resources and also records detailed information about the action. We should consider enabling CloudTrail in all AWS regions to identify any unusual activity and storing the audit records in S3 and CloudWatch for complying with business policies and legal regulations.

Amazon Macie

Amazon Macie is a service that analyzes, classifies, and protects our S3 data. Macie will use machine learning to analyze the S3 buckets for which we give it access. After analyzing the S3 objects, it will classify the data according to its sensitivity. It can monitor for suspicious user activity, such as a user downloading large quantities of sensitive data. We should consider using Macie when our application stores sensitive data in S3, and we need to protect that data as per legal regulations or business requirements.

AWS IAM Access Analyzer

IAM Access Analyzer is a feature of AWS IAM. When enabled, it provides continuous monitoring of our IAM permissions. Whenever we create or modify a policy, Access Analyzer evaluates the IAM permissions for IAM roles, S3 buckets, KMS keys, SQS queues, and Lambda functions. It reports findings for permissions and policies that might be a security concern and shows the last usage time for services. We can use these findings to adjust any policy changes that weaken our security posture and remove access to resources that our users are no longer using. We should consider using IAM Access Analyzer to help us achieve the least-privileged IAM policies and roles.

Amazon GuardDuty

GuardDuty provides us with continuous security monitoring by analyzing user activity and API request data from CloudTrail and Amazon VPC Flow and Domain Name System (DNS) logs. It identifies unusual and suspicious behavior and reports findings of possible threats. We can integrate GuardDuty with CloudWatch Events and Lambda to perform automated remediation of the security threat findings. We should consider using GuardDuty to respond to security threats quickly.

AWS Security Hub

Security Hub allows us to aggregate our security findings into one location. It works with Amazon GuardDuty, Amazon Inspector, Amazon Macie, and several AWS partner solutions. It reports results in a standardized format, grouped and prioritized for ease of review, analysis, and response. We can use CloudWatch Events, Lambda, and AWS Step Functions for automated remediation. We should consider using Security Hub to consolidate our security findings in one location so we can review and respond to security findings more efficiently.

Azure

We will discuss how we can leverage the different Azure features and capabilities to audit our application and infrastructure.

Azure Policy

Azure Policy helps us consolidate our policies and their compliance data. We can configure policies to audit findings (i.e., record compliance or noncompliance) or enforce them (i.e., allow or prevent configuration settings). We can manually review noncompliance findings or use bulk remediation to set the resources to be compliant automatically.

Azure Security Center

Azure Security Center audits your Azure resources. It checks the resources against best practices, common misconfigurations, security policies, and regulatory requirements. It consolidates the findings into a score and provides a list of recommendations to remediate the findings. Security Center also uses analytics and machine learning to detect potential threats and give you recommendations to address them. We can send the findings to Security Monitor for improved management.

Azure Advisor

Azure Advisor analyzes your Azure resources and usage data and provides recommendations to improve availability, security, performance, and cost. It consolidates the list of recommendations in one interface.

Google Cloud

We will discuss how we can leverage the different Google Cloud features and capabilities to audit our application and infrastructure.

Cloud Audit Logs

We can use Cloud Audit Logs, found in the IAM & Admin section, to detect unusual and suspicious activity. We can specify an audit configuration to capture audit logs when users perform data read, data write, and Google super admin¹⁶ operations on our resources.

IAM & Admin

IAM allows us to enable Access Transparency when using an organization and have role-based support packages. Whereas Audit Logs capture the actions our team members perform, Access Transparency captures logs if and when Google personnel access our content.

Cloud Asset Inventory

Cloud Asset Inventory catalogs our resources and IAM policies by capturing metadata about them. We use the Cloud SDK CLI tool to work with that metadata. We can perform queries to search for our resources and IAM policies, export their metadata and history, and monitor changes. We can audit the current state of our resources and policies to make sure they are as we expect, and changes were made as intended. The monitoring allows us to get notified when someone modifies a resource and IAM policy. We can also analyze our IAM policies to audit who has access to what resources.

Data Catalog

Data Catalog allows us to discover our data and organizes it based on metadata. Data Catalog ingests data from BigQuery, Pub/Sub, and Cloud Storage and data outside Google Cloud via the Data Catalog API. We can search the metadata database to determine what type we have in our application. We might potentially find we have redundant data in multiple locations or have misplaced data in a location we did not expect.

Cloud Data Loss Prevention (DLP)

Cloud DLP allows us to scan our data and classify it based on classification rules. We can run scans on Cloud Storage, BigQuery, and Datastore to identify whether they contain any sensitive data. We can send the results to BigQuery, Security Command Center, and Data Catalog for review and analysis. We might find we have sensitive data stored in locations where we do not expect. We can also send notifications via email or Pub/Sub to alert us of the results.

Security Command Center

Security Command Center centralizes our auditing functions by integrating with Cloud DLP and Cloud Audit Logs and generating compliance reports. It identifies security misconfigurations and vulnerabilities by identifying publicly exposed resources, insecure IAM configurations, firewalls with improper configurations, and compliance requirement findings. Security Command Center detects threats by monitoring Cloud Logging logs for unusual and suspicious activity and performing security scans for vulnerabilities in our web applications. The dashboard allows us to see all the findings in one location and apply the recommended remediations specified by the findings.

Alerting

Alerting allows us to receive notifications from our monitoring and auditing solutions. I chose to discuss alerting separately from monitoring and auditing to highlight that monitoring/auditing themselves may not notify us about the results. Monitoring and auditing may be performed either manually or automatically. A manual solution may not require alerting, whereas an automatic solution should require alerting because a person may not review the results within a reasonable time frame. Therefore, we should consider alerting as a different pillar and configure alerts to provide timely and relevant notifications to the response team. We will review the general principles and provider-specific capabilities that can assist us in alerting us.

General Principles

How we configure alerting will vary depending on the application’s design, the business and security requirements, and the budget. Yet, there are some general principles we can consider to implement in our alerting solution.

Security-Related Notifications

Your organization should have a team that can respond to security-related events. Whether you have dedicated security teams or only developers, having individuals assigned to respond to security events allows your organization to respond more effectively. Your organization should have an email distribution list that includes individuals responsible for responding to security events.

After establishing the security email distribution list, we should set up our provider account and any third-party monitoring and alerting tools to send security-related notifications to that email list. This allows the security team to respond appropriately. The types of notifications this list might receive include upcoming deprecations, known vulnerabilities, and unusual and suspicious activities.

Operational Notifications

In addition to having a security email distribution list, we should have an email distribution list to catch all notifications and alerts. Whether an error happens in the system or we were notified of an upcoming outage, we should have a dedicated team to respond to these notifications. Without having this team, our application might stop functioning due to numerous errors or a planned outage, for example.

In addition to using an email distribution list, we can use provider and third-party tools. The providers have notification capabilities that can alert email addresses and mobile phones or trigger our serverless functions that perform custom notifications. Third parties allow us to add additional notification options (e.g., SMSs, phone calls, mobile app push notifications, instant message notifications) that might better support the processes, workflows, and preferences our organizations might have. Whether we use the cloud provider tools or a third-party system, we should configure the notifications to alert the correct individuals with the appropriate level of information and promptly.

Verbosity

The notification verbosity is sometimes an art to configure. The level of verbosity defines how many notifications to send, the types of notifications, how often to send them, and how much information to include. Having too much verbosity may result in recipients ignoring alerts. Conversely, too little could mean not sending a notification when one is much needed.

The response team will initially respond to notifications, but over time they become less responsive if they are receiving too many notifications or the severity (or importance) of the notifications seems irrelevant or unimportant. The story of “The Boy Who Cried Wolf” provides a relatable lesson in notification verbosity.¹⁷ We need to ensure we send notifications when we want our response team to take action, and those notifications should include the proper level of urgency. This way, we avoid waking up a team member in the middle of the night for a lower urgency notification.

Let’s not confuse notification verbosity with verbosity in our logs and audit records. We should have more verbosity in logs and audit records to make sure we have ample information that can ultimately result in notifications. We cannot send error notifications if our application is not logging errors or logging insufficient error logs. We cannot detect malicious activity if our auditing is not capturing different types of records. We cannot send notifications when our application behavior is deviating from standard patterns if we are not creating logs at different log levels. The more verbose our logs and audit records, the more we will achieve a higher probability of sending relevant notifications when the patterns start to deviate but, we must be careful not to log sensitive information as discussed in the previous chapter.

Notification Destinations

We should evaluate where we want our alert notifications to go. We can configure sending our alerts to message queues, individual email addresses, email distribution lists, third-party services, or a combination of any of these. We should configure a destination that allows the response team to receive timely notifications and acknowledge receipt. Failure to receive timely notifications might result in a service outage or a security incident. Failure to acknowledge receipt may result in the recipient forgetting to address the notification or thinking another team member will look into it. Depending on the severity of the alert, it might be worthwhile to consider manually sending follow-up notifications to the user base and stakeholders when the issue or finding is still unresolved after a reasonable time. Wherever we decide to send the alert notifications, we should ensure they are being reviewed and addressed.

Third-Party Solutions

Similar to our discussion about third-party monitoring and auditing solutions, third-party alerting solutions might provide additional features. These solutions might integrate with phone and text messaging systems, ticketing and support systems, mobile apps, instant messaging platforms, and others. We might be able to avoid the issue about IAM privileges by using webhooks, APIs, and other integrations that only need a way to trigger the alert and not need access to our cloud provider account. These systems might have reporting capabilities where we can assess how quickly our response team addresses notifications. These systems might even support the automatic notification closure when our application and infrastructure report a restoration. The case for having a third-party solution (whether a service or software) might be more compelling than using a cloud provider solution because we potentially get increased reliability in our notifications when our cloud provider is experiencing service degradation.

AWS

We will discuss how we can leverage the different AWS features and capabilities to alert us about our monitoring events and audit findings. The AWS monitoring and auditing services integrate with the following services in one way or another for sending alert notifications.

Amazon Simple Notification Service (SNS)

SNS allows us to receive notifications from DynamoDB, CloudWatch Alarms, CloudWatch Events, CloudTrail, Config, Elasticsearch, and so on. We can forward these notifications to a Lambda function, SQS queue, or a webhook to send the notification to the appropriate response team. We can use a Lambda function that sends an email to an email distribution list using Amazon Simple Email Service. We can use the SQS queue to store the SNS notification message and process it programmatically. We can use a webhook to send the notification to other applications or third-party services. Multiple Lambda functions, SQS queues, and webhooks can subscribe to the SNS topic. This allows us to have numerous ways to respond to alerts.

Amazon CloudWatch Alarms and Events and Event Bridge

Some services do not support sending alerts and notifications. CloudWatch Alarms and CloudWatch Events and Event Bridge help us bridge the gap. CloudWatch Alarms allow us to publish a message to an SNS topic when exceeding the threshold for a CloudWatch metric and when it returns to within range. This allows us to monitor our serverless resources continuously. CloudWatch Events and Event Bridge will enable us to trigger a Lambda function or Step Function state machine; publish a message to an SNS topic, SQS queue, or Kinesis team; run a Systems Manager operation; and so on from a CloudTrail API call. This allows us to respond to unusual activity that CloudTrail identifies. We can use CloudWatch Alarms and Events and Event Bridge to help us achieve continuous monitoring and auditing.

Azure

We will discuss how we can leverage the different Azure features and capabilities to alert us about our monitoring events and audit findings.

Azure Monitor

We can configure Azure Monitor to send alerts when our metrics exceed our metric thresholds. We will be able to see all the alerts in a centralized location where we can review, acknowledge, and query alerts. It will display the type of alert (metric or log), alert information, and the severity (information, warning, etc.). We also can query our alerts from Azure Resource Graph and an external system by using the Azure Alert Management REST API.

In addition to creating alerts, we can configure automated actions when we get an alert. The automated actions can call a webhook, launch an Azure Workbook, trigger an Azure Function, or start an Azure Logic Apps. Alerts enable us to improve our workflow.

We can integrate Azure Monitor with external monitoring and alerting systems (e.g., a security information and event management [SIEM] system). We can integrate with external systems by using Azure Event Hubs, Azure Partner integrations, or the Azure REST APIs.

Azure Service Health

Azure Service Health can automatically send us alerts via email, SMS, and push notification so we can respond to Azure service outages.

Google Cloud

We will discuss how we can leverage the different Google Cloud features and capabilities to alert us about our monitoring events and audit findings.

Google Cloud Operation’s Suite

The services in Google Cloud Operation’s suite support sending alerts via email, the Google Cloud Console Mobile App, SMS, Pub/Sub, and webhook. We can use webhooks and Pub/Sub topics to send alerts to third-party services.

Security Command Center

Security Command Center sends alerts via Gmail, SMSs, Pub/Sub, and Cloud Functions.

Cloud Pub/Sub

Many of the services we might use for monitoring and auditing support sending alerts to Pub/Sub topics. Some of these services include Budget alerts, Cloud Monitoring, Security Command Center, and Operation’s suite. Pub/Sub will receive the alert, and it can distribute it to multiple targets (e.g., Cloud Functions, App Engine, and Firebase) where we can create user notifications and integrate into our workflows.

Cloud Functions

Whether our monitoring and auditing services support sending alerts to Cloud Functions or we trigger a Cloud Function from Pub/Sub, we can use them to send alerts. The Cloud Function can use an SDK, API, or any custom logic to send an alert to many types of notification capabilities and third-party solutions.

Key Takeaways

We established what monitoring, auditing, and alerting are and how they relate to each other. Monitoring allows us to assess how well our application is doing. By monitoring, we can detect performance issues and errors. Auditing allows us to identify potential security issues and noncompliance to legal regulations and business requirements. We discussed security monitoring in the auditing section because security issues are typically a result of a failure to comply with a security policy and best practices. We might liken security monitoring to having continuous auditing. Alerting allows us to receive notifications when our monitoring and auditing detect performance issues, suspicious activity, and noncompliance. We can configure alerts to send notifications via email, SMS, push notification, and third-party solutions (e.g., instant message and ticketing systems). The monitoring and auditing systems may or may not provide alerting capabilities, or they may allow us to forward findings to services that will start the alerting process. Monitoring, auditing, and alerting are related and essential in understanding how our application functions and to finding and addressing performance and security issues.

We discussed the general principles for each topic. These general principles provided items and practices to consider when implementing our monitoring, auditing, and alerting solutions. There are many additional factors to consider (like business and technical requirements, cost and schedule, and workflows), but the general principles are a starting point. We followed the general principles by reviewing the different AWS, Azure, and Google Cloud services we can utilize to implement our solution. Some services provide basic features, and others provide advanced capabilities (e.g., machine learnings and artificial intelligence). Whether choosing a basic or advanced service will depend on many factors. Ultimately, we want to make sure we detect when our application is experiencing issues and detect when our security posture is weakening to be able to send a notification to the appropriate persons.

Footnotes

“OWASP Top Ten 2017.” OWASP. 2017. https://owasp.org/www-project-top-ten/OWASP_Top_Ten_2017/

“OWASP Top 10 (2017): Interpretation for Serverless.” OWASP. 2017. https://github.com/OWASP/Serverless-Top-10-Project/raw/master/OWASP-Top-10-Serverless-Interpretation-en.pdf

“A10:2017-Insufficient Logging & Monitoring.” OWASP Top Ten 2017. OWASP Foundation. 2017. https://owasp.org/www-project-top-ten/OWASP_Top_Ten_2017/Top_10-2017_A10-Insufficient_Logging%252526Monitoring

I learned about this exploit at the OWASP AppSec California 2019 conference. “An Attacker’s View of Serverless and GraphQL Apps.” Abhay Bhargav. AppSec California 2019. OWASP. www.youtube.com/watch?v=wCRkmeLYhYQ

In addition to monitoring, we could implement billing quotas to ensure our costs do not exceed a specified amount. Implementing quotas is a double-edged sword: we can limit excess usage like those from a DDoS attack, but can self-impose a DoS if our application has an unusually large spike in user activity.

“DynamoDB Metrics and Dimensions.” Amazon DynamoDB Developer Guide. Amazon Web Services. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/metrics-dimensions.html

“Monitoring Kinesis Data Firehose Using CloudWatch metrics.” Amazon Kinesis Data Firehose Developer Guide. Amazon Web Services. https://docs.aws.amazon.com/firehose/latest/dev/monitoring-with-cloudwatch-metrics.html

“Amazon API Gateway dimensions and metrics.” Amazon API Gateway Developer Guide. Amazon Web Services. https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-metrics-and-dimensions.html

“Monitoring your usage and costs.” AWS Billing and Cost Management User Guide. Amazon Web Services. https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/monitoring-costs.html

Elasticsearch is a registered trademark of Elasticsearch BV.

“Monitoring Amazon S3.” Amazon Simple Storage Service Developer Guide. Amazon Web Services. https://docs.aws.amazon.com/AmazonS3/latest/dev/monitoring-overview.html

Kibana is a registered trademark of Elasticsearch BV.

Logstash is a registered trademark of Elasticsearch BV.

“Monitoring and troubleshooting Lambda applications.” AWS Lambda Developer Guide. Amazon Web Services. https://docs.aws.amazon.com/lambda/latest/dg/lambda-monitoring.html

“Monitoring Tools.” Amazon DynamoDB Developer Guide. Amazon Web Services. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/monitoring-automated-manual.html

These users would be Google employees that use their privileges to access our account and data.

This story is about a boy who pretended and cried, “Wolf! Wolf!” Others came to his rescue only to find no wolf. This happened multiple times. Eventually, everyone stopped listening to him. One day a real wolf came, but nobody came to his rescue. You may read this story at the Library of Congress. “The Shepherd Boy & the Wolf.” The Aesop for Children. Circa 620–560 BCE. www.read.gov/aesop/043.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11. Monitoring, Auditing, and Alerting

Create new playlist

Sign In

Sign Up

11. Monitoring, Auditing, and Alerting

The Importance of Monitoring, Auditing, and Alerting

Monitoring

General Principles

Billing

HTTP Status Codes

Log Aggregation

Outages

Utilization and Metrics

Third-Party Solutions

Amazon Web Services (AWS)

Account Settings

AWS Billing and Cost Management

Amazon CloudWatch

AWS X-Ray

AWS Personal Health Dashboard

Amazon Simple Storage Service (S3)

Amazon Elasticsearch Service

AWS Lambda

Amazon DynamoDB

Azure

Account Settings

Azure Cost Management + Billing

Azure Monitor

Azure Service Health

Azure Sentinel

Google Cloud

Account Settings

Billing and Cost Management

IAM & Admin

Google Cloud’s Operations Suite

Auditing

General Principles

Authorization Attempts

Configuration Settings

Infrastructure Changes

Privileges

Users and Credentials

Unusual Activity and Behaviors

Known Vulnerabilities

Third-Party Solutions

AWS

AWS Config

AWS CloudTrail

Amazon Macie

AWS IAM Access Analyzer

Amazon GuardDuty

AWS Security Hub

Azure

Azure Policy

Azure Security Center

Azure Advisor

Google Cloud

Cloud Audit Logs

IAM & Admin

Cloud Asset Inventory

Data Catalog

Cloud Data Loss Prevention (DLP)

Security Command Center

Alerting

General Principles

Security-Related Notifications

Operational Notifications

Verbosity

Notification Destinations

Third-Party Solutions

AWS

Amazon Simple Notification Service (SNS)

Amazon CloudWatch Alarms and Events and Event Bridge

Azure

Azure Monitor

Azure Service Health

Google Cloud

Google Cloud Operation’s Suite

Security Command Center

Cloud Pub/Sub

Table of Contents for
11. Monitoring, Auditing, and Alerting