Monitoring Services

The last piece in the puzzle is monitoring the services themselves. Having good monitoring in place is important to keeping any application healthy and available. In a microservices architecture good monitoring is even more important. We not only need to be able to monitor what is going on inside a service, but also all the interactions between the services and the operations that span them. When an anomaly occurs, the inter-services information is much needed to understand the causality and find the root cause. It’s important to embrace the following design principles to achieve it.

Log aggregation and analytics

Use activity or correlation ID’s

Consider an agent as an operations adapter

Use a common log format

In addition to these points, common standard monitoring tools and techniques should be utilized where appropriate, such as endpoint monitoring and synthetic user monitoring.

Log Aggregation

A request into the system will often span multiple services, and it is important that we are able to easily view metrics and events for a request across all the systems. There is always the question of how much to log to avoid over- or under-logging. A good starting point is to log at least the:

Requestor name/ID: If a user initiates the request, it should be the user name. If a service initiates the request, it should be the service name.

Correlation ID: For more information, see the paragraph on Correlation ID.

Service flow: Log entry and exit points of a service for a given request.

Metrics: Log runtime performance of the service and its methods.

With all the data, can you imagine finding something in the logs of one service, and then having to go to another system and try to find the related logs? For even just a handful of services, this would be painful. This is not something we should spend our time doing, so it should be easy to query and view logs across all the systems.

There are tools that collect and aggregate logs across all the VMs and transfer them to a centralized store. For example, Logstash, an OSS tool by Elastic.co, or the Microsoft Azure diagnostics agent (which we will discuss later in the chapter) can be used to collect logs from all the nodes running our services, and put them in a centralized store. There are also many good tools that help us visualize and analyze the data. One of the more popular end-to-end solutions is the ELK stack. It uses Elastic Search as the data store, Logstash to transfer the logs, and Kibana to view the logs.

Correlation ID

In addition to collecting the logs, we need to be able to correlate logs; basically we need to be able to find associated logs. When a request lands on the application, we can generate an activity or correlation ID that represents that unique request on the application. This ID is then passed to all downstream service calls, and each service includes this ID in its logs. This makes it easy to find logs across all the services and systems used to process the request. If an operation fails, we can trace it back through the systems and services to help identify the source. In Figure 7.2, we can see how a correlation ID can be used to build a waterfall chart of requests to visualize the end-to-end processing time of an operation. This can be used to optimize transactions or identify bottlenecks in the transactions.

Image

FIGURE 7.2: Correlation ID used in an operation

Operational Consistency

Freedom to use the technology of choice for each service has its benefits, but can present some operational challenges as well. The operations tools and teams now need to deal with a growing number of stacks and data stores, which can make it difficult to provide a common view of system health and monitoring. Every technology tends to deal with configuration, logging, telemetry, and other operational data a bit differently. Consider providing some consistency and standard operational interfaces for things like service registration, logging, and configuration management.

Netflix, for example, uses a project called Prana and a sidecar pattern, also sometimes referred to as a sidekick pattern, to ensure that type of consistency. The Prana service is an operations agent that is deployed to each virtual machine. The operations agent can manage things like configuration in a consistent manner across all the various services. Then the teams implementing the services can integrate with the agent through an adapter and still use whatever technology they want.


Image Note

The sidecar pattern refers to an application that is deployed alongside a microservice. Generally, the sidecar application is attached to its microservice just as a sidecar would be to its motorcycle, thus the name “sidecar pattern.” For more information on sidecar pattern and Netflix Prana visit http://techblog.netflix.com/2014/11/prana-sidecar-for-your-netflix-paas.html.


Common Log Format

Collecting the logs and throwing a correlation ID in the log entry is not enough. We need some kind of consistency in our logs to properly correlate them. Imagine that we write a correlation ID like the following:

[ERROR] [2345] another message about the event – cid=1

Another microservices team implements their logging as follows:

{type:"info",thread:"2345",activityId:"1",message:"my message"}

There are a few things wrong with this. The differing format of the logs is one problem, and then the keys vary across the logs. When we query the logs for cid=1 we are going to miss the logs from the second microservice because although it’s logging the correlation ID, they call it something else. The same is true for the event timestamp. If one service logs it as “timestamp”, another “eventdate”, and yet another “@timestamp”, it can become difficult to correlate these events, and time is a common property to correlate events on. Thus, for some of those critical events, we need to make sure that every team is using the same key name, or at least we must consider processing events with something like Logstash.


Image Note

Logstash is a very popular data pipeline for processing logs and other event data from various systems such as Windows or Linux. Logstash offers many plug-ins for other systems, like Elasticsearch. This makes the log data easily searchable and consumable, making Logstash a great data pipeline for many scenarios.


In addition to the key, the format and meaning of the event message need to be consistent for analysis. For example, timestamp can refer to the time the event was written, or when the event was raised. As timestamp is commonly used to correlate events and analysis, not having consistency could skew things quite a bit.

Further, we need to determine what events should be consistent across the entire organization, and use them across all the services in the company. We should think of this as a schema for our logs that has multiple parts that are defined at different scopes, to facilitate analysis of log events across the various organizational, application, and microservices boundaries.

For example, we might include something like the following in all our log events.

Timestamp with a key of ‘timestamp,’ the value in ISO 8601 format, and as close to the time the event happened

Correlation Identifier with a key of ‘activityId’ and a unique string

Activity start and end times, like ‘activity.start’ and ‘activity.end’

Severity/Level (warning, error) with a key of ‘level’

Event Identifier with a key of ‘eventid’

Process or services identifier enabling us to track the event across services

Service name to identify the service that logged that event

Host Identifier with a key of ‘nodeId’ and a unique string value of the machine name

As services will be developed by multiple teams across an organization, having these agreed upon, documented, shared, and evolved together, is important.

A good example for the importance of enforcing a common log format is Azure itself. Some Azure Services depend on each other. For example, Azure Virtual Machines rely on Azure storage. If there is an issue with Azure storage, it can affect Azure Virtual Machines. It is important for Azure Virtual Machine engineers to be able trace back the issue to its root cause. Azure storage components need to log the data to the overall Azure diagnostics system following a common format and correlation rules. The Azure Virtual Machine engineer can now easily trace back the issue to Azure storage by just searching for the correlation ID.

Additional Logging Considerations

In addition to the preceding recommendations, we should also consider the following points in our logging strategy:

Log what is working: This can be a simple message to say the service launched successfully, or a regular heartbeat to indicate the service is alive. This might not seem necessary, but it is actually critical for monitoring.

Log what is not working and error details: In addition to the exception message and stack trace, the semantic information including user request, transaction, and so on, will greatly help the initial triage and further root-cause analysis.

Log performance data on critical path: Performance metrics directly reflect how well the services are running. By aggregating the percentile on performance metrics, you can easily find out some system-wide long-tail performance issues.

Have a dedicated store for logging data: If your application uses the same store as your log data, issues with one can affect the other.

Log “healthy” state data: This will enable you to create a baseline of how things look during normal operations.

How to Log Application Data from Within Containers

Now that we have an idea how to structure our logs and what to log, it’s time to look at how we can get the log data out of the containers.

For Docker containers, STDOUT and STDERR are the two channels that pump the application logs out of containers. This obviously requires the application to write logs to either STDOUT or STDERR.

Docker v1.6 added a logging driver feature that can send the STDOUT and STDERR message from the container directly to other channels specified in the driver. These include the host’s syslog system, Fluentd log system, Journald log system, Graylog log systems, and more.


Image Note

Find more information on supported logging drivers at http://docs.docker.com/reference/logging/overview.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.72.74