Chapter 5. Logging and monitoring

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Logging and monitoring

The logging and monitoring tools used in IBM Cloud Private are at the core of how users interact with their application log data and metrics. The Elasticsearch, Logstash and Kibana (also called ELK) stack is a suite of open source tools designed to provide extensive log capture, retention, visualization and query support for application log data, and is the primary way for users to interact with their application log data. Alert Manager, Prometheus and Grafana is another suite of open source tools that provides the user with powerful capabilities to query metrics for their application containers, and raise alerts when something isn’t quite right.

This chapter explores each of these components in depth, describing their function in an IBM Cloud Private cluster and how the logging and monitoring systems can be leveraged to cover a range of common use cases when used with IBM Cloud Private.

This chapter has the following sections:

•5.1, “Introduction” on page 154

•5.2, “IBM Cloud Private Logging” on page 155

•5.3, “IBM Cloud Private Monitoring” on page 222

5.1 Introduction

This section will provide an overview and describe the main functions of each of the components within the logging and monitoring tools used in IBM Cloud Private. It will discuss the importance of each role and how each technology plays a key part in providing the user with all the tools necessary to store, view, query and analyze log data and performance metrics for their deployed application containers.

5.1.1 Elasticsearch, Logstash and Kibana

Elasticsearch, Logstash and Kibana are the three components that make up the ELK stack. Each component has a different role, but is heavily integrated with each other to allow application log analysis, visualization and RESTful access to the data generated by the whole IBM Cloud Private platform. The ELK stack is coupled with a Filebeat component that deals with collecting the raw log data from each node in the cluster.

Elasticsearch

Elasticsearch is a NoSQL database that is based on the Lucene search engine. Elasticsearch In IBM Cloud Private has three main services that process, store and retrieve data; the client, master and data nodes.

The client (also known as a ‘smart load-balancer’) is responsible for handling all requests to Elasticsearch. It is the result of a separation of duty from the master node and the use of a separate client enables stability by reducing the workload on the master.

The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster and deciding which shards to allocate to which nodes.

Data nodes hold the shards that contain the documents you have indexed. Data nodes handle data related operations like CRUD, search and aggregations. These operations are I/O and memory intensive. It is important to monitor these resources and to add more data nodes if they are overloaded. The main benefit of having dedicated data nodes is the separation of the master and data roles to help stabilise the cluster when under load.

Logstash

Logstash is a log pipeline tool that accepts inputs from various sources, executes different transformations, and exports the data to various targets. In IBM Cloud Private, it acts as a central input for different log collectors, such as Filebeat, to rapidly buffer and process data before sending it to Elasticsearch. Logstash can be configured to output data not just to Elasticsearch, but a whole suite of other products to suit most other external log analysis software.

Kibana

Kibana is an open source analytics and visualization layer that works on top of Elasticsearch that allows end users to perform advanced data analysis and visualize your data in a variety of charts, tables and maps. The lucene search syntax allows users to construct complex search queries for advanced analysis and, feeding in to a visualization engine to create dynamic dashboards for a real time view of log data.

Filebeat

Filebeat is a lightweight shipper for forwarding and centralizing log data. Filebeat monitors the specified log locations to collect log events and data from containers running on a host and forwards them to Logstash.

5.2 IBM Cloud Private Logging

The ELK stack plays a key role in an IBM Cloud Private cluster, as it acts as a central repository for all logging data generated by the platform and the only method to access log data without accessing the Kubernetes API server directly. This section will explore how the whole logging system works and how it can be used effectively to satisfy several common use cases seen by Cluster Administrators when faced with configuring or customizing IBM Cloud Private to suit their requirements for viewing and storing application log data.

5.2.1 ELK architecture

Figure 5-1 shows the architecture overview for the IBM Cloud Private platform ELK stack and the logical flow between the components.

Figure 5-1 ELK high level overview

In IBM Cloud Private, the platform logging components are hosted on the management nodes, with the exception of Filebeat that runs on all nodes, collecting log data generated by Docker. Depending on the cluster configuration, there are multiples of each component. For example, in a High Availability (HA) configuration with multiple management nodes, multiple instances of the Elasticsearch components will be spread out across these nodes. The overview in Figure 5-2 shows how the Elasticsearch pods are spread across management nodes

Figure 5-2 Logging on IBM Cloud Private with multiple management nodes

Each of the components are configured to run only on management nodes and, where possible, spread evenly across them to ensure that the logging service remains available in the event a management node goes offline.

5.2.2 How Elasticsearch works

This section will explore how raw log data is collected by Filebeat and transformed into an Elasticsearch document ready for analysis and querying. Figure 5-3 shows an overview of the process from Filebeat collecting the data, to Elasticsearch storing it in an IBM Cloud Private environment with multiple management nodes.

Figure 5-3 Elasticsearch data flow overview with multiple management nodes

Collecting the data

On every IBM Cloud Private cluster node, an instance of Filebeat is running. The Filebeat pods are controlled by a Daemonset that is actively keeping at least one Filebeat pod running on each cluster node to ensure all node are collecting log data across the whole cluster. By default, ELK is configured to be cluster-wide, monitoring all namespaces on all nodes.

All containers running on a host write out data to stdout and stderr, which is captured by Docker and stored on the host filesystem. In IBM Cloud Private, Docker is configured to use the json-file logging driver, which means that Docker captures the standard output (and standard error) of all the containers and writes them to the filesystem in files using the JSON format. The JSON format annotates each line with its origin (stdout or stderr) and its timestamp and each log file contains information about only one container. For each container, Docker stores the JSON file in a unique directory using the container ID. A typical format is /var/lib/docker/containers/<container-id>/<container-id>-json.log.

The /var/lib/docker/containers/ directory has a symlink for each file to another location at /var/log/pods/<uid>/<container-name>/<number>.log.

Filebeat then continuously monitors the JSON data for every container, but it does not know anything about the container IDs. It does not query Kubernetes for every container ID, so the kubelet service creates a series of symlinks pointing to the correct location (useful for centralized log collection in Kubernetes) and it retrieves the container logs from the host filesystem at /var/log/containers/<container-name>_<namespace>_<uid>.log. In the Filebeat configuration, this filepath is used to retrieve log data from all namespaces using a wildcard filepath /var/log/containers/*.log to retrieve everything, but it’s also possible to configure more accurate filepaths for specific namespaces.

As a reference, it’s worth noting that this filepath is not the same as the one generated by Docker. Using a the ls -l command shows that the /var/log/containers/<pod_name>_<namespace>_<container_name>-<uid>.log contains a symlink to /var/log/pods/<id>/<container-name>/<index>.log. Following the trail, the <index>.log file contains yet another symlink to finally arrive at /var/lib/docker/containers/<uid>/<uid>-json.log. Example 5-1 shows the symlinks for the icp-mongodb-0 container in a real environment.

Example 5-1 Container log symlink trail

[root@icp-boot ~]# ls -l /var/log/containers/

...

lrwxrwxrwx 1 root root 66 Feb 11 09:44 icp-mongodb-0_kube-system_bootstrap-c39c7b572db78c957d027f809ff095666678146f8d04dc102617003f465085f2.log -> /var/log/pods/96b4abef-2e24-11e9-9f38-00163e01ef7a/bootstrap/0.log

...

[root@icp-boot ~]# ls -l /var/log/pods/96b4abef-2e24-11e9-9f38-00163e01ef7a/bootstrap/

lrwxrwxrwx 1 root root 165 Feb 11 09:44 0.log -> /var/lib/docker/containers/c39c7b572db78c957d027f809ff095666678146f8d04dc102617003f465085f2/c39c7b572db78c957d027f809ff095666678146f8d04dc102617003f465085f2-json.log

Filebeat consists of two components; inputs and harvesters. These components work together to tail files and send event data to a specific output. A harvester is responsible for reading the content of a single file. It reads each file, line by line, and sends the content to the output. An input is responsible for managing the harvesters and finding all sources to read from. If the input type is log, the input finds all files on the drive that match the defined glob paths and starts a harvester for each file.

Filebeat keeps the state of each file and, if the output (such as Logstash) is not reachable, keeps track of the last lines sent so it can continue reading the files as soon as the output becomes available again, which improves the overall reliability of the system. In IBM Cloud Private, Logstash is pre-configured as an output in the Filebeat configuration, so Filebeat is actively collecting logs from all cluster nodes and sending the data to Logstash.

Filtering and sending the data

Logstash has 3 main stages; inputs, filters, and outputs. The input stage is the means in which Logstash receives data. It can be configured to receive data from a number of sources, such as the file system, Redis, Kafka, or Filebeat. The filters stage is where the inbound data from Filebeat is transformed to extract certain attributes, such as the pod name and namespace, remove sensitive data such as the host and drop empty lines from the log file in an effort to remove unnecessary processing in Elasticsearch. Logstash has many different outputs and available plug-ins. A full list of outputs can be found at https://www.elastic.co/guide/en/logstash/5.5/output-plugins.html. In IBM Cloud Private, the default is Elasticsearch. Logstash send the data, along with the index name (defaults to logstash-<year.month.day>) to Elasticsearch to process and store.

Indexing and storing the data

When Elasticsearch receives a new request, it gets to work processing and storing the data, ready for searching at a later stage. When discussing how it stores and searches the data it holds, it’s important to understand a few key terms, such as indices, shards and documents.

First and foremost, Elasticsearch is built on top of Lucene. Lucene is Java based information retrieval software primarily designed for searching text based files. Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an ‘index’. In this context, an index is a record of all the instances in which a keyword exists. To explain the theory, a typical example is to think about searching for a single word in a book where all the attributes of the word are ‘indexed’ at the back of the book, so you know on which page, line and letter number that word exists, for all instances of the word. This act of ‘reverse search’ is called an inverted index. As the name suggests, it is the inverse of a forward index, where a word would be searched starting from the front of the book and sequentially working through all the pages. So where a forward index would resolve all instances of a word from searching pages to find words (pages > words), the inverted index uses a data-centric approach and searches words to find pages (words > pages). This is how data is retrieved from text based files quicker than a traditional database, where a single query sequentially searches a database record by record until it finds the match.

In Elasticsearch, an index is a single item that defines a collection of shards and each shard is an instance of a Lucene index. A shard is a basic scaling unit for an index, designed to sub-divide the whole index in to smaller pieces that can be spread across data nodes to prevent a single index exceeding the limits of a single host. A Lucene index consists of one or more segments, which are also fully functioning inverted indexes. The data itself is stored in a document, which is the top level serialized JSON object (with key-value pairs), stored in the index and the document is indexed and routed to a segment for searching. Lucene will search each of these segments and merge the results, which is returned to Elasticsearch.

Figure 5-4 Anatomy of an Elasticsearch index

Elasticsearch also provides the capability to replicate shards, so ‘primary’ and ‘replica’ shards are spread across the data nodes in the cluster. If a data node hosting a primary shard goes down, the replica is promoted to primary, thus still able to serve search queries.

Figure 5-5 ELK primary-replica spread

For more information about the individual concepts, see the Elasticsearch documentation at https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_basic_concepts.html with other useful articles available at https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html and https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

IBM Cloud Private, by default, will set 5 shards per index and 1 replica per shard. This means that in a cluster with 300 indices per day the system will, at any one time, host 3000 shards (1500 primary + 1500 replicas). To verify, in Example 5-2 an index is used to check the number of shards, replicas and the index settings.

Example 5-2 Verifying number of shards per index

#Get index settings

GET /logstash-2019.03.03/_settings

{

"logstash-2019.03.03": {

"settings": {

"index": {

"creation_date": "1551791346123",

"number_of_shards": "5",

"number_of_replicas": "1",

"uuid": "xneG-iaWRiuNWHNH6osL8w",

"version": {

"created": "5050199"

"provided_name": "logstash-2019.03.03"

}

#Get the index

GET _cat/indices/logstash-2019.03.03

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

green open logstash-2019.03.03 xneG-iaWRiuNWHNH6osL8w 5 1 1332 0 2.8mb 1.4mb

#Get the shards for the index

GET _cat/shards/logstash-2019.03.03

index shard prirep state docs store ip node

logstash-2019.03.03 3 p STARTED 251 363.3kb 10.1.25.226 logging-elk-data-1

logstash-2019.03.03 3 r STARTED 251 319.3kb 10.1.92.13 logging-elk-data-0

logstash-2019.03.03 2 p STARTED 219 294.8kb 10.1.25.226 logging-elk-data-1

logstash-2019.03.03 2 r STARTED 219 225.1kb 10.1.92.13 logging-elk-data-0

logstash-2019.03.03 4 p STARTED 263 309.4kb 10.1.25.226 logging-elk-data-1

logstash-2019.03.03 4 r STARTED 263 271.2kb 10.1.92.13 logging-elk-data-0

logstash-2019.03.03 1 p STARTED 230 599.1kb 10.1.25.226 logging-elk-data-1

logstash-2019.03.03 1 r STARTED 230 324.9kb 10.1.92.13 logging-elk-data-0

logstash-2019.03.03 0 p STARTED 216 283.7kb 10.1.25.226 logging-elk-data-1

logstash-2019.03.03 0 r STARTED 216 340.8kb 10.1.92.13 logging-elk-data-0

Each time Logstash sends a request to Elasticsearch, it will create a new index if it does not exist, or the existing index will be updated with additional documents. The Elasticsearch data pod is responsible for indexing and storing data, so during this time the CPU utilization, memory consumption and disk I/O will increase.

5.2.3 Default logging configuration

Attention: Throughout this chapter, there are references to using the ibm-icplogging-2.2.0 Helm chart for helm upgrade or helm install commands. This chart can be used in a variety of ways, but the examples in this chapter use a locally stored copy of the Helm chart. You can retrieve this by using the following methods:

1. Use wget --no-check-certificate https://mycluster.icp:8443/mgmt-repo/requiredAssets/ibm-icplogging-2.2.0.tgz to download the file locally, replacing mycluster.icp with your cluster name.

2. Add the mgmt-charts repository to your local Helm repositories by using helm repo add icp-mgmt https://mycluster.icp:8443/mgmt-repo/charts --ca-file ~/.helm/ca.pem --key-file ~/.helm/key.pem --cert-file ~/.helm/cert.pem. Replace mycluster.icp with your cluster name. The chart can then be referenced using icp-mgmt/ibm-icplogging --version 2.2.0 in place of the ibm-icplogging-2.2.0.tgz file. Helm should be configured to access the cluster.

The default logging configuration is designed to be a baseline and it provides the minimum resources required to effectively run a small IBM Cloud Private cluster. The default resource limits are not the ‘production ready’ values and therefore the Cluster Administrator should thoroughly test and adjust these settings to find the optimal resource limits for the workloads that will be running on the production environment. IBM Cloud Private Version 3.1.2 logging installs with the following resource limits by default (See Table 5-1).

Table 5-1 Default ELK resource limits

Name	CPU	Memory
client	-	1.5GB (1GB Xmx/Xms)
master	-	1.5GB (1GB Xmx/Xms)
data	-	3GB (1.5GB Xmx/Xms)
logstash	-	1GB (512MB Xmx/Xms)
filebeat	-	-

Tip: Some users experience high CPU utilization by the java processes on the host. This is due to no limit specified on the containers, allowing them to consume all the available host CPU, if necessary. This is intentional and setting limits may impact the ELK stack stability. It is worth noting that high CPU utilization may be an indication of memory pressure due to garbage collection and should be investigated.

100 GB of storage via a LocalVolume PersistentVolume (PV) is allocated to the Elasticsearch data pods, which resides at /var/lib/icp/logging on the management node filesystem. Each PV has affinity in place so that the PV is bound to one management node only, to ensure consistency across data nodes. Each Elasticsearch data node also has affinity rules in place so that only one data pod runs on one management node at any one time.

The size of the Elasticsearch cluster deployed in an environment entirely depends on the number of management nodes in the cluster. The default number of Elasticsearch master and data pods are calculated based on the available management nodes and take on the following rules:

•One Elasticsearch data pod per IBM Cloud Private management node

•Number of Elasticsearch master pods is equal to number of management nodes

•One Logstash pod per IBM Cloud Private management node

•One Elasticsearch client pod per IBM Cloud Private management node

Elasticsearch client and Logstash replicas can be temporarily scaled as required using the default Kubernetes scaling methods. If any scaling is permanent, it’s recommended to use the Helm commands to update the number of replicas.

Data retention

The default retention period for logs stored in the platform ELK stack is 24 hours. A curator is deployed as a CronJob that will remove the logstash indices from Elasticsearch every day at 23:30 UTC. If Vulnerability Advisor is enabled, another CronJob runs at 23:59 UTC to remove the indices related to Vulnerability Advisor older than 7 days.

Modifying the default retention period without proper capacity planning may be destructive to the ELK stack. Increasing the retention period will increase the resources required to search and store the data in Elasticsearch, so ensure the cluster has the required resources to be able to do so. For more information about resource allocation, see “Capacity planning” on page 164.

Configuring data retention during installation

It’s possible to specify the default data retention period before installing IBM Cloud Private by adding the curator configuration to the config.yaml. Use Example 5-3 to set a default index retention of 14 days.

Example 5-3 Curator config in config.yaml

logging:

curator:

schedule: "30 23 * * *"

app:

unit: days

Configuring data retention after installation

The recommended way to permanently increase the default retention period is to use a helm upgrade command, passing the new value as a parameter. This ensures that any future chart or IBM Cloud Private upgrades do not overwrite the value with the default value used during chart installation. To update the retention period to 14 days using helm upgrade from the command line, use helm upgrade logging ibm-icplogging-2.2.0.tgz --reuse-values --recreate-pods --set curator.app.count=14 --force --no-hooks --tls

For testing purposes, the default retention period is easily customized by modifying the logging-elk-elasticsearch-curator-config ConfigMap. To modify the retention period from 1 day to 14 days edit the logging-elk-elasticsearch-curator-config ConfigMap and modify the unit_count in the first action named delete_indices. The result should look similar to Example 5-4.

Example 5-4 Curator modified configuration

actions:

action: delete_indices

description: "Delete user log indices that are older than 1 days. Cron schedule: 30 23 * * *"

options:

timeout_override:

continue_if_exception: True

ignore_empty_list: True

disable_action: False

filters:

- filtertype: pattern

kind: prefix

value: logstash-

- filtertype: age

source: name

direction: older

timestring: '%Y.%m.%d'

unit: days

unit_count: 14

After saving and closing the file, the new curator configuration will be automatically reloaded and indices will be retained for 14 days.

Also within this ConfigMap, cleanup actions are provided for the Vulnerability Advisor indices.

5.2.4 ELK security

The platform ELK is deployed with mutual TLS security enabled by default, using Search Guard to provide PKI. It’s possible to disable this during IBM Cloud Private installation by adding the following to the config.yaml

logging:

security:

enabled: false

If you already have a valid X-Pack license, you can use the X-Pack security features instead, by adding the following to the config.yaml at installation time

logging:

security:

provider: xpack

The Certificate Authority (CA) is created during installation, but the IBM Cloud Private installer offers the capability to supply your own CA that will be used for all other certificates in the cluster. For more information, see https://www.ibm.com/support/knowledgecenter/en/SSBS6K_3.1.2/installing/create_cert.html

5.2.5 Capacity planning

Capacity planning for Elasticsearch is one of the most over-looked topics and one of the most common causes of a failing Elasticsearch cluster. Lack of resources is usually the root-cause of failures at cluster installation, when audit logging is enabled or during random surges of log traffic. Allocating sufficient resources towards the capture, storage and management of logging and metrics is crucial, especially under stressful conditions. No universally, cost-effective recommendation for the capture, storage and management of logs and metrics is available, but this section provides some insights based on observations of workload behavior in IBM Cloud Private.

The default configuration should not be relied upon to meet the needs of every deployment. It is designed to provide a baseline that the Cluster Administrator should use as a starting point to determine the resources required for their environment. There are a number of factors that affect the logging performance, mostly centered around CPU and memory consumption. Elasticsearch is based on Lucene, which uses Java, and therefore has a requirement for sufficient memory to be allocated to the JVM heap. The entire JVM heap is assigned to Lucene and the Lucene engine will consume all of the available memory for its operations, which can lead to out-of-memory errors if the heap size is not sufficient for all the indexing, searching and storing that the Lucene engine is trying to do. By default, there are no CPU limits on the Elasticsearch containers, as the requirements vary depending on workload. On average, the CPU load is generally low, but will significantly rise during active periods where heavy indexing or searches are taking place and could consume the entire available CPU from the host. Restricting the CPU usage to only a few cores will create too much of a backlog of logs to process, increasing the memory usage and ultimately resulting in an unusable Elasticsearch cluster during this time. Therefore, it is almost impossible to predict the required resources for every use case and careful analysis should be made in a pre-production environment to determine the required configuration for the workload that will be running, plus additional margins for spikes in traffic.

It is not uncommon for the logging pods to be unresponsive immediately after installation of an IBM Cloud Private cluster, especially when left at the default configuration. In a basic installation, whilst all pods are starting up, they are producing hundreds of logs per second that all need to be catered for by the logging pods, which are also trying to start up at the same time. During cluster installation, an observed average across several installations by the development team was around 1500 messages per second and it takes around 30 minutes for the logging platform to stabilise with the normal rate of 50-100 messages per second in a lightly active cluster. When Vulnerability Advisor is enabled, the rate during installation can rise to observed rate of around 2300 messages per second, taking Elasticsearch up to 90 minutes to fully stabilise. When audit logging is enabled, the default resources are not sufficient and if audit logging will be enabled during cluster installation, it’s recommended that increased resource limits for logging are applied at the same time, using the config.yaml.

Estimating the required CPU and memory

Estimating the CPU and memory required for any environment really comes down to one thing - experience. Without knowing what applications will be running, how much data the application produces, the rate at which data is produced etc it’s difficult to know what resources are required. This information plays a valuable role in setting the values for a production cluster and these values must be determined from testing the workload and analyzing the data in a pre-production cluster. There are several tools at hand that enable analysis of the current resource consumption. Prior to analyzing how much storage is being consumed by an index, it’s important to ensure that a realistic workload is deployed, preferably with the ability to test failover or other tests than can simulate a rise in log traffic, to allow visibility of the additional resource margins required.

X-Pack monitoring

IBM Cloud Private logging comes with a trial license for X-Pack enabled by default, but the trial functionality is not enabled during deployment. The trial is aimed at users who need more advanced capabilities that may eventually need to purchase the full X-Pack license. Information about X-Pack can be found at https://www.elastic.co/guide/en/x-pack/current/xpack-introduction.html as it is not covered in this chapter. However, for the purpose of estimating the logging requirements, the X-Pack monitoring can be enabled.

To enable the X-Pack monitoring, if it is not already enabled at installation time, use the helm upgrade command.

The logging Helm chart is located in a Helm repository called mgmt-charts. Instead of adding the mgmt-charts repository to the local machine, the URL of the chart can be used instead.

Run the helm upgrade command and set the xpack.monitoring value to true. You’ll need to pass the default installation values in this command too, found in the cluster installation directory.

helm upgrade logging ibm-icplogging-2.2.0.tgz --reuse-values --recreate-pods --set xpack.monitoring=true --force --tls

Attention: Using helm upgrade can result in some down time for the logging service while it is replaced by new instances, depending on what resources have changed. Any changes made to the current logging configuration that was not changed through Helm will be lost. In some cases, the cluster may be unresponsive and requires initialization by sgadmin. This is a known issue, so if it occurs follow the IBM Knowledge Center steps at https://www.ibm.com/support/knowledgecenter/en/SSBS6K_3.1.2/getting_started/known_issues.html.

Restriction: The helm upgrade command may not run without first deleting the logging-elk-kibana-init job. The use of --force in the above command typically removes the need for this, but if you receive an error related to the logging-elk-kibana-init job, delete this prior to running helm upgrade using kubectl -n kube-system delete job logging-elk-kibana-init.

Once the upgrade is complete, the monitoring data is now available in Kibana. The monitoring dashboards present a wealth of information about the Elasticsearch cluster, including valuable information about how much log data is running through it.

Figure 5-6 Kibana monitoring

Navigate to the Monitoring → Elasticsearch (cluster) → > Nodes section. All the nodes that make up the current Elasticsearch cluster are displayed, along with extremely useful data about JVM memory consumption and CPU usage per node. Assuming that the workload in the current cluster is realistic, it’s more clear to see if the current resource allocation for the management node CPU cores, or the JVM heap, is enough to handle the current workload plus additional spikes.

Figure 5-7 Monitoring dashboard for Elasticsearch nodes

Currently, in the above cluster, JVM memory usage for the data nodes is around 60% which is suitable for the current environment, plus a 40% overhead to handle temporary excess log data. If this value was around 80% or 90% consistently, it would be beneficial to double the allocated resources in the logging-elk-data StatefulSet. The current data nodes’ CPU usage is around 30% and the maximum value the nodes have hit is around 60%, so for the current workload, the CPU cores assigned to the management nodes are sufficient. In all cases, these values should be monitored for changes and adjusted as required to keep the logging services in a stable state. High CPU usage is an indicator that there is memory pressure in the data nodes due to garbage collection processes.

Navigate to the Monitoring → Elasticsearch → Indices section. Here, all the currently stored indices in the system are displayed and gives a good overview of what storage is taken up by each index.

Figure 5-8 Monitoring indices list

Note that this may only be the current and previous day’s indices due to the curator cleaning up indices older than the default 24 hours (unless configured differently). Selecting the previous day’s index will provide information about how much storage the index is using, the rate at which data is indexed, growth rate over time and other metrics. Assuming that these values are realistic representations of the actual log data the production workloads would produce, it becomes easier to realize whether or not the current storage is sufficient for the length of time the data should be retained. Figure 5-8 shows that 6 days worth of data retention has yielded around 30GB of storage, so the default value of 100GB is sufficient, for now.

The advanced dashboard provides further analytics and insights in to some key metrics that enable better fine tuning of the cluster. Tuning Elasticsearch for better performance during indexing and querying is a big subject and is not covered in this book. More information about tuning Elasticsearch can be found in the Elasticsearh documentation at https://www.elastic.co/guide/en/elasticsearch/reference/5.5/general-recommendations.html.

Prometheus monitoring

Elasticsearch exposes metrics to Prometheus, which can be viewed using a default Grafana dashboard (Figure 5-9 on page 168), or queried using the Prometheus User Interface (UI). An Elasticsearch dashboard is provided by the development team out-of-the-box with IBM Cloud Private Version 3.1.2 to easily view various metrics about the Elasticsearch cluster, similar to those provided by the X-Pack monitoring.

Figure 5-9 Grafana Elasticsearch dashboard

The Prometheus UI is also available, but is used to submit Prometheus search queries and return the raw results, with no additional layers on top. Access the Prometheus UI at https://<icp-cluster-ip>:8443/prometheus and this should display the graph page allowing you to input queries and view the results as values or a time-series based graph.

There are numerous metrics available to use, from both the standard Prometheus container metrics scraped by Kubernetes containers and those provided by the Elasticsearch application. As an example, start typing ‘elasticsearch_’ in to a search box and a list of available Elasticsearch specific metrics to use will show in the drop down list. It’s worth noting that Grafana also uses these metrics to construct the Elasticsearch dashboard mentioned previously, so all the same data is available here too.

As a starting point, use the query in Example 5-5 to view the current memory usage as a percentage of all master, data, client and logstash pods. Copy the query to the search box, select Execute and then Graph, to see the time-series data graphically.

Example 5-5 Prometheus query for Elasticsearch pod memory consumption %

sum(round((container_memory_working_set_bytes{pod_name=~"logging-elk-.+",container_name=~"es-.+|logstash"} / container_spec_memory_limit_bytes{pod_name=~"logging-elk-.+",container_name=~"es-.+|logstash"})*100)) by (pod_name)

In the current environment, this produces the output in Figure 5-10 on page 169.

Figure 5-10 Prometheus graph

Here we can see the overall memory consumption for the whole container, not just the JVM heap usage. The query in Example 5-6 also displays the current memory usage in GB.

Example 5-6 Prometheus query for Elasticsearch pod memory consumption GB

sum(container_memory_working_set_bytes{pod_name=~"logging-elk-.+",container_name=~"es-.+|logstash"}) by (pod_name)

Example 5-7 shows a query for viewing the average rate of the current CPU usage of the same Elasticsearch containers, also providing valuable information about how accurate the CPU resource allocation is in this cluster.

Example 5-7 Prometheus query for Elasticsearch pod CPU consumption

sum(rate(container_cpu_usage_seconds_total{pod_name=~"logging-elk-.+",container_name=~"es-.+|logstash"}[2m])) by (pod_name)

Figure 5-11 on page 170 shows the output of this query for this cluster.

Figure 5-11 CPU usage of the Elasticsearch containers

The container_cpu_usage_seconds_total metric used here contains the total amount of CPU seconds consumed by container, by core. The output represents the number of cores used by the container, so a value of 2.5 in this example would mean 2500 millicores in terms of Kubernetes resource requests and limits.

Estimating the required storage capacity

The default storage value is 100GB, which is typically sufficient for normal cluster operations, however this should scale with the amount of data that applications are expected to generate plus an additional margin for spikes and multiplied by the number of days the data is retained.

When planning the amount of storage required, use the following rule:

(Total index GB) x (retention period) = (GB per day)

In this equation, there is an unknown; the Total index GB, which is the sum of the storage consumed for each index. Without this information, it’s difficult to estimate how much storage to allocate to each data node. However, using the same monitoring methods from the previous sections it’s possible to see the current storage usage for each index and therefore making it possible to measure the amount of storage capacity used on a daily basis. In the default Elasticsearch deployment in IBM Cloud Private 3.1.2, there is only one index configured to store logging data, but if audit logging and Vulnerability Advisor is enabled there are additional indices to cater for, each varying in size.

As an example, based on the information gathered by the X-Pack monitoring dashboard, for the logstash-2019.03.05 index, it takes up ~18GB storage space. This includes the replicas, so the primary shards consume 9GB and the replica shards also consume 9GB. Depending on the configuration (e.g. HA) this will be spread across management nodes. Typically, if there is only one management node, then use the 9GB value as replica shards will remain unassigned. Therefore, retaining 14 days worth of logging data (using this as an average) requires a minimum of ~126GB. This figure however, does not represent a real production value, only that of a minimally used lab environment. In some cases, a single days worth of logging data may be around 100GB for a single index. Therefore in a typical cluster with 2 management nodes, 50GB per management node per day of retention is required. In a use case where 14 days of retention is required, each management node would need 700GB each at a minimum to simply retain the log data.

Similar methods can be used by using the Kibana UI and executing API commands directly to Elasticsearch to gather data on the storage used per index. For example, the storage size on disk can be retrieved using GET _cat/indices/logstash-2019.03.05/, producing the output in Example 5-8.

Example 5-8 API response

health status index uuid pri rep store.size pri.store.size

green open logstash-2019.03.05 qgH-QK35QLiBSmGHe0eRWw 5 1 18gb 9gb

There does appear to be some correlation between the storage capacity and the data node resource consumption during idle indexing. A simple advisory rule is to set the data node memory to around 15% of the storage capacity required on disk. So for example, storing 100GB of log data would mean setting the data node memory limits to around 15G. So in practice, this would be 8GB JVM Xmx/Xms and 16GB container memory for each data node.

Based on the examples in this section, the daily index requires around 18GB of storage per node, per day, and the observed memory usage of the data nodes is around 2.5GB, which is roughly 14%.

In another cluster, similar behavior was observed where the index storage size was at 61GB for the current day and the observed memory consumption for the data node (configured with 6gb JVM Xmx/Xms and 12GB container memory) was 8.1GB, so the memory usage was around 13%.

This is just a baseline to help start with a simple estimate and the ‘15%’ value is likely to increase if users perform a lot of queries on large sets of data.

JVM heap sizing

A common question when deciding on the memory requirements for an Elasticsearch data node is how much JVM heap to allocate in proportion to the amount of memory given to the whole container. Whilst it may seem like something to overlook, the JVM heap size plays an extremely important role. Elasticsearch is built on the Java programming language and when the node starts up, the heap is allocated to the Java processes. Java uses this heap to store data in-memory, enabling faster data processing. Lucene on the other hand is designed to use the host Operating System (OS) for caching data, so allocating too much heap leaves the OS and Lucene without enough memory to function correctly, which can cause out-of-memory (OOM) errors and resulting in an OOM Killed signal from Kubernetes to restart the container.

When setting the heap size, the following rules should be followed:

•JVM heap should be no more than 50% of the total container memory

•JVM heap should not exceed 32GB (which means the maximum size for an Elasticsearch node is 64GB memory)

More information about JVM heap sizing can be found at https://www.elastic.co/guide/en/elasticsearch/guide/master/heap-sizing.html.

These rules also apply to the master, client and Logstash containers however, there is no Lucene engine running on these nodes so around 60% of the total memory can be allocated to the JVM heap.

Garbage collection

Java is a garbage-collected language, which means the JVM is in charge of allocating memory and releasing it. The garbage collection process has some ‘halt’ scenarios where the JVM halts execution of Elasticsearch so it can trace and collect dead objects in memory to release it. During this time, nothing happens within Elasticsearch, so no requests are processed and shards are not relocated. A cluster experiencing long garbage collection processes will be under heavy loads and Elasticsearch nodes will appear to go offline for brief periods of time. This instability causes shards to relocate frequently as Elasticsearch tries to keep the cluster balanced and enough replicas available, as well as increased network traffic and disk I/O. Therefore, long garbage collection processes are more obvious when Elasticsearch is slow to respond to requests and experiences high CPU utilization.

Elasticsearch reports that garbage collection is configured to start when the JVM heap usage exceeds 75% full. If the Elasticsearch nodes are constantly above 75%, the nodes are experiencing memory pressure, which means not enough is allocated to the heap. If nodes constantly exceed 85-95% the cluster is at high risk of becoming unstable with frequent response delays and even out-of-memory exceptions. Elasticsearch provides useful information about the garbage collector usage on each node using the _nodes/stats API. In particular, the heap_used_percent metric in the jvm section is worth looking at if you’re experiencing issues to ensure it is generally below 75% during normal operation. See Example 5-9.

Example 5-9 Partial API Response for _node/stats endpoint

...

"jvm": {

"timestamp": 1552999822233,

"uptime_in_millis": 100723941,

"mem": {

"heap_used_in_bytes": 967083512,

"heap_used_percent": 62,

"heap_committed_in_bytes": 1556938752,

"heap_max_in_bytes": 1556938752,

"non_heap_used_in_bytes": 118415128,

"non_heap_committed_in_bytes": 127152128,

"pools": {

"young": {

"used_in_bytes": 244530624,

"max_in_bytes": 429522944,

"peak_used_in_bytes": 429522944,

"peak_max_in_bytes": 429522944

"survivor": {

"used_in_bytes": 5855696,

"max_in_bytes": 53673984,

"peak_used_in_bytes": 53673984,

"peak_max_in_bytes": 53673984

"old": {

"used_in_bytes": 716697192,

"max_in_bytes": 1073741824,

"peak_used_in_bytes": 844266000,

"peak_max_in_bytes": 1073741824

}

...

More information on garbage collection monitoring can be found in the Elasticsearch documentation at https://www.elastic.co/guide/en/elasticsearch/guide/current/_monitoring_individual_nodes.html.

Example sizing of an Elasticsearch data node

This section will attempt to provide a base sizing for the Elasticsearch master and data nodes in a production environment based on the observations in the previous sections. Whilst the values can’t really be carried over to other environments, the methodology is the same. This section assumes that workloads representing a realistic production cluster have been used to test resource consumption, as this analysis is based on that information. The following tests will use 6 WebSphere Liberty deployments, each one executing a loop to log data to stdout indefinitely, to simulate a constant log stream. At random periods, a burst of log traffic is executed by scaling 2 of the deployments temporarily to 2 replicas to simulate some event and replacement of pods. Whilst this test is performed over only one day, it will be assumed that this is the normal operational behavior and therefore provide a baseline for the Elasticsearch cluster sizing.

3 hours into load testing, the index size was monitored and logs were, on average, generated at a rate of about 0.9GB per hour. See Figure 5-12.

Figure 5-12 Index rate increase

Over 24 hours, this gives 21.6GB of data per day. Retaining this for 14 days gives 302GB and as this is cluster is installed with two management nodes, each management node requires at least 151GB available storage. Based on this information, the memory for data nodes can also be estimated using the ‘15%’ rule mentioned in the previous section. 15% of 21.6GB is 3.25GB, so in this example, setting the JVM Xmx/Xms to 2GB and container limit to 4GB is a good estimate.

Configuring the resources for Elasticsearch containers

The resource for the ELK stack can be provided during installation, via the config.yaml, or post-installation using kubectl patch commands. It’s recommended to set the values during installation, especially if audit logging is enabled, but it does typically require existing knowledge of the resources needed for the current cluster configuration. Depending on the resources available on the management nodes, setting the resources for logging too high may hinder the successful installation of the other management services that may be enabled, so be sure to have planned the sizing of the management nodes correctly.

Configuring resources during installation

This is the recommended approach, especially if audit logging is enabled as the default configuration is too low to cater for the platform plus audit logs. To configure the logging resources in the config.yaml, use XX as a template to set the resources for each component. The installer will then use these values during installation of the logging chart.

Example 5-10 config.yaml values for logging

logging:

logstash:

heapSize: "512m"

memoryLimit: "1024Mi"

elasticsearch:

client:

heapSize: "1024m"

memoryLimit: "1536Mi"

data:

heapSize: "1536m"

memoryLimit: "3072Mi"

storage:

size: "100Gi"

master:

heapSize: "1024m"

memoryLimit: "1536Mi"

elasticsearch_storage_size: "100Gi"

Note that the additional parameter elasticsearch_storage_size is required in addition to when setting the Elasticsearch data node volume sizes. Additional information and values for other components such as Filebeat and the curator can be found at https://www.ibm.com/support/knowledgecenter/en/SSBS6K_3.1.2/manage_metrics/logging_elk.html.

Configuring resources after installation

Using the kubectl patch command, it’s simple to increase the resources for each component after they are deployed in IBM Cloud Private. Each component has a rolling update strategy, so the logging services should still be available while Kubernetes is redeploying a new node with the additional resources.

Use the below examples as templates to set the memory limits for a given component, replacing values where necessary.

Example 5-11 kubectl patch for Elasticsearch client node

kubectl -n kube-system patch deployment logging-elk-client --patch '{ "spec": { "template": { "spec": { "containers": [ { "env": [ { "name": "ES_JAVA_OPTS", "value": "-Xms2048m -Xmx2048m" } ], "name": "es-client", "resources": { "limits": { "memory": "3072M" } } } ] } } } }'

Example 5-12 kubectl patch for Elasticsearch Logstash node

kubectl -n kube-system patch deployment logging-elk-logstash --patch '{ "spec": { "template": { "spec": { "containers": [ { "env": [ { "name": "ES_JAVA_OPTS", "value": "-Xms1024m -Xmx1024m" } ], "name": "logstash", "resources": { "limits": { "memory": "2048M" } } } ] } } } }'

Example 5-13 kubectl patch for Elasticsearch data node

kubectl -n kube-system patch statefulset logging-elk-data --patch '{ "spec": { "template": { "spec": { "containers": [ { "env": [ { "name": "ES_JAVA_OPTS", "value": "-Xms3072m -Xmx3072m" } ], "name": "es-data", "resources": { "limits": { "memory": "6144M" } } } ] } } } }'

Example 5-14 kubectl patch for Elasticsearch master node

kubectl -n kube-system patch deployment logging-elk-master --patch '{ "spec": { "template": { "spec": { "containers": [ { "env": [ { "name": "ES_JAVA_OPTS", "value": "-Xms2048m -Xmx2048m" } ], "name": "es-master", "resources": { "limits": { "memory": "4096M" } } } ] } } } }'

These commands are sufficient to scale the containers vertically, but not horizontally. Due to the cluster settings being integrated in to the Elasticsearch configuration, you cannot simply add and remove master or data nodes using standard Kubernetes capabilities such as the HorizontalPodAutoscaling resource.

Using kubectl patch should be used for testing to find a suitable value. Updating the logging resources in this way will not persist during chart or IBM Cloud Private upgrades. Permanent configuration changes can only be made using helm upgrade. To update the JVM and container memory for the data nodes, for example, run the upgrade:

helm upgrade logging ibm-icplogging-2.2.0.tgz --reuse-values --set elasticsearch.data.heapSize="3072m" --set elasticsearch.data.memoryLimit="6144Mi" --force --no-hooks --tls

Scaling an Elasticsearch cluster

Scaling an Elasticsearch node in IBM Cloud Private entirely depends on the type of node, the volume of logs and the performance of the Elasticsearch cluster against the logs generated. Sending thousands of smaller, simple log messages per second may have a different impact than a fewer number of larger log messages, so the data nodes may perform differently when configured as several smaller nodes or a few larger nodes. What works best in an environment generally comes down to experience by tuning the Elasticsearch cluster against your workloads to find which configuration provides the best performance for the type of log data the deployed applications are producing. Consider the following scenarios if a banking application is sending thousands of transactions and transaction logs to Elasticsearch every second:

•Logstash may be the bottle neck (assuming the host network can handle this volume of traffic anyway) and will require either more memory or additional replicas deployed to meet the high demand.

•if the transaction data generated is a very large data set with lots of fields, then the data node may be the bottle neck as it has to index such a large volume of data, so additional memory or even additional data nodes may be required.

•Logstash may also struggle with low resources and large volumes of JSON formatted data, as it will parse the JSON and bring fields to the root level.

•As the master node is responsible for tracking which data nodes all the documents are stored on, thousands of logs means the master node has to track thousands of documents which can be difficult to achieve without enough memory and CPU.

•When querying the stored data, the client node has to parse the potentially complex or time-based lucene query and lack of memory can result in request timeouts.

With the above in mind it’s important to be able to scale the components as and when it is required.

•Logstash and client pods - increase the number of replicas by using Helm:

helm upgrade logging ibm-icplogging-2.2.0.tgz --reuse-values --set elasticsearch.client.replicas=3 --set logstash.replicas=3 --force --no-hooks --tls

•Master pods - run the helm upgrade command passing the desired number of replicas (this should ideally be the number of management nodes in the cluster, however it will still work).

helm upgrade logging ibm-icplogging-2.2.0.tgz --reuse-values --set elasticsearch.master.replicas=3 --recreate-pods --force --tls

•Data pods - ensure there are sufficient management nodes available. Data nodes have an affinity to management nodes and a hard anti-affinity to other data pods, so no more than one data pod will run on a single management node at any one time.

helm upgrade logging ibm-icplogging-2.2.0.tgz --reuse-values --set elasticsearch.data.replicas=3 --recreate-pods --force --no-hooks --tls

Impacts of audit logging

Planning for audit logging is crucial to allow the logging services to start successfully and have enough resources to handle the large volume of data being sent to the Elasticsearch cluster. This section shows the how enabling audit logging can affect a cluster and the tests were performed in a lab environment, which may differ from your results. The tests, however can provide a helpful insight in to methodology used for analyzing the cluster and deriving the resources required when enabling audit logging by using the trial X-Pack monitoring and Prometheus tools already available in IBM Cloud Private. The audit logging in this example was configured after installation of IBM Cloud Private, following the IBM Knowledge Center article at https://www.ibm.com/support/knowledgecenter/en/SSBS6K_3.1.2/user_management/auditing_icp_serv.html.

Whilst audit logging was being enabled, the workloads from the previous sections were left running to determine the effect of enabling audit logging has on a running cluster with an already active Elasticsearch.

Figure 5-13 on page 177 shows that the cluster became unresponsive for a short period of time whilst the main pods that generate audit logs (plus kubernetes api-server audit logging) were restarted and began generating data in the newly added audit index.

Figure 5-13 Audit log enabled

Figure 5-14 shows one of the data nodes becoming unresponsive after being overloaded due to high CPU utilization.

Figure 5-14 Data node becoming unresponsive

The graph of the JVM heap size on a data node shows that the value of 3GB Xmx/Xms and 6GB container memory was sufficient, as it maintained a steady maximum of around 2.3GB (Figure 5-15).

Figure 5-15 JVM heap size

Checking the total memory consumed during this time period in Prometheus also shows similar results, with the container memory peaking at around 4.5GB. See Figure 5-16.

Figure 5-16 Container memory usage with audit logging

As the audit log generation uses Fluentd, data is sent directly to Elasticsearch, so there is no need to consider Logstash.

The client and master nodes generally stayed stable throughout the test. The sudden hike for one of the client nodes in Figure 5-17 on page 179 is due to a new replica starting in advance of enabling audit logging.

Figure 5-17 New replica starting

Based on the above results, Table 5-2 shows the minimum advised resources for a small High Availability cluster using 3 master nodes, 2 proxy nodes, 2 management nodes, 1 Vulnerability Advisor node and 3 worker nodes.

Table 5-2 Advised ELK resource limits for audit logging with high availability

Name	Number of Nodes	Memory
client	2	3GB (2GB Xmx/Xms)
master	2	1.5GB (1GB Xmx/Xms)
data	2	6GB (3GB Xmx/Xms)
logstash	2	1GB (512MB Xmx/Xms)

In smaller clusters, resources can be reduced. Audit logging was tested on a smaller, 6 node cluster (1 master, 1 proxy, 1 management and 3 workers) and the minimum resource limits in Table 5-3 are sufficient for audit logging.

Table 5-3 Minimum ELK resource limits for audit logging in small clusters

Name	Number of Nodes	Memory
client	1	3GB (2GB Xmx/Xms)
master	1	1.5GB (1GB Xmx/Xms)
data	1	4GB (2GB Xmx/Xms)
logstash	1	1GB (512MB Xmx/Xms)

In larger, more active clusters, or during installation of IBM Cloud Private with audit logging and Vulnerability Advisor enabled these values will almost certainly increase, so plan accordingly and be patient when finding the correct resource limits. During times of heavy use, it may appear as though Elasticsearch is failing, but it usually does a good job of catching up after it has had time to process, provided it has the appropriate resources. An easy mistake users often make is thinking that it’s taking too long and pods are stuck, so they are forcefully restarted and the process takes even longer to recover from. During busy periods of heavy indexing or data retrieval, it’s common to observe high CPU utilization on the management nodes, however this is usually an indicator that there is not enough memory allocated to the JVM heap and it should be monitored carefully to ensure that the CPU returns to a normal idle value.

Careful monitoring of the container memory is advised and container memory should be increased if there are several sudden dips or gaps in the Prometheus graphs while the container is under heavy load, as this is typically a sign that the container is restarting. Evidence of this is in Figure 5-18 on page 180, where a data pod had reached it’s 3GB container limit (monitored using Prometheus) and suddenly dropped, indicating an out-of-memory failure.

Figure 5-18 Out-of-memory failure

As a reference, Example 5-15 provides the base YAML to add to the config.yaml to set the logging resources to the above recommendations.

Example 5-15 Recommended initial logging values in the config.yaml

logging:

logstash:

heapSize: "512m"

memoryLimit: "1024Mi"

elasticsearch:

client:

heapSize: "2048m"

memoryLimit: "3072Mi"

data:

heapSize: "3072m"

memoryLimit: "6144Mi"

master:

heapSize: "1024m"

memoryLimit: "1536Mi"

Other factors to consider

As mentioned throughout this chapter, tuning the resources for the platform ELK can take considerable effort if capacity planning has not been done for this configuration before. Despite audit logging, a range of other factors can also play a part in deciding the resources required for both the ELK stack and the management nodes hosting the components:

•Data-at-rest encryption - IBM Cloud Private provides the capability to encrypt data stored on disk using dm-crypt. This, for many Cluster Administrators, is a must-have feature to be security compliant. This does, however, mean that there may be a performance penalty in terms of CPU utilization on the management nodes hosting ELK, so plan for additional resources on the management nodes to cater for encryption. More information can be found at https://www.ibm.com/support/knowledgecenter/SSBS6K_3.1.2/installing/fips_encrypt_volumes.html.

•Data-in-transit encryption - IBM Cloud Private provides the capability to encrypt data traffic across the host network using IPSec. Similarly to data-at-rest encryption, this may have a performance impact on the CPU utilization. More information can be found at https://www.ibm.com/support/knowledgecenter/SSBS6K_3.1.2/installing/fips_encrypt_volumes.html.

•Number of nodes - the number of nodes in a cluster largely affects the performance, mainly due to the number of additional Filebeat containers all streaming data to Logstash. Consider scaling Logstash to more replicas or increasing CPU when additional worker nodes with workloads are added to the cluster.

•Storage - Elasticsearch performs better with high performance storage. It’s recommended to use the highest performing storage available for more efficient indexing, storage and retrieval of data.

5.2.6 Role based access control

The platform ELK has role based access control (RBAC) built in to the deployments that will filter API responses to the Elasticsearch client pods to only return results relevant to the requesting users accessible namespaces. This means that a user that only has access to namespace1 should not see logs related to namespace2.

The RBAC modules consists of Nginx containers bundled with the client and Kibana pods to provide authentication and authorization to use the ELK stack. These Nginx containers use the following rules

1. A user with the role ClusterAdministrator can access any resource, whether audit or application log.

2. A user with the role Auditor is only granted access to audit logs in the namespaces for which that user is authorized.

3. A user with any other role can access application logs only in the namespaces for which that user is authorized.

4. Any attempt by an auditor to access application logs, or a non-auditor to access audit logs, is rejected.

The RBAC rules provide basic data retrieval control for users that access Kibana. The rules do not prevent seeing metadata such as log field names or saved Kibana dashboards. User-saved artifacts in Kibana are all saved in Elasticsearch in the same default index of /.kibana. This means that all users using the same instance of Kibana can access each others saved searches and dashboards and view any other custom fields added to the data. Without an X-Pack or Search Guard Enterprise license, no other native multi-tenant features are available to address this situation in a single Kibana instance. For information about deploying multiple Kibana instances, see 5.2.8, “Management” on page 188.

5.2.7 Using Kibana

This section will cover some basic tools available in the Kibana User Interface (UI) to provide some useful functions that allow a user to view logs, create graphical representations of the data and custom dashboards.

The Kibana User Interface (UI) is split in to 6 sections:

•Discover

•Visualize

•Dashboards

•Timelion

•Dev Tools

•Management

If X-Pack functions were enabled during deployment, they will also appear in the UI.

Discover

This is the first page shown automatically when logging in to Kibana. By default, this page will display all of your ELK stack’s 500 most recently received logs in the last 15 minutes from the namespaces you are authorized to access. Here, you can filter through and find specific log messages based on search queries.

The Kibana Discover UI contains the following elements:

•Search Bar: Use this to search specific fields and/or entire messages

•Time Filter: Top-right (clock icon). Use this to filter logs based on various relative and absolute time ranges

•Field Selector: Left, under the search bar. Select fields to modify which ones are displayed in the Log View

•Log Histogram: Bar graph under the search bar. By default, this shows the count of all logs, versus time (x-axis), matched by the search and time filter. You can click on bars, or click-and-drag, to narrow the time filter

The search bar is the most convenient way to search to string of text in log data. It uses a fairly simple language structure, called the Lucene Query Syntax. The query string is parsed into a series of terms and operators. A term can be a single word or a phrase surrounded by double quotes which searches for all the words in the phrase, in the same order. Figure 5-19 on page 183 shows searching Kibana for log data containing the phrase “websphere-liberty”

Figure 5-19 Searching logs in Kibana

Searches can be further refined, for example by searching only for references to the default namespace. Using filters enables users to restrict the results shown only to contain the relevant field filters, as shown in Figure 5-20.

Figure 5-20 Kibana search filter

And the results are restricted to this filter only, as seen in Figure 5-21.

Figure 5-21 Kibana filtered search

The use of fields also allows more fine grained control over what is displayed in the search results, as seen in Figure 5-22 where the fields kubernetes.namespace, kubernetes.container_name and log are selected.

Figure 5-22 Kibana filtered search with selected fields

More detailed information is available in the Elasticsearch documentation at https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-query-string-query.html#query-string-syntax.

Visualize

The Visualize page lets you create graphical representations of search queries in a variety of formats, such as bar charts, heat maps, geographical maps and gauges.

This example will show how a simple pie chart is created, configured to show the top 10 containers that produce the highest number of logs per namespace from a Websphere Liberty Deployment that is running in different namespaces.

To create a visualization, select the + icon, or Create Visualization, if none exist, then perform the following steps

1. Select the indices that this visualization will apply to, such as logstash. You may only have one if the default configuration has not been modified or audit logging is not enabled.

2. Select Pie Chart from the available chart selection.

3. In the search box, enter the query that will generate the data and apply the relevant filters. This is the same as writing a query and filters in the Discover section. In this example the term ‘websphere’ is used to search for instances of ‘websphere’ in the log data.

4. In the left side pane, select the data tab, select metrics, then set slice size to Count.

5. In the buckets section, select Split Slices and set the following values:

a. Set Aggregation to Terms.

b. Set Field to kubernetes.namespace.keyword.

c. Set Order By to metric: Count.

d. Set Order to Descending and Size to 10.

6. Select the Play icon at the top of this pane and the pie chart should display with the data from the search query. In this case, it will show the top 10 namespaces that contain the highest number of logs generated by websphere containers. You can use the filters to exclude certain namespaces, such as kube-system.

The result should look similar to Figure 5-23. Save the visualization using the Save link at the top of the page, as this will be used later to add to a dashboard.

Figure 5-23 Kibana visualization pie chart

Explore the other visualizations available, ideally creating a few more as these can be used to create a more meaningful Kibana Dashboard.

Dashboards

The Kibana Dashboard page is where you can create, modify and view your own custom dashboards where multiple visualizations can be combined on to a single page and filter them by providing a search query or by selecting filters. Dashboards are useful for when you want to get an overview of your logs and make correlations among various visualizations and logs.

To create a new Dashboard, select the + icon, or Create Visualization, if none exist. Select Add Visualization, then select the visualizations created earlier you want to display in this dashboard. From here, you further filter the data shown in the individual visualizations by entering a search query, changing the time filter, or clicking on the elements within the visualization. The search and time filters work just like they do in the Discover page, except they are only applied to the data subsets that are presented in the dashboard. You can save this dashboard by selecting Save at the top of the page. Figure 5-24 shows how this dashboard looks with two visualizations added.

Figure 5-24 Kibana dashboard

Timelion

Timelion is a time series data visualizer that enables you to combine totally independent data sources within a single visualization. It’s driven by a simple expression language you use to retrieve time series data, perform calculations to tease out the answers to complex questions and visualize the results.

The following example uses Timelion to compare time-series data about the number of logs generated in the past hour, compared to the number of logs generated this hour. From this, you can compare trends and patterns in data at different periods of time. For IBM Cloud Private, this can be useful to see trends in the logs generated on a per-pod basis. For example, the chart in Figure 5-25 compares the total number of logs in the current hour compared with the previous hour and the chart in Figure 5-26 on page 187 compares the number of logs generated by the image-manager pods in the last hour.

Figure 5-25 Timelion chart comparing total logs in the current hour compared to the previous hour

Figure 5-26 Timelion chart comparing image-manager log count in the last hour

These charts can also be saved and added to a Dashboard to provide even more analysis on the log data stored in Elasticsearch.

More information about Timelion can be found in the Elasticsearch documentation at https://www.elastic.co/guide/en/kibana/5.5/timelion.html.

Dev Tools

The Dev Tools page is primarily used to query the Elasticsearch API directly and can be used in a multitude of ways to retrieve information about the Elasticsearch cluster. This provides an easy way to interact with the Elasticsearch API without having to execute the commands from within the Elasticsearch client container. For example, Figure 5-27 shows executing the _cat/indices API command to retrieve a list of indices

Figure 5-27 API command to retrieve the list of indices

Important: One thing to note about this tool is that all users with access to Kibana currently have access to the Elasticsearch API, which at the time of writing is not RBAC filtered, so all users can run API commands against Elasticsearch. It is possible to disable the Dev Tools page, by adding console.enabled: false to the kibana.yml content in the logging-elk-kibana-config ConfigMap in the kube-system namespace and restarting the Kibana pod.

Elasticsearch extensively covers the use of the API in it’s documentation at https://www.elastic.co/guide/en/elasticsearch/reference/5.5/cat.html, so it is not covered in this chapter.

5.2.8 Management

The Management page allows for modifying the configuration of several aspects of Kibana. Most of the settings are not configurable, as they are controlled by the use of a configuration file within the Kibana container itself, but the Management page lets you modify settings related to the stored user data, such as searches, visualizations, dashboards and indices.

Deploying ELK in IBM Cloud Private

Deploying ELK stacks to an IBM Cloud Private cluster is a common practice when there is a need to segregate the platform logging from the application logging. Elasticsearch is excellent at handling many user deployments at scale, but it makes sense to use separate logging systems, especially when a specific application has a high demand for logging and could potentially disrupt the platform logging services by overloading it in the event of a disaster.

Adding additional ELK stacks does not come without it’s price. As described throughout this chapter, logging takes a toll on the resources available within a cluster, favoring a lot of memory to function correctly. When designing a cluster, it’s important to take in to consideration whether multiple ELK stacks are required, and if so, the resources required using the capacity planning methods discussed in “Capacity planning” on page 164. The same care should be taken when designing clusters for production that include multiple ELK stacks for applications and the configuration should be thoroughly tested in the same way the platform ELK is tested for resiliency. Failure to do so will result in the loss of logging services and potentially loss of data if Kubernetes has to restart some components due bad design by not having enough memory to fulfill the logging requirements for an application.

Planning for multiple ELK stacks should, in theory, be a little easier than figuring out how much the platform ELK should be scaled to meet the needs of both the platform and the applications it is hosting. This is because developers typically know how much log data their application (or group of applications) produces based on experience or native monitoring tools. In this situation, you can solely focus on what the application needs, as opposed to catering for the unknowns that the platform brings.

Architectural considerations

Before deploying additional ELK stacks to the cluster, think about how it will affect any resources available for the namespace. Resource requirements for ELK can be quite high, so if you are deploying ELK to an individual namespace where users operate, take this into consideration when designing Resource Quotas for that namespace. Best practice is to deploy ELK to it’s own namespace to isolate it from user workloads, and users themselves if necessary.

The ELK stack requires elevated privileges in order to function correctly, in particular, it requires the IPC_LOCK privilege which is not included in the default Pod Security Policy. If ELK is not being deployed to the kube-system namespace, the Service Account that ELK will use (typically the default Service Account for the hosting namespace) should be configured to use a Pod Security Policy that permits the use of IPC_LOCK. This can be done by creating a new Pod Security Policy for ELK and by creating a new ClusterRole and RoleBinding. Here there are two options to consider:

1. Deploying the ELK stack to the kube-system namespace

2. Deploying the ELK stack to the user namespace

3. Deploying the ELK stack to another dedicated namespace

Deploying the ELK stack to the users namespace means that the users that have access to the namespace also have access to view the resources within it. This means that users will be able to perform operations on the ELK pods (depending on the user roles) including viewing the CA certificates used to deploy the ELK stack security configuration (if ELK is deployed with security). By giving a service account within the user namespace elevated privileges, you’re also allowing users to acquire those privileges, so ensure that the IPC_LOCK capability does not conflict with any security policies. IPC_LOCK enables mlock to protect the heap memory from being swapped. You can read more about mlock at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/sect-Realtime_Reference_Guide-Memory_allocation-Using_mlock_to_avoid_memory_faults.html.

If this is an issue, consider deploying the additional ELK stacks to a separate namespace. Additional namespaces can be used to host the additional ELK stacks and utilise the standard RBAC mechanisms within IBM Cloud Private to prevent unauthorized users from accessing the ELK pods. However in this scenario (providing security is enabled), users in other namespaces would not be able to access the Kibana pods to view logs or the status for troubleshooting. If, as a Cluster Administrator, you do not require users to monitor the Kibana pods, then restricting access to it is recommended. If users should be able to view logs and the status of the Kibana pod (to troubleshoot access issues), an additional namespace can be used to host the Kibana pod, where users can be given a 'Viewer' role. This still provides access to the Kibana pods for diagnostics but prevents users from making any changes to it’s state or configuration, further protecting the ELK stack from malicious intent.

For development clusters, deploying the ELK stack to user namespaces may be sufficient. For production clusters where access to the core ELK components should be restricted, deploying the ELK stack to a dedicated management namespace is recommended.

Standard versus managed mode

The ELK Helm chart provides a mode parameter that defines whether or not ELK is deployed as ‘managed’ or ‘standard’. Managed mode is generally reserved to replace the platform ELK and contains several core functions that are only enabled when the ELK stack is deployed to the kube-system namespace, which of course is where the platform ELK is hosted. Deploying several managed ELK stacks will not work without additional modification and is not a recommended configuration. It is possible to deploy a managed ELK stack to another namespace, but additional configuration is still needed and is not covered in this book. The main functional differences between managed and managed mode are summarized in Table 5-4.

Table 5-4 Comparison of managed and standard mode

Attribute	Standard	Managed
Kibana Access method	NodePort	Ingress
Access Security	None	ICP Management Ingress
RBAC	No	Yes
Host Affinity	Worker nodes	Management nodes

Each mode has pros and cons and the mode to choose entirely depends on what the ELK requirements are. For example, if the requirement is to deploy only one additional ELK stack dedicated to application logs, but it should be secure, implement RBAC mechanisms and not consume worker node resources, then the managed mode is suitable. The drawback is that it requires modifications to the deployed resources to work alongside the platform ELK. If multiple ELK stacks are needed for team or application dedicated ELK stacks where the use of NodePort (no authentication required) is acceptable, then standard mode is suitable.

The recommended way to deploy additional ELK stacks is by using the standard mode. Managed mode is possible to achieve, but introduces a lot of additional configuration to enable all the beneficial features of managed mode and it is not covered in this book.

Storage

Data node replicas in each deployment of ELK require a dedicated PersistentVolume (PV) and PersistentVolumeClaim (PVC). If a dynamic storage provider is available, ELK can be configured to use this during deployment. If dynamic provisioning is not available, then suitable PVs must be created first.

Deploying ELK stacks

This section will deploy a new standard mode ELK stack to the elk namespace and this namespace will be used in all subsequent example commands.

To create the namespace, use kubectl create namespace elk and then label is using kubectl label namespace elk -l name=elk. Alternatively, use the YAML definition in Example 5-16.

Example 5-16 elk namespace YAML definition

apiVersion: v1

kind: Namespace

metadata:

labels:

RBAC

The ELK stack requires privileged containers, IPC_LOCK and SYS_RESOURCE capabilities, which means giving the default Service Account (SA) in the elk namespace elevated privileges, as the default restricted policy is too restrictive for ELK to function. To allow this, a new Pod Security Policy (PSP) is required, as well as a Cluster Role and Role Binding to the new PSP. To create the required RBAC resources perform the following steps

1. Copy the elk-psp.yaml in Example 5-17 to a local file called elk-psp.yaml and create it in Kubernetes using kubectl create -f elk-psp.yaml.

Example 5-17 elk-psp.yaml

apiVersion: extensions/v1beta1

kind: PodSecurityPolicy

metadata:

spec:

allowPrivilegeEscalation: true

privileged: true

allowedCapabilities:

- CHMOD

- CHOWN

- IPC_LOCK

- SYS_RESOURCE

forbiddenSysctls:

- '*'

fsGroup:

rule: RunAsAny

runAsUser:

rule: RunAsAny

seLinux:

rule: RunAsAny

supplementalGroups:

rule: RunAsAny

volumes:

- configMap

- emptyDir

- projected

- secret

- downwardAPI

- persistentVolumeClaim

- hostPath

Alternatively, download it from https://github.com/IBMRedbooks/SG248440-IBM-Cloud-Private-System-Administrator-s-Guide/tree/master/Ch7-Logging-and-monitoring/Deploying-ELK/elk-psp.yaml

2. Create a Cluster Role to enable the use of the new PSP

kubectl -n elk create clusterrole elk-clusterrole --verb=use --resource=podsecuritypolicy --resource-name=ibm-elk-psp

3. In the elk namespace, create a Role Binding to the ibm-elk-psp Cluster Role

kubectl -n elk create rolebinding elk-rolebinding --clusterrole=elk-clusterrole --serviceaccount=elk:default

4. Verify the default Service Account in the elk namespace can use the ibm-elk-psp PSP

kubectl auth can-i --as=system:serviceaccount:elk:default -n elk use podsecuritypolicy/ibm-elk-psp

This should output a simple yes or no. If the output is no, check the correct names have been used when creating the Cluster Role or Role Binding.

After the above steps, RBAC is configured with the correct privileges for ELK.

Pulling images

As ELK is deployed to a dedicated namespace, it’s necessary to create an image pull secret so that it can pull the ELK images from the ibmcom namespace that contains the platform images.

Create an image pull secret in the elk namespace using the platform admin credentials.

kubectl -n elk create secret docker-registry ibmcomregkey --docker-server="mycluster.icp:8500/ibmcom" --docker-username="admin" --docker-password="admin" --docker-email="[email protected]"

Replace mycluster.icp and the username/password credentials with the values for the environment. This Secret will be used later to allow the ELK pods to pull images from the ibmcom namespace.

Security

It’s recommended to enable security on all deployed ELK stacks. You have the option of using the platform generated Certificate Authority (CA), supplying your own or letting Elasticsearch generate certificates internally. The recommended approach is to supply your own CA, or create an entirely new CA specifically for each ELK stack to provide as much isolation as possible in clusters using multiple ELK stacks. If a malicious user has access to the CA key-pair used for each ELK deployment, it’s possible for that user to gain access to the other ELK stacks. If the cluster is a ‘trusted environment’ model, then this may not be a problem, but for other clusters where security and isolation of log data is a key requirement, it is recommended to use a new CA key-pair for each ELK deployment.

To create a dedicated CA for each ELK stack use the openssl command in Example 5-18 replacing the subject details if necessary.

Example 5-18 Creating a new CA using openssl

openssl req -newkey rsa:4096 -sha256 -nodes -keyout ca.key -x509 -days 36500 -out ca.crt -subj "/C=US/ST=NewYork/L=Armonk/O=IBM Cloud Private/CN=www.ibm.com"

This will output a ca.crt and ca.key file to your local machine. Run the command in Example 5-20 to create a new key-pair Kubernetes Secret from these files.

Example 5-19 Create secret from new CA

kubectl -n elk create secret generic elk-ca-secret --from-file=ca.crt --from-file=ca.key

This secret can be used later when deploying ELK.

To specify your own CA to use when deploying additional ELK stacks, three requirements must be met:

•The CA must be stored in a Kubernetes secret.

•The secret must exist in the namespace to which the ELK stack is deployed.

•The contents of the certificate and its secret key must be stored in separately named fields (or keys) within the Kubernetes secret.

If the keys are stored locally, run the command in Example 5-20 to create a new secret, replacing <path-to-file> with the file path of the files.

Example 5-20 Create secret from custom CA

kubectl -n elk create secret generic elk-ca-secret --from-file=<path-to-file>/my_ca.crt --from-file=<path-to-file>/my_ca.key

Alternatively, Example 5-21 shows the YAML for a sample secret using defined CA certificates. You’ll need to paste the contents of the my_ca.crt and my_ca.key in the YAML definition in your preferred editor.

Example 5-21 YAML definition for my-ca-secret

apiVersion: v1

kind: Secret

metadata:

type: Opaque

data:

my_ca.crt: ...

my_ca.key: ...

This secret can be used later when deploying ELK.

If your own CA is not supplied, and the default IBM Cloud Private cluster certificates are suitable, they are located in the cluster-ca-cert Secret in the kube-system namespace. As ELK is deployed to another namespace, it’s important to copy this Secret to that namespace, otherwise the deployment will fail trying to locate it. This is only relevant if you are not using your own (or generated) CA for ELK. Use the command in Example 5-22 to copy the cluster-ca-cert to the elk namespace, using sed to replace the namespace name without manually editing it.

Example 5-22 Copying cluster-ca-cert to elk namespace

kubectl -n kube-system get secret cluster-ca-cert -o yaml | sed 's/namespace: kube-system/namespace: elk/g' | kubectl -n elk create -f -

Deploying ELK

The ibm-icplogging Helm chart contains all the Elasticsearch, Logstash, Kibana and Filebeat components required to deploy an ELK stack to an IBM Cloud Private cluster. The chart version used in this deployment is 2.2.0, which is the same as the platform ELK stack deployed during cluster installation in IBM Cloud Private Version 3.1.2 to ensure the images used throughout the platform are consistent between ELK deployments. This Helm chart can be retrieved from the mgmt-charts repository, as it is not (at the time of writing) published in the IBM public Helm chart repository.

The chart will install the following ELK components as pods in the cluster:

•client

•data

•filebeat (one per node)

•logstash

•master

•kibana (optional)

The goal of this example ELK deployment is to provide logging capabilities for applications running in the production namespace. ELK will be configured to monitor the dedicated production worker nodes, retrieving log data from applications in that namespace only.

To deploy the ELK Helm chart, perform the following steps:

1. Log in to IBM Cloud Private using cloudctl to ensure kubectl and helm command line utilities are configured.

2. If dynamic storage provisioning is enabled in the cluster, this can be used. If dynamic storage is not available, PersistentVolumes should be created prior to deploying ELK. In this deployment, the namespace isolation features in IBM Cloud Private 3.1.2 have been used to create a dedicated worker node for ELK. This means that a LocalVolume PersistentVolume can be used, as ELK will be running on only one node. Example 5-23 is a YAML definition for a PersistentVolume that uses LocalVolume, so the data node uses the /var/lib/icp/applogging/elk-data file system on the hosting dedicated worker node.

Example 5-23 YAML definition for a Persistent Volume using LocalVolume on a management node

apiVersion: v1

kind: PersistentVolume

metadata:

spec:

accessModes:

- ReadWriteOnce

capacity:

storage: 20Gi

local:

path: /var/lib/icp/applogging/elk-data

nodeAffinity:

required:

nodeSelectorTerms:

- matchExpressions:

- key: kubernetes.io/hostname

operator: In

values:

- 172.24.19.212

persistentVolumeReclaimPolicy: Retain

storageClassName: logging-storage-datanode

In this deployment, two of these volumes are created; one for each management node in the cluster.

3. Retrieve the current platform ELK chart values and save to a local file. These values can be used to replace the defaults set in the Helm chart itself so that the correct images are used.

helm get values logging --tls > default-values.yaml

4. Create a file called override-values.yaml. This will be the customized configuration for ELK and is required to override some of the default values to tailor the resource names, curator duration, security values or number of replicas of each component in this deployment. Use the values in Example 5-24 as a template.

Example 5-24 override-values.yaml

cluster_domain: cluster.local

mode: standard

nameOverride: elk

general:

mode: standard

image:

pullPolicy: IfNotPresent

pullSecret:

enabled: true

curator:

app:

elasticsearch:

client:

replicas: "1"

data:

replicas: "1"

storage:

persistent: true

size: 20Gi

storageClass: logging-storage-datanode

useDynamicProvisioning: false

master:

replicas: "1"

filebeat:

scope:

namespaces:

- production

nodes:

production: "true"

kibana:

install: true

external:

logstash:

replicas: "1"

security:

ca:

external:

certFieldName: ca.crt

keyFieldName: ca.key

secretName: elk-ca-secret

origin: external

enabled: true

xpack:

monitoring: true

These values should be tailored to meet the requirements. If dynamic provisioning is enabled in the cluster, set elasticsearch.data.storageClass to the appropriate storage class name and elasticsearch.data.useDynamicProvisioning value to true.

In this values file, the kibana.external is intentionally left empty, so that Kubernetes will automatically assign a NodePort value from the default NodePort range. At the time of writing, the Helm chart does not support automatic NodePort assignment when deploying through the IBM Cloud Private catalog user interface, due to validation on empty fields. Therefore auto-assignment is only possible by deploying the Helm chart through the Helm CLI.

Additional values not in the override-values.yaml can also be assigned using the --set parameter in the helm install command.

5. Deploy the ibm-icplogging-2.2.0.tgz Helm chart using helm install, passing the values files created earlier. The values files order is important as Helm will override the values from each file in sequence. For example, the values set in override-values.yaml will replace any values in the default-values.yaml file as priority is given to the right-most file specified.

helm install ibm-icplogging-2.2.0.tgz --name app-logging --namespace elk -f default-values.yaml -f override-values.yaml --tls

After some time, all resources should have been deployed to the elk namespace. To view all pods in the release, use kubectl -n elk get pods -l release=app-logging

[root@icp-ha-boot cluster]# kubectl -n elk get pods -l release=app-logging

NAME READY STATUS RESTARTS AGE

app-logging-elk-client-868db5cbd9-6mhjl 1/1 Running 0 1h

app-logging-elk-data-0 1/1 Running 0 1h

app-logging-elk-elasticsearch-pki-init-78bzt 0/1 Completed 0 1h

app-logging-elk-kibana-86df58d79d-wfgwx 1/1 Running 0 1h

app-logging-elk-kibana-init-m9r92 0/1 CrashLoopBackOff 22 1h

app-logging-elk-logstash-68f996bc5-92gpd 1/1 Running 0 1h

app-logging-elk-master-6c64857b5b-x4j9b 1/1 Running 0 1h

Tip: If the kibana-init pod fails, it’s because it could not initialize the default index in Kibana. This is not a problem, as the default index can be set through the Kibana UI.

6. Retrieve the NodePort for the Kibana service:

kubectl -n elk get service kibana -o=jsonpath='{.spec.ports[?(@.port==5601)].nodePort}'

7. Use the returned port to access Kibana via an IBM Cloud Private node. For example using the proxy

http://<proxy-ip>:<nodeport>

This will display the Kibana dashboard. See Figure 5-28.

Figure 5-28 New Kibana user interface

8. Set the default index to whichever value you choose. The default is logstash- but this may change depending on how you modify Logstash in this instance. Note that it is not possible to set the default index until data with that index actually exists in Elasticsearch, so before this can be set, ensure log data is sent to Elasticsearch first.

Configuring namespace based indices

The default Logstash configuration forwards all log data to an index in the format logstash-<year-month-day>. Whilst this is suitable for the platform logs, it makes sense for additional ELK stacks, designed to collect logs for specific namespaces, to create indices based on the namespace the logs originate from. This can be achieved by editing the Logstash ConfigMap and modifying the output to use the kubernetes namespace field as the index name instead of the default logstash. To do this for the ELK stack deployed in previous sections, edit the app-logging-elk-logstash-config ConfigMap in the elk namespace and change the output.elasticsearch.index section from

output {

elasticsearch {

hosts => "elasticsearch:9200"

index => "logstash-%{+YYYY.MM.dd}"

document_type => "%{[@metadata][type]}"

ssl => true

ssl_certificate_verification => true

keystore => "/usr/share/elasticsearch/config/tls/logstash-elasticsearch-keystore.jks"

keystore_password => "${APP_KEYSTORE_PASSWORD}"

truststore => "/usr/share/elasticsearch/config/tls/truststore.jks"

truststore_password => "${CA_TRUSTSTORE_PASSWORD}"

}

output {

elasticsearch {

hosts => "elasticsearch:9200"

index => "%{kubernetes.namespace}-%{+YYYY.MM.dd}"

document_type => "%{[@metadata][type]}"

ssl => true

ssl_certificate_verification => true

keystore => "/usr/share/elasticsearch/config/tls/logstash-elasticsearch-keystore.jks"

keystore_password => "${APP_KEYSTORE_PASSWORD}"

truststore => "/usr/share/elasticsearch/config/tls/truststore.jks"

truststore_password => "${CA_TRUSTSTORE_PASSWORD}"

}

Save and close to update the ConfigMap. Logstash will automatically reload the new configuration. If it does not reload after 3 minutes, delete the Logstash pod(s) to restart them.

As Logstash does the buffering and transformation of log data, it contains a variety of useful functions to translate, mask or remove potentially sensitive fields and data from each log message, such as passwords or host data. More information about mutating the log data can be found at https://www.elastic.co/guide/en/logstash/5.5/plugins-filters-mutate.html.

Configuring the curator

If the default logstash index has been removed in favor of namespace based indices, The curator with this ELK stack also needs to be modified to cleanup the log data older than the number of days specified in the ELK deployment. As this Elasticsearch may contain a single or multiple namespace logs that should be deleted after a certain time period, the default ‘logstash-’ prefix filter can be removed to catch all time-based indices. Use kubectl -n elk edit configmap app-logging-elk-elasticsearch-curator-config to edit the curator configuration and remove the first filtertype in action 1

- filtertype: pattern

kind: prefix

value: logstash-

The resulting filters should look similar to the following, which applies to all indices in this ELK deployment

filters:

- filtertype: age

source: name

direction: older

timestring: '%Y.%m.%d'

unit: days

unit_count: 28

If only namespace1, namespace2 and namespace3 indices should be deleted, you can use a regex pattern, similar to the following

filters:

- filtertype: pattern

kind: regex

value: '^(namespace1-|namespace2-|namespace3-).*$'

- filtertype: age

source: name

direction: older

timestring: '%Y.%m.%d'

unit: days

unit_count: 28

Securing access to Kibana

As Kibana was installed with NodePort as the access method, it leaves Kibana exposed to users outside of the cluster, which in most environment is not suitable. To secure Kibana, the NodePort should be removed and an Ingress added with the appropriate annotations to restrict access to Kibana only to users of IBM Cloud Private. At the time of writing, there is no segregation of access to Ingress resources via RBAC. To do this, perform the following steps

1. Edit the kibana service to remove the spec.ports.nodePort field and value and change spec.type to ClusterIP, using kubectl -n elk edit service kibana. The Service should look similar to Example 5-25.

Example 5-25 Kibana service definition

apiVersion: v1

kind: Service

metadata:

labels:

app: app-logging-elk-elasticsearch

chart: ibm-icplogging-2.2.0

component: kibana

heritage: Tiller

release: app-logging

namespace: elk

spec:

clusterIP: 10.0.144.80

ports:

- port: 5601

protocol: TCP

targetPort: ui

selector:

app: app-logging-elk-elasticsearch

component: kibana

role: kibana

type: ClusterIP

2. Create an ingress using Example 5-26 as a template, replacing the ingress name and path if necessary.

Example 5-26 Kibana secure ingress

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

namespace: elk

annotations:

icp.management.ibm.com/auth-type: "access-token"

icp.management.ibm.com/rewrite-target: "/"

icp.management.ibm.com/secure-backends: "true"

kubernetes.io/ingress.class: "ibm-icp-management"

spec:

rules:

- http:

paths:

- path: "/app-kibana"

backend:

serviceName: "kibana"

servicePort: 5601

3. Modify the Kibana ConfigMap to add the server.basePath value. This is required to prevent targeting the default Kibana instance when the /app-kibana ingress path is used in the browser. Edit the Kibana ConfigMap using kubectl -n elk edit configmap app-logging-elk-kibana-config and add ‘server.basePath: "/app-kibana"’ anywhere in the data section.

4. Test the new URL by logging out of IBM Cloud Private and trying to access https://<master-ip>:8443/app-kibana. This should redirect you to the log in page for IBM Cloud Private.

Scaling Filebeat namespaces

Filebeat is in control of deciding from which namespaces log data is sent to Elasticsearch. After deploying ELK, you can modify the Filebeat ConfigMap to add or remove namespaces.

To do this with Helm, create a file called fb-ns.yaml with the content in Example 5-27

Example 5-27 fb-ns.yaml

filebeat:

scope:

namespaces:

- production

- pre-production

Use Helm to push the new configuration, passing the default parameters used during initial chart installation. These values are required as the default chart values are different, and Helm will use the default chart values as a base for changes.

Run the helm upgrade by using helm upgrade:

helm upgrade app-logging ibm-icplogging-2.2.0.tgz --reuse-values -f fb-ns.yaml --namespace elk --recreate-pods --tls

You can also do this without Helm by modifying Kubernetes resources directly. Edit the following input paths in the app-logging-elk-filebeat-ds-config ConfigMap in the elk namespace. For example, to add the pre-production namespace, add the line

- "/var/log/containers/*_pre-production_*.log" to the input paths

For example:

filebeat.prospectors:

- input_type: log

paths:

- "/var/log/containers/*_production_*.log"

- "/var/log/containers/*_pre-production_*.log"

This method may be preferred if additional modifications have been made to the ELK stack.

5.2.9 Forwarding logs to external logging systems

A common requirement in IBM Cloud Private installations is to integrate with an existing logging solution, where the platform (and all workloads) logs should be sent to a another external logging system, either instead of, or as well as, the platform deployed Elasticsearch.

To achieve this, there are two options to consider:

•Logs are sent only to the external logging system, Elasticsearch is redundant.

•Logs are sent to both Elasticsearch and the external logging system.

In the case of option 1, the current Filebeat and Logstash components already deployed can be repurposed to ship all log data to the external logging system. Depending on the external system, only Filebeat may need to be retained. The output configuration for Filebeat and Logstash can be redirected to the external system instead of the platform Elasticsearch, but additional security certificates may need to be added as volumes and volume mounts to both Filebeat and/or Logstash, if the external system uses security.

For option 2, depending on the external system, it’s recommended to deploy an additional Filebeat and Logstash to IBM Cloud Private. It’s possible to add an additional ‘pipeline’ to Logstash, so that it can stream logs to the platform Elasticsearch and the external system simultaneously, but this also introduces additional effort to debug in the event one pipeline fails. With separate deployments, it’s easy to determine which Logstash is failing and why.

Both Filebeat and Logstash have a number of plugins that enable output to a variety of endpoints. More information about the available output plugins for Filebeat can be found at https://www.elastic.co/guide/en/beats/filebeat/current/configuring-output.html and information about the available output plugins for Logstash can be found at https://www.elastic.co/guide/en/logstash/5.5/output-plugins.html.

If Vulnerability Advisor is deployed, it’s recommended to retain the current Elasticsearch, as Vulnerability Advisor stores data within Elasticsearch and may not function properly if it cannot reach it.

Deploying Filebeat and Logstash

At the time of writing, there is no independent Helm chart available to deploy Fliebeat or Logstash separately. The deployment configuration will entirely depend on the level of security the external system has for example basic authentication versus TLS authentication.

In this example, an external Elasticsearch has been set up using HTTP demonstrate how Filebeat and Logstash can be used in IBM Cloud Private to send logs to an external ELK. Filebeat and Logstash will use the default certificates generated by the platform to secure the communication between these components. Filebeat and Logstash are deployed to a dedicated namespace. Perform the following steps

1. Create a new namespace called external-logging that will host the Filebeat Daemonset and Logstash Deployment containers:

kubectl create namespace external-logging

2. Create a new namespace called external that will host the containers to generate logs for this example:

kubectl create namespace external

3. Copy the required Secret and ConfigMaps to the external-logging namespace. Note that doing this provides any users with access to this namespace the ability to view the authentication certificates for the platform ELK. In this example, only the required certificates are extracted from the logging-elk-certs in kube-system.

a. Copy the required files from the logging-elk-certs Secret

i. Copy the logging-elk-certs Secret:

kubectl -n kube-system get secret logging-elk-certs -o yaml | sed "s/namespace: kube-system/namespace: external-logging/g" | kubectl -n external-logging create -f -

ii. Remove all entries in data except for the following, using kubectl -n external-logging edit secret logging-elk-certs:

- ca.crt

- logstash.crt

- logstash.key

- filebeat.crt

- filebeat.key

b. Copy the logging-elk-elasticsearch-pki-secret Secret to the external-logging namespace:

kubectl -n kube-system get secret logging-elk-elasticsearch-pki-secret -o yaml | sed "s/namespace: kube-system/namespace: external-logging/g" | kubectl -n external-logging create -f -

c. Copy the logging-elk-elasticsearch-entrypoint ConfigMap:

kubectl -n kube-system get configmap logging-elk-elasticsearch-entrypoint -o yaml | sed "s/namespace: kube-system/namespace: external-logging/g" | kubectl -n external-logging create -f -

4. Create a RoleBinding to the ibm-anyuid-hostaccess-clusterrole so that the Logstash pod can use the host network to reach the external Elasticsearch:

kubectl -n external-logging create rolebinding ibm-anyuid-hostaccess-rolebinding --clusterrole ibm-anyuid-hostaccess-clusterrole --serviceaccount=external-logging:default

Tip: This step is not needed if Logstash is being deployed to kube-system namespace, as it automatically inherits this privilege from the ibm-privileged-psp Pod Security Policy.

5. Create the filebeat-config.yaml in Example 5-28.

Example 5-28 filebeat-config.yaml

apiVersion: v1

kind: ConfigMap

metadata:

labels:

app: filebeat-ds

namespace: external-logging

data:

filebeat.yml: |-

filebeat.prospectors:

- input_type: log

paths:

- /var/log/containers/*_external_*.log

scan_frequency: 10s

symlinks: true

json.message_key: log

json.keys_under_root: true

json.add_error_key: true

multiline.pattern: '^s'

multiline.match: after

fields_under_root: true

fields:

type: kube-logs

node.hostname: ${NODE_HOSTNAME}

pod.ip: ${POD_IP}

tags:

- k8s-app

filebeat.config.modules:

# Set to true to enable config reloading

reload.enabled: true

output.logstash:

hosts: logstash:5044

timeout: 15

ssl.certificate_authorities: ["/usr/share/elasticsearch/config/tls/ca.crt"]

ssl.certificate: "/usr/share/elasticsearch/config/tls/filebeat.crt"

ssl.key: "/usr/share/elasticsearch/config/tls/filebeat.key"

ssl.key_passphrase: "${APP_KEYSTORE_PASSWORD}"

logging.level: info

Alternatively, download the file from https://github.com/IBMRedbooks/SG248440-IBM-Cloud-Private-System-Administrator-s-Guide/tree/master/Ch7-Logging-and-monitoring/Deploying-Filebeat-and-Logstash/filebeat-config.yaml.

6. Create the logstash-config.yaml in Example 5-29, replacing output.elasticsearch.hosts with your own Elasticsearch host.

Example 5-29 logstash-config.yaml

apiVersion: v1

kind: ConfigMap

metadata:

labels:

app: logstash

namespace: external-logging

data:

k8s.conf: |-

input {

beats {

port => 5044

ssl => true

ssl_certificate_authorities => ["/usr/share/elasticsearch/config/tls/ca.crt"]

ssl_certificate => "/usr/share/elasticsearch/config/tls/logstash.crt"

ssl_key => "/usr/share/elasticsearch/config/tls/logstash.key"

ssl_key_passphrase => "${APP_KEYSTORE_PASSWORD}"

ssl_verify_mode => "force_peer"

}

filter {

if [type] == "kube-logs" {

mutate {

rename => { "message" => "log" }

remove_field => ["host"]

}

date {

match => ["time", "ISO8601"]

}

dissect {

mapping => {

"source" => "/var/log/containers/%{kubernetes.pod}_%{kubernetes.namespace}_%{container_file_ext}"

}

dissect {

mapping => {

"container_file_ext" => "%{container}.%{?file_ext}"

}

remove_field => ["host", "container_file_ext"]

}

grok {

"match" => {

"container" => "^%{DATA:kubernetes.container_name}-(?<kubernetes.container_id>[0-9A-Za-z]{64,64})"

}

remove_field => ["container"]

}

filter {

# Drop empty lines

if [log] =~ /^s*$/ {

drop { }

}

# Attempt to parse JSON, but ignore failures and pass entries on as-is

json {

source => "log"

skip_on_invalid_json => true

}

output {

elasticsearch {

hosts => ["9.30.123.123:9200"]

index => "%{kubernetes.namespace}-%{+YYYY.MM.dd}"

document_type => "%{[@metadata][type]}"

}

logstash.yml: |-

config.reload.automatic: true

http.host: "0.0.0.0"

path.config: /usr/share/logstash/pipeline

xpack.monitoring.enabled: false

xpack.monitoring.elasticsearch.url: "http://9.30.123.123:9200"

Important: This configuration does not use security. To enable security using keystores, add the following section to output.elasticsearch

ssl => true

ssl_certificate_verification => true

keystore => "/usr/share/elasticsearch/config/tls/keystore.jks"

keystore_password => "${APP_KEYSTORE_PASSWORD}"

truststore => "/usr/share/elasticsearch/config/tls/truststore.jks"

truststore_password => "${CA_TRUSTSTORE_PASSWORD}"

7. Create the logstash-service.yaml in Example 5-30.

Example 5-30 logstash-service.yaml

apiVersion: v1

kind: Service

metadata:

labels:

app: logstash

namespace: external-logging

spec:

ports:

- name: beats

port: 5044

protocol: TCP

targetPort: 5044

selector:

app: logstash

type: ClusterIP

8. Create the logstash-deployment.yaml in Example 5-31. In this deployment, the Logstash container will run on the management node, so the management node in your environment should have network connectivity to the external ELK. It is scheduled to the management node to prevent unauthorized access to the Logstash pod and protect the target Elasticsearch from unauthorized commands if a worker node is compromised.

Example 5-31 logstash-deployment.yaml

apiVersion: extensions/v1beta1

kind: Deployment

metadata:

labels:

app: logstash

namespace: external-logging

spec:

replicas: 1

selector:

matchLabels:

app: logstash

strategy:

rollingUpdate:

maxSurge: 1

maxUnavailable: 1

type: RollingUpdate

template:

metadata:

annotations:

productID: none

productName: Logstash

productVersion: 5.5.1

scheduler.alpha.kubernetes.io/critical-pod: ""

labels:

app: logstash

spec:

affinity:

nodeAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

nodeSelectorTerms:

- matchExpressions:

- key: beta.kubernetes.io/arch

operator: In

values:

- amd64

- ppc64le

- s390x

- key: management

operator: In

values:

- "true"

containers:

- command:

- /bin/bash

- /scripts/entrypoint.sh

env:

- name: LS_JAVA_OPTS

value: -Xmx512m -Xms512m

- name: CFG_BASEDIR

value: /usr/share/logstash

- name: CA_TRUSTSTORE_PASSWORD

valueFrom:

secretKeyRef:

key: caTruststorePassword

- name: APP_KEYSTORE_PASSWORD

valueFrom:

secretKeyRef:

key: appKeystorePassword

image: ibmcom/icp-logstash:5.5.1-f2

imagePullPolicy: IfNotPresent

ports:

- containerPort: 5044

protocol: TCP

resources:

limits:

memory: 1Gi

terminationMessagePath: /dev/termination-log

terminationMessagePolicy: File

volumeMounts:

- mountPath: /usr/share/logstash/pipeline

- mountPath: /usr/share/logstash/config/logstash.yml

subPath: logstash.yml

- mountPath: /usr/share/logstash/data

- mountPath: /scripts

- mountPath: /usr/share/elasticsearch/config/tls

readOnly: true

restartPolicy: Always

securityContext: {}

terminationGracePeriodSeconds: 30

tolerations:

- effect: NoSchedule

key: dedicated

operator: Exists

volumes:

- configMap:

defaultMode: 420

items:

- key: k8s.conf

path: k8s.conf

- configMap:

defaultMode: 420

items:

- key: logstash.yml

path: logstash.yml

- configMap:

defaultMode: 365

items:

- key: logstash-entrypoint.sh

path: entrypoint.sh

- key: map-config.sh

path: map-config.sh

- name: certs

secret:

defaultMode: 420

secretName: logging-elk-certs

- emptyDir: {}

9. Create the filebeat-ds.yaml in Example 5-32.

Example 5-32 filebeat-ds.yaml

apiVersion: extensions/v1beta1

kind: DaemonSet

metadata:

labels:

app: filebeat

namespace: kube-system

spec:

selector:

matchLabels:

app: filebeat

template:

metadata:

annotations:

productID: none

productName: filebeat

productVersion: 5.5.1

scheduler.alpha.kubernetes.io/critical-pod: ""

labels:

app: filebeat

spec:

containers:

- env:

- name: NODE_HOSTNAME

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: spec.nodeName

- name: POD_IP

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: status.podIP

- name: CA_TRUSTSTORE_PASSWORD

valueFrom:

secretKeyRef:

key: caTruststorePassword

- name: APP_KEYSTORE_PASSWORD

valueFrom:

secretKeyRef:

key: appKeystorePassword

image: ibmcom/icp-filebeat:5.5.1-f2

imagePullPolicy: IfNotPresent

livenessProbe:

exec:

command:

- sh

- -c

- ps aux | grep '[f]ilebeat' || exit 1

failureThreshold: 3

periodSeconds: 30

successThreshold: 1

timeoutSeconds: 1

readinessProbe:

exec:

command:

- sh

- -c

- ps aux | grep '[f]ilebeat' || exit 1

failureThreshold: 3

initialDelaySeconds: 10

periodSeconds: 10

successThreshold: 1

timeoutSeconds: 1

resources: {}

terminationMessagePath: /dev/termination-log

terminationMessagePolicy: File

volumeMounts:

- mountPath: /usr/share/filebeat/filebeat.yml

subPath: filebeat.yml

- mountPath: /usr/share/filebeat/data

- mountPath: /usr/share/elasticsearch/config/tls

readOnly: true

- mountPath: /var/log/containers

readOnly: true

- mountPath: /var/log/pods

readOnly: true

- mountPath: /var/lib/docker/containers/

readOnly: true

restartPolicy: Always

securityContext:

runAsUser: 0

terminationGracePeriodSeconds: 30

tolerations:

- effect: NoSchedule

key: dedicated

operator: Exists

volumes:

- configMap:

defaultMode: 420

items:

- key: filebeat.yml

path: filebeat.yml

- name: certs

secret:

defaultMode: 420

secretName: logging-elk-certs

- emptyDir: {}

- hostPath:

path: /var/log/containers

type: ""

- hostPath:

path: /var/log/pods

type: ""

- hostPath:

path: /var/lib/docker/containers

type: ""

updateStrategy:

rollingUpdate:

maxUnavailable: 1

type: RollingUpdate

10. After all resources are deployed, check the pods are running:

[root@icp-ha-boot ~]# kubectl -n external-logging get pods

NAME READY STATUS RESTARTS AGE

filebeat-ds-4pf52 1/1 Running 0 81s

filebeat-ds-7hshw 1/1 Running 0 87s

filebeat-ds-bm2dd 1/1 Running 0 89s

filebeat-ds-ddk55 1/1 Running 0 87s

filebeat-ds-l5d2v 1/1 Running 0 84s

filebeat-ds-t26gt 1/1 Running 0 90s

filebeat-ds-x5xkx 1/1 Running 0 84s

logstash-6d7f976b97-f85ft 1/1 Running 0 11m

11. Create some workload in the external namespace, so that some log data is generated. In this example, two WebSphere Liberty Helm charts were deployed from the IBM public Helm chart repository. During start up the containers create log data which is sent to the external Elasticsearch, as shown in Figure 5-29.

Figure 5-29 Logs sent to external ELK

Reconfiguring the platform Logstash

If all platform and application logs are sent to an external system, requiring no use of the platform ELK, you can modify the Logstash outputs to send logs to the external system instead. This can be done by modifying the output section in the logging-elk-logstash-config ConfigMap in the kube-system namespace. The default configuration looks like the following:

output {

elasticsearch {

hosts => ["9.30.231.235:9200"]

index => "%{kubernetes.namespace}-%{+YYYY.MM.dd}"

document_type => "%{[@metadata][type]}"

}

This output uses the Elasticsearch plug-in. A list of all available plug-ins can be found at https://www.elastic.co/guide/en/logstash/5.5/output-plugins.html. For example, to use a generic HTTP endpoint, use the following example:

output {

http {

http_method => "post"

url => "http://<external_url>"

<other_options>

...

}

5.2.10 Forwarding logs from application log files

In many containerised applications today, not all the log data generated by the application is sent to stdout and stderr and instead writes to a log file. Unless this particular log file is on a filesystem mounted from a PersistentVolume, every time Kubernetes restarts the container, the log data will be lost. There are several ways in which log data from a log file in a container can make it’s way to Elasticsearch, but this section will focus on two commons methods:

1. Using a Filebeat side car to read data from specific directories or files and stream the content to stdout or stderr.

2. Using a Filebeat side car to read data from specific directories or files and stream the output directly to Logstash.

Option 1 is the recommended option as all the log data is sent to stdout and stderr, which also gives the advantage of allowing tools such as kubectl logs to read the log data, as Docker handles storing stdout and stderr data on the host, which is read by kubelet. This data is also automatically sent to Elasticsearch using the default logging mechanisms. See Figure 5-30.

Figure 5-30 Side car logging to stdout and stderr

Option 2 is also a useful solution, as the data from specific log files can be parsed or transformed in the side car and pushed directly to a logging solution, whether it’s the platform ELK, an application dedicated logging system or an external logging system entirely. See Figure 5-31.

Figure 5-31 Side car logging directly to logging system

Important: At the time of writing, the current Filebeat image runs as the root user within the container, so ensure this complies with your security policies before giving the namespace access to a PodSecurityPolicy that allows containers to be run with the root user.

Using a Filebeat side car to forward log file messages to stdout and stderr

This example will use a simple WebSphere Liberty deployment and Filebeat side car to send log data from a file populated by WebSphere within the WebSphere container to stdout, to simulate typical application logging. This functionality can be achieved in a similar way, by mounting another image, such as busybox, to tail the file and redirect to stdout in the busybox container, but this is not scalable and requires multiple busybox side car containers for multiple log files. Filebeat has a scalability advantage, as well as advanced data processing to output the data to stdout and stderr flexibly.

To create a WebSphere and Filebeat side car Deployment, perform the following steps:

1. Create a new namespace called sidecar for this example

kubectl create namespace sidecar

2. Creating a ConfigMap for the Filebeat configuration allows you to reuse the same settings for multiple deployments without redefining an instance of Filebeat every time. Alternatively, you can create one ConfigMap per deployment if your deployment requires very specific variable settings in Filebeat. This ConfigMap will be consumed by the Filebeat container as its core configuration data.

Create the ConfigMap in Example 5-33 to store the Filebeat configuration using kubectl create -f filebeat-sidecar-config.yaml.

Example 5-33 filebeat-sidecar-config.yaml

apiVersion: v1

kind: ConfigMap

metadata:

namespace: sidecar

data:

filebeat.yml: |-

filebeat.prospectors:

- input_type: log

paths: '${LOG_DIRS}'

exclude_lines: '${EXCLUDE_LINES:[]}'

include_lines: '${INCLUDE_LINES:[]}'

ignore_older: '${IGNORE_OLDER:0}'

scan_frequency: '${SCAN_FREQUENCY:10s}'

symlinks: '${SYMLINKS:true}'

max_bytes: '${MAX_BYTES:10485760}'

harvester_buffer_size: '${HARVESTER_BUFFER_SIZE:16384}'

multiline.pattern: '${MULTILINE_PATTERN:^s}'

multiline.match: '${MULTILINE_MATCH:after}'

multiline.negate: '${MULTILINE_NEGATE:false}'

filebeat.config.modules:

# Set to true to enable config reloading

reload.enabled: true

output.console:

codec.format:

string: '%{[message]}'

logging.level: '${LOG_LEVEL:info}'

Alternatively, download from https://github.com/IBMRedbooks/SG248440-IBM-Cloud-Private-System-Administrator-s-Guide/tree/master/Ch7-Logging-and-monitoring/Forwarding-logs-from-application-log-files/stdout/filebeat-sidecar-config.yaml.

This configuration will capture all log messages from the specified files and relay them to stdout. It’s worth noting that this will simply relay all message types to stdout, which is later captured by the Filebeat monitoring the docker logs. If you require additional formatting of log messages, then consider using the approach to send formatted log data directly to something such as Logstash.

3. Create a RoleBinding to the ibm-anyuid-clusterrole to enable the Filebeat container to run as the root user:

kubectl -n sidecar create rolebinding sidecar-anyuid-rolebinding --clusterrole=ibm-anyuid-clusterrole --serviceaccount=sidecar:default

4. Create some workloads that writes data to a file, with a Filebeat side car. Example 5-34 uses a simple WebSphere Liberty deployment and a Filebeat side car to output the contents of /var/log/applogs/app.log to stdout. In this example, the string Logging data to app.log - <number>: <current-date-time> is output to the app.log file every second. Create the Deployment using kubectl create -f websphere-liberty-fb-sidecar.yaml.

Example 5-34 websphere-liberty-fb-sidecar-deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

labels:

app: websphere-sidecar

namespace: sidecar

spec:

replicas: 1

selector:

matchLabels:

app: websphere-sidecar

strategy:

rollingUpdate:

maxSurge: 1

maxUnavailable: 1

type: RollingUpdate

template:

metadata:

labels:

app: websphere-sidecar

spec:

securityContext:

runAsUser: 0

containers:

- name: ibm-websphere-liberty

args: [/bin/sh, -c,'i=0; while true; do echo "Logging data to app.log - $i: $(date)" >> /var/log/app.log; i=$((i+1)); sleep 1; done']

image: websphere-liberty:latest

imagePullPolicy: IfNotPresent

env:

- name: JVM_ARGS

- name: WLP_LOGGING_CONSOLE_FORMAT

value: json

- name: WLP_LOGGING_CONSOLE_LOGLEVEL

value: info

- name: WLP_LOGGING_CONSOLE_SOURCE

value: message,trace,accessLog,ffdc

- name: POD_IP

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: status.podIP

- name: HTTPENDPOINT_HTTPSPORT

value: "9443"

- name: KEYSTORE_REQUIRED

value: "false"

resources: {}

volumeMounts:

- name: was-logging

mountPath: /var/log

- name: filebeat-sidecar

image: ibmcom/icp-filebeat:5.5.1-f2

env:

- name: LOG_DIRS

value: /var/log/applogs/app.log

- name: NODE_HOSTNAME

valueFrom:

fieldRef:

fieldPath: spec.nodeName

- name: POD_IP

valueFrom:

fieldRef:

fieldPath: status.podIP

volumeMounts:

- name: was-logging

mountPath: /var/log/applogs

- name: filebeat-config

mountPath: /usr/share/filebeat/filebeat.yml

subPath: filebeat.yml

volumes:

- name: was-logging

emptyDir: {}

- name: filebeat-config

configMap:

items:

- key: filebeat.yml

path: filebeat.yml

In this deployment, there are two volumes specified. The filebeat-config volume mounts the ConfigMap data and the was-logging volume stores the application logs in the container. For each folder containing logs, you should create a new volume in a similar way for fine grained control over each file. For deployments that require persistent storage, replace emptyDir: {} with a PersistentVolumeClaim definition, as this deployment will lose it’s log data if it is restarted. The was-logging volume is mounted to the container running WebSphere and both the was-logging and filebeat-config volumes are mounted to the Filebeat container.

When adding this to an existing deployment, there are several new sections that should be added. This section defines the Filebeat container with the environment variables that set the directories for Filebeat to use. For the variables LOG_DIRS you can provide a single directory path or a comma-separated list of directories. The was-logging volume is mounted on the /var/log/applogs directory, which will be equal to the /var/log directory after mounting the same volume on the ibm-websphere-liberty container. In a real world example, this would be the filepath to the log file(s) you want Filebeat to scan and should match up with the mount path on the volume mount for the WebSphere container.

- name: filebeat-sidecar

image: ibmcom/icp-filebeat:5.5.1-f2

env:

- name: LOG_DIRS

value: /var/log/applogs/app.log

- name: NODE_HOSTNAME

valueFrom:

fieldRef:

fieldPath: spec.nodeName

- name: POD_IP

valueFrom:

fieldRef:

fieldPath: status.podIP

volumeMounts:

- name: was-logging

mountPath: /var/log/applogs

- name: filebeat-config

mountPath: /usr/share/filebeat/filebeat.yml

subPath: filebeat.yml

The key integration here is the volume mounts between the ibm-websphere-liberty container and the filebeat-sidecar container. This is what provides the Filebeat side car container access to the log files from the ibm-websphere-liberty container. If multiple log files need to be monitored from different locations within the file system, consider using multiple volumeMounts and update the filebeat-sidecar-config paths accordingly.

After a few minutes (to cater for the containers starting up) the log data should now be visible in Kibana after the platform Filebeat instance has successfully collected logs from Docker and pushed them through the default logging mechanism. See Figure 5-32 on page 216.

Figure 5-32 Data collected by the Filebeat

Using a Filebeat side car to forward log file messages to Logstash

This example will use a simple WebSphere Liberty deployment and Filebeat side car to send log data from a file populated by WebSphere within the WebSphere container to the external Elasticsearch used earlier in this chapter, to simulate typical application logs that need to be sent off-site. As the instance of Logstash uses TLS encryption to secure communications within the cluster, it’s important to understand that some certificates will need to be imported to allow this solution to work. This example will use only the minimum certificates required to allow communication between the Filebeat sidecar and Logstash. To use a Filebeat side car to stream logs directly to Logstash, perform the following steps:

1. Copy the required Secret and ConfigMaps to the sidecar namespace. Note that doing this provides any users with access to this namespace the ability to view the authentication certificates for the platform ELK. In this example, only the required certificates are extracted from the logging-elk-certs in kube-system.

a. Copy the required files from the logging-elk-certs Secret:

i. Copy the logging-elk-certs Secret:

kubectl -n kube-system get secret logging-elk-certs -o yaml | sed "s/namespace: kube-system/namespace: sidecar/g" | kubectl -n sidecar create -f -

ii. Remove all entries in data except for the following, using kubectl -n sidecar edit secret logging-elk-certs:

- ca.crt

- filebeat.crt

- filebeat.key

b. Copy the logging-elk-elasticsearch-pki-secret Secret to the external-logging namespace:

kubectl -n kube-system get secret logging-elk-elasticsearch-pki-secret -o yaml | sed "s/namespace: kube-system/namespace: sidecar/g" | kubectl -n sidecar create -f -

Important: These commands will copy the certificates used within the target ELK stack, which could be used to access Elasticsearch itself. If this does not conform to security standards, consider deploying a dedicated ELK stack for a specific application, user or namespace and provide these certificates for that stack.

2. Create the filebeat-sidecar-logstash-config.yaml ConfigMap in Example 5-35 using kubectl create -f filebeat-sidecar-logstash-config.yaml.

Example 5-35 filebeat-sidecar-logstash-config.yaml

apiVersion: v1

kind: ConfigMap

metadata:

namespace: sidecar

data:

filebeat.yml: |-

filebeat.prospectors:

- input_type: log

paths: '${LOG_DIRS}'

exclude_lines: '${EXCLUDE_LINES:[]}'

include_lines: '${INCLUDE_LINES:[]}'

ignore_older: '${IGNORE_OLDER:0}'

scan_frequency: '${SCAN_FREQUENCY:10s}'

symlinks: '${SYMLINKS:true}'

max_bytes: '${MAX_BYTES:10485760}'

harvester_buffer_size: '${HARVESTER_BUFFER_SIZE:16384}'

multiline.pattern: '${MULTILINE_PATTERN:^s}'

multiline.match: '${MULTILINE_MATCH:after}'

multiline.negate: '${MULTILINE_NEGATE:false}'

fields_under_root: '${FIELDS_UNDER_ROOT:true}'

fields:

type: '${FIELDS_TYPE:kube-logs}'

node.hostname: '${NODE_HOSTNAME}'

pod.ip: '${POD_IP}'

kubernetes.namespace: '${NAMESPACE}'

kubernetes.pod: '${POD_NAME}'

tags: '${TAGS:sidecar-ls}'

filebeat.config.modules:

# Set to true to enable config reloading

reload.enabled: true

output.logstash:

hosts: '${LOGSTASH:logstash.kube-system:5044}'

timeout: 15

ssl.certificate_authorities: ["/usr/share/elasticsearch/config/tls/ca.crt"]

ssl.certificate: "/usr/share/elasticsearch/config/tls/filebeat.crt"

ssl.key: "/usr/share/elasticsearch/config/tls/filebeat.key"

ssl.key_passphrase: ${APP_KEYSTORE_PASSWORD}

logging.level: '${LOG_LEVEL:info}'

This ConfigMap applies JSON formatted output to the platform Logstash, with added field identifiers that Logstash will use to filter data.

3. Create the websphere-sidecar-logstash-deployment.yaml in Example 5-36 using kubectl create -f websphere-sidecar-logstash-deployment.yaml.

Example 5-36 websphere-liberty-fb-sidecar-logstash-deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

labels:

app: websphere-sidecar-logstash

namespace: sidecar

spec:

replicas: 1

selector:

matchLabels:

app: websphere-sidecar-logstash

strategy:

rollingUpdate:

maxSurge: 1

maxUnavailable: 1

type: RollingUpdate

template:

metadata:

labels:

app: websphere-sidecar-logstash

spec:

securityContext:

runAsUser: 0

containers:

- name: ibm-websphere-liberty

args: [/bin/sh, -c,'i=0; while true; do echo "Using Logstash - Logging data to app.log - $i: $(date)" >> /var/log/app.log; i=$((i+1)); sleep 1; done']

image: websphere-liberty:latest

imagePullPolicy: IfNotPresent

env:

- name: JVM_ARGS

- name: WLP_LOGGING_CONSOLE_FORMAT

value: json

- name: WLP_LOGGING_CONSOLE_LOGLEVEL

value: info

- name: WLP_LOGGING_CONSOLE_SOURCE

value: message,trace,accessLog,ffdc

- name: POD_IP

valueFrom:

fieldRef:

apiVersion: v1

fieldPath: status.podIP

- name: HTTPENDPOINT_HTTPSPORT

value: "9443"

- name: KEYSTORE_REQUIRED

value: "false"

resources: {}

volumeMounts:

- name: was-logging

mountPath: /var/log

- name: filebeat-sidecar

image: ibmcom/icp-filebeat:5.5.1-f2

env:

- name: LOG_DIRS

value: /var/log/applogs/app.log

- name: NODE_HOSTNAME

valueFrom:

fieldRef:

fieldPath: spec.nodeName

- name: POD_IP

valueFrom:

fieldRef:

fieldPath: status.podIP

- name: CA_TRUSTSTORE_PASSWORD

valueFrom:

secretKeyRef:

key: caTruststorePassword

- name: APP_KEYSTORE_PASSWORD

valueFrom:

secretKeyRef:

key: appKeystorePassword

- name: NAMESPACE

valueFrom:

fieldRef:

fieldPath: metadata.namespace

- name: POD_NAME

valueFrom:

fieldRef:

fieldPath: metadata.name

volumeMounts:

- name: was-logging

mountPath: /var/log/applogs

- name: filebeat-config

mountPath: /usr/share/filebeat/filebeat.yml

subPath: filebeat.yml

- mountPath: /usr/share/elasticsearch/config/tls

readOnly: true

volumes:

- name: was-logging

emptyDir: {}

- name: filebeat-config

configMap:

items:

- key: filebeat.yml

path: filebeat.yml

- name: certs

secret:

defaultMode: 420

secretName: logging-elk-certs

Alternatively, download at https://github.com/IBMRedbooks/SG248440-IBM-Cloud-Private-System-Administrator-s-Guide/tree/master/Ch7-Logging-and-monitoring/Forwarding-logs-from-application-log-files/logstash/websphere-liberty-fb-sidecar-logstash-deployment.yaml.

This is similar to Example 5-34, but with a few key differences. First, the certificates in the logging-elk-certs Secret mounted as volumes in the filebeat-sidecar container definition, allowing it to communicate securely using TLS with the Logstash instance. Second, the introduction of additional environment variables provides the Filebeat side car with additional information that is forwarded to Logstash.

- name: NAMESPACE

valueFrom:

fieldRef:

fieldPath: metadata.namespace

- name: POD_NAME

valueFrom:

fieldRef:

fieldPath: metadata.name

Without this, the log entries in Elasticsearch would not contain the namespace and pod name fields.

Figure 5-33 Logs collected by the Filebeat instance

5.3 IBM Cloud Private Monitoring

IBM Cloud Private Version 3.1.2 uses AlertManager, Prometheus, and Grafana stack for system monitoring. It uses components listed in Table 5-5 on page 222 on the management nodes.

Table 5-5 Monitoring and alerting components

Component	Version	Role
AlertManager	- AlertManager (0.5.0)	Handles alerts sent by the Prometheus server. It takes care of de-duplicating, grouping, and routing them to the correct receiver integration such as slack, Email, or PagerDuty.
Grafana	- Grafana (5.2.0)	Data visualization & Monitoring with support for Prometheus as datasource.
Prometheus	- Prometheus (2.3.1) - collectd_exporter (0.4.0) - node_exporter (0.16.0) - configmap_reload (0.2.2) - elasticsearch-exporter(1.0.2) -kube-state-metrics-exporter(1.3.0)	Collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.

5.3.1 How Prometheus works

Prometheus is a monitoring platform that collects metrics from monitored targets by scraping metrics HTTP endpoints on these targets. Prometheus offers a richer data model and query language, in addition to being easier to run and integrate into the environment to be monitored. Figure 5-34 shows the Prometheus internal architecture.

Figure 5-34 Prometheus internal architecture

Prometheus discovers targets to scrape from service discovery. The scrape discovery manager is a discovery manager that uses Prometheus’s service discovery functionality to find and continuously update the list of targets from which Prometheus should scrape metrics. It runs independently of the scrape manager which performs the actual target scrape and feeds it with a stream of target group updates over a synchronization channel.

Prometheus stores time series samples in a local time series database (TSDB) and optionally also forwards a copy of all samples to a set of configurable remote endpoints. Similarly, Prometheus reads data from the local TSDB and optionally also from remote endpoints. The scrape manager is responsible for scraping metrics from discovered monitoring targets and forwarding the resulting samples to the storage subsystem.

Every time series is uniquely identified by its metric name and a set of key-value pairs, also known as labels. The metric name specifies the general feature of a system that is measured (for example http_requests_total - the total number of HTTP requests received). It may contain ASCII letters and digits, as well as underscores and colons. It must match the regex [a-zA-Z_:][a-zA-Z0-9_:]*.

The Prometheus client libraries offer four core metric types. These are currently only differentiated in the client libraries (to enable APIs tailored to the usage of the specific types) and in the wire protocol: These types are:

•Counter - A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

•Gauge - A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Gauges are typically used for measured values like temperatures or current memory usage, but also “counts” that can go up and down, like the number of concurrent requests.

•Histogram - A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.

•Summary - Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

The Prompt engine is responsible for evaluating Prompt expression queries against Prometheus’s time series database. The engine does not run as its own actor goroutine, but is used as a library from the web interface and the rule manager. PromQL evaluation happens in multiple phases: when a query is created, its expression is parsed into an abstract syntax tree and results in an executable query. The subsequent execution phase first looks up and creates iterators for the necessary time series from the underlying storage. It then evaluates the PromQL expression on the iterators. Actual time series bulk data retrieval happens lazily during evaluation (at least in the case of the local TSDB (time series database)). Expression evaluation returns a PromQL expression type, which most commonly is an instant vector or range vector of time series.

Prometheus serves its web UI and API on port 9090 by default. The web UI is available at / and serves a human-usable interface for running expression queries, inspecting active alerts, or getting other insight into the status of the Prometheus server.

For more information on Prometheus see the documentation in url below:

https://prometheus.io/docs/introduction/overview/

IBM Cloud Private provides the following exporters to provide metrics as listed in Table 5-6 below:

Table 5-6 IBM Cloud Private exporters

Exporter Types	Exporter Details
node-exporter	Provides the node-level metrics, including metrics for CPU, memory, disk, network, and other components
kube-state-metrics	Provides the metrics for Kubernetes objects, including metrics for pod, deployment, statefulset, daemonset, replicaset, configmap, service, job, and other objects
elasticsearch-exporter	Provides metrics for the IBM Cloud Private Elasticsearch logging service, including the status for Elasticsearch cluster, shards, and other components
collectd-exporter	Provides metrics that are sent from the collectd network plug-in

Role-based access control to IBM Cloud Private monitoring

A user with role ClusterAdministrator, Administrator or Operator can access monitoring service. A user with role ClusterAdministrator or Administrator can perform write operations in monitoring service, including deleting Prometheus metrics data, and updating Grafana configurations. Starting with version 1.2.0, the ibm-icpmonitoring Helm chart introduces an important feature. It offers a new module that provides role-based access controls (RBAC) for access to the Prometheus metrics data.

The RBAC module is effectively a proxy that sits in front of the Prometheus client pod. It examines the requests for authorization headers, and at that point, enforces role-based controls. It does this by retrieving the users current IBM Cloud Private access token, retrieved when logging in to the IBM Cloud Private dashboard. From the access token, the proxy can identify the user and their roles, and use this information to filter the query results, ensuring users can only see metrics they are authorized to see.

5.3.2 How AlertManager works

Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as:

•Email

•Generic Webhooks

•HipChat

•OpsGenie

•PagerDuty

•Pushover

•Slack

The rule manager in Prometheus is responsible for evaluating recording and alerting rules on a periodic basis (as configured using the evaluation_interval configuration file setting). It evaluates all rules on every iteration using PromQL and writes the resulting time series back into the storage. The notifier takes alerts generated by the rule manager via its Send() method, enqueues them, and forwards them to all configured Alertmanager instances. The notifier serves to decouple generation of alerts from dispatching them to Alertmanager (which may fail or take time).

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration. The following describes the core concepts the Alertmanager implements:

Grouping - Grouping categorizes alerts of similar nature into a single notification. This is especially useful during larger outages when many systems fail at once and hundreds to thousands of alerts may be firing simultaneously.

Inhibition - Inhibition is a concept of suppressing notifications for certain alerts if certain other alerts are already firing.

Silences - Silences are a straightforward way to simply mute alerts for a given time. A silence is configured based on matchers, just like the routing tree. Incoming alerts are checked whether they match all the equality or regular expression matchers of an active silence. If they do, no notifications will be sent out for that alert.

For more information on Alertmanager refer the url below:

https://prometheus.io/docs/alerting/overview/

5.3.3 How Grafana works

Grafana includes built-in support for Prometheus. Grafana exposes metrics for Prometheus on the /metrics endpoint. IBM Cloud Private comes with a Grafana dashboard to display the health status and various metrics about the cluster. Users can access the Prometheus UI directly at https://<master-ip>:8443/prometheus.

Grafana.com maintains a collection of shared dashboards which can be downloaded and used with standalone instances of Grafana. Prebuilt dashboards could be downloaded from url below:

https://grafana.com/dashboards?dataSource=prometheus

Prometheus data source could be configured and then data could be used in graphs. For more details refer the document below:

https://prometheus.io/docs/visualization/grafana/

IBM Cloud Private following Grafana dashboards listed in Table 5-7 below.

Table 5-7 IBM Cloud Private Grafana dashboards

Dashboard type	Dashboard details
ElasticSearch	Provides information about ElasticSearch cluster statistics, shard, and other system information
Etcd by Prometheus	Etcd Dashboard for Prometheus metrics scraper
Helm Release Metrics	Provides information about system metrics such as CPU and Memory for each Helm release that is filtered by pods
ICP Namespaces Performance IBM Provided 2.5	Provides information about namespace performance and status metrics
Cluster Network Health (Calico)	Calico hosts workload and system metric performance information
ICP Performance IBM Provided 2.5	Provides TCP system performance information about Nodes, Memory, and Containers
Kubernetes Cluster Monitoring	Monitors Kubernetes clusters that use Prometheus. Provides information about cluster CPU, Memory, and File system usage. The dashboard also provides statistics for individual pods, containers, and systemd services
Kubernetes POD Overview	Monitors pod metrics such as CPU, Memory, Network pod status, and restarts
NGINX Ingress controller	Provides information about NGINX Ingress controller metrics that can be sorted by namespace, controller class, controller, and ingress
Node Performance Summary	Provides information about system performance metrics such as CPU, Memory, Disk, and Network for all nodes in the cluster
Prometheus Stats	Dashboard for monitoring Prometheus v2.x.x
Storage GlusterFS Health	Provides GlustersFS Health metrics such as Status, Storage, and Node
Rook-Ceph	Dashboard that provides statistics about Ceph instances
Storage Minio Health	Provides storage and network details about Minio server instances

5.3.4 Accessing Prometheus, Alertmanager and Grafana dashboards

To access the Prometheus, Alertmanager and Grafana dashboard first log in to the IBM Cloud Private management console.

•To access the Grafana dashboard, click Menu → Platform → Monitoring. Alternatively, you can open https://<IP_address>:<port>/grafana, where <IP_address> is the DNS or IP address that is used to access the IBM Cloud Private console. <port> is the port that is used to access the IBM Cloud Private console.

•To access the Alertmanager dashboard, click Menu → Platform → Alerting. Alternatively, you can open https://<IP_address>:<port>/alertmanager.

•To access the Prometheus dashboard, open https://<IP_address>:<port>/prometheus.

5.3.5 Configuring Prometheus Alertmanager and Grafana in IBM Cloud Private

Users can customise the monitoring service pre-installation or post-installation. If configuring pre-installation then make changes to config.yaml file located in the /<installation_directory>/cluster folder. Users can customize the values of the parameters, as required.

The monitoring.prometheus section has the following parameters:

•prometheus.scrapeInterval - is the frequency to scrape targets in Prometheus

•prometheus.evaluationInterval - is the frequency to evaluate rules in Prometheus

•prometheus.retention - is the duration of time to retain the monitoring data

•prometheus.persistentVolume.enabled - is a flag that users set to use a persistent volume for Prometheus. The flag false means that prometheus do not use a persistent volume

•prometheus.persistentVolume.storageClass - is the storage class to be used by Prometheus

•prometheus.resources.limits.cpu - is the CPU limit that you set for the Prometheus container. The default value is 500 millicpu.

•prometheus.resources.limits.memory - is the memory limit that you set for the Prometheus container. The default value is 512 million bytes.

The monitoring.alertmanager section has the following parameters:

•alertmanager.persistentVolume.enabled - is a flag that you set to use a persistent volume for Alertmanager. The flag false means that you do not use a persistent volume

•alertmanager.persistentVolume.storageClass - is the storage class to be used by Alertmanager

•alertmanager.resources.limits.cpu - is the CPU limit that you set for the Alertmanager container. The default value is 200 millicpu.

•alertmanager.resources.limits.memory - is the memory limit that you set for the Alertmanager container. The default value is 256 million bytes.

The monitoring.grafana section has the following parameters:

•grafana.user - is the user name that you use to access Grafana.

•grafana.password - is the password of the user who is specified in the grafana.user parameter.

•grafana.persistentVolume.enabled - is a flag that you set to use a persistent volume for Grafana. The flag false means that you do not use a persistent volume.

•grafana.persistentVolume.storageClass - is the storage class to be used by Grafana

•grafana.resources.limits.cpu - is the CPU limit that you set for the Grafana container. The default value is 500 millicpu.

•grafana.resources.limits.memory - is the memory limit that you set for the Grafana container. The default value is 512 million bytes.

The resulting entry in config.yaml might resemble the YAML in Example 5-37.

Example 5-37 Monitoring configuration in config.yaml

monitoring:

prometheus:

scrapeInterval: 1m

evaluationInterval: 1m

retention: 24h

persistentVolume:

enabled: false

storageClass: "-"

resources:

limits:

cpu: 500m

memory: 2048Mi

requests:

cpu: 100m

memory: 128Mi

alertmanager:

persistentVolume:

enabled: false

storageClass: "-"

resources:

limits:

cpu: 200m

memory: 256Mi

requests:

cpu: 10m

memory: 64Mi

grafana:

persistentVolume:

enabled: false

storageClass: "-"

resources:

limits:

cpu: 500m

memory: 512Mi

requests:

cpu: 100m

memory: 128Mi

For more information and additional parameters to set, see https://www.ibm.com/support/knowledgecenter/en/SSBS6K_3.1.2/installing/monitoring.html

Example 5-38, Example 5-39, and Example 5-40 show the kubectl command to verify the configurations set for Prometheus, Alertmanager, and Grafana after they are deployed.

Example 5-38 Command to verify the configurations for Prometheus

kubectl -n kube-system get configmap monitoring-prometheus -o yaml

Example 5-39 Command to verify the configuration for Alertmanager

kubectl -n kube-system get configmap monitoring-prometheus-alertmanager -o yaml

Example 5-40 Command to verify the configuration for Grafana

kubectl -n kube-system get configmap monitoring-grafana -o yaml

5.3.6 Creating Prometheus alert rules

You can use the Kubernetes custom resource, AlertRule, to manage alert rules in IBM Cloud Private. Example 5-41 shows an example of alert rule to trigger alert of the node memory consumption is greater than 60%.

Create a rule file sample-rule.yaml, as shown in Example 5-41.

Example 5-41 Alert rule to monitor node memory

apiVersion: monitoringcontroller.cloud.ibm.com/v1

kind: AlertRule

metadata:

spec:

enabled: true

data: |-

groups:

- name: redbook.rules

rules:

- alert: NodeMemoryUsage

expr: ((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes))/ node_memory_MemTotal_bytes) * 100 > 60

annotations:

DESCRIPTION: '{{ $labels.instance }}: Memory usage is above the 60% threshold. The current value is: {{ $value }}.'

SUMMARY: '{{ $labels.instance }}: High memory usage detected'

To create the rule, run the command, as shown in Example 5-42.

Example 5-42 Create node memory monitoring rule

$ kubectl apply -f sample-rule.yaml -n kube-system

alertrule "sample-redbook-rule" configured

When the rule is created, you will be able to see the new alert in the Alerts tab in the Prometheus dashboard.

Figure 5-35 shows the Node Memory Alert rule created and started for one of the nodes.

Figure 5-35 Node memory usage alert rule

5.3.7 Configuring Alertmanager to integrate external alert service receivers

IBM Cloud Private has a built-in Alertmanager that will provide the status and details of each triggered alert. You then have the options to view more details, or silence the alert. Figure 5-36 on page 231 shows alert triggered in Alertmanager dashboard.

Figure 5-36 Alert triggered for NodeMemoryUsage Alert rule

This example will configure Alertmanager to send notifications on Slack. The Alertmanager uses the Incoming Webhooks feature of Slack, so first we need to set that up. Go to the Incoming Webhooks page in the App Directory and click Install (or Configure and then Add Configuration if it is already installed). Once the channel is configured to the incoming webhook, slack will provide a webhook URL. This URL will have to configured in the Alertrule configuration.

On the IBM Cloud Private boot node (or wherever kubectl is installed) run the following command to pull the current ConfigMap data into a local file. Example 5-43 shows how to get the alertmanager ConfigMap.

Example 5-43 Pull the current ConfigMap data into a local file

kubectl get configmap monitoring-prometheus-alertmanager -n kube-system -o yaml > monitoring-prometheus-alertmanager.yaml

Edit monitoring-prometheus-alertmanager.yaml as shown in Example 5-44.

Example 5-44 Slack configuration in Alertmanager ConfigMap

apiVersion: v1

data:

alertmanager.yml: |-

global:

receivers:

- name: default-receiver

slack_configs:

- api_url: https://hooks.slack.com/services/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

channel: '#icp-notification'

route:

group_wait: 10s

group_interval: 5m

receiver: default-receiver

repeat_interval: 3h

kind: ConfigMap

metadata:

creationTimestamp: 2019-02-20T17:07:20Z

labels:

app: monitoring-prometheus

chart: ibm-icpmonitoring-1.4.0

component: alertmanager

heritage: Tiller

release: monitoring

namespace: kube-system

resourceVersion: "3894"

selfLink: /api/v1/namespaces/kube-system/configmaps/monitoring-prometheus-alertmanager

uid: fbd6cd7c-3531-11e9-99e8-06d591293f01

Example 5-44 on page 231 shows the Alertmanager ConfigMap with updated Slack configuration including the webhook URL and channel name. Save this file and run the command in Example 5-45 to update the Alertmanager configuration.

Example 5-45 Update Alertmanager configuration command

kubectl apply -f monitoring-prometheus-alertmanager.yaml

Figure 5-37 on page 233 shows that the Node memory usage alert is sent as a notification on Slack for the operations teams to look at.

Figure 5-37 Node memory usage notification on Slack

5.3.8 Using Grafana

As discussed in section 5.3.3, “How Grafana works” on page 225, users can use various types of dashboards using the Prometheus datasource. Grafana is already configured to use the Prometheus timeseries datasource.

Users can use existing dashboards like Prometheus stats to see various statistics of Prometheus monitoring system. Figure 5-38 shows the Prometheus statistics dashboard.

Figure 5-38 Prometheus Statistics dashboard in Grafana

To import dashboards available on Grafana website, click Import and add the Grafana dashboard URL or ID. In this case we want to import a dashboard for Kubernetes Deployment, Statefulset, and Daemonset metrics from URL below.

https://grafana.com/dashboards/8588

Figure 5-39 shows how to import an external Prometheus dashboard published on grafana.com.

Figure 5-39 Import and external Prometheus dashboard

Configure the datasource to Prometheus and set unique identifier for the dashboard. You can alternatively import the dashboard by exporting the JSON from grafana.com and importing inside your grafana user interface.

This dashboard should now be available in the list of dashboards. Figure 5-40 shows the Kubernetes Deployment Statefulset Daemonset metrics dashboard.

Figure 5-40 Kubernetes Deployment Statefulset Daemonset metrics dashboard

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5. Logging and monitoring

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5. Logging and monitoring