Self-monitoring dashboards

vRealize Operations offers out-of-the-box dashboards that can help you monitor the overall status of a cluster and its nodes, including main services and key components.

These dashboards are fed with information by the built-in adapter for node and cluster status self-monitoring. They give a holistic picture of what vRealize Operations is doing, and makes it easier for users to check alerts or symptoms on the self-monitoring objects and troubleshoot vRealize Operations. It is worth mentioning that these dashboards were created keeping technical support engineers in mind, not vRealize Operations end users.

By default, the dashboards are not listed:

You can access them by navigating to the Dashboards tab, All Dashboards, and then vRealize Operations. Let's take a deeper look at some of those self-monitoring dashboards and how they can be useful for us when troubleshooting vRealize Operations.

The first part of the Self Cluster Statistics dashboard is the Top Processing Info widget:

The widget shows the following information:

Objects: Number of objects data collected in the vRealize Operations Cluster.
UI sessions: Number of users accessing the vRealize Operations Product UI.
Alerts: Depends on management packs, but something close to 1M may result in performance problems.
Unique string metrics: Depends on management packs, but something close to 1M may be result in performance problems.
Metrics: Number of metrics data collected.
Forward data entries: Each vRealize Operations node has a maximum capacity of storing 19,999 entries. If it continuously stays at this value, it indicates that analytics may be busy.
Alarms: Depends on management packs, but something close to 1M may result in performance problems.
CIQ: There should not be frequent invocations. It should start at night and finish after several hours.
Node count: Number of nodes available in the cluster.
Numeric properties: Total number of numeric properties in the cluster.
String properties: Total number of numeric properties in the cluster.
Load metrics: Metrics loaded into the analytics cache during the last collection cycle. Should be close to 0. Some amount of spikes may be expected, but this should not be very frequent in case of stable environments.
DT: There should not be frequent invocations. It should start at night and finish after several hours.
Incoming metrics: Metrics collected during last collection cycle.
Forward data entries: Summation of forward data queue size on all nodes (usually, greater than 20,000 means that some of the nodes may have an issue in data processing).
Remove metrics: Should be close to 0. Some amount of spikes may be expected, but this should not be very frequent spikes in case of stable environments.

The dashboard gets updated every 5 minutes (default collection cycle).

Scrolling down, we can see the CPU/Memory/Disk widget:

The metrics useful for troubleshooting here are mainly:

Free Mem: Shows free memory available in the cluster
Swap: Shows whether any swap is happening in the cluster
Used space: Shows used space in the cluster

Next, let’s take a look at the Self Services Summary dashboard:

The Available db% value in the Nodes widget should be mostly equal for all nodes if they have all been deployed together since the very beginning. If the Threads value in the Collectors widget is something close to 30,000 threads, this may result in problems. You may need to check which management packs are running on those collectors.

Next, let's take a look at the Self Performance Details dashboard:

On the Analytics Data Processing widget, we can find the following metrics:

DT read data points: The amount of data points read from FSDB during DT calculation on the selected node.
Load object: The amount of object cache load operations on the selected node (vRealize Operations loads the cache for an object when we start monitoring it, for example, the new object gets autodiscovered). On the stable environments, when there are no variations happening in the objects count, this metric should remain as 0. The operation has some overhead on performance so you must watch out for this. If it is having high non-zero values in each cycle, then it may be an indication of an issue.
Load metrics: The amount of metric cache load operations on the selected node (the amount of metrics that vRealize Operations has loaded for newly loaded objects).
Outcome metrics: The amount of metrics that were returned after threshold checking was completed (incoming metrics + metrics generated by vRealize Operations).
Reload object: The amount of object cache reload operations (happens when an object is updated, the policy updates, or the object changes its effective policy). Spikes can be expected, but in general should remain close to 0.
Reload metrics: The amount of metric cache reload operations.
Build symptom time: The summation of the build symptom phase during threshold checking of the threshold checking threads.
Th.ch. overall time: The summation of threshold checking duration the threshold checking threads.
Load property: The amount of property cache load operations. Should be close to 0. Some amount of spikes may be expected, but these should not be very frequent in case of stable environments.
Remove property: The amount of property cache reload operations.
Forward data entries: Should be close to 0. If it is close to 20,000, then data is being dropped.
CIQ Precomputations: Should be spikes every 1 hour 40 minutes. If it doesn’t happen at this interval, it indicates that precomputation is taking a long time.
CIQ Computations: There should not be frequent invocations. It should start at night and finish after several hours.

Some of the metrics worth mentioning on the FSDB Data Processing widget are:

Store operations count: How many story operations are performed on the FSDB of the selected node.
Work items: The value should be close to 0. Some amount of spikes may be expected, but there shouldn't be frequent spikes. If you see values close to 400, then it may degrade the threshold checking performance and may result in data being dropped. In this case, it usually means the underlying storage performance is not sufficient. Make sure vRealize Operations VMs are not running on snapshots for a long time.
FSDB state: Numerical representation of FSDB state (Initial, Synchronizing, Running, Balancing, and more).
Store time: Summation of store operations and time spent by all storage threads.

Next is the Self Service Communications dashboard:

On the Collector Communications widget, the Rejected Objects value should be 0, otherwise, it means data was dropped because the forward data region was full—analytics was not able to process the incoming data in time.

Finally, let’s take a look at the vCenter Adapter Details dashboard:

Our interest falls to the Adapter Instance Details widget. The metrics of interest here are:

New objects: In environments where there is not much object churn, this should be close to 0.
Collect time: Data collection should be complete within the monitoring interval (by default, 5 minutes).
Down resources: Depends on management packs, but something close to 1M for property updates, relationship updates, and events may result in performance problems. Check these metrics for all management packs, not just vSphere.

We didn't cover all available self-monitoring dashboards. Make sure to familiarize yourself with all the dashboards and the information they provide as it may turn out to be very valuable for troubleshooting vROps issues.

Table of Contents for Self-monitoring dashboards

Create new playlist

Sign In

Sign Up

Table of Contents for
Self-monitoring dashboards