Monitoring discrete sensors

The sensor list shows some sensors where the value is quite clear, such as for temperatures and fan RPMs. Some of these can be a bit trickier, though. For example, your sensor listing could have a sensor called Power Unit Stat or something similar. These are discrete sensors. You might think that they return 0 for an OK state and 1 for Failure, but they're usually more complicated than that. For example, the power unit sensor can actually return information about eight different states in one retrieved value.

Let's try to monitor it and see what value we can get in Zabbix for such a system:

  1. Navigate to Configuration | Hosts, click on Items next to IPMI host, and click on Create item. Fill in the following:
    • NamePower Unit Stat (or, if your IPMI-capable device does not provide such a sensor, choose another useful sensor)
    • Type: IPMI agent
    • Key: Power_Unit_Stat
    • IPMI sensor: Power Unit Stat
  1. When done, click on the Add button at the bottom.
If normal sensors work but discrete ones do not, make sure you try with the latest version of the OpenIPMI library. Discrete sensors in OpenIPMI-2.0.16, 2.0.17, and 2.0.18 often have an additional 0 (or some other digit or letter) appended at the end.

Check this item in the Latest data section—it will likely return 0. But what could it return? It's actually a decimal representation of a binary value, where each bit could identify a specific state, most often a failure. For this sensor, the possible states are listed in Intelligent Platform Management Interface Specification Second Generation v2.0.

The latest version of this specification can be found at http://www.intel.com/content/www/us/en/servers/ipmi/ipmi-home.html.

According to that specification, the meanings of the individual hex values are as follows:

00h

Power off/Power down

01h

Power cycle

02h

240 VA Power down

03h

Interlock power down

04h

AC lost/power input lost (the power source for the power unit was lost)

05h

Soft power control failure (the unit did not respond to a request to turn on)

06h

Power unit failure detected

07h

Predictive failure

 

Looking at the description of the first bit, a binary value of 0 means that the unit is running and reports no problems. A binary value of 1 means that the unit is powered down. We could compare the returned value to 0, and that would indicate that everything is fine with the unit, but what if we want to check some other bit, such as predictive failure? If only that bit were set, the item would return 128. As mentioned before, discrete items return a decimal representation of the binary value. The original binary value is 10000000 (or 07h in the previous table), where the eighth bit, counting from the least significant, is set. By the way, this is also the reason why we left the Type of information field as Numeric (unsigned) and Data type as Decimal for this item—although the actual meaning is encoded in a binary representation, the value is transmitted as a decimal integer.

Thus, to check for a predictive failure, we could compare the value to 128, couldn't we? No, not really. If the system is down and reports a predictive value, the original binary value would be 10000001, and the decimal value would be 129. It gets even messier when we start to include other bits in there. This is also the reason it's not possible to use value mapping for such items at this time—in some cases, a value could mean all bits are set, and there would have to be a value-mapping entry for every possible bit combination. Oh, and we cannot detect a system being down just by checking for a value of 1—a value of 129 and a whole bunch of other values would also mean that.

If we can't compare the last value in a simple way, can we reasonably check these discrete sensor values at all? Luckily, yes; Zabbix provides a bitwise trigger function called band(), which was originally implemented specifically for discrete IPMI sensor monitoring.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.253.152