Monitoring discrete sensors

The sensor list shows some sensors where the value is quite clear: temperatures, fan RPMs, and so on. Some of these can be a bit more tricky, though. For example, your sensor listing could have a sensor called Power Unit Stat or similar. These are discrete sensors. One might hopefully think that they could return 0 for an OK state and 1 for Failure, but they're usually more complicated. For example, the power unit sensor can actually return information about eight different states in one retrieved value. Let's try to monitor it and see what value we can get in Zabbix for such a system. Navigate to Configuration | Hosts, click on Items next to IPMI host, and click on Create item. Fill in the following:

  • Name: Enter Power Unit Stat (or, if your IPMI-capable device does not provide such a sensor, choose another useful sensor)
  • Type: IPMI agent
  • Key: Power_Unit_Stat
  • IPMI sensor: Power Unit Stat

When done, click on the Add button at the bottom.

Tip

If normal sensors work but discrete ones do not, make sure you try with the latest version of the OpenIPMI library—older versions add an extra .0 to discrete sensor names.

Check this item in the Latest data section—it likely returns 0. But what could it return? It is actually a decimal representation of a binary value, where each bit could identify a specific state, most often a failure. For this sensor, the possible states are listed in Intelligent Platform Management Interface Specification Second Generation v2.0.

Tip

The latest version of this specification can be reached at http://www.intel.com/content/www/us/en/servers/ipmi/ipmi-home.html.

According to it, the individual hex values have the following meaning:

00h

Power Off/Power Down

01h

Power Cycle

02h

240 VA Power Down

03h

Interlock Power Down

04h

AC lost/Power input lost (the power source for the power unit was lost)

05h

Soft Power Control Failure (the unit did not respond to a request to turn on)

06h

Power Unit Failure detected

07h

Predictive Failure

Looking at the description of the first bit, a binary value of 0 means that the unit is running and reports no problems. A binary value of 1 means that the unit is powered down. We could compare the returned value to 0, and that would indicate that everything is fine with the unit, but what if we would like to check some other bit—for example, the "Predictive failure" one? If only that bit were set, the item would return 128. As mentioned before, discrete items return a decimal representation of the binary value. The original binary value is 10000000 (or 07h in the previous table), where the eighth bit, counting from the least significant, is set. By the way, this is also the reason why we left the Type of information field as Numeric (unsigned) and Data type as Decimal for this item—although the actual meaning is encoded in a binary representation, the value is transmitted as a decimal integer.

Thus, to check for a predictive failure, we could compare the value to 128, couldn't we? No, not really. If the system is down and reports a predictive value, the original binary value would be 10000001, and the decimal value would be 129. It gets even messier when we start to include other bits in there. This is also the reason it is not possible to use value mapping for such items at this time—in some cases, a value could mean all bits are set, and there would have to be a value-mapping entry for every possible bit combination. Oh, and we cannot detect a system being down just by checking for the value to be 1—a value of 129 and a whole bunch of other values would also mean that.

If we can't compare the last value in a simple way, can we reasonably check such discrete sensor values at all? Luckily, yes; Zabbix provides a bitwise trigger function called band(), which was originally implemented specifically for discrete IPMI sensor monitoring.

Using the bitwise trigger function

The special function band() is somewhat similar to the simple function last(), but instead of just returning the last value, it applies a bitmask with bitwise AND to the value and returns the result of this operation. If we wanted to check for the least significant bit, the one that lets us know whether the unit is powered on, we would use a bitmask of 1. Assuming some other bits have been set, we could receive a value of 170 from the monitored system. In binary, that would be 10101010. Bitwise AND would multiply each bit down:

 

Decimal value

Binary value

Value

170

10101010

Bitwise AND (multiplied down)

  

Mask

1

00000001

Result

0

00000000

The general syntax for the band()trigger function is as follows:

band(#number|seconds,mask)

Tip

It also supports a third parameter, time shift—we discussed time shifts in Chapter 6, Detecting Problems with Triggers.

While thinking about the binary representation, we have to use decimal numbers in Zabbix. In this case, it is simple—the trigger expression would be as follows:

{host:item.band(#1,1)}=1

We are checking the last value received with #1, applying a decimal mask of 1, and verifying whether the last bit is set.

As a more complicated example, let's say we wanted to check for bits (starting from the least significant) 3 and 5, and we received a value of 110 (in decimal):

 

Decimal value

Binary value

Value

110

01101110

Bitwise AND (multiplied down)

  

Mask

20

00010100

Result

4

00000100

A simple way to think about the operation of the mask would be that all the bits that match a 0 in the mask are set to 0, and all other bits pass through it as is. In this case, we are interested in whether both bits 3 and 5 are set, so the expression would be this:

{host:item.band(#1,20)}=20

In our value, only bit 3 was set, the resulting value from the function was 4, and that does not match 20—both bits are not set, so the trigger expression evaluates to FALSE. If we wanted to check for bit number 3 being set and bit 5 being not, we would compare the result to 4. And if we wanted to check for bit number 3 not being set and bit 5 being set, we would compare it to 16—because in binary, that is 00010000.

And now, let's get back to checking for the predictive failure bit being set—it was the eighth bit, so, our mask should be 10000000, and we should compare the result to 10000000. But both of these should be in decimal format, so we should set both the mask and comparison values to 128. Let's create a trigger in the frontend with this knowledge. Go to Configuration | Hosts, click on Triggers next to IPMI host, and click on Create trigger. Enter Power unit predictive failure on {HOST.NAME} in the Name field, and then click on Add next to the Expression field. Click on Select next to the Item field, and then choose Power Unit Stat. Set the Function dropdown to Bitwise AND of last (most recent) T value and mask = N, enter 128 in both the Mask and N fields, and then click on Insert. The resulting trigger expression should be this:

{IPMI host:Power_Unit_Stat.band(,128)}=128

Notice how the first function parameter is missing? As with the last() function, omitting this parameter is equal to setting it to #1, like in the earlier examples. This trigger expression will ignore the 7 least significant bits and check whether the result is set to 10000000 in binary, or 128 in decimal.

Bitwise comparison is possible with the count() function, too. Here, the syntax is potentially more confusing: both the pattern and mask are to be specified as the second parameter, separated with a slash. If the pattern and mask are equal, the mask can be omitted. Let's try to look at some examples to clear this up.

For example, to count how many values had the eighth bit set during the previous 10 minutes, the function part of the expression would be as follows:

count(10m,128,band)

Our pattern and mask were the same, so we could omit the mask part. The previous expression is equivalent to this:

count(10m,128/128,band)

If we would like to count how many values had bit 5 set and bit 3 not set during the previous 10 minutes, the function part of the expression would be this:

count(10,16/20,band)

Here, the pattern is 16 or 10000, and the mask is 20 or 10100.

Beware of adding too many IPMI items against a single system—it is very easy to overload the IPMI controller.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.154.121