Working with counters
August 02. 2015
The principle of counters
When you write checks that deal with performance data (CPU usage, network traffic, disk IO), in many cases you will be confronted with counters. As an example, look into the file /proc/stat on your Linux box. Lets grep for the line beginning with processes:
user@host> grep processes /proc/stat processes 205458
What does that mean? It's the number of processes created since the system has booted (not the number of processes currently running!).
Now let's do it again:
user@host> grep processes /proc/stat processes 206160
What do we learn from this? The number of process creations has raised from 205458 to 206160. That is 702 new process creations. If we now assume that we have waited exactly 10 seconds between the two calls, then processes have been created at a rate of 70.2 process creations per second.
So if we have a counter and want to compute a rate, we need to compare the value of the counter between two points of time and need to know how much time has been passed between the first and the second sample of the counter.
Counters in Check_MK
The good news is: Check_MK supports check programmers with the handling of counters. It can keep memory or counter values and compute rates for you. Counter values are stored in /var/lib/check_mk/counters (OMD: tmp/check_mk/counters). The key to this is the helper function get_counter, which is called like this:
timedif, rate_per_sec = get_counter("some.unique.name", this_time, counter_value)
It is important to call this function with a unique name for each separate counter. This is usually done by using the check type and the item as a prefix (unless your check is always using None as item). Here is an example of how the check winperf.diskstat uses get_couter: It uses a separate counter for read IO and write IO:
read_timedif, read_per_sec = get_counter("diskstat.read", this_time, read_bytes_ctr) write_timedif, write_per_sec = get_counter("diskstat.write", this_time, write_bytes_ctr)
The check if is a bit more complex, since it deals with many switch ports and for each port with several counters. It uses the counter name (name) and the port number/description (item) to make the counter unique:
timedif, rate = get_counter("if.%s.%s" % (name, item), this_time, saveint(counter))
The get_counter function returns two values:
Counter wraps and resets
There are two situations where you have to be careful when working with counters: wraps and resets. A wrap occurs when a counter with a limited precision overflows. The most prominent example are the 32 Bit counters in the SNMP IF-MIB that are used for the network traffic - for example ifOutOctets. After 4GB of traffic over a port the counter wraps over back to 0. Such a case must be detected and handled. Otherwise you would get negative values for your rate.
Another situation is a reboot of the target device. In that case all counters start again from 0. This must also be handled correctly in order to avoid anomalies.
get_counter makes a simple but effective wrap detection. If the new counter value is lower than the old one, a wrap/reset is being assumed. Now something important happens: get_counter does not return any value, but raises a Python exception of the MKCounterWrapped. This exception immediately aborts the execution of the check. No check result will be sent to Nagios this turn!
You might think of this as a bug. But it's the only way to do it right! If a counter anomaly occurs then it is not possible to compute the rate in a reliable way. Even if we would know the maximum value of the counter, we still would have no way to distinguish between a wrap and a reset. And returning a rate of 0 would also be wrong and might trigger false alarms, create invalid RRD graphs and would send the user a check result that does not reflect the reality.
Also if get_counter is called for the very first time for
a specific counter, as MKCounterWrapped will be raised. No rate
can be computed without a previous value. This is the reason why some
checks need a bit longer to leave the pending state when a new host is
added to the monitoring.
If your check uses more then one are two counters then the time until the first time the check produces results might be no acceptable. The problem is that the MKCounterWrapped exception will immediately abort the check execution as soon as the first counter is initialized. The second time the check is called the second counter will be initialized and so on.
If you want all counters to be initalized at the first check then you need to:
The following pseudo-code illustrates how to do this correctly (variant a):
wrapped = False for ...: # loop over counters try: timedif, rate = get_counter(.....) # process resulting rate... except MKCounterWrapped: wrapped = True # continue, other counters might wrap as well # after all counters are handled if wrapped: raise MKCounterWrapped("Counter wrap")
Variant b) is only possible if the counters just supply additional performance data but do not influence the check result.
In any case make sure, that your check does always output the same number of performance variables. If some would be missing due to counter wraps, then output none at all. Graphing tools such as PNP4Nagios may break if the number of performance variables vary.
1.2.6b1, 1.2.6b1 Werk #1723 - New check API function get_average() as more intelligent replacement for get_counter()
The function get_counter() is now deprecated in the programming of checks. There is a new function called get_rate() that should be used as a replacement.
def get_rate(countername, this_time, this_val, allow_negative=False, onwrap=SKIP): ... return rate
The call syntax is almostthe same - just with the new optional parameter onwrap. Important however: now just the rate (counter steps per second) is being returned. The formerly additional return value timedif has been dropped since it is of no real use. So the return type has changed from tuple to float.
The most imporant change - however - is in the handling of counter wraps. A counter wrap happens in three situations:
Wraps usually happen when a device reboots or when the valid range of the counter is exceeded and it wraps through again to zero.
The old function get_counter() used to raise an exception of type MKCounterWrapped. This exception was handeld by the main core of Check_MK, which skipped that check for one cycle. The problem were checks with more than one counter: at the point of initialization the code of the check wasaborted after the first of these counters had been initialized. If you had 10 counters, you would need 10 check cycles until the first time a check result would be returned. So in order to avoid that the check had to catch the MKCounterWrapped itself and handle this situation - very ugly.
The new function get_rate implements a different approach. Per default no exception is raised in case of a counter wrap, but simply the value 0.00 is being returned. But Check_MK keeps record of this wrap event. After the check function has completed (and all counters are handled), Check_MK creates one final MKCounterWrapped exception, so that the (invalid) check result is being skipped as it should be. This way the check programmers' burden is a reduced a bit because now even if the check has several counters he does not need to catch counter wraps.
1.2.6b1, 1.2.6b1 Werk #1725 - The get_average() function from now on only returns one argument: the averageNote to all developers of checks that use get_average(): In order to simplify the check API the function get_average() from now on does not return the additional timedif value anymore - just the rate. Please check your checks for the usage of this function.