Guidelines for writing checks for the official distribution
December 22. 2014
The check file names should be named short and unique. They must consist only of lower case characters, digits and underscores and begin with a lower case character.
Vendor specific checks must be prefixed with a vendor specific unique abbreviation (which you think of). Example: fsc_ for Fujitsu Siemens Computers.
Product specific checks must be prefixed with a product abbreviation, for example steelhead_status for a Steelhead appliance of Riverbed.
SNMP based checks: if the check makes use of a standardized MIB which is or might be implemented by more than one vendor, then the check should not be named after the vendor but after the MIB. An example are the hr_* checks.
All checks must follow the same layout specified below:
Add an author
If the check is contributed by a third party (i.e., not by the developers
of Check_MK), the name and email address of the contributor should be added
as a comment, right after the header.
Avoid long lines. Ideally, your lines shouldn't exceed 100 chars.
Use four spaces to indent your code. Don't use tab chars!
And if you really can't live without tabs, set the tab width to 8 spaces.
For checks which are supposed to be part of the official Check_MK project
the file header with the copyright information must be present. This will be
automatically created if you call 'make headers' in the main source directory
Including example output of the agent is very helpful for understanding how the check parser works.
TCP-Agent based checks must include an output example of the agent. If the agent output can have different formats or output styles, then put an example for each kind of style the check supports (e.g.: the output of multipath -l has changed its layout between SLES 10 and SLES 11).
For SNMP based checks, at least include examples if the kind of output is remarkable in some respect.
Configuration variables for main.mk should be named after the check if they are only used by this check. This does not hold for variables, that are used by several checks (e.g. filesystem_default_levels is used by df, hr_fs, df_netapp, ...)
The variable that is used for the check's default parameters and entered in the inventory function must be named CHECKTYP_default_levels (if not used by more than one check, see above). Example: check foo_bar has the configuration variable foo_bar_default_levels.
If a check does not use check parameters, the inventory function must return None as parameter and the check function must name the parameter argument _no_params.
Other details / required practices
Setting default values for configuration variables
Default values for check parameters
(e.g. switch_cpu_default_levels) must be chosen in a way that they
make sense for everybody, not just for your special case. If case
you are unsure, rather choose too loose than too tight levels. This helps
avoid false alarms.
If the same configuration variable is used in multiple checks, it must be
set to a default value in all checks and the values must be identical!
Your check should assume that the agent is always producing valid data. It should not try to handle cases when the agent output is broken. Reason: broken agent output is already handled by Check_MK via Python exceptions. Intercepting these exceptions in your check code makes debugging of broken outputs much more difficult.
Do not handle cases in the agent output for which you have no indication
that they can actually happen.
vs. savefloat() int() will throw an exception if the argument is not a valid number string (or if it is empty). Check_MK will catch the exception and make the check result "UNKNOWN" with an appropriate error message. saveint(), however, will assume 0 if the argument cannot be converted to a valid integer.
Use saveint() in all cases when you know or suspect that your device may supply invalid data, but the check should work with the rest of the data and produce useful results. Disadvantage: you may never find out that the device has supplied invalid data, because the check wont tell you !
Use int() in all other cases, e.g. if you want to be notified
with an exception if the check has received invalid data from your device.
In most cases this is what you want !
Many checks have parameters defining warning and critical levels which are compared to an actual value. Please observe the following important rules and conventions if you are writting such checks.
Warning and critical levels should always be checked with >= and <=. Example: a check monitors the length of a mail queue. The critical upper level is at 100. This means that if the length is exactly 100, the check should already be critical. There might be a few exceptions to this where this wouldn't make sense.
If there are just upper or just lower levels, the imput fields of the WATO ruleset definitions for such levels must be labelled Warning at ______, and Critical at ______.
If there are both upper and lower levels, the labelling should be:
Warning at or above ___, Critical at or above ___,
Warning at or below ___ and Critical at or below ___.
A check function producing several subresults (e.g. current usage and
growth) must use the yield function for returning these results. On
the other hand, check generating exactly one result must use return.
Each check returns one line of text - the plugin output (or sometimes called check output). In order to unify things the output must be formated according to the following rules:
Format of Performance data
Always send int or float data as performance data. Do not attach a unit. Write temp instead of "%0.2fC" % temp!
If you need to omit fields in the middle of the data list (e.g. warn or crit), add a None instead, for example [("usage", usage, None, None, 0, size)]
If you need to omit fields at the end, simply omit them. Do not add trailing Nones.
Naming of performance data variables:
Always use the canonical unit: send Bytes, not KB, MB or GB. Send
Celsius, not Fahrenheit. Send Bits/sec, not MBits/sec. It is the task of
the graphing tool to do a useful scaling.
Only set "has_perfdata" to True in check_info
if the check really produces performance data output.
Each check returning performance data must have a dedicated PNP graph definition in pnp-templates. If the check has warning and critical levels, the graph must display these levels as yellow and red lines.
PNP graphs should always use the consolidation function MAX (there are some rare exceptions where only MIN makes sense).
However: the Average value which is printed in the labelling of the
graph must use the consolidation function AVERAGE. Using MAX
would compute the average of the maximum values - which is totally useless.
Each check returning performance data must also have an RRA definition
specifiying which of MAX, MIN and AVERAGE is needed to display the
graph in its current (and maybe future) forms. These definitions are in
pnp-rraconf. Use a symlink here.
Each check returning performance data should have a Perf-O-Meter.
For checks which are part of Check_MK the Perf-O-Meter must be defined in
web/plugins/perfometer/check_mk.py. For third-party checks it should
be defined in a separate file in web/plugins/perfometer.
Only use numeric OIDs in your checks. Name-based OIDs rely on MIB files and the check won't work when the MIB files are not in place. Always have your OIDs start with a root, for example: .184.108.40.206.4.1
Each check must have a check man page. This should be:
Information that must be contained in the check description:
Here are some frequent errors and further mixed guidelines:
When you output a number and a unit, always put one space inbetween: write 4.5 ms instead of 4.5ms. Only the percent sign is added without a space.
Checks doing the same should always have the same (consistent) service description. Examples:
Service descriptions should be capitalized like English titles, e.g. "Source of Output"
If your check is accompanied by an agent plugin, you should observe the following rules:
A check with items must return with an UNKNOWN state (3) when the checked item is not found in the agent output or SNMP data. The text in the case should be: Thing not found in SNMP data or Thing not found in agent output (depending on the type of check) where Thing is the name of the item type, for example Database not found, Sensor not found, Domain not found. Do not:
The state markers (!) and (!!) must only be used in checks which can go warning or critical for several different reasons, like sub-checks.
Checks returning a temperature should have the plugin output in the form Temperature is 12 °C or Temperature is 12.5 °C
Your check must also work with Nagios as Core. If you use functions or variables from *.include files then you must declare them in check_info in the key "includes" and you must then test our check with Nagios as the core.
Never use a global import statement in a check file
Do not use datetime for date/time parsing. Use time. It can do all you need, really !!!
Do not use any other modules, except: sys, os, time, socket
If you need regular expressions, use the function regex(). Do not use re directly.
Neither the check function nor the inventory function may use the print command, or otherwise output any data to stdout or stderr, or communicate with the outside world in any other way. An rare exception to this are checks which need a dedicated data storage (such as logwatch: it keeps unread log messages in files).
Never fetch SNMP data that is not actually used in the check or inventory function.