March 04. 2011
Writing a really good check has many aspects. If you want your check to be part
of the official Check_MK distribution, you have to make it adher to the following
guidelines:
Naming
The check file names should be named short and unique. They must consist
only of lower case characters, digits and underscores and begin
with a lower case character.
Vendor specific checks must be prefixed with a vendor specific unique
abbreviation (which you think of). Example: fsc_ for Fujitsu Siemens Computers.
Product specific checks must be prefixed with a product abbreviation, for
example steelhead_status. for a Steelhead appliance of Riverbed.
SNMP based checks: if the check makes use of a standardized MIB which
is or might be implemented by more than one vendor, then the check should
not be named after the vendor but after the MIB. An example are the
hr_* checks.
The service description of different check types that essentially
do the same must be identical (e.g. if/if64/ifoperstatus). Reason:
this makes rules in main.mk simpler for the user!
Order of implementation
All checks follow the same order of implementation:
- fileheader with GPL notice
- name and email address of the author - if check was contributed
- example output as sent by the agent
- check_includes[] definition
- default settings of configuration variables
- helper functions and variables, if any are needed
- the inventory function
- the check function
- check_info[] definition
- snmp_info[] definition
- snmp_scan_functions[] definition
Coding style
Add an author
If the check is contributed by a third party (like you), you must
add your name and your email address as a comment into the check, right after the header.
Readability, looks and intents.
Avoid long lines. In an optimal case your lines don't exceed 100 chars.
Use four spaces for intending your code. Just don't use tab chars.
And if you really can't life without tabs set the tab width to 8 spaces.
File Header
For checks part of the official Check_MK the file header with the
copyright information must be present. This will be automatically
created if you call 'make headers' in the main source directory
Example agent output
Including example output of the agent is very helpful for understanding how the
check parser works.
TCP-Agent based checks must include an output example of the
agent. If the agent output can have different formats or output styles
then put an example for each kind of style the check supports
(e.g.: the output of multipath -l has changed its layout between SLES 10 and SLES 11).
For SNMP based checks include examples if the kind of output is
in some respect remarkable.
Configuration variables
Configuration variables for main.mk should be named after
the check, if they are only used by this check. This does
not hold for variables, that are used by several checks
(e.g. filesystem_default_levels is used by df, hr_fs, df_netapp, ...)
If a check does not use check parameters, then the inventory function
must return None as parameter and the check function must name
the parameter argument _no_params.
The name of the inventory and check function must be
prefixed with the name of the check type, for example
inventory_h3c_lanswitch_cpu for the check h3c_lanswitch.
Other details / expected practices
Setting default values for configuration variables
Default values for check parameters (e.g. switch_cpu_default_levels) must be
chosen in a way that they make sense for everybody, not just for your
special case.
In case you are unsure then rather choose too loose than too tight levels.
This helps avoid false alarms.
Reuse of configuration variables
If the same configuration variable is used in multiple checks, all of them
must set a default value and all those values must be identical!
Error handling
Your check should assume that the agent is always producing valid data.
It should not try to handle cases where the agent output is broken.
This is handled by Check_MK via Python exceptions. Otherwise this will disable the
debug handler (make the code more ugly).
int() vs. saveint() and float vs. savefloat()
int(s) will throw an exception if if is not a valid number string (or empty).
Then Check_MK will catch the exception and make the check result "UNKNOWN"
with an according error message. saveint(s) will assume 0, if s is not valid.
Use saveint() in all places, where you know or suspect that some
device does not supply valid data but the check can work with the rest of
the data and produce useful results.
Use int() in all other cases,
e.g. if the check does not make any sense if you have no valid data.
Performance data
Performancedata Flag
Only set the perfdata flag (the third parameter in the check_info declaration)
to 1if the check really produces performance data output.
PNP Graph definition
Each check that outputs performance data must have a dedicated PNP
graph definition in pnp-templates. If the check has warning and critical
levels then the graph must display those levels as yellow and red
lines.
RRA definition
Each check that outputs performance data must also have an RRA definition
the specifies which of MAX, MIN and AVERAGE is needed to display the
graph in its current (and maybe future) forms. Those are in pnp-rraconf.
Use a symlink here.
Perf-O-Meter
Each check that outputs performance data should have a Perf-O-Meter.
For checks part of Check_MK this must be done in
web/plugins/perfometer/check_mk.py, for third party checks this should
be done in a separate file in web/plugins/perfometer.
SNMP based checks
Only use numeric OIDs in your checks. Name based OIDs rely on MIB files
and the check won't work when the MIB files are not in place.
Always have your OIDs start with a root, for example: .1.3.6.1.4.1
Forbidden things
Neither the check- nor the inventory function may use the print command
or otherwise output any data to stdout or stderr nor otherwise communicate
with the outside. An rare exception to this are checks that need a dedicated
data storage (such as logwatch: it keeps unread log messages in files).
Manpages
Each check must have a man page. This should be:
- complete
- precise
- terse
- helpful!
Information that must be contained in the check description:
- CM:What does the check exactly do?
- CM:A definition under which circumstances the check status will change to WARN/CRIT?
- CM:Which devices are supported by the check?
- CM:Does the check require some configuration of the agent or some separate agent plugin? (example: the logwatch check requires the agent plugin mk_logwatch to be installed)