LivecheckRequired version: 1.1.13i1
November 13. 2011
IntroductionMaybe the greatest performance bottleneck of Nagios is the execution of active checks. Even a perfectly tuned system rarely manages to execute more then a few thousand checks per minute. Even when using Check_MK not all checks are passive. At least the host checks, PING checks of ping-only hosts and of course the Check_MK check itself are always active checks. What make things worse: while your system is getting larger, the maximum check rate is even getting worse. The more hosts and services your system manages, the less checks per second it will be able to perform. Why? If you make a closer analysis of how the core of Nagios works, you will see that for each check it executes, it needs to create a new process. Unix people speek of this as forking, because the system call doing the process creation is fork(). This new process will then prepare everything needed to execute the check plugin - e.g. check_icmp - and finally fork a second time in order to execute it. Process creation is not only a CPU intense operation. It is becoming the more expensive the bigger the original process is (i.e. its memory usage). The problem is that fork() will create an exact copy of the original process, and even if that procedure is highly optimized by the Linux kernel, it's costly. What's even worse is the fact, that the forking of the Nagios core does not scale on multiple CPUs. If the actual execution of the active checks is efficient (such as a simple check_icmp) or enough CPU cores are available, then you can well run into a situation where your powerful 16-CPU server is limited to 100 Checks per second while most of its CPU cores are idle most of the time. How could a solution look like? Two possibilities come into mind:
LivecheckLivecheck is a new feature in MK Livestatus. It's so simple that we decided not to write a separate broker module but directly include it in livestatus.o. The total code is about 350 lines of C code (non-empty, non-comment), half of which is in the Livestatus module and half of which in a small C program called livecheck. The total thing took about one day to implement. It's working like this: When the monitoring core starts, Livecheck creates a configurable number of helper processes, using the program livecheck. The core communicates with each helper via a Unix socket (that does not appear in the filesystem). Whenever a check is to be executed, the core sends the neccessary data (e.g. the host name, the service description and the command line) via the socket to one of the helpers. The helper reads that data and executes the check by forking and running the specified command. The result of the check is passed to Nagios by directly creating a check results file. The gross result is that for the execution of the check only a very small helper program needs to be forked instead of the complete monitoring core. This alone brings a great speed up. Furthermore those forks distribute over all availabe CPUs, not just one. But Livecheck makes use of even more room for improvement:
In first tests on a Lenovo laptop with a dual core CPU running at 2800 MHz we easily managed to do about 300 ICMP checks per second. By a small alteration of the Nagios way of scheduling checks (which needs a small patch), we were marvelling to see Nagios executing 2600 ICMP checks per second, while the Nagios process being at 35% CPU usage and the total system way below 100% The checks generated an ICMP traffic of 45MBit/s! With that check rate you could monitor 150,000 hosts while pinging each host once per minute! Now guess you would not use a laptop but a real server! Setting up LivecheckSetting up Livecheck is easy:
Of course a restart of your Nagios/Icinga is needed in order to make the changes active. TestingIf everything is setup properly, you should get the following message in your nagios.log: nagios.log [1321193692] livestatus: Starting 20 livechecks helpers If you show a process tree, you should see 20 subprocesses of Nagios called livecheck:
root@linux# pstree -u nagios
nagios-+-20*[livecheck]
`-22*[{nagios}]
TuningNot much tuning is needed. The only configuration parameter is num_livecheck_helpers, which can be added to the broker configuration in nagios.cfg. It specifies the number of Livecheck helpers to start. The default value is 20. As long as this value is not less then your max_concurrent_checks setting of Nagios, there will always be a helper free if one is needed. If you run out of helpers then Livecheck will automatically let Nagios execute the exceeding checks in the usual way, so nothing will fail. The check execution will fall back to its normal speed during such situation. |
| ||||||||||||