Ethan Galstad of Nagios Enterprises interviews Mathias Kettner
We want to thank Ethan for that interview. The original English version
can be found here.
A German translation is also available.
Ethan Galstad: In this installment of the "Meet The Community"
series I interview Mathias Kettner, author of check_mk
- unique addon for Nagios that greatly simplies monitoring
remote system metrics.
Ethan Galstad: Can you tell us a bit about yourself?
Mathias Kettner: I’m from Munich in Germany and completed my studies in computer science in 1998. From Autumn 1999 to Spring 2001 I’ve been working as a developer for SuSE in Nurnberg. During that time, amongst other things, I’ve designed the system architecture for YaST2. Since I left
SuSE I’ve been a self-employed Linux specialist and offer consulting and workshops for Linux and Open Source. The topic “Monitoring with Nagios” is a main focus.
EG: Can you give us a brief overview of what your project (check_mk) is?
MK: Check_mk is the product of years of Nagios consulting. Especially when you deal with large installations, the effort for creating and updating the configuration can be great. And of course at about 2000 checks, problems with the performance arise – sometimes even earlier.
Check_mk solves these problems by an astonishingly simple scheme. It uses its own very simply built agents. Their specialty: the agent is not called separately for each check but always sends everything it knows about its host. For each host only one active check is triggered by Nagios. That fetches all data from the host at once, interprets it and sends the results of the various services via Nagios’ passive checks.
EG: What are the primary advantages for a Nagios user to begin using check_mk?
MK: The most eye-catching advantage is that with check_mk much less work is needed to integrate new hosts into the monitoring. The agent – regardless whether for Windows, Linux or UNIX – does not need to be configured. The Linux and UNIX agent is a portable shell script that does not need any compiled programs. The detection and integration of the majority of services and the creation of the configuration for Nagios happens automatically.
As the number of hosts and services grows, the performance benefit of check_mk gets obvious. Check_mk makes tens of thousands of checks per minute possible – even if data is written to round robin databases.
EG: check_mk has a unique architectural design compared to other remote monitoring methods. What inspired you to come up with this idea?
MK: The idea for that kind of architecture has its origin in the monitoring of UNIX systems. NRPE is especially intricate here since there are no precompiled packages, and compiling NRPE on various UNIX versions overburdens many administrators – as not only NRPE, but also the plugins have to be compiled.
For that reason I developed a method based on a shell script and the inetd. The idea of processing all of a host’s services in one single run, and sending the results via the command pipe to Nagios, came out of the blue one year ago.
Later I realized that this method can be extended nicely towards SNMP. When monitoring ports of a switch, check_mk processes the data from all ports in a single run. Furthermore it can detect which ports are in use (and thus should be monitored) and create Nagios services for them.
EG: In your experience, do check_mk users usually replace their existing remote monitoring agents (NRPE, NSClient++, etc) with this addon, or do the use them together?
MK: In principal, check_mk can be used in parallel to all other monitoring methods to any degree as required. Once you’ve worked with check_mk for a few days, you probably won’t want to work with the
disadvantages of NRPE and NSClient++ any longer. A migration towards check_mk is usually fast. Single checks that are not easy to realise with check_mk, or where the effort of migration does not pay off, can be performed with a classic method in parallel to check_mk.
EG: Are there any shortcomings of using check_mk instead of dedicated agents like NRPE and NSClient++? For instance, is there certain type of information or metrics that can’t easily be monitored using check_mk and its architectural design?
MK: The architecture does not impose any restrictions. There are, however, a few cases where checks can be better implemented with the classical method. One of those are checks of network services like check_http, which do not require an agent.
Furthermore, the Windows agent is not yet as flexible as those for Linux and UNIX. Currently it is not scriptable and can only be extended in C. The reason for that is not least my decision to directly program the agent using only the Win32-API. The advantage: the agent does not need .NET, Java or any other runtime environment – not even a special DLL – and is therefore perfectly portable.
EG: You have installed check_mk for some of your clients with large IT infrastructures. How does check_mk help Nagios to scale? What is the largest installation of check_mk that you know of?
MK: The largest installation currently performs 17,500 checks per minute, and is soon going to be expanded by several hundred Windows servers.
The load on the 4 CPU machine is currently at 6 – whereas the majority of CPU time consists of IO wait (Linux includes processes waiting for disk IO in the load calculation). RRD-data is written at a rate of about 5 MB per second. When deploying the RRD cache, I assume 30,000 to 40,000 checks per minute to be achievable.
EG: How long have you been working on check_mk?
MK: The current implementation – in Python – found its origin about a year and a half ago. Check_mk is available under GPL since the end of April 2009.
EG: You are the primarily developer of check_mk. Are there other developers or contributors to the project?
MK: Currently I’m the only one working on the actual programming. A decisive part of the success and maturity of the project has been due to one of my customers – Karl-Heinz Fiebig. He is the master of the largest installation, produces many good ideas and is the first one taking the blame for my bugs. He fixes many problems in the code himself.
EG: How did you first come to know about Nagios and why did you decide to begin using it?
MK: I did my first project with Nagios in 2003. In order to save costs an open source monitoring system was desirable. At that time Nagios was already the most prominent system of its kind. It could already fullfil all of the project’s requirements.
EG: What do you see as being the most advantageous reasons for using Nagios?
MK: Nagios is very flexible and allows the implementation of very individual monitoring solutions. Also the price is an important factor, because commercial monitoring systems still involve large licensing costs.
EG: Are there specific changes to Nagios that you’d like to see made in order to make check_mk integration simpler?
MK: The integration of Nagios and check_mk leaves nothing much to be desired. All necessary interfaces are present and they are also very simple and efficient.
EG: Are there any resources that you require in order to continue working on or improve the project?
MK: Most helpful are, on the one hand, projects for customers where I implement monitoring solutions with Nagios and check_mk, and thus further develop the software. On the other hand feedback from the community is very important – regardless of whether it’s qualified bug reports, suggestions for important requirements and – last but not least – promotion for the project in the form of links, postings in forums, documentation in foreign languages and the like.
EG: What plans do you have for the future of check_mk?
MK: The most important issue is surely the expansion of the documentation. Furthermore I’d like to make the Windows agent more easily expandable. For this reason I’m still in search of a really persuasive idea that fits into the current minimalistic concept.
*Amendment of 10/29/2013
The figure 17,500 is not precise anymore for a long time now. Real-life
monitoring installations with Check_MK are executing 100,000 checks/min and more
now. In our labs we reached a rate of 600,000 checks/min on a six-core CPU.
The Windows agent is now as easily expandable as the Linux and UNIX agents.