Monitoring of Logfiles
August 18. 2011
How check_mk monitors logfiles
The monitoring of the contents of logfiles is an especially challenging task for a Nagios administrator. The key difficulty is, that log messages are event based by nature - whereas Nagios is based on states. Check_mk's logwatch mechanism overcomes this problem by defining the critical state for a logfile as "no unacknownledged critical log messages".
At the beginning of the monitoring a logfile starts in the state OK - regardless of its contents. When a new critical message is seen in the file, it is stored on the Nagios server for reference by the administrator. The state of the logfile changes to CRITICAL and stays in that state unless the administrator acknowledges the messages. New critical messages arriving while in CRITICAL state are simply being stored and do not change the state.
Check_mk provides a webpage logwatch.php that displays log messages and allows the delete (and thus acknowledge) them in an easy way:
Logwatch on Linux and UNIX
Installing the logwatch extension
Logfiles on Linux and UNIX are monitored with the logwatch extension for the check_mk_agent. In the directory /usr/share/check_mk/agents you find the file mk_logwatch. It is a small Python-programm that must be installed into the plugins directory of the agent (you specify that directory while running setup.sh). The default path for the plugins directory is /usr/lib/check_mk_agent/plugins. Please make sure that your host has Python in at least version 2.3 installed. On Linux this is most probably the case. On UNIX you probably have to install it.
On Linux another way is to install the logwatch extension via its RPM or DEB package.
Logwatch needs to know which files to monitor and for which patterns to look. This is done in the configuration file logwatch.cfg on each target host. That file is searched in the following directories:
If you've used the DEB or RPM package for installation or used the default settings for setup as root, the path to the file is /etc/check_mk/logwatch.cfg. That file lists all relevant logfiles and defines patterns that should indicate a critical or warning level if found in a log line. The following example defines some patterns for /var/log/messages:
/var/log/messages C Fail event detected on md device O Backup created* I mdadm.*: Rebuild.*event detected W mdadm\[
Each pattern is a regular expression and must be prefixed with one space, one of C, W, O and I and another space. The upper example means:
You may list several logfiles separated by spaces:
/var/log/kern /var/log/kern.log C panic C Oops
It is also allowed to use shell globbing patterns in file names:
/sapdata/*/saptrans.log C critical.*error C some.*other.*thingy
An arbitrary number of such chunks can be listed in logwatch.cfg. Emtpy lines and comment lines will be ignored. This example defines different patterns for several logfiles:
# This is a comment: monitor system messages /var/log/messages C Fail event detected on md device I mdadm.*: Rebuild.*event detected W mdadm\[ # Several instances of SAP log into different subdirectories /sapdata/*/saptrans.log C critical.*error C some.*other.*thingy
1.1.11i3 Limiting the execution time
As of version 1.1.11i3, mk_logwatch allows to limit the time needed to parse the new messages in a logfile. This helps in cases where logfiles are growing very fast (i.e. due to reoccuring error, and endless loop or similar). Those cases often arise in the context of Java application servers logging long stack traces.
You can limit the number of new lines to be processed in a logfile as well as the time spent during parsing the file. This is done by appending options to the filename lines:
/var/log/foobar.log maxlines=10000 maxtime=3 overflow=W C critical.*error C some.*other.*thingy
The options have the following meanings:
Note (1): when the number of new messages or the processing time is exceeded, the non-processed new log messages will be skipped and not parsed even in the next run. That way the agent always keeps in sync with the current end of the logfile. From that follows that you might have to manually check the contents of the logfile if an overflow happened. We propose letting the overflow level set to C.
Note (2): It is not neccessary to specify both maxlines and maxtime. It also allowed to specify only one limit. The default is not to impose any limit at all.
Filtering filenames with regular expressions
New in 1.2.0p2: Sometimes the file matching patterns with * and ? are not specific enough in order to specify logfiles. In such a case you can use the new options regex or iregex in order to further filter the filenames found by the pattern. Here is an example:
/var/log/*.log regex=/[A-Z]+\.log$ C foo.*bar W some.*text
This just includes files whose path end with a /, followed by one or more upper case letters followed by .log, such as /var/log/FOO.log. The file /var/log/bar.log would be ignored by this line.
Note: In each logfile line you can use regex and iregex at most once.
In order to only send new messages, mk_logwatch remembers the current byte offset of each logfile seen so far. It keeps that information in /etc/check_mk/logwatch.state. If a logfile is scanned for the very first time, all existing messages are considered to be historic and are ignored - regardless any patterns. This behaviour is important. Otherwise you would be bombarded with thousands of ancient messages when check_mk runs for the first time.
When something bad happens that has usually more impact into the logfile than one single line. In order to make a error diagnosis easier, logwatch always sends all new lines seen in a logfile if at least one of those lines is classified as warning or critical. If you monitor each host once in a minute (a quasi standard with Nagios), you'll then see all messages appeared in that last minute.
Logwatch on Windows
The check_mk_agent.exe for Windows automatically monitors the
Windows Eventlog. Its output is fully compatible with that of the
logwatch extension for Linux/UNIX. The main difference is that
Windows already classifies its messages with Warning or Error.
Furthermore the agent automatically
monitors all existing event logs it finds, so no configuration is needed by you
at all on the target host. It is - however - possible to reclassify messages
to a higher or lower level via the configuration variable logwatch_patterns.
Messages classified as informational by Windows cannot be reclassified since
they are not sent by the agent. Please refer to the article about the Windows agent
for details on logwatch_patterns.
The logwatch web page
Whenever check_mk detects new log messages, it stores them on the Nagios host in a directory that defaults to /var/lib/check_mk/logwatch. Each host gets a subdirectory, each logfile's messages are stored in one file.
The Nagios service that reflects a logfile is in warning or critical state, if that file exists and contains at least one warning or critical message resp.
The /check_mk/logwatch.py web page allows you to nicely browse the messages in that file and acknowledges them, if you consider the problem to be solved. Acknowledgement means deletion of the file. Shortly afterwards the service of the logfile enters OK state in Nagios.
The default Nagios templates of Check_MK automatically create notes_url entries for all logwatch based services to that page.
Limiting the size of unaknowledged messages
In some situations the number of error messages can get quite large in a short time. In order to make the web pages still usage, the logwatch check stops to store new error messages on the monitoring server. The maximum size of a logfile is set to 500000 Bytes. This can be overridden in main.mk by setting logwatch_max_filesize to another number:
# Limit maximum size of stored message per file to 10 KB logwatch_max_filesize = 10000