1. Predictive Monitoring - Introduction
Defining warning and critical levels for checks which measure performance values is not always easy. Setting levels too low creates false alarms while choosing them too high makes the monitoring blind to problems. Let's take the CPU load of a system as an example: You might have a server that is idle most of the time but needs larger amounts of computing power for some short time periods on a regular basis.
Let's assume that each day a big job is running from approx 1:00 am until 6:00 am - except on saturdays and sundays. In these time periods a CPU load of 10 is completely normal. During the rest of the time even a load of 3 could be suspicious. Wouldn't it be nice to have the possibility to define the levels based on these times?
As of version 1.2.3i1 Check_MK can do that for you and even goes one step further. It can automatically learn what is normal und create such levels for you - while continuously adopting them to the reality. Some people call cwthisthat Anomaly detection. Check_MK calls it Predictive Monitoring, since the levels for the future are based on a prediction which is computed from the values of the past. This is done by analysing historic data contained in the RRD files of PNP4Nagios.
2. Setting up Predictive Monitoring
All you have to do in order to use predictive monitoring with Check_MK is to make sure that your PNP4Nagios integration works and your RRD files are being created and updated correctly (users of OMD may safely take this for granted). The configuration is done on a per-service basis. When a check type supports predictive monitoring then you will able to select this in the WATO ruleset for configuring the check. A few examples of check types supporting predicition are:
- CPU load
- Disk IO on Linux
- Disk IO on Windows
- Disk IO on HPUX LUNs
- MySQL Daemon InnoDB IO
- Context Switches, Process Creations and Major Page Faults on Linux
2.1. Base prediction on
Here you configure the interval at which you expect the performance data to repeat. If you select Day of the week then Check_MK will compute a different reference curve for every day of the week. Hour of the day creates just one curve to be used for every day and assumes that the measured value is developing similar each day. Minute of the hour assumes a cycle of 60 minutes and is mainly useful for testing the prediction algorithm.
2.2. Time horizon
Here you specify how far into the past Check_MK shall look when it computes the reference value for the prediction. It does not make sense to go too far into the past since you typical values might have changed by then. If you select Day of the week and a time horizon of 90 days, then for each day of the week, e.g. Monday, 12 reference Mondays of the past will be taken into account. For each minute of the day (depending on the resolution of your RRDs) the average of these 12 mondays will be taken as a reference for computing the levels.
2.3. Dynamic levels (upper bound)
Here you can specify how the dynamic warning/critical level shall be computed out of the reference value. You have three choices:
- Absolute difference to prediction: A fixed value will be added to the reference value.
- Relative difference to prediction: The reference value will be raised by the given percentage to compute the levels.
- In relation to standard deviation: The standard deviation tells us how much the values for each previous reference period (e.g. monday) differ. In other words: the more precise the prediction is, the lower is the standard deviation. Thus setting levels in relation to it creates more strict levels for time periods with a more precise prediction.
2.4. Dynamic levels (lower bound)
3. Analysing the prediction
As soon as you have configured a prediction for a service and the service has been checked once, an icon (bulb) leads you to the prediction analyzer:
Clicking on the bulb will show you a graph of the current prediction period. There you can see the predicted reference curve (black), the current value (blue) and the warning and critical areas (yellow and red):
Note: when changing prediction parameters the prediction will be updated as soon as:
- You have activated the changes
- The Check_MK service on the host has run once