1. BI - Make more out of your monitoring data
Check_MK Business Intelligence - or simply BI - is an addon to Multisite that helps you making more out of your monitoring data. Even medium sized Nagios systems monitor several thousands of single items (hosts and services). While for many tasks the classical GUI and views are sufficient to keep track of those, other tasks ask for more top-level aggregated views.
For the admins whose task is to repair things, a detailed list of currently unhandled problems is just what they need. But other collegues might have questions like "Which applications are affected by a certain problem?" or "What availability had my application XYZ?". They ask for tools that aggregate the basic details into higher level information.
Check_MK BI comes with two modules addressing this topic:
- Aggregation - new in version 1.1.11i1
- Reporting - coming in the future
2. BI Aggregation
BI Aggregations compute the overall state of applications, hosts, or other items of interest from a subset of your basic Nagios hosts and services. Each aggregation defines a tree of dependencies visualized and executed by the BI component in Multisite.
Such aggregation trees help answering lots of questions occurring in daily monitoring situation, e.g.:
- What is the general state of application X?
- Which hardware and software components is X based on?
- Service Y is critical. Which applications are affected by this?
- I would like to shutdown host Z for some time. Which applications will be affected?
- and so on...
2.1. How to aggregate - best or worst?
Compared to plain Nagios or NagVis - which always use the worst state of a list of items when displaying grouped data - Check_MK BI is much more flexible when aggregating information. Consider the following example:
You have several database instances running on HA clusters made of two nodes. You monitor both nodes, and also have configured a virtual cluster host by making use of Check_MK Clusters. All components attached to cluster might move around from one node to the other.
By defining a tree of dependencies made out of your basic host and service states, you can compute an overal state of your database instance. Where things (hardware and software) are redundant, the aggregated state should use the best state of the underlying items. In other cases the worst state is used.
In other situations only the state of one of the underlying items is of interest, the rest should be ignored. A good example is the operating system state of the two physical cluster hosts where databases are running. The databases are always only running on one of the cluster nodes. So problems with memory consumption or a high CPU load affects the databases only if they are on the host it is currently running on.
If you want to compute availablity reports, this might be a very relevant case for you: you certainly do not want your availability to be reported as degraded just because of a high CPU load on the stand-by host.
The following screenshot shows a (somewhat simplified) aggregation for such a scenario:
2.2. Features and Advantages
Check_MK BI Aggregations provide a lot of interesting features, many of which are unique in the world of Nagios:
- Complete tree overview
- Flexible rule based configuration
- User definable aggregation functions
- Auto detection of services and hosts
- Very convenient business impact analysis
- Aggregations spawning several monitoring servers
- User-specific aggregations
3. BI Reporting
The second module of BI - the reporting - makes use of the BI aggregations to compute the availability of your applications. That means that you can get the amount of time reported that a BI aggregate was in OK, WARN, CRIT and UNKNOWN state within a certain time range in the past. Simply click on the availability icon located left to the BI aggregate:
This brings you to the availability reporting module that shows you the percentages in the certain state. You can change this with the availability options in various ways.
A click on Timeline brings you to a detailed view where you can investiage why the aggregate has been in a certain state in the past:
By default a user is permitted to see all host and service states which he is allowed to see in the views. These permissions are configured via contacts and contactgroups. In BI there is one restriction: a service can only be shown in BI aggregations when a user is permitted to see a specific service and the host of the service. If the user is only permitted to see the service, but not the host, the service will not be added to the BI aggregations.
To override this behaviour you can enable the permission "BI - See all hosts and services" for a role. All users with this role assigned will be able to see the states of all hosts and services in the BI aggregations.
5. Scheduled Downtimes
As of version 1.2.5i1 Check_MK BI also takes into account scheduled downtimes of hosts and services. That means that also that complete aggregate also can be "in scheduled downtime". This new state is automatically being derived from the downtime state of the leaves by the following algorithm:
A BI aggregate is in scheduled downtime if it would have the state CRIT under the assumuption that all hosts and services that are currently in scheduled downtime are DOWN or CRIT resp. and all other hosts and services are UP or OK resp.
That means that for computing the downtime state of the aggregate the current state of hosts and services is irrelevant. What counts is just the current scheduled downtimes.
When a host or service is in a non-OK state then the user can acknowledge that problem. Just like the scheduled downtimes, the acknowledgement is saved as an additional attribute. As of Version 1.2.5i1 Check_MK BI now also aggregates acknowledgement information up to the top node. The following algorithm is being used for computing this from the states and acknowledgements of the nodes:
A BI aggregation is acknowledged if it would have an OK state under the assumption that all acknowledged hosts and services would be UP or OK resp.
Note: You cannot directly acknowledge a BI aggregate that is in a problem state. You need to acknowledge its underlying host and service problems.