Check_MK Business Intelligence - Aggregation Rules


Dieser Artikel wird nicht mehr gepflegt und ist unter Umständen nicht mehr gültig!

1. How to configure BI aggregations

The configuration of Check_MK BI aggregations is rule based. This means that you do not have to specify every single service for every aggregation explicitly (as with NagiosBP, for example). Rather you define rules, which are then used to create a large number of aggregations automatically - based on the actually existing services and hosts.

As an example, let's take the aggregation of the operating system state and hardware ressources of a Linux server. You once define a set of rules of how the states of hardware sensors, filesystems, performance values and other aspects should be aggregated together. Then you use this rule set for several or all Linux hosts which are currently avaiable.

This approach has two advantages:

  • You have much less to configure.
  • Changes in your environment are automatically reflected.

As always, there is also a disadvantage: writing (intelligent) rules is a bit more challenging then configuring every service aggregation explicitely. As always with Check_MK: "Learn more, work less!"

2. Where to put the configuration

Since Check_MK BI is integrated into Multisite, the rule configuration is done in multisite.mk. In order to make things a bit more flexible, however, Multisite now has the option to split the configuration into several files. This works similar to main.mk and conf.d/. The directory for Multisite is multisite.d and all files ending in .mk will be read in lexicographical order - after multisite.mk. Please note that files for Multisite and Check_MK are not compatible and must not be mixed up!

Please also note that - as with conf.d - you need to use lists by appending values to them using += in order not to loose values from previously read files in multisite.d or from multisite.mk.

So in the following documentation - if refering to multisite.mk - we silently assume that the definitions can also be put in files below multisite.d.

3. First simple example

As a first example we make an explicit rule building a simple aggregation on the host zbghora35 of the two services CPU load and CPU utilization. Wait - haven't we said that explicit configuration is bad? Well, yes. But as a starting point for understanding rules it will do best.

Configuration is done in two variables: aggregation_rules and aggregations. First comes our rule:

multsite.mk
aggregation_rules["cpu_usage"] = (
  "CPU Usage",                            # description, title
  [],                                     # list of parameters (empty here)
  "worst",                                # aggregation function
  [                                       # list of elements, nodes
    ( "zbghora35", "CPU load" ),
    ( "zbghora35", "CPU utilization" ),
  ]
)

As you can see from this example, the rule cpu_usage is a tuple with the following four parts:

  1. The title or description of the rule. This will be shown as the aggregation's name in the GUI.
  2. A list of parameters. We do not need them in the example and put an empty list [] here. I will explain parameters later.
  3. The name of the aggregation function. "worst" will simply select the worst state of all nodes as total state of the aggregation. More details will come later.
  4. The list of elements in the aggregation.

Each element can either be a reference to another rule or a pair of host and service. Replace the service with the keyword HOST_STATE (without quotes) in order to refer to the Nagios host state of that host.

Defining a rule will not automatically create an aggregation object. For that purpose we must use (instantiate) the rule. Programmers might be inclined to speak of calling the rule - much like a function. This is done via the variable aggregations:

aggregations = [
  ( "Hosts", "cpu_usage", [] ) # call rule 'cpu_usage' with no parameters
]

Each entry in aggregations is a triple of:

  1. The aggregation group. Grouping aggregations is a mere display feature.
  2. The name of the rule to use.
  3. The list of arguments to call the rule with. Since our rules takes no arguments, we need not und must not specify any, and again put an empty list here.

Now let's test this. The good news here is: No restart of Nagios or Apache is needed! All changes to the configuration are effective immediately. If you select the view All Aggregations from your Views snapin, you should see something like the following (provided you have a host named zbghora35).

If you get just one line with CPU Usage, then klick on it and the two nodes below it will be displayed.

4. Rules using other rules

Our next step is to make a slightly more complex tree. This is done by rules using other rules. Lets first create a rule of three services dealing with the network interface eth0 of that host:

aggregation_rules["nic_eth0"] = (
  "NIC eth0", [], "worst", [
    ( "zbghora35", "NIC eth0 link" ),
    ( "zbghora35", "NIC eth0 counters" ),
    ( "zbghora35", "NIC eth0 parameter" ),
  ]
)

Nothing new so far. Now we combine the information of the CPU, the NIC eth0, the state of the Check_MK service and the host status (PING) into one new rule called host_ressources:

aggregation_rules["host_ressources"] = (
  "Host ressources", [], "worst", [
    ( "cpu_usage", [] ),               # call rule cpu_usage
    ( "nic_eth0", [] ),                # call rule nic_eth0
    ( "zbghora35", "Check_MK$" ),      # state of service Check_MK
    ( "zbghora35", HOST_STATE ),       # host state of zbghora35
  ]
)

Please note, that we appended a $ to Check_MK. This does an exact match on the service name. Otherwise the rule would match all services beginning with Check_MK (e.g. also Check_MK Inventory).

Having that in mind we could simplify our rule nic_eth0:

aggregation_rules["nic_eth0"] = (
  "NIC eth0", [], "worst", [
    ( "zbghora35", "NIC eth0" ),  # Match all services beginning with NIC eth0
  ]
)

5. Using parameters

As you surely have noticed, in our above examples we simply hard-coded the host name zbghora35 into our rules. This is not very practicable, since we would need a separate set of rules for each host. It is much better to use a parameter for the host name.

Lets go back to our very first example and lets call this parameter HOST. Here is a new and better version of our first rule:

aggregation_rules["cpu_usage"] = (
  "CPU Usage", [ "HOST" ], "worst", [
    ( "$HOST$", "CPU load" ),
    ( "$HOST$", "CPU utilization" ),
  ]
)

We need to specify an argument when using this rule:

aggregations = [
  ( "Hosts", "cpu_usage", ["zbghora35"] ),
]

Now its easy to use the same rule for a couple of different hosts:

aggregations = [
  ( "Hosts", "cpu_usage", ["zbghora35"] ),
  ( "Hosts", "cpu_usage", ["zbghora36"] ),
  ( "Hosts", "cpu_usage", ["zbghora37"] ),
]

What happens if one of those host does not exist in the Nagios data? Nothing. The rule simply does not apply, and no aggregation is being created for that host. This is especially useful in multi site setups when some sites are not reachable or deselected and thus some hosts are not visible in the GUI. BI will not complain. It will simple ignore these hosts.

The some holds if services specified in a rule are missing. These services are simply dropped from the rule for the affected hosts. This makes it easier to write generic rules for all hosts. Here is our complete example converted to the HOST parameter:

multisite.mk
aggregation_rules["cpu_usage"] = (
  "CPU Usage", [ "HOST" ], "worst", [
    ( "$HOST$", "CPU load" ),
    ( "$HOST$", "CPU utilization" ),
  ]
)

aggregation_rules["nic_eth0"] = (
  "NIC eth0", [ "HOST" ], "worst", [
    ( "$HOST$", "NIC eth0" ),
  ]
)

aggregation_rules["host_ressources"] = (
  "Host ressources", [ "HOST" ], "worst", [
    ( "cpu_usage", ["$HOST$" ] ),
    ( "nic_eth0", [ "$HOST$" ] ),
    ( "$HOST$", "Check_MK$" ),
    ( "$HOST$", HOST_STATE ),
  ]
)

aggregations = [
  ( "Hosts", "host_ressources", [ "zbghora35" ] ),
  ( "Hosts", "host_ressources", [ "zbghora36" ] ),
  ( "Hosts", "host_ressources", [ "zbghora37" ] ),
]

6. Automatic detection of (sensible) rules

Using the parameter HOST makes the configuration a lot simpler by allowing us to reuse rules for several hosts. But we still had to manually define which objects we put into a rule.

6.1. Using regular expressions

One of the key features of Check_MK BI now comes into play: the live automatic detection of objects. Let's say we want to add a topic Filesystems to our host rules containing all filesystems of the host (i.e. all services beginning with fs_). This can easily be done with:

multisite.mk
aggregation_rules["filesytems"] = (
  "CPU Usage", [ "HOST" ], "worst", [
    ( "$HOST$", "fs_" ),
  ]
)

This will automatically add all services beginning with fs_ to the rule. To be exact, a regular expression match is done on the beginning of the service. The following entry would thus match all services containing the text WP1:

    ( "$HOST$", ".*WP1" ),

By using brackets and vertical bars you can put alternatives into one line:

    ( "$HOST$", "CPU (load|util)" ),

For those not familiar with regular expressions, here is a short summary of the most important special characters:

.matches one arbitrary character
.*matches any string, also an empty one
+the previous character at least once or multiple times
[a-z]one of the characters a, b, ..., z
[a-z]+a non-empty sequence of lower case characters
[a-z0-9_]one digit, lower case character or underscore
[0-9]*zero or more digits
[a-z]+[0-9]*a one or more lower case characters, then zero or more digits (like eth0 or cluster12)
(foo|bar)matches foo as well as bar

6.2. Calling a rule for all or some hosts

Our next step is to call a rule automatically for all hosts - or only for hosts having certain host tags. The following example will call the rule host_ressources once for all hosts:

aggregations += [
  ( "Hosts", FOREACH_HOST, ALL_HOSTS, "host_ressources", ["$1$"] ),
]

Please watch carefully how the line is built out of five elements:

  1. The aggregation group
  2. The keyword FOREACH_HOST
  3. The keyword ALL_HOSTS
  4. The name of the rule to call
  5. The argument list, where "$1$" will be substituted with the host name

If you want to make use out of host tags, put them as a list of strings in front of the keyword ALL_HOSTS. The following example will call the rule only for hosts having the tag lnx:

aggregations += [
  ( "Hosts", FOREACH_HOST, ["lnx"], ALL_HOSTS, "host_ressources", ["$1$"] ),
]

6.3. Iterating over parents and childs of hosts

New in 1.2.0p2: The new iterator methods FOREACH_CHILD and FOREACH_PARENT let you make use of the parent/child relationship of hosts in order to call rules. The syntax is the same as with FOREACH_HOST, with the following two differences:

  1. The hosts childs or parents are being selected instead of the matched hosts.
  2. The name of the child/parent will be added to the list of matching text groups for each hit.

What does this mean? Let's have a look at the following aggregation rule:

aggregation_rules["clusterstate"] = (
  "Cluster state $CLUSTER$",
  [ "CLUSTER" ],
  "best",
  [
    ( FOREACH_PARENT, "$CLUSTER$", "host", [ "$1$" ] ),
  ]
)

Now let us assume that a host foo is a Check_MK cluster with the three nodes node1, node2 and node3. Check_MK regards the nodes of a cluster as parents of the cluster. So calling the upper rule for the host foo will create the following expansion:

aggregation_rules["clusterstate"] = (
  "Cluster state foo",
  [ "CLUSTER" ],
  "best",
  [
    ( "host", [ "node1" ] ),
    ( "host", [ "node2" ] ),
    ( "host", [ "node3" ] ),
  ]
)

Note: if you do not specify a fixed host name for the host but work with ALL_HOSTS, one rule incarnation will be created for each parent/child-relationship for each of the matching hosts, while $1$ will be the matching host and $2$ the parent or child:

aggregations = [
  ( "Test", FOREACH_CHILD, [ "sometag", "othertag" ],
            ALL_HOSTS, "myrule", [ "$1$", "$2$" ]),
]

The rule myrule will now be called once for each child of each host with the tags sometag and othertag - where $1$ will be substituted with the host and $2$ with the child. If a host is child of more than one host then the rule will be called several times accordingly.

6.4. Calling a rule for each host having a certain service

We can build yet more intelligent rules making BI watch out for certain services. Lets assume you want to call the above rule for all hosts having the service Check_MK. This can be done with the keyword FOREACH_SERVICE. After the keyword ALL_HOSTS you put a regular expression for the service(s) to look for:

aggregations += [
  ( "Hosts", FOREACH_SERVICE, ALL_HOSTS, "Check_MK$", "host_ressources",["$1$"]),
]

Please note that again the $ matches the end of the service name and thus avoids also matching services like Check_MK Version.

6.5. Extracting data from the service name

Let's now assume that we do not want to build rules for hosts but for applications. A good example are database instances. The Check_MK checks for ORACLE - as an example - always have a service ORA ... Sessions for each ORACLE instance running on that host. We now want to use that service in order to automatically create an aggregation object for each ORACLE instance found on any host. We might have created the following aggregate rule for that task:

aggregation_rules["db"] = (
  "$DB$",
  [ "HOST", "DB" ], # take two parameters
  "worst",
  [
      ( "$HOST$", "ORA $DB$ (Sessions|Logswitches)" ),
      ( "oracle_log",    [ "$HOST$", "$DB$" ] ),  # Logfiles
      ( "oracle_tbs",    [ "$HOST$", "$DB$" ] ),  # Tablespaces
      ( "db_host_state", [ "$HOST$" ]),           # general state of host
  ]
)

This rule needs two parameters: the name of the host and the name of the database instance. The following code does the trick of calling that rule exactly once for each database known in your monitoring system. This time we've put each element into a single line:

aggregations += [
  ( "ORACLE",                # name of the aggregation group
    FOREACH_SERVICE,         # iterate over found services
    ALL_HOSTS,               # on all hosts
    "ORA (.*) Logswitches",  # service pattern with group in brackets
    "db",                    # name of rule to call
    ["$1$", "$2$" ]          # arguments: first = HOST, second = DB
  ),
]

The brackets in (.*) are important! They will make sure, that $2$ is assigned to the exact text that matched the .* in the brackets - in our case the name of the database instance!

You might already see the great advantage of this approach: Your BI configuration will always show all databases in your system - without any reconfiguration. As soon as a new database appears in your monitoring data, it will immediately show up in you BI aggregations.

7. Aggregation functions

So far we have always assumed that the worst state in a list of nodes defines the total state. This is not always what we want. You might aggregate a pair of two redundant network interfaces into a node an wish to use the best state.

The aggregation function determines how the state is computed. Currently the following functions are vailable:

worstTake the worst state of all subnodes in the order OK, PENDING, WARN, UNKNOWN, CRIT
bestTake the best state of all subnodes (according to the same order).
count_okNEW in version 1.2.0: Just count the number of nodes in the state OK. Per default the total status will get OK if at least two nodes are OK. The number is configurable with a first parameter. If at least one node is OK (second parameter), then the total status will be WARN.

New in 1.2.0p2: the levels can also be specified as percentages.
running_onTake into account on which node of a cluster an application is currently running (sophisticated!).

If you use best and at least one node is OK - so will be your total state. The following rule combines all NIC stati into a new rule assuming that they are redundant (it calls the rule nic for all NICs matching $NIC$):

aggregation_rules["nic_redundant"] = (
  "Redundant NICs", [ "HOST", "NICs" ], "best" [
    ( "nic", [ "$HOST$", "$NIC$" ] ),
  ]
)

7.1. Parameters

Aggregation functions can be parameterized. The functions worst and best both take two optional integer arguments. Arguments are - Nagios like - separated with an exclamation mark right within the string. Those two arguments are:

  1. Index n (making the nth worst or nth best)
  2. The worst possible state for the aggregation as Nagios state (one of 0, 1, 2, 3 or -1, where -1 means PENDING)

Lets assume we have five sub nodes with the states CRIT, CRIT, UNKNOWN, WARN and OK. Then here are a few examples of how different aggregations work:

functionExplanationState
worst!3Take the third worst state.UNKNOWN
best!2Take the second best state.WARN
worst!1!1Take the worst state, but at most WARN (1)WARN
worst!1!0Make this aggregation always be OKOK

Examples for count_ok:

functionExplanationState
count_ok!3Make the state OK if at least three nodes are OK, and WARN if at least one node is OKWARN
count_ok!3!3The same, but never use the WARN stateCRIT
count_ok!70%!50%Specify levels as a percentage of the total number of nodes (new in 1.2.0p2)CRIT

7.2. Custom aggregation functions

It is possible to write your own aggregation functions for BI. You can do this directly in the same configuration file as your BI rules. Of course you need to have some practise in Python programming in order to succeed. But you are really free in how to combine things to a new status that way.

Here is a hello world example that will always aggregate to the status 'OK'. As a first step you define a new function that takes one argument. It returns a dictionary with the aggregated status and aggregated text output:

def aggr_hello(node):
    return { "state" : 0, "output" : "Hello World!" }

This function must then be declared:

aggregation_functions["hello"] = aggr_hello

Now you can use it instead of "best" or "worst" in any aggregation. In order to do something useful, however, you need to evaluate the argument nodes, of course. This argument contains a list of entries. Each describes the details about one subtree (node). A small modification to our hello world aggregation will print the complete nodes information in Python notation:

import pprint
def aggr_dns(nodes):
    text = repr(nodes[0])
    return { "state" : 1, "output" : "<pre>%s</pre>" % pprint.pformat(nodes) }

Since this is a lot of output, let's just reduce it to the first node:

import pprint
def aggr_dns(nodes):
    text = repr(nodes[0])
    return { "state" : 1, "output" : "<pre>%s</pre>" % pprint.pformat(nodes[0])}

Each node entry consist of a pair of

  1. State information
  2. Tree information

In most cases you will need just the state information. It is a dictionary with at least two keys:

  • "state" - The state of this subnode (0, 1, 2 or 3)
  • "output" - The text output of the subnode

The Tree information is static data and also a dictionary. That has at least the following keys:

  • "title" - The title of the subnode
  • "reqhosts" - A list of the nodes contained in the subtree (as pairs of site and hostname)
  • "type" - The type of the subnode: 1 is a leaf node, 2 an aggregated node.
  • "host" - A pair of site and host (only for leaf nodes)
  • "service" - The service description (only for leaf nodes of services)

Here is a more complex example that computes the state of DNS from the point of view of a client. It assumes that a client can detect a failure of a DNS server only if the server is not responding at all (its host being down). It also assumes that there are six subnodes in the aggregation: the host state and the state of the DNS service of each of the three servers:

# Aggregation function for DNS
# Assumption: We have exactly three six nodes
# Host State Server 1
# DNS Server 1
# Host State Server 2
# DNS Server 2
# Host State Server 3
# DNS Server 3

def aggr_dns(nodes):
    host_states = [ nodes[x][0]["state"] for x in [0,2,4] ]
    dns_states = [ nodes[x][0]["state"] for x in [1,3,5] ]
    if dns_states[0] == 0:
        state = 0
        text = "DNS 1 working"
    elif host_states[0] != 0 and dns_states[1] == 0:
        state = 1
        text = "DNS 1 down, DNS 2 working"
    elif host_states[0] != 0 and host_states[1] != 0 and dns_states[2] == 0:
        state = 1
        text = "DNS 1+2 down, DNS 3 working"
    else:
        state = 2
        text = "DNS not working!"

    return { "state" : state, "output" : text }

aggregation_functions["dns"] = aggr_dns

The following example uses that aggregation function:

aggregation_rules["dns"] = (
   "DNS Gesamt",
   [],
   "dns",
   [
      ( "server1", HOST_STATE ),
      ( "server1", "DNS" ),
      ( "server2", HOST_STATE ),
      ( "server2", "DNS" ),
      ( "server3", HOST_STATE ),
      ( "server3", "DNS" ),
   ]
)

8. Rules spanning over multiple sites

A key feature of Check_MK BI is that its rules can span over several Nagios instances (sites). This is possible because the aggregation is done by the GUI and integrated into the configuration of the monitoring core itself.

As long as all of your host names are unique within your monitoring environment there is nothing special you have to be aware of. Distributed aggregation rules simply work. It is possible, however, to make explicit use of the sites in order to create site-specific rules.

8.1. Explicit specification of a site

In some cases you might want to specify a site explicitely in a rule. This can be done by prefixing the host name with the site name and a hash mark (#):

aggregation_rules["sitetest1"] = (
  "Site Test 1", [  ], "worst", [
      ( "prod#xabc123", "Some Service" ),
  ]
)

Note: the name of the prefix is the key given in the sites-dictionary in multisite.mk (not the alias).

9. Iterating services/rules in aggregation rules 1.1.13i1

In previous versions it was only possible to create multiple aggregation instances by iterating over aggregation rules in the topmost level of the aggregations. This lead to some overhead because one had to define excplicit host/service names at lower levels.

Since 1.1.13i1 it is possible to iterate over hosts/services in aggregation elements on every level.

For example you can now create aggregations like: Add one sub-aggregation for each host which matches the tag webcluster to this aggregation. The definition of that agregation would look like this:

aggregation_rules['web-cluster-aggr'] = (
    "All web cluster hosts",
    [],
    "worst",
    [
        (FOREACH_HOST, ['webcluster'], ALL_HOSTS, "host", ["$1$"] ),
    ]
)

This adds one instance of the aggregation host to the web-cluster-aggr aggregation for each host. The aggregation host is executed with the hostname as parameter.

Or another example: Add the service CPU load of each host with the tag oracle to the aggregation.

aggregation_rules['oracle-load'] = (
    "Oracle host CPU load",
    [],
    "worst",
    [
       (FOREACH_HOST, ['oracle'], ALL_HOSTS, "$HOST$", "CPU load" ),
    ]
)

All services starting with CPU load found on the hosts having the tag oracle are added to the aggregation.

10. One aggregation rule in several groups 1.2.1i2

In previous versions it was not possible to add a top-level aggregation rule to multiple aggregation groups. You had to declare this top-level aggregation once for each aggregation group. Each of those aggregations had to be calculated individually.

Since 1.2.i2 it is now possible to add a top-levvel aggregation to several groups by simply configure a list of group names instead of a single group name. In this situation the aggregation is calculated once but added to all given aggregation afterwards.

This is the "old style" where the top-level rule had to be configured twice to be member of both aggregation groups:

aggregations = [
  ( "Hosts", "cpu_usage", ["zbghora35"] ),
  ( "Linux", "cpu_usage", ["zbghora35"] ),
]

This is the new style:

aggregations = [
  ( [ "Hosts", "Linux" ], "cpu_usage", ["zbghora35"] ),
]

With this configuration there is only one aggregation to be calculated instead of two. Having a lot of aggregations in several groups should substantial reduce your total number of aggregations.

11. Single host aggregations 1.2.1i2

If you have a large monitoring environment with a lot of hosts and services, much time is spent looping through all these hosts and services during calculation of the aggregation structures. In many cases it is necessary to work with all of them since one aggregation can not know whether an aggregation in a lower level needs data from another host or service.

To tune this situation 1.2.1i2 introduces a new variable named host_aggregations. It is handled equal to the aggregations variable but with the assumption that all aggregations registered in this list only need information from one host. The rules registered in this list must only depend on information of this single host. Additionally, to make aggregations of clusters work, the data of the parent hosts is also available and can be used. Check_MK clusters have configured their node hosts as parents, so you can also add cluster aggregations to the host_aggregations list.

Werk #0721

Use hard states in BI aggregates