Check_MK Business Intelligence - Aggregation RulesRequired version: 1.1.11i1
April 21. 2011
How to configure BI aggregationsThe configuration of Check_MK BI aggregations is rule based. That means that you do not have to specify every single service for every aggregation explicitely (as with NagiosBP, for example). Rather you define rules which are then used to create a large number of aggregations automatically - based on the actually existing services and hosts. As an example, let's take the aggregation of the operating system state and hardware ressources of a Linux server. You once define a set of rules of how the states of hardware sensors, filesystems, performance values and other aspects should be aggregated together. Then you use this rule set for several or all Linux hosts which are currently avaiable. This approach has two advantages:
As always, there is also a disadvantage: writing (intelligent) rules is a bit more challenging then configuring everying explicitely. As always with Check_MK: "Learn more, work less!" Where to put the configurationSince Check_MK BI is integrated into Multisite, the rule configuration is done in multisite.mk. In order to make things a bit more flexible, however, Multisite has now the possibility to split up the configuration into several files. This works much the same like main.mk and conf.d/. The directory for Multisite is multisite.d and all files ending in .mk will be read in in lexographical order - after multisite.mk. Please keep in mind that files for Multisite and Check_MK are not compatible and must not be mixed up! Please also note that - just as with conf.d - also for multisite.d holds the rule, that lists must be appended to with += if you do not want to loose values from previously read files. So in the following documentation - if I speak of multisite.mk - I silently assume that the definitions can also be put in files below multisite.d. First simple exampleAs a first example we make an explicit rule building a simple aggregation on the host zbghora35 of the two services CPU load and CPU utilization. Wait - haven't I said that explicit configuration is bad? Well, yes. But as a starting point for understanding rules it will do best. Configuration is done in two variables: aggregation_rules and aggregations. First comes our rule: multsite.mk
aggregation_rules["cpu_usage"] = (
"CPU Usage", # description, title
[], # list of parameters (empty here)
"worst", # aggregation function
[ # list of elements, nodes
( "zbghora35", "CPU load" ),
( "zbghora35", "CPU utilization" ),
]
)
As you can see from this example, the rule cpu_usage is a tuple with the following four parts:
Each element can either be a reference to another rule or a pair of host and service. Replace the service with the keyword HOST_STATE (without quotes) in order to refer to the Nagios host state of that host. Defining a rule will not automatically create an aggregation object. For that purpose we must use (instantiate) the rule. Programmers might be inclined to speak of calling the rule - much like a function. This is done via the variable aggregations: aggregations = [ ( "Hosts", "cpu_usage", [] ) # call rule 'cpu_usage' with no parameters ] Each entry in aggregations is a triple of:
Now let's test this. The good news here is: No restart of Nagios or Apache is needed! All changes to the configuration take immediate effect. If you select the view All Aggregations from your Views snapin, you should see something like the following (provided you have a host named zbghora35).
If you get just one line with CPU Usage, then klick on it and the two nodes below it will be displayed. Rules using other rulesOur next step is to make a slightly more complex tree. This is done by rules using other rules. Lets first create a rule of three services dealing with the network interface eth0 of that host:
aggregation_rules["nic_eth0"] = (
"NIC eth0", [], "worst", [
( "zbghora35", "NIC eth0 link" ),
( "zbghora35", "NIC eth0 counters" ),
( "zbghora35", "NIC eth0 parameter" ),
]
)
Nothing new so far. Now we combine the information of the CPU, the NIC eth0, the state of the Check_MK service and the host status (PING) into one new rule called host_ressources:
aggregation_rules["host_ressources"] = (
"Host ressources", [], "worst", [
( "cpu_usage", [] ), # call rule cpu_usage
( "nic_eth0", [] ), # call rule nic_eth0
( "zbghora35", "Check_MK$" ), # state of service Check_MK
( "zbghora35", HOST_STATE ), # host state of zbghora35
]
)
Please note, that we appended a $ to Check_MK. This does an exact match on the service name. Otherwise the rule would match all services beginning with Check_MK (e.g. also Check_MK Inventory). Having that in mind we could simplify our rule nic_eth0:
aggregation_rules["nic_eth0"] = (
"NIC eth0", [], "worst", [
( "zbghora35", "NIC eth0" ), # Match all services beginning with NIC eth0
]
)
Using parametersAs you surely have noticed, in our upper examples we simply hard coded the host name zbghora35 into our rules. This is not very practicable. We would need a separate set of rules for each host. Not a nice idea, indeed. Better we use a parameter for the host name. Lets go back to our very first example and lets call this parameter HOST. Here is a new and better version of our first rule:
aggregation_rules["cpu_usage"] = (
"CPU Usage", [ "HOST" ], "worst", [
( "$HOST$", "CPU load" ),
( "$HOST$", "CPU utilization" ),
]
)
When using that rule, we need to specify an argument: aggregations = [ ( "Hosts", "cpu_usage", ["zbghora35"] ), ] Now its easy to use that rule for a couple of different hosts: aggregations = [ ( "Hosts", "cpu_usage", ["zbghora35"] ), ( "Hosts", "cpu_usage", ["zbghora36"] ), ( "Hosts", "cpu_usage", ["zbghora37"] ), ] What happens if one of those host does not exist in the Nagios data? Nothing. The rule simply does not apply then and no aggregation is being created for that host. This is especially useful in multi site setups when some sites are not reachable or deselected and thus some hosts not visible in the GUI. BI will not complain. It will simple ignore those hosts. The some holds if services specified in a rule are missing. Those services are simply dropped from the rule for the affected hosts. This makes it easier to write generic rules for all hosts. Here is our complete example converted to the HOST parameter: multisite.mk
aggregation_rules["cpu_usage"] = (
"CPU Usage", [ "HOST" ], "worst", [
( "$HOST$", "CPU load" ),
( "$HOST$", "CPU utilization" ),
]
)
aggregation_rules["nic_eth0"] = (
"NIC eth0", [ "HOST" ], "worst", [
( "$HOST$", "NIC eth0" ),
]
)
aggregation_rules["host_ressources"] = (
"Host ressources", [ "HOST" ], "worst", [
( "cpu_usage", ["$HOST$" ] ),
( "nic_eth0", [ "$HOST$" ] ),
( "$HOST$", "Check_MK$" ),
( "$HOST$", HOST_STATE ),
]
)
aggregations = [
( "Hosts", "host_ressources", [ "zbghora35" ] ),
( "Hosts", "host_ressources", [ "zbghora36" ] ),
( "Hosts", "host_ressources", [ "zbghora37" ] ),
]
Automatic detection of (sensible) rulesUsing the parameter HOST makes the configuration already a lot simpler by allowing us to reuse rules for several hosts. But we still had to manually define which objects we put into a rule. Using regular expressionsOne of the key features of Check_MK BI comes now into play: the live automatic detection of objects. Let's guess we want to add a topic Filesystems into our host rules that contains all filesystem of that host (i.e. all services beginning with fs_). This can easily be done with: multisite.mk
aggregation_rules["filesytems"] = (
"CPU Usage", [ "HOST" ], "worst", [
( "$HOST$", "fs_" ),
]
)
This will automatically add all services beginning with fs_ to the rule. To be exactly, a regular expression match is done on the beginning of the service. The following entry would thus match all services containing the text WP1:
( "$HOST$", ".*WP1" ),
By using brackets and vertical bars you can put alternatives into one line:
( "$HOST$", "CPU (load|util)" ),
For those not familiar with regular expressions a short summary of the most important special characters:
Calling a rule for all or some hostsOur next step is to call a rule automatically for all hosts - or only for hosts having certain host tags. The following example will call the rule host_ressources once for all hosts: aggregations += [ ( "Hosts", FOREACH_HOST, ALL_HOSTS, "host_ressources", ["$1$"] ), ] Please look carefully how that line is built out of five elements:
If you want to make use out of host tags, put them as a list of strings in front of the keyword ALL_HOSTS. The following example will call the rule only for hosts having the tag lnx: aggregations += [ ( "Hosts", FOREACH_HOST, ["lnx"], ALL_HOSTS, "host_ressources", ["$1$"] ), ] Calling a rule for each host having a certain serviceWe can build yet more intelligent rules makeing BI watch out for certain services. Lets assume, you want to call the upper rule for all hosts having the service Check_MK. This can be done with the keyword FOREACH_SERVICE. After the keyword ALL_HOSTS you put a regular expression for the service(s) to look for: aggregations += [ ( "Hosts", FOREACH_SERVICE, ALL_HOSTS, "Check_MK$", "host_ressources",["$1$"]), ] Please note, that again the $ matches the end of the service name and thus avoids matching also services like Check_MK Version. Extracting data from the service nameLet's now that we do not want to build rules for hosts but for applications. A good example are database instances. The Check_MK checks for ORACLE - as an example - always have a service ORA ... Sessions for each ORACLE instance running on that host. We now want to use that service in order to automatically create an aggregation object for each ORACLE instance found on any host. We might have created the following aggregate rule for that task:
aggregation_rules["db"] = (
"$DB$",
[ "HOST", "DB" ], # take two parameters
"worst",
[
( "$HOST$", "ORA $DB$ (Sessions|Logswitches)" ),
( "oracle_log", [ "$HOST$", "$DB$" ] ), # Logfiles
( "oracle_tbs", [ "$HOST$", "$DB$" ] ), # Tablespaces
( "db_host_state", [ "$HOST$" ]), # general state of host
]
)
This rule needs two parameters: the name of the host and the name of the database instance. The following code does the trick of calling that rule exactly once for each database known in your monitoring system. This time we've put each element into a single line:
aggregations += [
( "ORACLE", # name of the aggregation group
FOREACH_SERVICE, # iterate over found services
ALL_HOSTS, # on all hosts
"ORA (.*) Logswitches", # service pattern with group in brackets
"db", # name of rule to call
["$1$", "$2$" ] # arguments: first = HOST, second = DB
),
]
The brackets in (.*) are important! They will make sure, that $2$ is assigned to the exact text that matched the .* in the brackets - in our case the name of the database instance! You might already see the great advantage of this approach: Your BI configuration will always show all databases in your system - without any reconfiguration. As soon as a new database appears in your monitoring data, it will immediately show up in you BI aggregations. Aggregation functionsSo far we have always assume that the worst state in a list of nodes decides the total state. This is not always what we want. You might aggregate a pair of two redundant network interfaces into a node an wish to use the best state. How the state is exacly computed is decided by the aggregation function. Currently there are three functions available:
If you use best and at least one node is OK - so will be your total state. The following rule combines all NIC status' into a new rule assuming that they are redundant (it calls the rule nic for all NICs matching $NIC$):
aggregation_rules["nic_redundant"] = (
"Redundant NICs", [ "HOST", "NICs" ], "best" [
( "nic", [ "$HOST$", "$NIC$" ] ),
]
)
ParametersAggregation functions can be parameterized. The functions worst and best both take two optional integer arguments. Arguments are - Nagios like - separated with an exclamation mark right within the string. Those two arguments are:
Lets assume we have five sub nodes with the states CRIT, CRIT, UNKNOWN, WARN and OK. Then here are a few examples of how different aggregations work:
Custom aggregation functionsIn future you will be able to code your own aggregation functions in Python directly in multisite.mk. In fact you could do this right now. It's just that the API is not yet finalized and there is no documentation yet. Rules spanning over multiple sitesAn important key feature of Check_MK BI is that its rules can spann over several Nagios instances (sites). This is possible because the aggregation is done by the GUI and integrated into the configuration of the monitoring core itself. As long as all of your host names are unique within your monitoring environment there is nothing special you have to be aware of. Distributed aggregation rules simply work. It is possible, however, to make explicit use of the sites in order to create site-specific rules. Explicit specification of a siteIn some cases you might want to specify a site explicitely in a rule. This can be done by prefixing the host name with the site name and a hash mark (#):
aggregation_rules["sitetest1"] = (
"Site Test 1", [ ], "worst", [
( "prod#xabc123", "Some Service" ),
]
)
Note: the name of the prefix is the key given in the sites-dictionary in multisite.mk (not the alias). Iterating services/rules in aggregation rules 1.1.13i1In previous versions it was only possible to create multiple aggregation instances by iterating over aggregation rules in the topmost level of the aggregations. This lead to some overhead because one had to define excplicit host/service names at lower levels. Since 1.1.13i1 it is possible to iterate over hosts/services in aggregation elements on every level. For example you can now create aggregations like: Add one sub-aggregation for each host which matches the tag webcluster to this aggregation. The definition of that agregation would look like this:
aggregation_rules['web-cluster-aggr'] = (
"All web cluster hosts",
[],
"worst",
[
(FOREACH_HOST, ['webcluster'], ALL_HOSTS, "host", ["$1$"] ),
]
)
This adds one instances of the aggregation host to the web-cluster-aggr aggregation for each host. The aggregation host is executed with the hostname as parameter. Or another example:
Add the service CPU load of each host with the tag oracle to the
aggregation_rules['oracle-load'] = (
"Oracle host CPU load",
[],
"worst",
[
(FOREACH_HOST, ['oracle'], ALL_HOSTS, "$HOST$", "CPU load" ),
]
)
All services starting with CPU load found on the hosts having the tag oracle are added to the aggregation. |
| |||||||||||||||||||||||||||||||||||||||||||||||||||