1. BI - An example
This article wants to show Check_MK BI (Business Intelligence) in a real world example. All rules and aggregations will be created using the WATO GUI. You find this example on our Demo-Server too.
Let's assume you are running a webshop. Prior to assembling it in your monitoring system, you have drawn the following draft of it's architecture:
There are some redundant components. So if one of these fails, the webshop is still available to the customer.
And there are two points of view:
- The admin point of view: The admin needs to know about every single failed component, because he needs to get it working again without the customer taking notice of the failure.
- The application point of view: The guys at the hotline called up by
customers are only interested in the question: Is the webshop (as a whole application)
available to the customer at the moment or not.
Likewise, when doing your SLA reporting you probably will have this point of view.
2. Starting with hard-coded aggregation rules
We start off with checks already configured for all the components of our webshop, as you can see here:
We then first define the aggregation rules for the loadbalancers. We have two of them and they are in an active - standby setup. So for the whole application it is sufficient if one of them is up and running.
We go to WATO and click "BI - Business Intelligence" and then "New Rule". We give it the ID "LB" and the title "Loadbalancers". As aggregation function we select "Best", because if one loadbalancer is dead (CRITICAL) and the other is OK, we want the whole aggregation to report OK.
Now we can add the children by clicking "Add child node generator" twice (for the two loadbalancers). We select "State of a service" in the two drop-down menus and type both host names and the service names into the text fields. Finally we click on "Create":
Next let us jump up one level: Let's build the rule for the application (our highest level in the dependency tree). For the application to be available we need the internet connection to work AND we need the loadbalancer cluster to work AND we need the database to be available. (we leave the apaches and jBosses out for the moment.) So our aggregation function is "Worst" here, because even if only the internet connection goes to CRITICAL, the whole webshop is unavailable and needs to get the status CRITICAL.
Ok, "New Rule" again, ID "WebShop", title "WebShop", aggregation funktion "Worst". Click "Add child node generator" three times. At the first child we select "State of a service" again, enter "router" for the hostname and "Internet Connection" for the service.
For the second child we select "Call a Rule" and select the rule we created before: "LB - Loadbalancers". This means: the state reported by the LB-rule is used here. E. g. if both loadbalancers fail, we get CRITICAL. We can leave The arguments field blank since we do not use arguments at the moment. Arguments will be explained later.
For the third child we select "State of a service". We want to use it for the database, so we enter "mysqlserver" and "MySQL Daemon Sessions", for example. Click "Create" to finish this rule.
3. Define an Aggregation
A lot of work, two rules, but still nothing to see in Check_MK's Business Intelligence views. We need to change this now. The items displayed in the BI views are called "Aggregations", so let's define one: in WATO we navigate to "BI - Business Intelligence" and click "New Aggregation". In the field "Aggregation Groups" the name of the aggregation is assigned. We choose "WebShop", again without arguments. The aggregation should display the result of the WebShop rule, so we select "Call a Rule" and select the "WebShop" rule. Do not forget to click "Create".
4. Watch the first result and test it
Now its the first time we see the result of our work: In Check_MK under "Business Intelligence" in the "All Aggregations" view we see our WebShop and can fold up the subtrees.
To check if the aggregation works we can use the simulation mode of BI: simulation mode allows us to simulate a different state by clicking on the little gear-wheel icon near a service. Click the icon near the "Internet Connection" twice to set it to CRITICAL and afterwards click on the headline "All Aggregations" to refresh the view. The result of our simulation is that the state of our WebShop did change to CRITICAL. Yes, this is what we expect: if the Internet Connection is down, our WebShop is unavailable to our customers.
Next we want to test the redundant loadbalancers. We put back the router from simulated "CRITICAL" to it's current state by clicking it's icon three times. Then we set loadbalancer1 to CRITICAL and refresh the view: The WebShop recovers to OK, just as we have intended. If only one of the redundant components fails, it has no impact on our customer.
Finally we set the second loadbalancer to CRITICAL too, and the status of our WebShop turns to red. That's also what we wanted: if both loadbalancers fail, we have no availability of our WebShop application. So our tests are finished successfully now.
5. Adding webservers and application servers
As the next step we want to add the webservers and application servers. Here we need to pay attention to one important aspect: webserver1 and appserver1, as well as webserver2 and appserver2, are directly linked to the loadbalancers and each other, with no redundant cross link inbetween. Therefore, if webserver1 and appserver2 fail at the same time, the Webshop is unavailable.
For the application to be OK we thus require that either both, webserver1 and appserver1, or both, webserver2 and appserver2 are OK. Back to WATO under "BI - Business Intelligence" we implement this requirement with two new rules "New Rule" for each link filled as shown here:
... and the similarly for "webshopline2" with "webserver2" and "appserver2".
Now we have each single line. The next rule (one level up in the tree) tells, we need at least one of the lines working for our WebShop to work:
And the result of this rule we add to our top level rule "WebShop". Click the pen icon near the "WebShop" rule for this and add it. We can push it up in the list with the blue arrow icon.
After saving we can view our Aggregation, which now has got one additional level in the tree, and again we can test it.
6. Unifying things: Using Arguments
Ok, now we have all the components and depenencies of our WebShop in our monitoring. Now it is a good time to look back and shortly think it over. And then we see one point which completely conflicts with principals of computer science: We have one thing done twice: The rule describing a single line of our WebShop. We have it once for line 1 and once again for line 2. The only difference is the number used. Maybe in future the WebShop will grow and need more lines. We do not want to create the same rule over and over again. So let's do a rewrite of this part.
We create a new rule with ID "webshop_single_line". At the first view it looks very much like the old ones:
But you may have already remarked: It gets a parameter, we call it "LINENUM". We will fill this with the 1 or the 2 later. And we use this parameter (surounded by $ signs) in the both hostname fields.
Next step is to use this one rule in "webshoplines" instead of the two old ones. We edit the "webshoplines" rule and make it look like this:
We now twice call the same rule but give it a 1 as parameter at first call and 2 at the second. The two old rules "webshopline1" and "webshopline2" can now be deleted by clicking the blue trash bin icon near it. If looking at our aggregation in Check_MK, it should still look very much the same as before. But if folded up, both have the same title now. You need to look into it to see, which line is meant. Not so nice! So let's once again edit our rule "web_shop_single_line". We just change the "Rule Title" to "WebShop Line $LINENUM$". Yes, you can use the parameter in the title too! After saving our aggregation now looks 100% like before.
7. N+1 redundancy
As time goes by our WebShop gets more and more customers. At some point of time one webserver and one application server is not any longer sufficiant for the load. We need at least two online to get the work done. We decide for the concept of N+1 redundancy. That means: We always have the number of servers in place needed to get the work done plus one. Or in other words: One server can fail at a time. If the second one fails before the first came up again, we are unable to serve our customers. But that is a risk we are willing to accept.
So we add webserver3 and appserver3 for the moment. But what Aggregation Function to use now in the webshoplines rule? We may not use the "Best" any longer, because this is not what we want. We need to use the second worst state always for now and the future.
If one get CRITICAL and all other stay OK, the second worst state is OK. If two (or more) go to CRITICAL, the second worst state is CRITICAL. In both cases this is what we wanted. Go and test it in Check_MK's BI views.
8. Try it out
Maybe you now also might want to log into our Demo-Server and try it out. The example shown in this article is available there too.