Translations of this page:

News from check_multi

check_multi stable 0.26 released

It's urgent time for the next release, for some new features and for plenty of bug fixes.

Features
  • attribute [ <TAG>::<variable> ] = <expression>

    This feature is mightier than you expect. In some extent it works like check_generic does.

    It can set child specific variables, such as process_perfdata, timeout or displayed.
    But you also can set attributes like critical, warning, and unknown, and this is pretty much the same as you specify state [ critical ] = <expression>, just on the child check level. It will be evaluated directly after a child check has been executed and affects the state of a particular child check.
    Let me give an example:
    command   [ tmp_dir_permissions           ] = ls -ld /tmp
    attribute [ tmp_dir_permissions::critical ] = $tmp_dir_permissions$ !~ /^drwxrwxrwt/

    The output is:

    $ check_multi -f test_tmp_dir.cmd -r 1+4+64
    CRITICAL - 1 plugins checked, 1 critical (tmp_dir_permissions)
    [ 1] tmp_dir_permissions CRITICAL drwxr-xr-x 10 root root 4096 Nov 17 15:29 /tmp [CRITICAL: output matched rule '"$tmp_dir_permissions$" !~ /^drwxrwxrwt/']


  • attribute [ <variable> ] = <value>
    This is an option if you want to specify all in your cmd files to not use command line parameters like '-s <variable>=<value>'. Any -s option can be added with this attribute statement.
Other changes and fixes
  • Macros
    • FIX: macro handling for livestatus command
    • FIX: macro handling: complain about non-existing macros and remove
  • Output
    • print matching state rule in long plugin output
    • provide error message if no HOSTNAME is specified for passive service definitions
  • Tags
    • Note: spaces in tags and macros are not allowed any more to prevent ambiguities
    • numerical tags are not allowed
  • Performance data
    • added perfdata for statusdat command
    • FIX: perfdata column for livestatus command
    • FIX: hostname treatment for PNP urls
    • FIX: cumulate perfdata handling
    • better perfdata error output for invalid UOM
  • XML output
    • inventory parameter: reduced XML for feed_passive (no execution, only output)
    • FIX: encoding of XML state elements
  • Miscellaneous
    • FIX: handling of non existing child state rules
    • child timeout increased to 11s - no collision with 10s default
    • FIX: trimming blanks at line end with continuation lines
    • statusdat checks now only consider hard states
    • feed_passive with perfdata [ plugin ] at the end
    • contrib: notify_html_service.sh uses mime multipart elements
    • FIX: verbose mode


Have fun with check_multi,
-Matthias

Transports, buffers and multiline

The incredibly simple plugin interface is a key to the mysteries of the success of Nagios. A return code of 0,1,2,3 and some explaining words, that's all.

Everybody was able to understand these two elements and started to write plugins. In all languages one can imagine and for all OS where Nagios is running on. Nagios got a famous reputation ('Yes we can plugin!'), and the only limitation was the skill of the plugin programmer.

As a side note: Some plugins actually are not of best quality, as we can see in the exchange repositories. But they cover the whole range of monitoring.


Plugin output limited

Let's talk about a small aspect of the plugin interface which is annoying and often frustrates especially Nagios beginners - the limited length of plugin output. It sounds pretty simple, but the devil is in the details.

If you take the output of the standard plugin check_disk, the length of output should not be a problem:

DISK OK - free space: / 947 MB (8% inode=74%);| /=10533MB;10884;9675;0;12094

But in the meantime there are plugins like check_multi, which transport much more data:

OK - 3 plugins checked, 3 ok
[ 1] disk DISK OK - free space: / 947 MB (8% inode=74%);
[ 2] load OK - load average: 1.43, 1.23, 1.52
[ 3] swap SWAP OK - 95% free (1935 MB out of 2048 MB) |check_multi::check_multi::plugins=3 time=0.046274 disk::check_disk::/=10533MB;11489;11852;0;12094 load::check_load::load1=1.430;5.000;10.000;0; load5=1.230;4.000;8.000;0; load15=1.520;3.000;6.000;0; swap::check_swap::swap=1935MB;0;0;0;2048

428 bytes instead of 83 bytes: If you still have Nagios 2 running, this would have blown the maximum length of plugin output.


Plugin buffer overflow? Nagios does not care

The plugin interface is simple, but it also means that Nagios does not care about the length of plugin output. If it exceeds the internal buffer length, nobody is informed and often nobody notices it. The content is simply cut.

Bad news for the performance data, which is appended to the output: If the buffer is not long enough to house the whole output, the performance data is missing or even worse, it is corrupted. And no warning lamp alarms the monitoring admin.

Nagios 2 allowed 332 bytes of plugin output, but for Nagios 3 this was increased drastically. Have a look on how the maximum plugin output length evolved over the Nagios timeline:
On Nagios side it's the constant named MAX_PLUGIN_OUTPUT_LENGTH:

Maximum plugin
output in bytes
Nagios version Include file
352 1.0 common/objects.h
348 2-0 include/objects.h
332 2-1 include/objects.h
4096 3-0a include/nagios.h
8192 3-0 include/nagios.h

Don't think that the 8K bytes are sufficient in all cases - check_multi's HTML mode can easily consume dozens of kilobytes.


Increasing MAX_PLUGIN_OUTPUT_LENGTH - and some more

In principle the idea of increasing the constant MAX_PLUGIN_OUTPUT_LENGTH is correct - increase it, recompile and restart Nagios, done. Ethan himself gives a hint in nagios.h to also increase MAX_EXTERNAL_COMMAND_LENGTH for passive checks:

NOTE: Plugin length is artificially capped at 8k to prevent runaway plugins from returning MBs/GBs of data
back to Nagios.  If you increase the 8k cap by modifying this value, make sure you also increase the value
of MAX_EXTERNAL_COMMAND_LENGTH in common.h to allow for passive checks results received through the external
command file. EG 10/19/07

One remark to the buffer size - generally it's a good idea to restrict it. But increasing it to 32K or 64K should not be a problem for modern servers in the gigabit world, even if there are runaway plugins.

Transports and oddities

Enlarging Nagios buffers is not all - since many of the plugins are running on remote machines. Their output has to be transferred to Nagios. Here several transports enter the stage:

  • NRPE - Nagios remote plugin executor
  • NSCA - Nagios service check acceptor
  • SSH - check_by_ssh

There are more, but these are the most important in the Nagios world. Let's take a look how they behave with large plugin output.

We will begin with the transport SSH, since it's not Nagios and in terms of transportation the simplest. I know that some people will not agree, but here are my 2 cents: if you manage the public key authentication with SSH, it's a simple, safe and robust transport. And if you transfer 10K or 100K, who cares…

NRPE is a bit more tricky, and this comes from the internal implementation. In the original version it is a one buffer transport and will fail if you don't adjust the small buffer sizes in common.h:

#define MAX_INPUT_BUFFER           2048    /* max size of most buffers we use */
[...]
#define MAX_PACKETBUFFER_LENGTH    1024    /* max amount of data we'll send in one query/response */

Ton Voon has provided an improvement which breaks this limitation. The best on Ton's patch is that you don't have to upgrade all machines at once. You can do it step by step which is helpful especially for large installations.
Note: if you are running NRPE on Linux machines before kernel release 2.6.11, you will only be able to transport one buffer. This is an effect caused by the old single buffered PIPE implementation. In 2.6.11 Linus Torvalds himself inplemented a ring buffer which allowed circular pipes. With the default kernel PIPE size of 4K and 16 buffers, NRPE can now transport 64K. So if you still have problems with cut NRPE data, watch for 2.6.10 and below.

NSCA is the nasty end - and if you ask me: it needs a reimplementation. There are several implementation itches which do not fit anymore in the current Nagios world:

  1. NSCA does not scale very well: it passes all messages to the Nagios CMD interface, which is well known for its traffic jam in large installations. And NSCA is often used just in such large installations to cirumvent the Nagios scheduling bottleneck.
    There are numerous enhancements on both NSCA's sender and receiver side, but IMHO the only well performing approach to insert checks into Nagios will use the checkresults interface.
  2. NSCA does not allow multiline: it reads the input up to the first newline and that's it.


Recommendations for check_multi?
After our small walk through the puzzling world of Nagios transports the conclusion for the use of check_multi is pretty clear: NRPE and SSH will work well, while NSCA is the black sheep in this family.
But this does not need to be a real disadvantage: in a check_multi driven Nagios infrastructure you don't need that much passive services with NSCA, because you can use active check_multi services instead.
In the end this means no more need for freshness checks and no more need for sophisticated distributed setups.

2009/11/27 09:14 · Matthias Flacke · 0 Comments · 0 Linkbacks

check_multi stable 0.20 released

I confess - this release should have been launched much earlier. I know the OSS mantra: release early, release often. But there were so many enhancements, redesigns, fixes in a row that I really missed to shift the trunk version into a new stable release ;-).

You can download it like always from here.

So these are the new features:

  • statusdat [ TAG ] = host:service
    gather states and output from existing servicechecks and integrate this seamlessly into your existing checks. This is a good means to build Business Process Views using check_multi without the need to reexecute existing service checks.

  • One more for this new statusdat function:
    When you specify wildcards for hosts and services, check_multi will automatically expand this to additional child checks.
    And the data gathering from status.dat is done efficiently with a caching mechanism.

  • Support for passive feeding: there are several ways now to feed check_multi results directly into Nagios:
    1. via check_result files (direct and very fast)
    2. via send_nsca (needs nsca daemon on Nagios side)
    3. as a chain of commands: one check_multi sends, the other check_multi receives and inserts all child checks into Nagios queue.

  • eval is not counted any more for the number of plugins
    eeval is visible and therefore counted, eval was counted but not visible, and this confused some people. Now it's not counted any more.

  • COUNT(ALL) keywork added to state evaluations, now it's possible to specify:
    state [ CRITICAL ] = COUNT(CRITICAL) < COUNT(ALL)-1
    This feature is useful for cluster monitoring.

  • Some nifty improvements for the command line and the command files:
    1. the report options -r can be specified with a chain of plus-separated numbers instead of a sum:
      -r 1+2+4+8 is better readable than -r 15, isn't it?.
    2. eval and eeval perl snippets don't need to be written with trailing \.
      This allows comment lines within the code as well as direct copy and paste from perl scripts.
      Nevertheless: the old trailing \ is also valid, so nobody needs to rewrite his command files.

  • At last: configure based installation, added tests (make test) and consolidated directory structure.
    By the way: if you have a complicated configure line and deleted your config.log, no problem: call check_multi -V, and it will print the complete configure line:
    nagios~> check_multi -V
    check_multi: v$Revision: 272 $ $Date: 2009-10-15 08:22:59 +0200 (Thu, 15 Oct 2009) $ $Author: flackem $
    configure  '--prefix=/opt/nagios'


This version is properly working on 1.000 european servers in the data centers of the telecommunication company I work for. If you find oddities or bugs anyway, please report them in the German Nagios-Portal or send me a mail.

Cheers,
-Matthias

Distributed service checks with check_multi

Some day a customer complained that he could not access a local intranet server. But I was sure that the server is running, nagios showed green lights everywhere.
But when I remotely connected to the customers PC, I had to notice that he is right and there is no connection from his network to the intranet server due to routing problems. Whoops… 8-o

The afternoon I thought about ways to cover this situation in our Nagios monitoring. Nagios has no obvious resp. generic solution for this problem except setting up multiple checks from different hosts.

But wait a little bit - there is a conceptional problem with this. All these checks are associated to other hosts as they're really belonging to. This will confuse the whole process (and the administrator as well ;-)). Notifications and escalations are based on the wrong host and the statistics / SLAs are also affected.

check_multi provides a simple but effective solution for this scenario: a distributed monitoring which works as a service associated to the target host.



You need:

  1. remote access to some hosts in the wanted subnets
    (i.e. normally some of your monitored servers)
  2. a check plugin existing on each of these hosts, which can monitor the service
    (e.g. check_tcp)


You will then:

  1. schedule a check on each of these hosts towards your target host / target service
  2. add a flexible state evaluation of your results:
    • either all results have to be OK for the overall state being OK
    • or only some results OK will set the overall state to OK


That's it! :-)

And more: you can do this with one generic check_multi command file. Some parameters will control which hosts are to be checked and what check_command is used to examine the service.

Call / Parameters

check_multi -f distributed.cmd \
-s CHECK_COMMAND="check_tcp -p 80 -H hostx -t 5" \
-s HOSTS="host1,host2,host3,host4,host5" \
-s THRESHOLDS="-w 'COUNT(WARNING)>3' -c 'COUNT(CRITICAL)>3'"

As you can see in the source below, you can also set other parameters from command line.
But there are already some reasonable defaults available:

  1. TIMEOUT (default: 2)
    this default is shorter than the normal default of 10 seconds. This is feasible here because we have multiple checks where only some have to succeed and others may fail without influencing the whole result.
  2. REMOTE_CHECK (default: check_by_ssh -H \$host -t $timeout$ -C )
    If you're running NRPE you can adjust this default as well.

Child checks

Most of the child checks are designed for parameter validation and visualization:

  1. check_command: the plugin command line to check the remote service
  2. timeout: the timeout for the remote check
    (note: not the timeout for the plugin, this has to be specified within the check_command itself)
  3. hosts_to_check: comma separated list of hosts to check
  4. remote_check: command to access a remote host, normally via NRPE or SSH.
  5. host_checks: this dynamically creates one child check per host and returns the number of child checks.

State evaluation

Code: interface.cmd

#
# distributed.cmd
#
# Matthias Flacke, 21.11.2008
#
# calls different remote hosts with parametrized check and returns
# only critical if (nearly) all hosts return errors
#
# Call: check_multi -f distributed.cmd
#       -s CHECK_COMMAND="check_tcp -p 80 -H hostx -t 5" \
#       -s HOSTS="host1,host2,host3,host4,host5" \
#       -s THRESHOLDS="-w 'COUNT(WARNING)>3' -c 'COUNT(CRITICAL)>3'"
#
# caveat: take care of the different timeout thresholds!
#
eeval [ check_command ] = \
        if ( "$CHECK_COMMAND$" ) { \
                return "$CHECK_COMMAND$"; \
        } else {\
                print "Error: CHECK_COMMAND not defined. Exit.\n"; \
                exit 3; \
        }
#
eeval [ timeout ] = ( "$TIMEOUT$" eq "") ? "2" : "$TIMEOUT$";
#
eeval [ hosts_to_check ] = \
        if ( "$HOSTS$") { \
                return "$HOSTS$"; \
        } else {\
                print "Error: no HOSTS defined. Exit.\n"; \
                exit 3; \
        } \
#
eeval [ remote_check ] = \
        if ( "$REMOTE_CHECK$") { \
                return "$REMOTE_CHECK$"; \
        } else { \
                return "check_by_ssh -H \$host -t $timeout$ -C "; \
        }
#
eeval [ host_checks ] = \
        my $count=0; \
        my $disttest=''; \
        foreach my $host (split(/,/,'$HOSTS$')) { \
                $disttest.="-x \'command [ $host ] = $remote_check$ \"$CHECK_COMMAND$\"\' "; \
                $count++; \
        } \
        parse_lines("command [ distributed_check ] = check_multi -r 15 $disttest $THRESHOLDS$"); \ 
        $count;

Example output

Distributed website monitoring

Using the STATE expression to write flexible plugins with check_multi

For a individual interpretation of data in check_multi the builtin state evaluation is a good means. It internally works with perl eval and therefore is extremely flexible.

Look at the following SNMP example as a simple introduction:

# sensor.cmd
#
# (c) Matthias Flacke
# 30.12.2007
#
# Flexible interpretation of smmp results
command [ sensor ] = /usr/bin/snmpget -v 1 -c $COMMUNITY$ -Oqv $HOSTNAME$ XYZ-MIB::Sensor.0
 
state [ UNKNOWN  ] =  $sensor$  !~ /[4627859]/
state [ OK       ] = ( $sensor$ == 4 || $sensor$ == 6 )
state [ WARNING  ] = ( $sensor$ == 2 || $sensor$ == 7 || $sensor$ == 8 )
state [ CRITICAL ] = ( $sensor$ == 5 || $sensor$ == 9 )



What is happening here?
The first part, the sensor command is just a plain snmpget, as you probably often use in self written plugins.

But instead of a big if-else-clause check_multi uses state expressions to assign SNMP values to the different result values OK, WARNING and CRITICAL.

UNKNOWN has a special role here. It is used, when the SNMP value is not member of a specific group of numbers.

So this example really does not more than standard plugin also can do. But it shows how fast and reliable you can do develop such a SNMP plugin with check_multi. And if you want to change a value afterwards, you does not need to bother a developer. Any administrator can do this as well.

check_multi can do all the standard tasks like a normal plugin:

  • gathering data
  • validating the retrieved data (→ state [ UNKNOWN ] line)
  • evaluating the different results against rules



Where should check_multi run? On the local or on the remote server?

There is often the discussion, if monitoring checks should be run remotely or locally on the server, which has to be monitored. The decision is sometimes easy, when the ressource to be monitored is not available remotely, e.g. logfiles or disks.
But there are plenty of cases where you can do both, e.g. applications and services, which are accessed via network. Please don't think, that network services have to be monitored remotely, because otherwise there's no proof that it's working over the network. You can give exactly one proof with your nagios check, and that is for the nagios server. Where the servers users normally not reside ;-)
So why not executing all checks on the remote server?

There are indeed some reasons for this approach:

  • All checks are consistent.
  • The transport problem has only to be solved once.
  • The customer is paying in terms of performance for his own monitoring. The nagios server does not need to bear the load of multiple checks, it receives only the results.


But the disadvantage of this approach lies just here - when the customer comes and asks:

  • what the heck are all these 'nagios' processes doing on my server?
  • And why are they running so often?



Generally we won't find a generic solution for all cases. I have prepared a small table to help you finding the criteria for your specific solution:

1. check_multi running on Nagios server 2. check_multi running on client server
Basic concept check_multi runs on the Nagios server and accesses the remote server with each child check independently check_multi is running on the remote server (here called client) and executes the child services locally
Resources
Load
1. The Nagios servers bears most of the execution load and needs more power.
2. The footstep on the client server is not that big.
1. The load on the Nagios server is very small.
2. The major work is done on the client, therefore it takes the burden of its own monitoring.
Network The network load is higher, especially with SSH transport mechanisms the effort is much higher due to multiple authentication steps. The network load is very small, there is only one connection to trigger the startup and transport back the results over the network.
Configuration Configuration is easily accessible, there is no need to distribute configuration files The remote client needs configuration to be distributed and updated.
Plugins As for configuration, its sufficient to provide plugins only on the central Nagios server. Plugins have to be distributed and updated.
Local monitoring vs.
remote monitoring
There are particular local checks like disk monitoring, which need high efforts to be run remotely Every remote check can also be run as a local check. That means vice versa, that all checks can be run from the remote server, even checks for network services. A plus in terms of homogenity of monitoring

How to monitor clients which are not up all the time?

The Nagios world consists of hosts and services. But what to do if the hosts do not matter? This is the case with all devices which are not necessarily up all the time, but have services need to be monitored if they are.

E.g.

  • Printers, which should be monitored for toner and paper
  • Windows clients, which patch level should be supervised
  • Salesmans notebooks, which are not connected most of the time.
    But if they are, we want to check all we can get of them.

Offline devices

The basic principle is really simple:

  1. When a device is offline, say OK and print a message “It's offline”
  2. When it's available, check it and treat it as an normal device.

This can be implemented using the following check_multi snippet:

#
# check_client.cmd
# 08.02.2009
#
# (c) Matthias Flacke
#
# Call:  check_multi -f check_client.cmd -s HOSTNAME=<host>
command [ ping     ] = check_icmp -H $HOSTNAME$ -w 500,5% -c 1000,10%
eval    [ offline  ] = if ($STATE_ping$ != $OK) { print "Host offline"; exit 0; }
 
# execute these commands only if host online
command [ command1 ]  = ...
command [ command2 ]  = ...


The trick happens in the eval line: if the ping in the command line before does not succeed, we don't bother any more about this client, pring a short message “Host offline” and exit with OK.
Note: The eval command will not be shown in the normal visualization.

That's all.

BTW - users reported, that it's quite a miracle when during the late afternoon all clients problems disappeared silently host per host until everything is green :-)

The idea stems from this thread in the German Nagios Portal

Linkbacks

[...] check_multi [...]
 
[...] check_multi [...]
 
projects/check_multi/start.txt · Last modified: 2010/01/29 14:34 by flackem
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0