It's urgent time for the next release, for some new features and for plenty of bug fixes.
attribute [ <TAG>::<variable> ] = <expression>state [ critical ] = <expression>, just on the child check level. It will be evaluated directly after a child check has been executed and affects the state of a particular child check.command [ tmp_dir_permissions ] = ls -ld /tmp attribute [ tmp_dir_permissions::critical ] = $tmp_dir_permissions$ !~ /^drwxrwxrwt/
The output is:
$ check_multi -f test_tmp_dir.cmd -r 1+4+64 CRITICAL - 1 plugins checked, 1 critical (tmp_dir_permissions) [ 1] tmp_dir_permissions CRITICAL drwxr-xr-x 10 root root 4096 Nov 17 15:29 /tmp [CRITICAL: output matched rule '"$tmp_dir_permissions$" !~ /^drwxrwxrwt/']
attribute [ <variable> ] = <value>
Have fun with check_multi,
-Matthias
The incredibly simple plugin interface is a key to the mysteries of the success of Nagios. A return code of 0,1,2,3 and some explaining words, that's all.
Everybody was able to understand these two elements and started to write plugins. In all languages one can imagine and for all OS where Nagios is running on. Nagios got a famous reputation ('Yes we can plugin!'), and the only limitation was the skill of the plugin programmer.
As a side note: Some plugins actually are not of best quality, as we can see in the exchange repositories. But they cover the whole range of monitoring.
Plugin output limited
Let's talk about a small aspect of the plugin interface which is annoying and often frustrates especially Nagios beginners - the limited length of plugin output. It sounds pretty simple, but the devil is in the details.
If you take the output of the standard plugin check_disk, the length of output should not be a problem:
DISK OK - free space: / 947 MB (8% inode=74%);| /=10533MB;10884;9675;0;12094
But in the meantime there are plugins like check_multi, which transport much more data:
OK - 3 plugins checked, 3 ok [ 1] disk DISK OK - free space: / 947 MB (8% inode=74%); [ 2] load OK - load average: 1.43, 1.23, 1.52 [ 3] swap SWAP OK - 95% free (1935 MB out of 2048 MB) |check_multi::check_multi::plugins=3 time=0.046274 disk::check_disk::/=10533MB;11489;11852;0;12094 load::check_load::load1=1.430;5.000;10.000;0; load5=1.230;4.000;8.000;0; load15=1.520;3.000;6.000;0; swap::check_swap::swap=1935MB;0;0;0;2048
428 bytes instead of 83 bytes: If you still have Nagios 2 running, this would have blown the maximum length of plugin output.
Plugin buffer overflow? Nagios does not care
The plugin interface is simple, but it also means that Nagios does not care about the length of plugin output. If it exceeds the internal buffer length, nobody is informed and often nobody notices it. The content is simply cut.
Bad news for the performance data, which is appended to the output: If the buffer is not long enough to house the whole output, the performance data is missing or even worse, it is corrupted. And no warning lamp alarms the monitoring admin.
Nagios 2 allowed 332 bytes of plugin output, but for Nagios 3 this was increased drastically. Have a look on how the maximum plugin output length evolved over the Nagios timeline:
On Nagios side it's the constant named MAX_PLUGIN_OUTPUT_LENGTH:
| Maximum plugin output in bytes | Nagios version | Include file |
|---|---|---|
| 352 | 1.0 | common/objects.h |
| 348 | 2-0 | include/objects.h |
| 332 | 2-1 | include/objects.h |
| 4096 | 3-0a | include/nagios.h |
| 8192 | 3-0 | include/nagios.h |
Don't think that the 8K bytes are sufficient in all cases - check_multi's HTML mode can easily consume dozens of kilobytes.
Increasing MAX_PLUGIN_OUTPUT_LENGTH - and some more
In principle the idea of increasing the constant MAX_PLUGIN_OUTPUT_LENGTH is correct - increase it, recompile and restart Nagios, done.
Ethan himself gives a hint in nagios.h to also increase MAX_EXTERNAL_COMMAND_LENGTH for passive checks:
NOTE: Plugin length is artificially capped at 8k to prevent runaway plugins from returning MBs/GBs of data back to Nagios. If you increase the 8k cap by modifying this value, make sure you also increase the value of MAX_EXTERNAL_COMMAND_LENGTH in common.h to allow for passive checks results received through the external command file. EG 10/19/07
One remark to the buffer size - generally it's a good idea to restrict it. But increasing it to 32K or 64K should not be a problem for modern servers in the gigabit world, even if there are runaway plugins.
Transports and oddities
Enlarging Nagios buffers is not all - since many of the plugins are running on remote machines. Their output has to be transferred to Nagios. Here several transports enter the stage:
There are more, but these are the most important in the Nagios world. Let's take a look how they behave with large plugin output.
We will begin with the transport SSH, since it's not Nagios and in terms of transportation the simplest. I know that some people will not agree, but here are my 2 cents: if you manage the public key authentication with SSH, it's a simple, safe and robust transport. And if you transfer 10K or 100K, who cares…
NRPE is a bit more tricky, and this comes from the internal implementation. In the original version it is a one buffer transport and will fail if you don't adjust the small buffer sizes in common.h:
#define MAX_INPUT_BUFFER 2048 /* max size of most buffers we use */ [...] #define MAX_PACKETBUFFER_LENGTH 1024 /* max amount of data we'll send in one query/response */
Ton Voon has provided an improvement which breaks this limitation. The best on Ton's patch is that you don't have to upgrade all machines at once. You can do it step by step which is helpful especially for large installations.
Note: if you are running NRPE on Linux machines before kernel release 2.6.11, you will only be able to transport one buffer. This is an effect caused by the old single buffered PIPE implementation. In 2.6.11 Linus Torvalds himself inplemented a ring buffer which allowed circular pipes. With the default kernel PIPE size of 4K and 16 buffers, NRPE can now transport 64K. So if you still have problems with cut NRPE data, watch for 2.6.10 and below.
NSCA is the nasty end - and if you ask me: it needs a reimplementation. There are several implementation itches which do not fit anymore in the current Nagios world:
Recommendations for check_multi?
After our small walk through the puzzling world of Nagios transports the conclusion for the use of check_multi is pretty clear: NRPE and SSH will work well, while NSCA is the black sheep in this family.
But this does not need to be a real disadvantage: in a check_multi driven Nagios infrastructure you don't need that much passive services with NSCA, because you can use active check_multi services instead.
In the end this means no more need for freshness checks and no more need for sophisticated distributed setups.
I confess - this release should have been launched much earlier. I know the OSS mantra: release early, release often. But there were so many enhancements, redesigns, fixes in a row that I really missed to shift the trunk version into a new stable release
.
You can download it like always from here.
So these are the new features:
statusdat [ TAG ] = host:servicestate [ CRITICAL ] = COUNT(CRITICAL) < COUNT(ALL)-1-r can be specified with a chain of plus-separated numbers instead of a sum:-r 1+2+4+8 is better readable than -r 15, isn't it?.\.\ is also valid, so nobody needs to rewrite his command files.make test) and consolidated directory structure.check_multi -V, and it will print the complete configure line:nagios~> check_multi -V check_multi: v$Revision: 272 $ $Date: 2009-10-15 08:22:59 +0200 (Thu, 15 Oct 2009) $ $Author: flackem $ configure '--prefix=/opt/nagios'
This version is properly working on 1.000 european servers in the data centers of the telecommunication company I work for. If you find oddities or bugs anyway, please report them in the German Nagios-Portal or send me a mail.
Cheers,
-Matthias
Some day a customer complained that he could not access a local intranet server. But I was sure that the server is running, nagios showed green lights everywhere.
But when I remotely connected to the customers PC, I had to notice that he is right and there is no connection from his network to the intranet server due to routing problems. Whoops… ![]()
The afternoon I thought about ways to cover this situation in our Nagios monitoring. Nagios has no obvious resp. generic solution for this problem except setting up multiple checks from different hosts.
But wait a little bit - there is a conceptional problem with this. All these checks are associated to other hosts as they're really belonging to. This will confuse the whole process (and the administrator as well
). Notifications and escalations are based on the wrong host and the statistics / SLAs are also affected.
check_multi provides a simple but effective solution for this scenario: a distributed monitoring which works as a service associated to the target host.

You need:
You will then:
That's it!
And more: you can do this with one generic check_multi command file. Some parameters will control which hosts are to be checked and what check_command is used to examine the service.
check_multi -f distributed.cmd \ -s CHECK_COMMAND="check_tcp -p 80 -H hostx -t 5" \ -s HOSTS="host1,host2,host3,host4,host5" \ -s THRESHOLDS="-w 'COUNT(WARNING)>3' -c 'COUNT(CRITICAL)>3'"
As you can see in the source below, you can also set other parameters from command line.
But there are already some reasonable defaults available:
check_by_ssh -H \$host -t $timeout$ -C )Most of the child checks are designed for parameter validation and visualization:
# # distributed.cmd # # Matthias Flacke, 21.11.2008 # # calls different remote hosts with parametrized check and returns # only critical if (nearly) all hosts return errors # # Call: check_multi -f distributed.cmd # -s CHECK_COMMAND="check_tcp -p 80 -H hostx -t 5" \ # -s HOSTS="host1,host2,host3,host4,host5" \ # -s THRESHOLDS="-w 'COUNT(WARNING)>3' -c 'COUNT(CRITICAL)>3'" # # caveat: take care of the different timeout thresholds! # eeval [ check_command ] = \ if ( "$CHECK_COMMAND$" ) { \ return "$CHECK_COMMAND$"; \ } else {\ print "Error: CHECK_COMMAND not defined. Exit.\n"; \ exit 3; \ } # eeval [ timeout ] = ( "$TIMEOUT$" eq "") ? "2" : "$TIMEOUT$"; # eeval [ hosts_to_check ] = \ if ( "$HOSTS$") { \ return "$HOSTS$"; \ } else {\ print "Error: no HOSTS defined. Exit.\n"; \ exit 3; \ } \ # eeval [ remote_check ] = \ if ( "$REMOTE_CHECK$") { \ return "$REMOTE_CHECK$"; \ } else { \ return "check_by_ssh -H \$host -t $timeout$ -C "; \ } # eeval [ host_checks ] = \ my $count=0; \ my $disttest=''; \ foreach my $host (split(/,/,'$HOSTS$')) { \ $disttest.="-x \'command [ $host ] = $remote_check$ \"$CHECK_COMMAND$\"\' "; \ $count++; \ } \ parse_lines("command [ distributed_check ] = check_multi -r 15 $disttest $THRESHOLDS$"); \ $count;
For a individual interpretation of data in check_multi the builtin state evaluation is a good means. It internally works with perl eval and therefore is extremely flexible.
Look at the following SNMP example as a simple introduction:
# sensor.cmd # # (c) Matthias Flacke # 30.12.2007 # # Flexible interpretation of smmp results command [ sensor ] = /usr/bin/snmpget -v 1 -c $COMMUNITY$ -Oqv $HOSTNAME$ XYZ-MIB::Sensor.0 state [ UNKNOWN ] = $sensor$ !~ /[4627859]/ state [ OK ] = ( $sensor$ == 4 || $sensor$ == 6 ) state [ WARNING ] = ( $sensor$ == 2 || $sensor$ == 7 || $sensor$ == 8 ) state [ CRITICAl ] = ( $sensor$ == 5 || $sensor$ == 9 )
What is happening here?
The first part, the sensor command is just a plain snmpget, as you probably often use in self written plugins.
But instead of a big if-else-clause check_multi uses state expressions to assign SNMP values to the different result values OK, WARNING and CRITICAL.
UNKNOWN has a special role here. It is used, when the SNMP value is not member of a specific group of numbers.
So this example really does not more than standard plugin also can do. But it shows how fast and reliable you can do develop such a SNMP plugin with check_multi. And if you want to change a value afterwards, you does not need to bother a developer. Any administrator can do this as well.
check_multi can do all the standard tasks like a normal plugin:
There is often the discussion, if monitoring checks should be run remotely or locally on the server, which has to be monitored.
The decision is sometimes easy, when the ressource to be monitored is not available remotely, e.g. logfiles or disks.
But there are plenty of cases where you can do both, e.g. applications and services, which are accessed via network. Please don't think, that network services have to be monitored remotely, because otherwise there's no proof that it's working over the network. You can give exactly one proof with your nagios check, and that is for the nagios server. Where the servers users normally not reside ![]()
So why not executing all checks on the remote server?
There are indeed some reasons for this approach:
But the disadvantage of this approach lies just here - when the customer comes and asks:
Generally we won't find a generic solution for all cases. I have prepared a small table to help you finding the criteria for your specific solution:
The Nagios world consists of hosts and services. But what to do if the hosts do not matter? This is the case with all devices which are not necessarily up all the time, but have services need to be monitored if they are.
E.g.
The basic principle is really simple:
This can be implemented using the following check_multi snippet:
# # check_client.cmd # 08.02.2009 # # (c) Matthias Flacke # # Call: check_multi -f check_client.cmd -s HOSTNAME=<host> command [ ping ] = check_icmp -H $HOSTNAME$ -w 500,5% -c 1000,10% eval [ offline ] = if ($STATE_ping$ != $OK) { print "Host offline"; exit 0; } # execute these commands only if host online command [ command1 ] = ... command [ command2 ] = ...
The trick happens in the eval line: if the ping in the command line before does not succeed, we don't bother any more about this client, pring a short message “Host offline” and exit with OK.
Note: The eval command will not be shown in the normal visualization.
That's all.
BTW - users reported, that it's quite a miracle when during the late afternoon all clients problems disappeared silently host per host until everything is green
The idea stems from this thread in the German Nagios Portal