The incredibly simple plugin interface is a key to the mysteries of the success of Nagios. A return code of 0,1,2,3 and some explaining words, that's all.
Everybody was able to understand these two elements and started to write plugins. In all languages one can imagine and for all OS where Nagios is running on. Nagios got a famous reputation ('Yes we can plugin!'), and the only limitation was the skill of the plugin programmer.
As a side note: Some plugins actually are not of best quality, as we can see in the exchange repositories. But they cover the whole range of monitoring.
Plugin output limited
Let's talk about a small aspect of the plugin interface which is annoying and often frustrates especially Nagios beginners - the limited length of plugin output. It sounds pretty simple, but the devil is in the details.
If you take the output of the standard plugin check_disk, the length of output should not be a problem:
DISK OK - free space: / 947 MB (8% inode=74%);| /=10533MB;10884;9675;0;12094
But in the meantime there are plugins like check_multi, which transport much more data:
OK - 3 plugins checked, 3 ok [ 1] disk DISK OK - free space: / 947 MB (8% inode=74%); [ 2] load OK - load average: 1.43, 1.23, 1.52 [ 3] swap SWAP OK - 95% free (1935 MB out of 2048 MB) |check_multi::check_multi::plugins=3 time=0.046274 disk::check_disk::/=10533MB;11489;11852;0;12094 load::check_load::load1=1.430;5.000;10.000;0; load5=1.230;4.000;8.000;0; load15=1.520;3.000;6.000;0; swap::check_swap::swap=1935MB;0;0;0;2048
428 bytes instead of 83 bytes: If you still have Nagios 2 running, this would have blown the maximum length of plugin output.
Plugin buffer overflow? Nagios does not care
The plugin interface is simple, but it also means that Nagios does not care about the length of plugin output. If it exceeds the internal buffer length, nobody is informed and often nobody notices it. The content is simply cut.
Bad news for the performance data, which is appended to the output: If the buffer is not long enough to house the whole output, the performance data is missing or even worse, it is corrupted. And no warning lamp alarms the monitoring admin.
Nagios 2 allowed 332 bytes of plugin output, but for Nagios 3 this was increased drastically. Have a look on how the maximum plugin output length evolved over the Nagios timeline:
On Nagios side it's the constant named MAX_PLUGIN_OUTPUT_LENGTH:
| Maximum plugin output in bytes | Nagios version | Include file |
|---|---|---|
| 352 | 1.0 | common/objects.h |
| 348 | 2-0 | include/objects.h |
| 332 | 2-1 | include/objects.h |
| 4096 | 3-0a | include/nagios.h |
| 8192 | 3-0 | include/nagios.h |
Don't think that the 8K bytes are sufficient in all cases - check_multi's HTML mode can easily consume dozens of kilobytes.
Increasing MAX_PLUGIN_OUTPUT_LENGTH - and some more
In principle the idea of increasing the constant MAX_PLUGIN_OUTPUT_LENGTH is correct - increase it, recompile and restart Nagios, done.
Ethan himself gives a hint in nagios.h to also increase MAX_EXTERNAL_COMMAND_LENGTH for passive checks:
NOTE: Plugin length is artificially capped at 8k to prevent runaway plugins from returning MBs/GBs of data back to Nagios. If you increase the 8k cap by modifying this value, make sure you also increase the value of MAX_EXTERNAL_COMMAND_LENGTH in common.h to allow for passive checks results received through the external command file. EG 10/19/07
One remark to the buffer size - generally it's a good idea to restrict it. But increasing it to 32K or 64K should not be a problem for modern servers in the gigabit world, even if there are runaway plugins.
Transports and oddities
Enlarging Nagios buffers is not all - since many of the plugins are running on remote machines. Their output has to be transferred to Nagios. Here several transports enter the stage:
There are more, but these are the most important in the Nagios world. Let's take a look how they behave with large plugin output.
We will begin with the transport SSH, since it's not Nagios and in terms of transportation the simplest. I know that some people will not agree, but here are my 2 cents: if you manage the public key authentication with SSH, it's a simple, safe and robust transport. And if you transfer 10K or 100K, who cares…
NRPE is a bit more tricky, and this comes from the internal implementation. In the original version it is a one buffer transport and will fail if you don't adjust the small buffer sizes in common.h:
#define MAX_INPUT_BUFFER 2048 /* max size of most buffers we use */ [...] #define MAX_PACKETBUFFER_LENGTH 1024 /* max amount of data we'll send in one query/response */
Ton Voon has provided an improvement which breaks this limitation. The best on Ton's patch is that you don't have to upgrade all machines at once. You can do it step by step which is helpful especially for large installations.
Note: if you are running NRPE on Linux machines before kernel release 2.6.11, you will only be able to transport one buffer. This is an effect caused by the old single buffered PIPE implementation. In 2.6.11 Linus Torvalds himself inplemented a ring buffer which allowed circular pipes. With the default kernel PIPE size of 4K and 16 buffers, NRPE can now transport 64K. So if you still have problems with cut NRPE data, watch for 2.6.10 and below.
NSCA is the nasty end - and if you ask me: it needs a reimplementation. There are several implementation itches which do not fit anymore in the current Nagios world:
Recommendations for check_multi?
After our small walk through the puzzling world of Nagios transports the conclusion for the use of check_multi is pretty clear: NRPE and SSH will work well, while NSCA is the black sheep in this family.
But this does not need to be a real disadvantage: in a check_multi driven Nagios infrastructure you don't need that much passive services with NSCA, because you can use active check_multi services instead.
In the end this means no more need for freshness checks and no more need for sophisticated distributed setups.