======= check_multi feeds passive checks ======= {{check_multi:multi_feeds_passive.png|}} Large environments are one of the major places to run //check_multi// in order to * reduce the number of Nagios servers * avoid the complexity of a distributed environment On the other hand the //check_multi// has clear disadvantages (''There can be only one'') when * heterogeneous groups are participating in the monitoring * one Nagios object is needed for every monitoring item (=check_multi child check). This was the basic motivation for the improved implementation with * the standard //check_multi// plugin as a active data collector * an additional report mode which feeds all child results as passive checks into Nagios === Pros === * The communication between Nagios server and client happens only once per machine but not once per check. * The multi check is active and therefore under Nagios scheduling control. * Since the particular services are passive they do not put load on the Nagios scheduling queue. * All services are nevertheless Nagios services, we have: Notification, Escalation, Reporting. * The performance gain is enormeous: 25000 Services per server are possible (see Performance section). === Cons === * More complexity compared to a active service implementation (but less than a distributed setup) * The configuration has to be set up twice, one time for the active check_multi check, one time for the passive checks. ======= Basic design ======= ===== How does it work? ===== - //check_multi// acts as a normal active Nagios check and collects checks from a remote host. - Each child check has a corresponding passive check in Nagios with the same name. - //check_multi// takes the child checks output and RC and feeds it into the corresponding passive Nagios check. That's all. ===== Implementation details ===== There is a design problem when executing multiple remote checks within one collector check and then return the results into the passive side of Nagios: the transport. * If you run //check_multi// on the Nagios server, you need an remote connection for each child check: very expensive * If you run //check_multi// on the remote server, you have problems to reach Nagios input queue on the passive side. The solution is: use check_multi **twice** in a command chain:\\ {{:check_multi:multi_feeds_passive_remote.png|}} - check_multi on the remote hosts gathers data. - check_multi on the Nagios server feeds passive services. The first check_multi passes its results via XML to the second one.\\ \\ **Note:** the whole chain is started on the Nagios server. In case of DMZ host monitoring no inbound connections are used. ===== Remote examples ===== - SSH:check_by_ssh -H -c '/path/to/check_multi -f multi.cmd -r 256' | check_multi -f - -r 8192+8+1 - NRPE:check_nrpe -H -c check_multi -a '-f multi.cmd -r 256' | check_multi -f - -r 8192+8+1 - NSCA:check_nrpe -H -c check_multi -a '-f multi.cmd -r 4096+8+1'\\ This method needs a running nsca daemon on Nagios server. Inbound connections are used, therefore this approach is **not** recommended for DMZ setups. ===== mod_gearman ===== check_multi easily can be integrated into Sven Nierleins [[http://labs.consol.de/lang/de/nagios/mod-gearman/|mod_gearman]]. mod_gearman is a NEB module which runs checks over a [[http://gearman.org/index.php?id=protocol|gearman]] scheduling framework and passes results back to Nagios. This allows thousands of checks pretty efficiently to be run within a single Nagios instance. A specific client 'send_multi' is part of the mod_gearman package and can be used to feed the particular child check results into the gearman queues. This client is a small C binary and by far less resource consuming as if you call check_multi itself to pass checks into Nagios. **Call example:** $ check_multi -f multi.cmd -r 256 | \ send_multi --server= --encryption=no --host="" --service="" **Note:** * If you want to use only check_multi and no other workers, you can achieve this with the following neb module settings\\ broker_module=/usr/local/share/nagios/mod_gearman.o \ server=localhost \ encryption=no \ eventhandler=no \ hosts=no \ services=no \ hostgroups=does_not_exist * Encryption is not necessary if you both run the check_multi checks and the nagios check_results queue on the same server. ======= For the curious: example installation ======= This example installation is part of the sample-config directory in the //check_multi// package.\\ **Note**: it's a setup for one machine, there is no remote access included in the configuration. For the basic understanding of the principle this does not matter anyway ;-) **Cooking list:** - download //check_multi//, latest SVN. - ./configure; make all - cd sample-config/feed_passive - Install the feed_passive example files with themake install-configthis will add a directory ''/path/to/nagios/etc/check_multi/feed_passive''. - add the feed_passive subdirectory as cfg_dir to **nagios.cfg**:cfg_dir=/usr/local/nagios/etc/check_multi/feed_passive - reload / restart Nagios: et voila :-P\\ \\ {{:check_multi:examples:multi_feeds_passive_sample.png|one of the example hosts}} * Standard sizing is 10 Hosts with 10 feed services and 100 passive services * If you want to put more load on your system, go to ''/etc/check_multi/feed_passive'' and run perl gencfg nhosts then reload nagios. ======= Installation ======= ===== Prerequisites on the Nagios server ===== - **mandatory - perl module XML::Simple**\\ Install XML::Simple on Nagios server, either from your Linux distribution or directly from [[http://search.cpan.org/perldoc?XML::Simple|CPAN]]. Its only needed for the receiving side (the Nagios server), the senders (remote clients) do not need XML::Simple. - **optional - nagios.cfg settings**\\ I recommend to set some attributes for performance tuning and to avoid unnecessary logging:\\ ^setting ^ comment ^ |child_processes_fork_twice=0 | speeds up Nagios, one fork is enough | |free_child_process_memory=0 | Linux can free memory much faster than Nagios | |log_initial_states=0 | Otherwise each days log contains one unnecessary line per service | |log_passive_checks=0 | saves lots of space in the nagios.log | |use_large_installation_tweaks=1| another performance boost (e.g. no summary macros | None of these attributes is mandatory, but it will speed up your infrastructure in large setups. ===== check_multi command file ===== Just as an example, your mileage may vary ;)#--- multi.cmd command [ system_disk ] = check_disk -w 5% -c 2% -p / command [ system_load ] = check_load -w 10,8,6 -c 20,18,16 command [ system_swap ] = check_swap -w 90 -c 80 command [ system_users ] = check_users -w 5 -c 10 command [ procs_num ] = check_procs command [ procs_cpu ] = check_procs -w 10 -c 20 --metric=CPU -v command [ procs_mem ] = check_procs -w 100000 -c 2000000 --metric=RSS -v command [ procs_zombie ] = check_procs -w 1 -c 2 -s Z command [ proc_cron ] = check_procs -c 1: -C cron command [ proc_syslogd ] = check_procs -c 1: -C syslogd #--- avoid redundant states state [ WARNING ] = IGNORE state [ CRITICAL ] = IGNORE state [ UNKNOWN ] = IGNORE ===== check_multi active service definition ===== This service runs on the remote host and gathers data: * check_multi report option ''-r 256+4+1'': - Mandatory: ''-r 256'' as XML output option - Recommended: ''-r 4'' for ERROR output - last not least: ''-r 1'' for detailed results in the status line * Example:\\ define service { service_description multi_feed host_name host1 check_command check_multi!-f multi_small.cmd -r 256+4+1 -v event_handler multi_feed_passive check_interval 5 use local-service } ===== Passive ''feeded'' service definition ===== * Mandatory: ''passive_checks_enabled 1'' * Mandatory: ''active_checks_enabled 0'' * and the rest: YMMV * Example:define service { service_description $THIS_NAME$ host_name $HOSTNAME$ passive_checks_enabled 1 active_checks_enabled 0 check_command check_dummy!0 "passive check" use local-service } You can //easily// generate these passive services via check_multi report mode 2048: check_multi -f multi.cmd -r 2048 -s service_definition_template=/path/to/service_definition.tpl > services_passive.cfg \\ Hint: create a oneliner which loops over your hosts and generates bulk service check definitions. Whenever a host is added, you rerun your script and reload Nagios to put the new passive services into effect. ======= Troubleshooting ======= * if you see the message ''Passive check result was received for service '%s' on host '%s', but the service could not be found!'', you have to double check the passive service definitions. * If you want to test it with a XML file, you have to use a pipe, e.g.cat test.xml | check_multi -f - -r 8192. * Sometimes the error can be seen to specify the cmd file on the receiver side of the pipe, e.g. ''check_multi ... | check_multi -f - -f xyz.cmd''.\\ This lets you monitor your Nagios server, not the remote host. Please recall - the pipe has two parts:\\ - the sender part for the remote side, where the information is being gathered (-> input)\\ - the local receiver part on the Nagios server, where the information is presented (-> output). (Hint: check_multi -f - reads from STDIN) ======= Performance benchmarking ======= **Hardware** * Dual core Athlon X2/64 3600 with 1 GB RAM ** Nagios configuration ** * 1000 hosts * 1000 active services * 25000 feeded passive services ==== SAR states ==== # sar -u 06:00:00 PM CPU %user %nice %system %iowait %idle 06:00:01 PM all 32.28 0.00 28.37 0.71 38.63 06:10:01 PM all 31.66 0.00 27.86 1.05 39.43 06:20:31 PM all 31.60 0.00 28.06 1.25 39.09 06:30:01 PM all 31.61 0.00 28.40 1.19 38.79 06:40:01 PM all 31.55 0.00 28.39 1.16 38.90 06:50:01 PM all 33.68 0.00 28.54 0.92 36.86 # sar -q 06:00:00 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 06:00:01 PM 3 166 1.99 2.52 3.75 06:10:01 PM 3 158 1.91 2.19 2.95 06:20:31 PM 2 155 1.53 1.93 2.44 06:30:01 PM 2 159 2.22 2.11 2.28 06:40:01 PM 2 155 1.76 1.96 2.10 06:50:01 PM 2 165 1.90 2.14 2.14 ==== Nagiostats ==== Nagios Stats 3.1.2 Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org) Last Modified: 06-23-2009 License: GPL CURRENT STATUS DATA ------------------------------------------------------ Status File: /usr/local/nagios/var/status.dat Status File Age: 0d 0h 0m 5s Status File Version: 3.1.2 Program Running Time: 0d 1h 40m 5s Nagios PID: 4112 Used/High/Total Command Buffers: 0 / 0 / 4096 Total Services: 26001 Services Checked: 26001 Services Scheduled: 1001 Services Actively Checked: 1001 Services Passively Checked: 25000 Total Service State Change: 0.000 / 7.760 / 0.327 % Active Service Latency: 0.000 / 1.054 / 0.160 sec Active Service Execution Time: 0.300 / 3.266 / 0.917 sec Active Service State Change: 3.750 / 7.760 / 4.246 % Active Services Last 1/5/15/60 min: 187 / 988 / 1001 / 1001 Passive Service Latency: 0.000 / 0.000 / 0.000 sec Passive Service State Change: 0.000 / 7.760 / 0.170 % Passive Services Last 1/5/15/60 min: 4694 / 24676 / 25000 / 25000 Services Ok/Warn/Unk/Crit: 26001 / 0 / 0 / 0 Services Flapping: 186 Services In Downtime: 0 Total Hosts: 1002 Hosts Checked: 1002 Hosts Scheduled: 1002 Hosts Actively Checked: 1002 Host Passively Checked: 0 Total Host State Change: 0.000 / 0.000 / 0.000 % Active Host Latency: 0.971 / 2.042 / 1.366 sec Active Host Execution Time: 0.024 / 1.150 / 0.065 sec Active Host State Change: 0.000 / 0.000 / 0.000 % Active Hosts Last 1/5/15/60 min: 185 / 935 / 1002 / 1002 Passive Host Latency: 0.000 / 0.000 / 0.000 sec Passive Host State Change: 0.000 / 0.000 / 0.000 % Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0 Hosts Up/Down/Unreach: 1002 / 0 / 0 Hosts Flapping: 0 Hosts In Downtime: 0 Active Host Checks Last 1/5/15 min: 220 / 968 / 2906 Scheduled: 220 / 968 / 2906 On-demand: 0 / 0 / 0 Parallel: 220 / 968 / 2906 Serial: 0 / 0 / 0 Cached: 0 / 0 / 0 Passive Host Checks Last 1/5/15 min: 0 / 0 / 0 Active Service Checks Last 1/5/15 min: 200 / 1001 / 3003 Scheduled: 200 / 1001 / 3003 On-demand: 0 / 0 / 0 Cached: 0 / 0 / 0 Passive Service Checks Last 1/5/15 min: 4975 / 25000 / 75000 External Commands Last 1/5/15 min: 0 / 0 / 0