My research project at work, CPR, consists of several "networks" of machines that each run full mesh end-to-end measurements between themselves and other machines in their network. Our biggest network is approaching 100 machines but there are several other active networks. Each of these has it's configuration information in a centralized database and is managed by a tool called cpradmin. Initially, there was just one network so the tool was developed with one network in mind but as the project has grown, I hackishly added support for multiple networks by allowing the network name to specify the MySQL database for that particular network. Cpradmin sets up new machines and does things like configuring Smokeping, iptables, etc on them. Cpradmin is also responsible for generating Nagios configuration information for each node: for example, the campus CPR nodes each check the availability of campus services such as IMAP, DNS, DHCP, etc.
As the number of nodes has grown, we could no longer depend on our eyes to make sure they were all working and I set up a central installation of Nagios to monitor both reachability to the nodes and to verify that all the monitoring processes were running. This used the existing Nagios configuration generation script that was used for the nodes and worked fine, but due to the different tests being run on each network of machines, this started to get out of hand. Additionally, the central install was just monitoring one network and any hosts on other networks needed to be added individually. Obviously not good because it required human intervention, so I started on a new tool.
It's still not quite finished, but is looking pretty good. Instead of using specific configuration information for each host, the new tool uses the general configuration information about the types of tests running in each network. For example, all campus nodes should have arpwatch running. This new script generates a comprehensive Nagios configuration file based on the types of services running in each network on each host, as well as things like disk space, system load, latency, etc. Whenever a new node is added to a mesh, cpradmin no longer has to add specific files to a list, it just regenerates the configuration file and sends Nagios a signal to reload it's configuration. When this is all finished, we will no longer have to manually intervene with Nagios at all, and it will monitor close to 150 hosts with a total of close to 1000 tests and email us whenever the disk on one of them fills up.
Up next is giving real multi-network support to cpradmin so that it can do software upgrades on multiple networks all at once. (I've already completed multithread support so to upgrade smokeping on all the machines takes the time it takes smokeping to install on one node, once for each network, but cpradmin has to be run once for each network.)