Research
monitoring a centrally managed system with Nagios
Submitted by ckdake on Mon, 2007-10-15 17:37My research project at work, CPR, consists of several "networks" of machines that each run full mesh end-to-end measurements between themselves and other machines in their network. Our biggest network is approaching 100 machines but there are several other active networks. Each of these has it's configuration information in a centralized database and is managed by a tool called cpradmin. Initially, there was just one network so the tool was developed with one network in mind but as the project has grown, I hackishly added support for multiple networks by allowing the network name to specify the MySQL database for that particular network. Cpradmin sets up new machines and does things like configuring Smokeping, iptables, etc on them. Cpradmin is also responsible for generating Nagios configuration information for each node: for example, the campus CPR nodes each check the availability of campus services such as IMAP, DNS, DHCP, etc.
As the number of nodes has grown, we could no longer depend on our eyes to make sure they were all working and I set up a central installation of Nagios to monitor both reachability to the nodes and to verify that all the monitoring processes were running. This used the existing Nagios configuration generation script that was used for the nodes and worked fine, but due to the different tests being run on each network of machines, this started to get out of hand. Additionally, the central install was just monitoring one network and any hosts on other networks needed to be added individually. Obviously not good because it required human intervention, so I started on a new tool.
It's still not quite finished, but is looking pretty good. Instead of using specific configuration information for each host, the new tool uses the general configuration information about the types of tests running in each network. For example, all campus nodes should have arpwatch running. This new script generates a comprehensive Nagios configuration file based on the types of services running in each network on each host, as well as things like disk space, system load, latency, etc. Whenever a new node is added to a mesh, cpradmin no longer has to add specific files to a list, it just regenerates the configuration file and sends Nagios a signal to reload it's configuration. When this is all finished, we will no longer have to manually intervene with Nagios at all, and it will monitor close to 150 hosts with a total of close to 1000 tests and email us whenever the disk on one of them fills up.
Up next is giving real multi-network support to cpradmin so that it can do software upgrades on multiple networks all at once. (I've already completed multithread support so to upgrade smokeping on all the machines takes the time it takes smokeping to install on one node, once for each network, but cpradmin has to be run once for each network.)
So it begins.
Submitted by ckdake on Thu, 2007-01-18 00:17It's week 2 of semester number who knows. It's a big number. Anyways, things are slowly settling in and here's what I'll be spending my next few months doing:
CS 4270 - Data Communications Lab - Playing with routers and protocols and all that. It's a hands on lab class that will typically occupy most of my Wednesday evening but not require a whole lot outside of that. My lab partner is Terry Turner from OIT which is pretty convenient as I'll be working on other things with him at work this semester.
CS 6255 - Network Management - What I do at work and have been doing research on, this time It's all a class focused around group projects. Russ Clark from OIT (RNOC) is teaching this one. I work with him in OIT pretty regularly and my project for class will be some overlap with work. If there are enough groups doing work related to CPR I may have a full plate helping them out, but if there aren't I may do a project on my own or with a partner. The current candidate is "Managing mesh networks with GTSWD and CFengine" because we're working to make the configuration management of CPR nodes a bit easier and GTSWD is just about ready for prime time.
CS7260 - Internetworking Architectures and Protocols - More networking class! This time with Nick Feamster, who is also my graduate project adviser. This is another group project class and it looks like Chris Lewis and I will be working together again. We haven't decided on exactally what yet, but our last project was pretty awesome and we're thinking on building on it, somehow using libnetfilter_queue to grab packets from the kernel before they get to userspace and tinkering with them (encryption, compression, port changing, who knows). We'll see
I'll also be attending the Computer Networking and Telecommunications Seminar, attending the NTG Student Reading Group, working on my research project, continuing to work at OIT in RNOC on CPR, doing some freelance top secret web application architecture consulting, and somehow find time for the rest of the now regular crazyness including Thursday night mountain biking, Friday night "midnight ride" etc. I also seem to have discovered my old habit of being more sociable than is probably good for me. Ah well, life is short!
For this summer, I've got an interview with Google sometime soon, maybe some others along the way, and I can certainly keep myself busy here at OIT and riding bikes if nothing else works out. Stay tuned...
An interesting place to be
Submitted by ckdake on Sun, 2006-10-08 17:38Anyone that talks to me or digs around my website probably knows that I'm pretty deeply involved in a lot of things. Graduate school over the last few weeks has been taking up most of my time, doing research related to computer science as well as public policy, and I've just been busy digging into things. As part of my research on distributed innovation for my public policy class, Technology, Regions, and Policy, I'm reading Thomas Friedman's "The World Is Flat." I've gotten about 1/3 of the way through it so far and it, along with some discussions with some people at work got me thinking, I've gotten myself into a pretty specific niche and it's going to be very interesting where it takes me. (Click read more)

