School
monitoring a centrally managed system with Nagios
Submitted by ckdake on Mon, 2007-10-15 17:37My research project at work, CPR, consists of several "networks" of machines that each run full mesh end-to-end measurements between themselves and other machines in their network. Our biggest network is approaching 100 machines but there are several other active networks. Each of these has it's configuration information in a centralized database and is managed by a tool called cpradmin. Initially, there was just one network so the tool was developed with one network in mind but as the project has grown, I hackishly added support for multiple networks by allowing the network name to specify the MySQL database for that particular network. Cpradmin sets up new machines and does things like configuring Smokeping, iptables, etc on them. Cpradmin is also responsible for generating Nagios configuration information for each node: for example, the campus CPR nodes each check the availability of campus services such as IMAP, DNS, DHCP, etc.
As the number of nodes has grown, we could no longer depend on our eyes to make sure they were all working and I set up a central installation of Nagios to monitor both reachability to the nodes and to verify that all the monitoring processes were running. This used the existing Nagios configuration generation script that was used for the nodes and worked fine, but due to the different tests being run on each network of machines, this started to get out of hand. Additionally, the central install was just monitoring one network and any hosts on other networks needed to be added individually. Obviously not good because it required human intervention, so I started on a new tool.
It's still not quite finished, but is looking pretty good. Instead of using specific configuration information for each host, the new tool uses the general configuration information about the types of tests running in each network. For example, all campus nodes should have arpwatch running. This new script generates a comprehensive Nagios configuration file based on the types of services running in each network on each host, as well as things like disk space, system load, latency, etc. Whenever a new node is added to a mesh, cpradmin no longer has to add specific files to a list, it just regenerates the configuration file and sends Nagios a signal to reload it's configuration. When this is all finished, we will no longer have to manually intervene with Nagios at all, and it will monitor close to 150 hosts with a total of close to 1000 tests and email us whenever the disk on one of them fills up.
Up next is giving real multi-network support to cpradmin so that it can do software upgrades on multiple networks all at once. (I've already completed multithread support so to upgrade smokeping on all the machines takes the time it takes smokeping to install on one node, once for each network, but cpradmin has to be run once for each network.)
people and toys
Submitted by ckdake on Tue, 2007-10-02 20:21It's been a while since I've updated, but there are good reasons. This is my last semester of grad school so I have a few projects to work on:
- Advanced Operating Systems is a project every few weeks. So far I've written priority and co-scheduling components for a user space thread scheduling library that works on multiprocessors, and I'm working on code to implement a shared camera driver across UML instances
- Networked Applications and Services has me working with 2 other people on a semester long project to analyze the social graph created by forum postings on Faster Mustache.
- My final 3 hours of research for my master's project is spent on the data storage model for CPR. We now have a real time database, a plan for file archiving, and a plan for long term SQL accessible data archiving for the 300k or so rows that get added to the database every day.
- And then 3 hours are spent working on IMS, specifically building monitoring and deployment management tools for the carrier side of the system.
- And then, the usual 20 or so hour a week job working on other aspects of CPR.
Give that, my schedule is pretty full but there is still some time for other things. Two weekends ago was the third annual Gallery Developer Conference in San Francisco. It was a blast, as usual, and I took my share of pictures as well. The following week I got to attend a talk by Jim Lovell about his experiences in the Apollo program, followed by the GTISC Annual Security Summit with speakers such as Vint Cerf (one of the founders of the internet), the Information Assurance Technical Director from the NSA, and various influential people from the security industry.
Then there are the toys. First came a Garmon 60Csx GPS receiver. It does fun things like tracking where I bike, letting me tag places, and uploading all of the information into Google Earth. One of these days all that information will have a run in with my photo library and there will be maps with pictures and so on. I'm looking forward to that day but it's a long way out. On a related note, I picked up a scanner: the Epson 4990 Photo. It's great and I've been scanning the shoe boxes of prints from 1990-2000. Eventually these will make it online, but I need to get together with some old calendars at my parent's house to figure out when most of them were taken. Lasty, I replaced my Motorola L6 with an iPhone. My only complaint is that the 1.1 firmware broke 3rd party application support, but all I need is a SSH client so if Apple would just do that, it'd be great. The firmware update is important to me because it fixed what I see as a major problem: With the 1.0 firmware, the UI didn't warn you if IMAP server credentials changed. This means that the iPhone mail client would send your user name and password to any mail server (read: hacker) that showed up in DNS on the phone as your mail server. Given the iPhone's ability to connect to any old WiFi network, this could be pretty disastrous.
In other news, my leg is mostly healed up and I've almost gotten everything right on the newest addition to the bike collection: a Redline 925 and the collection of various parts I've somehow attached to it. (I had a bike shop cut the fork and press the headset for me since I don't have those tools yet.)
TFStat - Traffic Flow Statistics
Submitted by ckdake on Tue, 2007-05-01 10:09Another class, another group project. For CS7260 - Internet Architectures & Protocols, Chris Lewis and I worked together again to build what we think is a pretty neat networking tool. The full details are available on a page here: tfstat, but the abstract is copied below. This was a pretty fun project to work on. I got to learn how to use libpcap and did a lot of multithreading with persistent shared objects in perl, and I'll likely continue work on this in the fall to submit to a conference. And having pretty graphs as results is always fun!
Traditionally, researchers who wish to look at traffic flow have one option – Netflow. However, Netflow only allows researchers to get a limited view of the big pictures as the data comes from core routers – it is very difficult to get a view of the network from an end user's perspective. TFStat is a set of tools that solve this problem. TFStat allows researchers to get such data from the vantage point of the end user, and has the added benefit of being compatible with Netflow. This paper will present the 2 major components of TFStat, discuss the implementation, look at some experimental results running the tools, and finally note some areas for future research to expand on the project.
Database Clustering
Submitted by ckdake on Mon, 2007-04-30 13:48My group project in CS6255: Network Management this semester was on a data storage system for CPR (the OIT project at Georgia Tech that I spend most of my time working on). The problem we were attempting to solve is real time access and long-term data archiving of lots of data. Currently, the CPR monitoring mesh has close to 100 machines. Every 5 minutes, each of these machines issues 20 pings to every other CPR machine. The math works out to over 2 million rows of data being stored in a database every single day. Only keeping track of a day or a week isn't really a problem for a somewhat beefy machine, but what about 2 years of data? How could we build a system that let us add more hardware in the future without system downtime or loosing data? Given the requirements of the system:
- Data needs to be inserted into a database in real time
- Recent data needs to be accessible in real time
- Graphs need to be generated using statistics on older data based on the host, target, and a time range
we came up with a solution that seems to do the trick. Below is a list of system components and a short description of how they all work:
- An application server with a database of data from the past 24 hours. It processes new data and communicates with the storage cluster.
- Cluster of database machines: currently just MySQL but could be any other storage engine that perl's DBI supports. They all run listening on a port that is firewalled so that only designated application servers can connect to the database
- XML file containing a list of each machine in the cluster and a probabilistic weight for that machine.
- A weight calculation script (that I wrote) that takes into account some Industrial Engineering problem solving techniques to figure out a close to ideal probabilistic weight for each machine in the cluster based on how much data that machine has already, what the rate of incoming data is, and how far in the future we'd like all the machines to have the same ammount of data on them.
- A perl script that processes incoming data in realtime and stores its on the application server and picks one random database cluster machine to store it on as well. The probability that a machine in the cluster will get it is the weight of that machine divided by the combined weight of all the machines in the cluster.
- A perl script for running SELECT queries against the data. It fires off a thread to communicate with each database so that a query can be dispatched to each machine in the cluster at the same time and the results can be combined as soon as they come in.
This all is working and storing data. Right now we're up to about 30 million rows of data in a cluster of 4 P4 machines with 512MB of RAM and it's noticeably better performance than the one huge table on the Sun server with 2GB of RAM. We were hoping to make a perl DBI module containing the insert and select related code, but ended up not having enough time for this so applications will have to be slightly modified to use the storage cluster. However, even considering this, it's definitely worth it to go to our form of clustering when you need to store lots and lots of data on commodity hardware.
Fall 2007 Classes
Submitted by ckdake on Fri, 2007-04-13 09:01That was easy. I don't have a whole lot I need to do this Fall and half of my classes are permits and don't require registering early or anything so it's 8:52am and since my 8:45am registration time began I've signed up for everything and have all the forms filled out that I need to. Here's what it's looking like for this Fall:
- CS6210 - Advanced Operating Systems with Karsten Schwan (MWF 2-3)
- CS7270 - Network Applications and Services with Constantine Dovrolis (TR 3-4:30)
- CS8001 - Network Seminar (time TBD)
- CS8902 - Masters Project with Nick Feamster
- CS8903 - Special Topics - Something on IMS with Russ Clark
So it begins.
Submitted by ckdake on Thu, 2007-01-18 00:17It's week 2 of semester number who knows. It's a big number. Anyways, things are slowly settling in and here's what I'll be spending my next few months doing:
CS 4270 - Data Communications Lab - Playing with routers and protocols and all that. It's a hands on lab class that will typically occupy most of my Wednesday evening but not require a whole lot outside of that. My lab partner is Terry Turner from OIT which is pretty convenient as I'll be working on other things with him at work this semester.
CS 6255 - Network Management - What I do at work and have been doing research on, this time It's all a class focused around group projects. Russ Clark from OIT (RNOC) is teaching this one. I work with him in OIT pretty regularly and my project for class will be some overlap with work. If there are enough groups doing work related to CPR I may have a full plate helping them out, but if there aren't I may do a project on my own or with a partner. The current candidate is "Managing mesh networks with GTSWD and CFengine" because we're working to make the configuration management of CPR nodes a bit easier and GTSWD is just about ready for prime time.
CS7260 - Internetworking Architectures and Protocols - More networking class! This time with Nick Feamster, who is also my graduate project adviser. This is another group project class and it looks like Chris Lewis and I will be working together again. We haven't decided on exactally what yet, but our last project was pretty awesome and we're thinking on building on it, somehow using libnetfilter_queue to grab packets from the kernel before they get to userspace and tinkering with them (encryption, compression, port changing, who knows). We'll see
I'll also be attending the Computer Networking and Telecommunications Seminar, attending the NTG Student Reading Group, working on my research project, continuing to work at OIT in RNOC on CPR, doing some freelance top secret web application architecture consulting, and somehow find time for the rest of the now regular crazyness including Thursday night mountain biking, Friday night "midnight ride" etc. I also seem to have discovered my old habit of being more sociable than is probably good for me. Ah well, life is short!
For this summer, I've got an interview with Google sometime soon, maybe some others along the way, and I can certainly keep myself busy here at OIT and riding bikes if nothing else works out. Stay tuned...
Fall 2006 comes to an end.
Submitted by ckdake on Fri, 2006-12-15 16:16I made it through the semester somehow, and have to additional things to share from it. First of all, my term paper for "Technology, Regions, and Policy" on Distributed Innovation is now available on my school projects page. (I also just uploaded a paper from last fall on software patents.). My semester project for "Discrete Algorithms" which was also part of my research this semester is also now available on my FaultFinder research page. This will eventually get updated as I work more on it, but it should be atleast a few weeks.
More to come next semester! (And I may get around to uploading some old school projects over winter break..)
Transparent TCP Stream Encryption
Submitted by ckdake on Wed, 2006-12-06 00:49School this semester has been very busy, but things are getting done and I'll post the neat things here as I get finished. The first project I completed this semester was my group project for CS6250: Advanced Computer Networks aka Router Architectures and Algorithms.
My group, consisting of Me, Zack A, and Chris L (a usual in my CS groups) designed and implemented ZCC: a set of very easy to use and manage tools for encrypting TCP traffic. The end result didn't turn out quite as we originally planned, but the key functionality is there and we were all very excited about it. (For example, it only works with IPv4 currently) Chris L and I are actually hoping to build on this next semester as a project in CS 7260: Internet Architecture & Protocols.
The neatest part of this project for me was becoming very familar with the ip_queue kernel module and libipq which seems to be something that almost noone is familar with. It's manual page here hasn't been updated in a while, and the successor to it: "libnetfilter_queue," (which we found out too late to use) is only available in very new Linux kernels (2.6.14 or newer). If you Google around for libipq, you'll see that we didn't have a lot to work with or examples to go by. (We were trying to modify the packet data before pushing it back into the queue while most examples just show how to accept or drop packets)
Read on for the details..

