CPR, SNMP, and the Sun T2000

A while back I got a Sun Microsystems T2000 server for a few months as part of their try and buy program. It went into the Research Network Operations Center rack in OIT at Georgia Tech and we went to town on it seeing what the box was capable of doing. This was an 4 core box with 4GB of RAM in it, and the beauty of this server is that in this configuration it can handle 32 simultaneous threads in hardware on the cpu at any one time. The clock speed is only 1GHz but it's somewhat like having a 32 cpu machine. Better descriptions of this are available if you Google around a little bit, but here are our experiences with it.

Most of my tests focused on MySQL performance. The project I'm working, CPR: Campus-Wide Network Performance Monitoring and Recovery, is a full mesh network of monitoring machines that all report back _lots_ of data. I wrote a little parody of a Microsoft SQL Server ad to sum up what we do with databases here:
"A University Research Network Analyzing Over 2 Million Records a Day Running on MySQL 4.

How does Georgia Tech predict failures for its 57,370 network ports on 1765 switches in 188 buildings? They import data from 54 systems into one data warehouse requiring over 100 million rows, all running on MySQL 4 with no downtime that's ever been noticed. Current deployment rates indicate that over the next 12 months, the system will acquire around 1 billion rows of additional data."
This is all currently running on a slightly dated Sun box and we ran into some perhaps unique performance issues. Originally all of the machines reported back on the hour at the same time. over 50 machines initiating SCP sessions at exactally the same time was bringing the box to its knees, and the data import script would take a while to run. Load averages on the box stayed over 8 most of the time and if something went wrong it took days to get caught back up. We have been able to fix a lot of this by scheduling reporting times to be at different times and rewriting the conversion, import, and archiving scripts, but more hardware would probably help so I did some testing on the T2000.

The primary latency in what we do seems to be the disk IO for MySQL and the T2000 was of little to no help with this, additionally the T2000 was unable to keep the entire database in RAM due to the way that memory is partioned on a per-cpu basis. This meant slower searches (and some of our searches take over 10 minutes to run on the existing hardware) and no noticable increase on insert performance. I didn't have time to test out the simultaneous SSH connections from the CPR network, but performance there probably would have been improved. Later on we would like every machine to report back every 5 minutes after running it's test and push the data into the database then instead of scheduling the importing separately which would give us a more real-time picture of the network and would most likely be greatly helped by the T2000's architecture.

After my testing was finished, I handed off the box to Jerry Swann, another OIT employee that runs the snmp monitoring for the campus network and here is what he had to say about it:
We currently have a Ultra2 circa 1998 that is performing snmp monitoring via mrtg of about 100 of our core network routers/switches and firewall instances. The system in question has been running ok since placed into service, but has been unable to keep up with the new demands of monitoring all the edge devices we have on campus, due to load issues on the ultra2.

Then we got the 4 core T2000, as a test of the backwards compatibilities of the box, we copied the present polling directory structure from the old system to the new, including the program executables and binary data files. With minor changes, a symlink here and there to fix the program paths (/usr/local/bin vs /usr/bin), the code ran. Not only that, but it ran fast. Where before we had to control the number of devices polled in order to keep the total polling time under our 5 minute polling interval, the T2000 was able to poll all 100 of the previously configured devices in 50 seconds (down from 4min 30sec).

Well, that isn't much load, especially since I was trying to see what this new box can do. So I decided to use a list of all the network devices that our network admins maintain in an database called the Book of Knowledge to see if that would load up the T2000. There were now a total of 866 devices, all of the network pollable core routers, core switches, edge switches/stacks, and firewall instances.

After configuring the polling software to allow 300 consecutive processes, I started the poller polling. It was awesome, the load went up to 50 within a couple of minutes. It then topped out about 75, and surprisingly enough the box was still usable for other things.

What was really great was that where the ultra2 was monitoring a maximum of 1050 interfaces on those before mentioned 100 devices, the T2000 was now monitoring about 28000 interfaces every 5 minutes.

When you combine the fact that I had to only do minor changes to the code plus it was capable of using that backward compatibility and run the software so much faster, you really can't beat the T2000. Buy one today.

All and all it was a pretty well powered box and a lot of fun to play with. RNOC ended up ordering 4 of the 8 core T2000s for various things, one of which will definitely be the SNMP monitoring, and if you have similar tasks to do the T2000 is highly recommended, espically if you are already running on a Solaris system.

comments powered by Disqus