DISCERN: Distributed Intelligent System for Control, Evaluation, and Reporting on Networks.

This is my "scratch" page for my masters project. Don't read too deeply into whats on here, Once things are more concrete I'll say so.

I'm working with Warren Matthews, Nick Feamster, and Constantine Dovrolis at Georgia Tech.

Problem Summary

Network monitoring is a very important aspect of maintaining reliability in large scale systems and much work has been done on how to collect data about performance, how to store the data centrally, and how to analyze the data. However, as systems grow larger and algorithms grow more complex, the amount of power required to effectively perform centralized analysis is massive. Multi-terabyte disk arrays are required to store all of the information and the computational overhead prevents effective analysis from being real time. Existing tools are very useful at performing after the fact analysis, but often operators still must wait on a phone call from a user experiencing a problem to begin the troubleshooting process. Then the operator must sift through all the different data sources and reports to attempt to narrow down what the problem is which often leads to manually connecting to switches and routers, and running network diagnostic tools like ping and traceroute by hand. This is not an effective way to find or solve problems and it does not scale as the number of nodes on the network increases.

Proposed Solution

Several key components are required to begin to approach a solution to this problem. The first one, a distributed mesh of machines to do the work is already available through the CPR project that I currently work with in RNOC at the Office of Information Technology at Georgia Tech. This platform currently performs active monitoring and stores the results centrally for after the fact analysis. The second component is a framework that allows tests to be run on the machines, allows for the collection and archiving of data, and allows initial analysis to actually be performed by the monitoring machines. There are several existing frameworks that exist for running tests and sharing data, I'm evaluating using one of the following:

  • Our existing home grown SSH system that uses keys to allow machines to talk to one another
  • University of Washington’s Scriptroute tool
  • ESnet/GEANT2/Internet2's perfSONAR platform
  • NetRadar from the University of Deleware (not to be confused with NetRadar from a student at John Hopkins University but maybe we'll use both)

Lastly will be some form of automated diagnostic system that takes in reports of irregularities, compares them with the network topology, and decides what tests need to be run. This automated system will then run tests as needed on machines using the testing framework and present a report to operators with all of the data, a suggestion of what the problem might be (ex: Router X is acting weird), and links to graphs and data to allow the operator to do more in depth research if they need to. Initially this automated system would run centrally, but the end goal is to have automated analysis running on every machine in the mesh network of monitoring nodes so that any one of them can effectively report any problem they see before a user has to phone in. We also want to avoid the torrent of emails from things like Nagios when something goes down: An administrator only needs one email or page for a problem, not hundreds.

Things I'm working on

  • I'm employed at OIT working on things related to this project. A lot of my time goes into the operational side of things. There are scripts to set up new machines and add them to the mesh, manage the machines, send out updates, generate statistics etc. I'm also doing some work on packaing up some of the tools from Internet2 that we are using with hopes of them making it into the Red Hat and FreeBSD (and maybe Gentoo) package trees.
  • My semester project for Discrete Algorithms this semester will be on distributed algorithms for fault detection. Right now its just something along the lines of this, but i need to produce "some meaningful application of these algorithms, so there needs to be some testing or innovations."
  • My networking class project is completely unrelated, alas.
  • Everything combines together in my masters project which will likely be 3 hours a semester for 3 semesters.

Deliverables

There are several goals of this research. The currently primary goal is the completion of a 3 semester master's project, but progress will be represented by the other goals: several papers and a functioning system.

  • Fall 2006 MS Project Paper: building a new network monitoring system (for OSDI, Nick interested)
    • analysis of the current state of network monitoring (centralized data warehousing/analysis see datapository/openview/snmp/etc)
    • our new collection and analysis infrastructure
    • explanation of the distributed approach and its scalability advantages
    • modularized data collection and reporting (inputs: smokeping/configs/traceroute/etc, outputs: statistics/etc)
    • smart network problem detection and automated in-depth analysis
    • consolidation of problem report data (no more 100s of nagios emails)
    • comprehensive reporting to allow operators to work more quickly and effectively
    • building the system to respond in real-time.
    • As you also said below, it's one thing to propose the algorithms and quite another to build the systems. This goes beyond just validating algorithms with real data; it also asks "can we actually build something useful with these algorithms?" - Nick
  • Fall 2006 Algorithm Design Class Project paper/implementation:
    • Implementing Duffield's "Simple Network Performance Tomography
    • Combine existing smokeping data with layer 2 traceroute of mesh
    • Hopefully also run on a network simulator where we can introduce down links
  • Paper: extending distributed triggers and binary tomography by combining spatial and temporal correlations
    • overview of existing theory - improving existing techniques (or proposing new ones that work, in the case that the old ones are broken) - Nick
    • lots of work done on distributed triggers, but noone has implemented anything, verifications only from simulations, usually on networks with a small number of links
    • distributed triggers: Joe Hellerstein hellerstein at cs dot berkeley dot edu (from Nick)
    • why combining spatial and temporal correlations is different and a Good Thing
    • in english: taking into account both trends over time and the physical layout of the network at the same time so that algorithms can more effectively (and hopefully with less data) decide what indicates a problem
  • Paper: enhanced automated fault detection
    • overview of existing platform from 1/2 and existing algorithms
    • enhancements of spatial/temporal correlations
    • scaling time/day/week/month dependent statistics and comparisons
    • just in time data and trend distribution
  • The functional system
    • full-mesh trust scheme on cpr nodes so that any node can communicate with any other
    • on node immediate analysis of data before data is stored locally for submission to central statistics server
    • automated central statistical processing and interface for other machines to request statistics from central server
    • test result verification on cpr nodes (re-run tests), followed by comparisom to central statistics by cpr node
    • Additonal tests/data collection to respond to verifiable problems
    • report generation for centralized submission
    • centralized report condensation and reporting to operators
    • eventually distributed report consolidation: ie first node to see a problem correlates other nodes' reports and alerts operator directly
  • paper(s)
    • Proving or disproving existing papers on tomography algorithms
    • We can use CPR to experimentally validate (or invalidate..) some of the previously proposed tomographic techniques. I think this would be an interesting and important work. Especially if it turns out that these techniques do not work, for reasons that either algorithmic or practical, I think that it will attract major interest. - Constantine

Some conferences to think about for possible paper submission:

References