Building A New Network Monitoring System
DISCERN: Distributed Intelligent System for Control, Evaluation, and Reporting on Networks
a work in progress!
this has been moved to the GTNoise wiki and the end result will be posted back here
Building a New Network Monitoring Infrastructure
Chris Kelly, Warren Matthews, Nick Feamseter
{chris.kelly,warren.matthews}@oit.gatech.edu feamster@cc.gatech.edu
ABSTRACT
This paper proposes a distributed, scalable approach to network monitoring that resolves many issues with current network monitoring methods. Current monitoring methods are limited by at least one (and usually most of) the following limitations: a single point of view of the network, a single point of failure, relying on network devices for reliable information, relying soley on passive measurements, lack of support for new analysis methods or data sources, inability to scale to large networks, inability to combine data from multiple sources. This proposal, DISCERN, attempts to resolve every one of these failures in existing systems with a modular, multi-tiered approach utilizing a distributed mesh of monitoring nodes.
Keywords
network monitoring, network performance, distributed triggers
1. Introduction
1.1 Current State of Monitoring
Network monitoring is a very important aspect of maintaining reliability in large scale systems and much work has been done on how to collect data about performance, how to store the data centrally, and how to analyze the data. However, as systems grow larger and algorithms grow more complex, the amount of power required to effectively perform centralized analysis is massive. Multi-terabyte disk arrays are required to store all of the information and the computational overhead prevents effective analysis from being real time. Existing tools are very useful at performing after the fact analysis, but often operators still must wait on a phone call from a user experiencing a problem before beginning the troubleshooting process. Then the operator must sift through all the different data sources and reports to attempt to narrow down what the problem is which often leads to manually connecting to switches and routers, and running network diagnostic tools like ping and traceroute by hand. This is not an effective way to find or solve problems and it does not scale as the number of nodes on the network increases.
big problem - centralized view of the network - dont know what the user sees
see datapository/openview/snmp/etc
goals include: data reduction, placement of computation/functionality, automated detection and comprehensive report generation
1.2 Goals
explanation of the distributed approach and its scalability advantages (old system: DDOS file uploading, centralized point of failure, multiple views of the network)
smart network problem detection and automated in-depth analysis (not relying on network devices, active and passive tests)
comprehensive reporting to allow operators to work more quickly and effectively (combining alarms, data from various sources, combination graphs of diverse data, taking network topology into account)
building the system to respond in real-time.
2. System Architecture
distribtuted mesh of measurement machines
distributed data collection and processing servers: receive data upload, share with top-tier data warehouse, do some analysis, admin tools (Explain push vs pull approaches, need for tiered due to SQL query length when all centralized)
admin tools for managing measurement machines, processing servers, etc (work at any level of the system)
3. Data Collection
inputs: smokeping/traceroute/iperf/tcpdump/etc
modular, so new inputs can be easily added
4. Analysis
initial real time analysis on nodes for immediate problems
calculating statistics over time on collection servers
modular to allow for numbers to be crunched for various data types
comparison of data with network topology to isolate location on network of problem
5. Visualization
graphs that can combine graphs from any of the data sources onto one image
Unique graphs that consolidate massive ammounts of information into something where it is easy to detect problems
6. Triggers and Alerts
consolidation of problem report data (no more 100s of nagios emails)
comprehensive reporting with graphs and pointers to specific information to show what the problem might be
"top 10 problems" screen in the NOC
7. References
W. Matthews, Original CPR whitepaper. 2004.
D. G. Andersen and N. Feamster. Challenges and opportunities in Internet data mining. Technical Report CMU–PDL–06–102, Carnegie Mellon University Parallel Data Laboratory, Jan. 2006. www.datapository.net.
Composite events for network event correlation
Liu, G. Mok, A.K. Yang, E.J.
Dept. of Comput. Sci., Texas Univ., Austin, TX ;
A. Jain, J.M. Hellerstein, S. Ratnasamy abd D. Wetherall. A Wakeup Call for Internet Monitoring Systems: The Case for Distributed Triggers. Proc. 3rd ACM SIGCOMM Workshop on Hot Topics in Networks. 2004.
HP Openview

