Building A New Network Monitoring System

DISCERN: Distributed Intelligent System for Control, Evaluation, and Reporting on Networks

a work in progress!

this has been moved to the GTNoise wiki and the end result will be posted back here

Building a New Network Monitoring Infrastructure

Chris Kelly, Warren Matthews, Nick Feamseter
{chris.kelly,warren.matthews}@oit.gatech.edu feamster@cc.gatech.edu

ABSTRACT

This paper proposes a distributed, scalable approach to network monitoring that resolves many issues with current network monitoring methods. Current monitoring methods are limited by at least one (and usually most of) the following limitations: a single point of view of the network, a single point of failure, relying on network devices for reliable information, relying soley on passive measurements, lack of support for new analysis methods or data sources, inability to scale to large networks, inability to combine data from multiple sources. This proposal, DISCERN, attempts to resolve every one of these failures in existing systems with a modular, multi-tiered approach utilizing a distributed mesh of monitoring nodes.

Keywords

network monitoring, network performance, distributed triggers

1. Introduction

1.1 Current State of Monitoring

Network monitoring is a very important aspect of maintaining reliability in large scale systems and much work has been done on how to collect data about performance, how to store the data centrally, and how to analyze the data. However, as systems grow larger and algorithms grow more complex, the amount of power required to effectively perform centralized analysis is massive. Multi-terabyte disk arrays are required to store all of the information and the computational overhead prevents effective analysis from being real time. Existing tools are very useful at performing after the fact analysis, but often operators still must wait on a phone call from a user experiencing a problem before beginning the troubleshooting process. Then the operator must sift through all the different data sources and reports to attempt to narrow down what the problem is which often leads to manually connecting to switches and routers, and running network diagnostic tools like ping and traceroute by hand. This is not an effective way to find or solve problems and it does not scale as the number of nodes on the network increases.

big problem - centralized view of the network - dont know what the user sees

see datapository/openview/snmp/etc

goals include: data reduction, placement of computation/functionality, automated detection and comprehensive report generation

1.2 Goals

explanation of the distributed approach and its scalability advantages (old system: DDOS file uploading, centralized point of failure, multiple views of the network)

smart network problem detection and automated in-depth analysis (not relying on network devices, active and passive tests)

comprehensive reporting to allow operators to work more quickly and effectively (combining alarms, data from various sources, combination graphs of diverse data, taking network topology into account)

building the system to respond in real-time.

2. System Architecture

distribtuted mesh of measurement machines

distributed data collection and processing servers: receive data upload, share with top-tier data warehouse, do some analysis, admin tools (Explain push vs pull approaches, need for tiered due to SQL query length when all centralized)

admin tools for managing measurement machines, processing servers, etc (work at any level of the system)

3. Data Collection

inputs: smokeping/traceroute/iperf/tcpdump/etc

modular, so new inputs can be easily added

4. Analysis

initial real time analysis on nodes for immediate problems

calculating statistics over time on collection servers

modular to allow for numbers to be crunched for various data types

comparison of data with network topology to isolate location on network of problem

5. Visualization

graphs that can combine graphs from any of the data sources onto one image

Unique graphs that consolidate massive ammounts of information into something where it is easy to detect problems

6. Triggers and Alerts

consolidation of problem report data (no more 100s of nagios emails)

comprehensive reporting with graphs and pointers to specific information to show what the problem might be

"top 10 problems" screen in the NOC

7. References

W. Matthews, Original CPR whitepaper. 2004.

D. G. Andersen and N. Feamster. Challenges and opportunities in Internet data mining. Technical Report CMU–PDL–06–102, Carnegie Mellon University Parallel Data Laboratory, Jan. 2006. www.datapository.net.

Composite events for network event correlation
Liu, G. Mok, A.K. Yang, E.J.
Dept. of Comput. Sci., Texas Univ., Austin, TX ;

A. Jain, J.M. Hellerstein, S. Ratnasamy abd D. Wetherall. A Wakeup Call for Internet Monitoring Systems: The Case for Distributed Triggers. Proc. 3rd ACM SIGCOMM Workshop on Hot Topics in Networks. 2004.

HP Openview