Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

Development of tools to help operators diagnose faults has been the subject of much research and commercial activity. Existing diagnostic systems that designed with large networks concern at performance and reachability issues. They fall short at helping the operators of small networks. Detailed diagnosis is required to help these operators.

The study in this paper considers detailed diagnosis for problems in small enterprise called NetMedic. For observing and diagnosing a lot of failure modes, it depends on
i) using many variables to capture different aspects of component behavior rather than a single, abstract variable that denotes overall health.
ii) link faulty components to the affected components through a chain (estimate dependency graph) then inferring that source component is impacting the destination by looking in history when the state of the source component is similar to its current state. If the destination at that time is also in similar state to its current then we can infer that the source is impacting the destination one, otherwise is not. The primary difficulty in this estimation is that we do not know a priori how components interact.

NetMedic built a dependence graph with roughly 1000 components and 3600 edges, with each component represented by roughly 35 state variables. It was evaluated by injecting faults comprise both fail-stop and performance problems. In 80% of them, the faulty component is the top identified cause. They found from the classification of the considered cases that the diagnosis system should monitor individual applications and they need to track application specific health rather generic one. Also they found that the rootr causes of most of cases are some other configuration element in the environment on which the application depends like the lower-layer services that are running, the firewall configuration etc.

1 comment:

  1. Nice summary. I hope someone does a course project in this general area.

    ReplyDelete