Open Access Open Access  Restricted Access Subscription Access

Novel Approaches in Network Fault Management

Ankur Gupta, Purnendu Prabhat

Abstract


As computer networks increase in size and complexity, managing them to ensure 24x7 uptime while meeting increasingly stringent Service Level Agreements (SLAs) and customer expectations, become critical issues. Although Network Management solutions have progressed significantly in recent years, issues such as extreme scale, new network paradigms, protocols and increasingly heterogeneous networks make the task of efficient fault management non-trivial. This research paper identifies and reviews novel strategies, techniques and ideas, either in commercial solutions or in literature to alleviate some of these challenges in the network fault management domain. Some ideas of the future evolution of the domain are also presented.

Full Text:

PDF

References


Aijay Adams, Petr Lapukhov, J. H. Z. Netnorad. https://code.facebook.com/posts/1534350660228025/netnorad- troubleshooting-networks-via-end-to-end-probing/.

Atwal, K. S., Guleria, A., and Bassiouni, M. 2016. A scalable peer-to-peer control plane architecture for software defined networks. In Network Computing and Applications (NCA), 2016 IEEE 15th International Symposium on. IEEE, 148–152.

Caravela, I., Arsenio, A., and Borges, N. 2016. A closed-loop automatic data-mining approach for preventive network monitoring. Journal of Network and Systems Management 24, 4, 974–1003.

Chandrasekaran, B. and Benson, T. 2014. Tolerating sdn application failures with legosdn. In Proceedings of the 13th ACM Workshop on Hot Topics in Networks. ACM, 22.

cisco.com. Cisco global cloud index: Forecast and methodology, 20142019 white pa- per. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index- gci/Cloud Index White Paper.html.

Clark, J. Inside microsoft’s autopilot: Nadella’s secret cloud weapon. http://www.theregister.co.uk/2014/02/07/microsoft autopilot feature/.

corba.org. Corba. http://www.corba.org/.

Cronk, R. N., Callahan, P. H., and Bernstein, L. 1988. Rule-based expert systems for network management and operations: an introduction. IEEE Network 2, 5, 7–21.

emc.com. Smarts- automated it management for the software-defined data center. http://www.emc.com/it- management/smarts/index.html.

Gill, P., Jain, N., and Nagappan, N. 2011. Understanding network failures in data centers: measurement, analysis, and implications. In ACM SIGCOMM Computer Communication Review. Vol. 41. ACM, 350–361.

Gogineni, H., Greenberg, A., Maltz, D. A., Ng, T. E., Yan, H., and Zhang, H. 2010. Mms: An autonomic network-layer foundation for network management. IEEE Journal on Selected Areas in Communications 28, 1. Gupta, A. 2001. Network management system and computer-based methods for network management. US Patent

App. 09/845,456.

Gupta, A. 2006a. Method and system for identifying potential adverse network conditions. US Patent App. 11/487,248.

Gupta, A. 2006b. Network management: Current trends and future perspectives. Journal of Network and Systems Management 14, 4, 483–491.

Gupta, A. 2010. A system and method for reducing the mean-time-to-detect for network faults. Indian Patent App. 2966/DEL/2010.

Gupta, A. and Awasthi, L. K. 2008. Secure thyself: Securing individual peers in collaborative peer-to-peer environments. In GCA. 140–146.

Gupta, A. and Awasthi, L. K. 2012. Peer-to-peer networks and computation: Current trends and future per- spectives. Computing and Informatics 30, 3, 559–594.

Gupta, A. and Koul, N. 2007. Swan: a swarm intelligence based framework for network management of ip networks. In Conference on Computational Intelligence and Multimedia Applications, 2007. International Conference on. Vol. 1. IEEE, 114–118.

Gupta, A. and Prabhat, P. 2016. News: Towards an early warning system for network faults. International Journal of Next-Generation Computing 7, 3.

Handigol, N., Heller, B., Jeyakumar, V., Mazie`res, D., and McKeown, N. 2014. I know what your packet did last hop: Using packet histories to troubleshoot networks. In NSDI. Vol. 14. 71–85.

hp.com. Hp network node manager. http://www8.hp.com/in/en/software-solutions/network-node-manager-i- network-management-software/.

ibm.com. Reduce outages, automate, gain visibility and control of your network. http://www- 03.ibm.com/software/products/en/netcool-network-management.

Isard, M. 2007. Autopilot: automatic data center management. ACM SIGOPS Operating Systems Review 41, 2, 60–67.

Jennings, B., Van Der Meer, S., Balasubramaniam, S., Botvich, D., Foghlu´, M. O´ ., Donnelly, W., and Strassner, J. 2007. Towards autonomic management of communications networks. IEEE Communications Magazine 45, 10, 112–121.

Katta, N., Zhang, H., Freedman, M., and Rexford, J. 2015. Ravana: Controller fault-tolerance in software- defined networking. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. ACM, 4.

Kuklinski, S. and Chemouil, P. 2014. Network management challenges in software-defined networks. IEICE Transactions on Communications 97, 1, 2–9.

Kumar, G. P. and Venkataram, P. 1997. Artificial intelligence approaches to network management: recent advances and a survey. Computer Communications 20, 15, 1313–1322.

Ku´zniar, M., Pereˇs´ıni, P., Vasic´, N., Canini, M., and Kostic´, D. 2013. Automatic failure recovery for software- defined networks. In Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking. ACM, 159–160.

Liotta, A., Pavlou, G., and Knight, G. 2002. Exploiting agent mobility for large-scale network monitoring.

IEEE network 16, 3, 7–15.

Liu, V., Halperin, D., Krishnamurthy, A., and Anderson, T. E. 2013. F10: A fault-tolerant engineered network. In NSDI. 399–412.

Natu, M. and Sethi, A. S. 2008. Using temporal correlation for fault localization in dynamically changing networks. International Journal of Network Management 18, 4, 303–316.

Newhall, T., Libeks, J., Greenwood, R., and Knerr, J. 2010. Peermon: A peer-to-peer network monitor- ing system. In Proceedings of the 24th international conference on Large installation system administration. USENIX Association, 1–12.

packetdesign.com. Packet design route explorer, product overview whitepapers. http://www.packetdesign.com//technology/wp.htm.

Plus, C. Project madiera- celtic plus. https://www.celticplus.eu/project-madeira/.

Raman, L. 1998. Osi systems and network management. IEEE Communications Magazine 36, 3, 46–53.

Roy, P. V. A self-managing peer-to-peer network. http://www.ist- selfman.org/wiki/images/6/6c/Selfman A4s H Res.pdf.

Sethi, A. S., Raynaud, Y., and Faure-Vincent, F. 2013. Integrated Network Management IV: Proceedings of the fourth international symposium on integrated network management, 1995. Springer.

Wandel and Inc., G. Network baselining part-i: Understanding-the-past-to-predict-the-future. https://www.scribd.com/document/331973525/Network-Baselining-Part-I-Understanding-the-Past-to-Predict- the-Future.

Wang, T., Srivatsa, M., Agrawal, D., and Liu, L. 2010. Spatio-temporal patterns in network events. In

Proceedings of the 6th International COnference. ACM, 3.

Watanabe, Y., Otsuka, H., Sonoda, M., Kikuchi, S., and Matsumoto, Y. 2012. Online failure prediction in cloud datacenters by real-time message pattern learning. In Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on. IEEE, 504–511.

Wu, X., Turner, D., Chen, C.-C., Maltz, D. A., Yang, X., Yuan, L., and Zhang, M. 2012. Netpilot: au- tomating datacenter network failure mitigation. ACM SIGCOMM Computer Communication Review 42, 4, 419–430.

Yang, S.-Y. and Chang, Y.-Y. 2011. An active and intelligent network management system with ontology-based and multi-agent techniques. Expert Systems with Applications 38, 8, 10320–10342.

Zhang, T., Liao, Q., and Shi, L. 2014. Bridging the gap of network management and anomaly detection through interactive visualization. In Visualization Symposium (PacificVis), 2014 IEEE Pacific. IEEE, 253–257.

Zhou, K., Tian, F., and Kong, T. 2014. The design and implementation of a collaborative monitoring system based on jxta. In Computer Science & Education (ICCSE), 2014 9th International Conference on. IEEE, 989–992.

Zhu, Y., Kang, N., Cao, J., Greenberg, A., Lu, G., Mahajan, R., Maltz, D., Yuan, L., Zhang, M., Zhao,

B. Y., et al. 2015. Packet-level telemetry in large datacenter networks. In ACM SIGCOMM Computer Communication Review. Vol. 45. ACM, 479–491.