Packet-Level Telemetry in Large Datacenter Networks
Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan,
Ming Zhang, Ben Y. Zhao, Haitao Zheng
Proceedings of ACM SIGCOMM
[Full Text in PDF Format, 476KB]
Debugging faults in complex networks often requires capturing and analyzing traffic at the packet level. In this task, datacenter networks (DCNs) present unique challenges with their scale, traffic volume, and diversity of faults. To troubleshoot faults in a timely manner, DCN administrators must a) identify affected packets inside large volume of traffic; b) track them across multiple network components; c) analyze traffic traces for fault patterns; and d) test or confirm potential causes. To our knowledge, no tool today can achieve both the specificity and scale required for this task.
We present Everflow, a packet-level network telemetry system for large DCNs.
Everflow traces specific packets by implementing a powerful packet filter on
top of "match and mirror" functionality of commodity switches. It shuffles
captured packets to multiple analysis servers using load balancers built on
switch ASICs, and it sends "guided probes" to test or confirm potential
faults. We present experiments that demonstrate Everflow's scalability, and
share experiences of troubleshooting network faults gathered from running it
for over 6 months in Microsoft's DCNs.