## Title: Harnessing System Observability for Modeling and Detecting Gray Failures ## Abstract: Real-world distributed systems suffer unavailability due to various types of failure. But existing fault-tolerance mechanisms usually assume a simple fail-stop model, which cannot effectively capture many faults that occurred in large-scale production systems. In this talk, I will discuss this problem of gray failure with real-world examples to show its broad scope and consequences. Due to its inherently subtle nature, gray failure is difficult to characterize and detect. We argue that the key feature of gray failure is differential observability. I will present our solution based on this insight -- Panorama, a system that exploits the inherent observability in large systems to detect complex failures by using static analysis to convert any component in a system into an in-situ observer. ## Bio: Dr. Ryan Huang is an assistant professor in the Department of Computer Science at Johns Hopkins University. His research spans broadly in computer systems, with intersections of programming languages and software engineering. He enjoys solving challenging system problems in real-world settings and seeking impact. His co-authored work won the OSDI '16 best paper award and was nominated for best paper award at MICRO '18. Dr. Huang obtained his PhD from UC San Diego and his BS from Peking University.