Failure Prediction with Exact Localization
- funded by: LBNL
- funding level: $84,684
- duration: 10/18/2016 - 08/15/2017 (no-cost extension to 3/31/2018)
The objective of this work is to assess the potential of machine
learning techniques for pin-pointing failures before they happen with
high true positive and low false positive rates.
Publications:
-
"Doomsday: Predicting Which Node Will Fail When
on Supercomputers" by Anwesha Das, Frank
Mueller, Paul Hargrove, Eric Roman, Scott Baden,
in Supercomputing (SC), Nov 2018, pages (accepted), Best Paper Candidate.
- Desh: Deep Learning for System Health Prediction of Lead
Times to Failure in HPC by
Anwesha Das, Frank Mueller, Charles Siegel, Abhinav Vishnu in
High-Performance Parallel and Distributed Computing (HPDC), Jun
2018, pages 40-51.
Posters:
Theses: