MOLAR: Modular Linux and Adaptive Runtime Support for HEC OS/R research

funded by: DOE
funding level: $93,708 (for NCSU +$18,000 cost sharing by NC State COE+CSC)
duration: 02/01/2005 - 01/31/2008 (no-cost extension until 01/31/2009)
PIs (total funding: $1,200,000):
- Stephen L. Scott, Jeffrey Vetter, David Bernholdt, Christian Engelmann - ORNL
- Frank Mueller - North Carolina State University
- P. Sadayappan - Ohio State University
- Chokchai Leangsuksun - Louisiana Tech University

MOLAR is a multi-institution research effort that concentrates on adaptive, reliable,and efficient operating and runtime system solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined by the FAST-OS - forum to address scalable technology for runtime and operating systems - and HECRTF - high-end computing revitalization task force - activities by providing a modular Linux and adaptable runtime support for high-end computing operating and runtime systems.

The MOLAR research has the following goals to address these issues.

Create a modular and configurable Linux system that allows customized changes based on the requirements of the applications, runtime systems, and cluster management software.
Build runtime systems that leverage the OS modularity and configurability to improve efficiency, reliability, scalability, ease-of-use, and provide support to legacy and promising programming models.
Advance computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues.
Explore the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions.

The overall goal of the research conducted at NCSU will be to develop scalable algorithms for high-availability without single points of failure and without single points of control.

Publications:

"MOLAR: adaptive runtime support for high-end computing operating and runtime systems" by Christian Engelmann, Stephen L. Scott, David E. Bernholdt, Narasimha R. Gottumukkala, Chokchai Leangsuksun, Jyothish Varma, Chao Wang, Frank Mueller, Aniruddha G. Shet, P. Sadayappan in ACM SIGOPS Operating Systems Review, Vol. 40, No. 2, April 2006, pages 63-72.
"Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems" by J. Varma, C. Wang, F. Mueller, C. Engelmann and S. Scott in International Conference on Supercomputing, Jun 2006, pages 219-228.
"A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance" by C. Wang and F. Mueller and C. Engelmann and S. Scott" in International Parallel and Distributed Processing Symposium, Apr 2007.
"Proactive Fault Tolerance for HPC with Xen Virtualization" by A. Nagarajan and F. Mueller and C. Engelmann and S. Scott in International Conference on Supercomputing, Jun 2007.
"On-the-fly Recovery of Job Input Data in Supercomputers" by C. Wang, Z. Zhang, S. Vazhkudai, X. Ma, and F. Mueller in International Conference on Parallel Processing, Sep 2008 (accepted).
"Proactive Process-Level Live Migration in HPC Environments" by C. Wang and F. Mueller in Supercomputing, Nov 2008 (accepted).

Theses:

"Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems" by J. Varma, M.S. Thesis, North Carolina State University, Mar 2006 (last known position: SecureWorks, GA)
"System Virtualization for Proactive Fault-Tolerant Computing" by Arun Nagarajan, M.S. Thesis, North Carolina State University, Apr 2008 (last known position: Nvidia, CA)