MOLAR: Modular Linux and Adaptive Runtime Support for HEC OS/R research
- funded by: DOE
- funding level: $93,708 (for NCSU +$18,000 cost sharing by NC State COE+CSC)
- duration: 02/01/2005 - 01/31/2008 (no-cost extension until 01/31/2009)
- PIs (total funding: $1,200,000):
- Stephen L. Scott, Jeffrey Vetter, David Bernholdt, Christian Engelmann - ORNL
- Frank Mueller - North Carolina State University
- P. Sadayappan - Ohio State University
- Chokchai Leangsuksun - Louisiana Tech University
MOLAR is a multi-institution research effort that concentrates on
adaptive, reliable,and efficient operating and runtime system
solutions for ultra-scale high-end scientific computing on the next
generation of supercomputers. This research addresses the challenges
outlined by the FAST-OS - forum to address scalable technology for
runtime and operating systems - and HECRTF - high-end computing
revitalization task force - activities by providing a modular Linux
and adaptable runtime support for high-end computing operating and
runtime systems.
The MOLAR research has the following goals to address these issues.
- Create a modular and configurable Linux system that allows
customized changes based on the requirements of the applications,
runtime systems, and cluster management software.
- Build runtime systems that leverage the OS modularity and
configurability to improve efficiency, reliability, scalability,
ease-of-use, and provide support to legacy and promising programming
models.
- Advance computer reliability, availability and serviceability
(RAS) management systems to work cooperatively with the OS/R to
identify and preemptively resolve system issues.
- Explore the use of advanced monitoring and adaptation to improve
application performance and predictability of system interruptions.
The overall goal of the research conducted at NCSU will be to develop
scalable algorithms for high-availability without single points of
failure and without single points of control.
Publications:
-
"MOLAR: adaptive runtime support for high-end computing operating
and runtime systems" by
Christian Engelmann, Stephen L. Scott, David E. Bernholdt, Narasimha
R. Gottumukkala, Chokchai Leangsuksun, Jyothish Varma, Chao Wang,
Frank Mueller, Aniruddha G. Shet, P. Sadayappan in ACM SIGOPS
Operating Systems Review, Vol. 40, No. 2, April 2006, pages 63-72.
-
"Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems" by
J. Varma, C. Wang, F. Mueller, C. Engelmann and S. Scott
in International Conference on Supercomputing, Jun 2006, pages 219-228.
-
"A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance"
by C. Wang and F. Mueller and C. Engelmann and
S. Scott"
in International Parallel and Distributed Processing Symposium, Apr
2007.
-
"Proactive Fault Tolerance for HPC with Xen Virtualization"
by A. Nagarajan and F. Mueller and C. Engelmann and
S. Scott
in International Conference on Supercomputing, Jun 2007.
-
"On-the-fly Recovery of Job Input Data in Supercomputers"
by C. Wang, Z. Zhang, S. Vazhkudai, X. Ma,
and F. Mueller
in International Conference on Parallel Processing, Sep 2008 (accepted).
-
"Proactive Process-Level Live Migration in HPC Environments"
by C. Wang and F. Mueller
in Supercomputing, Nov 2008 (accepted).
Theses: