`Crash-Test Dummy' for Supercomputers Helps Work Out the Bugs

Designing the next-generation supercomputers will require fundamental changes to the way they operate. But testing these changes can be problematic because it could break the supercomputer being worked on and result in huge financial and developmental setbacks.

So what is a supercomputer designer to do?

One possible solution worked out by researchers from North Carolina State University (NCSU) is to build a dedicated "crash-test dummy" system to help find ways around some of the major obstacles facing next-generation high-performance computing (HPC) systems, which serve as the middlemen between supercomputers and humans.

NCSU computer scientist Frank Mueller and his team last week debuted the most powerful academic HPC cluster in the state, called ARC. A HPC cluster is essentially a large group of computers working together as though it were one device. These units are extremely fast and are often used to conduct large-scale scientific computing projects, such as global climate simulations or DNA sequencing.

"We designed this facility to help larger installations in the future," Mueller told TechNewsDaily. "We are trying to find a way to increase the resiliency of HPCs, so our infrastructure could be used not only as a crash-test dummy but also for educational purposes."

Unlike with other supercomputers, Mueller and his team are not afraid to try radical solutions to problems with ARC.

"We can do anything we want with it," he said. "We don't have to worry about delaying work being done on the large-scale systems at other institutions."

ARC, a mid-size computer cluster, was also launched to address scalability, a system's ability to seamlessly handle a lot of work and growth. On typical HPC systems, whenever a single component fails — out of the hundreds of thousands of components — hours of time and effort can be lost in an instant. In even more advanced systems there could be millions of components, which increases the chance for failures.

In order to test possible solutions to this and other problems, ARC was created to allow researchers to make fundamental changes to a HPC system's entire software rack, including the operating system.

ARC gives users temporary administrator rights and allows them to replace arbitrary components of the software stack. These replacements range from entire operating systems over drivers, kernel modules to runtime libraries, middleware and system tools.

The team used ARC to test software problems as well, even going so far as to simulate crashes to test the system's resiliency.

"We have brought down parts of the entire system during stress tests," Mueller said. "This results in downtime and time-consuming software re-installations, which is something we can afford but large-scale systems cannot."

The group is also testing out software tricks that could help salvage supercomputer programs when they do crash.

"We have modified the operating system so that we can migrate a part of an application from one computer to another without stopping it," Mueller explained.

"This allows us to vacate computers that show 'health problems,' such as high processor temperatures, low fan speeds, [and] high memory failure rates."

In addition to serving as a test bed for innovation, ARC was also designed to encourage research and education in computer science.

"We are trying to educate our students about where technology is headed in the future, and how to solve problems and test large-scale installations," Mueller said.

"On a national scale, we want to pitch this to National Labs and large-scale research installations. It could potentially be an interest on the commercial side as well, from car companies to pharmaceuticals."