Crash-Test Dummy For High-Performance Computing

04.04.2011 | Matt Shipman

If crashing a car is expensive, think about the cost of crashing a supercomputer.

When you’re trying to solve large-scale problems, sometimes you’ve got to experiment with changes to the fundamental building blocks of whatever is involved. That means things can break. And when you’re talking about the fastest computers in the world, that would be very expensive. Solution? Build your own high-performance computing (HPC) system – then you can do whatever you want.

That’s what Frank Mueller did. HPC refers to largely scientific computing done on a very large scale, such as global climate simulations or DNA sequencing. HPC relies on machines made up of thousands of computer processors linked together and working in tandem. These HPC systems are incredibly fast. In fact, there’s a lot of competition to see which machine is fastest.

The system built by Mueller’s team was completed March 30, and will serve as a sort of crash-test dummy for potential new solutions to the major obstacles facing next generation HPC system design. “We can do anything we want with it,” Mueller says. “We can experiment with potential solutions to major problems, and we don’t have to worry about delaying work being done on the large-scale systems at other institutions.”

In 2010, an updated list of the fastest computers in the world was released. For the first time in years, the United States didn’t top the list – a machine in China took top prize.

The National Science Foundation (NSF) and U.S. Department of Energy (DOE) are now trying to regain the title. The fastest computers in the world currently operate at speeds measured in petaflops – meaning they can conduct 1,000 trillion operations per second. NSF and DOE are funding research to support a machine that would operate in exaflops, which would be 1,000 times faster than today’s fastest machines.

The problem is that researchers face several challenges in designing these exaflop machines – every problem that current HPC systems face would be magnified in an exaflop system.

For example, in current HPC systems, hours’ worth of computational effort is lost whenever a single component fails. Because current systems have hundreds of thousands of components, these failures are inevitable but relatively uncommon – they tend to happen once or twice a day. But in an exaflop system, there will be many millions of components – exponentially increasing the number of failures, and hurting the system’s efficiency.

In order to test possible solutions to this and other problems, researchers need to make fundamental changes to the HPC system’s entire software stack, including the operating system.

“There is no way that large-scale HPC system operators, like Oak Ridge National Labs, would let us experiment with their systems,” says Mueller, a computer science professor at NC State, “We could break them.”

To get around the problem, Mueller and a team of researchers secured funding from NSF, NVIDIA and NC State to build their own HPC system – the largest academic HPC system in North Carolina.

Once Mueller and his team have shown that a solution has worked on their system, it can be tested on more powerful, high-profile systems – like the Jaguar supercomputer at Oak Ridge.

I won’t go into all of the technical details of the system at NC State (that information is available here), but here’s an overview: it has 1,728 processor cores and 36 NVIDIA Tesla C2050 GPUs on 108 computer nodes (32GB RAM each).

One Response to “Crash-Test Dummy For High-Performance Computing”

  1. [...] on more powerful, high-profile systems such as the Jaguar supercomputer at Oak Ridge. Read the Full Story.AKPC_IDS += "18827,"; Posted in HPC, New Installations by Rich Brueckner 0 [...]