US researchers are working to perfect a series of simulations capable of accurately rendering the performance of a nuclear weapon in precise molecular detail.
Due to their complexity, the simulations must be run on supercomputers with thousands of processors – inevitably creating a number of reliability and accuracy issues.
Indeed, the simulations, which are required to more efficiently certify nuclear weapons, may demand the use of up to 100,000 machines, a level of complexity that is essential to accurately render molecular-scale reactions taking place over milliseconds, or thousandths of a second.
Fortunately, researchers at Purdue and the National Nuclear Security Administration’s (NNSA) Lawrence Livermore National Laboratory say they have made progress in advancing the use of ultra-precise simulations.
“Such highly complex jobs must be split into many processes that execute in parallel on separate machines in large computer clusters,” explained Professor Saurabh Bagchi of Purdue University. ”Due to natural faults in the execution environment there is a high likelihood that some processing element will have an error during the application’s execution, resulting in corrupted memory or failed communication between machines. There are bottlenecks in terms of communication and computation.”
According to Bagchi, such errors are compounded as long as the simulation continues to run before the glitch is detected, potentially causing processes to stall or crash.
“We are particularly concerned with errors that corrupt data silently, possibly generating incorrect results with no indication that the error has occurred,” said Bronis R. de Supinski, co-leader of the ASC Application Development Environment Performance Team at Lawrence Livermore. ”Errors that significantly reduce system performance are also a major concern since the systems on which the simulations run are very expensive.”
Yet, as noted above, the researchers have developed automated methods to detect a glitch soon after it occurs.
“You want the system to automatically pinpoint when and in what machine the error took place and also the part of the code that was involved,” Bagchi said. “Then, a developer can come in, look at it and fix the problem.”
One bottleneck arises from the fact that data are streaming to a central server.
“Streaming data to a central server works fine for a hundred machines, but it can’t keep up when you are streaming data from a thousand machines,” says Purdue doctoral student Ignacio Laguna, who worked with Lawrence Livermore computer scientists. “We’ve eliminated this central brain, so we no longer have that bottleneck.”
Each machine in the supercomputer cluster contains several cores, or processors, and each core might run one “process” during simulations. The researchers created an automated method for “clustering,” or grouping the large number of processes into a smaller number of “equivalence classes” with similar traits. Grouping the processes into equivalence classes makes it possible to quickly detect and pinpoint problems.
Lawrence Livermore computer scientist Todd Gamblin came up with the scalable clustering approach. A lingering bottleneck in using the simulations is related to a procedure called checkpointing, or periodically storing data to prevent its loss in case a machine or application crashes. The information is saved in a file called a checkpoint and stored in a parallel system distant from the machines on which the application runs.
“The problem is that when you scale up to 10,000 machines, this parallel file system bogs down… It’s about 10 times too much activity for the system to handle, and this mismatch will just become worse because we are continuing to create faster and faster computers.
“[Yes], we’re beginning to solve the checkpointing problems… It’s not completely solved, but we are getting there. The checkpointing bottleneck must be solved in order for researchers to create supercomputers capable of ‘exascale computing,’ or 1,000 quadrillion operations per second. [And that is why this is] the Holy Grail of supercomputing,” Bagchi added.