Swarthmore College Department of Computer Science

Talk by Alison Norman, Department of Computer Science at the University of Texas at Austin

Towards Scalable Checkpointing in Supercomputing Applications
Thursday, February 17, 2011
SCI 240, 4:00 pm (refreshments at 3:45)

Abstract

Long-running parallel applications must occasionally save their state in a "checkpoint"; this is necessary to enable recovery of the computation after any failure in software, hardware, or environment (e.g. power). But, current checkpointing methods are becoming untenable for large-scale parallel applications on supercomputers. Many applications checkpoint all the parallel processes simultaneously---a technique that is easy to implement but can saturate the network and file system, causing a significant increase in checkpoint overhead.

This talk introduces "compiler-assisted staggered checkpointing", where processes can checkpoint at different places in the application text, thereby reducing contention for the network and file system. Placing staggered checkpoints is algorithmically challenging since the number of possible solutions is enormous and the number of desirable solutions is small, but we have developed a compiler algorithm that both places staggered checkpoints in an application and ensures that the solution is desirable. This algorithm successfully places staggered checkpoints in parallel applications configured to use tens of thousands of processes. For our benchmarks, this algorithm successfully finds and places checkpoints that are significantly faster than the current state of the art.