Using Checkpoint/Restart [7]

The Cray checkpoint/restart facility enables you to save job state to a checkpoint file and restart the job from its latest checkpoint at a future time. Cray checkpoint/restart is based on Berkeley Lab Checkpoint Restart (BLCR). Moab and TORQUE and PBS Professional are supported workload management systems.

Parallel applications must use MPI, SHMEM and Open MP; other parallel programming models are not supported. In general, MPI-2 applications are supported, but MPI process management is not supported. No changes to application source code are required to checkpoint and restart a job. For Open MP applications, you must load crprep module to link in functionality necessary for checkpoint/restart functionality.

Cray checkpoint/restart provides these commands:

See the qhold(1), qchkpt(1), qrls(1), and qrerun(1) man pages for details about these commands.

Note: Use the Cray checkpoint/restart commands, not the BLCR commands. The native BLCR cr_checkpoint and cr_restart commands are not supported. Also, use the Cray man pages; the BLCR cr_checkpoint(1) and cr_restart(1) man pages document some features that are not supported on Cray systems.

To use checkpoint/restart, you must load the workload management system module (moab or pbs) and the blcr module. Loading the blcr module causes subsequent compilations to link the libraries that make the application checkpointable.

Note: When you compile an application with checkpoint/restart support (that is, you load the blcr module), each processing element spawns a thread. Take this into account when you specify aprun placement options.

You should also be aware of the following checkpoint/restart usage restrictions:

For an example showing how to create, checkpoint, and restart a job, see Using Checkpoint/Restart Commands.