The Cray checkpoint/restart facility enables you to save job state to a checkpoint file and restart the job from its latest checkpoint at a future time. Cray checkpoint/restart is based on Berkeley Lab Checkpoint Restart (BLCR). Moab and TORQUE and PBS Professional are supported workload management systems.
Parallel applications must use MPI, SHMEM and Open MP; other parallel programming models are not supported. In general, MPI-2 applications are supported, but MPI process management is not supported. No changes to application source code are required to checkpoint and restart a job. For Open MP applications, you must load
crprep module to link in functionality necessary for checkpoint/restart functionality.
Cray checkpoint/restart provides these commands:
qhold, which checkpoints a job, releases resources assigned to the job, and places the job in hold state in the job queue.
qchkpt, which checkpoints a job, but the job keeps running.
qrls, which releases a checkpointed job from hold state; the job resumes running.
qrerun, which restarts a previously checkpointed job that has completed, is still queued in the completed state, and has not yet exited the workload management system.
Note: A system variable sets the amount of time a job can remain in the queue in the completed state. Once a job has been removed from the queue, you can no longer use qrerun to restart it.
See the qhold(1), qchkpt(1), qrls(1), and qrerun(1) man pages for details about these commands.
Note: Use the Cray checkpoint/restart commands, not the BLCR commands. The native BLCR cr_checkpoint and cr_restart commands are not supported. Also, use the Cray man pages; the BLCR cr_checkpoint(1) and cr_restart(1) man pages document some features that are not supported on Cray systems.
To use checkpoint/restart, you must load the workload management system module (
pbs) and the
blcr module. Loading the
blcr module causes subsequent compilations to link the libraries that make the application checkpointable.
Note: When you compile an application with checkpoint/restart support (that is, you load the
blcrmodule), each processing element spawns a thread. Take this into account when you specify aprun placement options.
You should also be aware of the following checkpoint/restart usage restrictions:
You cannot checkpoint/restart applications that are launched interactively through aprun.
You cannot checkpoint/restart applications that use TCP/IP sockets.
Files are handled by reference only. The checkpoint facility captures the state only of those files that are open at checkpoint time. For example, if a file has grown is in an append only mode, at restart it will truncated to the checkpointed size. BLCR only tracks files that are open at the time of checkpoint. Shared write access (across compute nodes) will result in the file being truncated to the file position of the sharer with smallest file position.
Linux asynchronous I/O is not supported.
Applications that connect
stderr to a TTY are not supported.
Checkpoint/restart does not support applications being debugged with an interactive debugger.
Any pages pinned at the time of checkpoint will not be pinned at restart. Applications that pin pages will not restart properly unless they are implemented to handle previously pinned pages.
Any memory registration handled outside of a programming models such as MPICH2 and SHMEM will have to be handled by the checkpointed application. Those registrations will not persist to restart.
Process groups and pages are not supported.
Unix System V shared memory and IPC (interprocess communication) resources are not supported.
CPU affinity is not supported on restart.
Cray performance tools shall provide reasonably accurate performance information across a checkpoint/restart. Other performance tools may not give accurate performance information across a checkpoint/restart.
/tmp is required for on-node MPI message transfers.
For an example showing how to create, checkpoint, and restart a job, see Using Checkpoint/Restart Commands.