Checkpoint and restore functionality for CUDA is exposed through a command-line utiity called cuda-checkpoint
which is available in the bin directory of this repo.
This utility can be used to transparently checkpoint and restore CUDA state within a running Linux process,
and can be combined with CRIU (described below) to fully checkpoint CUDA applications.
Transparent, per-process checkpointing offers a middle ground between virtual machine checkpointing and application-driven checkpointing. Per-process checkpointing can be used in combination with containers to checkpoint the state of a complex application, facilitating uses-cases such as:
- fault tolerance (with periodic checkpoints)
- preemption of lower-priority work on a single node (by checkpointing the preempted task), and
- cluster scheduling (with migration)
Virtual Machine | Per-Process | Application Driven |
---|---|---|
The most popular utility for transparent per-process checkpointing is a utility called CRIU.
CRIU (Checkpoint/Restore in Userspace) is an open source checkpointing utility
(maintained outside of NVIDIA) for Linux which can checkpoint and restore process trees.
CRIU exposes its functionality through a command line program called criu
and operates by checkpointing and restoring every kernel mode resource associated with a process. These resources include:
- anonymous memory,
- threads,
- regular files,
- sockets, and
- pipes between checkpointed processes.
Since the behavior of these resources is specified by Linux, and are independent of the underlying hardware,
CRIU knows how to checkpoint and restore them.
In contrast, NVIDIA GPUs provide functionality beyond that of a standard Linux kernel, and thus CRIU is not able to manage them.
cuda-checkpoint
adds this capability, and can therefore be used with CRIU to checkpoint and restore a CUDA application.
cuda-checkpoint
checkpoints and restores the CUDA state of a single Linux process.
The cuda-checkpoint
utility supports display driver version 550 and higher and is located in the bin directory of this repo.
localhost$ cuda-checkpoint --help
CUDA checkpoint and restore utility.
Toggles the state of CUDA within a process between suspended and running.
Version 550.54.09. Copyright (C) 2024 NVIDIA Corporation. All rights reserved.
--toggle --pid <value>
Toggle the state of CUDA in the specified process.
--help
Print help message.
The cuda-checkpoint
binary can toggle the CUDA state of a process (specified by PID) between suspended and running.
A running-to-suspended transition is called a suspend and the opposite transition is called a resume.
A process's CUDA state is initially running.
When cuda-checkpoint
is used to suspend CUDA in a process:
- any CUDA driver APIs which launch work, manage resources, or otherwise impact GPU state are locked;
- already-submitted CUDA work, including stream callbacks, are completed;
- device memory is copied to the host, into allocations managed by the CUDA driver; and
- all of CUDA’s GPU resources are released.
cuda-checkpoint
does not suspend CPU threads, which may continue to safely interact with CUDA by:
calling runtime or driver APIs (which may block until CUDA is resumed), and
accessing host memory (allocated by cudaMallocHost and similar APIs) which remains valid.
A suspended CUDA process no longer directly refers to any GPU hardware at the OS level and may therefore be checkpointed by a CPU checkpointing utility such as CRIU.
When a process’s CUDA state is resumed using cuda-checkpoint
:
- GPUs are re-acquired by the process;
- device memory is copied back to the GPU, and GPU memory mappings are restored at their original addresses;
- CUDA objects such as streams and contexts are restored; and
- CUDA driver APIs are unlocked.
At this point, CUDA calls will unblock and CUDA may begin running on the GPU again.
This example will use cuda-checkpoint
and criu
to checkpoint a CUDA application called counter.
Every time counter receives a packet, it increments GPU memory and replies with the updated value.
The source code for counter is shown below.
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#define PORT 10000
__device__ int counter = 100;
__global__ void increment()
{
counter++;
}
int main(void)
{
cudaFree(0);
int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
sockaddr_in addr = {AF_INET, htons(PORT), inet_addr("127.0.0.1")};
bind(sock, (sockaddr *)&addr, sizeof addr);
while (true) {
char buffer[16] = {0};
sockaddr_in peer = {0};
socklen_t inetSize = sizeof peer;
int hCounter = 0;
recvfrom(sock, buffer, sizeof buffer, 0, (sockaddr *)&peer, &inetSize);
increment<<<1,1>>>();
cudaMemcpyFromSymbol(&hCounter, counter, sizeof counter);
size_t bytes = sprintf(buffer, "%d\n", hCounter);
sendto(sock, buffer, bytes, 0, (sockaddr *)&peer, inetSize);
}
return 0;
}
The counter application can be built using nvcc
.
localhost$ nvcc counter.cu -o counter
Next, launch counter and wait to be sure that it is listening on its socket (which is important if this demo is being launched as a single script).
localhost# ./counter &
[1] 298027
localhost# sleep 1
Save counter’s PID for reference in subsequent commands.
localhost# PID=$!
Send counter a packet and observe the returned value. The initial value was 100 but the response is 101, showing that the GPU memory has changed since initialization.
localhost# echo hello | nc -u localhost 10000 -W 1
101
Use nvidia-smi
to confirm that counter is running on a GPU.
localhost# nvidia-smi --query --display=PIDS | grep $PID
Process ID : 298027
Use cuda-checkpoint
to suspend counter’s CUDA state.
localhost# cuda-checkpoint --toggle --pid $PID
Use nvidia-smi
to confirm that counter is no longer running on a GPU
localhost# nvidia-smi --query --display=PIDS | grep $PID
Create a directory to hold the checkpoint image
localhost# mkdir -p demo
Use criu
to checkpoint counter
localhost# criu dump --shell-job --images-dir demo --tree $PID
[1]+ Killed ./counter
Confirm that counter is no longer running
localhost# ps --pid $PID
PID TTY TIME CMD
Use criu
to restore counter
localhost# criu restore --shell-job --restore-detached --images-dir demo
Use cuda-checkpoint
to resume counter’s CUDA state
localhost# cuda-checkpoint --toggle --pid $PID
Now that counter is fully restored, send it another packet. The response is 102, showing that earlier GPU operations were persisted correctly!
localhost# echo hello | nc -u localhost 10000 -W 1
102
As of display driver version 550, checkpoint and restore functionality is still being actively developed.
In particular, cuda-checkpoint
:
- is x64 only,
- acts upon a single process, not a process tree,
- does not support UVM or IPC memory,
- does not support GPU migration,
- waits for already-submitted CUDA work to finish before completing a checkpoint,
- does not attempt to keep the process in a good state if an error (such as the presence of a UVM allocation) is encountered during checkpoint or restore.
These limitations will be addressed in subsequent display driver releases,
and will not require an update to the cuda-checkpoint
utility itself.
The cuda-checkpoint
utility simply exposes functionality that is contained in the driver.
By downloading or using the software, you agree to the terms of the License Agreement for NVIDIA Software Development Kits — EULA.