Improve UX when creating and removing containers while containerd service is not running #46451
Description
Description
This was after writing up some examples for a question / discussion, and perhaps we can improve the "interact with containers while containerd is down" UX a bit;
Here's an example where a container is started (and running), then containerd is restarted or stopped;
# create a container that prints the date every 2 seconds
docker run -d --rm --name mycontainer alpine sh -c 'while true; do date; sleep 2; done'
# check that everything is running
ps auxf
# ...
root 47222 0.2 4.2 1209564 42272 ? Ssl 11:36 0:00 /usr/bin/containerd
root 47370 0.0 0.9 720776 9188 ? Sl 11:38 0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 -address /run/containerd/containerd.sock
root 47390 0.0 0.0 1604 936 ? Ss 11:38 0:00 \_ sh -c while true; do date; sleep 2; done
root 47480 0.0 0.0 1588 4 ? S 11:39 0:00 \_ sleep 2
# stop containerd and check that it's still running
systemctl stop containerd.service
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4c330b522c68 alpine "sh -c 'while true; …" 2 minutes ago Up 2 minutes mycontainer
docker logs mycontainer
Mon Sep 11 11:28:10 UTC 2023
Mon Sep 11 11:28:12 UTC 2023
Mon Sep 11 11:28:14 UTC 2023
Mon Sep 11 11:28:16 UTC 2023
px auxf
# ...
root 47370 0.0 0.9 720776 9308 ? Sl 11:38 0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 -address /run/containerd/containerd.sock
root 47390 0.0 0.0 1604 936 ? Ss 11:38 0:00 \_ sh -c while true; do date; sleep 2; done
root 47584 0.0 0.0 1588 4 ? S 11:41 0:00 \_ sleep 2
Now, while containerd is stopped, try to start a new container; this is expected, but perhaps we can make the error-message a bit more informative (mention "containerd is not available" or something);
docker run -it --rm alpine
docker: Error response from daemon: connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout": unavailable.
Now try to remove the container that's still running; this returns eventually (10 seconds?) but with an error:
docker rm -fv mycontainer
Error response from daemon: Could not kill running container 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07, cannot remove - tried to kill container, but did not receive an exit event
Note that after this, the container's process is killed, but the shim is still running:
px auxf
# ...
root 47370 0.0 0.9 720776 9676 ? Sl 11:38 0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 -address /run/containerd/containerd.sock
And docker ps
still shows the container (but logs don't get updated);
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4c330b522c68 alpine "sh -c 'while true; …" 4 minutes ago Up 4 minutes mycontainer
docker logs mycontainer
Mon Sep 11 11:28:10 UTC 2023
...
Mon Sep 11 11:31:56 UTC 2023
Logs from when containerd was stopped, and the docker rm
was tried; the logs mention that the service was stopped, but the shim process continues running (which is the expected behavior). However, because containerd is down, dockerd starts to log errors that it can't receive events;
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: Stopping containerd container runtime...
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:41:03.148357817Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = error reading from server: EOF" module=libcontainerd namespace=moby
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:41:03.148458280Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:41:03.149260128Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = error reading from server: EOF" module=libcontainerd namespace=plugins.moby
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:41:03.149299740Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=plugins.moby
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: containerd.service: Deactivated successfully.
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: containerd.service: Unit process 47370 (containerd-shim) remains running after unit stopped.
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: Stopped containerd container runtime.
When trying to delete the container, the kill
and delete
fail;
Sep 11 11:42:47 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:42:47.851968146Z" level=error msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 error="context deadline exceeded"
Sep 11 11:42:47 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: docker-4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07.scope: Deactivated successfully.
Sep 11 11:42:49 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:42:49.857466567Z" level=error msg="Handler for DELETE /v1.43/containers/mycontainer returned error: Could not kill running container 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07, cannot remove - tried to kill container, but did not receive an exit event"
The container's state is incorrect at this point, because no event was received to update its status;
docker container inspect --format '{{ json .State }}' mycontainer | jq .
{
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 47390,
"ExitCode": 0,
"Error": "",
"StartedAt": "2023-09-11T11:39:00.39465654Z",
"FinishedAt": "0001-01-01T00:00:00Z"
}
After starting containerd again, the container can be removed, but again takes longer than expected;
systemctl start containerd.service
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4c330b522c68 alpine "sh -c 'while true; …" 24 minutes ago Up 24 minutes mycontainer
docker rm -fv mycontainer
mycontainer
Logs from starting containerd, and removing the container contains some logs about network interfaces (perhaps something to look into?):
Sep 11 12:03:18 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: Started containerd container runtime.
Sep 11 12:03:18 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:18.307703964Z" level=info msg="containerd successfully booted in 0.039373s"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T12:03:35.171231154Z" level=error msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 error="context deadline exceeded"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T12:03:35.187655527Z" level=info msg="ignoring event" container=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:35.188167420Z" level=info msg="shim disconnected" id=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:35.188604659Z" level=warning msg="cleaning up after shim disconnected" id=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 namespace=moby
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:35.188750132Z" level=info msg="cleaning up dead shim"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:35.199328471Z" level=warning msg="cleanup warnings time=\"2023-09-11T12:03:35Z\" level=info msg=\"starting signal loop\" namespace=moby pid=48633 runtime=io.containerd.runc.v2\n"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd-networkd[47914]: veth6b09306: Lost carrier
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.371657] docker0: port 1(veth6b09306) entered disabled state
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.375041] vethbfca065: renamed from eth0
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 networkd-dispatcher[696]: WARNING:Unknown index 19 seen, reloading interface list
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd-udevd[48648]: Using default interface naming scheme 'v249'.
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd-networkd[47914]: veth6b09306: Link DOWN
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.397183] docker0: port 1(veth6b09306) entered disabled state
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.398437] device veth6b09306 left promiscuous mode
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.398443] docker0: port 1(veth6b09306) entered disabled state
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 networkd-dispatcher[696]: ERROR:Unknown interface index 19 seen even after reload
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 networkd-dispatcher[696]: WARNING:Unknown index 19 seen, reloading interface list
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: run-docker-netns-e93e84a80b3e.mount: Deactivated successfully.
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: var-lib-docker-overlay2-bb6c4fe149ed089867eff21680939d764a30de35a8035dfaa6ee559366410e1b-merged.mount: Deactivated successfully.
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 networkd-dispatcher[696]: ERROR:Unknown interface index 19 seen even after reload
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T12:03:35.287777802Z" level=error msg="error removing container" container=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 error="removal of container 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 is already in progress"
Sep 11 12:03:36 ubuntu-s-1vcpu-1gb-ams3-01 systemd-networkd[47914]: docker0: Lost carrier
Possible improvements here;
- Should we log a more informative warning that containerd is not available to handle requests?
- TBD; do we want to (by default) still continue killing the container, or should we only do so "opt-in"?
- The good thing here is that we can still kill a container (i.e. the process), which can be useful
- The "bad" thing is that we don't inform why things took longer than expected (and the fact we couldn't connect with containerd)
- The "bad" thing is that the container's state no longer matches reality; do we have different means to get the actual state? (the shim is still running, so .. maybe?)
This was all on Docker 24.0.6, without containerd image-store enabled