Improve UX when creating and removing containers while containerd service is not running

### Description

This was after writing up some examples for a question / discussion, and perhaps we can improve the "interact with containers while containerd is down" UX a bit;

- https://github.com/moby/moby/discussions/46449


Here's an example where a container is started (and running), then containerd is restarted or stopped;

```bash
# create a container that prints the date every 2 seconds
docker run -d --rm --name mycontainer alpine sh -c 'while true; do date; sleep 2; done'

# check that everything is running
ps auxf
# ...
root       47222  0.2  4.2 1209564 42272 ?       Ssl  11:36   0:00 /usr/bin/containerd
root       47370  0.0  0.9 720776  9188 ?        Sl   11:38   0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 -address /run/containerd/containerd.sock
root       47390  0.0  0.0   1604   936 ?        Ss   11:38   0:00  \_ sh -c while true; do date; sleep 2; done
root       47480  0.0  0.0   1588     4 ?        S    11:39   0:00      \_ sleep 2


# stop containerd and check that it's still running
systemctl stop containerd.service

docker ps -a
CONTAINER ID   IMAGE     COMMAND                  CREATED         STATUS         PORTS     NAMES
4c330b522c68   alpine    "sh -c 'while true; …"   2 minutes ago   Up 2 minutes             mycontainer

docker logs mycontainer
Mon Sep 11 11:28:10 UTC 2023
Mon Sep 11 11:28:12 UTC 2023
Mon Sep 11 11:28:14 UTC 2023
Mon Sep 11 11:28:16 UTC 2023

px auxf
# ...
root       47370  0.0  0.9 720776  9308 ?        Sl   11:38   0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 -address /run/containerd/containerd.sock
root       47390  0.0  0.0   1604   936 ?        Ss   11:38   0:00  \_ sh -c while true; do date; sleep 2; done
root       47584  0.0  0.0   1588     4 ?        S    11:41   0:00      \_ sleep 2
```

Now, while containerd is stopped, try to start a new container; this is _expected_, but perhaps we can make the error-message a bit more informative (mention "containerd is not available" or something);

```bash
docker run -it --rm alpine
docker: Error response from daemon: connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout": unavailable.
```


Now try to remove the container that's still running; this returns eventually (10 seconds?) but with an error:

```bash
docker rm -fv mycontainer
Error response from daemon: Could not kill running container 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07, cannot remove - tried to kill container, but did not receive an exit event
```

Note that after this, the container's process is killed, but the shim is still running:

```bash
px auxf
# ...
root       47370  0.0  0.9 720776  9676 ?        Sl   11:38   0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 -address /run/containerd/containerd.sock
```

And `docker ps` still shows the container (but logs don't get updated);

```bash
docker ps -a
CONTAINER ID   IMAGE     COMMAND                  CREATED         STATUS         PORTS     NAMES
4c330b522c68   alpine    "sh -c 'while true; …"   4 minutes ago   Up 4 minutes             mycontainer

docker logs mycontainer
Mon Sep 11 11:28:10 UTC 2023
...
Mon Sep 11 11:31:56 UTC 2023
```

Logs from when containerd was stopped, and the `docker rm` was tried; the logs mention that the service was stopped, but the _shim_ process continues running (which is the expected behavior). However, because containerd is down, dockerd starts to log errors that it can't receive events;

```console
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: Stopping containerd container runtime...
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:41:03.148357817Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = error reading from server: EOF" module=libcontainerd namespace=moby
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:41:03.148458280Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:41:03.149260128Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = error reading from server: EOF" module=libcontainerd namespace=plugins.moby
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:41:03.149299740Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=plugins.moby
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: containerd.service: Deactivated successfully.
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: containerd.service: Unit process 47370 (containerd-shim) remains running after unit stopped.
Sep 11 11:41:03 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: Stopped containerd container runtime.
```

When trying to delete the container, the `kill` and `delete` fail;

```console
Sep 11 11:42:47 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:42:47.851968146Z" level=error msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 error="context deadline exceeded"
Sep 11 11:42:47 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: docker-4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07.scope: Deactivated successfully.
Sep 11 11:42:49 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T11:42:49.857466567Z" level=error msg="Handler for DELETE /v1.43/containers/mycontainer returned error: Could not kill running container 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07, cannot remove - tried to kill container, but did not receive an exit event"
```

The container's state is incorrect at this point, because no event was received to update its status;

```bash
docker container inspect --format '{{ json .State }}' mycontainer | jq .
{
  "Status": "running",
  "Running": true,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 47390,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2023-09-11T11:39:00.39465654Z",
  "FinishedAt": "0001-01-01T00:00:00Z"
}
```

After starting containerd again, the container can be removed, but again takes longer than expected;

```bash
systemctl start containerd.service

docker ps -a
CONTAINER ID   IMAGE     COMMAND                  CREATED          STATUS          PORTS     NAMES
4c330b522c68   alpine    "sh -c 'while true; …"   24 minutes ago   Up 24 minutes             mycontainer

docker rm -fv mycontainer
mycontainer
```

Logs from starting containerd, and removing the container contains some logs about network interfaces (perhaps something to look into?):

```console
Sep 11 12:03:18 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: Started containerd container runtime.
Sep 11 12:03:18 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:18.307703964Z" level=info msg="containerd successfully booted in 0.039373s"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T12:03:35.171231154Z" level=error msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 error="context deadline exceeded"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T12:03:35.187655527Z" level=info msg="ignoring event" container=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:35.188167420Z" level=info msg="shim disconnected" id=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:35.188604659Z" level=warning msg="cleaning up after shim disconnected" id=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 namespace=moby
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:35.188750132Z" level=info msg="cleaning up dead shim"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 containerd[48604]: time="2023-09-11T12:03:35.199328471Z" level=warning msg="cleanup warnings time=\"2023-09-11T12:03:35Z\" level=info msg=\"starting signal loop\" namespace=moby pid=48633 runtime=io.containerd.runc.v2\n"
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd-networkd[47914]: veth6b09306: Lost carrier
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.371657] docker0: port 1(veth6b09306) entered disabled state
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.375041] vethbfca065: renamed from eth0
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 networkd-dispatcher[696]: WARNING:Unknown index 19 seen, reloading interface list
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd-udevd[48648]: Using default interface naming scheme 'v249'.
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd-networkd[47914]: veth6b09306: Link DOWN
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.397183] docker0: port 1(veth6b09306) entered disabled state
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.398437] device veth6b09306 left promiscuous mode
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 kernel: [423810.398443] docker0: port 1(veth6b09306) entered disabled state
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 networkd-dispatcher[696]: ERROR:Unknown interface index 19 seen even after reload
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 networkd-dispatcher[696]: WARNING:Unknown index 19 seen, reloading interface list
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: run-docker-netns-e93e84a80b3e.mount: Deactivated successfully.
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 systemd[1]: var-lib-docker-overlay2-bb6c4fe149ed089867eff21680939d764a30de35a8035dfaa6ee559366410e1b-merged.mount: Deactivated successfully.
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 networkd-dispatcher[696]: ERROR:Unknown interface index 19 seen even after reload
Sep 11 12:03:35 ubuntu-s-1vcpu-1gb-ams3-01 dockerd[29327]: time="2023-09-11T12:03:35.287777802Z" level=error msg="error removing container" container=4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 error="removal of container 4c330b522c68b8ece1947deffddb92913f6412d0a4cbacc75772922d82bb6a07 is already in progress"
Sep 11 12:03:36 ubuntu-s-1vcpu-1gb-ams3-01 systemd-networkd[47914]: docker0: Lost carrier
```

Possible improvements here;

- Should we log a more informative warning that containerd is not available to handle requests?
- TBD; do we want to (by default) still continue killing the container, or should we only do so "opt-in"?
    - The _good thing_ here is that we can still kill a container (i.e. the process), which can be useful
    - The "bad" thing is that we don't inform why things took longer than expected (and the fact we couldn't connect with containerd)
    - The "bad" thing is that the container's state no longer matches reality; do we have different means to get the actual state? (the shim is still running, so .. maybe?)


-----


This was all on Docker 24.0.6, without containerd image-store enabled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve UX when creating and removing containers while containerd service is not running #46451

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve UX when creating and removing containers while containerd service is not running #46451

Description

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions