runc-shim: publish init exit after all processes have exited #10647

laurazard · 2024-08-27T15:45:58Z

Building off of #10634, took a stab at an implementation of what we discussed in slack, namely:

1. process the init exit, squirrel away its exit event and block new exec starts
2. wait for all pending execs to settle to running or early-exit
3. kill all container processes
4. wait for the running count to drop to zero, with a time bound
5. wipe out the running list (so we don't publish exit events for late-reaped execs) and publish the squirreled-away init event

This implementation in particular launches a goroutine per-container init exit. I did write another implementation which doesn't require that, but it was a bit more complex, and I wasn't sure it was worth the payoff.

However, I'm sure others (cc @corhere) have better ideas, so I'd be happy to hear them.

The runc task state machine prevents execs from being created after the init process has exited, but there are no guards against starting a created exec after the init process has exited. That leaves a small window for starting an exec to race our handling of the init process exiting. Normally this is not an issue in practice: the kernel will atomically kill all processes in a PID namespace when its "init" process terminates, and will not allow new processes to fork(2) into the PID namespace afterwards. Therefore the racing exec is guaranteed by the kernel to not be running after the init process terminates. On the other hand, when the container does not have a private PID namespace (i.e. the container's init process is not the "init" process of the container's PID namespace), the kernel does not automatically kill other container processes on init exit and will happily allow runc to start an exec process at any time. It is the runc shim's responsibility to clean up the container when the init process exits in this situation by killing all the container's remaining processes. Block execs from being started after the container's init process has exited to prevent the processes from leaking, and to avoid violating the task service's assumption that an exec can be running iff the init process is also running. Signed-off-by: Cory Snider <csnider@mirantis.com>

k8s-ci-robot · 2024-08-27T15:46:01Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

corhere

Looks like you might be having trouble getting the watchdog timer to work as expected? Timers before language version go1.23 (which we are currently stuck in) were tricky to use correctly, but there are patterns which can be applied to make them work as expected.

Feel free to squash away my commit. It's broken anyway and (as @samuelkarp pointed out) the commit message is not entirely correct.

cmd/containerd-shim-runc-v2/task/service.go

corhere · 2024-08-27T16:44:53Z

This implementation in particular launches a goroutine per-container init exit. I did write another implementation which doesn't require that, but it was a bit more complex, and I wasn't sure it was worth the payoff.

I agree. One ephemeral goroutine per container that is currently in the process of being torn down is cheap, and we always have the option to optimize later.

laurazard · 2024-08-27T18:05:52Z

Looks like you might be having trouble getting the watchdog timer to work as expected? Timers before language version go1.23 (which we are currently stuck in) were tricky to use correctly, but there are patterns which can be applied to make them work as expected.

Oh! I don't usually use them like I did here, but I also wasn't aware/don't remember if I knew about those pitfalls (and now I'm worrying about whether I've used them wrong before 😅)

laurazard · 2024-08-27T18:17:29Z

I guess it's something to call out that this implementation is slightly different than what we discussed – we talked about was:

[...]
4. wait for the running count to drop to zero, with a time bound
5. wipe out the running list (so we don't publish exit events for late-reaped execs) and publish the squirreled-away init event

What this does:

[...]
4. Subscribe to exits + wipe out the running list (so processExits doesn't publish events for exiting execs)
5. Wait for running count to drop to 0/for us to receive exits for all the running execs, or timeout
6. publish the exec exits we received and then publish the squirreled-away init event

It felt cleaner to do it this way rather than add a subscriber to s.exitSubscribers but still have s.processExits publish these exits, but let me know if you think otherwise.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>

It's not true that `s.mu` needs to be held when calling `handleProcessExit`, and indeed hasn't been the case for a while – see 892dc54. Signed-off-by: Laura Brehm <laurabrehm@hey.com>

laurazard · 2024-08-27T23:35:35Z

/cc @samuelkarp

ningmingxiao · 2024-08-28T09:44:25Z

The code is a bit complicated. we can delete s.pendingExecs
you can review #10647

laurazard · 2024-08-29T13:34:33Z

Superceded by #10651

k8s-ci-robot added do-not-merge/work-in-progress size/L labels Aug 27, 2024

laurazard force-pushed the exec-fix-refactor branch 2 times, most recently from 6aff0b0 to dfd9d43 Compare August 27, 2024 16:32

corhere reviewed Aug 27, 2024

View reviewed changes

cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved

cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved

corhere mentioned this pull request Aug 27, 2024

runc-shim: refuse to start execs after init exits #10634

Closed

laurazard force-pushed the exec-fix-refactor branch from dfd9d43 to 9972445 Compare August 27, 2024 19:08

runc-shim: refactor init exit handling

ad50858

Signed-off-by: Laura Brehm <laurabrehm@hey.com>

laurazard force-pushed the exec-fix-refactor branch from 9972445 to ad50858 Compare August 27, 2024 22:47

laurazard marked this pull request as ready for review August 27, 2024 22:47

k8s-ci-robot removed the do-not-merge/work-in-progress label Aug 27, 2024

dosubot bot added the area/runtime Runtime label Aug 27, 2024

runc-shim: remove misleading comment

3e7ad97

It's not true that `s.mu` needs to be held when calling `handleProcessExit`, and indeed hasn't been the case for a while – see 892dc54. Signed-off-by: Laura Brehm <laurabrehm@hey.com>

laurazard mentioned this pull request Aug 27, 2024

runc-shim: check/kill init's children even if skipping init exit #10608

Closed

k8s-ci-robot requested a review from samuelkarp August 27, 2024 23:35

This was referenced Aug 28, 2024

runc-shim: fix races/prevent init exits from being dropped #10651

Merged

integration: regression test for issue 10589 #10649

Merged

laurazard closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc-shim: publish init exit after all processes have exited #10647

runc-shim: publish init exit after all processes have exited #10647

laurazard commented Aug 27, 2024

k8s-ci-robot commented Aug 27, 2024

corhere left a comment

corhere commented Aug 27, 2024

laurazard commented Aug 27, 2024

laurazard commented Aug 27, 2024

laurazard commented Aug 27, 2024

ningmingxiao commented Aug 28, 2024 •

edited

Loading

laurazard commented Aug 29, 2024

runc-shim: publish init exit after all processes have exited #10647

runc-shim: publish init exit after all processes have exited #10647

Conversation

laurazard commented Aug 27, 2024

k8s-ci-robot commented Aug 27, 2024

corhere left a comment

Choose a reason for hiding this comment

corhere commented Aug 27, 2024

laurazard commented Aug 27, 2024

laurazard commented Aug 27, 2024

laurazard commented Aug 27, 2024

ningmingxiao commented Aug 28, 2024 • edited Loading

laurazard commented Aug 29, 2024

ningmingxiao commented Aug 28, 2024 •

edited

Loading