Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc-shim: publish init exit after all processes have exited #10647

Closed
wants to merge 3 commits into from

Conversation

laurazard
Copy link
Member

Building off of #10634, took a stab at an implementation of what we discussed in slack, namely:

1. process the init exit, squirrel away its exit event and block new exec starts
2. wait for all pending execs to settle to running or early-exit
3. kill all container processes
4. wait for the running count to drop to zero, with a time bound
5. wipe out the running list (so we don't publish exit events for late-reaped execs) and publish the squirreled-away init event

This implementation in particular launches a goroutine per-container init exit. I did write another implementation which doesn't require that, but it was a bit more complex, and I wasn't sure it was worth the payoff.

However, I'm sure others (cc @corhere) have better ideas, so I'd be happy to hear them.

The runc task state machine prevents execs from being created after the
init process has exited, but there are no guards against starting a
created exec after the init process has exited. That leaves a small
window for starting an exec to race our handling of the init process
exiting. Normally this is not an issue in practice: the kernel will
atomically kill all processes in a PID namespace when its "init" process
terminates, and will not allow new processes to fork(2) into the PID
namespace afterwards. Therefore the racing exec is guaranteed by the
kernel to not be running after the init process terminates. On the other
hand, when the container does not have a private PID namespace (i.e. the
container's init process is not the "init" process of the container's
PID namespace), the kernel does not automatically kill other container
processes on init exit and will happily allow runc to start an exec
process at any time. It is the runc shim's responsibility to clean up
the container when the init process exits in this situation by killing
all the container's remaining processes. Block execs from being started
after the container's init process has exited to prevent the processes
from leaking, and to avoid violating the task service's assumption that
an exec can be running iff the init process is also running.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you might be having trouble getting the watchdog timer to work as expected? Timers before language version go1.23 (which we are currently stuck in) were tricky to use correctly, but there are patterns which can be applied to make them work as expected.

Feel free to squash away my commit. It's broken anyway and (as @samuelkarp pointed out) the commit message is not entirely correct.

cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved
cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved
@corhere
Copy link
Contributor

corhere commented Aug 27, 2024

This implementation in particular launches a goroutine per-container init exit. I did write another implementation which doesn't require that, but it was a bit more complex, and I wasn't sure it was worth the payoff.

I agree. One ephemeral goroutine per container that is currently in the process of being torn down is cheap, and we always have the option to optimize later.

@laurazard
Copy link
Member Author

Looks like you might be having trouble getting the watchdog timer to work as expected? Timers before language version go1.23 (which we are currently stuck in) were tricky to use correctly, but there are patterns which can be applied to make them work as expected.

Oh! I don't usually use them like I did here, but I also wasn't aware/don't remember if I knew about those pitfalls (and now I'm worrying about whether I've used them wrong before 😅)

@laurazard
Copy link
Member Author

I guess it's something to call out that this implementation is slightly different than what we discussed – we talked about was:

[...]
4. wait for the running count to drop to zero, with a time bound
5. wipe out the running list (so we don't publish exit events for late-reaped execs) and publish the squirreled-away init event

What this does:

[...]
4. Subscribe to exits + wipe out the running list (so processExits doesn't publish events for exiting execs)
5. Wait for running count to drop to 0/for us to receive exits for all the running execs, or timeout
6. publish the exec exits we received and then publish the squirreled-away init event

It felt cleaner to do it this way rather than add a subscriber to s.exitSubscribers but still have s.processExits publish these exits, but let me know if you think otherwise.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
@laurazard laurazard marked this pull request as ready for review August 27, 2024 22:47
@dosubot dosubot bot added the area/runtime Runtime label Aug 27, 2024
It's not true that `s.mu` needs to be held when calling
`handleProcessExit`, and indeed hasn't been the case for a
while – see 892dc54.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
@laurazard
Copy link
Member Author

/cc @samuelkarp

@ningmingxiao
Copy link
Contributor

ningmingxiao commented Aug 28, 2024

The code is a bit complicated. we can delete s.pendingExecs
you can review #10647

@laurazard
Copy link
Member Author

Superceded by #10651

@laurazard laurazard closed this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants