-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runc-shim: publish init exit after all processes have exited #10647
Conversation
The runc task state machine prevents execs from being created after the init process has exited, but there are no guards against starting a created exec after the init process has exited. That leaves a small window for starting an exec to race our handling of the init process exiting. Normally this is not an issue in practice: the kernel will atomically kill all processes in a PID namespace when its "init" process terminates, and will not allow new processes to fork(2) into the PID namespace afterwards. Therefore the racing exec is guaranteed by the kernel to not be running after the init process terminates. On the other hand, when the container does not have a private PID namespace (i.e. the container's init process is not the "init" process of the container's PID namespace), the kernel does not automatically kill other container processes on init exit and will happily allow runc to start an exec process at any time. It is the runc shim's responsibility to clean up the container when the init process exits in this situation by killing all the container's remaining processes. Block execs from being started after the container's init process has exited to prevent the processes from leaking, and to avoid violating the task service's assumption that an exec can be running iff the init process is also running. Signed-off-by: Cory Snider <csnider@mirantis.com>
Skipping CI for Draft Pull Request. |
6aff0b0
to
dfd9d43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you might be having trouble getting the watchdog timer to work as expected? Timers before language version go1.23 (which we are currently stuck in) were tricky to use correctly, but there are patterns which can be applied to make them work as expected.
Feel free to squash away my commit. It's broken anyway and (as @samuelkarp pointed out) the commit message is not entirely correct.
I agree. One ephemeral goroutine per container that is currently in the process of being torn down is cheap, and we always have the option to optimize later. |
Oh! I don't usually use them like I did here, but I also wasn't aware/don't remember if I knew about those pitfalls (and now I'm worrying about whether I've used them wrong before 😅) |
I guess it's something to call out that this implementation is slightly different than what we discussed – we talked about was:
What this does:
It felt cleaner to do it this way rather than add a subscriber to |
dfd9d43
to
9972445
Compare
Signed-off-by: Laura Brehm <laurabrehm@hey.com>
9972445
to
ad50858
Compare
It's not true that `s.mu` needs to be held when calling `handleProcessExit`, and indeed hasn't been the case for a while – see 892dc54. Signed-off-by: Laura Brehm <laurabrehm@hey.com>
/cc @samuelkarp |
The code is a bit complicated. we can delete s.pendingExecs |
Superceded by #10651 |
Building off of #10634, took a stab at an implementation of what we discussed in slack, namely:
This implementation in particular launches a goroutine per-container init exit. I did write another implementation which doesn't require that, but it was a bit more complex, and I wasn't sure it was worth the payoff.
However, I'm sure others (cc @corhere) have better ideas, so I'd be happy to hear them.