-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runc-shim: fix races/prevent init exits from being dropped #10651
runc-shim: fix races/prevent init exits from being dropped #10651
Conversation
651d0c5
to
3619240
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it!
9ab1a15
to
7e07ecb
Compare
2c290ed
to
655cb60
Compare
s.pendingExecs
655cb60
to
d3629b0
Compare
/cc @samuelkarp @dmcgowan |
It's not true that `s.mu` needs to be held when calling `handleProcessExit`, and indeed hasn't been the case for a while – see 892dc54. Signed-off-by: Laura Brehm <laurabrehm@hey.com>
d3629b0
to
c9c1815
Compare
The runc task state machine prevents execs from being created after the init process has exited, but there are no guards against starting a created exec after the init process has exited. That leaves a small window for starting an exec to race our handling of the init process exiting. Normally this is not an issue in practice: the kernel will atomically kill all processes in a PID namespace when its "init" process terminates, and will not allow new processes to fork(2) into the PID namespace afterwards. Therefore the racing exec is guaranteed by the kernel to not be running after the init process terminates. On the other hand, when the container does not have a private PID namespace (i.e. the container's init process is not the "init" process of the container's PID namespace), the kernel does not automatically kill other container processes on init exit and will happily allow runc to start an exec process at any time. It is the runc shim's responsibility to clean up the container when the init process exits in this situation by killing all the container's remaining processes. Block execs from being started after the container's init process has exited to prevent the processes from leaking, and to avoid violating the task service's assumption that an exec can be running iff the init process is also running. Signed-off-by: Cory Snider <csnider@mirantis.com>
74a7288
to
957a2b7
Compare
I think it's fine to keep them separate; simpler to deal with logistically than both of us trying to push to the same branch in a single PR. I do intend to cherry-pick both this PR and that back to 1.7 and 1.6 though. |
s.lifecycleMu.Lock() | ||
s.runningExecs[container]-- | ||
if ch, ok := s.execCountSubscribers[container]; ok { | ||
ch <- s.runningExecs[container] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this could block if the for loop in the goroutine in handleInitExit
has already received an event but hasn't run the defer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to think through this:
- Line 284 is
s.runningExecs[container]++
, protected bys.lifecycleMu
. We can have two calls toStart
that both (sequentially) invoke line 284. - Now assume we have the init process exit, SIGCHLD is received by the shim, and
processExits
reads, acquiress.lifecycleMu
and storess.containerInitExit[container]
. No new execs can actually start after this, because of the check on line 280. - Then outside
s.lifecycleMu
handleInitProcess
is invoked. We locks.lifecyleMu
, finds.runningExecs[container]
is > 0, and make a 2-length channel ins.execCountSubscribers[container]
. - Both of the execs call
container.Start
, invoke runc, and fail because the container is no longer running - Each one acquires
s.lifecycleMu
and sends the count of runningExecs to the subscriber on line 299, first sending 1 then 0 handleInitProcess
reads from the channel and finds a 0, and breaks out of the loop- A new
Start
call comes in and acquiress.lifecycleMu
ahead ofhandleInitProcess
'sdefer
, but fails to make progress because of line 280
I'm struggling to think of a situation where:
handleInitExit
is already in the channel loop (which means both (a) the init process is exited and we've stored it intos.containerInitExit
already, and (b)s.runningExecs[container]
is at least 1- Another exec can come in and increment
s.runningExecs
further after a 0 has been sent toch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the channel created in s.handleInitExits
has a buffer of size s.runningExecs[c]
, and we block new execs from being started before handleInitExits
is called, so at most we'll process s.runningExecs[c]
exits – so I don't think this would ever block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems feasible that the shim is not even notified that the container init exited until after we have gotten the failure to start the exec here but before we take the lock at L296?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems feasible that the shim is not even notified that the container init exited until after we have gotten the failure to start the exec here but before we take the lock at L296?
Following that thread:
- container init is started
- (1) container exec is started
- we increment
s.runningExecs
in L284, then start the exec
- we increment
- (1) the exec start call fails (L290)
- (2) container init exits and
s.processExits
acquiress.lifecycleMu
and processes it- init exit gets stashed in
s.containerInitExits
, ands.handleInitExit
gets called (and 2. unlockss.lifecycleMu
)
- init exit gets stashed in
- (1) acquires the lock at L296, and decrements
s.runningExecs
- (2)
s.handleInitExit
lockss.lifecycleMu
checkss.runningExecs
, finds 0, and immediately processes the init exit
Or:
- container init is started
- (1) container exec is started
- we increment
s.runningExecs
in L284, then start the exec
- we increment
- (1) the exec start call fails (L290)
- (2) container init exits and
s.processExits
acquiress.lifecycleMu
and processes it- init exit gets stashed in
s.containerInitExits
, ands.handleInitExit
gets called (and 2. unlockss.lifecycleMu
)
- init exit gets stashed in
- (2)
s.handleInitExit
lockss.lifecycleMu
checkss.runningExecs
, finds 1, creates a buffered channel with length 1, launches thegoroutine
and unlockss.lifecycleMu
- (1) acquires the lock at L296, and decrements
s.runningExecs
, and sends (a 0) on the channel - (2) the goroutine receives on the channel, finds a 0, and processes the init exit.
runningExecs map[*runc.Container]int // container -> num running execs, guarded by lifecycleMu | ||
// container -> subscription to exec exits/changes to s.runningExecs[container], | ||
// guarded by lifecycleMu | ||
execCountSubscribers map[*runc.Container]chan<- int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be better as a cond var with lifecyclemu to avoid more data races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There'd still need to be some way for handleInitExit to be notified to check the count. That could be with a chan<- struct{}
, but it seems cheaper to send the value over the chan<- int
and avoid acquiring s.lifecycleMu
if the value is > 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I did have a branch somewhere with a sync.Cond
, but I ended up going with Cory's suggestion for a map of channels since it seemed simpler/more verifiably correct. In general I think I prefer this approach since it reduces lock contention over s.lifecycleMu
inside the goroutine, and we only have to lock it to set the channel and then remove it when the goroutine is done, but let me know if you have a strong opinion/have spotted any race condition in particular!
LGTM, looks like @cpuguy83 has a few more comments |
line 297 s.runningExecs[container]-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/cherrypick release/1.7 |
@samuelkarp: once the present PR merges, I will cherry-pick it on top of release/1.7 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
xref: #10603 |
@samuelkarp: #10651 failed to apply on top of branch "release/1.7":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Based off of #10647 and #10634 (and with inspiration from #10650), replaced
s.pendingExecs
withs.runningExecs
. The logic is changed to keep track of running execs instead of pending execs, and handling a pending/prestarted exec the same as a running exec, which significantly simplifies the shim exit processing logic.closes #10589
This PR rewrites and simplifies a lot of the shim exit processing logic to reduce it's complexity, and also handle the case where the container doesn't have it's own pid-namespace, which means that we're not guaranteed to receive the init exit last.
This is achieved by replacing
s.pendingExecs
withs.runningExecs
, for which both (previously) pending and de facto running execs are considered.The new exit handling logic can be summed up by:
s.containerInitExit
, and if a container's init process has exited, refuse new execs.The issue with init exits being dropped if new execs keep coming in is resolved by refusing new execs after the container's init process has exited.
Big thanks to both @corhere for the guidance and discussion and @samuelkarp for the input, discussion and repro in #10649.