Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/v2/runc: handle early exits w/o big locks #8617

Merged
merged 1 commit into from
Jun 14, 2023

Conversation

corhere
Copy link
Contributor

@corhere corhere commented May 31, 2023

eventSendMu is causing severe lock contention when multiple processes start and exit concurrently. Replace it with a different scheme for maintaining causality w.r.t. start and exit events for a process which does not rely on big locks for synchronization.

Keep track of all processes for which a Task(Exec)Start event has been published and have not yet exited in a map, keyed by their PID. Processing exits then is as simple as looking up which process corresponds to the PID. If there are no started processes known with that PID, the PID must either belong to a process which was started by s.Start() and before the s.Start() call has added the process to the map of running processes, or a reparented process which we don't care about. Handle the former case by having each s.Start() call subscribe to "early exit" events before starting the process. It checks if the PID has exited in the time between it starting the process and publishing the TaskStart event, handling the exit if it has. Exit events for reparented processes received when no s.Start() calls are in flight are immediately discarded, and events received during an s.Start() call are discarded when the s.Start() call returns.

Big thanks to @laurazard for analyzing the root cause of #8557 and for also developing a fix. I would not have come up with the idea for this solution without inspiration from hers.

@k8s-ci-robot
Copy link

Hi @corhere. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@corhere
Copy link
Contributor Author

corhere commented May 31, 2023

@k8s-ci-robot
Copy link

@corhere: GitHub didn't allow me to request PR reviews from the following users: laurazard.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @laurazard @thaJeztah @cpuguy83 @dmcgowan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@corhere corhere force-pushed the reduce-exec-lock-contention branch from 1476721 to 16698bf Compare June 1, 2023 18:32
@corhere
Copy link
Contributor Author

corhere commented Jun 2, 2023

I was able to get rid of the timed garbage collector by leveraging the fact that the only early exits which matter for a given s.Start call are the exits which are received in between container.Start() and the PID being added to s.running. I will squash down the branch if tests pass and reviewers are happy with the new approach.

@thaJeztah
Copy link
Member

Looks like one commit misses a DCO

@corhere corhere force-pushed the reduce-exec-lock-contention branch 3 times, most recently from 53dca61 to adbb1b1 Compare June 2, 2023 22:04
@laurazard
Copy link
Member

/cc @dmcgowan @fuweid @dcantah PTAL

I chatted with @corhere, he's a bit busy this week and the next but I can follow up on any comments (would be great to get this in!)

@k8s-ci-robot
Copy link

@laurazard: GitHub didn't allow me to request PR reviews from the following users: PTAL.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @dmcgowan @fuweid @dcantah PTAL

I chatted with @corhere, he's a bit busy this week and the next but I can follow up on any comments (would be great to get this in!)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcantah
Copy link
Member

dcantah commented Jun 8, 2023

Looks like this caused TestRegressionIssue4769 to go a little haywire, it's failing for runc, crun and the rockylinux runs. @laurazard Can you look into this?

@laurazard
Copy link
Member

Aaahh, I see and can replicate it locally, I'll take a look, thanks @dcantah.

@laurazard laurazard force-pushed the reduce-exec-lock-contention branch 2 times, most recently from adbb1b1 to fc17ca0 Compare June 8, 2023 10:40
@@ -113,7 +113,7 @@ type service struct {
containers map[string]*runc.Container

lifecycleMu sync.Mutex
running map[int][]containerProcess // pid -> running process, guarded by lifecycleMu
running map[int]map[containerProcess]struct{} // pid -> running process, guarded by lifecycleMu
Copy link
Member

@laurazard laurazard Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here to get some input: the fix for the failure mentioned in #8617 (comment) has to do with us creating duplicate entries for the same containerProcess in s.running (first in s.Create, then in s.Start), which then causes us to send exitEvents for the same containerProcess twice.

I fixed this in c43966c by storing replacing the containerProcess slice with a map, and using the containerProcess as the key so that we don't keep the same containerProcess there twice. This works fine, but I wonder if we can make this more simple: I don't think we can assume pids are unique, although if we could we could do:

Suggested change
running map[int]map[containerProcess]struct{} // pid -> running process, guarded by lifecycleMu
running map[int]containerProcess // pid -> running process, guarded by lifecycleMu

If we can't assume that, I still wonder if we should do something like

Suggested change
running map[int]map[containerProcess]struct{} // pid -> running process, guarded by lifecycleMu
running map[int]map[string]containerProcess // pid -> running process, guarded by lifecycleMu

where the key for the inner map is something like container.ID, since I think containerID+processID should be unique.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corhere thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly question; if pid here is containerProcess.Process.Pid, could it then be map[containerProcess]struct{} ? (or map[containerID]map[containerProcess]struct{} if we need to index per container?

Copy link
Member

@laurazard laurazard Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not :'(. We have multiple processes per-container, and when we get the exit event back from runC we only get the specific process PID, so the "outer" map must be keyed by pid – so that when we get an exit event, we can check our s.running map with the exit-event's PID and get the containerProcesses we know are running for that PID so that we can process the exit event for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ugh... gotcha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurazard and I have concluded that the duplicate entries in s.running is really a symptom of an underlying bug in my implementation which could manifest as an exit event being published before a start event in certain cases. Deduplicating the running map values by containerProcess does not address the root cause. The correct solution, we believe, is to have (*service).Start() remove the container init process from s.running before calling container.Start() and add it back afterwards so that the process exiting before (*service).Start() publishes the TaskStart event is correctly handled as an early exit.

@corhere corhere force-pushed the reduce-exec-lock-contention branch from c43966c to 5c52599 Compare June 9, 2023 18:17
eventSendMu is causing severe lock contention when multiple processes
start and exit concurrently. Replace it with a different scheme for
maintaining causality w.r.t. start and exit events for a process which
does not rely on big locks for synchronization.

Keep track of all processes for which a Task(Exec)Start event has been
published and have not yet exited in a map, keyed by their PID.
Processing exits then is as simple as looking up which process
corresponds to the PID. If there are no started processes known with
that PID, the PID must either belong to a process which was started by
s.Start() and before the s.Start() call has added the process to the map
of running processes, or a reparented process which we don't care about.
Handle the former case by having each s.Start() call subscribe to exit
events before starting the process. It checks if the PID has exited in
the time between it starting the process and publishing the TaskStart
event, handling the exit if it has. Exit events for reparented processes
received when no s.Start() calls are in flight are immediately
discarded, and events received during an s.Start() call are discarded
when the s.Start() call returns.

Co-authored-by: Laura Brehm <laurabrehm@hey.com>
Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere force-pushed the reduce-exec-lock-contention branch from 5c52599 to 5cd6210 Compare June 9, 2023 20:54
@estesp estesp added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Jun 20, 2023
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 16, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Although under normal circumstances, init process wouldn't quit so
early. But if this screnario occurs, handleStarted() will call
handleProcessExit(). It will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

I found that after pr containerd#8617, handleProcessExit() won't access
s.containers, so we can remove the unnecessary lock guard to avoid
deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 16, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Although under normal circumstances, init process wouldn't quit so
early. But if this screnario occurs, handleStarted() will call
handleProcessExit(). It will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

I found that after pr containerd#8617, handleProcessExit() won't access
s.containers, so we can remove the unnecessary lock guard to avoid
deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 16, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Although under normal circumstances, init process wouldn't quit so
early. But if this screnario occurs, handleStarted() will call
handleProcessExit(). It will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

I found that after pr containerd#8617, handleProcessExit() won't access
s.containers, so we can remove the unnecessary lock guard to avoid
deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 17, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Although under normal circumstances, init process wouldn't quit so
early. But if this screnario occurs, handleStarted() will call
handleProcessExit(). It will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

I found that after pr containerd#8617, handleProcessExit() won't access
s.containers, so we can remove the unnecessary lock guard to avoid
deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 17, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

I found that after pr containerd#8617, handleProcessExit() won't access
s.containers, so we can remove the unnecessary lock guard to avoid
deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 17, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

I found that after pr containerd#8617, handleProcessExit() won't access
s.containers, so we can remove the unnecessary lock guard to avoid
deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 24, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So I add handleProcessExitNoLock() function which will not lock s.mu.
It can safely be called in create handler without deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 24, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So I add handleProcessExitNoLock() function which will not lock s.mu.
It can safely be called in create handler without deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 25, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So I add handleProcessExitNoLock() function which will not lock s.mu.
It can safely be called in create handler without deadlock.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Sep 27, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Oct 1, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

For some historical reasons, the use of s.mu is a bit confusing,
sometimes used to protect s.containers only, and sometimes used as a
mutex for some functions.

According to analysis of @corhere in containerd#9103, we can know that:

Locking s.mu for Create() handler is introduced in containerd#1452, which was a
solution for the missed early-exits problem: Create() holds the mutex to
block checkProcesses() from handling exit events until after Create()
has added the container process to the s.processes map. This locking
logic was copied into the v2 shim implementation. containerd#8617 solves the same
problem using a strategy that does not rely on mutual exclusion between
container create and exit-event handling.

As for Shutdown(), the implementation was added in containerd#3004. And in this
initial implementation, the mutex is locked to safely access
s.containers, it is not unlocked because nothing matters after
os.Exit(0). Then ae87730 changed Shutdown() to return instead of
exiting, followed by containerd#4988 to unlock upon return. If the intention is
to block containers from being created while the shim's task service is
shutting down, locking a mutex is a poor solution since it makes the
create requests hang instead of returning an error, and is ineffective
after Shutdown() returns as the mutex is unlocked.

If Create() or handleProcessExit() enters again after Shutdown()
unlocking s.mu and returning, shim service will panic by sending to
a closed channel.

So I remove the unnecessary lock guards in service, and rename mu to
containersMu to make it clear that it is used to protect s.containers
only.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Oct 1, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

For some historical reasons, the use of s.mu is a bit confusing,
sometimes used to protect s.containers only, and sometimes used as a
mutex for some functions.

According to analysis of @corhere in containerd#9103, we can know that:

Locking s.mu for Create() handler is introduced in containerd#1452, which was a
solution for the missed early-exits problem: Create() holds the mutex to
block checkProcesses() from handling exit events until after Create()
has added the container process to the s.processes map. This locking
logic was copied into the v2 shim implementation. containerd#8617 solves the same
problem using a strategy that does not rely on mutual exclusion between
container create and exit-event handling.

As for Shutdown(), the implementation was added in containerd#3004. And in this
initial implementation, the mutex is locked to safely access
s.containers, it is not unlocked because nothing matters after
os.Exit(0). Then ae87730 changed Shutdown() to return instead of
exiting, followed by containerd#4988 to unlock upon return. If the intention is
to block containers from being created while the shim's task service is
shutting down, locking a mutex is a poor solution since it makes the
create requests hang instead of returning an error, and is ineffective
after Shutdown() returns as the mutex is unlocked.

If Create() or handleProcessExit() enters again after Shutdown()
unlocking s.mu and returning, shim service will panic by sending to
a closed channel.

So I remove the unnecessary lock guards in service, and rename mu to
containersMu to make it clear that it is used to protect s.containers
only.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
cyyzero added a commit to cyyzero/containerd that referenced this pull request Oct 2, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
thaJeztah pushed a commit to thaJeztah/containerd that referenced this pull request Oct 10, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
(cherry picked from commit 68dd47e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah pushed a commit to thaJeztah/containerd that referenced this pull request Oct 10, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
(cherry picked from commit 68dd47e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah pushed a commit to thaJeztah/containerd that referenced this pull request Oct 10, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
(cherry picked from commit 68dd47e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah pushed a commit to thaJeztah/containerd that referenced this pull request Oct 10, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
(cherry picked from commit 68dd47e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah pushed a commit to thaJeztah/containerd that referenced this pull request Oct 11, 2023
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
(cherry picked from commit 68dd47e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
juliusl pushed a commit to juliusl/containerd that referenced this pull request Jan 26, 2024
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrent task execs take significantly longer than sequentially
10 participants