runc-shim: fix races/prevent init exits from being dropped #10651

laurazard · 2024-08-28T11:50:31Z

Based off of #10647 and #10634 (and with inspiration from #10650), replaced s.pendingExecs with s.runningExecs. The logic is changed to keep track of running execs instead of pending execs, and handling a pending/prestarted exec the same as a running exec, which significantly simplifies the shim exit processing logic.

closes #10589

This PR rewrites and simplifies a lot of the shim exit processing logic to reduce it's complexity, and also handle the case where the container doesn't have it's own pid-namespace, which means that we're not guaranteed to receive the init exit last.

This is achieved by replacing s.pendingExecs with s.runningExecs, for which both (previously) pending and de facto running execs are considered.

The new exit handling logic can be summed up by:

when we receive an init exit, stash it it in s.containerInitExit, and if a container's init process has exited, refuse new execs.
(if the container does not have it's own pidns) kill all running processes (if the container has a private pid-namespace, then all processes will be dead already).
wait for the container's running exec count (which includes execs which have been started but might still early exit) to get to 0.
publish the stashed away init exit.

The issue with init exits being dropped if new execs keep coming in is resolved by refusing new execs after the container's init process has exited.

Big thanks to both @corhere for the guidance and discussion and @samuelkarp for the input, discussion and repro in #10649.

corhere

I like it!

cmd/containerd-shim-runc-v2/task/service.go

laurazard · 2024-08-29T09:26:06Z

/cc @samuelkarp @dmcgowan

It's not true that `s.mu` needs to be held when calling `handleProcessExit`, and indeed hasn't been the case for a while – see 892dc54. Signed-off-by: Laura Brehm <laurabrehm@hey.com>

cmd/containerd-shim-runc-v2/task/service.go

The runc task state machine prevents execs from being created after the init process has exited, but there are no guards against starting a created exec after the init process has exited. That leaves a small window for starting an exec to race our handling of the init process exiting. Normally this is not an issue in practice: the kernel will atomically kill all processes in a PID namespace when its "init" process terminates, and will not allow new processes to fork(2) into the PID namespace afterwards. Therefore the racing exec is guaranteed by the kernel to not be running after the init process terminates. On the other hand, when the container does not have a private PID namespace (i.e. the container's init process is not the "init" process of the container's PID namespace), the kernel does not automatically kill other container processes on init exit and will happily allow runc to start an exec process at any time. It is the runc shim's responsibility to clean up the container when the init process exits in this situation by killing all the container's remaining processes. Block execs from being started after the container's init process has exited to prevent the processes from leaking, and to avoid violating the task service's assumption that an exec can be running iff the init process is also running. Signed-off-by: Cory Snider <csnider@mirantis.com>

cmd/containerd-shim-runc-v2/task/service.go

samuelkarp · 2024-09-04T20:17:58Z

Does the test case in the other draft PR #10649 useful to put in this PR itself?

I think it's fine to keep them separate; simpler to deal with logistically than both of us trying to push to the same branch in a single PR. I do intend to cherry-pick both this PR and that back to 1.7 and 1.6 though.

cmd/containerd-shim-runc-v2/task/service.go

cpuguy83 · 2024-09-04T22:20:56Z

cmd/containerd-shim-runc-v2/task/service.go

+			s.lifecycleMu.Lock()
+			s.runningExecs[container]--
+			if ch, ok := s.execCountSubscribers[container]; ok {
+				ch <- s.runningExecs[container]


Seems like this could block if the for loop in the goroutine in handleInitExit has already received an event but hasn't run the defer.

Let me try to think through this:

Line 284 is s.runningExecs[container]++, protected by s.lifecycleMu. We can have two calls to Start that both (sequentially) invoke line 284.

Now assume we have the init process exit, SIGCHLD is received by the shim, and processExits reads, acquires s.lifecycleMu and stores s.containerInitExit[container]. No new execs can actually start after this, because of the check on line 280.

Then outside s.lifecycleMu handleInitProcess is invoked. We lock s.lifecyleMu, find s.runningExecs[container] is > 0, and make a 2-length channel in s.execCountSubscribers[container].

Both of the execs call container.Start, invoke runc, and fail because the container is no longer running

Each one acquires s.lifecycleMu and sends the count of runningExecs to the subscriber on line 299, first sending 1 then 0

handleInitProcess reads from the channel and finds a 0, and breaks out of the loop

A new Start call comes in and acquires s.lifecycleMu ahead of handleInitProcess's defer, but fails to make progress because of line 280

I'm struggling to think of a situation where:

handleInitExit is already in the channel loop (which means both (a) the init process is exited and we've stored it into s.containerInitExit already, and (b) s.runningExecs[container] is at least 1

Another exec can come in and increment s.runningExecs further after a 0 has been sent to ch

Hmm, the channel created in s.handleInitExits has a buffer of size s.runningExecs[c], and we block new execs from being started before handleInitExits is called, so at most we'll process s.runningExecs[c] exits – so I don't think this would ever block.

It seems feasible that the shim is not even notified that the container init exited until after we have gotten the failure to start the exec here but before we take the lock at L296?

It seems feasible that the shim is not even notified that the container init exited until after we have gotten the failure to start the exec here but before we take the lock at L296?

Following that thread:

container init is started

(1) container exec is started

we increment s.runningExecs in L284, then start the exec

(1) the exec start call fails (L290)

(2) container init exits and s.processExits acquires s.lifecycleMu and processes it

init exit gets stashed in s.containerInitExits, and s.handleInitExit gets called (and 2. unlocks s.lifecycleMu)

(1) acquires the lock at L296, and decrements s.runningExecs

(2)s.handleInitExit locks s.lifecycleMu checks s.runningExecs, finds 0, and immediately processes the init exit

Or:

container init is started

(1) container exec is started

we increment s.runningExecs in L284, then start the exec

(1) the exec start call fails (L290)

(2) container init exits and s.processExits acquires s.lifecycleMu and processes it

init exit gets stashed in s.containerInitExits, and s.handleInitExit gets called (and 2. unlocks s.lifecycleMu)

(2)s.handleInitExit locks s.lifecycleMu checks s.runningExecs, finds 1, creates a buffered channel with length 1, launches the goroutine and unlocks s.lifecycleMu

(1) acquires the lock at L296, and decrements s.runningExecs, and sends (a 0) on the channel

(2) the goroutine receives on the channel, finds a 0, and processes the init exit.

cpuguy83 · 2024-09-04T22:25:32Z

cmd/containerd-shim-runc-v2/task/service.go

+	runningExecs map[*runc.Container]int    // container -> num running execs, guarded by lifecycleMu
+	// container -> subscription to exec exits/changes to s.runningExecs[container],
+	// guarded by lifecycleMu
+	execCountSubscribers map[*runc.Container]chan<- int


This may be better as a cond var with lifecyclemu to avoid more data races.

There'd still need to be some way for handleInitExit to be notified to check the count. That could be with a chan<- struct{}, but it seems cheaper to send the value over the chan<- int and avoid acquiring s.lifecycleMu if the value is > 0.

Right, I did have a branch somewhere with a sync.Cond, but I ended up going with Cory's suggestion for a map of channels since it seemed simpler/more verifiably correct. In general I think I prefer this approach since it reduces lock contention over s.lifecycleMu inside the goroutine, and we only have to lock it to set the channel and then remove it when the goroutine is done, but let me know if you have a strong opinion/have spotted any race condition in particular!

dims · 2024-09-04T22:56:51Z

LGTM, looks like @cpuguy83 has a few more comments

ningmingxiao · 2024-09-05T02:52:38Z

func (e *execProcess) start(ctx context.Context) (err error) {
	// The reaper may receive exit signal right after
	// the container is started, before the e.pid is updated.
	// In that case, we want to block the signal handler to
	// access e.pid until it is updated.
	e.pid.Lock()
	defer e.pid.Unlock()

	var (
		socket  *runc.Socket
		pio     *processIO
		pidFile = newExecPidFile(e.path, e.id)
	)
	if e.stdio.Terminal {
		if socket, err = runc.NewTempConsoleSocket(); err != nil {
			return fmt.Errorf("failed to create runc console socket: %w", err)
		}
		defer socket.Close()
	} else {
		if pio, err = createIO(ctx, e.id, e.parent.IoUID, e.parent.IoGID, e.stdio); err != nil {
			return fmt.Errorf("failed to create init process I/O: %w", err)
		}
		e.io = pio
	}
	opts := &runc.ExecOpts{
		PidFile: pidFile.Path(),
		Detach:  true,
	}
	if pio != nil {
		opts.IO = pio.IO()
	}
	if socket != nil {
		opts.ConsoleSocket = socket
	}
	if err := e.parent.runtime.Exec(ctx, e.parent.id, e.spec, opts); err != nil {
		close(e.waitBlock)
		return e.parent.runtimeError(err, "OCI runtime exec failed")
	}
	if e.stdio.Stdin != "" {
		if err := e.openStdin(e.stdio.Stdin); err != nil {
			return err
		}
	}
	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
	defer cancel()
	if socket != nil {
		console, err := socket.ReceiveMaster()
		if err != nil {
			return fmt.Errorf("failed to retrieve console master: %w", err)
		}
		if e.console, err = e.parent.Platform.CopyConsole(ctx, console, e.id, e.stdio.Stdin, e.stdio.Stdout, e.stdio.Stderr, &e.wg); err != nil {
			return fmt.Errorf("failed to start console copy: %w", err)
		}
	} else {
		if err := pio.Copy(ctx, &e.wg); err != nil {
			return fmt.Errorf("failed to start io pipe copy: %w", err)
		}
	}
	pid, err := pidFile.Read()
	if err != nil {
		return fmt.Errorf("failed to retrieve OCI runtime exec pi: %wd", err)
	}
	e.pid.pid = pid
	return nil
}

line 297 s.runningExecs[container]--
after e.parent.runtime.Exec is called (if e.parent.runtime.Exec is ok, runc exec ok) but after "e.parent.runtime.Exec" may generate other error, s.runningExecs[container] may reduce to wrong value. maybe we should record exec process pid.

dims

LGTM

samuelkarp · 2024-09-05T17:58:09Z

/cherrypick release/1.7

k8s-infra-cherrypick-robot · 2024-09-05T17:58:11Z

@samuelkarp: once the present PR merges, I will cherry-pick it on top of release/1.7 in a new PR and assign it to you.

In response to this:

/cherrypick release/1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dims · 2024-09-05T19:41:32Z

xref: #10603

k8s-infra-cherrypick-robot · 2024-09-05T19:47:51Z

@samuelkarp: #10651 failed to apply on top of branch "release/1.7":

Applying: runc-shim: remove misleading comment
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
Applying: runc-shim: refuse to start execs after init exits
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
Applying: runc-shim: handle pending execs as running
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
CONFLICT (content): Merge conflict in runtime/v2/runc/task/service.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0003 runc-shim: handle pending execs as running
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release/1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

laurazard · 2024-09-06T10:23:00Z

Opened backports: #10676 and #10675.

k8s-ci-robot added do-not-merge/work-in-progress size/L labels Aug 28, 2024

dosubot bot added area/runtime Runtime kind/refactor labels Aug 28, 2024

laurazard force-pushed the shim-refactor-without-pending branch from 651d0c5 to 3619240 Compare August 28, 2024 12:03

laurazard mentioned this pull request Aug 28, 2024

integration: regression test for issue 10589 #10649

Merged

k8s-ci-robot added size/XL and removed size/L labels Aug 28, 2024

corhere reviewed Aug 28, 2024

View reviewed changes

cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved

cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved

laurazard force-pushed the shim-refactor-without-pending branch from 9ab1a15 to 7e07ecb Compare August 28, 2024 16:55

k8s-ci-robot added size/L and removed size/XL labels Aug 28, 2024

laurazard force-pushed the shim-refactor-without-pending branch 2 times, most recently from 2c290ed to 655cb60 Compare August 28, 2024 17:09

laurazard changed the title ~~WIP: refactor runc-shim to remove s.pendingExecs~~ runc-shim: fix races/prevent init exits from being dropped Aug 28, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress label Aug 28, 2024

corhere reviewed Aug 28, 2024

View reviewed changes

laurazard force-pushed the shim-refactor-without-pending branch from 655cb60 to d3629b0 Compare August 29, 2024 09:25

k8s-ci-robot requested review from dmcgowan and samuelkarp August 29, 2024 09:26

runc-shim: remove misleading comment

7f3bf99

It's not true that `s.mu` needs to be held when calling `handleProcessExit`, and indeed hasn't been the case for a while – see 892dc54. Signed-off-by: Laura Brehm <laurabrehm@hey.com>

laurazard force-pushed the shim-refactor-without-pending branch from d3629b0 to c9c1815 Compare August 29, 2024 13:26

laurazard mentioned this pull request Aug 29, 2024

runc-shim: publish init exit after all processes have exited #10647

Closed

laurazard self-assigned this Aug 30, 2024

laurazard mentioned this pull request Aug 30, 2024

Containerd lost the control of shim #10636

Closed

samuelkarp reviewed Sep 2, 2024

View reviewed changes

cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved

cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved

cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved

laurazard force-pushed the shim-refactor-without-pending branch 2 times, most recently from 74a7288 to 957a2b7 Compare September 2, 2024 10:27

dims reviewed Sep 4, 2024

View reviewed changes

cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved

samuelkarp reviewed Sep 4, 2024

View reviewed changes

cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved

cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved

cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved

cpuguy83 reviewed Sep 4, 2024

View reviewed changes

dims approved these changes Sep 5, 2024

View reviewed changes

cpuguy83 approved these changes Sep 5, 2024

View reviewed changes

samuelkarp added this pull request to the merge queue Sep 5, 2024

mikebrow added the ok-to-test label Sep 5, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 5, 2024

samuelkarp added this pull request to the merge queue Sep 5, 2024

Merged via the queue into containerd:main with commit c8b095f Sep 5, 2024
52 checks passed

This was referenced Sep 6, 2024

[1.7 backport] Fix bug where init exits were being dropped #10675

Merged

[1.6 backport] Fix bug where init exits were being dropped #10676

Merged

dims mentioned this pull request Oct 7, 2024

Update containerd to 1.7.20 awslabs/amazon-eks-ami#1997

Closed

brandond mentioned this pull request Nov 7, 2024

kube-apiserver occasionally terminates on cluster boot and doesn't get restarted rancher/rke2#7241

Closed

henry118 mentioned this pull request Dec 20, 2024

make sure console master tty is closed on task exit #11161

Merged

laurazard mentioned this pull request Jan 21, 2025

Pod stuck terminating / KillContainerError #8847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc-shim: fix races/prevent init exits from being dropped #10651

runc-shim: fix races/prevent init exits from being dropped #10651

laurazard commented Aug 28, 2024 •

edited

Loading

corhere left a comment

laurazard commented Aug 29, 2024

samuelkarp commented Sep 4, 2024

cpuguy83 Sep 4, 2024

samuelkarp Sep 4, 2024

laurazard Sep 5, 2024

cpuguy83 Sep 5, 2024

laurazard Sep 5, 2024

cpuguy83 Sep 4, 2024

samuelkarp Sep 4, 2024

laurazard Sep 5, 2024

dims commented Sep 4, 2024

ningmingxiao commented Sep 5, 2024 •

edited

Loading

dims left a comment

samuelkarp commented Sep 5, 2024

k8s-infra-cherrypick-robot commented Sep 5, 2024

dims commented Sep 5, 2024

k8s-infra-cherrypick-robot commented Sep 5, 2024

laurazard commented Sep 6, 2024

runc-shim: fix races/prevent init exits from being dropped #10651

runc-shim: fix races/prevent init exits from being dropped #10651

Conversation

laurazard commented Aug 28, 2024 • edited Loading

corhere left a comment

Choose a reason for hiding this comment

laurazard commented Aug 29, 2024

samuelkarp commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dims commented Sep 4, 2024

ningmingxiao commented Sep 5, 2024 • edited Loading

dims left a comment

Choose a reason for hiding this comment

samuelkarp commented Sep 5, 2024

k8s-infra-cherrypick-robot commented Sep 5, 2024

dims commented Sep 5, 2024

k8s-infra-cherrypick-robot commented Sep 5, 2024

laurazard commented Sep 6, 2024

laurazard commented Aug 28, 2024 •

edited

Loading

ningmingxiao commented Sep 5, 2024 •

edited

Loading