Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc-shim: fix races/prevent init exits from being dropped #10651

Merged
merged 3 commits into from
Sep 5, 2024

Conversation

laurazard
Copy link
Member

@laurazard laurazard commented Aug 28, 2024

Based off of #10647 and #10634 (and with inspiration from #10650), replaced s.pendingExecs with s.runningExecs. The logic is changed to keep track of running execs instead of pending execs, and handling a pending/prestarted exec the same as a running exec, which significantly simplifies the shim exit processing logic.

closes #10589


This PR rewrites and simplifies a lot of the shim exit processing logic to reduce it's complexity, and also handle the case where the container doesn't have it's own pid-namespace, which means that we're not guaranteed to receive the init exit last.

This is achieved by replacing s.pendingExecs with s.runningExecs, for which both (previously) pending and de facto running execs are considered.

The new exit handling logic can be summed up by:

  • when we receive an init exit, stash it it in s.containerInitExit, and if a container's init process has exited, refuse new execs.
  • (if the container does not have it's own pidns) kill all running processes (if the container has a private pid-namespace, then all processes will be dead already).
  • wait for the container's running exec count (which includes execs which have been started but might still early exit) to get to 0.
  • publish the stashed away init exit.

The issue with init exits being dropped if new execs keep coming in is resolved by refusing new execs after the container's init process has exited.

Big thanks to both @corhere for the guidance and discussion and @samuelkarp for the input, discussion and repro in #10649.

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it!

cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved
cmd/containerd-shim-runc-v2/task/service.go Show resolved Hide resolved
@laurazard laurazard force-pushed the shim-refactor-without-pending branch from 9ab1a15 to 7e07ecb Compare August 28, 2024 16:55
@laurazard laurazard force-pushed the shim-refactor-without-pending branch 2 times, most recently from 2c290ed to 655cb60 Compare August 28, 2024 17:09
@laurazard laurazard changed the title WIP: refactor runc-shim to remove s.pendingExecs runc-shim: fix races/prevent init exits from being dropped Aug 28, 2024
cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved
cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved
cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved
cmd/containerd-shim-runc-v2/task/service.go Outdated Show resolved Hide resolved
@laurazard laurazard force-pushed the shim-refactor-without-pending branch from 655cb60 to d3629b0 Compare August 29, 2024 09:25
@laurazard
Copy link
Member Author

/cc @samuelkarp @dmcgowan

It's not true that `s.mu` needs to be held when calling
`handleProcessExit`, and indeed hasn't been the case for a
while – see 892dc54.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
The runc task state machine prevents execs from being created after the
init process has exited, but there are no guards against starting a
created exec after the init process has exited. That leaves a small
window for starting an exec to race our handling of the init process
exiting. Normally this is not an issue in practice: the kernel will
atomically kill all processes in a PID namespace when its "init" process
terminates, and will not allow new processes to fork(2) into the PID
namespace afterwards. Therefore the racing exec is guaranteed by the
kernel to not be running after the init process terminates. On the other
hand, when the container does not have a private PID namespace (i.e. the
container's init process is not the "init" process of the container's
PID namespace), the kernel does not automatically kill other container
processes on init exit and will happily allow runc to start an exec
process at any time. It is the runc shim's responsibility to clean up
the container when the init process exits in this situation by killing
all the container's remaining processes. Block execs from being started
after the container's init process has exited to prevent the processes
from leaking, and to avoid violating the task service's assumption that
an exec can be running iff the init process is also running.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@laurazard laurazard force-pushed the shim-refactor-without-pending branch 2 times, most recently from 74a7288 to 957a2b7 Compare September 2, 2024 10:27
@samuelkarp
Copy link
Member

Does the test case in the other draft PR #10649 useful to put in this PR itself?

I think it's fine to keep them separate; simpler to deal with logistically than both of us trying to push to the same branch in a single PR. I do intend to cherry-pick both this PR and that back to 1.7 and 1.6 though.

s.lifecycleMu.Lock()
s.runningExecs[container]--
if ch, ok := s.execCountSubscribers[container]; ok {
ch <- s.runningExecs[container]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this could block if the for loop in the goroutine in handleInitExit has already received an event but hasn't run the defer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to think through this:

  • Line 284 is s.runningExecs[container]++, protected by s.lifecycleMu. We can have two calls to Start that both (sequentially) invoke line 284.
  • Now assume we have the init process exit, SIGCHLD is received by the shim, and processExits reads, acquires s.lifecycleMu and stores s.containerInitExit[container]. No new execs can actually start after this, because of the check on line 280.
  • Then outside s.lifecycleMu handleInitProcess is invoked. We lock s.lifecyleMu, find s.runningExecs[container] is > 0, and make a 2-length channel in s.execCountSubscribers[container].
  • Both of the execs call container.Start, invoke runc, and fail because the container is no longer running
  • Each one acquires s.lifecycleMu and sends the count of runningExecs to the subscriber on line 299, first sending 1 then 0
  • handleInitProcess reads from the channel and finds a 0, and breaks out of the loop
  • A new Start call comes in and acquires s.lifecycleMu ahead of handleInitProcess's defer, but fails to make progress because of line 280

I'm struggling to think of a situation where:

  • handleInitExit is already in the channel loop (which means both (a) the init process is exited and we've stored it into s.containerInitExit already, and (b) s.runningExecs[container] is at least 1
  • Another exec can come in and increment s.runningExecs further after a 0 has been sent to ch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the channel created in s.handleInitExits has a buffer of size s.runningExecs[c], and we block new execs from being started before handleInitExits is called, so at most we'll process s.runningExecs[c] exits – so I don't think this would ever block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems feasible that the shim is not even notified that the container init exited until after we have gotten the failure to start the exec here but before we take the lock at L296?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems feasible that the shim is not even notified that the container init exited until after we have gotten the failure to start the exec here but before we take the lock at L296?

Following that thread:

  • container init is started
  • (1) container exec is started
    • we increment s.runningExecs in L284, then start the exec
  • (1) the exec start call fails (L290)
  • (2) container init exits and s.processExits acquires s.lifecycleMu and processes it
    • init exit gets stashed in s.containerInitExits, and s.handleInitExit gets called (and 2. unlocks s.lifecycleMu)
  • (1) acquires the lock at L296, and decrements s.runningExecs
  • (2)s.handleInitExit locks s.lifecycleMu checks s.runningExecs, finds 0, and immediately processes the init exit

Or:

  • container init is started
  • (1) container exec is started
    • we increment s.runningExecs in L284, then start the exec
  • (1) the exec start call fails (L290)
  • (2) container init exits and s.processExits acquires s.lifecycleMu and processes it
    • init exit gets stashed in s.containerInitExits, and s.handleInitExit gets called (and 2. unlocks s.lifecycleMu)
  • (2)s.handleInitExit locks s.lifecycleMu checks s.runningExecs, finds 1, creates a buffered channel with length 1, launches the goroutine and unlocks s.lifecycleMu
  • (1) acquires the lock at L296, and decrements s.runningExecs, and sends (a 0) on the channel
  • (2) the goroutine receives on the channel, finds a 0, and processes the init exit.

runningExecs map[*runc.Container]int // container -> num running execs, guarded by lifecycleMu
// container -> subscription to exec exits/changes to s.runningExecs[container],
// guarded by lifecycleMu
execCountSubscribers map[*runc.Container]chan<- int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be better as a cond var with lifecyclemu to avoid more data races.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There'd still need to be some way for handleInitExit to be notified to check the count. That could be with a chan<- struct{}, but it seems cheaper to send the value over the chan<- int and avoid acquiring s.lifecycleMu if the value is > 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I did have a branch somewhere with a sync.Cond, but I ended up going with Cory's suggestion for a map of channels since it seemed simpler/more verifiably correct. In general I think I prefer this approach since it reduces lock contention over s.lifecycleMu inside the goroutine, and we only have to lock it to set the channel and then remove it when the goroutine is done, but let me know if you have a strong opinion/have spotted any race condition in particular!

@dims
Copy link
Member

dims commented Sep 4, 2024

LGTM, looks like @cpuguy83 has a few more comments

@ningmingxiao
Copy link
Contributor

ningmingxiao commented Sep 5, 2024

func (e *execProcess) start(ctx context.Context) (err error) {
	// The reaper may receive exit signal right after
	// the container is started, before the e.pid is updated.
	// In that case, we want to block the signal handler to
	// access e.pid until it is updated.
	e.pid.Lock()
	defer e.pid.Unlock()

	var (
		socket  *runc.Socket
		pio     *processIO
		pidFile = newExecPidFile(e.path, e.id)
	)
	if e.stdio.Terminal {
		if socket, err = runc.NewTempConsoleSocket(); err != nil {
			return fmt.Errorf("failed to create runc console socket: %w", err)
		}
		defer socket.Close()
	} else {
		if pio, err = createIO(ctx, e.id, e.parent.IoUID, e.parent.IoGID, e.stdio); err != nil {
			return fmt.Errorf("failed to create init process I/O: %w", err)
		}
		e.io = pio
	}
	opts := &runc.ExecOpts{
		PidFile: pidFile.Path(),
		Detach:  true,
	}
	if pio != nil {
		opts.IO = pio.IO()
	}
	if socket != nil {
		opts.ConsoleSocket = socket
	}
	if err := e.parent.runtime.Exec(ctx, e.parent.id, e.spec, opts); err != nil {
		close(e.waitBlock)
		return e.parent.runtimeError(err, "OCI runtime exec failed")
	}
	if e.stdio.Stdin != "" {
		if err := e.openStdin(e.stdio.Stdin); err != nil {
			return err
		}
	}
	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
	defer cancel()
	if socket != nil {
		console, err := socket.ReceiveMaster()
		if err != nil {
			return fmt.Errorf("failed to retrieve console master: %w", err)
		}
		if e.console, err = e.parent.Platform.CopyConsole(ctx, console, e.id, e.stdio.Stdin, e.stdio.Stdout, e.stdio.Stderr, &e.wg); err != nil {
			return fmt.Errorf("failed to start console copy: %w", err)
		}
	} else {
		if err := pio.Copy(ctx, &e.wg); err != nil {
			return fmt.Errorf("failed to start io pipe copy: %w", err)
		}
	}
	pid, err := pidFile.Read()
	if err != nil {
		return fmt.Errorf("failed to retrieve OCI runtime exec pi: %wd", err)
	}
	e.pid.pid = pid
	return nil
}

line 297 s.runningExecs[container]--
after e.parent.runtime.Exec is called (if e.parent.runtime.Exec is ok, runc exec ok) but after "e.parent.runtime.Exec" may generate other error, s.runningExecs[container] may reduce to wrong value. maybe we should record exec process pid.

Copy link
Member

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samuelkarp samuelkarp added this pull request to the merge queue Sep 5, 2024
@samuelkarp
Copy link
Member

/cherrypick release/1.7

@k8s-infra-cherrypick-robot

@samuelkarp: once the present PR merges, I will cherry-pick it on top of release/1.7 in a new PR and assign it to you.

In response to this:

/cherrypick release/1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 5, 2024
@samuelkarp samuelkarp added this pull request to the merge queue Sep 5, 2024
@dims
Copy link
Member

dims commented Sep 5, 2024

xref: #10603

Merged via the queue into containerd:main with commit c8b095f Sep 5, 2024
52 checks passed
@k8s-infra-cherrypick-robot

@samuelkarp: #10651 failed to apply on top of branch "release/1.7":

Applying: runc-shim: remove misleading comment
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
Applying: runc-shim: refuse to start execs after init exits
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
Applying: runc-shim: handle pending execs as running
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
CONFLICT (content): Merge conflict in runtime/v2/runc/task/service.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0003 runc-shim: handle pending execs as running
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release/1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@laurazard
Copy link
Member Author

Opened backports: #10676 and #10675.

@samuelkarp samuelkarp added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime Runtime cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch kind/bug ok-to-test size/L
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

supposedly running container actually dead
9 participants