-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integration: regression test for issue 10589 #10649
Conversation
Skipping CI for Draft Pull Request. |
@samuelkarp: GitHub didn't allow me to request PR reviews from the following users: corhere. Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
How to run and output
|
04ebfeb
to
7627176
Compare
if err != nil { | ||
return err | ||
} | ||
if err := ready.Trigger(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using fifo as controller looks great!
Thankyou! Will try to run/see if I can repro. Update: I can reproduce with this 🥳 |
Fwiw, testing with #10647 (and with #10651), I see: sudo make integration EXTRA_TESTFLAGS='-run TestIssue10589'
+ integration
INFO[0000] Using the following image list: {Alpine:ghcr.io/containerd/alpine:3.14.0 BusyBox:ghcr.io/containerd/busybox:1.36 Pause:registry.k8s.io/pause:3.10 ResourceConsumer:registry.k8s.io/e2e-test-images/resource-consumer:1.10 VolumeCopyUp:ghcr.io/containerd/volume-copy-up:2.2 VolumeOwnership:ghcr.io/containerd/volume-ownership:2.1 ArgsEscaped:cplatpublic.azurecr.io/args-escaped-test-image-ns:1.0 DockerSchema1:registry.k8s.io/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff}
INFO[0000] running tests against containerd revision=0574a010dd5afc1e98babe10911407ead1b095d7 runtime= snapshotter= version=v2.0.0-rc.3-215-g0574a010d
INFO[0000] start to pull seed image image="ghcr.io/containerd/busybox:1.36"
=== RUN TestIssue10589
container_linux_test.go:1597: 1. Create a sleeping container
log_hook.go:47: time="2024-08-28T14:41:38.190740914+01:00" level=debug msg="remote introspection plugin filters" func="proxy.(*introspectionRemote).Plugins" file="/home/laurabrehm/go/src/github.com/containerd/containerd/core/introspection/proxy/remote.go:57" filters="[type==io.containerd.snapshotter.v1, id==overlayfs]" testcase=TestIssue10589
container_linux_test.go:1633: 2. Create exec1
container_linux_test.go:1658: 3. Wait until exec1 is waiting to start
container_linux_test.go:1651: Starting exec1
container_linux_test.go:1663: 4. Kill the container init process
container_linux_test.go:1668: Killing main pid (871) of container TestIssue10589
container_linux_test.go:1674: container status running
container_linux_test.go:1677: 5. Create exec2
container_linux_test.go:1705: 6. Wait until exec2 is waiting to start
container_linux_test.go:1698: Starting exec2
container_linux_test.go:1701: error starting exec2: container TestIssue10589 init process is not running: failed precondition which is expected – means the shim prevented the new exec from being started since the init has exited. |
Yay! Edit: The test already expects an error here. |
0c7d310
to
e70eefc
Compare
@laurazard I've updated the logic here to not block when exec2 fails to start and run against the latest in #10651, but I'm still seeing the task have an incorrect status:
Do I need to delay checking the task status here to give the shim some time to process the init exit? Edit: adding a 5 second sleep did not help Full log
|
The latest in #10651 might've been bad, sorry! I made some changes/pushed what was missing, but I hadn't tried the no-timeout code yet so it might be broken somewhere. I'll take a better look tomorrow. |
This PR will remain WIP until #10651 is merged, at which point I will rebase it and mark it ready. Please feel free to review it now, however. |
Signed-off-by: Samuel Karp <samuelkarp@google.com>
e70eefc
to
376cb86
Compare
Rebased now that #10651 merged. |
e768237
to
0c00392
Compare
Latest push should fix the issue (mixing |
This issue was caused by a race between init exits and new exec process tracking inside the shim. The test operates by controlling the time between when the shim invokes "runc exec" and when the actual "runc exec" is triggered. This allows validating that races for shim state tracking between pre- and post-start of the exec process do not exist. Relates to containerd#10589 Signed-off-by: Samuel Karp <samuelkarp@google.com>
0c00392
to
18725f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@samuelkarp Sorry to bother, but I still get a block when exec2 fails to start. Could you help take a look?
|
https://github.com/containerd/containerd/releases/tag/v1.7.22 and https://github.com/containerd/containerd/releases/tag/v1.6.36 contain the fix for this issue. |
@samuelkarp Do you mean that I should get the test from v1.6.36 or v1.7.22? |
The test is expected to fail (and can block) on versions prior to 1.7.22. The test should pass on 1.7.22. |
@samuelkarp Thanks for reply. |
I see the descriptions in issue and PR are a little bit obscure to grasp. Let me explain the issue a bit more for those who are interested, please correct me if I'm wrong. The code I dive into is 1.7.21.
So that's the race. handleStarted produces the event, s.processExits is the consumer, cleanup is the cleaner. They share a lock, if cleanup got the lock before s.processExits, the issue occurs. |
do you have some ways to reproduce @linsite ? |
clean run after handleStarted
handleStarted will make sure exits = nil
delete(s.exitSubscribers, &exits) may not happen? |
Yes, you're right, my previous analize is wrong. The true problems is the concurrent exec calls can make
The 1.7.21 failed case log validate the issue, added in the s.send func.
|
We've noticed an odd thing. This testcase can 100% fail 1.7.21 code in a phiscal PC with 2.8GHz CPU , but 100% can't fail any VirtualMachine with 2.6 or 2.2GHz vCPU. The OS is ubuntu18.024 or 20.04, Golang version is 1.22. |
1.7.22 fix this you can try this pr #10675 |
thanks, :). I know the fix. All I want is to reproduce that issue and figure out the root cause. FYI, after some study, I found that by adding a time.Sleep(50 * time.Milliseconds) after the 4th step, the issue can be triggered 100% in 1.7.21. That can ensure the init process exits with a non-zero s.pendingExecs[cp.Container]. best wishes. |
We observe very similar phenomenon with containerd v1.6.33. It happens during running "kubectl drain node". I only have part of logs at handy. From the log, node drain hangs for hours and one pod/container keeps failing to be killed. It seems there is a race between init exits and exec process. Both PreStop hook and container exit cannot finish forever.
containerd log also reports the same for hours
|
can you ps -auxf show process status. (containerd main process and exec process status) |
Hi @ningmingxiao unfortunately our customer upgrades all their clusters with workaround anyhow and we didn't catch the chance to collect more of these data. Thanks for the guidance anyway! |
This issue was caused by a race between init exits and new exec process tracking inside the shim. The test operates by controlling the time between when the shim invokes "runc exec" and when the actual "runc exec" is triggered. This allows validating that races for shim state tracking between pre- and post-start of the exec process do not exist.
Relates to #10589
/cc @laurazard @corhere