Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration: regression test for issue 10589 #10649

Merged
merged 2 commits into from
Sep 6, 2024

Conversation

samuelkarp
Copy link
Member

@samuelkarp samuelkarp commented Aug 28, 2024

This issue was caused by a race between init exits and new exec process tracking inside the shim. The test operates by controlling the time between when the shim invokes "runc exec" and when the actual "runc exec" is triggered. This allows validating that races for shim state tracking between pre- and post-start of the exec process do not exist.

Relates to #10589

/cc @laurazard @corhere

@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot
Copy link

@samuelkarp: GitHub didn't allow me to request PR reviews from the following users: corhere.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This issue was caused by a race between init exits and new exec process tracking inside the shim. The test operates by controlling the time between when the shim invokes "runc exec" and when the actual "runc exec" is triggered. This allows validating that races for shim state tracking between pre- and post-start of the exec process do not exist.

Relates to #10589

Marked as WIP until the fix is merged.

/cc @laurazard @corhere

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@samuelkarp
Copy link
Member Author

How to run and output
samuelkarp@containerd:~/go/src/github.com/containerd/containerd$ sudo PATH=$PATH ./script/setup/install-failpoint-binaries 
+ bin/cni-bridge-fp
+ bin/containerd-shim-runc-fp-v1
+ bin/runc-fp
samuelkarp@containerd:~/go/src/github.com/containerd/containerd$ sudo -E PATH=$PATH make integration EXTRA_TESTFLAGS='-run TestIssue10589'
+ integration
INFO[0000] Using the following image list: {Alpine:ghcr.io/containerd/alpine:3.14.0 BusyBox:ghcr.io/containerd/busybox:1.36 Pause:registry.k8s.io/pause:3.10 ResourceConsumer:registry.k8s.io/e2e-test-images/resource-consumer:1.10 VolumeCopyUp:ghcr.io/containerd/volume-copy-up:2.2 VolumeOwnership:ghcr.io/containerd/volume-ownership:2.1 ArgsEscaped:cplatpublic.azurecr.io/args-escaped-test-image-ns:1.0 DockerSchema1:registry.k8s.io/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff} 
INFO[0000] running tests against containerd              revision=d1fa7450b9848a40c6379a3740a38b72bf34d59a runtime= snapshotter= version=v2.0.0-rc.3-222-gd1fa7450b
INFO[0000] start to pull seed image                      image="ghcr.io/containerd/busybox:1.36"
=== RUN   TestIssue10589
    container_linux_test.go:1597: 1. Create a sleeping container
    log_hook.go:47: time="2024-08-28T04:11:38.779645316Z" level=debug msg="remote introspection plugin filters" func="proxy.(*introspectionRemote).Plugins" file="/home/samuelkarp/go/src/github.com/containerd/containerd/core/introspection/proxy/remote.go:57" filters="[type==io.containerd.snapshotter.v1, id==overlayfs]" testcase=TestIssue10589
    container_linux_test.go:1631: 2. Create exec1
    container_linux_test.go:1656: 3. Wait until exec1 is waiting to start
    container_linux_test.go:1649: Starting exec1
    container_linux_test.go:1661: 4. Kill the container init process
    container_linux_test.go:1666: Killing main pid (95403) of container TestIssue10589
    container_linux_test.go:1672: container status running
    container_linux_test.go:1675: 5. Create exec2
    container_linux_test.go:1703: 6. Wait until exec2 is waiting to start
    container_linux_test.go:1696: Starting exec2
    container_linux_test.go:1708: 7. Allow exec=1 to proceed
    container_linux_test.go:1713: container status running
    container_linux_test.go:1652: error starting exec1: OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
    container_linux_test.go:1717: container status running
    container_linux_test.go:1720: 8. Allow exec=2 to proceed
    container_linux_test.go:1725: container status running
    container_linux_test.go:1699: error starting exec2: OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
    container_linux_test.go:1729: container status running
    container_linux_test.go:1732: 9. Validate
    container_linux_test.go:1735: exec1 status: created
    container_linux_test.go:1739: exec2 status: created
    container_linux_test.go:1742: task status: running
    container_linux_test.go:1744: 
        	Error Trace:	/home/samuelkarp/go/src/github.com/containerd/containerd/integration/client/container_linux_test.go:1744
        	Error:      	Not equal: 
        	            	expected: "stopped"
        	            	actual  : "running"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(client.ProcessStatus) (len=7) "stopped"
        	            	+(client.ProcessStatus) (len=7) "running"
        	            	 
        	Test:       	TestIssue10589
--- FAIL: TestIssue10589 (0.18s)
FAIL
time="2024-08-28T04:11:37.013425349Z" level=info msg="starting containerd" revision=d1fa7450b9848a40c6379a3740a38b72bf34d59a version=v2.0.0-rc.3-222-gd1fa7450b
time="2024-08-28T04:11:37.021898040Z" level=warning msg="Configuration migrated from version 2, use `containerd config migrate` to avoid migration" t="5.69µs"
time="2024-08-28T04:11:37.021973660Z" level=info msg="loading plugin" id=io.containerd.image-verifier.v1.bindir type=io.containerd.image-verifier.v1
time="2024-08-28T04:11:37.022123140Z" level=info msg="loading plugin" id=io.containerd.internal.v1.opt type=io.containerd.internal.v1
time="2024-08-28T04:11:37.022169900Z" level=info msg="loading plugin" id=io.containerd.warning.v1.deprecations type=io.containerd.warning.v1
time="2024-08-28T04:11:37.022186400Z" level=info msg="loading plugin" id=io.containerd.content.v1.content type=io.containerd.content.v1
time="2024-08-28T04:11:37.022467100Z" level=info msg="loading plugin" id=io.containerd.event.v1.exchange type=io.containerd.event.v1
time="2024-08-28T04:11:37.022510350Z" level=info msg="loading plugin" id=io.containerd.monitor.task.v1.cgroups type=io.containerd.monitor.task.v1
time="2024-08-28T04:11:37.022824590Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.blockfile type=io.containerd.snapshotter.v1
time="2024-08-28T04:11:37.022900730Z" level=info msg="skip loading plugin" error="no scratch file generator: skip plugin" id=io.containerd.snapshotter.v1.blockfile type=io.containerd.snapshotter.v1
time="2024-08-28T04:11:37.022917710Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.btrfs type=io.containerd.snapshotter.v1
time="2024-08-28T04:11:37.023144200Z" level=info msg="skip loading plugin" error="path /var/lib/containerd-test/io.containerd.snapshotter.v1.btrfs (ext4) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" id=io.containerd.snapshotter.v1.btrfs type=io.containerd.snapshotter.v1
time="2024-08-28T04:11:37.023163110Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.devmapper type=io.containerd.snapshotter.v1
time="2024-08-28T04:11:37.023173600Z" level=info msg="skip loading plugin" error="devmapper not configured: skip plugin" id=io.containerd.snapshotter.v1.devmapper type=io.containerd.snapshotter.v1
time="2024-08-28T04:11:37.023180670Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.native type=io.containerd.snapshotter.v1
time="2024-08-28T04:11:37.023251880Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.overlayfs type=io.containerd.snapshotter.v1
time="2024-08-28T04:11:37.023455210Z" level=info msg="loading plugin" id=io.containerd.metadata.v1.bolt type=io.containerd.metadata.v1
time="2024-08-28T04:11:37.023512270Z" level=info msg="metadata content store policy set" policy=shared
time="2024-08-28T04:11:37.033084420Z" level=info msg="loading plugin" id=io.containerd.gc.v1.scheduler type=io.containerd.gc.v1
time="2024-08-28T04:11:37.033172101Z" level=info msg="loading plugin" id=io.containerd.differ.v1.walking type=io.containerd.differ.v1
time="2024-08-28T04:11:37.033190230Z" level=info msg="loading plugin" id=io.containerd.lease.v1.manager type=io.containerd.lease.v1
time="2024-08-28T04:11:37.033218361Z" level=info msg="loading plugin" id=io.containerd.service.v1.containers-service type=io.containerd.service.v1
time="2024-08-28T04:11:37.033242930Z" level=info msg="loading plugin" id=io.containerd.service.v1.content-service type=io.containerd.service.v1
time="2024-08-28T04:11:37.033254330Z" level=info msg="loading plugin" id=io.containerd.service.v1.diff-service type=io.containerd.service.v1
time="2024-08-28T04:11:37.033265851Z" level=info msg="loading plugin" id=io.containerd.service.v1.images-service type=io.containerd.service.v1
time="2024-08-28T04:11:37.033284951Z" level=info msg="loading plugin" id=io.containerd.service.v1.introspection-service type=io.containerd.service.v1
time="2024-08-28T04:11:37.033300340Z" level=info msg="loading plugin" id=io.containerd.service.v1.namespaces-service type=io.containerd.service.v1
time="2024-08-28T04:11:37.033312540Z" level=info msg="loading plugin" id=io.containerd.service.v1.snapshots-service type=io.containerd.service.v1
time="2024-08-28T04:11:37.033338560Z" level=info msg="loading plugin" id=io.containerd.shim.v1.manager type=io.containerd.shim.v1
time="2024-08-28T04:11:37.033352120Z" level=info msg="loading plugin" id=io.containerd.runtime.v2.task type=io.containerd.runtime.v2
time="2024-08-28T04:11:37.033499541Z" level=debug msg="loading tasks in namespace" namespace=testing
time="2024-08-28T04:11:37.033744600Z" level=info msg="connecting to shim TestIssue10589" address="unix:///run/containerd/s/b409d80ab8498366dd90ee591ff37adaa8bfe4302953978e2801b6158cc41403" namespace=testing protocol=ttrpc version=3
time="2024-08-28T04:11:37.033854621Z" level=error msg="copy shim log after reload" error="read /proc/self/fd/7: file already closed" namespace=testing
time="2024-08-28T04:11:37.033903201Z" level=warning msg="cleaning up after shim disconnected" id=TestIssue10589 namespace=testing
time="2024-08-28T04:11:37.033909921Z" level=info msg="cleaning up dead shim" namespace=testing
time="2024-08-28T04:11:37.201476835Z" level=warning msg="cleanup warnings time=\"2024-08-28T04:11:37Z\" level=debug msg=\"starting signal loop\" namespace=testing pid=95323 runtime=io.containerd.runc.v2\n" namespace=testing
time="2024-08-28T04:11:37.202085355Z" level=error msg="failed to load shim /run/containerd-test/io.containerd.runtime.v2.task/testing/TestIssue10589" error="unable to load shim \"TestIssue10589\": unable to make connection: failed to create TTRPC connection: dial unix /run/containerd/s/b409d80ab8498366dd90ee591ff37adaa8bfe4302953978e2801b6158cc41403: connect: connection refused" namespace=testing
time="2024-08-28T04:11:37.202152375Z" level=error msg="cleanup working directory in namespace" error="open /var/lib/containerd-test/io.containerd.runtime.v2.task/testing: no such file or directory" namespace=testing
time="2024-08-28T04:11:37.202181265Z" level=info msg="loading plugin" id=io.containerd.service.v1.tasks-service type=io.containerd.service.v1
time="2024-08-28T04:11:37.202203375Z" level=debug msg="No blockio config file specified, blockio not configured"
time="2024-08-28T04:11:37.202208585Z" level=debug msg="No RDT config file specified, RDT not configured"
time="2024-08-28T04:11:37.202216545Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.containers type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202228705Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.content type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202239035Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.diff type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202252095Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.events type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202263835Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.images type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202274065Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.introspection type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202291115Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.leases type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202308425Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.namespaces type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202318735Z" level=info msg="loading plugin" id=io.containerd.sandbox.store.v1.local type=io.containerd.sandbox.store.v1
time="2024-08-28T04:11:37.202327725Z" level=info msg="loading plugin" id=io.containerd.cri.v1.images type=io.containerd.cri.v1
time="2024-08-28T04:11:37.202384495Z" level=info msg="Get image filesystem path \"/var/lib/containerd-test/io.containerd.snapshotter.v1.overlayfs\" for snapshotter \"overlayfs\""
time="2024-08-28T04:11:37.202401495Z" level=info msg="Start snapshots syncer"
time="2024-08-28T04:11:37.202430995Z" level=info msg="loading plugin" id=io.containerd.cri.v1.runtime type=io.containerd.cri.v1
time="2024-08-28T04:11:37.202604625Z" level=info msg="starting cri plugin" config="{\"containerd\":{\"defaultRuntimeName\":\"runc\",\"runtimes\":{\"runc\":{\"runtimeType\":\"io.containerd.runc.v2\",\"runtimePath\":\"\",\"PodAnnotations\":null,\"ContainerAnnotations\":null,\"options\":{\"BinaryName\":\"\",\"CriuImagePath\":\"\",\"CriuWorkPath\":\"\",\"IoGid\":0,\"IoUid\":0,\"NoNewKeyring\":false,\"Root\":\"\",\"ShimCgroup\":\"\"},\"privileged_without_host_devices\":false,\"privileged_without_host_devices_all_devices_allowed\":false,\"baseRuntimeSpec\":\"\",\"cniConfDir\":\"\",\"cniMaxConfNum\":0,\"snapshotter\":\"\",\"sandboxer\":\"podsandbox\",\"io_type\":\"\"}},\"ignoreBlockIONotEnabledErrors\":false,\"ignoreRdtNotEnabledErrors\":false},\"cni\":{\"binDir\":\"/opt/cni/bin\",\"confDir\":\"/etc/cni/net.d\",\"maxConfNum\":1,\"setupSerially\":false,\"confTemplate\":\"\",\"ipPref\":\"\",\"useInternalLoopback\":false},\"enableSelinux\":false,\"selinuxCategoryRange\":1024,\"maxContainerLogSize\":16384,\"disableCgroup\":false,\"disableApparmor\":false,\"restrictOOMScoreAdj\":false,\"disableProcMount\":false,\"unsetSeccompProfile\":\"\",\"tolerateMissingHugetlbController\":true,\"disableHugetlbController\":true,\"device_ownership_from_security_context\":false,\"ignoreImageDefinedVolumes\":false,\"netnsMountsUnderStateDir\":false,\"enableUnprivilegedPorts\":true,\"enableUnprivilegedICMP\":true,\"enableCDI\":true,\"cdiSpecDirs\":[\"/etc/cdi\",\"/var/run/cdi\"],\"drainExecSyncIOTimeout\":\"0s\",\"ignoreDeprecationWarnings\":null,\"containerdRootDir\":\"/var/lib/containerd-test\",\"containerdEndpoint\":\"/run/containerd-test/containerd.sock\",\"rootDir\":\"/var/lib/containerd-test/io.containerd.grpc.v1.cri\",\"stateDir\":\"/run/containerd-test/io.containerd.grpc.v1.cri\"}"
time="2024-08-28T04:11:37.202651135Z" level=info msg="loading plugin" id=io.containerd.sandbox.controller.v1.podsandbox type=io.containerd.sandbox.controller.v1
time="2024-08-28T04:11:37.202716755Z" level=info msg="loading plugin" id=io.containerd.sandbox.controller.v1.shim type=io.containerd.sandbox.controller.v1
time="2024-08-28T04:11:37.202824215Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.sandbox-controllers type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202847185Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.sandboxes type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202858595Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.snapshots type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202873185Z" level=info msg="loading plugin" id=io.containerd.streaming.v1.manager type=io.containerd.streaming.v1
time="2024-08-28T04:11:37.202883865Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.streaming type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202892455Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.tasks type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202901105Z" level=info msg="loading plugin" id=io.containerd.transfer.v1.local type=io.containerd.transfer.v1
time="2024-08-28T04:11:37.202922405Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.transfer type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202931955Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.version type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.202939095Z" level=info msg="loading plugin" id=io.containerd.monitor.container.v1.restart type=io.containerd.monitor.container.v1
time="2024-08-28T04:11:37.202973815Z" level=info msg="loading plugin" id=io.containerd.tracing.processor.v1.otlp type=io.containerd.tracing.processor.v1
time="2024-08-28T04:11:37.202984855Z" level=info msg="skip loading plugin" error="skip plugin: tracing endpoint not configured" id=io.containerd.tracing.processor.v1.otlp type=io.containerd.tracing.processor.v1
time="2024-08-28T04:11:37.202991725Z" level=info msg="loading plugin" id=io.containerd.internal.v1.tracing type=io.containerd.internal.v1
time="2024-08-28T04:11:37.202999145Z" level=info msg="skip loading plugin" error="skip plugin: tracing endpoint not configured" id=io.containerd.internal.v1.tracing type=io.containerd.internal.v1
time="2024-08-28T04:11:37.203005325Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.healthcheck type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.203022425Z" level=info msg="loading plugin" id=io.containerd.nri.v1.nri type=io.containerd.nri.v1
time="2024-08-28T04:11:37.203040585Z" level=info msg="runtime interface created"
time="2024-08-28T04:11:37.203044885Z" level=info msg="created NRI interface"
time="2024-08-28T04:11:37.203055965Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.cri type=io.containerd.grpc.v1
time="2024-08-28T04:11:37.203063535Z" level=info msg="Connect containerd service"
time="2024-08-28T04:11:37.203273275Z" level=info msg="using experimental NRI integration - disable nri plugin to prevent this"
time="2024-08-28T04:11:37.204259855Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
time="2024-08-28T04:11:37.216144536Z" level=debug msg="runtime \"runc\" supports recursive read-only mounts"
time="2024-08-28T04:11:37.216189716Z" level=debug msg="runtime \"runc\" supports CRI userns: false"
time="2024-08-28T04:11:37.216302036Z" level=info msg="Start subscribing containerd event"
time="2024-08-28T04:11:37.216384846Z" level=info msg="Start recovering state"
time="2024-08-28T04:11:37.216487096Z" level=info msg="Start event monitor"
time="2024-08-28T04:11:37.216509776Z" level=info msg="Start cni network conf syncer for default"
time="2024-08-28T04:11:37.216518166Z" level=info msg="Start streaming server"
time="2024-08-28T04:11:37.216533856Z" level=info msg="Registered namespace \"k8s.io\" with NRI"
time="2024-08-28T04:11:37.216541286Z" level=info msg="runtime interface starting up..."
time="2024-08-28T04:11:37.216546656Z" level=info msg="starting plugins..."
time="2024-08-28T04:11:37.216601806Z" level=info msg=serving... address=/run/containerd-test/containerd.sock.ttrpc
time="2024-08-28T04:11:37.216663896Z" level=info msg=serving... address=/run/containerd-test/containerd.sock
time="2024-08-28T04:11:37.216693676Z" level=debug msg="sd notification" notified=false state="READY=1"
time="2024-08-28T04:11:37.216711386Z" level=info msg="containerd successfully booted in 0.203890s"
time="2024-08-28T04:11:37.915657885Z" level=debug msg="(*service).Write started" expected="sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c" ref="index-sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c" total=2295
time="2024-08-28T04:11:38.049151204Z" level=debug msg="(*service).Write started" expected="sha256:907ca53d7e2947e849b839b1cd258c98fd3916c60f2e6e70c30edbf741ab6754" ref="manifest-sha256:907ca53d7e2947e849b839b1cd258c98fd3916c60f2e6e70c30edbf741ab6754" total=528
time="2024-08-28T04:11:38.178064985Z" level=debug msg="(*service).Write started" expected="sha256:66ba00ad3de8677a3fa4bc4ea0fc46ebca0f14db46ca365e7f60833068dd0148" ref="config-sha256:66ba00ad3de8677a3fa4bc4ea0fc46ebca0f14db46ca365e7f60833068dd0148" total=1457
time="2024-08-28T04:11:38.387775073Z" level=debug msg="stat snapshot" key="sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2"
time="2024-08-28T04:11:38.388530593Z" level=debug msg="prepare snapshot" key="extract-388036413-qBsk sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2" parent=
time="2024-08-28T04:11:38.399581294Z" level=debug msg="(*service).Write started" expected="sha256:205dae5015e78dd8c4d302e3db4eb31576fac715b46d099fe09680ba28093a7a" ref="layer-sha256:205dae5015e78dd8c4d302e3db4eb31576fac715b46d099fe09680ba28093a7a" total=2592863
time="2024-08-28T04:11:38.680446748Z" level=debug msg=" not found" error="exec: \"igzip\": executable file not found in $PATH"
time="2024-08-28T04:11:38.680537898Z" level=debug msg=" not found" error="exec: \"unpigz\": executable file not found in $PATH"
time="2024-08-28T04:11:38.749701223Z" level=debug msg="diff applied" d=69.497216ms digest="sha256:205dae5015e78dd8c4d302e3db4eb31576fac715b46d099fe09680ba28093a7a" media=application/vnd.docker.image.rootfs.diff.tar.gzip size=2592863
time="2024-08-28T04:11:38.750908073Z" level=debug msg="commit snapshot" key="extract-388036413-qBsk sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2" name="sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2"
time="2024-08-28T04:11:38.764704904Z" level=debug msg="create image" name="ghcr.io/containerd/busybox:1.36" target="sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c"
time="2024-08-28T04:11:38.766344684Z" level=debug msg="Received containerd event timestamp - 2024-08-28 04:11:38.766251035 +0000 UTC, namespace - \"testing\", topic - \"/images/create\""
time="2024-08-28T04:11:38.766374605Z" level=debug msg="Ignoring events in namespace - \"testing\""
time="2024-08-28T04:11:38.780651316Z" level=debug msg="prepare snapshot" key=TestIssue10589 parent="sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2"
time="2024-08-28T04:11:38.791604237Z" level=debug msg="get snapshot mounts" key=TestIssue10589
time="2024-08-28T04:11:38.801761747Z" level=debug msg="get snapshot mounts" key=TestIssue10589
time="2024-08-28T04:11:38.811294418Z" level=info msg="connecting to shim TestIssue10589" address="unix:///run/containerd/s/b409d80ab8498366dd90ee591ff37adaa8bfe4302953978e2801b6158cc41403" namespace=testing protocol=ttrpc version=3
time="2024-08-28T04:11:38.815234988Z" level=debug msg="loading plugin" id=io.containerd.event.v1.publisher runtime=io.containerd.runc.v2 type=io.containerd.event.v1
time="2024-08-28T04:11:38.815309199Z" level=debug msg="loading plugin" id=io.containerd.internal.v1.shutdown runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
time="2024-08-28T04:11:38.815321628Z" level=debug msg="loading plugin" id=io.containerd.ttrpc.v1.task runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2024-08-28T04:11:38.815394018Z" level=debug msg="registering ttrpc service" id=io.containerd.ttrpc.v1.task runtime=io.containerd.runc.v2
time="2024-08-28T04:11:38.815541139Z" level=debug msg="serving api on socket" socket="[inherited from parent]"
time="2024-08-28T04:11:38.815599498Z" level=debug msg="serving api on socket" socket="[inherited from parent]"
time="2024-08-28T04:11:38.815618769Z" level=debug msg="starting signal loop" namespace=testing path=/run/containerd-test/io.containerd.runtime.v2.task/testing/TestIssue10589 pid=95376 runtime=io.containerd.runc.v2
time="2024-08-28T04:11:38.952703070Z" level=debug msg="received signal" signal=terminated
time="2024-08-28T04:11:38.952758230Z" level=debug msg="sd notification" notified=false state="STOPPING=1"
time="2024-08-28T04:11:38.952831060Z" level=info msg="Stop CRI service"
time="2024-08-28T04:11:38.953052540Z" level=debug msg="cni watcher channel is closed"
time="2024-08-28T04:11:38.953070410Z" level=info msg="Stop CRI service"
time="2024-08-28T04:11:38.953133010Z" level=info msg="Event monitor stopped"
time="2024-08-28T04:11:38.953170460Z" level=info msg="Stream server stopped"

exit status 1
FAIL	github.com/containerd/containerd/v2/integration/client	1.991s
make: *** [Makefile:218: integration] Error 1

@samuelkarp samuelkarp force-pushed the shim-exec-fp-test branch 2 times, most recently from 04ebfeb to 7627176 Compare August 28, 2024 04:19
if err != nil {
return err
}
if err := ready.Trigger(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using fifo as controller looks great!

@laurazard
Copy link
Member

laurazard commented Aug 28, 2024

Thankyou! Will try to run/see if I can repro.

Update: I can reproduce with this 🥳

@laurazard
Copy link
Member

Fwiw, testing with #10647 (and with #10651), I see:

sudo make integration EXTRA_TESTFLAGS='-run TestIssue10589'
+ integration
INFO[0000] Using the following image list: {Alpine:ghcr.io/containerd/alpine:3.14.0 BusyBox:ghcr.io/containerd/busybox:1.36 Pause:registry.k8s.io/pause:3.10 ResourceConsumer:registry.k8s.io/e2e-test-images/resource-consumer:1.10 VolumeCopyUp:ghcr.io/containerd/volume-copy-up:2.2 VolumeOwnership:ghcr.io/containerd/volume-ownership:2.1 ArgsEscaped:cplatpublic.azurecr.io/args-escaped-test-image-ns:1.0 DockerSchema1:registry.k8s.io/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff}
INFO[0000] running tests against containerd              revision=0574a010dd5afc1e98babe10911407ead1b095d7 runtime= snapshotter= version=v2.0.0-rc.3-215-g0574a010d
INFO[0000] start to pull seed image                      image="ghcr.io/containerd/busybox:1.36"
=== RUN   TestIssue10589
    container_linux_test.go:1597: 1. Create a sleeping container
    log_hook.go:47: time="2024-08-28T14:41:38.190740914+01:00" level=debug msg="remote introspection plugin filters" func="proxy.(*introspectionRemote).Plugins" file="/home/laurabrehm/go/src/github.com/containerd/containerd/core/introspection/proxy/remote.go:57" filters="[type==io.containerd.snapshotter.v1, id==overlayfs]" testcase=TestIssue10589
    container_linux_test.go:1633: 2. Create exec1
    container_linux_test.go:1658: 3. Wait until exec1 is waiting to start
    container_linux_test.go:1651: Starting exec1
    container_linux_test.go:1663: 4. Kill the container init process
    container_linux_test.go:1668: Killing main pid (871) of container TestIssue10589
    container_linux_test.go:1674: container status running
    container_linux_test.go:1677: 5. Create exec2
    container_linux_test.go:1705: 6. Wait until exec2 is waiting to start
    container_linux_test.go:1698: Starting exec2
    container_linux_test.go:1701: error starting exec2: container TestIssue10589 init process is not running: failed precondition

which is expected – means the shim prevented the new exec from being started since the init has exited.

@samuelkarp
Copy link
Member Author

samuelkarp commented Aug 28, 2024

which is expected – means the shim prevented the new exec from being started since the init has exited.

Yay! We can update the test to pass in this case; I'll look at the exact error.

Edit: The test already expects an error here.

@samuelkarp samuelkarp force-pushed the shim-exec-fp-test branch 2 times, most recently from 0c7d310 to e70eefc Compare September 3, 2024 23:24
@samuelkarp
Copy link
Member Author

samuelkarp commented Sep 3, 2024

@laurazard I've updated the logic here to not block when exec2 fails to start and run against the latest in #10651, but I'm still seeing the task have an incorrect status:

    container_linux_test.go:1750: task status: running
    container_linux_test.go:1752: 
        	Error Trace:	/home/samuelkarp/go/src/github.com/containerd/containerd/integration/client/container_linux_test.go:1752
        	Error:      	Not equal: 
        	            	expected: "stopped"
        	            	actual  : "running"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(client.ProcessStatus) (len=7) "stopped"
        	            	+(client.ProcessStatus) (len=7) "running"
        	            	 
        	Test:       	TestIssue10589

Do I need to delay checking the task status here to give the shim some time to process the init exit?

Edit: adding a 5 second sleep did not help

Full log
samuelkarp@containerd:~/go/src/github.com/containerd/containerd$ sudo kill -9 $(pidof containerd-shim-runc-v2)
samuelkarp@containerd:~/go/src/github.com/containerd/containerd$ git pull --rebase
remote: Finding sources: 100% (6/6)
remote: Total 6 (delta 4), reused 6 (delta 4)
Unpacking objects: 100% (6/6), 3.19 KiB | 652.00 KiB/s, done.
From https://source.developers.google.com/p/samuelkarp-gke-dev/r/containerd
 + e73183aae...927ffbe86 shim-exec-fp-test -> samuelkarp-gke-dev/shim-exec-fp-test  (forced update)
Successfully rebased and updated refs/heads/shim-exec-fp-test.
samuelkarp@containerd:~/go/src/github.com/containerd/containerd$ sudo -E PATH=$PATH make integration EXTRA_TESTFLAGS='-run TestIssue10589'
+ integration
INFO[0000] Using the following image list: {Alpine:ghcr.io/containerd/alpine:3.14.0 BusyBox:ghcr.io/containerd/busybox:1.36 Pause:registry.k8s.io/pause:3.10 ResourceConsumer:registry.k8s.io/e2e-test-images/resource-consumer:1.10 VolumeCopyUp:ghcr.io/containerd/volume-copy-up:2.2 VolumeOwnership:ghcr.io/containerd/volume-ownership:2.1 ArgsEscaped:cplatpublic.azurecr.io/args-escaped-test-image-ns:1.0 DockerSchema1:registry.k8s.io/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff} 
INFO[0000] running tests against containerd              revision=0eb1d8ed220e07eca97b0533ec46b6e1dc1e4e7b runtime= snapshotter= version=v2.0.0-rc.3-213-g0eb1d8ed2
INFO[0000] start to pull seed image                      image="ghcr.io/containerd/busybox:1.36"
=== RUN   TestIssue10589
    container_linux_test.go:1597: 1. Create a sleeping container
    log_hook.go:47: time="2024-09-03T23:40:30.697982162Z" level=debug msg="remote introspection plugin filters" func="proxy.(*introspectionRemote).Plugins" file="/home/samuelkarp/go/src/github.com/containerd/containerd/core/introspection/proxy/remote.go:57" filters="[type==io.containerd.snapshotter.v1, id==overlayfs]" testcase=TestIssue10589
    container_linux_test.go:1633: 2. Create exec1
    container_linux_test.go:1658: 3. Wait until exec1 is waiting to start
    container_linux_test.go:1651: Starting exec1
    container_linux_test.go:1663: 4. Kill the container init process
    container_linux_test.go:1665: Killing main pid (152027) of container TestIssue10589
    container_linux_test.go:1669: container status running
    container_linux_test.go:1672: 5. Create exec2
    container_linux_test.go:1698: 6. Wait until exec2 is waiting to start
    container_linux_test.go:1691: Starting exec2
    container_linux_test.go:1694: error starting exec2: container TestIssue10589 init process is not running: failed precondition
    container_linux_test.go:1712: 7. Allow exec=1 to proceed
    container_linux_test.go:1717: container status running
    container_linux_test.go:1654: error starting exec1: OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
    container_linux_test.go:1721: container status running
    container_linux_test.go:1736: 8. Skip exec2
    container_linux_test.go:1740: 9. Validate
    container_linux_test.go:1744: exec1 status: created
    container_linux_test.go:1748: exec2 status: created
    container_linux_test.go:1751: task status: running
    container_linux_test.go:1753: 
        	Error Trace:	/home/samuelkarp/go/src/github.com/containerd/containerd/integration/client/container_linux_test.go:1753
        	Error:      	Not equal: 
        	            	expected: "stopped"
        	            	actual  : "running"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(client.ProcessStatus) (len=7) "stopped"
        	            	+(client.ProcessStatus) (len=7) "running"
        	            	 
        	Test:       	TestIssue10589
--- FAIL: TestIssue10589 (5.19s)
FAIL
time="2024-09-03T23:40:28.934014925Z" level=info msg="starting containerd" revision=0eb1d8ed220e07eca97b0533ec46b6e1dc1e4e7b version=v2.0.0-rc.3-213-g0eb1d8ed2
time="2024-09-03T23:40:28.941185304Z" level=warning msg="Configuration migrated from version 2, use `containerd config migrate` to avoid migration" t="4.74µs"
time="2024-09-03T23:40:28.941220324Z" level=info msg="loading plugin" id=io.containerd.image-verifier.v1.bindir type=io.containerd.image-verifier.v1
time="2024-09-03T23:40:28.941248234Z" level=info msg="loading plugin" id=io.containerd.internal.v1.opt type=io.containerd.internal.v1
time="2024-09-03T23:40:28.942123584Z" level=info msg="loading plugin" id=io.containerd.warning.v1.deprecations type=io.containerd.warning.v1
time="2024-09-03T23:40:28.942212594Z" level=info msg="loading plugin" id=io.containerd.event.v1.exchange type=io.containerd.event.v1
time="2024-09-03T23:40:28.942259214Z" level=info msg="loading plugin" id=io.containerd.monitor.task.v1.cgroups type=io.containerd.monitor.task.v1
time="2024-09-03T23:40:28.942529884Z" level=info msg="loading plugin" id=io.containerd.content.v1.content type=io.containerd.content.v1
time="2024-09-03T23:40:28.942790894Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.blockfile type=io.containerd.snapshotter.v1
time="2024-09-03T23:40:28.942862044Z" level=info msg="skip loading plugin" error="no scratch file generator: skip plugin" id=io.containerd.snapshotter.v1.blockfile type=io.containerd.snapshotter.v1
time="2024-09-03T23:40:28.942877564Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.btrfs type=io.containerd.snapshotter.v1
time="2024-09-03T23:40:28.943105064Z" level=info msg="skip loading plugin" error="path /var/lib/containerd-test/io.containerd.snapshotter.v1.btrfs (ext4) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" id=io.containerd.snapshotter.v1.btrfs type=io.containerd.snapshotter.v1
time="2024-09-03T23:40:28.943130674Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.devmapper type=io.containerd.snapshotter.v1
time="2024-09-03T23:40:28.943148384Z" level=info msg="skip loading plugin" error="devmapper not configured: skip plugin" id=io.containerd.snapshotter.v1.devmapper type=io.containerd.snapshotter.v1
time="2024-09-03T23:40:28.943155814Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.native type=io.containerd.snapshotter.v1
time="2024-09-03T23:40:28.943286094Z" level=info msg="loading plugin" id=io.containerd.snapshotter.v1.overlayfs type=io.containerd.snapshotter.v1
time="2024-09-03T23:40:28.943473164Z" level=info msg="loading plugin" id=io.containerd.metadata.v1.bolt type=io.containerd.metadata.v1
time="2024-09-03T23:40:28.943530464Z" level=info msg="metadata content store policy set" policy=shared
time="2024-09-03T23:40:28.969587951Z" level=info msg="loading plugin" id=io.containerd.gc.v1.scheduler type=io.containerd.gc.v1
time="2024-09-03T23:40:28.969649381Z" level=info msg="loading plugin" id=io.containerd.differ.v1.walking type=io.containerd.differ.v1
time="2024-09-03T23:40:28.970055761Z" level=info msg="loading plugin" id=io.containerd.lease.v1.manager type=io.containerd.lease.v1
time="2024-09-03T23:40:28.970077261Z" level=info msg="loading plugin" id=io.containerd.service.v1.containers-service type=io.containerd.service.v1
time="2024-09-03T23:40:28.970088681Z" level=info msg="loading plugin" id=io.containerd.service.v1.content-service type=io.containerd.service.v1
time="2024-09-03T23:40:28.970098271Z" level=info msg="loading plugin" id=io.containerd.service.v1.diff-service type=io.containerd.service.v1
time="2024-09-03T23:40:28.970109691Z" level=info msg="loading plugin" id=io.containerd.service.v1.images-service type=io.containerd.service.v1
time="2024-09-03T23:40:28.970120361Z" level=info msg="loading plugin" id=io.containerd.service.v1.introspection-service type=io.containerd.service.v1
time="2024-09-03T23:40:28.970129001Z" level=info msg="loading plugin" id=io.containerd.service.v1.namespaces-service type=io.containerd.service.v1
time="2024-09-03T23:40:28.970137011Z" level=info msg="loading plugin" id=io.containerd.service.v1.snapshots-service type=io.containerd.service.v1
time="2024-09-03T23:40:28.970144941Z" level=info msg="loading plugin" id=io.containerd.shim.v1.manager type=io.containerd.shim.v1
time="2024-09-03T23:40:28.970165031Z" level=info msg="loading plugin" id=io.containerd.runtime.v2.task type=io.containerd.runtime.v2
time="2024-09-03T23:40:28.970282421Z" level=debug msg="loading tasks in namespace" namespace=testing
time="2024-09-03T23:40:28.970537531Z" level=info msg="connecting to shim TestIssue10589" address="unix:///run/containerd/s/b409d80ab8498366dd90ee591ff37adaa8bfe4302953978e2801b6158cc41403" namespace=testing protocol=ttrpc version=3
time="2024-09-03T23:40:28.970645481Z" level=error msg="copy shim log after reload" error="read /proc/self/fd/7: file already closed" namespace=testing
time="2024-09-03T23:40:28.970696981Z" level=warning msg="cleaning up after shim disconnected" id=TestIssue10589 namespace=testing
time="2024-09-03T23:40:28.970703491Z" level=info msg="cleaning up dead shim" namespace=testing
time="2024-09-03T23:40:29.081720708Z" level=warning msg="cleanup warnings time=\"2024-09-03T23:40:28Z\" level=debug msg=\"starting signal loop\" namespace=testing pid=151950 runtime=io.containerd.runc.v2\n" namespace=testing
time="2024-09-03T23:40:29.082343238Z" level=error msg="failed to load shim /run/containerd-test/io.containerd.runtime.v2.task/testing/TestIssue10589" error="unable to load shim \"TestIssue10589\": unable to make connection: failed to create TTRPC connection: dial unix /run/containerd/s/b409d80ab8498366dd90ee591ff37adaa8bfe4302953978e2801b6158cc41403: connect: connection refused" namespace=testing
time="2024-09-03T23:40:29.082422578Z" level=error msg="cleanup working directory in namespace" error="open /var/lib/containerd-test/io.containerd.runtime.v2.task/testing: no such file or directory" namespace=testing
time="2024-09-03T23:40:29.082455818Z" level=info msg="loading plugin" id=io.containerd.service.v1.tasks-service type=io.containerd.service.v1
time="2024-09-03T23:40:29.082477898Z" level=debug msg="No blockio config file specified, blockio not configured"
time="2024-09-03T23:40:29.082483028Z" level=debug msg="No RDT config file specified, RDT not configured"
time="2024-09-03T23:40:29.082490428Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.containers type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.082503658Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.content type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.082513458Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.diff type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.082522838Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.events type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.082532298Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.images type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.082541318Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.introspection type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.082550388Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.leases type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.082560428Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.namespaces type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.082579678Z" level=info msg="loading plugin" id=io.containerd.sandbox.store.v1.local type=io.containerd.sandbox.store.v1
time="2024-09-03T23:40:29.082588888Z" level=info msg="loading plugin" id=io.containerd.cri.v1.images type=io.containerd.cri.v1
time="2024-09-03T23:40:29.082650887Z" level=info msg="Get image filesystem path \"/var/lib/containerd-test/io.containerd.snapshotter.v1.overlayfs\" for snapshotter \"overlayfs\""
time="2024-09-03T23:40:29.082666758Z" level=info msg="Start snapshots syncer"
time="2024-09-03T23:40:29.082701698Z" level=info msg="loading plugin" id=io.containerd.cri.v1.runtime type=io.containerd.cri.v1
time="2024-09-03T23:40:29.082939678Z" level=info msg="starting cri plugin" config="{\"containerd\":{\"defaultRuntimeName\":\"runc\",\"runtimes\":{\"runc\":{\"runtimeType\":\"io.containerd.runc.v2\",\"runtimePath\":\"\",\"PodAnnotations\":null,\"ContainerAnnotations\":null,\"options\":{\"BinaryName\":\"\",\"CriuImagePath\":\"\",\"CriuWorkPath\":\"\",\"IoGid\":0,\"IoUid\":0,\"NoNewKeyring\":false,\"Root\":\"\",\"ShimCgroup\":\"\"},\"privileged_without_host_devices\":false,\"privileged_without_host_devices_all_devices_allowed\":false,\"baseRuntimeSpec\":\"\",\"cniConfDir\":\"\",\"cniMaxConfNum\":0,\"snapshotter\":\"\",\"sandboxer\":\"podsandbox\",\"io_type\":\"\"}},\"ignoreBlockIONotEnabledErrors\":false,\"ignoreRdtNotEnabledErrors\":false},\"cni\":{\"binDir\":\"/opt/cni/bin\",\"confDir\":\"/etc/cni/net.d\",\"maxConfNum\":1,\"setupSerially\":false,\"confTemplate\":\"\",\"ipPref\":\"\",\"useInternalLoopback\":false},\"enableSelinux\":false,\"selinuxCategoryRange\":1024,\"maxContainerLogSize\":16384,\"disableCgroup\":false,\"disableApparmor\":false,\"restrictOOMScoreAdj\":false,\"disableProcMount\":false,\"unsetSeccompProfile\":\"\",\"tolerateMissingHugetlbController\":true,\"disableHugetlbController\":true,\"device_ownership_from_security_context\":false,\"ignoreImageDefinedVolumes\":false,\"netnsMountsUnderStateDir\":false,\"enableUnprivilegedPorts\":true,\"enableUnprivilegedICMP\":true,\"enableCDI\":true,\"cdiSpecDirs\":[\"/etc/cdi\",\"/var/run/cdi\"],\"drainExecSyncIOTimeout\":\"0s\",\"ignoreDeprecationWarnings\":null,\"containerdRootDir\":\"/var/lib/containerd-test\",\"containerdEndpoint\":\"/run/containerd-test/containerd.sock\",\"rootDir\":\"/var/lib/containerd-test/io.containerd.grpc.v1.cri\",\"stateDir\":\"/run/containerd-test/io.containerd.grpc.v1.cri\"}"
time="2024-09-03T23:40:29.083004528Z" level=info msg="loading plugin" id=io.containerd.sandbox.controller.v1.podsandbox type=io.containerd.sandbox.controller.v1
time="2024-09-03T23:40:29.083091738Z" level=info msg="loading plugin" id=io.containerd.sandbox.controller.v1.shim type=io.containerd.sandbox.controller.v1
time="2024-09-03T23:40:29.083230328Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.sandbox-controllers type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083258978Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.sandboxes type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083272687Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.snapshots type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083281527Z" level=info msg="loading plugin" id=io.containerd.streaming.v1.manager type=io.containerd.streaming.v1
time="2024-09-03T23:40:29.083292487Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.streaming type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083300758Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.tasks type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083309398Z" level=info msg="loading plugin" id=io.containerd.transfer.v1.local type=io.containerd.transfer.v1
time="2024-09-03T23:40:29.083331558Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.transfer type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083341178Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.version type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083351537Z" level=info msg="loading plugin" id=io.containerd.monitor.container.v1.restart type=io.containerd.monitor.container.v1
time="2024-09-03T23:40:29.083398968Z" level=info msg="loading plugin" id=io.containerd.tracing.processor.v1.otlp type=io.containerd.tracing.processor.v1
time="2024-09-03T23:40:29.083410917Z" level=info msg="skip loading plugin" error="skip plugin: tracing endpoint not configured" id=io.containerd.tracing.processor.v1.otlp type=io.containerd.tracing.processor.v1
time="2024-09-03T23:40:29.083417708Z" level=info msg="loading plugin" id=io.containerd.internal.v1.tracing type=io.containerd.internal.v1
time="2024-09-03T23:40:29.083425818Z" level=info msg="skip loading plugin" error="skip plugin: tracing endpoint not configured" id=io.containerd.internal.v1.tracing type=io.containerd.internal.v1
time="2024-09-03T23:40:29.083431908Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.healthcheck type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083440558Z" level=info msg="loading plugin" id=io.containerd.nri.v1.nri type=io.containerd.nri.v1
time="2024-09-03T23:40:29.083451987Z" level=info msg="runtime interface created"
time="2024-09-03T23:40:29.083456018Z" level=info msg="created NRI interface"
time="2024-09-03T23:40:29.083470567Z" level=info msg="loading plugin" id=io.containerd.grpc.v1.cri type=io.containerd.grpc.v1
time="2024-09-03T23:40:29.083488127Z" level=info msg="Connect containerd service"
time="2024-09-03T23:40:29.083662607Z" level=info msg="using experimental NRI integration - disable nri plugin to prevent this"
time="2024-09-03T23:40:29.084470628Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
time="2024-09-03T23:40:29.096181486Z" level=debug msg="runtime \"runc\" supports recursive read-only mounts"
time="2024-09-03T23:40:29.096210966Z" level=debug msg="runtime \"runc\" supports CRI userns: false"
time="2024-09-03T23:40:29.096385256Z" level=info msg="Start subscribing containerd event"
time="2024-09-03T23:40:29.096874226Z" level=info msg=serving... address=/run/containerd-test/containerd.sock.ttrpc
time="2024-09-03T23:40:29.096900866Z" level=info msg="Start recovering state"
time="2024-09-03T23:40:29.097190006Z" level=info msg="Start event monitor"
time="2024-09-03T23:40:29.097266246Z" level=info msg="Start cni network conf syncer for default"
time="2024-09-03T23:40:29.097281266Z" level=info msg="Start streaming server"
time="2024-09-03T23:40:29.097394796Z" level=info msg="Registered namespace \"k8s.io\" with NRI"
time="2024-09-03T23:40:29.097745326Z" level=info msg="runtime interface starting up..."
time="2024-09-03T23:40:29.096979176Z" level=info msg=serving... address=/run/containerd-test/containerd.sock
time="2024-09-03T23:40:29.097970266Z" level=info msg="starting plugins..."
time="2024-09-03T23:40:29.098171696Z" level=debug msg="sd notification" notified=false state="READY=1"
time="2024-09-03T23:40:29.098196976Z" level=info msg="containerd successfully booted in 0.164771s"
time="2024-09-03T23:40:29.825142583Z" level=debug msg="(*service).Write started" expected="sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c" ref="index-sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c" total=2295
time="2024-09-03T23:40:29.947087188Z" level=debug msg="(*service).Write started" expected="sha256:907ca53d7e2947e849b839b1cd258c98fd3916c60f2e6e70c30edbf741ab6754" ref="manifest-sha256:907ca53d7e2947e849b839b1cd258c98fd3916c60f2e6e70c30edbf741ab6754" total=528
time="2024-09-03T23:40:30.062308946Z" level=debug msg="(*service).Write started" expected="sha256:66ba00ad3de8677a3fa4bc4ea0fc46ebca0f14db46ca365e7f60833068dd0148" ref="config-sha256:66ba00ad3de8677a3fa4bc4ea0fc46ebca0f14db46ca365e7f60833068dd0148" total=1457
time="2024-09-03T23:40:30.283400659Z" level=debug msg="stat snapshot" key="sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2"
time="2024-09-03T23:40:30.284225090Z" level=debug msg="prepare snapshot" key="extract-283710330-DfmZ sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2" parent=
time="2024-09-03T23:40:30.296448668Z" level=debug msg="(*service).Write started" expected="sha256:205dae5015e78dd8c4d302e3db4eb31576fac715b46d099fe09680ba28093a7a" ref="layer-sha256:205dae5015e78dd8c4d302e3db4eb31576fac715b46d099fe09680ba28093a7a" total=2592863
time="2024-09-03T23:40:30.595056294Z" level=debug msg=" not found" error="exec: \"igzip\": executable file not found in $PATH"
time="2024-09-03T23:40:30.595146584Z" level=debug msg=" not found" error="exec: \"unpigz\": executable file not found in $PATH"
time="2024-09-03T23:40:30.664698076Z" level=debug msg="diff applied" d=69.820302ms digest="sha256:205dae5015e78dd8c4d302e3db4eb31576fac715b46d099fe09680ba28093a7a" media=application/vnd.docker.image.rootfs.diff.tar.gzip size=2592863
time="2024-09-03T23:40:30.665917956Z" level=debug msg="commit snapshot" key="extract-283710330-DfmZ sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2" name="sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2"
time="2024-09-03T23:40:30.681143974Z" level=debug msg="create image" name="ghcr.io/containerd/busybox:1.36" target="sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c"
time="2024-09-03T23:40:30.683020434Z" level=debug msg="Received containerd event timestamp - 2024-09-03 23:40:30.682829134 +0000 UTC, namespace - \"testing\", topic - \"/images/create\""
time="2024-09-03T23:40:30.683077154Z" level=debug msg="Ignoring events in namespace - \"testing\""
time="2024-09-03T23:40:30.699116212Z" level=debug msg="prepare snapshot" key=TestIssue10589 parent="sha256:b64792c17e4ad443d16b218afb3a8f5d03ca0f4ec49b11c1a7aebe17f6c3c1d2"
time="2024-09-03T23:40:30.713687280Z" level=debug msg="get snapshot mounts" key=TestIssue10589
time="2024-09-03T23:40:30.723742459Z" level=debug msg="get snapshot mounts" key=TestIssue10589
time="2024-09-03T23:40:30.734096348Z" level=info msg="connecting to shim TestIssue10589" address="unix:///run/containerd/s/b409d80ab8498366dd90ee591ff37adaa8bfe4302953978e2801b6158cc41403" namespace=testing protocol=ttrpc version=3
time="2024-09-03T23:40:30.738588157Z" level=debug msg="loading plugin" id=io.containerd.event.v1.publisher runtime=io.containerd.runc.v2 type=io.containerd.event.v1
time="2024-09-03T23:40:30.738642117Z" level=debug msg="loading plugin" id=io.containerd.internal.v1.shutdown runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
time="2024-09-03T23:40:30.738651857Z" level=debug msg="loading plugin" id=io.containerd.ttrpc.v1.task runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2024-09-03T23:40:30.738735987Z" level=debug msg="registering ttrpc service" id=io.containerd.ttrpc.v1.task runtime=io.containerd.runc.v2
time="2024-09-03T23:40:30.738895507Z" level=debug msg="serving api on socket" socket="[inherited from parent]"
time="2024-09-03T23:40:30.738923027Z" level=debug msg="serving api on socket" socket="[inherited from parent]"
time="2024-09-03T23:40:30.738933557Z" level=debug msg="starting signal loop" namespace=testing path=/run/containerd-test/io.containerd.runtime.v2.task/testing/TestIssue10589 pid=152002 runtime=io.containerd.runc.v2
time="2024-09-03T23:40:35.878313904Z" level=debug msg="received signal" signal=terminated
time="2024-09-03T23:40:35.878395314Z" level=debug msg="sd notification" notified=false state="STOPPING=1"
time="2024-09-03T23:40:35.878512814Z" level=info msg="Stop CRI service"
time="2024-09-03T23:40:35.878631224Z" level=debug msg="cni watcher channel is closed"
time="2024-09-03T23:40:35.878653624Z" level=info msg="Stop CRI service"
time="2024-09-03T23:40:35.878716394Z" level=info msg="Event monitor stopped"
time="2024-09-03T23:40:35.878749054Z" level=info msg="Stream server stopped"

exit status 1
FAIL	github.com/containerd/containerd/v2/integration/client	7.002s
make: *** [Makefile:218: integration] Error 1

@laurazard
Copy link
Member

The latest in #10651 might've been bad, sorry! I made some changes/pushed what was missing, but I hadn't tried the no-timeout code yet so it might be broken somewhere. I'll take a better look tomorrow.

@samuelkarp
Copy link
Member Author

This PR will remain WIP until #10651 is merged, at which point I will rebase it and mark it ready. Please feel free to review it now, however.

Signed-off-by: Samuel Karp <samuelkarp@google.com>
@samuelkarp
Copy link
Member Author

Rebased now that #10651 merged.

@samuelkarp samuelkarp marked this pull request as ready for review September 5, 2024 19:49
integration/failpoint/const.go Outdated Show resolved Hide resolved
@samuelkarp
Copy link
Member Author

Latest push should fix the issue (mixing defer and t.Cleanup).

This issue was caused by a race between init exits and new exec process
tracking inside the shim. The test operates by controlling the time
between when the shim invokes "runc exec" and when the actual "runc
exec" is triggered. This allows validating that races for shim state
tracking between pre- and post-start of the exec process do not exist.

Relates to containerd#10589

Signed-off-by: Samuel Karp <samuelkarp@google.com>
Copy link
Member

@laurazard laurazard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dims dims added this pull request to the merge queue Sep 6, 2024
Merged via the queue into containerd:main with commit e013322 Sep 6, 2024
52 checks passed
@samuelkarp samuelkarp added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch labels Sep 6, 2024
@redcurtain
Copy link

redcurtain commented Nov 15, 2024

@samuelkarp Sorry to bother, but I still get a block when exec2 fails to start. Could you help take a look?
containerd version 1.7.14

sudo -E PATH=$PATH make integration EXTRA_TESTFLAGS='-run TestIssue10589'
+ integration
INFO[0000] Using the following image list: {Alpine:ghcr.io/containerd/alpine:3.14.0 BusyBox:mirrors.sangfor.com/busybox:1.36 Pause:registry.k8s.io/pause:3.8 ResourceConsumer:registry.k8s.io/e2e-test-images/resource-consumer:1.10 VolumeCopyUp:ghcr.io/containerd/volume-copy-up:2.1 VolumeOwnership:ghcr.io/containerd/volume-ownership:2.1 ArgsEscaped:cplatpublic.azurecr.io/args-escaped-test-image-ns:1.0 DockerSchema1:registry.k8s.io/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff}
INFO[0000] running tests against containerd              revision=212e8b6fa2f44b9c21b2798135fc6fb7c53efc16 runtime= snapshotter= version=v1.6.4
INFO[0000] start to pull seed image                      image="mirrors.sangfor.com/busybox:1.36"
=== RUN   TestIssue10589
    container_linux_test.go:1645: 1. Create a sleeping container
    log_hook.go:47: time="2024-11-15T11:35:25.350129380+08:00" level=debug msg="remote introspection plugin filters" func="introspection.(*introspectionRemote).Plugins" file="/home/dht/containerd/containerd-1.7.14/services/introspection/introspection.go:46" filters="[type==io.containerd.snapshotter.v1, id==overlayfs]" testcase=TestIssue10589
    container_linux_test.go:1688: 2. Create exec1
    container_linux_test.go:1713: 3. Wait until exec1 is waiting to start
    container_linux_test.go:1706: Starting exec1
    container_linux_test.go:1718: 4. Kill the container init process
    container_linux_test.go:1720: Killing main pid (19680) of container TestIssue10589
    container_linux_test.go:1724: container status running
    container_linux_test.go:1727: 5. Create exec2
    container_linux_test.go:1753: 6. Wait until exec2 is waiting to start
    container_linux_test.go:1746: Starting exec2

panic: test timed out after 10m0s
running tests:
        TestIssue10589 (10m0s)

goroutine 74 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/go/src/testing/testing.go:2259 +0x259
created by time.goFunc
        /usr/local/go/src/time/sleep.go:176 +0x45

goroutine 1 [chan receive, 10 minutes]:
testing.(*T).Run(0xc0004a89c0, {0x1b61134, 0xe}, 0x1bba938)
        /usr/local/go/src/testing/testing.go:1649 +0x871
testing.runTests.func1(0x0?)
        /usr/local/go/src/testing/testing.go:2054 +0x85
testing.tRunner(0xc0004a89c0, 0xc0001dd568)
        /usr/local/go/src/testing/testing.go:1595 +0x262
testing.runTests(0xc000423860?, {0x26831c0, 0x67, 0x67}, {0x3648980?, 0x1d48a70?, 0x26e1820?})
        /usr/local/go/src/testing/testing.go:2052 +0x8ae
testing.(*M).Run(0xc000423860)
        /usr/local/go/src/testing/testing.go:1925 +0xcd8
github.com/containerd/containerd/integration/client.TestMain(0xe0272dbb69d1ea2c?)
        /home/dht/containerd/containerd-1.7.14/integration/client/client_test.go:164 +0x13b5
main.main()
        _testmain.go:257 +0x308

goroutine 6 [IO wait, 10 minutes]:
internal/poll.runtime_pollWait(0x155461f34eb0, 0x72)
        /usr/local/go/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc00007ef80, 0xc000490000?, 0x1)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0xb1
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00007ef60, {0xc000490000, 0x8000, 0x8000})
        /usr/local/go/src/internal/poll/fd_unix.go:164 +0x405
os.(*File).read(...)
        /usr/local/go/src/os/file_posix.go:29
os.(*File).Read(0xc00005e4f0, {0xc000490000, 0x8000, 0x8000})
        /usr/local/go/src/os/file.go:118 +0xad
io.copyBuffer({0x1d42f40, 0xc000013020}, {0x1d42d60, 0xc00005e4f0}, {0x0, 0x0, 0x0})
        /usr/local/go/src/io/io.go:430 +0x27e
io.Copy(...)
        /usr/local/go/src/io/io.go:389
os/exec.(*Cmd).writerDescriptor.func1()
        /usr/local/go/src/os/exec/exec.go:560 +0x5b
os/exec.(*Cmd).Start.func2(0xc000478140)
        /usr/local/go/src/os/exec/exec.go:717 +0x3d
created by os/exec.(*Cmd).Start in goroutine 1
        /usr/local/go/src/os/exec/exec.go:716 +0x101a

goroutine 66 [select, 10 minutes]:
github.com/containerd/containerd/integration/client.TestIssue10589(0xc0004a8b60)
        /home/dht/containerd/containerd-1.7.14/integration/client/container_linux_test.go:1759 +0x2578
testing.tRunner(0xc0004a8b60, 0x1bba938)
        /usr/local/go/src/testing/testing.go:1595 +0x262
created by testing.(*T).Run in goroutine 1
        /usr/local/go/src/testing/testing.go:1648 +0x846

goroutine 41 [select, 10 minutes]:
google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc0003acc60)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/transport.go:327 +0x114
google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/transport.go:342
google.golang.org/grpc.(*csAttempt).recvMsg(0xc000383380, {0x1a98ba0?, 0xc0003af380}, 0x0)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/stream.go:1070 +0x16f
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc0003acad4?)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/stream.go:927 +0x51
google.golang.org/grpc.(*clientStream).withRetry(0xc0003aca20, 0xc000413bf0, 0xc000413be0)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/stream.go:776 +0x198
google.golang.org/grpc.(*clientStream).RecvMsg(0xc0003aca20, {0x1a98ba0?, 0xc0003af380})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/stream.go:926 +0x1f6
google.golang.org/grpc.invoke({0x1d4dfd0, 0xc0003959b0}, {0x1b845ef, 0x29}, {0x1ac0de0, 0xc0003976d0}, {0x1a98ba0, 0xc0003af380}, 0x155462bc2f18?, {0xc000046620, ...})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/call.go:73 +0x13a
google.golang.org/grpc.(*ClientConn).Invoke(0xc000202400, {0x1d4dfd0, 0xc0003959b0}, {0x1b845ef, 0x29}, {0x1ac0de0, 0xc0003976d0}, {0x1a98ba0, 0xc0003af380}, {0x0, ...})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/call.go:37 +0x2f1
github.com/containerd/containerd/api/services/tasks/v1.(*tasksClient).Start(0xc00037a570, {0x1d4dfd0, 0xc0003959b0}, 0xc0001c7100?, {0x0, 0x0, 0x0})
        /home/dht/containerd/containerd-1.7.14/api/services/tasks/v1/tasks_grpc.pb.go:68 +0xee
github.com/containerd/containerd.(*process).Start(0xc0003af350, {0x1d4dfd0, 0xc0003959b0})
        /home/dht/containerd/containerd-1.7.14/process.go:121 +0x242
github.com/containerd/containerd/integration/client.TestIssue10589.func5()
        /home/dht/containerd/containerd-1.7.14/integration/client/container_linux_test.go:1747 +0x148
created by github.com/containerd/containerd/integration/client.TestIssue10589 in goroutine 66
        /home/dht/containerd/containerd-1.7.14/integration/client/container_linux_test.go:1744 +0x23d4

goroutine 42 [syscall, 10 minutes]:
syscall.Syscall6(0x9423a9?, 0xc0004b8e00?, 0xf4a07a?, 0xc000534840?, 0x32?, 0xc0000acd80?, 0xffffffffffffffff?)
        /usr/local/go/src/syscall/syscall_linux.go:91 +0x3d
golang.org/x/sys/unix.openat(0x942855?, {0xc0000acd80, 0x32}, 0x9423a9?, 0x0)
        /root/go/pkg/mod/golang.org/x/sys@v0.13.0/unix/zsyscall_linux.go:87 +0xd3
golang.org/x/sys/unix.Open(...)
        /root/go/pkg/mod/golang.org/x/sys@v0.13.0/unix/syscall_linux.go:109
github.com/containerd/containerd/pkg/fifosync.(*fifo).Wait(0xc000045c50)
        /home/dht/containerd/containerd-1.7.14/pkg/fifosync/fifo_unix.go:113 +0x6c
github.com/containerd/containerd/integration/client.TestIssue10589.func6()
        /home/dht/containerd/containerd-1.7.14/integration/client/container_linux_test.go:1756 +0x43
created by github.com/containerd/containerd/integration/client.TestIssue10589 in goroutine 66
        /home/dht/containerd/containerd-1.7.14/integration/client/container_linux_test.go:1755 +0x250b

goroutine 67 [select, 10 minutes]:
google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run(0xc000046660, {0x1d4e008, 0xc000096eb0})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/grpcsync/callback_serializer.go:83 +0x1e9
created by google.golang.org/grpc/internal/grpcsync.NewCallbackSerializer in goroutine 66
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/grpcsync/callback_serializer.go:55 +0x24a

goroutine 85 [select, 10 minutes]:
google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00023a7e0)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/transport.go:327 +0x114
google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/transport.go:342
google.golang.org/grpc.(*csAttempt).recvMsg(0xc000583860, {0x1a98ba0?, 0xc000395cb0}, 0x0)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/stream.go:1070 +0x16f
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc00023a654?)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/stream.go:927 +0x51
google.golang.org/grpc.(*clientStream).withRetry(0xc00023a5a0, 0xc000125bf0, 0xc000125be0)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/stream.go:776 +0x198
google.golang.org/grpc.(*clientStream).RecvMsg(0xc00023a5a0, {0x1a98ba0?, 0xc000395cb0})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/stream.go:926 +0x1f6
google.golang.org/grpc.invoke({0x1d4dfd0, 0xc0003959b0}, {0x1b845ef, 0x29}, {0x1ac0de0, 0xc0004187d0}, {0x1a98ba0, 0xc000395cb0}, 0x155462bc2108?, {0xc000046620, ...})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/call.go:73 +0x13a
google.golang.org/grpc.(*ClientConn).Invoke(0xc000202400, {0x1d4dfd0, 0xc0003959b0}, {0x1b845ef, 0x29}, {0x1ac0de0, 0xc0004187d0}, {0x1a98ba0, 0xc000395cb0}, {0x0, ...})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/call.go:37 +0x2f1
github.com/containerd/containerd/api/services/tasks/v1.(*tasksClient).Start(0xc000045b30, {0x1d4dfd0, 0xc0003959b0}, 0xc0000acb80?, {0x0, 0x0, 0x0})
        /home/dht/containerd/containerd-1.7.14/api/services/tasks/v1/tasks_grpc.pb.go:68 +0xee
github.com/containerd/containerd.(*process).Start(0xc0002e6240, {0x1d4dfd0, 0xc0003959b0})
        /home/dht/containerd/containerd-1.7.14/process.go:121 +0x242
github.com/containerd/containerd/integration/client.TestIssue10589.func4()
        /home/dht/containerd/containerd-1.7.14/integration/client/container_linux_test.go:1707 +0x148
created by github.com/containerd/containerd/integration/client.TestIssue10589 in goroutine 66
        /home/dht/containerd/containerd-1.7.14/integration/client/container_linux_test.go:1704 +0x1734

goroutine 83 [IO wait, 10 minutes]:
internal/poll.runtime_pollWait(0x155461f34db8, 0x72)
        /usr/local/go/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc0000b57a0, 0xc0003e8000?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0xb1
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0000b5780, {0xc0003e8000, 0x8000, 0x8000})
        /usr/local/go/src/internal/poll/fd_unix.go:164 +0x405
net.(*netFD).Read(0xc0000b5780, {0xc0003e8000, 0x8000, 0x8000})
        /usr/local/go/src/net/fd_posix.go:55 +0x4b
net.(*conn).Read(0xc000522188, {0xc0003e8000, 0x8000, 0x8000})
        /usr/local/go/src/net/net.go:179 +0xad
bufio.(*Reader).Read(0xc00009ac60, {0xc0002a0040, 0x9, 0x9})
        /usr/local/go/src/bufio/bufio.go:244 +0x4be
io.ReadAtLeast({0x1d43220, 0xc00009ac60}, {0xc0002a0040, 0x9, 0x9}, 0x9)
        /usr/local/go/src/io/io.go:335 +0xd0
io.ReadFull(...)
        /usr/local/go/src/io/io.go:354
golang.org/x/net/http2.readFrameHeader({0xc0002a0040, 0x9, 0x9}, {0x1d43220, 0xc00009ac60})
        /root/go/pkg/mod/golang.org/x/net@v0.17.0/http2/frame.go:237 +0x9b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0002a0000)
        /root/go/pkg/mod/golang.org/x/net@v0.17.0/http2/frame.go:498 +0xf5
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc000324900, 0x1162f94?)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/http2_client.go:1594 +0x2e5
created by google.golang.org/grpc/internal/transport.newHTTP2Client in goroutine 70
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/http2_client.go:397 +0x30f6

goroutine 69 [select, 10 minutes]:
google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run(0xc0000467a0, {0x1d4e008, 0xc000096f50})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/grpcsync/callback_serializer.go:83 +0x1e9
created by google.golang.org/grpc/internal/grpcsync.NewCallbackSerializer in goroutine 66
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/grpcsync/callback_serializer.go:55 +0x24a

goroutine 68 [select, 10 minutes]:
google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run(0xc000046740, {0x1d4e008, 0xc000096f00})
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/grpcsync/callback_serializer.go:83 +0x1e9
created by google.golang.org/grpc/internal/grpcsync.NewCallbackSerializer in goroutine 66
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/grpcsync/callback_serializer.go:55 +0x24a

goroutine 84 [select, 10 minutes]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc000548ff0, 0x1)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/controlbuf.go:418 +0x1af
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc0000b6cb0)
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/controlbuf.go:552 +0x14f
google.golang.org/grpc/internal/transport.newHTTP2Client.func6()
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/http2_client.go:451 +0x168
created by google.golang.org/grpc/internal/transport.newHTTP2Client in goroutine 70
        /root/go/pkg/mod/google.golang.org/grpc@v1.58.3/internal/transport/http2_client.go:449 +0x3a6b
exit status 2
FAIL    github.com/containerd/containerd/integration/client     601.594s
make: *** [integration] Error 1

@samuelkarp
Copy link
Member Author

@redcurtain

containerd version 1.7.14
version=v1.6.4

https://github.com/containerd/containerd/releases/tag/v1.7.22 and https://github.com/containerd/containerd/releases/tag/v1.6.36 contain the fix for this issue.

@redcurtain
Copy link

@redcurtain

containerd version 1.7.14
version=v1.6.4

https://github.com/containerd/containerd/releases/tag/v1.7.22 and https://github.com/containerd/containerd/releases/tag/v1.6.36 contain the fix for this issue.

@samuelkarp
not understand clearly, expected to reproduce the issue with my containerd version(1.7.14).
Then merge the patch from https://github.com/containerd/containerd/releases/tag/v1.7.22.
But get a block when reproduce with containerd v1.7.14.

Do you mean that I should get the test from v1.6.36 or v1.7.22?

@samuelkarp
Copy link
Member Author

samuelkarp commented Nov 15, 2024

reproduce the issue with my containerd version(1.7.14).

The test is expected to fail (and can block) on versions prior to 1.7.22. The test should pass on 1.7.22.

@redcurtain
Copy link

redcurtain commented Nov 16, 2024

reproduce the issue with my containerd version(1.7.14).

The test is expected to fail (and can block) on versions prior to 1.7.22. The test should pass on 1.7.22.

@samuelkarp Thanks for reply.
Test with 1.7.22 Pass,but still get the same block after merged the PR10675 based on 1.7.14.
I will have more look.

@linsite
Copy link

linsite commented Nov 20, 2024

I see the descriptions in issue and PR are a little bit obscure to grasp. Let me explain the issue a bit more for those who are interested, please correct me if I'm wrong.

The code I dive into is 1.7.21.

  1. The reason why a defunct task exists is that an exit event is missing in containerd-shim-runc-v2.
  2. Why the event is missing? 'cleanup' in preStart which remove subscriber from s.exitSubscribers, is called before s.processExits can see the event.

So that's the race. handleStarted produces the event, s.processExits is the consumer, cleanup is the cleaner. They share a lock, if cleanup got the lock before s.processExits, the issue occurs.

@ningmingxiao
Copy link
Contributor

do you have some ways to reproduce @linsite ?

@ningmingxiao
Copy link
Contributor

clean run after handleStarted

	handleStarted = func(c *runc.Container, p process.Process) {
		var pid int
		if p != nil {
			pid = p.Pid()
		}
		s.lifecycleMu.Lock()
		ees, exited := exits[pid]
		delete(s.exitSubscribers, &exits)
		exits = nil
		if pid == 0 || exited {
			s.lifecycleMu.Unlock()
			for _, ee := range ees {
				s.handleProcessExit(ee, c, p)
			}
		} else {
			// Process start was successful, add to `s.running`.
			s.running[pid] = append(s.running[pid], containerProcess{
				Container: c,
				Process:   p,
			})
			s.lifecycleMu.Unlock()
		}
	}

handleStarted will make sure exits = nil

	cleanup = func() {
		if exits != nil {
			s.lifecycleMu.Lock()
			defer s.lifecycleMu.Unlock()
			delete(s.exitSubscribers, &exits)
		}
	}

delete(s.exitSubscribers, &exits) may not happen?

@linsite
Copy link

linsite commented Nov 21, 2024

clean run after handleStarted

	handleStarted = func(c *runc.Container, p process.Process) {
		var pid int
		if p != nil {
			pid = p.Pid()
		}
		s.lifecycleMu.Lock()
		ees, exited := exits[pid]
		delete(s.exitSubscribers, &exits)
		exits = nil
		if pid == 0 || exited {
			s.lifecycleMu.Unlock()
			for _, ee := range ees {
				s.handleProcessExit(ee, c, p)
			}
		} else {
			// Process start was successful, add to `s.running`.
			s.running[pid] = append(s.running[pid], containerProcess{
				Container: c,
				Process:   p,
			})
			s.lifecycleMu.Unlock()
		}
	}

handleStarted will make sure exits = nil

	cleanup = func() {
		if exits != nil {
			s.lifecycleMu.Lock()
			defer s.lifecycleMu.Unlock()
			delete(s.exitSubscribers, &exits)
		}
	}

delete(s.exitSubscribers, &exits) may not happen?

Yes, you're right, my previous analize is wrong. The true problems is the concurrent exec calls can make s.pendingExecs not 0, when the init process exits, it get skipped in the processExists func, so no TaskExit event will ever be sent.

if init && s.pendingExecs[cp.Container] != 0 {
                 skipped = append(skipped, cp)                                                                                                                                             
             } else {                                                                                                                                                                      
              cps = append(cps, cp)                                                                                                                                                     
           }   

The 1.7.21 failed case log validate the issue, added in the s.send func.

/tmp/323049941:113:time="2024-11-21T13:05:38.296583785+08:00" level=info msg="linlin Sending event: &events.TaskCreate{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), ContainerID:\"TestIssue10589\", Bundle:\"/run/containerd-test/io.containerd.runtime.v2.task/testing/TestIssue10589\", Rootfs:[]*types.Mount{(*types.Mount)(0xc000330850)}, IO:(*events.TaskIO)(0xc0002aa9c0), Checkpoint:\"\", Pid:0x9ea72}"
/tmp/323049941:121:time="2024-11-21T13:05:38.317175488+08:00" level=info msg="linlin Sending event: &events.TaskStart{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), ContainerID:\"TestIssue10589\", Pid:0x9ea72}"
/tmp/323049941:122:time="2024-11-21T13:05:38.322465889+08:00" level=info msg="linlin Sending event: &events.TaskExecAdded{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), ContainerID:\"TestIssue10589\", ExecID:\"exec1\"}"
/tmp/323049941:123:time="2024-11-21T13:05:38.334166390+08:00" level=info msg="linlin Sending event: &events.TaskExecAdded{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), ContainerID:\"TestIssue10589\", ExecID:\"exec2\"}"

@linsite
Copy link

linsite commented Nov 21, 2024

We've noticed an odd thing. This testcase can 100% fail 1.7.21 code in a phiscal PC with 2.8GHz CPU , but 100% can't fail any VirtualMachine with 2.6 or 2.2GHz vCPU. The OS is ubuntu18.024 or 20.04, Golang version is 1.22.

@ningmingxiao
Copy link
Contributor

We've noticed an odd thing. This testcase can 100% fail 1.7.21 code in a phiscal PC with 2.8GHz CPU , but 100% can't fail any VirtualMachine with 2.6 or 2.2GHz vCPU. The OS is ubuntu18.024 or 20.04, Golang version is 1.22.

1.7.22 fix this you can try this pr #10675

@linsite
Copy link

linsite commented Nov 21, 2024

thanks, :). I know the fix. All I want is to reproduce that issue and figure out the root cause.

FYI, after some study, I found that by adding a time.Sleep(50 * time.Milliseconds) after the 4th step, the issue can be triggered 100% in 1.7.21. That can ensure the init process exits with a non-zero s.pendingExecs[cp.Container].

best wishes.

@wyike
Copy link

wyike commented Dec 7, 2024

We observe very similar phenomenon with containerd v1.6.33. It happens during running "kubectl drain node".

I only have part of logs at handy. From the log, node drain hangs for hours and one pod/container keeps failing to be killed. It seems there is a race between init exits and exec process. Both PreStop hook and container exit cannot finish forever.

./kubelet/drain.stdout.log:7:evicting pod tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0
./kubelet/kubelet.stderr.log:189151:E1125 13:25:04.871098    9713 kuberuntime_container.go:747] "Container termination failed with gracePeriod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4 containerName="monitor" containerID="containerd://19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409" gracePeriod=10
./kubelet/kubelet.stderr.log:189152:E1125 13:25:04.871150    9713 kuberuntime_container.go:772] "Kill container failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4 containerName="monitor" containerID={Type:containerd ID:19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409}
./kubelet/kubelet.stderr.log:189517:E1125 13:27:04.871591    9713 pod_workers.go:1294] "Error syncing pod, skipping" err="[failed to \"KillContainer\" for \"monitor\" with KillContainerError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\", failed to \"KillPodSandbox\" for \"9c55ca22-eb20-44ae-a90f-3db4024d3cb4\" with KillPodSandboxError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"]" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4
./kubelet/kubelet.stderr.log:189519:E1125 13:27:05.061383    9713 kuberuntime_container.go:634] "PreStop hook failed" err="rpc error: code = Unknown desc = failed to exec in container: failed to start exec \"ba8029c2fda9bb0a1cc5fbb4fcb3e58a391b44ad16073e6bbc1345bc2364de15\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4 containerName="monitor" containerID="containerd://19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409"
./kubelet/kubelet.stderr.log:189519:E1125 13:27:05.061383    9713 kuberuntime_container.go:634] "PreStop hook failed" err="rpc error: code = Unknown desc = failed to exec in container: failed to start exec \"ba8029c2fda9bb0a1cc5fbb4fcb3e58a391b44ad16073e6bbc1345bc2364de15\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4 containerName="monitor" containerID="containerd://19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409" 
...
./kubelet/kubelet.stderr.log:202464:E1125 14:38:03.973276    9713 kuberuntime_container.go:634] "PreStop hook failed" err="rpc error: code = Unknown desc = failed to exec in container: failed to start exec \"86b0a50e384bef51dc6153a78bff53a52b31658b63859be227a6790d04915cae\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4 containerName="monitor" containerID="containerd://19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409"
kubelet/kubelet.stderr.log:189152:E1125 13:25:04.871150    9713 kuberuntime_container.go:772] "Kill container failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4 containerName="monitor" containerID={Type:containerd ID:19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409}
...
kubelet/kubelet.stderr.log:202097:E1125 14:36:03.459989    9713 kuberuntime_container.go:772] "Kill container failed" err="rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409\" to be killed: wait container \"19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409\": context deadline exceeded" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4 containerName="monitor" containerID={Type:containerd ID:19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409}
kubelet/kubelet.stderr.log:202857:E1125 14:40:13.974144    9713 kuberuntime_container.go:772] "Kill container failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="tsmgr-3aac4723-0f75-4d1e-950f-6d738a30ebb7/postgres-db-monitor-0" podUID=9c55ca22-eb20-44ae-a90f-3db4024d3cb4 containerName="monitor" containerID={Type:containerd ID:19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409}

containerd log also reports the same for hours

containerd/containerd.stderr.log:1211:time="2024-11-25T13:22:54.871258928Z" level=info msg="Stop container \"19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409\" with signal terminated"
containerd/containerd.stderr.log:1568:time="2024-11-25T13:22:58.618693876Z" level=error msg="ExecSync for \"19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409\" failed" error="failed to exec in container: failed to start exec \"b3f29d2cd30300fda165f8bbb3594ffd127bd3178ff45567296f0820a82c39df\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown"
...
containerd/containerd.stderr.log:16315:time="2024-11-25T14:40:16.388797736Z" level=error msg="ExecSync for \"19dff9cd5487c32c4f1bd5375e5eaa6392208fdb515a6bfb691a04b03bf28409\" failed" error="failed to exec in container: failed to start exec \"d0d70536a9f23370d07881b7dae74a67f2ffd016cd4e120c8bdeacb70cc464b1\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown"

@ningmingxiao
Copy link
Contributor

can you ps -auxf show process status. (containerd main process and exec process status)
and ctr -n k8s.io task ls and kill -10 shimpid (will print stack into containerd). @wyike

@wyike
Copy link

wyike commented Dec 11, 2024

Hi @ningmingxiao unfortunately our customer upgrades all their clusters with workaround anyhow and we didn't catch the chance to collect more of these data. Thanks for the guidance anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime Runtime cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch kind/test size/L
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants