-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core/mount: use ptrace instead of go:linkname #10611
Conversation
```bash $ go test -bench=. -benchmem ./ -exec sudo goos: linux goarch: amd64 pkg: github.com/containerd/containerd/v2/core/mount cpu: AMD Ryzen 7 5800H with Radeon Graphics BenchmarkBatchRunGetUsernsFD_Concurrent1-16 2398 532424 ns/op 1145 B/op 43 allocs/op BenchmarkBatchRunGetUsernsFD_Concurrent10-16 343 3701695 ns/op 11552 B/op 421 allocs/op PASS ok github.com/containerd/containerd/v2/core/mount 2.978s ``` Signed-off-by: Wei Fu <fuweid89@gmail.com>
Hmm. Never see this error before... REF: https://github.com/containerd/containerd/actions/runs/10454853534/job/28948642310?pr=10611 Updated:It's caused by golang/go#68984 |
50702d0
to
8c583f4
Compare
8c583f4
to
df00a75
Compare
ec69b27
to
a7ce13c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly LGTM, will review in detail later. Added two questions, but I think the answer is what I expect, so probably no changes needed for those things :)
// The usernsFD will hold the userns reference in kernel. Even if the | ||
// child process is reaped, the usernsFD is still valid. | ||
usernsFD, err := os.Open(fmt.Sprintf("/proc/%d/ns/user", proc.Pid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the kernel, do_exit()
from kernel/exit.c
is called when the process dies, which calls exit_task_namespaces()
to destroy the namespaces. So this only holds if we do the open()
before it is destroyed.
But IIUC this is what you mean, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this only holds if we do the open() before it is destroyed.
Yes.
Ptrace: true, | ||
Pdeathsig: syscall.SIGKILL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, SIGKILL does kill the process if it is using Ptrace: true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pdeathsig is used to kill the tracee if the parent process, like containerd, exits. So it won't cause leaky subprocess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that. My question is: if the process is waiting with Ptrace: true
, will it react to SIGKILL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Yes. The tracee is in STOP status so kernel can kill it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some nits, but this looks great to me :)
unix.ECHILD, werr) | ||
} | ||
|
||
// NOTE: The CLONE_PIDFD flag has been supported since Linux kernel v5.2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using this flag? Or maybe in a previous iteration? I can't find it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use this flag. I just left the comment. If kernel supports P_PIDFD in waitid, the kernel should support CLONE_PIDFD as well.
/ So assumption is that if waitid(2) supports P_PIDFD, current kernel should support CLONE_PIDFD as well.
return nil, fmt.Errorf("failed to start noop process for unshare: %w", err) | ||
} | ||
|
||
if pidfd == -1 || !sys.SupportsPidFD() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd check for pidfd support at the very beginning, and return an error if it doesn't. As you do in the go 1.23 file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check it in old kernel which doesn't support CLONE_PIDFD. If the kernel doesn't support CLONE_PIDFD, the os.StartProcess
should return error. If so, we don't need to check pidfd value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that the os.StartProcess won't return error if the kernel doesn't support CLONE_PIDFD.
vagrant@vagrant:~$ uname -a
Linux vagrant 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
vagrant@vagrant:~$ sudo /vagrant/test -test.v -test.root -test.run TestIdmappedMount/IDMapMount
=== RUN TestIdmappedMount
=== RUN TestIdmappedMount/IDMapMount
pidfd = -1
mount_idmapped_linux_test.go:130:
Error Trace: /home/fuweid/go/src/github.com/containerd/containerd/core/mount/mount_idmapped_linux_test.go:130
Error: Received unexpected error:
failed to prevent pid reused issue because pidfd isn't supported
Test: TestIdmappedMount/IDMapMount
--- FAIL: TestIdmappedMount (0.00s)
--- FAIL: TestIdmappedMount/IDMapMount (0.00s)
FAIL
I applied the following patch based on this pull request.
diff --git a/core/mount/mount_idmapped_linux_test.go b/core/mount/mount_idmapped_linux_test.go
index 90e3a61e8..be802705a 100644
--- a/core/mount/mount_idmapped_linux_test.go
+++ b/core/mount/mount_idmapped_linux_test.go
@@ -24,7 +24,6 @@ import (
"syscall"
"testing"
- kernel "github.com/containerd/containerd/v2/pkg/kernelversion"
"github.com/containerd/continuity/testutil"
"github.com/stretchr/testify/require"
)
@@ -74,12 +73,14 @@ var (
func TestIdmappedMount(t *testing.T) {
testutil.RequiresRoot(t)
- k512 := kernel.KernelVersion{Kernel: 5, Major: 12}
- ok, err := kernel.GreaterEqualThan(k512)
- require.NoError(t, err)
- if !ok {
- t.Skip("GetUsernsFD requires kernel >= 5.12")
- }
+ /*
+ k512 := kernel.KernelVersion{Kernel: 5, Major: 12}
+ ok, err := kernel.GreaterEqualThan(k512)
+ require.NoError(t, err)
+ if !ok {
+ t.Skip("GetUsernsFD requires kernel >= 5.12")
+ }
+ */
t.Run("GetUsernsFD", testGetUsernsFD)
diff --git a/core/mount/mount_idmapped_utils_linux_go122.go b/core/mount/mount_idmapped_utils_linux_go122.go
index 21b1c81a3..a0b9546e4 100644
--- a/core/mount/mount_idmapped_utils_linux_go122.go
+++ b/core/mount/mount_idmapped_utils_linux_go122.go
@@ -48,7 +48,8 @@ func getUsernsFD(uidMaps, gidMaps []syscall.SysProcIDMap) (*os.File, error) {
return nil, fmt.Errorf("failed to start noop process for unshare: %w", err)
}
- if pidfd == -1 || !sys.SupportsPidFD() {
+ if pidfd == -1 {
+ fmt.Println("pidfd = ", pidfd)
proc.Kill()
proc.Wait()
return nil, fmt.Errorf("failed to prevent pid reused issue because pidfd isn't supported")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you do in the go 1.23 file
Since go1.23.0 has pidfd double-close issue, I check the SupportsPidFD
at the beginning.
But for the go1.22.0, I need to check pidfd
result. So I combine two condition checks together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh, I see now. Makse sense, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment
a7ce13c
to
9c91b2d
Compare
The Go runtime has started to [lock down future uses of linkname][1] since go1.23. In the go source code, containerd project has been marked in the comment, [hall of shame][2]. Well, the go:linkname is used to fork no-op subprocess efficiently. However, since that comment, I would like to use ptrace and remove go:linkname in the whole repository. With go1.22 `go:linkname`: ```bash $ go test -bench=. -benchmem ./ -exec sudo goos: linux goarch: amd64 pkg: github.com/containerd/containerd/v2/core/mount cpu: AMD Ryzen 7 5800H with Radeon Graphics BenchmarkBatchRunGetUsernsFD_Concurrent1-16 2440 533320 ns/op 1145 B/op 43 allocs/op BenchmarkBatchRunGetUsernsFD_Concurrent10-16 342 3661616 ns/op 11562 B/op 421 allocs/op PASS ok github.com/containerd/containerd/v2/core/mount 2.983s ``` With go1.22 `ptrace`: ```bash $ go test -bench=. -benchmem ./ -exec sudo goos: linux goarch: amd64 pkg: github.com/containerd/containerd/v2/core/mount cpu: AMD Ryzen 7 5800H with Radeon Graphics BenchmarkBatchRunGetUsernsFD_Concurrent1-16 1785 739557 ns/op 3948 B/op 68 allocs/op BenchmarkBatchRunGetUsernsFD_Concurrent10-16 328 4024300 ns/op 39601 B/op 671 allocs/op PASS ok github.com/containerd/containerd/v2/core/mount 3.104s ``` With go1.23 `ptrace`: ```bash $ go test -bench=. -benchmem ./ -exec sudo goos: linux goarch: amd64 pkg: github.com/containerd/containerd/v2/core/mount cpu: AMD Ryzen 7 5800H with Radeon Graphics BenchmarkBatchRunGetUsernsFD_Concurrent1-16 1815 723252 ns/op 4220 B/op 69 allocs/op BenchmarkBatchRunGetUsernsFD_Concurrent10-16 319 3957157 ns/op 42351 B/op 682 allocs/op PASS ok github.com/containerd/containerd/v2/core/mount 3.051s ``` Diff: The `ptrace` is slower than `go:linkname` mode. However, it's accepctable. ``` goos: linux goarch: amd64 pkg: github.com/containerd/containerd/v2/core/mount cpu: AMD Ryzen 7 5800H with Radeon Graphics │ go122-golinkname │ go122-ptrace │ go123-ptrace │ │ sec/op │ sec/op vs base │ sec/op vs base │ BatchRunGetUsernsFD_Concurrent1-16 533.3µ ± ∞ ¹ 739.6µ ± ∞ ¹ ~ (p=1.000 n=1) ² 723.3µ ± ∞ ¹ ~ (p=1.000 n=1) ² BatchRunGetUsernsFD_Concurrent10-16 3.662m ± ∞ ¹ 4.024m ± ∞ ¹ ~ (p=1.000 n=1) ² 3.957m ± ∞ ¹ ~ (p=1.000 n=1) ² geomean 1.397m 1.725m +23.45% 1.692m +21.06% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ go122-golinkname │ go122-ptrace │ go123-ptrace │ │ B/op │ B/op vs base │ B/op vs base │ BatchRunGetUsernsFD_Concurrent1-16 1.118Ki ± ∞ ¹ 3.855Ki ± ∞ ¹ ~ (p=1.000 n=1) ² 4.121Ki ± ∞ ¹ ~ (p=1.000 n=1) ² BatchRunGetUsernsFD_Concurrent10-16 11.29Ki ± ∞ ¹ 38.67Ki ± ∞ ¹ ~ (p=1.000 n=1) ² 41.36Ki ± ∞ ¹ ~ (p=1.000 n=1) ² geomean 3.553Ki 12.21Ki +243.65% 13.06Ki +267.43% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ go122-golinkname │ go122-ptrace │ go123-ptrace │ │ allocs/op │ allocs/op vs base │ allocs/op vs base │ BatchRunGetUsernsFD_Concurrent1-16 43.00 ± ∞ ¹ 68.00 ± ∞ ¹ ~ (p=1.000 n=1) ² 69.00 ± ∞ ¹ ~ (p=1.000 n=1) ² BatchRunGetUsernsFD_Concurrent10-16 421.0 ± ∞ ¹ 671.0 ± ∞ ¹ ~ (p=1.000 n=1) ² 682.0 ± ∞ ¹ ~ (p=1.000 n=1) ² geomean 134.5 213.6 +58.76% 216.9 +61.23% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 ``` [1]: <golang/go#67401> [2]: <https://github.com/golang/go/blob/release-branch.go1.23/src/runtime/proc.go#L4820> Signed-off-by: Wei Fu <fuweid89@gmail.com>
9c91b2d
to
3cd8f97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Still LGTM :)
Seems like this is making it so this will only work on kernel >= 5.4. |
The re-exec solution is complicated because it requires all the binaries implement the command to support re-exec. For this case, we don't need to run a real command. With From https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/ :
|
ping @dmcgowan @samuelkarp @AkihiroSuda for more inputs. thanks |
Kubernetes userns KEP author here. As @fuweid was saying, we require idmap mounts for userns. idmap mounts was added later than pidfd support in Linux, so from a k8s point of view, this is not a problem for us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, LGTM.
containerd deleted unsafe, golinkname usage from whole project in the containerd/containerd#10611. This patch is to delete contained name in the comment. Change-Id: Ide55ad9c65b3b622650a0b5813a7817306e87d3f Reviewed-on: https://go-review.googlesource.com/c/go/+/609996 Auto-Submit: Ian Lance Taylor <iant@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
Related to this: golang/go#70352 |
The Go runtime has started to lock down future uses of linkname since
go1.23. In the go source code, containerd project has been marked in the
comment, hall of shame. Well, the go:linkname is used to fork no-op
subprocess efficiently. However, since that comment, I would like to use
ptrace and remove go:linkname in the whole repository.
With go1.22
go:linkname
:$ go test -bench=. -benchmem ./ -exec sudo goos: linux goarch: amd64 pkg: github.com/containerd/containerd/v2/core/mount cpu: AMD Ryzen 7 5800H with Radeon Graphics BenchmarkBatchRunGetUsernsFD_Concurrent1-16 2440 533320 ns/op 1145 B/op 43 allocs/op BenchmarkBatchRunGetUsernsFD_Concurrent10-16 342 3661616 ns/op 11562 B/op 421 allocs/op PASS ok github.com/containerd/containerd/v2/core/mount 2.983s
With go1.22
ptrace
:$ go test -bench=. -benchmem ./ -exec sudo goos: linux goarch: amd64 pkg: github.com/containerd/containerd/v2/core/mount cpu: AMD Ryzen 7 5800H with Radeon Graphics BenchmarkBatchRunGetUsernsFD_Concurrent1-16 1785 739557 ns/op 3948 B/op 68 allocs/op BenchmarkBatchRunGetUsernsFD_Concurrent10-16 328 4024300 ns/op 39601 B/op 671 allocs/op PASS ok github.com/containerd/containerd/v2/core/mount 3.104s
With go1.23
ptrace
:$ go test -bench=. -benchmem ./ -exec sudo goos: linux goarch: amd64 pkg: github.com/containerd/containerd/v2/core/mount cpu: AMD Ryzen 7 5800H with Radeon Graphics BenchmarkBatchRunGetUsernsFD_Concurrent1-16 1815 723252 ns/op 4220 B/op 69 allocs/op BenchmarkBatchRunGetUsernsFD_Concurrent10-16 319 3957157 ns/op 42351 B/op 682 allocs/op PASS ok github.com/containerd/containerd/v2/core/mount 3.051s
Diff:
The
ptrace
is slower thango:linkname
mode. However, it's accepctable.Signed-off-by: Wei Fu fuweid89@gmail.com
NOTE:
Highlight golang/go#68984