Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/mount: use ptrace instead of go:linkname #10611

Merged
merged 2 commits into from
Aug 30, 2024

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Aug 19, 2024

The Go runtime has started to lock down future uses of linkname since
go1.23. In the go source code, containerd project has been marked in the
comment, hall of shame. Well, the go:linkname is used to fork no-op
subprocess efficiently. However, since that comment, I would like to use
ptrace and remove go:linkname in the whole repository.

With go1.22 go:linkname:

$ go test -bench=.  -benchmem ./ -exec sudo
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
BenchmarkBatchRunGetUsernsFD_Concurrent1-16                 2440            533320 ns/op            1145 B/op         43 allocs/op
BenchmarkBatchRunGetUsernsFD_Concurrent10-16                 342           3661616 ns/op           11562 B/op        421 allocs/op
PASS
ok      github.com/containerd/containerd/v2/core/mount  2.983s

With go1.22 ptrace:

$ go test -bench=.  -benchmem ./ -exec sudo
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
BenchmarkBatchRunGetUsernsFD_Concurrent1-16                 1785            739557 ns/op            3948 B/op         68 allocs/op
BenchmarkBatchRunGetUsernsFD_Concurrent10-16                 328           4024300 ns/op           39601 B/op        671 allocs/op
PASS
ok      github.com/containerd/containerd/v2/core/mount  3.104s

With go1.23 ptrace:

$ go test -bench=.  -benchmem ./ -exec sudo
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
BenchmarkBatchRunGetUsernsFD_Concurrent1-16                 1815            723252 ns/op            4220 B/op         69 allocs/op
BenchmarkBatchRunGetUsernsFD_Concurrent10-16                 319           3957157 ns/op           42351 B/op        682 allocs/op
PASS
ok      github.com/containerd/containerd/v2/core/mount  3.051s

Diff:

The ptrace is slower than go:linkname mode. However, it's accepctable.

goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
                                    │ go122-golinkname │             go122-ptrace              │             go123-ptrace              │
                                    │      sec/op      │    sec/op     vs base                 │    sec/op     vs base                 │
BatchRunGetUsernsFD_Concurrent1-16        533.3µ ± ∞ ¹   739.6µ ± ∞ ¹        ~ (p=1.000 n=1) ²   723.3µ ± ∞ ¹        ~ (p=1.000 n=1) ²
BatchRunGetUsernsFD_Concurrent10-16       3.662m ± ∞ ¹   4.024m ± ∞ ¹        ~ (p=1.000 n=1) ²   3.957m ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                   1.397m         1.725m        +23.45%                   1.692m        +21.06%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                    │ go122-golinkname │              go122-ptrace               │              go123-ptrace               │
                                    │       B/op       │     B/op       vs base                  │     B/op       vs base                  │
BatchRunGetUsernsFD_Concurrent1-16       1.118Ki ± ∞ ¹   3.855Ki ± ∞ ¹         ~ (p=1.000 n=1) ²   4.121Ki ± ∞ ¹         ~ (p=1.000 n=1) ²
BatchRunGetUsernsFD_Concurrent10-16      11.29Ki ± ∞ ¹   38.67Ki ± ∞ ¹         ~ (p=1.000 n=1) ²   41.36Ki ± ∞ ¹         ~ (p=1.000 n=1) ²
geomean                                  3.553Ki         12.21Ki        +243.65%                   13.06Ki        +267.43%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                    │ go122-golinkname │             go122-ptrace             │             go123-ptrace             │
                                    │    allocs/op     │  allocs/op   vs base                 │  allocs/op   vs base                 │
BatchRunGetUsernsFD_Concurrent1-16         43.00 ± ∞ ¹   68.00 ± ∞ ¹        ~ (p=1.000 n=1) ²   69.00 ± ∞ ¹        ~ (p=1.000 n=1) ²
BatchRunGetUsernsFD_Concurrent10-16        421.0 ± ∞ ¹   671.0 ± ∞ ¹        ~ (p=1.000 n=1) ²   682.0 ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                    134.5         213.6        +58.76%                   216.9        +61.23%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

Signed-off-by: Wei Fu fuweid89@gmail.com


NOTE:

Highlight golang/go#68984

```bash
$ go test -bench=.  -benchmem ./ -exec sudo
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
BenchmarkBatchRunGetUsernsFD_Concurrent1-16                 2398            532424 ns/op            1145 B/op         43 allocs/op
BenchmarkBatchRunGetUsernsFD_Concurrent10-16                 343           3701695 ns/op           11552 B/op        421 allocs/op
PASS
ok      github.com/containerd/containerd/v2/core/mount  2.978s
```

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@dosubot dosubot bot added area/runtime Runtime go Pull requests that update Go code labels Aug 19, 2024
@fuweid
Copy link
Member Author

fuweid commented Aug 19, 2024

default: time="2024-08-19T14:13:35.673075415Z" level=fatal msg="serve failure" address=/run/containerd-test/containerd.sock error="accept unix /run/containerd-test/containerd.sock: accept4: bad file descriptor"

Hmm. Never see this error before...

REF: https://github.com/containerd/containerd/actions/runs/10454853534/job/28948642310?pr=10611


Updated:

It's caused by golang/go#68984

Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly LGTM, will review in detail later. Added two questions, but I think the answer is what I expect, so probably no changes needed for those things :)

Comment on lines +68 to +70
// The usernsFD will hold the userns reference in kernel. Even if the
// child process is reaped, the usernsFD is still valid.
usernsFD, err := os.Open(fmt.Sprintf("/proc/%d/ns/user", proc.Pid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the kernel, do_exit() from kernel/exit.c is called when the process dies, which calls exit_task_namespaces() to destroy the namespaces. So this only holds if we do the open() before it is destroyed.

But IIUC this is what you mean, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only holds if we do the open() before it is destroyed.

Yes.

Comment on lines +42 to +43
Ptrace: true,
Pdeathsig: syscall.SIGKILL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, SIGKILL does kill the process if it is using Ptrace: true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pdeathsig is used to kill the tracee if the parent process, like containerd, exits. So it won't cause leaky subprocess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that. My question is: if the process is waiting with Ptrace: true, will it react to SIGKILL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Yes. The tracee is in STOP status so kernel can kill it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks!

Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some nits, but this looks great to me :)

unix.ECHILD, werr)
}

// NOTE: The CLONE_PIDFD flag has been supported since Linux kernel v5.2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this flag? Or maybe in a previous iteration? I can't find it now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't use this flag. I just left the comment. If kernel supports P_PIDFD in waitid, the kernel should support CLONE_PIDFD as well.

/ So assumption is that if waitid(2) supports P_PIDFD, current kernel should support CLONE_PIDFD as well.

return nil, fmt.Errorf("failed to start noop process for unshare: %w", err)
}

if pidfd == -1 || !sys.SupportsPidFD() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd check for pidfd support at the very beginning, and return an error if it doesn't. As you do in the go 1.23 file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check it in old kernel which doesn't support CLONE_PIDFD. If the kernel doesn't support CLONE_PIDFD, the os.StartProcess should return error. If so, we don't need to check pidfd value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that the os.StartProcess won't return error if the kernel doesn't support CLONE_PIDFD.

vagrant@vagrant:~$ uname -a
Linux vagrant 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
vagrant@vagrant:~$ sudo /vagrant/test -test.v -test.root -test.run TestIdmappedMount/IDMapMount
=== RUN   TestIdmappedMount
=== RUN   TestIdmappedMount/IDMapMount
pidfd =  -1
    mount_idmapped_linux_test.go:130:
                Error Trace:    /home/fuweid/go/src/github.com/containerd/containerd/core/mount/mount_idmapped_linux_test.go:130
                Error:          Received unexpected error:
                                failed to prevent pid reused issue because pidfd isn't supported
                Test:           TestIdmappedMount/IDMapMount
--- FAIL: TestIdmappedMount (0.00s)
    --- FAIL: TestIdmappedMount/IDMapMount (0.00s)
FAIL

I applied the following patch based on this pull request.

diff --git a/core/mount/mount_idmapped_linux_test.go b/core/mount/mount_idmapped_linux_test.go
index 90e3a61e8..be802705a 100644
--- a/core/mount/mount_idmapped_linux_test.go
+++ b/core/mount/mount_idmapped_linux_test.go
@@ -24,7 +24,6 @@ import (
        "syscall"
        "testing"

-       kernel "github.com/containerd/containerd/v2/pkg/kernelversion"
        "github.com/containerd/continuity/testutil"
        "github.com/stretchr/testify/require"
 )
@@ -74,12 +73,14 @@ var (
 func TestIdmappedMount(t *testing.T) {
        testutil.RequiresRoot(t)

-       k512 := kernel.KernelVersion{Kernel: 5, Major: 12}
-       ok, err := kernel.GreaterEqualThan(k512)
-       require.NoError(t, err)
-       if !ok {
-               t.Skip("GetUsernsFD requires kernel >= 5.12")
-       }
+       /*
+               k512 := kernel.KernelVersion{Kernel: 5, Major: 12}
+               ok, err := kernel.GreaterEqualThan(k512)
+               require.NoError(t, err)
+               if !ok {
+                       t.Skip("GetUsernsFD requires kernel >= 5.12")
+               }
+       */

        t.Run("GetUsernsFD", testGetUsernsFD)

diff --git a/core/mount/mount_idmapped_utils_linux_go122.go b/core/mount/mount_idmapped_utils_linux_go122.go
index 21b1c81a3..a0b9546e4 100644
--- a/core/mount/mount_idmapped_utils_linux_go122.go
+++ b/core/mount/mount_idmapped_utils_linux_go122.go
@@ -48,7 +48,8 @@ func getUsernsFD(uidMaps, gidMaps []syscall.SysProcIDMap) (*os.File, error) {
                return nil, fmt.Errorf("failed to start noop process for unshare: %w", err)
        }

-       if pidfd == -1 || !sys.SupportsPidFD() {
+       if pidfd == -1 {
+               fmt.Println("pidfd = ", pidfd)
                proc.Kill()
                proc.Wait()
                return nil, fmt.Errorf("failed to prevent pid reused issue because pidfd isn't supported")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you do in the go 1.23 file

Since go1.23.0 has pidfd double-close issue, I check the SupportsPidFD at the beginning.
But for the go1.22.0, I need to check pidfd result. So I combine two condition checks together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh, I see now. Makse sense, thanks!

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment

core/mount/mount_idmapped_utils_linux.go Show resolved Hide resolved
@fuweid fuweid force-pushed the getridof-hall-of-shame branch from a7ce13c to 9c91b2d Compare August 24, 2024 02:59
The Go runtime has started to [lock down future uses of linkname][1] since
go1.23. In the go source code, containerd project has been marked in the
comment, [hall of shame][2]. Well, the go:linkname is used to fork no-op
subprocess efficiently. However, since that comment, I would like to use
ptrace and remove go:linkname in the whole repository.

With go1.22 `go:linkname`:

```bash
$ go test -bench=.  -benchmem ./ -exec sudo
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
BenchmarkBatchRunGetUsernsFD_Concurrent1-16                 2440            533320 ns/op            1145 B/op         43 allocs/op
BenchmarkBatchRunGetUsernsFD_Concurrent10-16                 342           3661616 ns/op           11562 B/op        421 allocs/op
PASS
ok      github.com/containerd/containerd/v2/core/mount  2.983s
```

With go1.22 `ptrace`:

```bash
$ go test -bench=.  -benchmem ./ -exec sudo
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
BenchmarkBatchRunGetUsernsFD_Concurrent1-16                 1785            739557 ns/op            3948 B/op         68 allocs/op
BenchmarkBatchRunGetUsernsFD_Concurrent10-16                 328           4024300 ns/op           39601 B/op        671 allocs/op
PASS
ok      github.com/containerd/containerd/v2/core/mount  3.104s
```

With go1.23 `ptrace`:

```bash
$ go test -bench=.  -benchmem ./ -exec sudo
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
BenchmarkBatchRunGetUsernsFD_Concurrent1-16                 1815            723252 ns/op            4220 B/op         69 allocs/op
BenchmarkBatchRunGetUsernsFD_Concurrent10-16                 319           3957157 ns/op           42351 B/op        682 allocs/op
PASS
ok      github.com/containerd/containerd/v2/core/mount  3.051s
```

Diff:

The `ptrace` is slower than `go:linkname` mode. However, it's accepctable.

```
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/v2/core/mount
cpu: AMD Ryzen 7 5800H with Radeon Graphics
                                    │ go122-golinkname │             go122-ptrace              │             go123-ptrace              │
                                    │      sec/op      │    sec/op     vs base                 │    sec/op     vs base                 │
BatchRunGetUsernsFD_Concurrent1-16        533.3µ ± ∞ ¹   739.6µ ± ∞ ¹        ~ (p=1.000 n=1) ²   723.3µ ± ∞ ¹        ~ (p=1.000 n=1) ²
BatchRunGetUsernsFD_Concurrent10-16       3.662m ± ∞ ¹   4.024m ± ∞ ¹        ~ (p=1.000 n=1) ²   3.957m ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                   1.397m         1.725m        +23.45%                   1.692m        +21.06%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                    │ go122-golinkname │              go122-ptrace               │              go123-ptrace               │
                                    │       B/op       │     B/op       vs base                  │     B/op       vs base                  │
BatchRunGetUsernsFD_Concurrent1-16       1.118Ki ± ∞ ¹   3.855Ki ± ∞ ¹         ~ (p=1.000 n=1) ²   4.121Ki ± ∞ ¹         ~ (p=1.000 n=1) ²
BatchRunGetUsernsFD_Concurrent10-16      11.29Ki ± ∞ ¹   38.67Ki ± ∞ ¹         ~ (p=1.000 n=1) ²   41.36Ki ± ∞ ¹         ~ (p=1.000 n=1) ²
geomean                                  3.553Ki         12.21Ki        +243.65%                   13.06Ki        +267.43%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                    │ go122-golinkname │             go122-ptrace             │             go123-ptrace             │
                                    │    allocs/op     │  allocs/op   vs base                 │  allocs/op   vs base                 │
BatchRunGetUsernsFD_Concurrent1-16         43.00 ± ∞ ¹   68.00 ± ∞ ¹        ~ (p=1.000 n=1) ²   69.00 ± ∞ ¹        ~ (p=1.000 n=1) ²
BatchRunGetUsernsFD_Concurrent10-16        421.0 ± ∞ ¹   671.0 ± ∞ ¹        ~ (p=1.000 n=1) ²   682.0 ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                    134.5         213.6        +58.76%                   216.9        +61.23%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
```

[1]: <golang/go#67401>
[2]: <https://github.com/golang/go/blob/release-branch.go1.23/src/runtime/proc.go#L4820>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid fuweid force-pushed the getridof-hall-of-shame branch from 9c91b2d to 3cd8f97 Compare August 26, 2024 13:21
Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Still LGTM :)

@cpuguy83
Copy link
Member

Seems like this is making it so this will only work on kernel >= 5.4.
Seems like we should have a less optimized version that doesn't require pidfd?
Maybe a full re-exec and pass fd back over unix socket?

@fuweid
Copy link
Member Author

fuweid commented Aug 27, 2024

Seems like this is making it so this will only work on kernel >= 5.4. Seems like we should have a less optimized version that doesn't require pidfd? Maybe a full re-exec and pass fd back over unix socket?

The re-exec solution is complicated because it requires all the binaries implement the command to support re-exec.
Both containerd and containerd-shim have to do that.

For this case, we don't need to run a real command. With ptrace, the execve(2) syscall will be stopped at exit-to-user loop. And then, if kubernetes users want to use user-namespace feature, at least, they should have >= v6.3 linux kernel.
The IDMapped mount is the key to have better experience for user-namespace. That's why I think pidfd is better than re-exec.

From https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/ :

This is a Linux-only feature and support is needed in Linux for idmap mounts on the filesystems used. This means:

On the node, the filesystem you use for /var/lib/kubelet/pods/, or the custom directory you configure for this, needs idmap mount support.

* All the filesystems used in the pod's volumes must support idmap mounts.

* In practice this means you need at least Linux 6.3, as tmpfs started supporting idmap mounts in that version. This is usually needed as several Kubernetes features use tmpfs (the service account token that is mounted by default uses a tmpfs, Secrets use a tmpfs, etc.)

* Some popular filesystems that support idmap mounts in Linux 6.3 are: btrfs, ext4, xfs, fat, tmpfs, overlayfs.

* In addition, the container runtime and its underlying OCI runtime must support user namespaces. The following OCI runtimes offer support:

[crun](https://github.com/containers/crun) version 1.9 or greater (it's recommend version 1.13+).

@fuweid
Copy link
Member Author

fuweid commented Aug 27, 2024

ping @dmcgowan @samuelkarp @AkihiroSuda for more inputs. thanks

@rata
Copy link
Contributor

rata commented Aug 27, 2024

Kubernetes userns KEP author here. As @fuweid was saying, we require idmap mounts for userns. idmap mounts was added later than pidfd support in Linux, so from a k8s point of view, this is not a problem for us.

@AkihiroSuda AkihiroSuda requested a review from cpuguy83 August 30, 2024 08:38
Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, LGTM.

@AkihiroSuda AkihiroSuda added this pull request to the merge queue Aug 30, 2024
Merged via the queue into containerd:main with commit 26d6fd0 Aug 30, 2024
52 checks passed
@fuweid fuweid deleted the getridof-hall-of-shame branch September 1, 2024 01:04
gopherbot pushed a commit to golang/go that referenced this pull request Sep 4, 2024
containerd deleted unsafe, golinkname usage from whole project in
the containerd/containerd#10611. This patch is
to delete contained name in the comment.

Change-Id: Ide55ad9c65b3b622650a0b5813a7817306e87d3f
Reviewed-on: https://go-review.googlesource.com/c/go/+/609996
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@kolyshkin
Copy link
Contributor

Related to this: golang/go#70352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime Runtime go Pull requests that update Go code size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants