Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 4216: Implement support for image pull per runtime class #9832

Closed
wants to merge 6 commits into from

Conversation

kiashok
Copy link
Contributor

@kiashok kiashok commented Feb 15, 2024

This PR implements the main logic needed to support image pull per runtime class.

  • In K8s v1.29 CRI, a new field for RuntimeHandler was added in the ImageSpec struct. We should honor this field during image pull, image remove and containerd restart and ensure that image is pulled for the appropriate platform

  • With this feature, an image can now exist for multiple platforms. Therefore, a new platform label is added to every containerd image to indicate the platform that the image exists for.

  • With these changes, an image in CRI will now be referenced as a tuple of (imageID/imageName, platform) instead of just the imageName or imageID.

  • CRIImageService.RemoveImage() :

    • RuntimeHandler field passed from CRI needs to be honored here as well. When removing an image for a particular runtimehandler, we attempt to first just remove the platform image label for this platform from the containerd image. An image is completely removed from containerd image store only if there are no more platform image labels on this image.
  • ReloadImages(): When containerd is restarted, CRI's image cache is updated from reading containerd's image store. When containerd is started the CRIImageService.ImagePlatforms field could have been updated in containerd.toml. Therefore, on the reload it is important for us to parse every platform image label and try unpack on compatible snapshotters for this image. If the image could not be unpacked with any such snapshotter, then we attempt to cleanup by removing this platform image label from the containerd image.

P.S. :

  1. The first commit is merely a cherry-pick of PR that is currently in review in containerd/platform.
    This PR has two sign offs and waiting to be merged.
    Add format for platform string platforms#6

  2. Tests to verify new functionality will be added to this PR once there is consensus on this approach.

  • CRIImageService is responsible for translating runtime handler to its platform using the newly added CRIImageService.ImagePlatforms field.

@kiashok
Copy link
Contributor Author

kiashok commented Feb 15, 2024

@fuweid
Copy link
Member

fuweid commented Feb 17, 2024

cc @shuaichang

@dims
Copy link
Member

dims commented Feb 27, 2024

@henry118 do you mind reviewing this?

@kiashok needs a rebase as well!

@kiashok kiashok force-pushed the allChanges-kep-4216 branch from cdb46e2 to 25884d1 Compare March 4, 2024 20:05
@kiashok
Copy link
Contributor Author

kiashok commented Mar 4, 2024

@henry118 do you mind reviewing this?

@kiashok needs a rebase as well!

@dims has been rebased! Thanks.

@kiashok
Copy link
Contributor Author

kiashok commented Mar 4, 2024

@dmcgowan @mikebrow @fuweid could you please take a look when you have some time please? Thanks!

@TBBle
Copy link
Contributor

TBBle commented Mar 22, 2024

Flicking through this, I noticed a couple of unremoved TODOs that I think are being resolved by the adjacent changes, and also (not in this PR, but while checking context) I think I saw some confusion between "runtime_platforms" and "runtime_platform" for the TOML config migration code, which might potentially cause unexpected failures when testing this work. (Also, is DefaultRuntimeHandlerName used? That's CRI-specific, so I don't see why it would appear in the top-level defaults... I guess they're leftovers from earlier work.)

Anyway, so I better-understand, the approach being used here is to have the runtime handler config optionally include a platform string to override platforms.DefaultSpec()? So effectively for Hyper-V on theoretical future Windows Server with an ABI break, you'd need two runtimes, one for hyperv-2019 and one for hyperv-2022?

(Actually, if I'm browsing the code correctly, we don't provide a platform-per-handler, but have a mapping elsewhere in the config? I'm not sure what this looks like in the config file, once this has examples/docs added, that'll be clearer. I might be overlooking some code, I'm not super-familiar with the CRI plugin.)

If I'm browsing this correctly, we can't provide a single runtime handler with multiple platforms either, which is a shame, but that's not being affected by this PR AFAICT.

I don't think it does, but I wanted to confirm that this isn't blocking the idea of using different platform matchers for different runtime handlers, so that we can for example make runhcs-wcow-hypervisor a "take the most-up-to-date image we can run locally in HyperV isolation" runtime by adding something like platformMatcher = HyperVIsolationMatcher to the existing default config, and that'll do the right thing everywhere forever, irrespective of host platform OS Version.

kiashok added 6 commits April 11, 2024 14:28
Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
-Add platform image labels to transfer service
and CRI functions to help indicate what platform
the image is being pulled for.

-Update containerd image store with imageID,repoDigest
and repoTag when images are pulled with transfer service
to keep the behavior in sync with CRIImageService.ImagePull()

Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
- Honors runtimeHandler passed in CRI for
image pull code paths
- Extends basic functions needed to support
having multiple platforms of the same image pulled
in CRI image store.

Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
- Reload images by supporting multi snapshotter
and platform image labels.
If an image has not
been unpacked with any snapshotter, an attempt
is made to remove that platform image label.

Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
@kiashok
Copy link
Contributor Author

kiashok commented Apr 11, 2024

Flicking through this, I noticed a couple of unremoved TODOs that I think are being resolved by the adjacent changes, and also (not in this PR, but while checking context) I think I saw some confusion between "runtime_platforms" and "runtime_platform" for the TOML config migration code, which might potentially cause unexpected failures when testing this work. (Also, is DefaultRuntimeHandlerName used? That's CRI-specific, so I don't see why it would appear in the top-level defaults... I guess they're leftovers from earlier work.)

Anyway, so I better-understand, the approach being used here is to have the runtime handler config optionally include a platform string to override platforms.DefaultSpec()? So effectively for Hyper-V on theoretical future Windows Server with an ABI break, you'd need two runtimes, one for hyperv-2019 and one for hyperv-2022?

(Actually, if I'm browsing the code correctly, we don't provide a platform-per-handler, but have a mapping elsewhere in the config? I'm not sure what this looks like in the config file, once this has examples/docs added, that'll be clearer. I might be overlooking some code, I'm not super-familiar with the CRI plugin.)

If I'm browsing this correctly, we can't provide a single runtime handler with multiple platforms either, which is a shame, but that's not being affected by this PR AFAICT.

I don't think it does, but I wanted to confirm that this isn't blocking the idea of using different platform matchers for different runtime handlers, so that we can for example make runhcs-wcow-hypervisor a "take the most-up-to-date image we can run locally in HyperV isolation" runtime by adding something like platformMatcher = HyperVIsolationMatcher to the existing default config, and that'll do the right thing everywhere forever, irrespective of host platform OS Version.

@TBBle please refer to KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4216-image-pull-per-runtime-class#summary for details on this.
High level, you would need to specify platform per runtime handler in the containerd.toml. Next, while image is being pulled, you could specify a runtime handler to overrider the default platform matcher. This will be very helpful for cases where we want to pull a non-default image. For example, if host is WS2022 but we want to pull the nanoserver:ltsc2019 because we want to use that image for hyperv windows containers.
If not runtime handler is specified during image pull, then the default platform matcher will be used and it will look for exact host and guest match.

@kiashok
Copy link
Contributor Author

kiashok commented Apr 11, 2024

@dims @dmcgowan @mikebrow @fuweid rebased the PR. Still waiting for the PR on containerd/platform containerd/platforms#6 to be merged. The first commit in this PR can be removed once that is done. Rest of the commits are ready to be reviewed.
Will add unit tests once there is consensus on the approach taken.

cc @estesp

@k8s-ci-robot
Copy link

@kiashok: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-containerd-node-e2e 94ff140 link true /test pull-containerd-node-e2e
pull-containerd-build 94ff140 link true /test pull-containerd-build
pull-containerd-k8s-e2e-ec2 94ff140 link false /test pull-containerd-k8s-e2e-ec2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

}

for _, platform := range listOfPlatformsForImage {
images = append(images, NewImageWithPlatform(c, img, platforms.Only(platforms.MustParse(platform))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not a good to change basic API logic here. For containerd API, the image service already supports multiple platforms. It doesn't make sense to change here. We should add a wrapper function in CRI plugin side.


// PlatformLabelPrefix is common prefix for indicating what platforms
// an image has been pulled for
PlatformLabelPrefix = "containerd.io/imagePulledForPlatform"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the design here. Would we just use existing API to check the platform part?
IMO, it's hard to make sure that the existing data is consistent with label.
The user might use ctr or other tools to pull images and change the labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just like how we add CRI image labels for images pulled through ctr, we should do the same for platform label as well. We need some way to track which platform the image is being pulled for as it is needed by ImageService.UpdateImage() that gets called after an event is published. An alternative was to store this information as an exclusive field in containerd image struct but feedback given was that runtimehandler/platform info is specific to CRI and should not be saved to core containerd structures. Therefore, labels are the only alternative to achieve this.

@dmcgowan

@fidencio
Copy link
Contributor

fidencio commented Apr 18, 2024

I've started testing this PR as it is, and I will start dropping comments as I face issues.

The first one, building containerd:

+ bin/containerd
go build  -gcflags=-trimpath=/home/ubuntu/go/src -buildmode=pie  -o bin/containerd -ldflags '-X github.com/containerd/containerd/v2/version.Version=v2.0.0-rc.0-64-g94ff1406a -X github.com/containerd/containerd/v2/version.Revision=94ff1406abdc9aef6f1fcbccc91ec742b01d611d -X github.com/containerd/containerd/v2/version.Package=github.com/containerd/containerd -s -w ' -tags "urfave_cli_no_docs"  ./cmd/containerd
# github.com/containerd/containerd/v2/internal/cri/server
internal/cri/server/container_checkpoint_linux.go:96:27: not enough arguments in call to c.GetImage
	have (string)
	want (string, string)
make: *** [Makefile:259: bin/containerd] Error 1

Basically, CheckpointContainer needs a reliable way to get the runtimeHandler used there.

Maybe this would be what we're looking for?

diff --git a/internal/cri/server/container_checkpoint_linux.go b/internal/cri/server/container_checkpoint_linux.go
index f2211fdfe..e6493450c 100644
--- a/internal/cri/server/container_checkpoint_linux.go
+++ b/internal/cri/server/container_checkpoint_linux.go
@@ -93,7 +93,7 @@ func (c *criService) CheckpointContainer(ctx context.Context, r *runtime.Checkpo
        }
 
        imageRef := container.ImageRef
-       image, err := c.GetImage(imageRef)
+       image, err := c.GetImage(imageRef, container.Config.GetImage().RuntimeHandler)
        if err != nil {
                return nil, fmt.Errorf("getting container image failed: %w", err)
        }

A second thing here, that I was aware it wasn't being focused on this PR, but we'll need to take into consideration.

This PR is proposing a mapping between an Image and the Platform, but we may still get to the case that one specific Platform uses different snapshotters there. In case of linux/amd64, for instance, we may use overlayfs for vanilla container runtimes, but add devmapper / nydus for kata-containers / confidential-containers.

Then, what I'd expect as a user, that if user U1 schedules an nginx pod to run with runc (using overlayfs), and user U2 schedules an nginx pod to run with kata-containers (using nydus, in my tests), containerd would have a way to distinguish between the images used by each specific snapshotter.

This is not the case right now in the current upstream codebase, neither with this patch applied.

After starting an nginx pod with runc, and then trying to start the same pod with Kata Containers, I get the following error:

  Warning  FailedCreatePodSandBox  0s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "41f8e9fdab12ccf40cf05f6aaf97a2dd2e581a09a15f2cb462bffc821187a2cc": failed to create containerd task: failed to create shim task: failed to mount /run/kata-containers/shared/containers/41f8e9fdab12ccf40cf05f6aaf97a2dd2e581a09a15f2cb462bffc821187a2cc/rootfs to /run/kata-containers/41f8e9fdab12ccf40cf05f6aaf97a2dd2e581a09a15f2cb462bffc821187a2cc/rootfs, with error: ENOENT: No such file or directory: unknown

This is happening because, even nydus being properly set up and configured, containerd will assume that the already downloaded and cached, using overlayfs, "pause" and "nginx" images are returned instead of actually triggering a PullImage for the correct snapshotter.

@henry118
Copy link
Member

This PR is proposing a mapping between an Image and the Platform, but we may still get to the case that one specific Platform uses different snapshotters there. In case of linux/amd64, for instance, we may use overlayfs for vanilla container runtimes, but add devmapper / nydus for kata-containers / confidential-containers.

+1 here.

We also have use cases where using different snapshotters on the same platform, but under different runtime handlers. Can we instead maintain a mapping between image and runtime handlers?

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

// Ref of the image
Ref string
// Runtimehandler used for pulling this image
Platform string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions RuntimeHandler, but it seems we're using platform everywhere.
Shouldn't this actually be RuntimeHandler and we should ensure that the RuntimeHandler is actually used?

The same comment goes for the ImageIDKey struct and how it's used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we moved away from the design of storing the runtimeHandler as it was making changes to core containerd packages which we did not want to do ahead with. This comment needs to be updated to "Platform used for pulling this image". Will address this and push new changes.

@kiashok
Copy link
Contributor Author

kiashok commented Apr 22, 2024

I've started testing this PR as it is, and I will start dropping comments as I face issues.

The first one, building containerd:

+ bin/containerd
go build  -gcflags=-trimpath=/home/ubuntu/go/src -buildmode=pie  -o bin/containerd -ldflags '-X github.com/containerd/containerd/v2/version.Version=v2.0.0-rc.0-64-g94ff1406a -X github.com/containerd/containerd/v2/version.Revision=94ff1406abdc9aef6f1fcbccc91ec742b01d611d -X github.com/containerd/containerd/v2/version.Package=github.com/containerd/containerd -s -w ' -tags "urfave_cli_no_docs"  ./cmd/containerd
# github.com/containerd/containerd/v2/internal/cri/server
internal/cri/server/container_checkpoint_linux.go:96:27: not enough arguments in call to c.GetImage
	have (string)
	want (string, string)
make: *** [Makefile:259: bin/containerd] Error 1

Basically, CheckpointContainer needs a reliable way to get the runtimeHandler used there.

Maybe this would be what we're looking for?

diff --git a/internal/cri/server/container_checkpoint_linux.go b/internal/cri/server/container_checkpoint_linux.go
index f2211fdfe..e6493450c 100644
--- a/internal/cri/server/container_checkpoint_linux.go
+++ b/internal/cri/server/container_checkpoint_linux.go
@@ -93,7 +93,7 @@ func (c *criService) CheckpointContainer(ctx context.Context, r *runtime.Checkpo
        }
 
        imageRef := container.ImageRef
-       image, err := c.GetImage(imageRef)
+       image, err := c.GetImage(imageRef, container.Config.GetImage().RuntimeHandler)
        if err != nil {
                return nil, fmt.Errorf("getting container image failed: %w", err)
        }

A second thing here, that I was aware it wasn't being focused on this PR, but we'll need to take into consideration.

This PR is proposing a mapping between an Image and the Platform, but we may still get to the case that one specific Platform uses different snapshotters there. In case of linux/amd64, for instance, we may use overlayfs for vanilla container runtimes, but add devmapper / nydus for kata-containers / confidential-containers.

Please check the changes made in check.go . The platform associated with an image MAY map to multiple runtime handlers and the associated snapshotter with that runtime handler. Therefore, we go through the list of runtime handlers during reload time to ensure we unpack for the different snapshotters. If an image was not unpacked for a particular snapshotter, this will fail and we will continue with the rest of runtime handlers whose platform matches the image's platform.

Then, what I'd expect as a user, that if user U1 schedules an nginx pod to run with runc (using overlayfs), and user U2 schedules an nginx pod to run with kata-containers (using nydus, in my tests), containerd would have a way to distinguish between the images used by each specific snapshotter.

If I understand the example you are talking about, I think the above response applies to this as well. If I have missed something, please let me know.

@kiashok
Copy link
Contributor Author

kiashok commented Apr 22, 2024

This PR is proposing a mapping between an Image and the Platform, but we may still get to the case that one specific Platform uses different snapshotters there. In case of linux/amd64, for instance, we may use overlayfs for vanilla container runtimes, but add devmapper / nydus for kata-containers / confidential-containers.

+1 here.

We also have use cases where using different snapshotters on the same platform, but under different runtime handlers. Can we instead maintain a mapping between image and runtime handlers?

please refer to #9832 (comment)

@Kern--
Copy link
Contributor

Kern-- commented Apr 23, 2024

Please check the changes made in check.go . The platform associated with an image MAY map to multiple runtime handlers and the associated snapshotter with that runtime handler. Therefore, we go through the list of runtime handlers during reload time to ensure we unpack for the different snapshotters. If an image was not unpacked for a particular snapshotter, this will fail and we will continue with the rest of runtime handlers whose platform matches the image's platform.

I have a couple of questions/concerns about this:

  1. From what I understand, the changes in check will go through each snapshotter associated with a platform (looking up the mapping via configured runtime classes) and call images.Check for that snapshotter. images.Check will verify that all the snapshots exist, or unpack the content into the snapshotter from the local content store. Is that right?
  2. Does "during reload time" mean that containerd needs to be restarted in order for the unpack to happen in additional snapshotters?
  3. Unpacking into additional snapshotters doesn't work if one of the snapshotters is a remote snapshotter because the content was never in the content store to begin with
  4. IIRC, all of EKS, AKS, and GKE set discard_unpacked_layers=true in their first party k8s images which I think would mean the content wouldn't be available to unpack after the first pull.

@henry118
Copy link
Member

This PR is proposing a mapping between an Image and the Platform, but we may still get to the case that one specific Platform uses different snapshotters there. In case of linux/amd64, for instance, we may use overlayfs for vanilla container runtimes, but add devmapper / nydus for kata-containers / confidential-containers.

+1 here.
We also have use cases where using different snapshotters on the same platform, but under different runtime handlers. Can we instead maintain a mapping between image and runtime handlers?

please refer to #9832 (comment)

Thanks for the explanation. I understand the concern of introducing runtime handler concept to the core. Just wanted to bring up a use case from my side and I really hope this scenario can be supported by this patch. We have use cases that requires mapping of runtime classes to different snapshotters. In our setup the runtime classes (snapshotters) will share the same platform. Hope this scenario could be put into consideration in the design.

@kiashok
Copy link
Contributor Author

kiashok commented Apr 25, 2024

I've started testing this PR as it is, and I will start dropping comments as I face issues.

The first one, building containerd:

+ bin/containerd
go build  -gcflags=-trimpath=/home/ubuntu/go/src -buildmode=pie  -o bin/containerd -ldflags '-X github.com/containerd/containerd/v2/version.Version=v2.0.0-rc.0-64-g94ff1406a -X github.com/containerd/containerd/v2/version.Revision=94ff1406abdc9aef6f1fcbccc91ec742b01d611d -X github.com/containerd/containerd/v2/version.Package=github.com/containerd/containerd -s -w ' -tags "urfave_cli_no_docs"  ./cmd/containerd
# github.com/containerd/containerd/v2/internal/cri/server
internal/cri/server/container_checkpoint_linux.go:96:27: not enough arguments in call to c.GetImage
	have (string)
	want (string, string)
make: *** [Makefile:259: bin/containerd] Error 1

Basically, CheckpointContainer needs a reliable way to get the runtimeHandler used there.

Maybe this would be what we're looking for?

diff --git a/internal/cri/server/container_checkpoint_linux.go b/internal/cri/server/container_checkpoint_linux.go
index f2211fdfe..e6493450c 100644
--- a/internal/cri/server/container_checkpoint_linux.go
+++ b/internal/cri/server/container_checkpoint_linux.go
@@ -93,7 +93,7 @@ func (c *criService) CheckpointContainer(ctx context.Context, r *runtime.Checkpo
        }
 
        imageRef := container.ImageRef
-       image, err := c.GetImage(imageRef)
+       image, err := c.GetImage(imageRef, container.Config.GetImage().RuntimeHandler)
        if err != nil {
                return nil, fmt.Errorf("getting container image failed: %w", err)
        }

A second thing here, that I was aware it wasn't being focused on this PR, but we'll need to take into consideration.

This PR is proposing a mapping between an Image and the Platform, but we may still get to the case that one specific Platform uses different snapshotters there. In case of linux/amd64, for instance, we may use overlayfs for vanilla container runtimes, but add devmapper / nydus for kata-containers / confidential-containers.

Then, what I'd expect as a user, that if user U1 schedules an nginx pod to run with runc (using overlayfs), and user U2 schedules an nginx pod to run with kata-containers (using nydus, in my tests), containerd would have a way to distinguish between the images used by each specific snapshotter.

This is not the case right now in the current upstream codebase, neither with this patch applied.

After starting an nginx pod with runc, and then trying to start the same pod with Kata Containers, I get the following error:

  Warning  FailedCreatePodSandBox  0s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "41f8e9fdab12ccf40cf05f6aaf97a2dd2e581a09a15f2cb462bffc821187a2cc": failed to create containerd task: failed to create shim task: failed to mount /run/kata-containers/shared/containers/41f8e9fdab12ccf40cf05f6aaf97a2dd2e581a09a15f2cb462bffc821187a2cc/rootfs to /run/kata-containers/41f8e9fdab12ccf40cf05f6aaf97a2dd2e581a09a15f2cb462bffc821187a2cc/rootfs, with error: ENOENT: No such file or directory: unknown

This is happening because, even nydus being properly set up and configured, containerd will assume that the already downloaded and cached, using overlayfs, "pause" and "nginx" images are returned instead of actually triggering a PullImage for the correct snapshotter.

I don't think we should combine any already existing containerd bugs to this PR changes. Could you please track them separately by opening github issues on containerd? I think with the annotation approach you are trying, existing containerd changes should support it. I think we are trying to mix up too many things here.
Like we discussed in the community meeting, specifying platform with ImagePlatform on containerd.toml is NOT enforced. If one is mentioned, it will be honored and if not, default platform will be used.

Copy link

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 7 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the Stale label Jul 25, 2024
@kiashok kiashok closed this Jul 31, 2024
@kiashok
Copy link
Contributor Author

kiashok commented Jul 31, 2024

Discussed a newer approach with Derek and since it is significantly different from the approach taken here, sent out a new PR #10533

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants