-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal checkpointing support #104907
Minimal checkpointing support #104907
Conversation
Hi @adrianreber. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
This needs to be implemented behind a feature flag. /ok-to-test |
c4eb187
to
71f1bdb
Compare
/test pull-kubernetes-unit |
42e0bba
to
948556e
Compare
Added unit tests and a feature flag (disabled by default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrianreber this is super close.
can you address the following:
- ensure checkpoint directory is only created if the feature gate is enabled
- ensure i can checkpoint an init container and/or ephemeral container
Thanks!
|
||
ctx, cancel := func() (context.Context, context.CancelFunc) { | ||
defaultTimeout := int64(r.timeout / time.Second) | ||
if options.Timeout > defaultTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what could cause this timeout to need to vary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint duration directly depends on the number of memory pages the processes in the container require. For very large containers checkpointing can take longer than the default CRI timeout of 2 seconds.
pkg/kubelet/kubelet.go
Outdated
@@ -1251,6 +1254,9 @@ func (kl *Kubelet) setupDataDirs() error { | |||
if err := os.MkdirAll(kl.getPodResourcesDir(), 0750); err != nil { | |||
return fmt.Errorf("error creating podresources directory: %v", err) | |||
} | |||
if err := os.MkdirAll(kl.getCheckpointsDir(), 0750); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should only create the checkpoint directory if the feature gate is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Changed.
// checkpoint handles the checkpoint API request. It checks if the requested | ||
// podNamespace, pod and container actually exist and only then calls out | ||
// to the runtime to actually checkpoint the container. | ||
func (s *Server) checkpoint(request *restful.Request, response *restful.Response) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initially we want to avoid exposing checkpoint operations from the k8s control plane itself.
we need to better understand the pros/cons from a forensic use case.
containerName := request.PathParameter("containerName") | ||
|
||
found := false | ||
for _, container := range pod.Spec.Containers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means we are unable to checkpoint init containers or ephemeral containers.
can you add a check for pod.Spec.InitContainers
and a check for pod.Spec.EphemeralContainers
if the feature gate for ephemeral containers is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I changed it and was able to checkpoint init and ephemeral containers.
This adds the minimal CRI API changes to support checkpoint/restore and to enable implementation of these checkpoint/restore interfaces on container engine level. Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
This adds the first infrastructure code parts to the kubelet to support checkpoint/restore. Signed-off-by: Adrian Reber <areber@redhat.com>
This is the first step to implement checkpointing and restoring of container and containers starting from the lowest layer in the kubelet. Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
This adds the last pieces to wire through the container checkpoint support in the kubelet. Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
343c6ac
to
92ea6e3
Compare
🎆
Done. Good point.
Also done. Thanks for your very specific review. That made it really easy to implement the requested changes. |
Thanks for the updates as requested. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianreber, derekwaynecarr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-kind-ipv6 |
/test pull-kubernetes-e2e-kind |
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
1 similar comment
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
@adrianreber: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
This PR introduces the minimal checkpoint support as discussed in kubernetes/enhancements#1990
With the changes in the PR it is only possible to checkpoint a container. One of the main use cases, as discussed in the KEP of this PR is the forensic use case: Checkpoint a container and leave it running, take the checkpoint image and restart the container somewhere else (sandboxed environment) for analysis.
Possible steps could be:
curl -skv -X POST "https://localhost:10250/checkpoint/default/counters/wildfly"
/var/lib/kubelet/checkpoints
and transfer it to another systemcrictl restore --import=<archive>
to restore a copy of the container somewhere else.In the KEP we mentioned that checkpoints will be stored in OCI images and can be pushed to a registry. This is still planned but currently not implemented on the container engine level. It will be implemented in the future.
Which issue(s) this PR fixes:
This does not really completely fix any open issues, but it is a step forward to close issues like:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: