-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Forensic Container Checkpointing KEP #1990
Conversation
Welcome @adrianreber! |
Hi @adrianreber. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
5529fa7
to
22ec713
Compare
Hi @adrianreber All KEPs must have an open issue linked to the PR, can you please create filling out all of the sections and link here? Thanks, |
@kikisdeliveryservice Thanks for letting me know. Here it is: |
According to the template the number in the file name should match the issue number, which would be |
@schrej Thanks for pointing that out. Will update this PR soon. |
22ec713
to
5058f15
Compare
@adrianreber while it is clear that we may want this type of API in the CRI to facilitate a future apiserver->kubelet<->runtime checkpoint/restore, but I am not inclined to add things to the CRI until we know how the higher order systems would utilize it. Can you maybe explore what it would mean to checkpoint/restore a Pod across hosts to maybe identify the questions that may arise? For example, do init containers re-run on the new host? What is the advantage of checkpoint/restore a pod versus just schedule a new pod. Basically, look at the problem from a Kube API level first before going to CRI level. |
Interesting subject. Use case examples from the below link:
Would go a long way just to make k8s aware of the additional possible runc container states and work them into valid container life cycle tooling/scenarios for k8s users. With the state of pod and node resource management being shared between kubelet and the container runtimes.. checkpoint restore of a pod and it's in use resources will be somewhat difficult and probably needs to be staged based on dependency chains. See discussions over on the sidecar/container dependency tree for bootup/tear down kep. |
I will try to be relatively verbose in my answers to hopefully avoid ambiguity. So I might repeat obvious things.
Looking first at the technical level. To migrate (checkpoint/restore) a Pod from one host to another, I would say we only need to worry about migrating single containers. A Pod, as far as I have seen it, is a pause container and multiple other containers sharing the (some) namespaces. Sharing namespaces upon restore, with the help of CRIU, is possible in runc once opencontainers/runc#2583 is merged. So on the lower level we have everything implemented and working. Migrating a Pod is now nothing more than checkpointing all containers in the Pod (except the pause container), destroying the Pod on the source system, creating a new Pod on the destination system (including a newly created pause container) and restoring all containers into the new Pod on the destination system. From what I have done so far in Podman and CRI-O, I can say, that everything I mentioned is doable. To migrate a container to a new host, the container is created with the same ID as it had before checkpointing, the corresponding image is pulled from the registry, also based on the image ID, config.json is adapted and then runc restores the container with the help of CRIU. The container can get a new ID if necessary, but, at least in Podman, I use the same ID as during checkpointing if possible. Migrating a Pod would mean creating a new Pod (based on the properties of the original Pod), creating a pause container, restoring all checkpointed containers into the new Pod and thus joining the corresponding namespaces. The one thing which I am not yet sure how to solve is how to transfer the checkpoint from the source host to the destination host.
If they are still running on the source of the migration, yes they will continue to run (but not re-run), as they will be migrated as any other container. We know the image the init container is based on, so restore can easily pull it from a registry and every other part of the state is included in the checkpoint. If they are no longer running in the Pod they will not be migrated. It is different from when a Pod restarts, where, according to the documentation, init containers are running again. I agree that init containers could sound like a possible problematic point. If the initialization somehow creates a connection between the containers and the host then it brakes when moving the container to another host. Not sure if this is acceptable, but this is something the user is responsible for: Just because it is possible it still might not be a good idea to do it, depending on the application. Related to init containers and relations between the host and the Pod/container. Podman, runc and CRIU are able to migrate established TCP connections. This, however, only works if the IP address of the migrated container is the same as of the checkpointed container. In Podman I am blocking migrations if the IP is not available during restore. CRIU offers the possibility to migrate with established TCP connections, but CRIU also offers to just close all TCP connections, so even if established TCP connections are open they will be closed on restore. The default behaviour is to not checkpoint a process if established TCP connections exists if neither the option to keep established connections or close established connections has been used. Mentioning this in case it is related to init containers and relations between the host and the Pod/container
Not 100% sure I understand this correctly. This can either mean that checkpointing is not useful at all, because containers are supposed to be stateless and can be stopped and started any time and the application will automatically adapt and scale. If this is your question, then it is really hard to answer, because checkpoint/restore is indeed somehow antithetical to the paradigm that containers are stateless. I guess I can only say, that they are not always stateless and especially looking at the already listed use cases (in the KEP) it is useful from my point of view. If your question is more about why migrate a Pod and all its container instead of why not create a new Pod and restore the containers into the new Pod, then I think the answer is easy, because I described it above, that this exactly how I would do it. Create a new Pod with a new pause container and restore the containers into that Pod and join the pause container's namespaces.
By providing the ability to checkpoint containers out of running Pods, even with keeping the container running, and the possibility to restore containers into existing Pods all possible use cases should be covered. Use cases could be:
|
Thanks for listing this. I will try to add additional comments to those examples.
This is part of the KEP document and currently already possible with my CRI-O changes.
Also a good one. This example, as well as the previous example, can be seen here using Podman: https://www.redhat.com/en/blog/container-migration-podman-rhel
Using the keep running functionality this is also possible using Podman and should also be easy to implement in CRI-O next to my existing changes.
Based on all the above also not difficult. Especially as CRIU provides tools to convert a checkpoint to a coredump.
This is of course one of the main goals, but, in the end, it is nothing more than checkpoint, transfer and restore.
Thanks for the pointer. Not yet clear what this means, but I will find out. Thanks @derekwaynecarr and @mikebrow for your feedback. Let me know (here or on Slack) if my answers are helpful and if there additional topics which need to be discussed. |
This is a simplification, but to express an idea:
If we zoom out and ask 'what is the user goal to do the checkpoint/restore?' we have to ask which of the items above persist, and manifest themselves on the new pod, how the scheduler schedules that new pod to the new host (and when). I assume at some point in a larger system architecture a pod is 'frozen' and then migrated to another clone of said pod scheduled to new host. There is a lot that has to happen in the pod API model and lifecycle to facilitate that beyond just the discrete containers. Maybe narrowing the use case to a particular set of pods would help. |
Maybe a good way to frame this is 'if a workload wanted to do a checkpoint/restore for something that was managed by a StatefulSet workload controller, what is the proposed set of changes to meet that use case?'. |
You could consider checkpointing to be part of the desired state of a If that's the case, then one way to implement this is to extend the Would that approach work? If so, the KEP for that would be to extend the kubelet's behavior either to allow a way for plugins to interact with a Pod's containers for checkpointing, or directly implement it. Later you might also need to add support to the scheduler to know that checkpointed Pods need special treatment, perhaps - that could be via a scheduling plugin. If this feature can work without changes to the code running in the cluster control plane, then I'd like it to work that way. |
Just musing on this briefly - the kubelet automatically registers “mirror” |
Fair, with a skim of the code the additional runc state(s), I see, are an artifact of the current implementation for the containerd ctr checkpoint command which uses runc's pause command before executing the checkpoint then resumes the container afterwards. Which may or may not be a valid/valuable pattern/use case for your kep. |
I've been working on Pod migration for a few months as I'm writing a masters thesis about it. I haven't shared too much so far, also because I'm not completely satisfied with my results yet and also because I needed to focus on writing for a while. Regarding the API extensions I tried to collect my thoughts in this google doc. I'll try to update it with my most recent ideas as it is probably a little bit outdated. In general there are a lot of good reasons to have a new resource and a migration controller to go along with it. First of all it makes it easier to trigger and track the migration from a user perspective. There are also things that are hard to do in an elegant way without it, most importantly deleting the old Pod if a migration takes place. My current approach for transferring the checkpoint between nodes is a mounted volume. As runc writes the checkpoint to disk this is probably one of the fastest ways to transfer it. A migration controller could use PersistentVolumeClaims to mount a volume on both nodes for the migration.
I would suggest to do this with two controllers. I'd extend StatefulSet or implement e.g. a MigratingStatefulSet that either
While I agree this is useful for some of the usecases, e.g. auto-scaling and faster Pod startup in general, it's probably not ideal for migration as this would be slower than moving containers directly. In addition, ways of migration with less downtime (CRIU supports pre- and post-copy) can't really be implemented that way. If migration is implemented, short downtime should be one of the goals as that's one of the main reasons for migration. In order to create and store checkpoints, a
I would argue to require a Pod to be finished with init containers before it can be migrated. If it's still running init containers it can probably also just be deleted and recreated. (This of course depends on what init containers get used for).
From what I've seen this is very tricky. Many networking plugins define a CIDR per node, which wouldn't allow to migrate the IP. I'm currently using a Service to abstract the changing IP, but that does break open tcp connections.
For basic migration it's actually not that hard. I'd suggest to define certain constrains for Pods that should be able to migrate, e.g. only RWMany Volumes, no extra/local resources, etc. Especially Pods that use special/local resources might be extremely hard to migrate. I've implemented a very rough but working prototye for Pod migration. I've tried to explain how to set it up here, as I did that rather quickly it's probably missing a few things. The binaries are also not up to date anymore. If you're interested I could do a demo though. I'll also share the thesis with the community once it's completed (and I'm happy with it) at the end of december. In case any of you would be open to read a earlier version and provide feedback, that would be a tremendous help, also because I don't have a lot of experience with Kubernetes yet. Please reach out on Slack if you'd like to. As the things I've said here will also appear in the thesis, please point out things that don't seem right, that also helps a lot! There is also an issue regarding Pod migration btw: kubernetes/kubernetes#3949 |
I am happy to see input from @schrej, because it seems he really thought about the high level view much more than me. I was always looking at the low-level implementation details. One, for me, very interesting result of this discussion is, that everyone, except me, is mostly talking about Pod migration. For me it was always more about container migration which would be limited by the Pod definition or policy. Right now I am thinking it probably is more or less the same, just viewing it from a different angle. What worries me most when talking about Pods instead of containers is that it seems to limit the possibilities of checkpoint/restore to migration only. Doing a fast startup of initialized and checkpointed container using restore seems to be missing from the Pod migration discussion. One of my goals to move forward with checkpoint/restore/migration was to focus on a limited set of features in the first step. That was also the reason I was only proposing checkpoint/restore at the API level. I understand now, that introducing changes on the API level requires a thought-out high level use case, which makes sense to not just introduce low-level API changes for nothing. Looking at the code @schrej has written, it is indeed very rough 😉 (just like @schrej mentioned), but maybe his thoughts on the Pod API can be very helpful to move this forward. There are now two implementations moving the migration topic forward: @schrej's and mine. On the lower, CRI-API, level the implementations are similar, which is probably a good sign that it is not totally wrong. Going back to the KEP process level: If @schrej would provide a KEP describing the Pod API changes he has so far and referencing this KEP, would that be a good step forward? Combing our two efforts into one, would that help? One effort, two KEPs. Would that work for you @schrej? I would be happy to help out with a KEP describing the Pod API changes and also doing the implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ab821b9
to
8f8cc15
Compare
|
||
Although *checkpoint* and *restore* can be used to implement container | ||
migration this KEP is only about enabling the forensic use case. Checkpointing | ||
a pod is not part of this proposal and left for future enhancements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpointing a pod may be required for the forensic use case in the case of sandboxed (e.g. micro VM or gVisor) pods. I think it's fine to leave that out of the initial scope, but just wanted to name it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Checkpointing a pod is not a problem and we demonstrated it in one of the previous proof of concept implementations. We just left it out for this KEP as we only wanted to focus on the things we want to do.
the corresponding code pull request is extended to support *checkpoint* and | ||
*restore* in the CRI API. The reason to add *restore* to the CRI API without | ||
implementing it in the kubelet is to make development and especially testing | ||
easier on the container engine level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a corresponding proposal for the implementation of the API in cri-o and containerd? I think it would be helpful to get CRI implementations on board before dictating the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: cri-o/cri-o#4199
We used the CRI-O implementation to checkpoint and restore containers and pods in one of the early proof of concept implementations.
There is also: kubernetes-sigs/cri-tools#662
|
||
The goal of this KEP is to introduce *checkpoint* and *restore* to the CRI API. | ||
This includes extending the *kubelet* API to support checkpointing single | ||
containers with the forensic use case in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the scope of what's included in the checkpoint? Memory snapshot? Writeable layer snapshot? RO snapshot? What about volumes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple answer is, everything that is in the container.
Everything external (devices, mounts, volumes) is not.
Volumes are special because they exist as unnamed volumes (I hope that is the right name) that is included in our current Podman implementation. External volumes are not included.
Everything external needs additional handling. Right now it is manually but it can become part of Kubernetes at some point. Depending on the type of external resources if it can be migrated to a new location.
API: | ||
|
||
``` | ||
curl -skv -X POST "https://localhost:10250/checkpoint/default/counters/wildfly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be seen as a highly privileged operation. How will it be authorized?
This seems on par with exec permissions, but exec uses the highly overloaded proxy
subresource for the authorization check. I'd prefer to use a new dedicated subresource, preferably scoped to the container for this. I'm not aware of us doing this anywhere else, but it would be interesting if it required 2 authorization checks: that you have access to the checkpoint subresource on the node (since it's hitting a node API directly), and the checkpoint subresource on the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be seen as a highly privileged operation. How will it be authorized?
Not sure I understand this question correctly. To do this operation the normal kubelet API access restrictions apply. I can only do it without authorization if I set KUBELET_AUTHORIZATION_WEBHOOK="false"
and KUBELET_FLAGS="--anonymous-auth=true --authorization-mode=AlwaysAllow"
. Isn't that enough? Are you thinking of additional checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelet authorization (without AlwaysAllow
) works by sending a SubjectAccessReview
request to the apiserver, which includes (among other things) the resource being checked (currently always node) and a subresource. You can find the mapping for the existing Kubelet API here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/auth.go#L59-L112
At the bottom of that link, you'll see that we use a dedicated subresource for some requests (logs, stats & metrics) but everything else just uses the highly overloaded proxy
subresource.
I'd like for all new Kubelet APIs to use a new dedicated subresource (e.g. checkpoint
). My comment about checking the subresource against the node & pod probably doesn't make sense as a 1-off (maybe something to consider if we ever overhaul the Kubelet API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair thanks for the feedback, for the forensic use case, the initial desire was root on the node but not root in the cluster.
@adrianreber i think the net of this is we need to ensure that the test case for this function does not redirect back to proxy, but is its own dedicated subresource. for beta, we should add a section on authorization.
see:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/auth_test.go#L111
@tallclair Thanks for taking a look at this KEP. For me it seems that most questions you asked have already been asked during the 1.5 years lifetime of this KEP. It is probably hard to find everything in more than 100 comments. Please have a look at my answers and if they answer your questions please mark the conversations as resolved so that I know if the answers work from your side. At some point this KEP was really huge and contained an answer for every possible question, but we removed most things to focus on the points we want to achieve. If you any further questions I am happy to answer them all. |
### Test Plan | ||
|
||
For alpha: | ||
- Unit tests available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There also should be an e2e test for this for alpha. I'm not sure how we would unit test the new endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I went for unit tests only is that the none of the CRI implementations are providing the necessary functionality yet. So we would get a "NotImplemented". I can do that. Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, it would give an extension point for the future so not critical.
|
||
###### How can this feature be enabled / disabled in a live cluster? | ||
|
||
Using the feature gate `ContainerCheckpointRestore` the feature can be enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the template:
enhancements/keps/NNNN-kep-template/README.md
Lines 394 to 408 in c0e42f4
###### How can this feature be enabled / disabled in a live cluster? | |
<!-- | |
Pick one of these and delete the rest. | |
--> | |
- [ ] Feature gate (also fill in values in `kep.yaml`) | |
- Feature gate name: | |
- Components depending on the feature gate: | |
- [ ] Other | |
- Describe the mechanism: | |
- Will enabling / disabling the feature require downtime of the control | |
plane? | |
- Will enabling / disabling the feature require downtime or reprovisioning | |
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). |
The feature gate only applies to the kubelet, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the feature gate only applies to the kubelet.
###### Are there any tests for feature enablement/disablement? | ||
|
||
Unit tests will temporarily enable the `ContainerCheckpointRestore` feature gate | ||
to ensure that the unit tests are always running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question is about testing turning the feature on and off, not merely testing with the feature gate on.
This will be covered so long as you complete manual testing of turning the feature gate on and off on an individual node.
|
||
## Drawbacks | ||
|
||
Not aware of any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any performance or security implications of this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
8f8cc15
to
3c4eeea
Compare
@ehashman Thanks a lot for taking a look at this KEP. I tried to address a couple of your review comments. I was unsure about the feature gate name, but I changed it now to just About e2e testing. If I get a not implemented back from the CRI implementation, is that good enough? If the CRIs are adding support for it the tests need to be adapted as they will return something positive (I hope). I am not sure I filled out the Please take another look if you have a chance and let me know if my changes were not correct or if I need to provide more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
for PRR
|
||
###### Are there any tests for feature enablement/disablement? | ||
|
||
Currently no. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you manually test this. (i.e. run kubelet with feature gate on, test feature, turn feature gate off and restart it, test that it's disabled)
Signed-off-by: Adrian Reber <areber@redhat.com>
3c4eeea
to
891ba31
Compare
/assign @dchen1107 |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
this can be refined further, but is good to proceed on implementation.
use case has been discussed extensively in sig.
API: | ||
|
||
``` | ||
curl -skv -X POST "https://localhost:10250/checkpoint/default/counters/wildfly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair thanks for the feedback, for the forensic use case, the initial desire was root on the node but not root in the cluster.
@adrianreber i think the net of this is we need to ensure that the test case for this function does not redirect back to proxy, but is its own dedicated subresource. for beta, we should add a section on authorization.
see:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/auth_test.go#L111
For alpha: | ||
- Unit tests available | ||
|
||
For beta: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add detail on checkpoint authorization, we will need to restrict access to the kubelet api resource.
on the container runtime, the actual checkpoint is stored in a location is restricted, but prior to beta, we need clear security practices documented.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianreber, derekwaynecarr, ehashman, rst0git The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I want to thank everyone who was involved getting this reviewed and now merged. Thanks! The corresponding code pull request is ready for review at kubernetes/kubernetes#104907 I updated the pull request to include e2e tests and renamed the feature gate like suggested by @ehashman |
Hi @adrianreber, I really enjoyed your Kubecon talk from last October, and while I understand that this PR is very much closed, I thought I might add another use case that could rely upon the In the HPC world, the SLURM scheduler includes time slicing functionality that gives a cluster the ability to alternate between two or more batch jobs on a node by using the In certain contexts, I could see something like the SLURM gang scheduler being useful in the Kubernetes world. Namely, if a Kubernetes cluster is oriented primarily towards data processing and makes heavy use of the It might be argued that with certain Kubernetes distributions providing the ability to autoscale resources that this sort of Hopefully what I've said here isn't totally crazy and doesn't create unnecessary noise. It's just something that occurred to me when I was watching your talk. |
@jpellman A couple of years ago when I started to look into checkpoint/restore it was also in an HPC environment, so that I am aware of different HPC use cases. Thanks for bringing up additional use cases. I will try to keep them in mind when working on this in the future. |
So looking forward to this feature!!! I've been expecting CRIU + k8s since several years ago. I would like to supplement a use case in HPC world: usually training jobs(such as MPI jobs, pytorch jobs, etc.) are fragile to disaster, stateful(especially model states and optimizer states on GPU), not scalable at all. They may last days to weeks. State-of-the-art AI + k8s solution is not ideal. At first, any disastrous error, such as kernel panic, electricity outage, even OOM, etc. make the training jobs fail at all. Secondly, jobs are not scalable. If you create a job with N pods, each using Y GPUs, the size is static. You cannot resize it at runtime. Kubernetes features elastic, but AI cannot benefit from it. At last, GPU virtualization is not mature. Usually GPUs are still pass-through to container. As a result, in large clusters, resource fragment is a huge problem. For example, a server with 8 GPU, now holds 6 jobs, each using 1 GPU. The remaining 2 GPU is a fragment resource, it cannot serve a 8 GPU training job. Researchers complain about fragmentation, but they get really mad if you stop their job and move to another host. Let's raise a bigger problem, we take it for granted that AI on cloud(usually means k8s) is a trend. But AI on cloud is not equal to simply putting AI workload in container and developing some operators. "Count on the framework.", people may say, hoping pytorch gets rid of everything. Yes, pytorch is on the way. Training framework may do periodically checkpointing(not container checkpoint) and saves data on shared file system. Job will recover from that checkpoint. By this mean, job only loses part of its state due to checkpointing occurs at the end of an epoch. Pytorch also supports elastic training, making resizing at runtime possible. But are there alternatives? What can cloud provider do to accelerate the pace? Microsoft releases its work Singurality, an AI infra that enables cross-region scheduling, live migration, job scaling. But a first problem is that k8s does not support checkpointing! With this work, I see the possibility of building a real elastic AI infra. Jobs can be checkpointed and restored without loss, scheduler can do housework to avoid fragmentation. AI is far from cloud native, but I'm happy to see gaps are filled one by one. Hopefully one day, we can really enjoy cloud-native AI! |
I second your excitement about the potential of this for HPC jobs. As part of my master thesis (available soon), I also evaluated the potential of migration to increase resource utilization of HPC jobs with unknown resource requirements and it seems feasible even for huge containers of more than 100 GB. But I want to add that the proposal is only a first tiny step toward it. I think it will still take some time to have enough functionality for it to be usable in practice (it took almost one year to merge the minimal PR). For the time, I think we still need to rely on custom solutions. Some industry solutions are also available already: |
This KEP is about Forensic Container Checkpointing
Ref: #2008