-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA for 1.31 #125488
DRA for 1.31 #125488
Conversation
Skipping CI for Draft Pull Request. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is a complete revamp of the original API. Some of the key differences: - refocused on structured parameters and allocating devices - support for constraints across devices - support for allocating "all" or a fixed amount of similar devices in a single request - no class for ResourceClaims, instead individual device requests are associated with a mandatory DeviceClass For the sake of simplicity, optional basic types (ints, strings) where the null value is the default are represented as values in the API types. This makes Go code simpler because it doesn't have to check for nil (consumers) and values can be set directly (producers). The effect is that in protobuf, these fields always get encoded because `opt` only has an effect for pointers. The roundtrip test data for v1.29.0 and v1.30.0 changes because of the new "request" field. This is considered acceptable because the entire `claims` field in the pod spec is still alpha. The implementation is complete enough to bring up the apiserver. Adapting other components follows.
Publishing ResourceSlices now supports network-attached devices and the new v1alpha3 API. The logic for splitting up across different slices is missing.
This adds the ability to select specific requests inside a claim for a container. NodePrepareResources is always called, even if the claim is not used by any container. This could be useful for drivers where that call has some effect other than injecting CDI device IDs into containers. It also ensures that drivers can validate configs. The pod resource API can no longer report a class for each claim because there is no such 1:1 relationship anymore. Instead, that API reports claim, API devices (with driver/pool/device as ID) and CDI device IDs. The kubelet itself doesn't extract that information from the claim. Instead, it relies on drivers to report this information when the claim gets prepared. This isolates the kubelet from API changes. Because of a faulty E2E test, kubelet was told to contact the wrong driver for a claim. This was not visible in the kubelet log output. Now changes to the claim info cache are getting logged. While at it, naming of variables and some existing log output gets harmonized. Co-authored-by: Oksana Baranova <oksana.baranova@intel.com> Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
The advantages of using a validation admission policy (VAP) are that no changes are needed in Kubernetes and that admins have full flexibility if and how they want to control which users are allowed to use "admin access" in their requests. The downside is that without admins taking actions, the feature is enabled out-of-the-box in a cluster. Documentation for DRA will have to make it very clear that something needs to be done in multi-tenant clusters. The test/e2e/testing-manifests/dra/admin-access-policy.yaml shows how to do this. The corresponding E2E tests ensures that it actually works as intended. For some reason, adding the namespace to the message expression leads to a type check errors, so it's currently commented out.
The resource claim controller is completely agnostic to the claim spec. It doesn't care about classes or devices, therefore it needs no changes in 1.31 besides the v1alpha2 -> v1alpha3 renaming from a previous commit.
The structured parameter allocation logic was written from scratch in staging/src/k8s.io/dynamic-resource-allocation/structured where it might be useful for out-of-tree components. Besides the new features (amount, admin access) and API it now supports backtracking when the initial device selection doesn't lead to a complete allocation of all claims. Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com> Co-authored-by: John Belamaric <jbelamaric@google.com>
In the API, the effect of the feature gate is that alpha fields get dropped on create. They get preserved during updates if already set. The PodSchedulingContext registration is *not* restricted by the feature gate. This enables deleting stale PodSchedulingContext objects after disabling the feature gate. The scheduler checks the new feature gate before setting up an informer for PodSchedulingContext objects and when deciding whether it can schedule a pod. If any claim depends on a control plane controller, the scheduler bails out, leading to: Status: Pending ... Warning FailedScheduling 73s default-scheduler 0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. The rest of the changes prepare for testing the new feature separately from "structured parameters". The goal is to have base "dra" jobs which just enable and test those, then "classic-dra" jobs which add DRAControlPlaneController.
This fixes the message (node name and "cluster-scoped" were switched) and simplifies the VAP: - a single matchCondition short circuits completely unless they're a user we care about - variables to extract the userNodeName and objectNodeName once (using optionals to gracefully turn missing claims and fields into empty strings) - leaves very tiny concise validations Co-authored-by: Jordan Liggitt <liggitt@google.com>
Some of the E2E node tests were flaky. Their timeout apparently was chosen under the assumption that kubelet would retry immediately after a failed gRPC call, with a factor of 2 as safety margin. But according to kubernetes@0449cef, kubelet has a different, higher retry period of 90 seconds, which was exactly the test timeout. The test timeout has to be higher than that. As the tests don't use the gRPC call timeout anymore, it can be made private. While at it, the name and documentation gets updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
for scheduler changes.
PS: there was one nit for integration test coverage. For now DRA is using scheduler_perf tests for both perf and functionality tests. However, scheduler_perf is not included in a PR's presubmit-CI, so it makes sense to add some integration tests under test/integration/scheduler. Not a blocker though.
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra Setup failed with a flake (? "Error while dialing: dial unix /var/run/crio/crio.sock: connect: no such file or directory").
It is enabled, and there is one test scenario ("SchedulingWithResourceClaimTemplateStructured") for the new API. More can and will be added once we start investigating performance a bit more. Right now, it's more about functionality. @klueska: would you do us the honor and LGTM together with lifting the hold? We have positive reviews and approval from (not collecting links, but they are there...):
Thanks everyone! |
Gladly. Thanks everyone! One step closer... /approve |
LGTM label has been added. Git tree hash: ce1d7771db93f922ecda8bf61041a6be1c84de6d
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei, klueska, pohly, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This behavior was useful for kubernetes#125488 but wasn't obvious when reading the documentation.
The manual deep comparison code is hard to maintain (would need to be updated in kubernetes#125488) and error prone. In fact, one test case failed when doing a full automatic comparison with cmp.Diff because it wasn't setting allMemory.
This behavior was useful for kubernetes#125488 but wasn't obvious when reading the documentation.
The manual deep comparison code is hard to maintain (would need to be updated in kubernetes#125488) and error prone. In fact, one test case failed when doing a full automatic comparison with cmp.Diff because it wasn't setting allMemory.
This behavior was useful for kubernetes#125488 but wasn't obvious when reading the documentation.
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is an implementation of the revised API and features from kubernetes/enhancements#4709.
Which issue(s) this PR fixes:
Related-to:
Fixes: #125665, #124041, #125210
Special notes for your reviewer:
Several of these changes where already proposed earlier in separate PRs and/or might get split out. Here's a list:
In this PR, please only review commits starting with "DRA: remove immediate allocation".
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: