-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-scheduler: Support extender managed extended resources in kube-scheduler #60332
Conversation
/retest |
/retest |
/lgtm |
// PredicateMetadataProducer that creates predicate metadata with the provided | ||
// ignoredExtendedResources. | ||
func RegisterPredicateMetadataProducerWithIgnoredExtendedResources(ignoredExtendedResources sets.String) { | ||
RegisterPredicateMetadataProducer("PredicateWithIgnoredExtendedResources", func(pm *predicateMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the predicate name from PodFitsResources
to PredicateWithIgnoredExtendedResources
.
IIUC, the "precomp" function that sets the ignoredExtendedResources
will override the one for the same predicate created here:
kubernetes/pkg/scheduler/factory/plugins.go
Line 217 in 01bd3c4
predicates.RegisterPredicateMetadataProducer(policy.Name, precomputationFunction) |
Is this the case? And I don't see why predicateMetadataProducers
is not a slice here:
var predicateMetadataProducers = make(map[string]PredicateMetadataProducer) |
@@ -95,6 +134,7 @@ func (pfactory *PredicateMetadataFactory) GetMetadata(pod *v1.Pod, nodeNameToInf | |||
podRequest: GetResourceRequest(pod), | |||
podPorts: schedutil.GetContainerPorts(pod), | |||
matchingAntiAffinityTerms: matchingTerms, | |||
ignoredExtendedResources: sets.NewString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Can't it be default (nil)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. Will remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Missing extended resources will be ignored by default unless specified | ||
// in predicate metadata. Note that kube-scheduler sets this flag to false | ||
// in predicate metadata. | ||
ignoreMissingExtendedResources := true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of this variable is confusing. I think default value of "true" is a backward incompatible change. Besides, the value of predicateMeta.ignoreMissingExtendedResources
is false by default. So, I guess it makes sense to make the default value false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this over the weekend!
Yes, this is a backward incompatible change but I haven't come up with a promising solution yet.
In PodFitsResources
, we cast algorithm.PredicateMetadata
to predicate.predicateMetadata
and then use the metadata, including ignoreMissingExtendedResources
in this PR.
kubernetes/pkg/scheduler/algorithm/predicates/predicates.go
Lines 716 to 717 in 0394ffb
if predicateMeta, ok := meta.(*predicateMetadata); ok { | |
podRequest = predicateMeta.podRequest |
This is how the predicate is invoked in pod admission on node.
fit, reasons, err := predicates.GeneralPredicates(pod, nil, nodeInfo) |
nil
as the predicate metadata, which means we use the defaults in pod admission on node. That's why I set ignoreMissingExtendedResources
default to true
.
Ideally, kubelet should create a predicate metadata with ignoreMissingExtendedResources
set to true
and pass it to the predicate. But this is not possible because of the type cast in the predicate function.
I think the type cast should be removed, and the metadata should be accessed via the algorithm.PredicateMetadata
interface. kube-scheduler/node/clusterautoscalar can provide their own metadata implementation. But this will be a larger change. Maybe I'm missing some thing. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So, for now please add a comment and explain that this default value is chosen so that the kubelet code works fine and add the details.
My only concern right now is that this is backward incompatible from the kubelet side. It is however, backward compatible on the scheduler side, because if scheduler's policy config does not exists or misses this field, the default value of the field will be false and overrides the default value provided here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created algorithm.PredicateMetadataProducer
in kubelet using predicates.NewPredicateMetadataFactory()
.
It turned out that NewPredicateMetadataFactory
's podLister
argument is not used at all so I can just use this function in kubelet.
Now, the default value of predicateMeta.ignoreMissingExtendedResources
is false.
podName string | ||
extenders []algorithm.SchedulerExtender | ||
|
||
expectedBinderType string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you check the pointer to the extender instead of its string type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have access to the pointer to the default binder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Never mind.
pkg/scheduler/scheduler_test.go
Outdated
}, { | ||
sendPod: podWithID(testBindingPodName, ""), | ||
algo: mockScheduler{testNode.Name, nil}, | ||
expectBind: testBinding, // testBinding has a name different than testBindingPodName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused. Why is the name different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing and not that useful. I already added full tests in TestGetBinderFunc
in pkg/scheduler/factory/factory_test.go
so this should have been removed.
/hold cancel |
06d7769
to
33cb03b
Compare
// Missing extended resources will be ignored by default unless a predicate | ||
// metadata is provided. | ||
// | ||
// This predicate is used by scheduler, kubelet, cluster autoscaler and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchen1107 @mwielgus
This is a backward incompatible change in kubelet and autoscaler, but it applies only to Pods that have extended resources missing on the node. Do you guys approve such a change?
@@ -606,8 +606,16 @@ func (kl *Kubelet) setNodeStatusMachineInfo(node *v1.Node) { | |||
} | |||
|
|||
for _, removedResource := range removedDevicePlugins { | |||
glog.V(2).Infof("Remove capacity for %s", removedResource) | |||
delete(node.Status.Capacity, v1.ResourceName(removedResource)) | |||
glog.V(2).Infof("Set capacity for %s to 0 on device removal", removedResource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiayingz, please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should allocatable also be set to 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocatable will be set to capacity at line 644.
/lgtm for the device plugin related change. |
@bsalamat and @vishh. I changed the default behavior for cluster autoscaler back to what it does for now. I keep |
Per discussion with @vishh offline, I removed the |
pkg/kubelet/lifecycle/predicate.go
Outdated
|
||
// Remove the requests of the extended resources that are missing in the | ||
// node info. This is required to support cluster-level resources, which | ||
// are extended resources unknown to nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue where if a pod was manually (or incorrectly) bound to a node where a node level extended resource it requires is not found, then kubelet will not fail admission. We should document this as known issue and fix it with resource APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@@ -39,7 +39,7 @@ func init() { | |||
// Register functions that extract metadata used by predicates and priorities computations. | |||
factory.RegisterPredicateMetadataProducerFactory( | |||
func(args factory.PluginFactoryArgs) algorithm.PredicateMetadataProducer { | |||
return predicates.NewPredicateMetadataFactory() | |||
return predicates.NewPredicateMetadataFactory(args.PodLister) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that argument even necessary? @bsalamat can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed. If yes, I can do that in a separate PR in 1.11.
/lgtm |
), | ||
expectedPod: makeTestPod( | ||
v1.ResourceList{}, | ||
v1.ResourceList{"foo.com/bar": quantity}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, shouldn't this resource request be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request is in Limits
which is skipped in this logic. I will refine this test once the overall design is LGTM'd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. So this is limit and we leave limit unchanged. For extended resources, we actually require limits to equal to requests and in devicemanager, we actually examine resource Limits during resource allocation. Can we make sure to also remove limits for non-existing extended resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. Whether we have Limit or not doesn't affect GeneralPredicate evaluation.
/lgtm |
/lgtm |
Can you squash the commits @yguo0905 ? /lgtm |
Co-authored-by: Yang Guo <ygg@google.com> Co-authored-by: Chun Chen <chenchun.feed@gmail.com>
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, jiayingz, vishh, yguo0905 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Automatic merge from submit-queue (batch tested with PRs 60236, 60332, 57375, 60451, 57408). If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
This is the continuation of #58851.
IgnoredByScheduler
flag can be set along with each of such resources. If this flag is set to true, the scheduler will not check the resource in thePodFitsResources
predicate.PodFitsResources
predicate. Now, by default,PodFitsResources
will ignore the extended resources that are not in node status. This is required to support extender managed extended resources (including cluster-level resources) on node. Note that in kube-scheduler we override the default behavior by not ignoring such missing extended resources.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #53616 #58850
Special notes for your reviewer:
Release note: