Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-scheduler: Support extender managed extended resources in kube-scheduler #60332

Merged
merged 1 commit into from
Feb 28, 2018

Conversation

yguo0905
Copy link
Contributor

@yguo0905 yguo0905 commented Feb 23, 2018

What this PR does / why we need it:

This is the continuation of #58851.

  • This PR adds new extender configurations in scheduler policy config.
    • A set of extended resource names can be specified in an extender config. They are the resources that are managed by the extender. The scheduler will only send pods to the extender if the pod requests at least one of the extended resources in the set.
    • An IgnoredByScheduler flag can be set along with each of such resources. If this flag is set to true, the scheduler will not check the resource in the PodFitsResources predicate.
  • This PR also changes the default behavior of the PodFitsResources predicate. Now, by default, PodFitsResources will ignore the extended resources that are not in node status. This is required to support extender managed extended resources (including cluster-level resources) on node. Note that in kube-scheduler we override the default behavior by not ignoring such missing extended resources.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #53616 #58850

Special notes for your reviewer:

Release note:

Support extender managed extended resources in kube-scheduler

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 23, 2018
@yguo0905
Copy link
Contributor Author

/assign @bsalamat @vishh

@yguo0905
Copy link
Contributor Author

/retest

@vishh vishh added this to the v1.10 milestone Feb 24, 2018
@yguo0905
Copy link
Contributor Author

/retest

@vishh
Copy link
Contributor

vishh commented Feb 24, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 24, 2018
// PredicateMetadataProducer that creates predicate metadata with the provided
// ignoredExtendedResources.
func RegisterPredicateMetadataProducerWithIgnoredExtendedResources(ignoredExtendedResources sets.String) {
RegisterPredicateMetadataProducer("PredicateWithIgnoredExtendedResources", func(pm *predicateMetadata) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the predicate name from PodFitsResources to PredicateWithIgnoredExtendedResources.

IIUC, the "precomp" function that sets the ignoredExtendedResources will override the one for the same predicate created here:

predicates.RegisterPredicateMetadataProducer(policy.Name, precomputationFunction)

Is this the case? And I don't see why predicateMetadataProducers is not a slice here:

var predicateMetadataProducers = make(map[string]PredicateMetadataProducer)

@yguo0905 yguo0905 changed the title [Scheduler] Support extender managed extended resources [WIP] kube-scheduler: Support extender managed extended resources Feb 24, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2018
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2018
@@ -95,6 +134,7 @@ func (pfactory *PredicateMetadataFactory) GetMetadata(pod *v1.Pod, nodeNameToInf
podRequest: GetResourceRequest(pod),
podPorts: schedutil.GetContainerPorts(pod),
matchingAntiAffinityTerms: matchingTerms,
ignoredExtendedResources: sets.NewString(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Can't it be default (nil)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. Will remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Missing extended resources will be ignored by default unless specified
// in predicate metadata. Note that kube-scheduler sets this flag to false
// in predicate metadata.
ignoreMissingExtendedResources := true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of this variable is confusing. I think default value of "true" is a backward incompatible change. Besides, the value of predicateMeta.ignoreMissingExtendedResources is false by default. So, I guess it makes sense to make the default value false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this over the weekend!

Yes, this is a backward incompatible change but I haven't come up with a promising solution yet.

In PodFitsResources, we cast algorithm.PredicateMetadata to predicate.predicateMetadata and then use the metadata, including ignoreMissingExtendedResources in this PR.

if predicateMeta, ok := meta.(*predicateMetadata); ok {
podRequest = predicateMeta.podRequest

This is how the predicate is invoked in pod admission on node.

fit, reasons, err := predicates.GeneralPredicates(pod, nil, nodeInfo)
Here we pass nil as the predicate metadata, which means we use the defaults in pod admission on node. That's why I set ignoreMissingExtendedResources default to true.

Ideally, kubelet should create a predicate metadata with ignoreMissingExtendedResources set to true and pass it to the predicate. But this is not possible because of the type cast in the predicate function.

I think the type cast should be removed, and the metadata should be accessed via the algorithm.PredicateMetadata interface. kube-scheduler/node/clusterautoscalar can provide their own metadata implementation. But this will be a larger change. Maybe I'm missing some thing. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So, for now please add a comment and explain that this default value is chosen so that the kubelet code works fine and add the details.
My only concern right now is that this is backward incompatible from the kubelet side. It is however, backward compatible on the scheduler side, because if scheduler's policy config does not exists or misses this field, the default value of the field will be false and overrides the default value provided here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Besides scheduler and kubelet, cluster autoscaler and damonset controller also use this predicate. But I don't see any negative impacts from this change. This is also the behavior before 1.8 #53616.

@vishh @jiayingz

Copy link
Contributor Author

@yguo0905 yguo0905 Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created algorithm.PredicateMetadataProducer in kubelet using predicates.NewPredicateMetadataFactory().

It turned out that NewPredicateMetadataFactory's podLister argument is not used at all so I can just use this function in kubelet.

Now, the default value of predicateMeta.ignoreMissingExtendedResources is false.

podName string
extenders []algorithm.SchedulerExtender

expectedBinderType string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you check the pointer to the extender instead of its string type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to the pointer to the default binder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Never mind.

}, {
sendPod: podWithID(testBindingPodName, ""),
algo: mockScheduler{testNode.Name, nil},
expectBind: testBinding, // testBinding has a name different than testBindingPodName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Why is the name different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing and not that useful. I already added full tests in TestGetBinderFunc in pkg/scheduler/factory/factory_test.go so this should have been removed.

@yguo0905 yguo0905 changed the title [WIP] kube-scheduler: Support extender managed extended resources kube-scheduler: Support extender managed extended resources in kube-scheduler Feb 25, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 25, 2018
@yguo0905
Copy link
Contributor Author

/hold cancel

// Missing extended resources will be ignored by default unless a predicate
// metadata is provided.
//
// This predicate is used by scheduler, kubelet, cluster autoscaler and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107 @mwielgus
This is a backward incompatible change in kubelet and autoscaler, but it applies only to Pods that have extended resources missing on the node. Do you guys approve such a change?

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 27, 2018
@@ -606,8 +606,16 @@ func (kl *Kubelet) setNodeStatusMachineInfo(node *v1.Node) {
}

for _, removedResource := range removedDevicePlugins {
glog.V(2).Infof("Remove capacity for %s", removedResource)
delete(node.Status.Capacity, v1.ResourceName(removedResource))
glog.V(2).Infof("Set capacity for %s to 0 on device removal", removedResource)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiayingz, please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should allocatable also be set to 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocatable will be set to capacity at line 644.

@jiayingz
Copy link
Contributor

/lgtm for the device plugin related change.

@yguo0905
Copy link
Contributor Author

@bsalamat and @vishh. I changed the default behavior for cluster autoscaler back to what it does for now.

I keep ignoreMissingExtendedResources as a flag rather than a list of resources that are required. I think ff we need to support new extended resources in cluster autoscaler, it's very likely that we will add the support in its cloud provider on a case by case basis.

@yguo0905
Copy link
Contributor Author

Per discussion with @vishh offline, I removed the ignoreMissingExtendedResource flag. Instead, we remove the requests of the extended resources from pod if the resources are missing from node status in kubelet.


// Remove the requests of the extended resources that are missing in the
// node info. This is required to support cluster-level resources, which
// are extended resources unknown to nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue where if a pod was manually (or incorrectly) bound to a node where a node level extended resource it requires is not found, then kubelet will not fail admission. We should document this as known issue and fix it with resource APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -39,7 +39,7 @@ func init() {
// Register functions that extract metadata used by predicates and priorities computations.
factory.RegisterPredicateMetadataProducerFactory(
func(args factory.PluginFactoryArgs) algorithm.PredicateMetadataProducer {
return predicates.NewPredicateMetadataFactory()
return predicates.NewPredicateMetadataFactory(args.PodLister)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that argument even necessary? @bsalamat can this be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed. If yes, I can do that in a separate PR in 1.11.

@vishh
Copy link
Contributor

vishh commented Feb 27, 2018

/lgtm
/approve
@yguo0905 please update public documentation with the repercussions of this PR. It enables cluster level resources, but with a few caveats (lack of support for daemonsets, requires scheduler extender config, etc.)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 27, 2018
),
expectedPod: makeTestPod(
v1.ResourceList{},
v1.ResourceList{"foo.com/bar": quantity},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, shouldn't this resource request be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request is in Limits which is skipped in this logic. I will refine this test once the overall design is LGTM'd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. So this is limit and we leave limit unchanged. For extended resources, we actually require limits to equal to requests and in devicemanager, we actually examine resource Limits during resource allocation. Can we make sure to also remove limits for non-existing extended resources?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. Whether we have Limit or not doesn't affect GeneralPredicate evaluation.

@jiayingz
Copy link
Contributor

/lgtm

@bsalamat
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2018
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 27, 2018
@vishh
Copy link
Contributor

vishh commented Feb 28, 2018

Can you squash the commits @yguo0905 ?

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2018
Co-authored-by: Yang Guo <ygg@google.com>
Co-authored-by: Chun Chen <chenchun.feed@gmail.com>
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2018
@vishh
Copy link
Contributor

vishh commented Feb 28, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, jiayingz, vishh, yguo0905

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 60236, 60332, 57375, 60451, 57408). If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

After 1.8, scheduler could reject unknown extended resource names
8 participants