Skip to content

Scheduler does not clear nominatedNodeName on Pod when node is no longer valid #85677

Closed
@losipiuk

Description

What happened:

Scheduler decided that it wants to schedule Pod P on Node N after preempting some pods from N.
It set the P.nominatedNodeName to N. After a while (before scheduling actually happened) node N was removed from the cluster.
Scheduler did not reset P.nominatedNodeName to empty value.

What you expected to happen:

I would expect scheduler to unset P.nominatedNodeName when the node is no longer valid.
The trivial reason for node not being valid any more is it is removed from the cluster.
But there are more cases to cover. Out of the top of my head I may think of:

  • extra Pod was scheduled to node and there is not enough resources any more
    • either scheduled by scheduler
    • or directly as static pod/daemonset
  • node became unschedulable

How to reproduce it (as minimally and precisely as possible):

We did not try to reproduce issue manually. Yet following setup should work.

  • create cluster with nodes A and B.
  • ensure all system pods are running on A. Fill up A so it does not have spare resources.
  • add a low priority Pod L with long termination, with requests matching allocatable of B. It should be scheduled to B.
  • add a high priority Pod H. It should get B as nominatedNodeName
  • kill node B
  • the H.nominatedNodeName will still point to B which is no longer in the cluster

Anything else we need to know?:

Current behavior does not play well with Cluster Autoscaler which assumes that Pod with nominatedNodeName will be scheduled in a while.
Therefore such pods do not trigger scale-up. If we can have P.nominatedNodeName set incorrectly for unbounded period of time CA will never
provision node for P to run.

Environment:

  • Kubernetes version (use kubectl version):

We observed the behavior during presubmit tests which run k8s built from HEAD of master on 27.11.2019.

Versions logged in test log:
I1125 18:55:11.691] Client Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.0-alpha.0.1183+414202578ba1ad", GitCommit:"414202578ba1ad53a26bc5126b3828a3a410097e", GitTreeState:"clean", BuildDate:"2019-11-25T17:04:18Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
I1125 18:55:11.692] Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.0-alpha.0.1183+414202578ba1ad", GitCommit:"414202578ba1ad53a26bc5126b3828a3a410097e", GitTreeState:"clean", BuildDate:"2019-11-25T17:04:18Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:
    GCE

/sig scheduling

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions