Avoid moving pods out of unschedulable status unconditionally #94009

ahg-g · 2020-08-14T16:11:14Z

Currently we unconditionally move pods out of unschedulable status when an event happens. However, we can optimize this by setting some conditions for each event to avoid unnecessarily doing that:

PV, PVC, StorageClass or CSINode add/update: move only pods with a PVC.
Service add/update: only if ServiceAffinity plugin is enabled in one of the profiles
Node add/update: test the pods against the kubelet admission logic, this include: NodeAffinity, NodeName NodePorts, NodeResources and TaintsTolerations filters.
Pod delete: keep it as is (spreading or affinity constraints must be tested, which is expensive to do)

This is necessary to avoid wasting scheduling cycles.

Related proposal: https://docs.google.com/document/d/1Dw1qPi4eryllSv0F419sKVbGXiPvSv6N_mJd6AVSg74/edit

/sig scheduling
/assign @adtac

k8s-ci-robot · 2020-08-14T16:11:15Z

@ahg-g: GitHub didn't allow me to assign the following users: adtac.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Currently we unconditionally move pods out of unschedulable status when an event happens. However, we can optimize this by setting some conditions for each event to avoid unnecessarily doing that:

PV, PVC, StorageClass or CSINode add/update: move only pods with a PVC.

Service add/update: only if ServiceAffinity plugin is enabled in one of the profiles

Node add/update: test the pods against the kubelet admission logic, this include: NodeAffinity, NodeName NodePorts, NodeResources and TaintsTolerations filters.

Pod delete: keep it as is (spreading or affinity constraints must be tested, which is expensive to do)

This is necessary to avoid wasting scheduling cycles.

Related proposal: https://docs.google.com/document/d/1Dw1qPi4eryllSv0F419sKVbGXiPvSv6N_mJd6AVSg74/edit

/sig scheduling
/assign @adtac

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adtac · 2020-08-14T16:14:46Z

/assign

Huang-Wei · 2020-08-14T16:54:33Z

/cc @denkensk

I'm thinking of if we can amplify the scope: which is to say, not only enable/disable moving (all) pods, but also provides mechanics to let plugins notify the scheduler framework which particular pods should be moved.

This would benefits in @denkensk 's coscheduling implementation to move the preceding (N-1) pods back to activeQ when the Nth pod comes - which satisfies the "minAvaialbe" requirement. Moreover, this would help in getting rid of flushing
pods back to activeQ #87850.

ahg-g · 2020-08-14T17:06:11Z

The suggestion is not to enable/disable moving all pods in all events, this is only true for the Service events. For the other two, node and volumes, we will test each pod and move the ones that pass the condition.

I agree that our proposal of integrating this with filters/permit plugins is more generic, but it adds a lot of complexity and potential for bugs, so we need to be careful with that I think.

ahg-g · 2020-08-14T17:25:36Z

One approach could be to start simple, validate and establish a reference point in terms of perf improvements, and then migrate to something more generic. The more generic approach will not be more performant than what is proposed in this issue.

Huang-Wei · 2020-08-14T18:03:23Z

I agree that our proposal of integrating this with filters/permit plugins is more generic, but it adds a lot of complexity and potential for bugs, so we need to be careful with that I think.

Yes, filters/permit runs in the critical path, and move Pods isn't a lock-free operation. What I want to say is during the design/implementation of this proposal, keep it in mind that plugins may need to actively "suggest" moving particular Pods back to activeQ.

One approach could be to start simple, validate and establish a reference point in terms of perf improvements, and then migrate to something more generic.

Agree.

ahg-g · 2020-08-14T18:18:52Z

Yes, filters/permit runs in the critical path, and move Pods isn't a lock-free operation. What I want to say is during the design/implementation of this proposal, keep it in mind that plugins may need to actively "suggest" moving particular Pods back to activeQ.

Right, we already designed the interface for such a potential eventuality: https://docs.google.com/document/d/1Dw1qPi4eryllSv0F419sKVbGXiPvSv6N_mJd6AVSg74/edit.

neolit123 · 2020-08-17T16:09:19Z

just fixing double "moving" in title:
/retitle Avoid moving pods out of unschedulable status unconditionally

fejta-bot · 2020-11-15T16:57:22Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ahg-g · 2020-11-16T03:27:01Z

/remove-lifecycle stale

Huang-Wei · 2020-12-10T19:57:45Z

/assign
/unassign @adtac

cwdsuzhou · 2020-12-12T09:13:02Z

/cc

Huang-Wei · 2021-03-10T19:39:55Z

Actually, to get rid of the reaction to service changes... all plugins would have to opt-out, so nvm.

Yes, to achieve that, every plugin has to implement EventsToRegister()...

BTW: one way is work it around is to add some temporary logic on framework side - skip movingPods upon Service events if ServiceAffinity is not enabled. And remove this temporary logic when all plugins opt-in EventsToRegister() in the next release.

alculquicondor · 2021-03-10T19:47:26Z

That wouldn't qualify for codefreeze though.

Huang-Wei · 2021-03-10T19:59:05Z

That wouldn't qualify for codefreeze though.

Nvm then :)

ahg-g · 2021-03-11T17:41:51Z

We should simply not register for service updates at all if ServiceAffinity is not configured, which is the default.

Huang-Wei · 2021-03-11T21:23:45Z

We should simply not register for service updates at all if ServiceAffinity is not configured, which is the default.

Probably not a great idea to totally disable the registration, as an out-of-tree plugin may choose to respond to Service events.

ahg-g · 2021-03-11T21:42:23Z

We need to find a way for external plugins to register events because we will remove this registration once the plugin is deprecated.

Huang-Wei · 2021-03-11T21:48:09Z

We need to find a way for external plugins to register events because we will remove this registration once the plugin is deprecated.

Yes, also should take CRD into consideration. Probably dynamic informer.

ahg-g · 2021-03-12T19:50:45Z

I still think we should disabled Service event handlers when the plugin is not registered. If an external plugin is using it, then they could enable the service affinity filter plugin with an empty affinity labels or patch event handlers until we figure out a way for external plugins to register extra events.

btw, do you know of any external plugins that actually rely on service events?

Huang-Wei · 2021-03-12T21:01:36Z

I still think we should disabled Service event handlers when the plugin is not registered. If an external plugin is using it, then they could enable the service affinity filter plugin with an empty affinity labels or patch event handlers until we figure out a way for external plugins to register extra events.

That's fair, and I can work on this, adding some comments caveating this is a temporary workaround.

btw, do you know of any external plugins that actually rely on service events?

I'm not aware so far.

alculquicondor · 2021-03-12T21:32:15Z

tbh, I would say plugins depending on services is an antipattern, given the existence of pod affinity.

Probably some CRDs might make sense for specialized plugins.

ahg-g · 2021-03-17T18:52:39Z

@Huang-Wei we should add metrics to track the efficiency of the solution. The metric would track the number of pods that were not put back into the queue broken down by plugin and even type for example.

Huang-Wei · 2021-03-17T20:01:07Z

we should add metrics to track the efficiency of the solution. The metric would track the number of pods that were not put back into the queue broken down by plugin and even type for example.

SG. I compiled a list to include on-going and to-do items here #100347.

Huang-Wei · 2021-04-02T23:28:54Z

I would like take "podtopologyspread" plugin.

@yuzhiquan Are you still working on this?

ahg-g · 2021-04-23T02:43:29Z

@Huang-Wei did we have PRs for number 2 and number 3?

ahg-g · 2021-04-23T03:24:45Z

I am asking about the two points from the four mentioned in the description of this issue: #94009 (comment)

yuzhiquan · 2021-04-23T03:28:25Z

I am asking about the two points from the four mentioned in the description of this issue: #94009 (comment)

Oops, misunderstanding, ignore me.

Huang-Wei · 2021-04-23T05:45:42Z

@Huang-Wei did we have PRs for number 2 and number 3?

number 2: implement EnqueueExtensions interface in serviceaffinity #100357 mitigated the problem. In short, implement EnqueueExtensions interface in serviceaffinity #100357 includes two parts: one part is to remove Service event from default list so that only Pods failed by a plugin (e.g., ServiceAffinity) which explicitly are interested in Service events can be moved; the other part is to implement EventsToRegister() in ServcieAfffinity plugin. But that doesn't prevent the internal processing cycles like going through the unschedulabe pods and verify whether each is failed by that plugin, so it's a mitigation.
PR sched: dynamic event handlers registration #101394 will fix this problem thoroughly.
number 3: sched: support PreEnqueueChecks prior to moving Pods #100049

ahg-g · 2021-04-23T13:34:07Z

number 3: #100049

ah, yeah, I knew we worked on that in the previous release, lol. Thanks.

ahg-g added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 14, 2020

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Aug 14, 2020

k8s-ci-robot assigned adtac Aug 14, 2020

k8s-ci-robot changed the title ~~Avoid moving moving pods out of unschedulable status unconditionally~~ Avoid moving pods out of unschedulable status unconditionally Aug 17, 2020

This was referenced Aug 31, 2020

Exposure the race condition on the pod preemption #94358

Closed

race condition detected during the scheduling with preemption #93505

Open

adtac mentioned this issue Sep 1, 2020

scheduler_perf: meta issue to track multiple issues #94285

Closed

13 tasks

adtac mentioned this issue Sep 22, 2020

REQUEST: New membership for adtac kubernetes/org#2205

Closed

6 tasks

alculquicondor mentioned this issue Oct 14, 2020

Add KEP 1923 - prefer nominated node kubernetes/enhancements#2026

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2020

alculquicondor mentioned this issue Nov 17, 2020

Rejection of a waiting pod in scheduler may cause delay of its rescheduling #96478

Closed

k8s-ci-robot assigned Huang-Wei and unassigned Huang-Wei Dec 10, 2020

k8s-ci-robot assigned Huang-Wei and unassigned adtac Dec 10, 2020

Huang-Wei mentioned this issue Dec 12, 2020

coscheduling queue sort plugin starves pods kubernetes-sigs/scheduler-plugins#110

Closed

ahg-g mentioned this issue Jan 5, 2021

Reevaluate flushing unschedulable pods into activeQ #87850

Open

Huang-Wei mentioned this issue Mar 10, 2021

Add a small random number to PodBackoffDuration #100095

Closed

Huang-Wei mentioned this issue Mar 13, 2021

A hacky way to register service events handler dynamically #100207

Closed

This was referenced Mar 16, 2021

added eventsToRegister for nodeName & nodeUnschedulable plugins #100279

Merged

added eventsToRegister for NodeUnschedulable plugin #100280

Closed

Huang-Wei mentioned this issue Mar 18, 2021

implement EnqueueExtensions interface in serviceaffinity #100357

Merged

yuzhiquan mentioned this issue Apr 6, 2021

Implement EnqueueExtensions interface in TopologySpreading scheduling #100853

Merged

KofClubs mentioned this issue Apr 12, 2021

Added the pod-(high-priority-)with-hostport YAMLs #101010

Closed

alculquicondor mentioned this issue Apr 13, 2021

make unschedulableQTimeInterval configurable #99435

Closed

KofClubs mentioned this issue May 11, 2021

REQUEST: New membership for KofClubs kubernetes/org#2703

Closed

6 tasks

k8s-ci-robot closed this as completed in #100026 May 25, 2021

ahg-g mentioned this issue Nov 28, 2021

low priority pods stuck in pending without any scheduling events #106546

Closed

minbaev mentioned this issue Dec 20, 2021

REQUEST: New membership for minbaev - Alexander Minbaev kubernetes/org#3152

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid moving pods out of unschedulable status unconditionally #94009

Avoid moving pods out of unschedulable status unconditionally #94009

ahg-g commented Aug 14, 2020

k8s-ci-robot commented Aug 14, 2020

adtac commented Aug 14, 2020

Huang-Wei commented Aug 14, 2020

ahg-g commented Aug 14, 2020

ahg-g commented Aug 14, 2020

Huang-Wei commented Aug 14, 2020

ahg-g commented Aug 14, 2020

neolit123 commented Aug 17, 2020

fejta-bot commented Nov 15, 2020

ahg-g commented Nov 16, 2020

Huang-Wei commented Dec 10, 2020

cwdsuzhou commented Dec 12, 2020

Huang-Wei commented Mar 10, 2021

alculquicondor commented Mar 10, 2021

Huang-Wei commented Mar 10, 2021

ahg-g commented Mar 11, 2021

Huang-Wei commented Mar 11, 2021

ahg-g commented Mar 11, 2021

Huang-Wei commented Mar 11, 2021

ahg-g commented Mar 12, 2021 •

edited

Loading

Huang-Wei commented Mar 12, 2021

alculquicondor commented Mar 12, 2021

ahg-g commented Mar 17, 2021 •

edited

Loading

Huang-Wei commented Mar 17, 2021

Huang-Wei commented Apr 2, 2021

ahg-g commented Apr 23, 2021

ahg-g commented Apr 23, 2021

yuzhiquan commented Apr 23, 2021

Huang-Wei commented Apr 23, 2021 •

edited

Loading

ahg-g commented Apr 23, 2021

Avoid moving pods out of unschedulable status unconditionally #94009

Avoid moving pods out of unschedulable status unconditionally #94009

Comments

ahg-g commented Aug 14, 2020

k8s-ci-robot commented Aug 14, 2020

adtac commented Aug 14, 2020

Huang-Wei commented Aug 14, 2020

ahg-g commented Aug 14, 2020

ahg-g commented Aug 14, 2020

Huang-Wei commented Aug 14, 2020

ahg-g commented Aug 14, 2020

neolit123 commented Aug 17, 2020

fejta-bot commented Nov 15, 2020

ahg-g commented Nov 16, 2020

Huang-Wei commented Dec 10, 2020

cwdsuzhou commented Dec 12, 2020

Huang-Wei commented Mar 10, 2021

alculquicondor commented Mar 10, 2021

Huang-Wei commented Mar 10, 2021

ahg-g commented Mar 11, 2021

Huang-Wei commented Mar 11, 2021

ahg-g commented Mar 11, 2021

Huang-Wei commented Mar 11, 2021

ahg-g commented Mar 12, 2021 • edited Loading

Huang-Wei commented Mar 12, 2021

alculquicondor commented Mar 12, 2021

ahg-g commented Mar 17, 2021 • edited Loading

Huang-Wei commented Mar 17, 2021

Huang-Wei commented Apr 2, 2021

ahg-g commented Apr 23, 2021

ahg-g commented Apr 23, 2021

yuzhiquan commented Apr 23, 2021

Huang-Wei commented Apr 23, 2021 • edited Loading

ahg-g commented Apr 23, 2021

ahg-g commented Mar 12, 2021 •

edited

Loading

ahg-g commented Mar 17, 2021 •

edited

Loading

Huang-Wei commented Apr 23, 2021 •

edited

Loading