Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid moving pods out of unschedulable status unconditionally #94009

Closed
ahg-g opened this issue Aug 14, 2020 · 69 comments · Fixed by #98041 or #100026
Closed

Avoid moving pods out of unschedulable status unconditionally #94009

ahg-g opened this issue Aug 14, 2020 · 69 comments · Fixed by #98041 or #100026
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ahg-g
Copy link
Member

ahg-g commented Aug 14, 2020

Currently we unconditionally move pods out of unschedulable status when an event happens. However, we can optimize this by setting some conditions for each event to avoid unnecessarily doing that:

  1. PV, PVC, StorageClass or CSINode add/update: move only pods with a PVC.
  2. Service add/update: only if ServiceAffinity plugin is enabled in one of the profiles
  3. Node add/update: test the pods against the kubelet admission logic, this include: NodeAffinity, NodeName NodePorts, NodeResources and TaintsTolerations filters.
  4. Pod delete: keep it as is (spreading or affinity constraints must be tested, which is expensive to do)

This is necessary to avoid wasting scheduling cycles.

Related proposal: https://docs.google.com/document/d/1Dw1qPi4eryllSv0F419sKVbGXiPvSv6N_mJd6AVSg74/edit

/sig scheduling
/assign @adtac

@ahg-g ahg-g added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 14, 2020
@k8s-ci-robot
Copy link
Contributor

@ahg-g: GitHub didn't allow me to assign the following users: adtac.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Currently we unconditionally move pods out of unschedulable status when an event happens. However, we can optimize this by setting some conditions for each event to avoid unnecessarily doing that:

  1. PV, PVC, StorageClass or CSINode add/update: move only pods with a PVC.
  2. Service add/update: only if ServiceAffinity plugin is enabled in one of the profiles
  3. Node add/update: test the pods against the kubelet admission logic, this include: NodeAffinity, NodeName NodePorts, NodeResources and TaintsTolerations filters.
  4. Pod delete: keep it as is (spreading or affinity constraints must be tested, which is expensive to do)

This is necessary to avoid wasting scheduling cycles.

Related proposal: https://docs.google.com/document/d/1Dw1qPi4eryllSv0F419sKVbGXiPvSv6N_mJd6AVSg74/edit

/sig scheduling
/assign @adtac

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Aug 14, 2020
@adtac
Copy link
Member

adtac commented Aug 14, 2020

/assign

@Huang-Wei
Copy link
Member

/cc @denkensk

I'm thinking of if we can amplify the scope: which is to say, not only enable/disable moving (all) pods, but also provides mechanics to let plugins notify the scheduler framework which particular pods should be moved.

This would benefits in @denkensk 's coscheduling implementation to move the preceding (N-1) pods back to activeQ when the Nth pod comes - which satisfies the "minAvaialbe" requirement. Moreover, this would help in getting rid of flushing
pods back to activeQ #87850.

@ahg-g
Copy link
Member Author

ahg-g commented Aug 14, 2020

The suggestion is not to enable/disable moving all pods in all events, this is only true for the Service events. For the other two, node and volumes, we will test each pod and move the ones that pass the condition.

I agree that our proposal of integrating this with filters/permit plugins is more generic, but it adds a lot of complexity and potential for bugs, so we need to be careful with that I think.

@ahg-g
Copy link
Member Author

ahg-g commented Aug 14, 2020

One approach could be to start simple, validate and establish a reference point in terms of perf improvements, and then migrate to something more generic. The more generic approach will not be more performant than what is proposed in this issue.

@Huang-Wei
Copy link
Member

I agree that our proposal of integrating this with filters/permit plugins is more generic, but it adds a lot of complexity and potential for bugs, so we need to be careful with that I think.

Yes, filters/permit runs in the critical path, and move Pods isn't a lock-free operation. What I want to say is during the design/implementation of this proposal, keep it in mind that plugins may need to actively "suggest" moving particular Pods back to activeQ.

One approach could be to start simple, validate and establish a reference point in terms of perf improvements, and then migrate to something more generic.

Agree.

@ahg-g
Copy link
Member Author

ahg-g commented Aug 14, 2020

Yes, filters/permit runs in the critical path, and move Pods isn't a lock-free operation. What I want to say is during the design/implementation of this proposal, keep it in mind that plugins may need to actively "suggest" moving particular Pods back to activeQ.

Right, we already designed the interface for such a potential eventuality: https://docs.google.com/document/d/1Dw1qPi4eryllSv0F419sKVbGXiPvSv6N_mJd6AVSg74/edit.

@neolit123
Copy link
Member

just fixing double "moving" in title:
/retitle Avoid moving pods out of unschedulable status unconditionally

@k8s-ci-robot k8s-ci-robot changed the title Avoid moving moving pods out of unschedulable status unconditionally Avoid moving pods out of unschedulable status unconditionally Aug 17, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020
@ahg-g
Copy link
Member Author

ahg-g commented Nov 16, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2020
@k8s-ci-robot k8s-ci-robot assigned Huang-Wei and unassigned Huang-Wei Dec 10, 2020
@Huang-Wei
Copy link
Member

/assign
/unassign @adtac

@cwdsuzhou
Copy link
Contributor

/cc

@Huang-Wei
Copy link
Member

Actually, to get rid of the reaction to service changes... all plugins would have to opt-out, so nvm.

Yes, to achieve that, every plugin has to implement EventsToRegister()...

BTW: one way is work it around is to add some temporary logic on framework side - skip movingPods upon Service events if ServiceAffinity is not enabled. And remove this temporary logic when all plugins opt-in EventsToRegister() in the next release.

@alculquicondor
Copy link
Member

That wouldn't qualify for codefreeze though.

@Huang-Wei
Copy link
Member

That wouldn't qualify for codefreeze though.

Nvm then :)

@ahg-g
Copy link
Member Author

ahg-g commented Mar 11, 2021

We should simply not register for service updates at all if ServiceAffinity is not configured, which is the default.

@Huang-Wei
Copy link
Member

We should simply not register for service updates at all if ServiceAffinity is not configured, which is the default.

Probably not a great idea to totally disable the registration, as an out-of-tree plugin may choose to respond to Service events.

@ahg-g
Copy link
Member Author

ahg-g commented Mar 11, 2021

We need to find a way for external plugins to register events because we will remove this registration once the plugin is deprecated.

@Huang-Wei
Copy link
Member

We need to find a way for external plugins to register events because we will remove this registration once the plugin is deprecated.

Yes, also should take CRD into consideration. Probably dynamic informer.

@ahg-g
Copy link
Member Author

ahg-g commented Mar 12, 2021

I still think we should disabled Service event handlers when the plugin is not registered. If an external plugin is using it, then they could enable the service affinity filter plugin with an empty affinity labels or patch event handlers until we figure out a way for external plugins to register extra events.

btw, do you know of any external plugins that actually rely on service events?

@Huang-Wei
Copy link
Member

I still think we should disabled Service event handlers when the plugin is not registered. If an external plugin is using it, then they could enable the service affinity filter plugin with an empty affinity labels or patch event handlers until we figure out a way for external plugins to register extra events.

That's fair, and I can work on this, adding some comments caveating this is a temporary workaround.

btw, do you know of any external plugins that actually rely on service events?

I'm not aware so far.

@alculquicondor
Copy link
Member

tbh, I would say plugins depending on services is an antipattern, given the existence of pod affinity.

Probably some CRDs might make sense for specialized plugins.

@ahg-g
Copy link
Member Author

ahg-g commented Mar 17, 2021

@Huang-Wei we should add metrics to track the efficiency of the solution. The metric would track the number of pods that were not put back into the queue broken down by plugin and even type for example.

@Huang-Wei
Copy link
Member

we should add metrics to track the efficiency of the solution. The metric would track the number of pods that were not put back into the queue broken down by plugin and even type for example.

SG. I compiled a list to include on-going and to-do items here #100347.

@Huang-Wei
Copy link
Member

I would like take "podtopologyspread" plugin.

@yuzhiquan Are you still working on this?

@ahg-g
Copy link
Member Author

ahg-g commented Apr 23, 2021

@Huang-Wei did we have PRs for number 2 and number 3?

@ahg-g
Copy link
Member Author

ahg-g commented Apr 23, 2021

I am asking about the two points from the four mentioned in the description of this issue: #94009 (comment)

@yuzhiquan
Copy link
Member

I am asking about the two points from the four mentioned in the description of this issue: #94009 (comment)

Oops, misunderstanding, ignore me.

@Huang-Wei
Copy link
Member

Huang-Wei commented Apr 23, 2021

@Huang-Wei did we have PRs for number 2 and number 3?

@ahg-g
Copy link
Member Author

ahg-g commented Apr 23, 2021

number 3: #100049

ah, yeah, I knew we worked on that in the previous release, lol. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet