Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure volcano schduler cache synced before first scheduling by waiting for handlers sync. #3177

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

RamezesDong
Copy link

@RamezesDong RamezesDong commented Nov 5, 2023

Ⅰ. Describe what this PR does

The issue kubernetes/kubernetes#116717 mentions the bug that event handlers hadn't handled all events when informer cache synced. This can lead to a terrible result, which is that the scheduler starts scheduling in the wrong state. The K8s community itself has fixed this issue kubernetes/kubernetes#116729.

The PR makes sure handlers have finished syncing before the scheduling cycles start, just like the default scheduler does.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign thor-wl
You can assign the PR to them by writing /assign @thor-wl in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 5, 2023
@lowang-bh
Copy link
Member

@RamezesDong
Copy link
Author

RamezesDong commented Nov 6, 2023

same

It looks like the two are similar. If the other one is merged, I will close this pr.

Copy link

stale bot commented Mar 17, 2024

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2024
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2024
@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024
@lowang-bh
Copy link
Member

We'd better include this feature in new release. @Monokaix @william-wang

@Monokaix
Copy link
Member

Please rebase your pr: )

@RamezesDong
Copy link
Author

I will rebase the code as soon as possible

@RamezesDong RamezesDong force-pushed the wait-for-handlers-sync branch from 52ce8c6 to daeb06f Compare May 16, 2024 09:21
@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2024
@RamezesDong
Copy link
Author

@lowang-bh The code rebase is done. Could you review these codes, please?

@@ -88,7 +88,10 @@ func (pc *Scheduler) Run(stopCh <-chan struct{}) {
pc.cache.SetMetricsConf(pc.metricsConf)
pc.cache.Run(stopCh)
pc.cache.WaitForCacheSync(stopCh)
klog.V(2).Infof("Scheduler completes Initialization and start to run")
if err := pc.cache.WaitForHandlersSync(stopCh); err != nil {
panic(fmt.Sprintf("failed to wait for handlers sync: %v", err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't panic and go on, what will happen? Is that acceptable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way that error is returned is if stopCh cancel with ErrWaitTimeout. So we can make the method WaitForHandlersSync returnless and remove this panic.

@@ -88,7 +88,10 @@ func (pc *Scheduler) Run(stopCh <-chan struct{}) {
pc.cache.SetMetricsConf(pc.metricsConf)
pc.cache.Run(stopCh)
pc.cache.WaitForCacheSync(stopCh)
klog.V(2).Infof("Scheduler completes Initialization and start to run")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a log before custom handler sync, so that user can check the latancy from log.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I will add logs about the waiting cache initialized.

Signed-off-by: RamezesDong <donghouze666@outlook.com>
@RamezesDong RamezesDong force-pushed the wait-for-handlers-sync branch from daeb06f to 0810709 Compare May 18, 2024 11:53
@Monokaix
Copy link
Member

Does volcano controller also need catch this?

@RamezesDong
Copy link
Author

Does volcano controller also need catch this?

I don't think so. The controller just needs to make sure it's eventual consistency, waiting for cache sync is enough

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2024
@volcano-sh-bot
Copy link
Contributor

@RamezesDong: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

stale bot commented Feb 1, 2025

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants