Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface info of failed plugins during PreFilter, Filter and Permit #98041

Merged
merged 1 commit into from
Jan 29, 2021

Conversation

Huang-Wei
Copy link
Member

@Huang-Wei Huang-Wei commented Jan 14, 2021

What type of PR is this?

/kind feature
/sig scheduling

What this PR does / why we need it:

Basically when running plugins, the info that records the Pod failed on which plugins gets thrown away, and a high-level Status is returned. To resolve #94009, we need to know (1) which plugins are interested in which cluster events, and (2) which plugins a queued Pod failed on. This PR tries to resolve the latter concern.

Which issue(s) this PR fixes:

Pre-PR of #94009.

Special notes for your reviewer:

This PR only starts with surfacing failed plugins of PreFilter, Filter and Permit, if deem needed, we will extend the scope to other plugins.

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 14, 2021
@k8s-ci-robot
Copy link
Contributor

@Huang-Wei: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 14, 2021
@k8s-ci-robot k8s-ci-robot requested review from damemi and k82cn January 14, 2021 00:49
@Huang-Wei Huang-Wei marked this pull request as ready for review January 14, 2021 01:06
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 14, 2021
@Huang-Wei Huang-Wei changed the title Surface info of failed plugins during PerFilter and Filter [WIP] Surface info of failed plugins during PerFilter and Filter Jan 14, 2021
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 14, 2021
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2021
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 14, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2021
@Huang-Wei Huang-Wei changed the title [WIP] Surface info of failed plugins during PerFilter and Filter Surface info of failed plugins during PerFilter and Filter Jan 20, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2021
@Huang-Wei
Copy link
Member Author

/retest

@ahg-g @alculquicondor It's ready for review. This is the first PR to resolve #94009.

@@ -127,6 +131,16 @@ func (s *Status) Message() string {
return strings.Join(s.reasons, ", ")
}

// SetFailedPlugin sets the given plugin name to s.failedPlugin.
func (s *Status) SetFailedPlugin(plugin string) {
s.failedPlugin = plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since nil is a valid status, should we check and return if s == nil, in other functions as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yes, but in practice it is not that necessary because we don't call this on nil or a concrect Success status. (If deem needed, it's even more necessary to guard AppendReason() than this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we can do that as a follow up.

@@ -109,6 +109,10 @@ type Status struct {
code Code
reasons []string
err error
// failedPlugin is an optional field that records the plugin name a Pod failed by.
// It's only set by the framework when a Unschedulable or UnschedulableAndUnresolvable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also include Error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read Error more as an abnormal fault, which is usually not associated with the plugin's semantics, such as "node not found", "internal labelSelector deducing error", etc. So my original idea was that for type of Error, we don't set failedPlugin, so in runtime:

  • if a Pod returns error on all nodes, its aggregated failedPlugin set is nil, so we move this pod upon any cluster event unconditionally
  • if a Pod return non-error on some nodes, we intersect the associated plugin name upon a cluster event with those failedPlugins

This may prevent intermittent internal errors from disabling the pod's move. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the sake of separation of concerns, shouldn't the logic receiving the status decide how to deal with different codes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, even we set failedPlugin upon Error, that info wouldn't be carried all the way to the logic which decides to move pods or not:

Upon an error in (Pre)Filter, diagnosis will be discarded:

	feasibleNodes, diagnosis, err := g.findNodesThatFitPod(ctx, fwk, state, pod)
	if err != nil {
		return result, err
	}

Moreover during Filter(), error would be immediately returned once we hit the first Error (other in-parallel FIlter() would cancel()). So even we carry the failedPlugin (on errro) all the way to Pod moving logic, this failedPlugin might be misleading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you have access to that when building the "diagnosis". In any case, my point is that "failed" != unschedulable. So we either include all failure modes, or name the field to indicate its limited scope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, my point is that "failed" != unschedulable.

That's a fair point. I think I will leave the failedPlugin in Status (more specifically, just Error, Unschedulable and UnschedulableAndUnresolvable, not including Success, Wait and Skip). And in diagnosis, I will rename failedPlugin to unschedulablePlugins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the name to unschedulablePlugins sounds good to me.

pkg/scheduler/framework/runtime/framework.go Outdated Show resolved Hide resolved
pkg/scheduler/framework/runtime/framework.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 26, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 26, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 26, 2021
@Huang-Wei
Copy link
Member Author

/retest

@ahg-g
Copy link
Member

ahg-g commented Jan 28, 2021

/lgtm
/approve
/hold

not sure if you are waiting for others, so holding.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 28, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Member

Giving a quick look after the fire has finished :)

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR description outdated? It seems that we did add it to Permit already.

pkg/scheduler/framework/interface.go Show resolved Hide resolved
NodeToStatusMap: framework.NodeToStatusMap{
"1": framework.NewStatus(framework.UnschedulableAndUnresolvable, "injected unschedulable status"),
"2": framework.NewStatus(framework.UnschedulableAndUnresolvable, "injected unschedulable status"),
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave a TODO as comment in the code too.


if err != nil {
t.Errorf("unexpected error: %v", err)
}

if len(nodeToStatusMap) != len(nodes) {
t.Errorf("unexpected failed status map: %v", nodeToStatusMap)
if len(diagnosis.NodeToStatusMap) != len(nodes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave as comment in code

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2021
@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2021
@Huang-Wei
Copy link
Member Author

Thanks @ahg-g @alculquicondor @chendave all for the review!

/hold cancel
/retest

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 28, 2021
@k8s-ci-robot k8s-ci-robot merged commit f402e47 into kubernetes:master Jan 29, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jan 29, 2021
@Huang-Wei Huang-Wei deleted the sched-enqueue-1 branch February 5, 2021 19:14
@Huang-Wei Huang-Wei changed the title Surface info of failed plugins during PerFilter and Filter Surface info of failed plugins during PerFilter, Filter and Permit Mar 7, 2021
@Huang-Wei Huang-Wei changed the title Surface info of failed plugins during PerFilter, Filter and Permit Surface info of failed plugins during PreFilter, Filter and Permit Mar 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid moving pods out of unschedulable status unconditionally
5 participants