-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface info of failed plugins during PreFilter, Filter and Permit #98041
Conversation
Skipping CI for Draft Pull Request. |
@Huang-Wei: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
81da742
to
1dea606
Compare
1dea606
to
b41f817
Compare
/retest @ahg-g @alculquicondor It's ready for review. This is the first PR to resolve #94009. |
@@ -127,6 +131,16 @@ func (s *Status) Message() string { | |||
return strings.Join(s.reasons, ", ") | |||
} | |||
|
|||
// SetFailedPlugin sets the given plugin name to s.failedPlugin. | |||
func (s *Status) SetFailedPlugin(plugin string) { | |||
s.failedPlugin = plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since nil is a valid status, should we check and return if s == nil
, in other functions as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory yes, but in practice it is not that necessary because we don't call this on nil
or a concrect Success
status. (If deem needed, it's even more necessary to guard AppendReason()
than this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we can do that as a follow up.
pkg/scheduler/framework/interface.go
Outdated
@@ -109,6 +109,10 @@ type Status struct { | |||
code Code | |||
reasons []string | |||
err error | |||
// failedPlugin is an optional field that records the plugin name a Pod failed by. | |||
// It's only set by the framework when a Unschedulable or UnschedulableAndUnresolvable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we also include Error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read Error more as an abnormal fault, which is usually not associated with the plugin's semantics, such as "node not found", "internal labelSelector deducing error", etc. So my original idea was that for type of Error, we don't set failedPlugin, so in runtime:
- if a Pod returns error on all nodes, its aggregated failedPlugin set is nil, so we move this pod upon any cluster event unconditionally
- if a Pod return non-error on some nodes, we intersect the associated plugin name upon a cluster event with those failedPlugins
This may prevent intermittent internal errors from disabling the pod's move. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the sake of separation of concerns, shouldn't the logic receiving the status decide how to deal with different codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, even we set failedPlugin upon Error, that info wouldn't be carried all the way to the logic which decides to move pods or not:
Upon an error in (Pre)Filter, diagnosis will be discarded:
feasibleNodes, diagnosis, err := g.findNodesThatFitPod(ctx, fwk, state, pod)
if err != nil {
return result, err
}
Moreover during Filter(), error would be immediately returned once we hit the first Error (other in-parallel FIlter() would cancel()). So even we carry the failedPlugin (on errro) all the way to Pod moving logic, this failedPlugin might be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you have access to that when building the "diagnosis". In any case, my point is that "failed" != unschedulable. So we either include all failure modes, or name the field to indicate its limited scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, my point is that "failed" != unschedulable.
That's a fair point. I think I will leave the failedPlugin in Status (more specifically, just Error, Unschedulable and UnschedulableAndUnresolvable, not including Success, Wait and Skip). And in diagnosis, I will rename failedPlugin to unschedulablePlugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing the name to unschedulablePlugins sounds good to me.
2afff6f
to
aba4947
Compare
aba4947
to
c92cddc
Compare
/retest |
c92cddc
to
e7311b1
Compare
/lgtm not sure if you are waiting for others, so holding. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Giving a quick look after the fire has finished :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the PR description outdated? It seems that we did add it to Permit already.
NodeToStatusMap: framework.NodeToStatusMap{ | ||
"1": framework.NewStatus(framework.UnschedulableAndUnresolvable, "injected unschedulable status"), | ||
"2": framework.NewStatus(framework.UnschedulableAndUnresolvable, "injected unschedulable status"), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave a TODO as comment in the code too.
|
||
if err != nil { | ||
t.Errorf("unexpected error: %v", err) | ||
} | ||
|
||
if len(nodeToStatusMap) != len(nodes) { | ||
t.Errorf("unexpected failed status map: %v", nodeToStatusMap) | ||
if len(diagnosis.NodeToStatusMap) != len(nodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave as comment in code
e7311b1
to
f8a6bdb
Compare
/lgtm |
Thanks @ahg-g @alculquicondor @chendave all for the review! /hold cancel |
What type of PR is this?
/kind feature
/sig scheduling
What this PR does / why we need it:
Basically when running plugins, the info that records the Pod failed on which plugins gets thrown away, and a high-level
Status
is returned. To resolve #94009, we need to know (1) which plugins are interested in which cluster events, and (2) which plugins a queued Pod failed on. This PR tries to resolve the latter concern.Which issue(s) this PR fixes:
Pre-PR of #94009.
Special notes for your reviewer:
This PR only starts with surfacing failed plugins of PreFilter, Filter and Permit, if deem needed, we will extend the scope to other plugins.
Does this PR introduce a user-facing change?: