-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/kubelet: ensure node object exists before syncing pods #129464
base: master
Are you sure you want to change the base?
Conversation
The nodeLister loop runs in parallel with kubelet node registration call. In rare case that the nodeLister could list with empty node result and the later watch failed due to kube-apiserver closes the connection for unexpected reaseon. Such case leaves the node informer with an empty cache marekd with synced. The node informer would need to back-off until next list watch call to discover the created node object. If kubelet synces pods with affinity before next list watch call, these pods will be failed with NodeAffinity admission error. This commit attempts to fix by adding an extra check for the node object existence on the node informer cache before starting the pods sync. fix: kubernetes#129463
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages. The list of commits with invalid commit messages:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Welcome @bcho! |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @bcho. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bcho The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @neolit123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listing the node doesn't mean it has synced, correct?
from the ticket:
Kubelet should back off the node list-watch calls until the node object has been populated in the node informer cache.
but that can block indefinitely? how about a configurable timeout that can cover uses cases like yours.
However, in rare scenarios that the node lister might be failing for some other reasons before the first successful call (waiting for TLS bootstrapping for example), and increased the back-off (max out to 30s), then in theory, the gap of T6-T5 could be as long as 30s + jitter, which aligns with the logs timestamps we observed in above.
30 seconds for a node list sounds like a lot. almost seems like this cluster would need 'priority and fairness' configured.
Or, we should invalidate the node informer cache after successfully registering the node from kubelet to maintain the correct version of node object in memory.
that seems like something that has to be done either way.
hi @neolit123
yep, this simple fix is to check if the informer has synced and it has the node record read from kube-apiserver
If the node informer never reads the node object from remote, it's supposed to be blocked or restarted because the states are drifted. But I didn't find a good way for doing so in current implementation.
Agree, but this is the current default settings from the back off mgr used in the node lister. This is also why it might be better to start the node lister informer or invalidate its cache after registering the node. However I am not sure what is the impact to the static pods creation process. |
what are the implications of doing it like
indeed, that might be non-trivial.
SIG Node (kubelet owners) should review and advise here. from my POV a configurable backoff / timeout option might make sense, if that can be set on the informer. i will run ok-to-test on this PR, static pods are quite tricky and i don't think we have full test coverage for all their edge cases. in any case, we must not break them. /ok-to-test |
The main issue is the informer is hiding the short watch signal. Even if we do a full list immediately after short watch, there is still chance the node object is not yet regstiered in kube-apiserver side (event T4 in the issue). Giving this helper function runs in parallel with the node registeration, therefore, the only viable way in current implementation is to back-off Will try to join the node sig call, thanks for the pointer and reviews! |
The nodeLister loop runs in parallel with kubelet node registration call. In rare case that the nodeLister could list with empty node result and the later watch failed due to kube-apiserver closes the connection for unexpected reaseon. Such case leaves the node informer with an empty cache marekd with synced. The node informer would need to back-off until next list watch call to discover the created node object.
If kubelet synces pods with affinity before next list watch call, these pods will be failed with NodeAffinity admission error.
This commit attempts to fix by adding an extra check for the node object existence on the node informer cache before starting the pods sync.
fix: #129463
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #129463
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: