-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure virtual nodes aren't stranded in GC graph #57503
Ensure virtual nodes aren't stranded in GC graph #57503
Conversation
@kubernetes/sig-api-machinery-pr-reviews |
@@ -259,6 +259,12 @@ func (gc *GarbageCollector) attemptToDeleteWorker() bool { | |||
} | |||
// retry if garbage collection of an object failed. | |||
gc.attemptToDelete.AddRateLimited(item) | |||
} else if !n.isObserved() { | |||
// requeue if item hasn't been observed yet. | |||
// otherwise a virtual node for an item added/removed during a watch outage can get orphaned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use orphaned. That does other things in this area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworded comment
@@ -588,6 +592,9 @@ func (gb *GraphBuilder) processGraphChanges() bool { | |||
glog.V(5).Infof("GraphBuilder process object: %s/%s, namespace %s, name %s, uid %s, event type %v", event.gvk.GroupVersion().String(), event.gvk.Kind, accessor.GetNamespace(), accessor.GetName(), string(accessor.GetUID()), event.eventType) | |||
// Check if the node already exsits | |||
existingNode, found := gb.uidToNode.Read(accessor.GetUID()) | |||
if found { | |||
existingNode.markObserved() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ask discussed on slack, this is only safe because of callsite behavior. The method and workqueue are private and its not unusual in this area of code, but it is comment worthy.
This is what triggers marking the node observed in the "good" case when the informer sees the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments, and tightened up call sites to only expose enqueuing virtual delete events into graphChanges outside the normal informer paths
lgtm |
/approve |
74b8db2
to
4d06356
Compare
4d06356
to
df60789
Compare
comments addressed |
Can we make this one happen? (regarding the milestone spam above) |
Removing label |
[MILESTONENOTIFIER] Milestone Removed From Pull Request @caesarxuchao @ironcladlou @liggitt Important: This pull request was missing labels required for the v1.9 milestone for more than 3 days: priority: Must specify exactly one of |
Removing label |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, liggitt Associated issue: #56121 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all Tests are more than 96 hours old. Re-running tests. |
@liggitt: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 57735, 57503). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
Fixes #56121
See #56121 (comment) for details on the sequence of events that can lead to virtual nodes getting stranded in the graph
(a branch with a commit that reliably triggers the cascading deletion test failure is at https://github.com/liggitt/kubernetes/commits/gc-debug-cascading... it's not easily made into a permanent test case because it only works when that test is run in isolation, and requires plumbing test hooks deep into the watch cache layer)