-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reevaluate eviction thresholds after reclaim functions #59841
Reevaluate eviction thresholds after reclaim functions #59841
Conversation
/assign @sjenning @Random-Liu |
/retest |
57e09c2
to
9b94a80
Compare
9b94a80
to
cfcd0df
Compare
I also added a node e2e test and verified that it fails with the current implementation and passes with this change |
cfcd0df
to
5e908f4
Compare
/lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole: 0 warnings.
In response to this:
/lint
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@@ -80,5 +82,6 @@ func (cgc *realContainerGC) GarbageCollect() error { | |||
} | |||
|
|||
func (cgc *realContainerGC) DeleteAllUnusedContainers() error { | |||
glog.Infof("attempting to delete unused containers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this the right log level? how often do we expect this will be output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially just moved from the deleteTerminatedContainers function in the eviction manager I removed. This keeps the log level the same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only be output when under disk pressure
nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim] | ||
for _, nodeReclaimFunc := range nodeReclaimFuncs { | ||
// attempt to reclaim the pressured resource. | ||
reclaimed, err := nodeReclaimFunc() | ||
nodeReclaimFunc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eating the error here. i don't think we want to bail out but not sure if we want it to go by unreported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, good catch. i forgot to put error handling back in (i had moved it into the nodeReclaimFunc at one point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
two comments, but overall lgtm. thanks! |
5e908f4
to
e0830d0
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dashpole, derekwaynecarr, sjenning The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Automatic merge from submit-queue (batch tested with PRs 59683, 59964, 59841, 59936, 59686). If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
When the node comes under
DiskPressure
due to inodes or disk space, the eviction manager runs garbage collection functions to clean up dead containers and unused images.Currently, we use the strategy of trying to measure the disk space and inodes freed by garbage collection. However, as #46789 and #56573 point out, there are gaps in the implementation that can cause extra evictions even when they are not required. Furthermore, for nodes which frequently cycle through images, it results in a large number of evictions, as running out of inodes always causes an eviction.
This PR changes this strategy to call the garbage collection functions and ignore the results. Then, it triggers another collection of node-level metrics, and sees if the node is still under DiskPressure.
This way, we can simply observe the decrease in disk or inode usage, rather than trying to measure how much is freed.
Which issue(s) this PR fixes:
Fixes #46789
Fixes #56573
Related PR #56575
Special notes for your reviewer:
This will look cleaner after #57802 removes arguments from makeSignalObservations.
Release note:
/sig node
/kind bug
/priority important-soon
cc @kubernetes/sig-node-pr-reviews