Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reevaluate eviction thresholds after reclaim functions #59841

Merged
merged 1 commit into from
Feb 17, 2018

Conversation

dashpole
Copy link
Contributor

@dashpole dashpole commented Feb 14, 2018

What this PR does / why we need it:
When the node comes under DiskPressure due to inodes or disk space, the eviction manager runs garbage collection functions to clean up dead containers and unused images.
Currently, we use the strategy of trying to measure the disk space and inodes freed by garbage collection. However, as #46789 and #56573 point out, there are gaps in the implementation that can cause extra evictions even when they are not required. Furthermore, for nodes which frequently cycle through images, it results in a large number of evictions, as running out of inodes always causes an eviction.

This PR changes this strategy to call the garbage collection functions and ignore the results. Then, it triggers another collection of node-level metrics, and sees if the node is still under DiskPressure.
This way, we can simply observe the decrease in disk or inode usage, rather than trying to measure how much is freed.

Which issue(s) this PR fixes:
Fixes #46789
Fixes #56573
Related PR #56575

Special notes for your reviewer:
This will look cleaner after #57802 removes arguments from makeSignalObservations.

Release note:

NONE

/sig node
/kind bug
/priority important-soon
cc @kubernetes/sig-node-pr-reviews

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 14, 2018
@dashpole
Copy link
Contributor Author

cc @eastlondoner

@dashpole dashpole added this to the v1.10 milestone Feb 14, 2018
@dashpole
Copy link
Contributor Author

/assign @sjenning @Random-Liu

@sjenning
Copy link
Contributor

/retest

@dashpole dashpole force-pushed the metrics_after_reclaim branch from 57e09c2 to 9b94a80 Compare February 15, 2018 21:20
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 15, 2018
@dashpole dashpole force-pushed the metrics_after_reclaim branch from 9b94a80 to cfcd0df Compare February 15, 2018 22:04
@dashpole
Copy link
Contributor Author

I also added a node e2e test and verified that it fails with the current implementation and passes with this change

@dashpole dashpole force-pushed the metrics_after_reclaim branch from cfcd0df to 5e908f4 Compare February 16, 2018 00:55
@dashpole
Copy link
Contributor Author

/lint

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole: 0 warnings.

In response to this:

/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -80,5 +82,6 @@ func (cgc *realContainerGC) GarbageCollect() error {
}

func (cgc *realContainerGC) DeleteAllUnusedContainers() error {
glog.Infof("attempting to delete unused containers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this the right log level? how often do we expect this will be output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially just moved from the deleteTerminatedContainers function in the eviction manager I removed. This keeps the log level the same as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be output when under disk pressure

nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim]
for _, nodeReclaimFunc := range nodeReclaimFuncs {
// attempt to reclaim the pressured resource.
reclaimed, err := nodeReclaimFunc()
nodeReclaimFunc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eating the error here. i don't think we want to bail out but not sure if we want it to go by unreported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, good catch. i forgot to put error handling back in (i had moved it into the nodeReclaimFunc at one point).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@sjenning
Copy link
Contributor

two comments, but overall lgtm. thanks!

@dashpole dashpole force-pushed the metrics_after_reclaim branch from 5e908f4 to e0830d0 Compare February 16, 2018 16:35
@sjenning
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 16, 2018
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dashpole, derekwaynecarr, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2018
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 59683, 59964, 59841, 59936, 59686). If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
6 participants