-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix leaking memory backed volumes of terminated pods #36779
fix leaking memory backed volumes of terminated pods #36779
Conversation
// Iterate through all pods and add to desired state of world if they don't | ||
// exist but should | ||
func (dswp *desiredStateOfWorldPopulator) findAndAddNewPods() { | ||
for _, pod := range dswp.podManager.GetPods() { | ||
if isPodTerminated(pod) { | ||
// Do not (re)add volumes for terminated pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless "non-memory backed volume"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but I couldn't think of a good reason to add a volume back to a node for a pod that is terminated. The only reason I'm filtering the ones we remove is to not make waves for the release. In general, I can't think of a reason we would want to leave volumes attached to a node for a terminated pods except for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but I couldn't think of a good reason to add a volume back to a node for a pod that is terminated. The only reason I'm filtering the ones we remove is to not make waves for the release. In general, I can't think of a reason we would want to leave volumes attached to a node for a terminated pods except for debugging.
Neither can I, but in the interest of minimizing potentially disruptive changes, it makes sense to limit the scope of the change as much as possible. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saad-ali the change required to check that actually makes the change more invasive because I need to get the volume. In findAndRemoveDeletedPods()
, that is already available via volumeToMount
. In findAndAddNewPods()
, the logic is inverted; for each pod, process volumes. I'm not sure if it is safe to call dswp.desiredStateOfWorld.GetVolumesToMount()
in findAndAddNewPods()
.
I can extend my e2e tests to include a check that ensures non-memory backed volumes are untouched by this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. That's fine. I'm a little nervous about it, but let's just get it in and give it time to bake and keep an eye out for strange volume behaviors.
@yujuhong Can you also take a look at this PR |
Is it possible to write a node E2E test that verifies this behavior? |
For example, can we do a test like the following:
See example of something similar here: |
@derekwaynecarr I'm looking into it |
Change LGTM. Like @derekwaynecarr suggested, an e2e test would be nice. |
eaa1be0
to
010a826
Compare
@derekwaynecarr @yujuhong updated PR with e2e node test |
Jenkins GCE Node e2e failed for commit 010a826c6a0b25dbcc0913357acba92fdeb1e2c3. Full PR test history. The magic incantation to run this job again is |
Jenkins verification failed for commit 010a826c6a0b25dbcc0913357acba92fdeb1e2c3. Full PR test history. The magic incantation to run this job again is |
010a826
to
3dc3de1
Compare
{ | ||
Name: "kubelet-pods", | ||
VolumeSource: api.VolumeSource{ | ||
HostPath: &api.HostPathVolumeSource{Path: "/var/lib/kubelet/pods"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put a // TODO to pull this value from the test context for node (it needs to be added)
one nit for a todo, otherwise this LGTM. @saad-ali - can you take a pass and mark for 1.5 milestone if you agree? |
3dc3de1
to
b80bea4
Compare
One minor comment about potentially minimizing the change further |
/lgtm |
Automatic merge from submit-queue |
Automatic merge from submit-queue (batch tested with PRs 39493, 39496) kubelet: fix nil deref in volume type check An attempt to address memory exhaustion through a build up of terminated pods with memory backed volumes on the node in PR #36779 introduced this. For the `VolumeSpec`, either the `Volume` or `PersistentVolume` field is set, not both. This results in a situation where there is a nil deref on PVs. Since PVs are inherently not memory-backend, only local/temporal volumes should be considered. This needs to go into 1.5 as well. Fixes #39480 @saad-ali @derekwaynecarr @grosskur @gnufied ```release-note fixes nil dereference when doing a volume type check on persistent volumes ```
Automatic merge from submit-queue (batch tested with PRs 37228, 40146, 40075, 38789, 40189) kubelet: storage: teardown terminated pod volumes This is a continuation of the work done in #36779 There really is no reason to keep volumes for terminated pods attached on the node. This PR extends the removal of volumes on the node from memory-backed (the current policy) to all volumes. @pmorie raised a concern an impact debugging volume related issues if terminated pod volumes are removed. To address this issue, the PR adds a `--keep-terminated-pod-volumes` flag the kubelet and sets it for `hack/local-up-cluster.sh`. For consideration in 1.6. Fixes #35406 @derekwaynecarr @vishh @dashpole ```release-note kubelet tears down pod volumes on pod termination rather than pod deletion ```
@sjenning Should this change also delete the emptyDir-Memory volume for OOMKilled terminated containers? Because I still get an OOMKilled loop, if my cgroup mem limit is reached, because of large files written to the emptyDir with OpenShift 3.6. |
Currently, we allow volumes to remain mounted on the node, even though the pod is terminated. This creates a vector for a malicious user to exhaust memory on the node by creating memory backed volumes containing large files.
This PR removes memory backed volumes (emptyDir w/ medium Memory, secrets, configmaps) of terminated pods from the node.
@saad-ali @derekwaynecarr
This change is