GC pod ips #35572

bprashanth · 2016-10-26T01:31:31Z

Finally managed to write a failing test.
Supersedes #34373

GC pod ips

This change is

feiskyer · 2016-10-26T06:06:46Z

pkg/kubelet/network/kubenet/kubenet_linux.go

@@ -49,6 +49,7 @@ import (
 	"strconv"

 	"k8s.io/kubernetes/pkg/kubelet/network/hostport"
+	"path/filepath"


nit: move this above

yujuhong · 2016-10-26T15:03:54Z

test/utils/conditions.go

+	statuses := pod.Status.ContainerStatuses
+	if len(statuses) == 0 {
+		return states
+	} else {


nit: redundant else

yujuhong · 2016-10-26T15:09:30Z

test/utils/conditions.go

@@ -82,6 +82,23 @@ func FailedContainers(pod *api.Pod) map[string]ContainerFailures {
 	return states
 }

+// TerminatedContainers inspects all containers in a pod and returns those
+// that have terminated.


nit: explain what the map contains.

yujuhong · 2016-10-26T15:21:33Z

test/e2e_node/restart_test.go

+		// test node or default limits applied (if any). It's is essential
+		// that no containers end up in terminated. 100 was chosen because
+		// it's the max pods per node.
+		POD_COUNT             = 100


nit: why using the all caps with underscore name? Go convention is to use mixedCaps names.

POSIX. updated.

yujuhong · 2016-10-26T15:24:29Z

test/e2e_node/restart_test.go

+		POD_CREATION_INTERVAL = 100 * time.Millisecond
+		RECOVER_TIMEOUT       = 5 * time.Minute
+		START_TIMEOUT         = 3 * time.Minute
+		MIN_PODS              = 20


What's the assumption about the IP capacity? It should affect the calculation of MIN_PODS and RESTART_COUNT.

yeah it's above both (255), added comments

yujuhong · 2016-10-26T16:16:56Z

test/e2e_node/restart_test.go

+
+		runningPods = []*api.Pod{}
+		for _, pod := range podList.Items {
+			if r, err := testutils.PodRunningReady(&pod); err != nil {


nti: combine both conditions into one?
; err != nil || !r {

yujuhong · 2016-10-26T16:21:41Z

test/e2e_node/restart_test.go

+				By("Confirm no containers have terminated")
+				for _, pod := range postRestartRunningPods {
+					if c := testutils.TerminatedContainers(pod); len(c) != 0 {
+						framework.Failf("Pod %v has failed containers %+v after docker restart, this might indicate an IP leak", pod.Name, c)


nit: s/Pod %v/Pod %q

yujuhong · 2016-10-26T16:31:28Z

pkg/kubelet/network/kubenet/kubenet_linux.go

+	// release leaked ips
+	for ip, containerID := range ipContainerIdMap {
+		// if the container is not running, release IP
+		if !runningContainerIDs.Has(containerID) {


nit (optional): reduce the indent by

if runningContainerIDs.Has(containerID) { continue }

yujuhong · 2016-10-26T16:38:13Z

pkg/kubelet/network/kubenet/kubenet_linux.go

+	for _, pod := range pods {
+		containerID, err := plugin.host.GetRuntime().GetPodContainerID(pod)
+		if err != nil {
+			glog.Errorf("Failed to get infra containerID of %q/%q: %v", pod.Namespace, pod.Name, err)


Not sure if this should be an error. If docker has just restarted, kubelet may not have the chance to start the infra container yet. Maybe Warningf is more appropriate.

changed to warning

yujuhong · 2016-10-26T17:03:21Z

pkg/kubelet/network/kubenet/kubenet_linux.go

+			continue
+		}
+
+		runningContainerIDs.Insert(strings.TrimSpace(containerID.ID))


What if the infra container has already terminated? You probably want to check the container state before inserting.

this is a race we can't easily avoid. If it exits after getNonExitedPods and this line, assume we get a teardown. It's more important that we detect the gargabe in the ip dir.

My comment wasn't about the race condition. getNonExitedPods returns pods with at least one running container, which may include a pod with a running user container and a dead infra container. I don't see the state of the infra container being checked anywhere.

is that a problem?

Other than the IP used by those infra containers wouldn't be recycled, there is no problem.

And that pod will get cleaned up the normal way (teardown)? there's no way we restarted an old container because we must've tried the infra container first and failed, so this must be a crashing infra container of a current user pods that will at some point in the future naturally die. no?

yujuhong · 2016-10-26T17:09:15Z

pkg/kubelet/network/kubenet/kubenet_linux.go

-// Assumes PodSpecs retrieved from the runtime include the name and ID of containers in
-// each pod.
-func (plugin *kubenetNetworkPlugin) getActivePods() ([]*hostport.ActivePod, error) {
+// getNonExitedPods returns a list of pods running or ready to run on this node


What "ready to run" means here is unclear...How about just "returns a list of pods where there are at least one running container in the pod"?

On the other hand, as I said in line 679, you probably just want a list of "non-terminated" infra containers. You can filter out pods with a terminated or non-existent infra container in one place.

just reworded comment since this has already been tested and refactoring will require more stressing and no functional benefit.

It's just a very convoluted way to get a set of running infra containers, but I get the need to keep the change minimal for cherrypicking.

bprashanth

PTAL

bprashanth · 2016-10-26T20:37:33Z

pkg/kubelet/network/kubenet/kubenet_linux.go

+			continue
+		}
+
+		runningContainerIDs.Insert(strings.TrimSpace(containerID.ID))


this is a race we can't easily avoid. If it exits after getNonExitedPods and this line, assume we get a teardown. It's more important that we detect the gargabe in the ip dir.

bprashanth · 2016-10-26T20:40:04Z

test/e2e_node/restart_test.go

+		POD_CREATION_INTERVAL = 100 * time.Millisecond
+		RECOVER_TIMEOUT       = 5 * time.Minute
+		START_TIMEOUT         = 3 * time.Minute
+		MIN_PODS              = 20


yeah it's above both (255), added comments

bprashanth · 2016-10-26T20:41:17Z

pkg/kubelet/network/kubenet/kubenet_linux.go

-// Assumes PodSpecs retrieved from the runtime include the name and ID of containers in
-// each pod.
-func (plugin *kubenetNetworkPlugin) getActivePods() ([]*hostport.ActivePod, error) {
+// getNonExitedPods returns a list of pods running or ready to run on this node


just reworded comment since this has already been tested and refactoring will require more stressing and no functional benefit.

bprashanth · 2016-10-26T20:42:25Z

pkg/kubelet/network/kubenet/kubenet_linux.go

@@ -49,6 +49,7 @@ import (
 	"strconv"

 	"k8s.io/kubernetes/pkg/kubelet/network/hostport"
+	"path/filepath"


bprashanth · 2016-10-26T20:54:13Z

test/e2e_node/restart_test.go

+		// test node or default limits applied (if any). It's is essential
+		// that no containers end up in terminated. 100 was chosen because
+		// it's the max pods per node.
+		POD_COUNT             = 100


POSIX. updated.

bprashanth · 2016-10-26T21:19:48Z

test/e2e_node/restart_test.go

+					framework.Failf("Failed to start *any* pods after docker restart, this might indicate an IP leak")
+				}
+				By("Confirm no containers have terminated")
+				for _, pod := range postRestartRunningPods {


yeah that's lastState, im checking state which shouldn't be terminated at the end of this experiment (it is in step 3 below):

schedule new pods on a full ip node, they end up with state waiting (containerCreating, no available ips).

If a pod gets a chance to run, we have state running (startedAt blah).

If docker gets bounced and we end up with no available ips, we have state terminated (finishedAt blah), Ready=false.

Now GC runs and frees up some ips (in fact with 100 pods gc keeps running after the 3rd restart).

And we get state running (startedAt blah) with lastState terminated (finishedAt blah), Ready=true.

If GC hadn't run, we would be stuck at (3) a 100 containers with state terminated (finishedAt blah), ready=false.

bprashanth · 2016-10-26T21:26:10Z

pkg/kubelet/network/kubenet/kubenet_linux.go

+	for _, pod := range pods {
+		containerID, err := plugin.host.GetRuntime().GetPodContainerID(pod)
+		if err != nil {
+			glog.Errorf("Failed to get infra containerID of %q/%q: %v", pod.Namespace, pod.Name, err)


changed to warning

bprashanth · 2016-10-26T22:39:44Z

@k8s-bot gce etcd3 e2e test this

yujuhong · 2016-10-26T22:49:04Z

LGTM.

k8s-ci-robot · 2016-10-26T23:47:45Z

Jenkins GCE etcd3 e2e failed for commit 18b2f085f36058447096aad601429e11e6c83250. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce etcd3 e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

bprashanth · 2016-10-27T02:12:58Z

@k8s-bot gce etcd3 e2e test this

bprashanth · 2016-10-27T04:52:16Z

Reviewers, I'm planning to apply the lgtm label tomorrow to cherrypick this for Friday's release. Tests passed and I have one LGTM.

bprashanth · 2016-10-27T18:58:08Z

Applying label, @dchen1107 I'll get verbal confirmation from you before cherrypicking

k8s-ci-robot · 2016-10-28T04:29:49Z

Jenkins verification failed for commit 18b2f085f36058447096aad601429e11e6c83250. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, networking is always torn down for the container even if the container itself is not deleted. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: kubernetes#14940 Related: kubernetes#35572

Automatic merge from submit-queue (batch tested with PRs 40505, 34664, 37036, 40726, 41595) dockertools: call TearDownPod when GC-ing infra pods The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, infra pods are now always GC-ed rather than gating them by containersToKeep. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: #14940 Related: #35572

The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, networking is always torn down for the container even if the container itself is not deleted. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: kubernetes#14940 Related: kubernetes#35572

bprashanth added the cherrypick-candidate label Oct 26, 2016

bprashanth added this to the v1.4 milestone Oct 26, 2016

bprashanth assigned thockin, dchen1107 and yujuhong Oct 26, 2016

googlebot added the cla: yes label Oct 26, 2016

bprashanth mentioned this pull request Oct 26, 2016

add ipam garbageCollection #34373

Closed

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Oct 26, 2016

feiskyer reviewed Oct 26, 2016

View reviewed changes

yujuhong reviewed Oct 26, 2016

View reviewed changes

yujuhong added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Oct 26, 2016

bprashanth force-pushed the ip_gc branch from e1a6523 to d111c21 Compare October 26, 2016 21:25

bprashanth commented Oct 26, 2016

View reviewed changes

bprashanth force-pushed the ip_gc branch from d111c21 to 18b2f08 Compare October 26, 2016 21:29

jessfraz added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 27, 2016

bprashanth added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2016

periodically GC pod ips

37bc34c

bprashanth force-pushed the ip_gc branch from 18b2f08 to 37bc34c Compare October 28, 2016 05:15

bprashanth added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 28, 2016

GC pod ips #35572

GC pod ips #35572

Conversation

bprashanth commented Oct 26, 2016 • edited by jessfraz Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth commented Oct 26, 2016

yujuhong commented Oct 26, 2016

k8s-ci-robot commented Oct 26, 2016

bprashanth commented Oct 27, 2016

bprashanth commented Oct 27, 2016

bprashanth commented Oct 27, 2016

k8s-ci-robot commented Oct 28, 2016

bprashanth commented Oct 26, 2016 •

edited by jessfraz

Loading