Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC pod ips #35572

Merged
merged 1 commit into from
Oct 28, 2016
Merged

GC pod ips #35572

merged 1 commit into from
Oct 28, 2016

Conversation

bprashanth
Copy link
Contributor

@bprashanth bprashanth commented Oct 26, 2016

Finally managed to write a failing test.
Supersedes #34373

GC pod ips

This change is Reviewable

@bprashanth bprashanth added this to the v1.4 milestone Oct 26, 2016
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Oct 26, 2016
@@ -49,6 +49,7 @@ import (
"strconv"

"k8s.io/kubernetes/pkg/kubelet/network/hostport"
"path/filepath"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

statuses := pod.Status.ContainerStatuses
if len(statuses) == 0 {
return states
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: redundant else

@@ -82,6 +82,23 @@ func FailedContainers(pod *api.Pod) map[string]ContainerFailures {
return states
}

// TerminatedContainers inspects all containers in a pod and returns those
// that have terminated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: explain what the map contains.

// test node or default limits applied (if any). It's is essential
// that no containers end up in terminated. 100 was chosen because
// it's the max pods per node.
POD_COUNT = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why using the all caps with underscore name? Go convention is to use mixedCaps names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POSIX. updated.

POD_CREATION_INTERVAL = 100 * time.Millisecond
RECOVER_TIMEOUT = 5 * time.Minute
START_TIMEOUT = 3 * time.Minute
MIN_PODS = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the assumption about the IP capacity? It should affect the calculation of MIN_PODS and RESTART_COUNT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's above both (255), added comments


runningPods = []*api.Pod{}
for _, pod := range podList.Items {
if r, err := testutils.PodRunningReady(&pod); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nti: combine both conditions into one?
; err != nil || !r {

By("Confirm no containers have terminated")
for _, pod := range postRestartRunningPods {
if c := testutils.TerminatedContainers(pod); len(c) != 0 {
framework.Failf("Pod %v has failed containers %+v after docker restart, this might indicate an IP leak", pod.Name, c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Pod %v/Pod %q

// release leaked ips
for ip, containerID := range ipContainerIdMap {
// if the container is not running, release IP
if !runningContainerIDs.Has(containerID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (optional): reduce the indent by

if runningContainerIDs.Has(containerID) {
    continue
}

for _, pod := range pods {
containerID, err := plugin.host.GetRuntime().GetPodContainerID(pod)
if err != nil {
glog.Errorf("Failed to get infra containerID of %q/%q: %v", pod.Namespace, pod.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should be an error. If docker has just restarted, kubelet may not have the chance to start the infra container yet. Maybe Warningf is more appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to warning

continue
}

runningContainerIDs.Insert(strings.TrimSpace(containerID.ID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the infra container has already terminated? You probably want to check the container state before inserting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a race we can't easily avoid. If it exits after getNonExitedPods and this line, assume we get a teardown. It's more important that we detect the gargabe in the ip dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment wasn't about the race condition. getNonExitedPods returns pods with at least one running container, which may include a pod with a running user container and a dead infra container. I don't see the state of the infra container being checked anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that a problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the IP used by those infra containers wouldn't be recycled, there is no problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that pod will get cleaned up the normal way (teardown)? there's no way we restarted an old container because we must've tried the infra container first and failed, so this must be a crashing infra container of a current user pods that will at some point in the future naturally die. no?

// Assumes PodSpecs retrieved from the runtime include the name and ID of containers in
// each pod.
func (plugin *kubenetNetworkPlugin) getActivePods() ([]*hostport.ActivePod, error) {
// getNonExitedPods returns a list of pods running or ready to run on this node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What "ready to run" means here is unclear...How about just "returns a list of pods where there are at least one running container in the pod"?

On the other hand, as I said in line 679, you probably just want a list of "non-terminated" infra containers. You can filter out pods with a terminated or non-existent infra container in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just reworded comment since this has already been tested and refactoring will require more stressing and no functional benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a very convoluted way to get a set of running infra containers, but I get the need to keep the change minimal for cherrypicking.

@yujuhong yujuhong added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Oct 26, 2016
Copy link
Contributor Author

@bprashanth bprashanth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

continue
}

runningContainerIDs.Insert(strings.TrimSpace(containerID.ID))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a race we can't easily avoid. If it exits after getNonExitedPods and this line, assume we get a teardown. It's more important that we detect the gargabe in the ip dir.

POD_CREATION_INTERVAL = 100 * time.Millisecond
RECOVER_TIMEOUT = 5 * time.Minute
START_TIMEOUT = 3 * time.Minute
MIN_PODS = 20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's above both (255), added comments

// Assumes PodSpecs retrieved from the runtime include the name and ID of containers in
// each pod.
func (plugin *kubenetNetworkPlugin) getActivePods() ([]*hostport.ActivePod, error) {
// getNonExitedPods returns a list of pods running or ready to run on this node
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just reworded comment since this has already been tested and refactoring will require more stressing and no functional benefit.

@@ -49,6 +49,7 @@ import (
"strconv"

"k8s.io/kubernetes/pkg/kubelet/network/hostport"
"path/filepath"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// test node or default limits applied (if any). It's is essential
// that no containers end up in terminated. 100 was chosen because
// it's the max pods per node.
POD_COUNT = 100
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POSIX. updated.

framework.Failf("Failed to start *any* pods after docker restart, this might indicate an IP leak")
}
By("Confirm no containers have terminated")
for _, pod := range postRestartRunningPods {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's lastState, im checking state which shouldn't be terminated at the end of this experiment (it is in step 3 below):

  1. schedule new pods on a full ip node, they end up with state waiting (containerCreating, no available ips).
  2. If a pod gets a chance to run, we have state running (startedAt blah).
  3. If docker gets bounced and we end up with no available ips, we have state terminated (finishedAt blah), Ready=false.
  4. Now GC runs and frees up some ips (in fact with 100 pods gc keeps running after the 3rd restart).
  5. And we get state running (startedAt blah) with lastState terminated (finishedAt blah), Ready=true.

If GC hadn't run, we would be stuck at (3) a 100 containers with state terminated (finishedAt blah), ready=false.

for _, pod := range pods {
containerID, err := plugin.host.GetRuntime().GetPodContainerID(pod)
if err != nil {
glog.Errorf("Failed to get infra containerID of %q/%q: %v", pod.Namespace, pod.Name, err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to warning

@bprashanth
Copy link
Contributor Author

@k8s-bot gce etcd3 e2e test this

@yujuhong
Copy link
Contributor

LGTM.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE etcd3 e2e failed for commit 18b2f085f36058447096aad601429e11e6c83250. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce etcd3 e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@bprashanth
Copy link
Contributor Author

@k8s-bot gce etcd3 e2e test this

@bprashanth
Copy link
Contributor Author

Reviewers, I'm planning to apply the lgtm label tomorrow to cherrypick this for Friday's release. Tests passed and I have one LGTM.

@jessfraz jessfraz added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 27, 2016
@bprashanth
Copy link
Contributor Author

Applying label, @dchen1107 I'll get verbal confirmation from you before cherrypicking

@bprashanth bprashanth added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 18b2f085f36058447096aad601429e11e6c83250. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@bprashanth bprashanth added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 28, 2016
dcbw added a commit to dcbw/kubernetes that referenced this pull request Dec 7, 2016
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
dcbw added a commit to dcbw/kubernetes that referenced this pull request Dec 9, 2016
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
dcbw added a commit to dcbw/kubernetes that referenced this pull request Dec 13, 2016
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
dcbw added a commit to dcbw/kubernetes that referenced this pull request Dec 14, 2016
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
dcbw added a commit to dcbw/kubernetes that referenced this pull request Dec 15, 2016
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
dcbw added a commit to dcbw/kubernetes that referenced this pull request Feb 16, 2017
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
k8s-github-robot pushed a commit that referenced this pull request Feb 17, 2017
Automatic merge from submit-queue (batch tested with PRs 40505, 34664, 37036, 40726, 41595)

dockertools: call TearDownPod when GC-ing infra pods

The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
infra pods are now always GC-ed rather than gating them by
containersToKeep.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: #14940
Related: #35572
ahakanbaba pushed a commit to ahakanbaba/kubernetes that referenced this pull request Feb 17, 2017
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants