Fix race condition for consuming podIP via downward API #13052

pmorie · 2015-08-21T20:03:46Z

Fixes the race that can make status.podIP be blank when consumed via the downward API.

smarterclayton · 2015-08-21T20:15:48Z

pkg/kubelet/dockertools/manager.go

+		return "", err
+	}
+	if inspectResult == nil {
+		return "", nil // TODO: should be an error, probably


Make these errors, then the caller can choose to ignore the error. If you can't get a network IP it's better to continue with no status IP. Don't log in this method.

This if clause is bothering me; inspectContainer does it but it seems like it would represent a bug in the docker API to get an (empty container, nil) result from InspectContainer. It would be great if there were a comment on the original code with a reference to an issue either in kube or docker, but oh well. I'm strongly tempted to take this clause out...

Write this code where you can tolerate not having the podIP, but you should
not ignore any errors in this function.

On Fri, Aug 21, 2015 at 4:42 PM, Paul Morie notifications@github.com
wrote:

In pkg/kubelet/dockertools/manager.go
#13052 (comment)
:

@@ -276,6 +276,38 @@ type containerStatusResult struct {
err error
}

+// determineInfraContainerIP inspects the given container and determines the IP address of that
+// container. The container is assumed to be the pod infra container; it is the responsibility
+// of the caller to ensure that the container ID passed is that of the infra container.
+func (dm *DockerManager) determineInfraContainerIP(dockerID string, pod *api.Pod) (string, error) {

inspectResult, err := dm.client.InspectContainer(dockerID)

if err != nil {

return "", err

}

if inspectResult == nil {

return "", nil // TODO: should be an error, probably

This if clause is bothering me; inspectContainer does it but it seems
like it would represent a bug in the docker API to get an (empty
container, nil) result from InspectContainer. It would be great if there
were a comment on the original code with a reference to an issue either in
kube or docker, but oh well. I'm strongly tempted to take this clause out...

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/13052/files#r37676100.

Clayton Coleman | Lead Engineer, OpenShift

k8s-bot · 2015-08-21T20:27:30Z

GCE e2e build/test passed for commit 82ba717ce35deb4db96d35c454342e8b2d18a632.

pmorie · 2015-08-21T20:34:45Z

pkg/kubelet/dockertools/manager.go

+		if dm.networkPlugin.Name() != network.DefaultPluginName {
+			netStatus, err := dm.networkPlugin.Status(pod.Namespace, pod.Name, kubeletTypes.DockerID(dockerID))
+			if err != nil {
+				glog.Errorf("NetworkPlugin %s failed on the status hook for pod '%s' - %v", dm.networkPlugin.Name(), pod.Name, err)


I'm wondering to myself if this should be an error in this and inspectContainer, which I cribbed this code from.

Maybe, spawn an issue

On Fri, Aug 21, 2015 at 4:35 PM, Paul Morie notifications@github.com
wrote:

In pkg/kubelet/dockertools/manager.go
#13052 (comment)
:

}

if inspectResult == nil {

return "", nil // TODO: should be an error, probably

}

result := ""

if inspectResult.State.Running {

if inspectResult.NetworkSettings != nil {

result = inspectResult.NetworkSettings.IPAddress

}

if dm.networkPlugin.Name() != network.DefaultPluginName {

netStatus, err := dm.networkPlugin.Status(pod.Namespace, pod.Name, kubeletTypes.DockerID(dockerID))

if err != nil {

glog.Errorf("NetworkPlugin %s failed on the status hook for pod '%s' - %v", dm.networkPlugin.Name(), pod.Name, err)

I'm wondering to myself if this should be an error in this and
inspectContainer, which I cribbed this code from.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/13052/files#r37675321.

Clayton Coleman | Lead Engineer, OpenShift

Spawned #13055

k8s-bot · 2015-08-21T23:06:41Z

GCE e2e build/test failed for commit 2e92ea03d6048068c7207a01e2310970e9e9681d.

k8s-bot · 2015-08-21T23:07:14Z

GCE e2e build/test failed for commit 1f1c4da84887f8e2217824cfead1de0113797399.

k8s-bot · 2015-08-22T00:45:16Z

GCE e2e build/test failed for commit 0b111a621de9022b40e1297b223bb9f653889c7c.

k8s-bot · 2015-08-22T01:01:11Z

GCE e2e build/test failed for commit ab858a997bd7ac774e888845d969f6e78d1bb610.

smarterclayton · 2015-08-23T00:46:53Z

pkg/kubelet/dockertools/manager.go

@@ -285,16 +285,53 @@ type containerStatusResult struct {
 	err    error
 }

+const podIPDownwardAPISelector = "status.podIP"
+
+func podDependsOnPodIP(pod *api.Pod) bool {


smarterclayton · 2015-08-23T00:48:27Z

One nit, add this to the downward_api e2e test (please), and the existing e2e failures look like flakes caused by over scheduling.

pmorie · 2015-08-24T04:01:00Z

@smarterclayton Yep, I have been working on adding it to the e2e; an additional refactor was required for that; gonna split out another PR since it touches many files.

k8s-bot · 2015-08-25T16:05:05Z

GCE e2e build/test failed for commit 6b5321d8a95e32788d81df4b1582b5868fb3804a.

pmorie · 2015-08-25T18:08:53Z

Interesting, this appears to be a legit failure on GCE. Going to try to reproduce.

k8s-bot · 2015-08-25T19:55:03Z

GCE e2e build/test failed for commit b63215e3ae7f202707a95d29d28fa49725d1563d.

pmorie · 2015-08-25T22:15:26Z

I couldn't recreate the failure on GCE. Kicking the test runner -- I will need to reshuffle the commits in this branch before this goes in if the GCE run finishes successfully.

k8s-bot · 2015-08-25T22:36:23Z

GCE e2e build/test failed for commit fe96aa0d7b3bcb6d7f2842b09e88b8b1fb963888.

k8s-bot · 2015-09-01T00:23:46Z

GCE e2e build/test failed for commit 4ff66bd.

yifan-gu · 2015-09-01T00:37:18Z

@pmorie Thanks for heading up, for rkt side, as we haven't done much with the network plugin yet, will revisit later.
FWIW, this LGTM :)

pmorie · 2015-09-01T01:53:18Z

@smarterclayton squashed

pmorie · 2015-09-01T13:49:37Z

Thanks @yifan-gu

smarterclayton · 2015-09-01T13:53:55Z

LGTM, e2e is flakes

pmorie · 2015-09-02T13:55:16Z

@k8s-oncall

Fix race condition for consuming podIP via downward API

pmorie · 2015-09-02T14:11:24Z

@mwielgus I would have been fine waiting for the bot -- I just didn't know how long to expect to have to wait since LGTM was added yesterday

wojtek-t · 2015-09-08T07:30:50Z

@pmorie @smarterclayton - sorry for jumping late to the party, but it seems that this PR completely broke pod startup time latency (our scalability tests are just logging it, so I didn't catch it before - I will extend tests today to check it). Before the 99th percentile of pod statup time was less than 5 seconds, after this PR it usually exceeds 30 seconds.
I reverted it locally and it seems to help, so I'm going to revert this PR.

smarterclayton · 2015-09-08T13:56:06Z

What was the 90 and 95 percentiles?

On Tue, Sep 8, 2015 at 3:31 AM, Wojciech Tyczynski <notifications@github.com

wrote:

@pmorie https://github.com/pmorie @smarterclayton
https://github.com/smarterclayton - sorry for jumping late to the
party, but it seems that this PR completely broke pod startup time latency
(our scalability tests are just logging it, so I didn't catch it before - I
will extend tests today to check it). Before the 99th percentile of pod
statup time was less than 5 seconds, after this PR it usually exceeds 30
seconds.
I reverted it locally and it seems to help, so I'm going to revert this PR.

—
Reply to this email directly or view it on GitHub
#13052 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

wojtek-t · 2015-09-08T13:59:55Z

I didn't look for 95th percentile, but there was also significant difference (5-10x) in 50th and 90th percentile.

wojtek-t · 2015-09-08T14:00:18Z

So it wasn't only the long tail - it was just significantly worse at all percentiles.

smarterclayton · 2015-09-08T14:41:15Z

Ok, sounds like then we probably need to reuse the data we get from Docker
when we create the pod infra container instead of calling list again. Does
the scalability tests run with any network plugins (assume no)?

On Tue, Sep 8, 2015 at 10:00 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

So it wasn't only the long tail - it was just significantly worse at all
percentiles.

—
Reply to this email directly or view it on GitHub
#13052 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

wojtek-t · 2015-09-08T16:06:11Z

We are running scalability tests on GCE (without any additional plugins).

pmorie · 2015-09-08T18:11:06Z

@smarterclayton @wojtek-t It's extremely curious that this would have affected performance that much. This pad doesn't add any new inspect calls, @smarterclayton; it already uses the result of the docker inspect performed when we started the infra container.

smarterclayton · 2015-09-08T18:27:18Z

All I can imagine is that the network plugin is being called, or some error
condition is being triggered that causes a bad code path to be executed
that we didn't spot while testing.

On Tue, Sep 8, 2015 at 2:11 PM, Paul Morie notifications@github.com wrote:

@smarterclayton https://github.com/smarterclayton @wojtek-t
https://github.com/wojtek-t It's extremely curious that this would have
affected performance that much. This pad doesn't add any new inspect calls,
@smarterclayton https://github.com/smarterclayton; it already uses the
result of the docker inspect performed when we started the infra container.

—
Reply to this email directly or view it on GitHub
#13052 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

This reverts commit b85d055.

googlebot added the cla: yes label Aug 21, 2015

pmorie mentioned this pull request Aug 21, 2015

Add status.podIP as a valid downward API target #12595

Merged

smarterclayton reviewed Aug 21, 2015
View reviewed changes

k8s-github-robot assigned bgrant0607 Aug 21, 2015

pmorie reviewed Aug 21, 2015
View reviewed changes

bgrant0607 assigned smarterclayton and unassigned bgrant0607 Aug 21, 2015

pmorie force-pushed the podip-fix branch 2 times, most recently from 1f1c4da to 2e92ea0 Compare August 21, 2015 22:16

pmorie force-pushed the podip-fix branch 2 times, most recently from 0b111a6 to ab858a9 Compare August 22, 2015 00:35

smarterclayton reviewed Aug 23, 2015
View reviewed changes

pmorie mentioned this pull request Aug 24, 2015

Add TestContainerOutputRegex method to framework #13119

Closed

pmorie force-pushed the podip-fix branch from ab858a9 to 6b5321d Compare August 25, 2015 15:42

pmorie force-pushed the podip-fix branch from 6b5321d to b63215e Compare August 25, 2015 19:08

pmorie force-pushed the podip-fix branch from b63215e to fe96aa0 Compare August 25, 2015 22:14

smarterclayton added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 1, 2015

mwielgus added a commit that referenced this pull request Sep 2, 2015

Merge pull request #13052 from pmorie/podip-fix

3e99325

Fix race condition for consuming podIP via downward API

mwielgus merged commit 3e99325 into kubernetes:master Sep 2, 2015

wojtek-t added a commit to wojtek-t/kubernetes that referenced this pull request Sep 8, 2015

Revert kubernetes#13052

b85d055

wojtek-t mentioned this pull request Sep 8, 2015

Revert #13052 #13656

Merged

ghost mentioned this pull request Sep 8, 2015

soak test failure: Downward API fails to provide pod IP as an env variable after a while #13690

Closed

pmorie mentioned this pull request Sep 14, 2015

Extract latency test so it can be run separately from density E2E #13932

Closed

pmorie mentioned this pull request Sep 29, 2015

Regression in performance of small clusters on GCE #14554

Closed

ashcrow mentioned this pull request Oct 1, 2015

Extract latency test so it can be run separately from density E2E #14925

Merged

ashcrow added a commit to ashcrow/kubernetes that referenced this pull request Oct 8, 2015

Revert "Revert kubernetes#13052"

8369921

This reverts commit b85d055.

This was referenced Oct 13, 2015

Recreate a replication controller Pod status API error (500) Prefix can't be empty #14027

Closed

Fix race condition for consuming podIP via downward API. #15569

Merged

ashcrow mentioned this pull request Oct 15, 2015

RFC: Add annotation for pod metrics #15708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition for consuming podIP via downward API #13052

Fix race condition for consuming podIP via downward API #13052

pmorie commented Aug 21, 2015

smarterclayton Aug 21, 2015

pmorie Aug 21, 2015

smarterclayton Aug 21, 2015

k8s-bot commented Aug 21, 2015

pmorie Aug 21, 2015

smarterclayton Aug 21, 2015

pmorie Aug 21, 2015

k8s-bot commented Aug 21, 2015

k8s-bot commented Aug 21, 2015

k8s-bot commented Aug 22, 2015

k8s-bot commented Aug 22, 2015

smarterclayton Aug 23, 2015

smarterclayton commented Aug 23, 2015

pmorie commented Aug 24, 2015

k8s-bot commented Aug 25, 2015

pmorie commented Aug 25, 2015

k8s-bot commented Aug 25, 2015

pmorie commented Aug 25, 2015

k8s-bot commented Aug 25, 2015

k8s-bot commented Sep 1, 2015

yifan-gu commented Sep 1, 2015

pmorie commented Sep 1, 2015

pmorie commented Sep 1, 2015

smarterclayton commented Sep 1, 2015

pmorie commented Sep 2, 2015

pmorie commented Sep 2, 2015

wojtek-t commented Sep 8, 2015

smarterclayton commented Sep 8, 2015

wojtek-t commented Sep 8, 2015

wojtek-t commented Sep 8, 2015

smarterclayton commented Sep 8, 2015

wojtek-t commented Sep 8, 2015

pmorie commented Sep 8, 2015

smarterclayton commented Sep 8, 2015

Fix race condition for consuming podIP via downward API #13052

Fix race condition for consuming podIP via downward API #13052

Conversation

pmorie commented Aug 21, 2015

smarterclayton Aug 21, 2015

Choose a reason for hiding this comment

pmorie Aug 21, 2015

Choose a reason for hiding this comment

smarterclayton Aug 21, 2015

Choose a reason for hiding this comment

k8s-bot commented Aug 21, 2015

pmorie Aug 21, 2015

Choose a reason for hiding this comment

smarterclayton Aug 21, 2015

Choose a reason for hiding this comment

pmorie Aug 21, 2015

Choose a reason for hiding this comment

k8s-bot commented Aug 21, 2015

k8s-bot commented Aug 21, 2015

k8s-bot commented Aug 22, 2015

k8s-bot commented Aug 22, 2015

smarterclayton Aug 23, 2015

Choose a reason for hiding this comment

smarterclayton commented Aug 23, 2015

pmorie commented Aug 24, 2015

k8s-bot commented Aug 25, 2015

pmorie commented Aug 25, 2015

k8s-bot commented Aug 25, 2015

pmorie commented Aug 25, 2015

k8s-bot commented Aug 25, 2015

k8s-bot commented Sep 1, 2015

yifan-gu commented Sep 1, 2015

pmorie commented Sep 1, 2015

pmorie commented Sep 1, 2015

smarterclayton commented Sep 1, 2015

pmorie commented Sep 2, 2015

pmorie commented Sep 2, 2015

wojtek-t commented Sep 8, 2015

smarterclayton commented Sep 8, 2015

wojtek-t commented Sep 8, 2015

wojtek-t commented Sep 8, 2015

smarterclayton commented Sep 8, 2015

wojtek-t commented Sep 8, 2015

pmorie commented Sep 8, 2015

smarterclayton commented Sep 8, 2015