e2e flake: Downward API should provide pod IP as an env var #20403

gmarek · 2016-02-01T10:13:50Z

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce/10735/

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/downward_api.go:81 "POD_IP=(?:\\d+)\\.(?:\\d+)\\.(?:\\d+)\\.(?:\\d+)" in container output Expected : POD_IP= KUBERNETES_PORT=tcp://10.0.0.1:443 KUBERNETES_SERVICE_PORT=443 HOSTNAME=downward-api-b0eb2988-c8ab-11e5-afda-42010af01555 SHLVL=1 HOME=/root KUBERNETES_PORT_443_TCP_ADDR=10.0.0.1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP=tcp://10.0.0.1:443 KUBERNETES_SERVICE_PORT_HTTPS=443 PWD=/ KUBERNETES_SERVICE_HOST=10.0.0.1 to match regular expression : POD_IP=(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)

I'm not sure who's responsible for Downward API. cc @lavalamp @davidopp @bgrant0607

The text was updated successfully, but these errors were encountered:

bgrant0607 · 2016-02-01T19:29:41Z

Previous occurrence #13690

pmorie · 2016-02-02T14:47:39Z

@gmarek is this in a normal e2e? In a soak env?

gmarek · 2016-02-02T14:52:20Z

It's normal e2e.

pmorie · 2016-02-02T14:53:21Z

@gmarek I mean, the context in which the failure was discovered -- previously this happened only in a soak env.

I'm trying to reproduce now.

gmarek · 2016-02-02T15:01:53Z

Yup, I understand. It happened in our main test suite.

pmorie · 2016-02-02T20:20:13Z

Ran this about 500 times so far in a loop w/o a failure

pmorie · 2016-02-03T04:37:02Z

Wasn't able to repro today, will continue tomorrow.

thockin · 2016-02-03T19:18:48Z

Can we decorate logs (in a non-horrible way) to make a failure more visible? That is a viable answer

pmorie · 2016-02-03T19:23:14Z

@thockin I was doing just that and found that the information is present in the logs in a log message related to the hosts file mount. Analyzing now, it looks like on the node where the flake happened, a little over of half the containers started with the podIP unknown.

I0201 06:19:08.157017    3382 kubelet.go:1157] Will create hosts mount for container:"dapi-container", podIP:: false

^^ example message

So, it looks like the underlying issue here is happening much more often than I personally thought. Will try to debug.

pmorie · 2016-02-03T19:43:09Z

Nevermind, it looks like most of the containers without pod IP are infracontainers. I remember now that this is expected for the infra container.

pmorie · 2016-02-03T21:47:03Z

Looks like it happened 4 times to actual containers on this node that flaked during e2e.

$ grep "hosts mount" ~/pod-flake.log  | grep "podIP::" | grep -v POD
I0201 06:17:53.642291    3382 kubelet.go:1157] Will create hosts mount for container:"kube-proxy", podIP:: false
I0201 06:19:08.157017    3382 kubelet.go:1157] Will create hosts mount for container:"dapi-container", podIP:: false
I0201 06:19:49.106935    3382 kubelet.go:1157] Will create hosts mount for container:"c", podIP:: false
I0201 06:20:49.107846    3382 kubelet.go:1157] Will create hosts mount for container:"hostexec", podIP:: false

wojtek-t · 2016-02-04T09:37:31Z

This just happened again:

https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce/10931/?project=kubernetes-jenkins

erictune · 2016-02-04T10:35:38Z

It happened again:
http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/26572/testReport/junit/(root)/Kubernetes%20e2e%20suite/Downward_API_should_provide_pod_IP_as_an_env_var/

spxtr · 2016-02-05T22:14:39Z

This is flaking 1 or 2 times per day on kubernetes-e2e-gce.

lavalamp · 2016-02-05T22:17:19Z

new occurrence: https://cloud.google.com/console/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce/11020/

thockin · 2016-02-08T19:06:35Z

bumping to p0

thockin · 2016-02-22T17:41:55Z

@pmorie status on this?

ncdc · 2016-02-24T19:38:11Z

@thockin fix PR has lgtm; running through jenkins now.

ncdc · 2016-02-24T19:58:52Z

@thockin correction, PR has a regression for kube-managed /etc/hosts :-( we'll need to update it

ncdc · 2016-02-24T21:44:50Z

I think I've figured out why this is flaking. Here's what I'm seeing locally and also what I've noticed from 1 of the GCE flakes:

Kubelet sync loop processes a new pod
Infra container is created
Actual container is created (and it has the pod IP available correctly)
Actual container fails to start for some reason
Next sync loop iteration results in a new container
New container is passed a pod IP of ""

This looks something like this in the logs (with lots of snipping):

kubelet.go:1203] container: e2e-tests-downward-api-57jz3/downward-api-53637d5a-db3a-11e5-8cef-001c42de2c3c/dapi-container podIP: "172.17.0.3" creating hosts mount: true
[snip]
manager.go:1881] Error running pod "downward-api-53637d5a-db3a-11e5-8cef-001c42de2c3c_e2e-tests-downward-api-57jz3(5363a626-db3a-11e5-82e5-001c42de2c3c)" container "dapi-container": runContainer: API error (500): Cannot start container 90cffa61d617ab4f98a3dc72f7549c5a5cb85487957a167e180f1747571a8b43: [8] System error: open /sys/fs/cgroup/cpu,cpuacct/system.slice/docker-90cffa61d617ab4f98a3dc72f7549c5a5cb85487957a167e180f1747571a8b43.scope/cpu.shares: no such file or directory
[snip]
kubelet.go:1203] container: e2e-tests-downward-api-57jz3/downward-api-53637d5a-db3a-11e5-8cef-001c42de2c3c/dapi-container podIP: "" creating hosts mount: false

wojtek-t · 2016-02-29T12:06:15Z

@pmorie @ncdc - what's the status of this?

pmorie · 2016-02-29T15:54:14Z

@wojtek-t I'm working on an e2e that captures the existing failure, hope to have a PR in the next day or so.

ncdc · 2016-03-07T16:02:36Z

I was testing Paul's fix for this on Friday, but unfortunately the lack of #22607 in the tree resulted in a different error condition. Instead of failing because POD_IP was blank, it failed because the pod was marked as Failed and the container never ran.

yujuhong · 2016-03-08T19:53:32Z

@ncdc I think the problem was originated by relying on passing the pod IP to downstream functions through modifying the api.Pod object

kubernetes/pkg/kubelet/dockertools/manager.go

Line 1887 in b1a6ee2

    
           pod.Status.PodIP = dm.determineContainerIP(pod.Name, pod.Namespace, podInfraContainer)

The api.Pod object should be treated as read-only in most kubelet functions, and you cannot assume the object wouldn't be updated if you write to it.

@pmorie's PR (#22666) builds on top of this hack, so the getPodIP() function won't always return the correct IP. Having a one-line hack is one thing, but wrapping it to be a seemingly proper function makes me uncomfortable. Since DockerManager.SyncPod has the most up-to-date information about the pod IP, I think we should just pass that down the call stack. (Alternatively, we can always query the container runtime for the correct IP address, but IMO, that's too expensive)

The call stack that needs to be modified:

  -> containerRuntime.SyncPod
    -> dm.runContainerInPod
      -> kubelet.GenerateRunContainerOptions
        -> kubelet.makeMounts
          -> kubelet.makeHostsMount
        -> kubelet.makeEnvironmentVariables
          -> kubelet.podFieldSelectorRuntimeValue

/cc @kubernetes/sig-node, FYI.

ncdc · 2016-03-08T19:57:37Z

@yujuhong cool, thanks for the detailed explanation. Where in SyncPod does the pod IP exist other than the line you linked above, which only gets called when the infra container is created?

yujuhong · 2016-03-08T20:02:51Z

If the infra container was not created in the same sync iteration, you should be able to get the pod IP from podStatus *kubecontainer.PodStatus, which was passed to the SyncPod()

ncdc · 2016-03-08T20:03:28Z

Ok, so if we put if/else logic in for that check, then we can pass the pod IP around as needed, right?

pmorie · 2016-05-05T15:44:40Z

Has anyone hit this recently? Afaik, it can be closed. I haven't seen it linked to anything recently.

gmarek added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 1, 2016

gmarek assigned lavalamp Feb 1, 2016

goltermann added the team/cluster label Feb 1, 2016

goltermann assigned pmorie and unassigned lavalamp Feb 1, 2016

pmorie mentioned this issue Feb 3, 2016

Make it easier to debug pod IP flakes for downward API #20603

Merged

pwittrock mentioned this issue Feb 3, 2016

Nodes Metrics API - Summary Impl #19932

Merged

freehan mentioned this issue Feb 4, 2016

add service validation for mix protocol #20600

Merged

thockin mentioned this issue Feb 4, 2016

kube-proxy iptables - reject packets to services without endpoints #19576

Merged

krousey mentioned this issue Feb 4, 2016

Minor selectHost optimization in scheduler #19907

Merged

wojtek-t mentioned this issue Feb 6, 2016

Update docker-multinode, use restart=on-failure and bump to 1.2.0-alpha.7 #20483

Merged

pweil- mentioned this issue Feb 8, 2016

don't enable psp by default and fix comment #20721

Merged

thockin added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 8, 2016

thockin removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 8, 2016

This was referenced Feb 8, 2016

Remove unused getRuntimePodStatus func from Kubelet #20848

Merged

Make podIP flake easier to debug #20869

Closed

Fix pod IP downward API test flake #21043

Closed

pmorie mentioned this issue Mar 7, 2016

Fix flake in pod IP as env var e2e #22666

Merged

gmarek closed this as completed May 16, 2016

zmerlynn mentioned this issue Jun 22, 2016

Copy and display source location prominently on Kubernetes instances #27840

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: Downward API should provide pod IP as an env var #20403

e2e flake: Downward API should provide pod IP as an env var #20403

gmarek commented Feb 1, 2016

bgrant0607 commented Feb 1, 2016

pmorie commented Feb 2, 2016

gmarek commented Feb 2, 2016

pmorie commented Feb 2, 2016

gmarek commented Feb 2, 2016

pmorie commented Feb 2, 2016

pmorie commented Feb 3, 2016

thockin commented Feb 3, 2016

pmorie commented Feb 3, 2016

pmorie commented Feb 3, 2016

pmorie commented Feb 3, 2016

wojtek-t commented Feb 4, 2016

erictune commented Feb 4, 2016

spxtr commented Feb 5, 2016

lavalamp commented Feb 5, 2016

thockin commented Feb 8, 2016

thockin commented Feb 22, 2016

ncdc commented Feb 24, 2016

ncdc commented Feb 24, 2016

ncdc commented Feb 24, 2016

wojtek-t commented Feb 29, 2016

pmorie commented Feb 29, 2016

ncdc commented Mar 7, 2016

yujuhong commented Mar 8, 2016

ncdc commented Mar 8, 2016

yujuhong commented Mar 8, 2016

ncdc commented Mar 8, 2016

pmorie commented May 5, 2016

e2e flake: Downward API should provide pod IP as an env var #20403

e2e flake: Downward API should provide pod IP as an env var #20403

Comments

gmarek commented Feb 1, 2016

bgrant0607 commented Feb 1, 2016

pmorie commented Feb 2, 2016

gmarek commented Feb 2, 2016

pmorie commented Feb 2, 2016

gmarek commented Feb 2, 2016

pmorie commented Feb 2, 2016

pmorie commented Feb 3, 2016

thockin commented Feb 3, 2016

pmorie commented Feb 3, 2016

pmorie commented Feb 3, 2016

pmorie commented Feb 3, 2016

wojtek-t commented Feb 4, 2016

erictune commented Feb 4, 2016

spxtr commented Feb 5, 2016

lavalamp commented Feb 5, 2016

thockin commented Feb 8, 2016

thockin commented Feb 22, 2016

ncdc commented Feb 24, 2016

ncdc commented Feb 24, 2016

ncdc commented Feb 24, 2016

wojtek-t commented Feb 29, 2016

pmorie commented Feb 29, 2016

ncdc commented Mar 7, 2016

yujuhong commented Mar 8, 2016

ncdc commented Mar 8, 2016

yujuhong commented Mar 8, 2016

ncdc commented Mar 8, 2016

pmorie commented May 5, 2016