Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: Downward API should provide pod IP as an env var #20403

Closed
gmarek opened this issue Feb 1, 2016 · 28 comments
Closed

e2e flake: Downward API should provide pod IP as an env var #20403

gmarek opened this issue Feb 1, 2016 · 28 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@gmarek
Copy link
Contributor

gmarek commented Feb 1, 2016

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce/10735/

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/downward_api.go:81 "POD_IP=(?:\\d+)\\.(?:\\d+)\\.(?:\\d+)\\.(?:\\d+)" in container output Expected : POD_IP= KUBERNETES_PORT=tcp://10.0.0.1:443 KUBERNETES_SERVICE_PORT=443 HOSTNAME=downward-api-b0eb2988-c8ab-11e5-afda-42010af01555 SHLVL=1 HOME=/root KUBERNETES_PORT_443_TCP_ADDR=10.0.0.1 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP=tcp://10.0.0.1:443 KUBERNETES_SERVICE_PORT_HTTPS=443 PWD=/ KUBERNETES_SERVICE_HOST=10.0.0.1 to match regular expression : POD_IP=(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)

I'm not sure who's responsible for Downward API. cc @lavalamp @davidopp @bgrant0607

@gmarek gmarek added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 1, 2016
@goltermann goltermann assigned pmorie and unassigned lavalamp Feb 1, 2016
@bgrant0607
Copy link
Member

Previous occurrence #13690

@pmorie
Copy link
Member

pmorie commented Feb 2, 2016

@gmarek is this in a normal e2e? In a soak env?

@gmarek
Copy link
Contributor Author

gmarek commented Feb 2, 2016

It's normal e2e.

@pmorie
Copy link
Member

pmorie commented Feb 2, 2016

@gmarek I mean, the context in which the failure was discovered -- previously this happened only in a soak env.

I'm trying to reproduce now.

@gmarek
Copy link
Contributor Author

gmarek commented Feb 2, 2016

Yup, I understand. It happened in our main test suite.

@pmorie
Copy link
Member

pmorie commented Feb 2, 2016

Ran this about 500 times so far in a loop w/o a failure

@pmorie
Copy link
Member

pmorie commented Feb 3, 2016

Wasn't able to repro today, will continue tomorrow.

@thockin
Copy link
Member

thockin commented Feb 3, 2016

Can we decorate logs (in a non-horrible way) to make a failure more visible? That is a viable answer

@pmorie
Copy link
Member

pmorie commented Feb 3, 2016

@thockin I was doing just that and found that the information is present in the logs in a log message related to the hosts file mount. Analyzing now, it looks like on the node where the flake happened, a little over of half the containers started with the podIP unknown.

I0201 06:19:08.157017    3382 kubelet.go:1157] Will create hosts mount for container:"dapi-container", podIP:: false

^^ example message

So, it looks like the underlying issue here is happening much more often than I personally thought. Will try to debug.

@pmorie
Copy link
Member

pmorie commented Feb 3, 2016

Nevermind, it looks like most of the containers without pod IP are infracontainers. I remember now that this is expected for the infra container.

@pmorie
Copy link
Member

pmorie commented Feb 3, 2016

Looks like it happened 4 times to actual containers on this node that flaked during e2e.

$ grep "hosts mount" ~/pod-flake.log  | grep "podIP::" | grep -v POD
I0201 06:17:53.642291    3382 kubelet.go:1157] Will create hosts mount for container:"kube-proxy", podIP:: false
I0201 06:19:08.157017    3382 kubelet.go:1157] Will create hosts mount for container:"dapi-container", podIP:: false
I0201 06:19:49.106935    3382 kubelet.go:1157] Will create hosts mount for container:"c", podIP:: false
I0201 06:20:49.107846    3382 kubelet.go:1157] Will create hosts mount for container:"hostexec", podIP:: false

@wojtek-t
Copy link
Member

wojtek-t commented Feb 4, 2016

@spxtr
Copy link
Contributor

spxtr commented Feb 5, 2016

This is flaking 1 or 2 times per day on kubernetes-e2e-gce.

@lavalamp
Copy link
Member

lavalamp commented Feb 5, 2016

@thockin
Copy link
Member

thockin commented Feb 8, 2016

bumping to p0

@thockin thockin added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 8, 2016
@thockin thockin removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 8, 2016
@thockin
Copy link
Member

thockin commented Feb 22, 2016

@pmorie status on this?

@ncdc
Copy link
Member

ncdc commented Feb 24, 2016

@thockin fix PR has lgtm; running through jenkins now.

@ncdc
Copy link
Member

ncdc commented Feb 24, 2016

@thockin correction, PR has a regression for kube-managed /etc/hosts :-( we'll need to update it

@ncdc
Copy link
Member

ncdc commented Feb 24, 2016

I think I've figured out why this is flaking. Here's what I'm seeing locally and also what I've noticed from 1 of the GCE flakes:

  1. Kubelet sync loop processes a new pod
  2. Infra container is created
  3. Actual container is created (and it has the pod IP available correctly)
  4. Actual container fails to start for some reason
  5. Next sync loop iteration results in a new container
  6. New container is passed a pod IP of ""

This looks something like this in the logs (with lots of snipping):

kubelet.go:1203] container: e2e-tests-downward-api-57jz3/downward-api-53637d5a-db3a-11e5-8cef-001c42de2c3c/dapi-container podIP: "172.17.0.3" creating hosts mount: true
[snip]
manager.go:1881] Error running pod "downward-api-53637d5a-db3a-11e5-8cef-001c42de2c3c_e2e-tests-downward-api-57jz3(5363a626-db3a-11e5-82e5-001c42de2c3c)" container "dapi-container": runContainer: API error (500): Cannot start container 90cffa61d617ab4f98a3dc72f7549c5a5cb85487957a167e180f1747571a8b43: [8] System error: open /sys/fs/cgroup/cpu,cpuacct/system.slice/docker-90cffa61d617ab4f98a3dc72f7549c5a5cb85487957a167e180f1747571a8b43.scope/cpu.shares: no such file or directory
[snip]
kubelet.go:1203] container: e2e-tests-downward-api-57jz3/downward-api-53637d5a-db3a-11e5-8cef-001c42de2c3c/dapi-container podIP: "" creating hosts mount: false

@wojtek-t
Copy link
Member

@pmorie @ncdc - what's the status of this?

@pmorie
Copy link
Member

pmorie commented Feb 29, 2016

@wojtek-t I'm working on an e2e that captures the existing failure, hope to have a PR in the next day or so.

@ncdc
Copy link
Member

ncdc commented Mar 7, 2016

I was testing Paul's fix for this on Friday, but unfortunately the lack of #22607 in the tree resulted in a different error condition. Instead of failing because POD_IP was blank, it failed because the pod was marked as Failed and the container never ran.

@yujuhong
Copy link
Contributor

yujuhong commented Mar 8, 2016

@ncdc I think the problem was originated by relying on passing the pod IP to downstream functions through modifying the api.Pod object

pod.Status.PodIP = dm.determineContainerIP(pod.Name, pod.Namespace, podInfraContainer)

The api.Pod object should be treated as read-only in most kubelet functions, and you cannot assume the object wouldn't be updated if you write to it.

@pmorie's PR (#22666) builds on top of this hack, so the getPodIP() function won't always return the correct IP. Having a one-line hack is one thing, but wrapping it to be a seemingly proper function makes me uncomfortable. Since DockerManager.SyncPod has the most up-to-date information about the pod IP, I think we should just pass that down the call stack. (Alternatively, we can always query the container runtime for the correct IP address, but IMO, that's too expensive)

The call stack that needs to be modified:

  -> containerRuntime.SyncPod
    -> dm.runContainerInPod
      -> kubelet.GenerateRunContainerOptions
        -> kubelet.makeMounts
          -> kubelet.makeHostsMount
        -> kubelet.makeEnvironmentVariables
          -> kubelet.podFieldSelectorRuntimeValue

/cc @kubernetes/sig-node, FYI.

@ncdc
Copy link
Member

ncdc commented Mar 8, 2016

@yujuhong cool, thanks for the detailed explanation. Where in SyncPod does the pod IP exist other than the line you linked above, which only gets called when the infra container is created?

@yujuhong
Copy link
Contributor

yujuhong commented Mar 8, 2016

If the infra container was not created in the same sync iteration, you should be able to get the pod IP from podStatus *kubecontainer.PodStatus, which was passed to the SyncPod()

@ncdc
Copy link
Member

ncdc commented Mar 8, 2016

Ok, so if we put if/else logic in for that check, then we can pass the pod IP around as needed, right?

@pmorie
Copy link
Member

pmorie commented May 5, 2016

Has anyone hit this recently? Afaik, it can be closed. I haven't seen it linked to anything recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests