Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS e2e pretty busted since ~16:30 on Wed 4/29 #7548

Closed
ghost opened this issue Apr 30, 2015 · 10 comments
Closed

DNS e2e pretty busted since ~16:30 on Wed 4/29 #7548

ghost opened this issue Apr 30, 2015 · 10 comments
Assignees
Labels
area/test area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@ghost
Copy link

ghost commented Apr 30, 2015

I haven't looked into the details yet, but it seems that a PR went in at around 16:30 PDT today (Wed 04/29) that is causing significantly more e2e failures than usual in our continuous integration.

I'll dig into it in the morning unless one of the oncalls gets there before me.

@ghost ghost added area/test priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra labels Apr 30, 2015
@ghost ghost assigned cjcullen Apr 30, 2015
@ghost ghost added this to the v1.0 milestone Apr 30, 2015
@ghost
Copy link
Author

ghost commented Apr 30, 2015

Wild guess that it might be this one, based purely on merge time:

@yifan-gu
Copy link
Contributor

Bad news :(
which tests are failing?

@ghost
Copy link
Author

ghost commented Apr 30, 2015

A variety, and they're intermittent:

  • Services should provide DNS for the cluster
  • Density [Performance suite] should allow starting 30 pods per node
  • Shell tests that services.sh passes
  • Monitoring verify monitoring pods and all cluster nodes are available on influxdb using heapster.
  • kubectl guestbook should create and stop a working application
  • Cluster level logging using Elasticsearch should check that logs from pods on all nodes are ingested into Elasticsearch

... all failed at least once...

@ghost
Copy link
Author

ghost commented Apr 30, 2015

Tests have been consistently passing since filing this issue. Dropping to P1 while investigation continues. Pretty sure that this is going to crop up again until we figure out the root cause.

@ghost ghost added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Apr 30, 2015
@ghost
Copy link
Author

ghost commented Apr 30, 2015

As I guessed, this cropped up again. I'm going to try to repro locally with verbose cluster logging to figure out what's happening. The DNS service is not working, which is breaking other stuff. Not clear yet whether DNS, Services, or just Pods in general are flaky.

Test Result (4 failures / +4)

Kubernetes e2e Suite run 1 of 1.Cluster level logging using Elasticsearch should check that logs from pods on all nodes are ingested into Elasticsearch
Kubernetes e2e Suite run 1 of 1.kubectl guestbook should create and stop a working application
Kubernetes e2e Suite run 1 of 1.Density [Performance suite] should allow starting 30 pods per node
Kubernetes e2e Suite run 1 of 1.Services should provide DNS for the cluster

Identified problems

Cluster level logging using Elasticsearch should check that logs from pods on all nodes are ingested into Elasticsearch

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/es_cluster_logging.go:46
Failed to find all 200 log lines

kubectl guestbook should create and stop a working application

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/kubectl.go:125
Frontend service did not start serving content in 600 seconds.

Density [Performance suite] should allow starting 30 pods per node

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/density.go:158
Expected error:
<*errors.errorString | 0xc208a36240>: {
s: "Error: Pod my-hostname-density60-4363cd45-ef6e-11e4-a8a5-42010af01555zzng7: Container my-hostname-density60-4363cd45-ef6e-11e4-a8a5-42010af01555 was found to have terminated 1 times",
}
Error: Pod my-hostname-density60-4363cd45-ef6e-11e4-a8a5-42010af01555zzng7: Container my-hostname-density60-4363cd45-ef6e-11e4-a8a5-42010af01555 was found to have terminated 1 times
not to have occurred

Services should provide DNS for the cluster

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:165
Expected
: 3
to equal
: 0

@cjcullen
Copy link
Member

I'm also running the DNS test in a loop locally to see if I can get a failed cluster to pick at.

@cjcullen
Copy link
Member

cjcullen commented May 1, 2015

I can consistently get DNS to fail by killing the kube-dns-xxxxx pod. And I can consistently get it to come back to life by killing the kube2sky container.

"kubectl stop pods kube-dns-xxxxx" will cause the DNS RC to recreate the kube-dns pod, and the kube2sky logs will show that the etcd client is hung on the first write to etcd.

Then, "docker stop xxxxxxxxxxx" on the kube2sky container will cause kubelet to restart kube2sky, and it will successfully write all service DNS entries to skyDNS's etcd.

@nikhiljindal nikhiljindal mentioned this issue May 1, 2015
16 tasks
@ghost ghost changed the title e2e pretty busted since ~16:30 on Wed 4/29 DNS e2e pretty busted since ~16:30 on Wed 4/29 May 1, 2015
@vmarmol
Copy link
Contributor

vmarmol commented May 2, 2015

Might be worth timing out the kube2sky Set operation and re-try?

@ghost
Copy link
Author

ghost commented May 5, 2015

I think that we can close this one now? (closed by #7675)

@cjcullen
Copy link
Member

cjcullen commented May 5, 2015

Agreed. I wanted to hold off until the flakiness aged out of jenkins, which it looks like has happened.

@cjcullen cjcullen closed this as completed May 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants