Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration test flaky due to missing endpoints #6045

Closed
dchen1107 opened this issue Mar 27, 2015 · 10 comments
Closed

integration test flaky due to missing endpoints #6045

dchen1107 opened this issue Mar 27, 2015 · 10 comments
Labels
area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@dchen1107
Copy link
Member

I0327 07:02:03.729736 732 integration.go:287] default/service2 endpoint: 1.2.3.4:1234 &api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"foo", UID:"2823d7fd-d44f-11e4-9efc-02420a0000c3", APIVersion:"", ResourceVersion:"186", FieldPath:""}
I0327 07:02:03.730711 732 integration.go:283] Error on creating endpoints: endpoints "service1" not found
F0327 07:02:03.730752 732 integration.go:785] FAILED: service in other namespace should have no endpoints: false
!!! Error in hack/test-integration.sh:47
'"${KUBE_OUTPUT_HOSTBIN}/integration" --v=2 --apiVersion="$1"' exited with status 255
Call stack:

@dchen1107 dchen1107 added area/test-infra priority/backlog Higher priority than priority/awaiting-more-evidence. labels Mar 27, 2015
@bgrant0607 bgrant0607 added team/cluster priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Apr 4, 2015
@bgrant0607 bgrant0607 added this to the v1.0 milestone Apr 4, 2015
@bgrant0607
Copy link
Member

This is failing quite often now, esp. on v1beta3 builds. Not yet sure if it's 100% broken yet, though.

@ghodss
Copy link
Contributor

ghodss commented Apr 4, 2015

I'm seeing the error Dawn wrote above as well as the following error near constantly on v1beta3.

...
I0405 08:29:34.537944    2619 event.go:196] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"static-pod-from-spec-localhost", UID:"7fc40faf8939a9956e10aa9c9da4ffe8", APIVersion:"v1beta2",
ResourceVersion:"", FieldPath:"spec.containers{static-container}"}): reason: 'started' Started with docker id /k8s_static-container.55bee96_static-pod-from-spec-localhost_default_7fc40faf8939a9956e10aa9c9
da4ffe8_351700d1
I0405 08:29:35.088769    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
I0405 08:29:36.090548    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
I0405 08:29:37.089696    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
E0405 08:29:38.064989    2619 reflector.go:143] Failed to watch *api.Node: field label not supported: name
I0405 08:29:38.089777    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
E0405 08:29:39.064974    2619 reflector.go:143] Failed to watch *api.Node: field label not supported: name
I0405 08:29:39.089454    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
I0405 08:29:40.089832    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
I0405 08:29:41.089971    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
I0405 08:29:42.088700    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
E0405 08:29:42.464833    2619 reflector.go:143] Failed to watch *api.Node: field label not supported: name
I0405 08:29:43.088886    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
E0405 08:29:43.865463    2619 reflector.go:143] Failed to watch *api.Node: field label not supported: name
I0405 08:29:44.089599    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
I0405 08:29:45.088995    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
I0405 08:29:46.088937    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
I0405 08:29:47.089790    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
E0405 08:29:47.265761    2619 reflector.go:143] Failed to watch *api.Node: field label not supported: name
I0405 08:29:48.093708    2619 integration.go:285] Error on creating endpoints: endpoints "service1" not found
F0405 08:29:48.093903    2619 integration.go:841] FAILED: unexpected endpoints: timed out waiting for the condition
!!! Error in hack/test-integration.sh:47
  '"${KUBE_OUTPUT_HOSTBIN}/integration" --v=2 --apiVersion="$1"' exited with status 255
Call stack:
  1: hack/test-integration.sh:47 runTests(...)
  2: hack/test-integration.sh:60 main(...)
Exiting with status 1
+++ [0405 08:29:48] Integration test cleanup complete

@bgrant0607
Copy link
Member

Not 100% broken. Last build passed after being restarted.

@ghodss
Copy link
Contributor

ghodss commented Apr 5, 2015

So I spent some time digging into this. Both errors seem to have the root cause that the endpoints controller (pkg/service/endpoints_controller.go) is taking forever (on the order of tens of seconds) to do its job of synchronizing the endpoint objects to the current state of services and pods. If you increase the time out to 60 seconds instead of 30, these errors both go away.

There are a few reasons it's so slow. The endpoints controller is a serial loop that does 4-5 calls to the apiserver per service that it synchronizes endpoints for. I added some logging to pkg/service/endpoints_controller.go and found that, at least on the Linux VM on my mac, it was taking about 2-3 seconds for each and every apiserver call. That's also the reason that this has also gotten worse recently - the addition of patchservice as a test object gives the endpoint controller even more work to do, and delays the creation of endpoints for runServiceTest by an average of an extra 10 seconds or so.

There are a few ways to resolve this issue, and we should probably do them sequentially:

  1. Increase the timeout. This makes the test less flakey without penalizing tests that run quickly, but if someone actually does break the tests, it may take them an extra minute or two to get feedback.
  2. Figure out why apiserver is taking so long to respond to endpoints_controller's API calls. If this is expected, we can skip this step, but 2-3 seconds for simple CRUD calls seems ridiculously high.
  3. Batch the calls together in pkg/service/endpoints_controller.go to improve performance.
  4. After 2. and/or 3. are done, revert 1. :)

cc @thockin

@thockin
Copy link
Member

thockin commented Apr 7, 2015

Also discussed in #6518

@wojtek-t
Copy link
Member

wojtek-t commented Apr 8, 2015

cc @fgrzadkowski

@ghodss
Copy link
Contributor

ghodss commented Apr 9, 2015

Do we have any idea why apiserver is so slow in the integration test? (referring back to 2. in #6045 (comment))

@wojtek-t
Copy link
Member

There are few open issues touching this (e.g. #6518 as mentioned by @thockin).
But I think that the main reason was that we were running all the integration tests in parallel. This has been fixed yesterday (see e.g. #6669).
Do you still observe this issue?

@ghost ghost added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Apr 13, 2015
@piosz
Copy link
Member

piosz commented Apr 21, 2015

I ran integration tests 100 times from today's head and all passes, so closing the issue.

@piosz piosz closed this as completed Apr 21, 2015
@ghodss
Copy link
Contributor

ghodss commented Apr 25, 2015

Yes, 2. looks much better - most calls are now taking a few milliseconds, although https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/service/endpoints_controller.go#L337 still takes about 200ms, but it's not a big deal atm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

6 participants