Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestClientGoCustomResourceExample flakes #49956

Closed
liggitt opened this issue Aug 1, 2017 · 6 comments
Closed

TestClientGoCustomResourceExample flakes #49956

liggitt opened this issue Aug 1, 2017 · 6 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@liggitt liggitt added kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 1, 2017
@liggitt liggitt self-assigned this Aug 1, 2017
@liggitt
Copy link
Member Author

liggitt commented Aug 1, 2017

@wojtek-t
Copy link
Member

wojtek-t commented Aug 1, 2017

@smarterclayton

@smarterclayton
Copy link
Contributor

Could this be interactions with the aggregator cache?

@smarterclayton
Copy link
Contributor

E0728 21:35:49.536152   31354 customresource_discovery_controller.go:155] timed out waiting for caches to sync

Is that bad?

@liggitt
Copy link
Member Author

liggitt commented Aug 1, 2017

Could this be interactions with the aggregator cache?

no, TestClientGoCustomResourceExample tests a CRD server all alone, no aggregator involved

@liggitt
Copy link
Member Author

liggitt commented Aug 1, 2017

with lots of tracing added, I can reproduce:

controller lister/watcher:

  1. lists before watch cache is ready, goes to etcd, gets empty list at resourceVersion=408
  2. watches from resourceVersion=408 before watch cache completes first list, gets initEvents=[]
  3. registers a watcher with the watch cache at resourceVersion=408

client:

  1. creates an object with resourceVersion 409

cacher lister/watcher:

  1. lists, gets list with a single item at resourceVersion=409
  2. replaces store content with the list containing the single item, and sets the store resource version to 409
  3. watches etcd from resourceVersion=409

processEvent is never called and the watcher is never sent the 409 watch event

DEBUG: controller list /cr.client-go.k8s.io/examples bypassed the watch cache, got err=<nil>
DEBUG: controller list returned &v1.ExampleList{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"/apis/cr.client-go.k8s.io/v1/examples", ResourceVersion:"408"}, Items:[]v1.Example{}}
DEBUG: Test created version 409
DEBUG: before wait at 408
DEBUG: cacherListerWatcher list /cr.client-go.k8s.io/examples from , got list &unstructured.UnstructuredList{Object:map[string]interface {}{"apiVersion":"cr.client-go.k8s.io/v1", "kind":"ExampleList", "metadata":map[string]interface {}{"resourceVersion":"409"}}, Items:[]unstructured.Unstructured{unstructured.Unstructured{Object:map[string]interface {}{"spec":map[string]interface {}{"foo":"hello", "bar":true}, "status":map[string]interface {}{"message":"Created, not processed yet", "state":"Created"}, "apiVersion":"cr.client-go.k8s.io/v1", "kind":"Example", "metadata":map[string]interface {}{"selfLink":"", "uid":"ea82703b-7704-11e7-9125-acbc32c1ca87", "deletionGracePeriodSeconds":interface {}(nil), "deletionTimestamp":interface {}(nil), "name":"example1", "namespace":"default", "resourceVersion":"409", "clusterName":"", "creationTimestamp":"2017-08-01T22:01:05Z"}}}}}
DEBUG: replaced with 409
DEBUG: cacherListerWatcher watch /cr.client-go.k8s.io/examples from 409
DEBUG: initEvents since 408: []*storage.watchCacheEvent{}, <nil>
DEBUG: Added watcher at 408
DEBUG: process(): waiting for new event
DEBUG: controller watch established with no err

@liggitt liggitt added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Aug 1, 2017
k8s-github-robot pushed a commit that referenced this issue Aug 2, 2017
Automatic merge from submit-queue (batch tested with PRs 49992, 48861, 49267, 49356, 49886)

Correctly handle empty watch event cache

Fixes #49956

Introduced by ada6023 which did not adjust the oldest available resourceVersion for an empty watch event cache.

Exposed by 74b9ba3, which allowed controllers to get list results from etcd before the watch cache is ready (normally they list with resourceVersion=0 which serves the list request from the watch cache, blocking until it is ready)

When the watch cache had an empty cache of watch events, it currently allows establishing a watch as if it can deliver a watch event for its currently synced resourceVersion. This results in an off-by-one error which can result in a missed watch event.

Scenario:

bob:
1. creates object at resourceVersion=11

sally:
1. does a list API request, gets a list resourceVersion of 10 (just before bob creates the object)
2. starts watch handled by watch cache at resourceVersion=10

Watch cache:
1. initial list gets resourceVersion=11, including the item created by bob
2. when determining the initial watch events to send to sally's watch, there are no watch events in the cache, so no initial watch events are sent.
3. the cache listerwatcher watches etcd starting at resourceVersion=11, so future events are fed into the event cache and to sally's watch

The watch cache should have dropped sally's watch from resourceVersion=10 with a "gone" error, since it can't deliver the watch event for resourceVersion=11. This would force sally to relist (where she would get a list at resourceVersion=11) and rewatch (from resourceVersion=11)

This particularly affects tests that create CRD/TPRs and establish watches on the new types as the storage layer's watch cache is also populating for that type.

```release-note
Fix a bug in watch cache sometimes causing missing events after watch cache initialization.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

3 participants