Return "410 Gone" errors via watch stream when using watch cache #25369

liggitt · 2016-05-09T21:01:53Z

This PR updates the API server to produce the same results when a watch is attempted with a resourceVersion that is too old, regardless of whether the etcd watch cache is enabled. The expected result is a 200 http status, with a single watch event of type ERROR. Previously, the watch cache would deliver a 410 http response.

This is the uncached watch impl:

// Implements storage.Interface.
func (h *etcdHelper) WatchList(ctx context.Context, key string, resourceVersion string, filter storage.FilterFunc) (watch.Interface, error) {
    if ctx == nil {
        glog.Errorf("Context is nil")
    }
    watchRV, err := storage.ParseWatchResourceVersion(resourceVersion)
    if err != nil {
        return nil, err
    }
    key = h.prefixEtcdKey(key)
    w := newEtcdWatcher(true, h.quorum, exceptKey(key), filter, h.codec, h.versioner, nil, h)
    go w.etcdWatch(ctx, h.etcdKeysAPI, key, watchRV)
    return w, nil
}

once the resourceVersion parses, there is no path that returns a direct error, so all errors would have to be returned as an ERROR event via the ResultChan().

smarterclayton · 2016-05-09T21:05:21Z

pkg/storage/cacher.go

+
+func newErrWatcher(err error) *errWatcher {
+	// Create an error event
+	errEvent := watch.Event{Type: watch.Error}


I would probably be better if this could use apiserver.StatusForError.

But it's fine for here.

ah, I didn't know about errToAPIStatus... I'd like to keep this targeted for picking, can rework to expose and use errToAPIStatus (maybe moved to the api/errors package?) as a follow up

liggitt · 2016-05-09T21:12:31Z

/cc @smarterclayton @ncdc @lavalamp @wojtek-t

lavalamp · 2016-05-09T21:20:13Z

LGTM

liggitt · 2016-05-10T01:36:57Z

rebased

wojtek-t · 2016-05-10T05:38:01Z

LGTM - thanks!

xiang90 · 2016-05-10T05:39:51Z

/cc @hongchaodeng

liggitt · 2016-05-10T15:00:16Z

rebased

liggitt · 2016-05-10T17:59:03Z

@k8s-bot unit test this issue #25349

liggitt · 2016-05-10T21:09:34Z

@k8s-bot unit test this issue #25037

hongchaodeng · 2016-05-10T21:26:26Z

The test flake doesn't look related to #25037.
Logs are here.

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0xff3f77]

goroutine 581 [running]:
panic(0x21d0c00, 0xc8200120d0)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
k8s.io/kubernetes/pkg/controller/framework.(*Controller).HasSynced(0x0, 0x463ec0)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:108 +0x47
k8s.io/kubernetes/pkg/controller/framework.(*sharedIndexInformer).HasSynced(0xc820019380, 0xa2e2ae)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/framework/shared_informer.go:161 +0x25
k8s.io/kubernetes/pkg/controller/framework.(SharedIndexInformer).HasSynced-fm(0xe00000028)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:100 +0x32
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).syncService(0xc820019440, 0xc8209ea960, 0x12)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:310 +0xf8
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).worker.func1(0xc820019440)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:299 +0xe3
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).worker(0xc820019440)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:300 +0x21
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).(k8s.io/kubernetes/pkg/controller/endpoint.worker)-fm()
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:151 +0x20
k8s.io/kubernetes/pkg/util/wait.JitterUntil.func1(0xc820d0c330)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:66 +0x4b
k8s.io/kubernetes/pkg/util/wait.JitterUntil(0xc820d0c330, 0x3b9aca00, 0x0, 0xc820112ba0)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:67 +0x68
k8s.io/kubernetes/pkg/util/wait.Until(0xc820d0c330, 0x3b9aca00, 0xc820112ba0)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:47 +0x3e
created by k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).Run
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:151 +0x147
!!! Error in ./hack/test-integration.sh:58
  'KUBE_TEST_API_VERSIONS="$1" "${KUBE_OUTPUT_HOSTBIN}/integration" --v=${LOG_LEVEL} --max-concurrency="${KUBE_INTEGRATION_TEST_MAX_CONCURRENCY}" --watch-cache=true' exited with status 2
Call stack:
  1: ./hack/test-integration.sh:58 runTests(...)
  2: ./hack/test-integration.sh:83 main(...)

I'm more worried about there is some race or assumptions made in integration test.

k8s-bot · 2016-05-11T00:13:19Z

GCE e2e build/test passed for commit f80b59b.

k8s-github-robot · 2016-05-11T00:22:29Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-05-11T01:14:55Z

GCE e2e build/test passed for commit f80b59b.

k8s-github-robot · 2016-05-11T01:15:15Z

Automatic merge from submit-queue

k8s-cherrypick-bot · 2016-05-23T17:37:31Z

Commit found in the "release-1.2" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

…9-upstream-release-1.2 Automated cherry pick of #25369

…k-of-#25369-upstream-release-1.2 Automated cherry pick of kubernetes#25369

Due to a regression in some versions of Kubernetes (kubernetes/kubernetes#35068), the "resource version too old" watch event sometimes has HTTP status code 200, rather than status code 410. This event is not a fatal error and simply indicates that a watch should be restarted – k8s will fire this event for any watch that has been open for longer than thirty minutes. In Linkerd,`Watchable` currently detects this event by matching the HTTP status code of the watch event, and restarts the watch when it occurs. However, when the event is fired with the incorrect status code, the error is not handled in `Watchable` and passed downstream – in the case of issue #1636, to `EndpointsNamer`, which does not know how to handle this event. This leads to namers intermittently failing to resolve k8s endpoints. Although this issue seems to have been fixed upstream in kubernetes/kubernetes#25369, many users of Linkerd are running versions of Kubernetes where it still occurs. Therefore, I've added a workaround in `Watchable` to detect "resource version too old" events with status code 200 and restart the watch rather than passing these events downstream. When this occurs, Linkerd now logs a warning indicating that, although the error was handled, Kubernetes behaved erroneously. I've added a test to `v1.ApiTest` that replicates the Kubernetes bug. Fixes #1636

googlebot added the cla: yes label May 9, 2016

smarterclayton reviewed May 9, 2016
View reviewed changes

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels May 9, 2016

liggitt assigned wojtek-t May 9, 2016

liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 9, 2016

liggitt added this to the v1.2 milestone May 9, 2016

liggitt added cherrypick-candidate release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels May 9, 2016

lavalamp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2016

lavalamp assigned lavalamp and unassigned wojtek-t May 9, 2016

liggitt added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 9, 2016

liggitt force-pushed the cached-watch branch from 1f3de6e to aa52c9f Compare May 10, 2016 01:36

liggitt added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 10, 2016

Return 'too old' errors from watch cache via watch stream

f80b59b

liggitt force-pushed the cached-watch branch from aa52c9f to f80b59b Compare May 10, 2016 15:00

liggitt added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 10, 2016

k8s-github-robot merged commit aa8fddb into kubernetes:master May 11, 2016

liggitt deleted the cached-watch branch May 11, 2016 01:35

liggitt mentioned this pull request May 11, 2016

Automated cherry pick of #25369 #25450

Merged

liggitt changed the title ~~Return 'too old' errors from watch cache via watch stream~~ Return "410 Gone" errors via watch stream when using watch cache May 11, 2016

roberthbailey added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label May 23, 2016

k8s-cherrypick-bot removed the cherrypick-candidate label May 23, 2016

roberthbailey added a commit that referenced this pull request May 23, 2016

Merge pull request #25450 from liggitt/automated-cherry-pick-of-#2536…

4e55c62

…9-upstream-release-1.2 Automated cherry pick of #25369

shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016

Merge pull request kubernetes#25450 from liggitt/automated-cherry-pic…

4fe063b

…k-of-#25369-upstream-release-1.2 Automated cherry pick of kubernetes#25369

shouhong pushed a commit to shouhong/kubernetes that referenced this pull request Feb 14, 2017

Merge pull request kubernetes#25450 from liggitt/automated-cherry-pic…

ffbb0fe

…k-of-#25369-upstream-release-1.2 Automated cherry pick of kubernetes#25369

sjenning mentioned this pull request Jun 20, 2017

API Server returns HTTP 200 OK on "too old resource version" errors #35068

Closed

hawkw mentioned this pull request Sep 20, 2017

Workaround for K8s watch events with incorrect status codes linkerd/linkerd#1649

Merged

This was referenced Apr 27, 2018

Add support for watching individual objects. kubernetes-client/csharp#147

Merged

Handle error messages in the watcher kubernetes-client/csharp#150

Merged

magaldima mentioned this pull request Aug 7, 2018

Sensor-controller throws errors about wrong config map version argoproj/argo-events#81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return "410 Gone" errors via watch stream when using watch cache #25369

Return "410 Gone" errors via watch stream when using watch cache #25369

liggitt commented May 9, 2016 •

edited

Loading

smarterclayton May 9, 2016

smarterclayton May 9, 2016

liggitt May 9, 2016

liggitt commented May 9, 2016

lavalamp commented May 9, 2016

liggitt commented May 10, 2016

wojtek-t commented May 10, 2016

xiang90 commented May 10, 2016

liggitt commented May 10, 2016

liggitt commented May 10, 2016

liggitt commented May 10, 2016

hongchaodeng commented May 10, 2016

k8s-bot commented May 11, 2016

k8s-github-robot commented May 11, 2016

k8s-bot commented May 11, 2016

k8s-github-robot commented May 11, 2016

k8s-cherrypick-bot commented May 23, 2016

Return "410 Gone" errors via watch stream when using watch cache #25369

Return "410 Gone" errors via watch stream when using watch cache #25369

Conversation

liggitt commented May 9, 2016 • edited Loading

smarterclayton May 9, 2016

Choose a reason for hiding this comment

smarterclayton May 9, 2016

Choose a reason for hiding this comment

liggitt May 9, 2016

Choose a reason for hiding this comment

liggitt commented May 9, 2016

lavalamp commented May 9, 2016

liggitt commented May 10, 2016

wojtek-t commented May 10, 2016

xiang90 commented May 10, 2016

liggitt commented May 10, 2016

liggitt commented May 10, 2016

liggitt commented May 10, 2016

hongchaodeng commented May 10, 2016

k8s-bot commented May 11, 2016

k8s-github-robot commented May 11, 2016

k8s-bot commented May 11, 2016

k8s-github-robot commented May 11, 2016

k8s-cherrypick-bot commented May 23, 2016

liggitt commented May 9, 2016 •

edited

Loading