Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return "410 Gone" errors via watch stream when using watch cache #25369

Merged
merged 1 commit into from
May 11, 2016

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented May 9, 2016

Fixes #25151

This PR updates the API server to produce the same results when a watch is attempted with a resourceVersion that is too old, regardless of whether the etcd watch cache is enabled. The expected result is a 200 http status, with a single watch event of type ERROR. Previously, the watch cache would deliver a 410 http response.

This is the uncached watch impl:

// Implements storage.Interface.
func (h *etcdHelper) WatchList(ctx context.Context, key string, resourceVersion string, filter storage.FilterFunc) (watch.Interface, error) {
    if ctx == nil {
        glog.Errorf("Context is nil")
    }
    watchRV, err := storage.ParseWatchResourceVersion(resourceVersion)
    if err != nil {
        return nil, err
    }
    key = h.prefixEtcdKey(key)
    w := newEtcdWatcher(true, h.quorum, exceptKey(key), filter, h.codec, h.versioner, nil, h)
    go w.etcdWatch(ctx, h.etcdKeysAPI, key, watchRV)
    return w, nil
}

once the resourceVersion parses, there is no path that returns a direct error, so all errors would have to be returned as an ERROR event via the ResultChan().


func newErrWatcher(err error) *errWatcher {
// Create an error event
errEvent := watch.Event{Type: watch.Error}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably be better if this could use apiserver.StatusForError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's fine for here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I didn't know about errToAPIStatus... I'd like to keep this targeted for picking, can rework to expose and use errToAPIStatus (maybe moved to the api/errors package?) as a follow up

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels May 9, 2016
@liggitt
Copy link
Member Author

liggitt commented May 9, 2016

@liggitt liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 9, 2016
@liggitt liggitt added this to the v1.2 milestone May 9, 2016
@liggitt liggitt added cherrypick-candidate release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels May 9, 2016
@lavalamp
Copy link
Member

lavalamp commented May 9, 2016

LGTM

@lavalamp lavalamp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2016
@lavalamp lavalamp assigned lavalamp and unassigned wojtek-t May 9, 2016
@liggitt liggitt added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 9, 2016
@liggitt
Copy link
Member Author

liggitt commented May 10, 2016

rebased

@liggitt liggitt added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 10, 2016
@wojtek-t
Copy link
Member

LGTM - thanks!

@xiang90
Copy link
Contributor

xiang90 commented May 10, 2016

/cc @hongchaodeng

@liggitt
Copy link
Member Author

liggitt commented May 10, 2016

rebased

@liggitt liggitt added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 10, 2016
@liggitt
Copy link
Member Author

liggitt commented May 10, 2016

@k8s-bot unit test this issue #25349

@liggitt
Copy link
Member Author

liggitt commented May 10, 2016

@k8s-bot unit test this issue #25037

@hongchaodeng
Copy link
Contributor

The test flake doesn't look related to #25037.
Logs are here.

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0xff3f77]

goroutine 581 [running]:
panic(0x21d0c00, 0xc8200120d0)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
k8s.io/kubernetes/pkg/controller/framework.(*Controller).HasSynced(0x0, 0x463ec0)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:108 +0x47
k8s.io/kubernetes/pkg/controller/framework.(*sharedIndexInformer).HasSynced(0xc820019380, 0xa2e2ae)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/framework/shared_informer.go:161 +0x25
k8s.io/kubernetes/pkg/controller/framework.(SharedIndexInformer).HasSynced-fm(0xe00000028)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:100 +0x32
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).syncService(0xc820019440, 0xc8209ea960, 0x12)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:310 +0xf8
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).worker.func1(0xc820019440)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:299 +0xe3
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).worker(0xc820019440)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:300 +0x21
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).(k8s.io/kubernetes/pkg/controller/endpoint.worker)-fm()
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:151 +0x20
k8s.io/kubernetes/pkg/util/wait.JitterUntil.func1(0xc820d0c330)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:66 +0x4b
k8s.io/kubernetes/pkg/util/wait.JitterUntil(0xc820d0c330, 0x3b9aca00, 0x0, 0xc820112ba0)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:67 +0x68
k8s.io/kubernetes/pkg/util/wait.Until(0xc820d0c330, 0x3b9aca00, 0xc820112ba0)
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:47 +0x3e
created by k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).Run
    /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:151 +0x147
!!! Error in ./hack/test-integration.sh:58
  'KUBE_TEST_API_VERSIONS="$1" "${KUBE_OUTPUT_HOSTBIN}/integration" --v=${LOG_LEVEL} --max-concurrency="${KUBE_INTEGRATION_TEST_MAX_CONCURRENCY}" --watch-cache=true' exited with status 2
Call stack:
  1: ./hack/test-integration.sh:58 runTests(...)
  2: ./hack/test-integration.sh:83 main(...)

I'm more worried about there is some race or assumptions made in integration test.

@k8s-bot
Copy link

k8s-bot commented May 11, 2016

GCE e2e build/test passed for commit f80b59b.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented May 11, 2016

GCE e2e build/test passed for commit f80b59b.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit aa8fddb into kubernetes:master May 11, 2016
@liggitt liggitt deleted the cached-watch branch May 11, 2016 01:35
@liggitt liggitt changed the title Return 'too old' errors from watch cache via watch stream Return "410 Gone" errors via watch stream when using watch cache May 11, 2016
@roberthbailey roberthbailey added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label May 23, 2016
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.2" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

roberthbailey added a commit that referenced this pull request May 23, 2016
…9-upstream-release-1.2

Automated cherry pick of #25369
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…k-of-#25369-upstream-release-1.2

Automated cherry pick of kubernetes#25369
shouhong pushed a commit to shouhong/kubernetes that referenced this pull request Feb 14, 2017
…k-of-#25369-upstream-release-1.2

Automated cherry pick of kubernetes#25369
hawkw added a commit to linkerd/linkerd that referenced this pull request Sep 20, 2017
Due to a regression in some versions of Kubernetes (kubernetes/kubernetes#35068), the "resource version too old" watch event sometimes has HTTP status code 200, rather than status code 410. This event is not a fatal error and simply indicates that a watch should be restarted – k8s will fire this event for any watch that has been open for longer than thirty minutes. In Linkerd,`Watchable` currently detects this event by matching the HTTP status code of the watch event, and restarts the watch when it occurs. However, when the event is fired with the incorrect status code, the error is not handled in `Watchable` and passed downstream – in the case of issue #1636, to `EndpointsNamer`, which does not know how to handle this event. This leads to namers intermittently failing to resolve k8s endpoints.

Although this issue seems to have been fixed upstream in kubernetes/kubernetes#25369, many users of Linkerd are running versions of Kubernetes where it still occurs. Therefore,  I've added a workaround in `Watchable` to detect "resource version too old" events with status code 200 and restart the watch rather than passing these events downstream. When this occurs, Linkerd now logs a warning indicating that, although the error was handled, Kubernetes behaved erroneously. 

I've added a test to `v1.ApiTest` that replicates the Kubernetes bug.

Fixes #1636
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watch cache regression: changes behavior of "resource version too old" error