-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return "410 Gone" errors via watch stream when using watch cache #25369
Conversation
|
||
func newErrWatcher(err error) *errWatcher { | ||
// Create an error event | ||
errEvent := watch.Event{Type: watch.Error} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably be better if this could use apiserver.StatusForError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's fine for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I didn't know about errToAPIStatus
... I'd like to keep this targeted for picking, can rework to expose and use errToAPIStatus (maybe moved to the api/errors package?) as a follow up
LGTM |
rebased |
LGTM - thanks! |
/cc @hongchaodeng |
rebased |
The test flake doesn't look related to #25037. panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0xff3f77]
goroutine 581 [running]:
panic(0x21d0c00, 0xc8200120d0)
/usr/local/go/src/runtime/panic.go:481 +0x3e6
k8s.io/kubernetes/pkg/controller/framework.(*Controller).HasSynced(0x0, 0x463ec0)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:108 +0x47
k8s.io/kubernetes/pkg/controller/framework.(*sharedIndexInformer).HasSynced(0xc820019380, 0xa2e2ae)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/framework/shared_informer.go:161 +0x25
k8s.io/kubernetes/pkg/controller/framework.(SharedIndexInformer).HasSynced-fm(0xe00000028)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:100 +0x32
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).syncService(0xc820019440, 0xc8209ea960, 0x12)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:310 +0xf8
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).worker.func1(0xc820019440)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:299 +0xe3
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).worker(0xc820019440)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:300 +0x21
k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).(k8s.io/kubernetes/pkg/controller/endpoint.worker)-fm()
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:151 +0x20
k8s.io/kubernetes/pkg/util/wait.JitterUntil.func1(0xc820d0c330)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:66 +0x4b
k8s.io/kubernetes/pkg/util/wait.JitterUntil(0xc820d0c330, 0x3b9aca00, 0x0, 0xc820112ba0)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:67 +0x68
k8s.io/kubernetes/pkg/util/wait.Until(0xc820d0c330, 0x3b9aca00, 0xc820112ba0)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:47 +0x3e
created by k8s.io/kubernetes/pkg/controller/endpoint.(*EndpointController).Run
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/endpoint/endpoints_controller.go:151 +0x147
!!! Error in ./hack/test-integration.sh:58
'KUBE_TEST_API_VERSIONS="$1" "${KUBE_OUTPUT_HOSTBIN}/integration" --v=${LOG_LEVEL} --max-concurrency="${KUBE_INTEGRATION_TEST_MAX_CONCURRENCY}" --watch-cache=true' exited with status 2
Call stack:
1: ./hack/test-integration.sh:58 runTests(...)
2: ./hack/test-integration.sh:83 main(...) I'm more worried about there is some race or assumptions made in integration test. |
GCE e2e build/test passed for commit f80b59b. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit f80b59b. |
Automatic merge from submit-queue |
Commit found in the "release-1.2" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
…9-upstream-release-1.2 Automated cherry pick of #25369
…k-of-#25369-upstream-release-1.2 Automated cherry pick of kubernetes#25369
…k-of-#25369-upstream-release-1.2 Automated cherry pick of kubernetes#25369
Due to a regression in some versions of Kubernetes (kubernetes/kubernetes#35068), the "resource version too old" watch event sometimes has HTTP status code 200, rather than status code 410. This event is not a fatal error and simply indicates that a watch should be restarted – k8s will fire this event for any watch that has been open for longer than thirty minutes. In Linkerd,`Watchable` currently detects this event by matching the HTTP status code of the watch event, and restarts the watch when it occurs. However, when the event is fired with the incorrect status code, the error is not handled in `Watchable` and passed downstream – in the case of issue #1636, to `EndpointsNamer`, which does not know how to handle this event. This leads to namers intermittently failing to resolve k8s endpoints. Although this issue seems to have been fixed upstream in kubernetes/kubernetes#25369, many users of Linkerd are running versions of Kubernetes where it still occurs. Therefore, I've added a workaround in `Watchable` to detect "resource version too old" events with status code 200 and restart the watch rather than passing these events downstream. When this occurs, Linkerd now logs a warning indicating that, although the error was handled, Kubernetes behaved erroneously. I've added a test to `v1.ApiTest` that replicates the Kubernetes bug. Fixes #1636
Fixes #25151
This PR updates the API server to produce the same results when a watch is attempted with a resourceVersion that is too old, regardless of whether the etcd watch cache is enabled. The expected result is a
200
http status, with a single watch event of typeERROR
. Previously, the watch cache would deliver a410
http response.This is the uncached watch impl:
once the resourceVersion parses, there is no path that returns a direct error, so all errors would have to be returned as an
ERROR
event via the ResultChan().