-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround for K8s watch events with incorrect status codes #1649
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
// special case to handle Kubernetes bug where "too old resource version" errors | ||
// are returned with status code 200 rather than status code 410. | ||
// see https://github.com/kubernetes/kubernetes/issues/35068 for details. | ||
case e: Watch.Error[O] if e.status.message.exists(_.startsWith("too old resource version")) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check e.status.code
for 410
instead of inspecting the message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely better than string comparison from a performance perspective. I felt like inspecting the message might be more future-proof – what if Kubernetes were to change the status code returned for this event, or added additional events with status code 410 that should be handled differently – but if you don't think that's as big a concern, I can change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that looking for a code is actually more future proof than looking at a human readable string. Plus it's already the case that we look for the HTTP status code 410.
// are returned with status code 200 rather than status code 410. | ||
// see https://github.com/kubernetes/kubernetes/issues/35068 for details. | ||
case e: Watch.Error[O] if e.status.message.exists(_.startsWith("too old resource version")) => | ||
log.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that this k8s bug has no adverse effects now that we're detecting "too old" events with status code 200? If so, I don't think we need to log. (we will continue to log at trace level that a watch was restarted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I thought it was worthwhile to encourage users to upgrade their Kubernetes version, so that we can eventually remove this workaround? I suspect it adds at least a minor performance hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is pretty cheap, actually. The objects have already been parsed, after all. And we will never be able to assume that all users have upgrade off of the affected K8s version. The best we can do is eventually drop support for certain k8s versions.
8dba935
to
1cd767c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐ 🕵️♀️ 👣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhhh mazing
aaaah my build finally succeeded! 🙌 |
When a websocket connection is closed between Chrome and a server, we get a 1006 error code signifying abnormal closure of the websocket connection. It seems as if we only get this error on Chrome web clients. Firefox and Safari do not encounter this issue. The solution is to suppress 1006 errors that occur in the web browser since the connection is closed anyway. There is no negative side effect that occurs when the connection is closed abnormally and so the error message is benign. fixes linkerd#1630 Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
Due to a regression in some versions of Kubernetes (kubernetes/kubernetes#35068), the "resource version too old" watch event sometimes has HTTP status code 200, rather than status code 410. This event is not a fatal error and simply indicates that a watch should be restarted – k8s will fire this event for any watch that has been open for longer than thirty minutes. In Linkerd,
Watchable
currently detects this event by matching the HTTP status code of the watch event, and restarts the watch when it occurs. However, when the event is fired with the incorrect status code, the error is not handled inWatchable
and passed downstream – in the case of issue #1636, toEndpointsNamer
, which does not know how to handle this event. This leads to namers intermittently failing to resolve k8s endpoints.Although this issue seems to have been fixed upstream in kubernetes/kubernetes#25369, many users of Linkerd are running versions of Kubernetes where it still occurs. Therefore, I've added a workaround in
Watchable
to detect "resource version too old" events with status code 200 and restart the watch rather than passing these events downstream. When this occurs, Linkerd now logs a warning indicating that, although the error was handled, Kubernetes behaved erroneously.I've added a test to
v1.ApiTest
that replicates the Kubernetes bug.Fixes #1636