Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for K8s watch events with incorrect status codes #1649

Merged
merged 6 commits into from
Sep 20, 2017

Conversation

hawkw
Copy link
Contributor

@hawkw hawkw commented Sep 20, 2017

Due to a regression in some versions of Kubernetes (kubernetes/kubernetes#35068), the "resource version too old" watch event sometimes has HTTP status code 200, rather than status code 410. This event is not a fatal error and simply indicates that a watch should be restarted – k8s will fire this event for any watch that has been open for longer than thirty minutes. In Linkerd,Watchable currently detects this event by matching the HTTP status code of the watch event, and restarts the watch when it occurs. However, when the event is fired with the incorrect status code, the error is not handled in Watchable and passed downstream – in the case of issue #1636, to EndpointsNamer, which does not know how to handle this event. This leads to namers intermittently failing to resolve k8s endpoints.

Although this issue seems to have been fixed upstream in kubernetes/kubernetes#25369, many users of Linkerd are running versions of Kubernetes where it still occurs. Therefore, I've added a workaround in Watchable to detect "resource version too old" events with status code 200 and restart the watch rather than passing these events downstream. When this occurs, Linkerd now logs a warning indicating that, although the error was handled, Kubernetes behaved erroneously.

I've added a test to v1.ApiTest that replicates the Kubernetes bug.

Fixes #1636

@hawkw hawkw self-assigned this Sep 20, 2017
@hawkw hawkw requested review from olix0r and adleong September 20, 2017 17:03
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

// special case to handle Kubernetes bug where "too old resource version" errors
// are returned with status code 200 rather than status code 410.
// see https://github.com/kubernetes/kubernetes/issues/35068 for details.
case e: Watch.Error[O] if e.status.message.exists(_.startsWith("too old resource version")) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check e.status.code for 410 instead of inspecting the message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely better than string comparison from a performance perspective. I felt like inspecting the message might be more future-proof – what if Kubernetes were to change the status code returned for this event, or added additional events with status code 410 that should be handled differently – but if you don't think that's as big a concern, I can change this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that looking for a code is actually more future proof than looking at a human readable string. Plus it's already the case that we look for the HTTP status code 410.

// are returned with status code 200 rather than status code 410.
// see https://github.com/kubernetes/kubernetes/issues/35068 for details.
case e: Watch.Error[O] if e.status.message.exists(_.startsWith("too old resource version")) =>
log.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that this k8s bug has no adverse effects now that we're detecting "too old" events with status code 200? If so, I don't think we need to log. (we will continue to log at trace level that a watch was restarted)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought it was worthwhile to encourage users to upgrade their Kubernetes version, so that we can eventually remove this workaround? I suspect it adds at least a minor performance hit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is pretty cheap, actually. The objects have already been parsed, after all. And we will never be able to assume that all users have upgrade off of the affected K8s version. The best we can do is eventually drop support for certain k8s versions.

@hawkw hawkw force-pushed the eliza/too-old-resource-version branch from 8dba935 to 1cd767c Compare September 20, 2017 17:57
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐ 🕵️‍♀️ 👣

Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhhh mazing

@hawkw
Copy link
Contributor Author

hawkw commented Sep 20, 2017

aaaah my build finally succeeded! 🙌

@hawkw hawkw merged commit 2f286fe into master Sep 20, 2017
@hawkw hawkw added this to the 1.2.2 milestone Sep 22, 2017
Tim-Brooks pushed a commit to Tim-Brooks/linkerd that referenced this pull request Dec 20, 2018
When a websocket connection is closed between Chrome and a server, we get a 1006 error code signifying abnormal closure of the websocket connection. It seems as if we only get this error on Chrome web clients. Firefox and Safari do not encounter this issue.

The solution is to suppress 1006 errors that occur in the web browser since the connection is closed anyway. There is no negative side effect that occurs when the connection is closed abnormally and so the error message is benign.

fixes linkerd#1630  

Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP 504 Gateway timeouts after upgrading to linkerd 1.2.0
3 participants