Workaround for K8s watch events with incorrect status codes #1649

hawkw · 2017-09-20T17:03:02Z

Due to a regression in some versions of Kubernetes (kubernetes/kubernetes#35068), the "resource version too old" watch event sometimes has HTTP status code 200, rather than status code 410. This event is not a fatal error and simply indicates that a watch should be restarted – k8s will fire this event for any watch that has been open for longer than thirty minutes. In Linkerd,Watchable currently detects this event by matching the HTTP status code of the watch event, and restarts the watch when it occurs. However, when the event is fired with the incorrect status code, the error is not handled in Watchable and passed downstream – in the case of issue #1636, to EndpointsNamer, which does not know how to handle this event. This leads to namers intermittently failing to resolve k8s endpoints.

Although this issue seems to have been fixed upstream in kubernetes/kubernetes#25369, many users of Linkerd are running versions of Kubernetes where it still occurs. Therefore, I've added a workaround in Watchable to detect "resource version too old" events with status code 200 and restart the watch rather than passing these events downstream. When this occurs, Linkerd now logs a warning indicating that, although the error was handled, Kubernetes behaved erroneously.

I've added a test to v1.ApiTest that replicates the Kubernetes bug.

Fixes #1636

adleong

Awesome!

adleong · 2017-09-20T17:31:01Z

k8s/src/main/scala/io/buoyant/k8s/Watchable.scala

+                // special case to handle Kubernetes bug where "too old resource version" errors
+                // are returned with status code 200 rather than status code 410.
+                // see https://github.com/kubernetes/kubernetes/issues/35068 for details.
+                case e: Watch.Error[O] if e.status.message.exists(_.startsWith("too old resource version")) =>


Can we check e.status.code for 410 instead of inspecting the message?

That's definitely better than string comparison from a performance perspective. I felt like inspecting the message might be more future-proof – what if Kubernetes were to change the status code returned for this event, or added additional events with status code 410 that should be handled differently – but if you don't think that's as big a concern, I can change this?

I think that looking for a code is actually more future proof than looking at a human readable string. Plus it's already the case that we look for the HTTP status code 410.

adleong · 2017-09-20T17:34:09Z

k8s/src/main/scala/io/buoyant/k8s/Watchable.scala

+                // are returned with status code 200 rather than status code 410.
+                // see https://github.com/kubernetes/kubernetes/issues/35068 for details.
+                case e: Watch.Error[O] if e.status.message.exists(_.startsWith("too old resource version")) =>
+                  log.warning(


Is it true that this k8s bug has no adverse effects now that we're detecting "too old" events with status code 200? If so, I don't think we need to log. (we will continue to log at trace level that a watch was restarted)

Hmm. I thought it was worthwhile to encourage users to upgrade their Kubernetes version, so that we can eventually remove this workaround? I suspect it adds at least a minor performance hit.

I suspect this is pretty cheap, actually. The objects have already been parsed, after all. And we will never be able to assume that all users have upgrade off of the affected K8s version. The best we can do is eventually drop support for certain k8s versions.

adleong

⭐ 🕵️‍♀️ 👣

olix0r

ahhhhh mazing

(whoops!)

hawkw · 2017-09-20T19:09:06Z

aaaah my build finally succeeded! 🙌

When a websocket connection is closed between Chrome and a server, we get a 1006 error code signifying abnormal closure of the websocket connection. It seems as if we only get this error on Chrome web clients. Firefox and Safari do not encounter this issue. The solution is to suppress 1006 errors that occur in the web browser since the connection is closed anyway. There is no negative side effect that occurs when the connection is closed abnormally and so the error message is benign. fixes linkerd#1630 Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>

hawkw added 4 commits September 20, 2017 10:26

Temp debug logging

5c4415b

Add test for handling kubernetes/kubernetes#35068.

92fcd6d

Add workaround for kubernetes/kubernetes#35068 to ServiceNamer

6f3a62f

rm commented out code

d2e9f40

hawkw added the kubernetes label Sep 20, 2017

hawkw self-assigned this Sep 20, 2017

hawkw requested review from olix0r and adleong September 20, 2017 17:03

adleong reviewed Sep 20, 2017

View reviewed changes

Review feedback

1cd767c

hawkw force-pushed the eliza/too-old-resource-version branch from 8dba935 to 1cd767c Compare September 20, 2017 17:57

adleong approved these changes Sep 20, 2017

View reviewed changes

olix0r approved these changes Sep 20, 2017

View reviewed changes

erroneous watch event still has code 410 in message body

f4f70b8

(whoops!)

hawkw merged commit 2f286fe into master Sep 20, 2017

hawkw added this to the 1.2.2 milestone Sep 22, 2017

klingerf mentioned this pull request Oct 12, 2017

Namerd not updated automatically #1669

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround for K8s watch events with incorrect status codes #1649

Workaround for K8s watch events with incorrect status codes #1649

hawkw commented Sep 20, 2017

adleong left a comment

adleong Sep 20, 2017

hawkw Sep 20, 2017

adleong Sep 20, 2017

adleong Sep 20, 2017

hawkw Sep 20, 2017

adleong Sep 20, 2017

adleong left a comment

olix0r left a comment

hawkw commented Sep 20, 2017

Workaround for K8s watch events with incorrect status codes #1649

Workaround for K8s watch events with incorrect status codes #1649

Conversation

hawkw commented Sep 20, 2017

adleong left a comment

Choose a reason for hiding this comment

adleong Sep 20, 2017

Choose a reason for hiding this comment

hawkw Sep 20, 2017

Choose a reason for hiding this comment

adleong Sep 20, 2017

Choose a reason for hiding this comment

adleong Sep 20, 2017

Choose a reason for hiding this comment

hawkw Sep 20, 2017

Choose a reason for hiding this comment

adleong Sep 20, 2017

Choose a reason for hiding this comment

adleong left a comment

Choose a reason for hiding this comment

olix0r left a comment

Choose a reason for hiding this comment

hawkw commented Sep 20, 2017