API Server returns HTTP 200 OK on "too old resource version" errors #35068

kelseyhightower · 2016-10-18T21:58:41Z

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

"too old resource version"

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.0", GitCommit:"a16c0a7f71a6f93c7e0f222d961f4675cd97a46b", GitTreeState:"clean", BuildDate:"2016-09-26T18:16:57Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.1", GitCommit:"33cf7b9acbb2cb7c9c72a10d6636321fb180b159", GitTreeState:"clean", BuildDate:"2016-10-10T18:13:36Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

gcloud container clusters list

NAME  ZONE        MASTER_VERSION  MASTER_IP       MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
k0    us-west1-b  1.4.1           104.199.114.88  n1-standard-1  1.4.1         3          RUNNING

What happened:

When the resource version is too old the API server returns HTTP 200 OK.

$ curl -i http://127.0.0.1:8001/api/v1/watch/namespaces/default/endpoints/nginx?resourceVersion=240385

HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 18 Oct 2016 21:47:16 GMT
Content-Length: 176

{"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 240385 (250923)","reason":"Gone","code":410}}

What you expected to happen:

When the resource version is too old the API server returns HTTP 410 to match the error code in the response body.

How to reproduce it (as minimally and precisely as possible):

Create a deployment and expose it.
Scale up the deployment to 3 nodes
Wait ~30 mins and get the endpoints backing the service
Perform a watch on the endpoint using the resource version from the previous get

Anything else do we need to know:

No.

The text was updated successfully, but these errors were encountered:

ghost · 2016-10-18T23:16:05Z

Looks like a straight bug to me.

kelseyhightower · 2016-10-19T02:11:50Z

Another thing that I find odd is that the resource version will remain too old and there is no way to retrieve a resource version that is current and will work with a future watch request.

Hours later and no changes to the nginx endpoints results in the following:

Get the current nginx endpoint and capture the resource version (250647):

$ curl http://127.0.0.1:8001/api/v1/namespaces/default/endpoints/nginx

{
  "kind": "Endpoints",
  "apiVersion": "v1",
  "metadata": {
    "name": "nginx",
    "namespace": "default",
    "selfLink": "/api/v1/namespaces/default/endpoints/nginx",
    "uid": "ce671363-942c-11e6-87dc-42010a8a008d",
    "resourceVersion": "250647",
    "creationTimestamp": "2016-10-17T05:44:44Z",
    "labels": {
      "run": "nginx"
    }
  },
  "subsets": [
    {
      "addresses": [
        {
          "ip": "10.176.0.31",
          "nodeName": "gke-k0-default-pool-12695b58-7ocd",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-xkhwx",
            "uid": "c230d8f9-956a-11e6-87dc-42010a8a008d",
            "resourceVersion": "240348"
          }
        },
        {
          "ip": "10.176.0.32",
          "nodeName": "gke-k0-default-pool-12695b58-7ocd",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-j1rih",
            "uid": "c2314f78-956a-11e6-87dc-42010a8a008d",
            "resourceVersion": "240345"
          }
        },
        {
          "ip": "10.176.2.28",
          "nodeName": "gke-k0-default-pool-12695b58-rltz",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-17g3g",
            "uid": "54b83562-955b-11e6-87dc-42010a8a008d",
            "resourceVersion": "228832"
          }
        },
        {
          "ip": "10.176.2.59",
          "nodeName": "gke-k0-default-pool-12695b58-rltz",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-l1dfz",
            "uid": "c230f0ec-956a-11e6-87dc-42010a8a008d",
            "resourceVersion": "240372"
          }
        },
        {
          "ip": "10.176.2.61",
          "nodeName": "gke-k0-default-pool-12695b58-rltz",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-rchna",
            "uid": "c230b4c2-956a-11e6-87dc-42010a8a008d",
            "resourceVersion": "240385"
          }
        }
      ],
      "ports": [
        {
          "port": 80,
          "protocol": "TCP"
        }
      ]
    }
  ]
}

Start a watch using the resource version (250647):

$ curl-i http://127.0.0.1:8001/api/v1/watch/namespaces/default/endpoints/nginx?resourceVersion=250647

HTTP/1.1 200 OK
Content-Type: application/json
Date: Wed, 19 Oct 2016 02:08:12 GMT
Content-Length: 176

{"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 250647 (275904)","reason":"Gone","code":410}}

Is there another way to get the current resource version? My current workaround is to check if the type is ERROR and restart the watch at 0. This workaround has the drawback that you'll need to ignore the first response with type ADDED or else you'll end up in a fast loop and ddos the API server.

lavalamp · 2016-11-16T23:37:26Z

Is this a bug in the kubectl proxy or a bug in apiserver? I'm having a hard time believing that we return 200s from apiserver in this condition.

lavalamp · 2016-11-17T18:01:19Z

Oh, wait a second. I forgot how this worked. The 200 is because the connection was established; it did that before getting the error from etcd. There (unfortunately) has to be two distinct error mechanisms for streaming connections, since you only get one chance to return a status code but an error can happen at any point.

It looks bad when an error is the first thing returned, but IIRC changing this is actually technically difficult. And since clients need to handle errors in this form anyway, it may actually be a good thing that it's easy to produce one, so people will be surprised long before they get to production.

@kelseyhightower

sjenning · 2017-06-20T16:44:34Z

Just stumbled on this issue. The resourceVersion starting at a point that is too old is addressed in kubectlwith #27392.

As far as watching from the current resourceVersion, the resourceVersion returned in lists is always the current resourceVersion. Setting resourceVersion=0 in the watch also works.

sjenning · 2017-06-20T17:05:54Z

I think this is fixed by #25369

Due to a regression in some versions of Kubernetes (kubernetes/kubernetes#35068), the "resource version too old" watch event sometimes has HTTP status code 200, rather than status code 410. This event is not a fatal error and simply indicates that a watch should be restarted – k8s will fire this event for any watch that has been open for longer than thirty minutes. In Linkerd,`Watchable` currently detects this event by matching the HTTP status code of the watch event, and restarts the watch when it occurs. However, when the event is fired with the incorrect status code, the error is not handled in `Watchable` and passed downstream – in the case of issue #1636, to `EndpointsNamer`, which does not know how to handle this event. This leads to namers intermittently failing to resolve k8s endpoints. Although this issue seems to have been fixed upstream in kubernetes/kubernetes#25369, many users of Linkerd are running versions of Kubernetes where it still occurs. Therefore, I've added a workaround in `Watchable` to detect "resource version too old" events with status code 200 and restart the watch rather than passing these events downstream. When this occurs, Linkerd now logs a warning indicating that, although the error was handled, Kubernetes behaved erroneously. I've added a test to `v1.ApiTest` that replicates the Kubernetes bug. Fixes #1636

fejta-bot · 2017-12-29T02:14:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-28T02:23:15Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-27T03:08:50Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

References: * ManageIQ/kubeclient#452 (comment) * fabric8io/kubernetes-client#1800 (comment) * kubernetes/kubernetes#25151 (comment) * kubernetes/kubernetes#35068 (comment) * https://github.com/kubernetes/kubernetes/blob/dde6e8e7465468c32642659cb708a5cc922add64/test/e2e/apimachinery/protocol.go#L68-L75 * https://kubernetes.io/docs/reference/using-api/api-concepts/#410-gone-responses * https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes * https://www.baeldung.com/java-kubernetes-watch#1-resource-versions

k8s-github-robot added area/kubectl sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Oct 18, 2016

ghost added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 18, 2016

ghost assigned lavalamp Oct 18, 2016

ghost closed this as completed Oct 18, 2016

ghost reopened this Oct 18, 2016

lavalamp added area/apiserver and removed area/kubectl labels Nov 16, 2016

lavalamp added area/kubectl priority/backlog Higher priority than priority/awaiting-more-evidence. and removed area/kubectl priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 16, 2016

bseibel mentioned this issue Sep 20, 2017

HTTP 504 Gateway timeouts after upgrading to linkerd 1.2.0 linkerd/linkerd#1636

Closed

hawkw added a commit to linkerd/linkerd that referenced this issue Sep 20, 2017

Add test for handling kubernetes/kubernetes#35068.

92fcd6d

hawkw added a commit to linkerd/linkerd that referenced this issue Sep 20, 2017

Add workaround for kubernetes/kubernetes#35068 to ServiceNamer

6f3a62f

hawkw mentioned this issue Sep 20, 2017

Workaround for K8s watch events with incorrect status codes linkerd/linkerd#1649

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 28, 2018

k8s-ci-robot closed this as completed Feb 27, 2018

jakerobb mentioned this issue Feb 28, 2018

Watch error details not available to client kubernetes-client/java#206

Closed

cben mentioned this issue Jul 10, 2020

How to detect 410 Gone in watch response? ManageIQ/kubeclient#452

Open

stubbornTanzhe mentioned this issue Jul 11, 2022

help request: not forward first 200 response from apiserver to client apache/apisix#7426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Server returns HTTP 200 OK on "too old resource version" errors #35068

API Server returns HTTP 200 OK on "too old resource version" errors #35068

kelseyhightower commented Oct 18, 2016 •

edited

Loading

ghost commented Oct 18, 2016

kelseyhightower commented Oct 19, 2016

lavalamp commented Nov 16, 2016

lavalamp commented Nov 17, 2016

sjenning commented Jun 20, 2017 •

edited

Loading

sjenning commented Jun 20, 2017

fejta-bot commented Dec 29, 2017

fejta-bot commented Jan 28, 2018

fejta-bot commented Feb 27, 2018

API Server returns HTTP 200 OK on "too old resource version" errors #35068

API Server returns HTTP 200 OK on "too old resource version" errors #35068

Comments

kelseyhightower commented Oct 18, 2016 • edited Loading

ghost commented Oct 18, 2016

kelseyhightower commented Oct 19, 2016

lavalamp commented Nov 16, 2016

lavalamp commented Nov 17, 2016

sjenning commented Jun 20, 2017 • edited Loading

sjenning commented Jun 20, 2017

fejta-bot commented Dec 29, 2017

fejta-bot commented Jan 28, 2018

fejta-bot commented Feb 27, 2018

kelseyhightower commented Oct 18, 2016 •

edited

Loading

sjenning commented Jun 20, 2017 •

edited

Loading