Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Server returns HTTP 200 OK on "too old resource version" errors #35068

Closed
kelseyhightower opened this issue Oct 18, 2016 · 9 comments
Closed
Assignees
Labels
area/apiserver lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@kelseyhightower
Copy link
Contributor

kelseyhightower commented Oct 18, 2016

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

  • "too old resource version"

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.0", GitCommit:"a16c0a7f71a6f93c7e0f222d961f4675cd97a46b", GitTreeState:"clean", BuildDate:"2016-09-26T18:16:57Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.1", GitCommit:"33cf7b9acbb2cb7c9c72a10d6636321fb180b159", GitTreeState:"clean", BuildDate:"2016-10-10T18:13:36Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

gcloud container clusters list
NAME  ZONE        MASTER_VERSION  MASTER_IP       MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
k0    us-west1-b  1.4.1           104.199.114.88  n1-standard-1  1.4.1         3          RUNNING

What happened:

When the resource version is too old the API server returns HTTP 200 OK.

$ curl -i http://127.0.0.1:8001/api/v1/watch/namespaces/default/endpoints/nginx?resourceVersion=240385
HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 18 Oct 2016 21:47:16 GMT
Content-Length: 176

{"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 240385 (250923)","reason":"Gone","code":410}}

What you expected to happen:

When the resource version is too old the API server returns HTTP 410 to match the error code in the response body.

How to reproduce it (as minimally and precisely as possible):

  • Create a deployment and expose it.
  • Scale up the deployment to 3 nodes
  • Wait ~30 mins and get the endpoints backing the service
  • Perform a watch on the endpoint using the resource version from the previous get

Anything else do we need to know:

No.

@k8s-github-robot k8s-github-robot added area/kubectl sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Oct 18, 2016
@ghost ghost added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 18, 2016
@ghost ghost assigned lavalamp Oct 18, 2016
@ghost
Copy link

ghost commented Oct 18, 2016

Looks like a straight bug to me.

@ghost ghost closed this as completed Oct 18, 2016
@ghost ghost reopened this Oct 18, 2016
@kelseyhightower
Copy link
Contributor Author

Another thing that I find odd is that the resource version will remain too old and there is no way to retrieve a resource version that is current and will work with a future watch request.

Hours later and no changes to the nginx endpoints results in the following:

Get the current nginx endpoint and capture the resource version (250647):

$ curl http://127.0.0.1:8001/api/v1/namespaces/default/endpoints/nginx
{
  "kind": "Endpoints",
  "apiVersion": "v1",
  "metadata": {
    "name": "nginx",
    "namespace": "default",
    "selfLink": "/api/v1/namespaces/default/endpoints/nginx",
    "uid": "ce671363-942c-11e6-87dc-42010a8a008d",
    "resourceVersion": "250647",
    "creationTimestamp": "2016-10-17T05:44:44Z",
    "labels": {
      "run": "nginx"
    }
  },
  "subsets": [
    {
      "addresses": [
        {
          "ip": "10.176.0.31",
          "nodeName": "gke-k0-default-pool-12695b58-7ocd",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-xkhwx",
            "uid": "c230d8f9-956a-11e6-87dc-42010a8a008d",
            "resourceVersion": "240348"
          }
        },
        {
          "ip": "10.176.0.32",
          "nodeName": "gke-k0-default-pool-12695b58-7ocd",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-j1rih",
            "uid": "c2314f78-956a-11e6-87dc-42010a8a008d",
            "resourceVersion": "240345"
          }
        },
        {
          "ip": "10.176.2.28",
          "nodeName": "gke-k0-default-pool-12695b58-rltz",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-17g3g",
            "uid": "54b83562-955b-11e6-87dc-42010a8a008d",
            "resourceVersion": "228832"
          }
        },
        {
          "ip": "10.176.2.59",
          "nodeName": "gke-k0-default-pool-12695b58-rltz",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-l1dfz",
            "uid": "c230f0ec-956a-11e6-87dc-42010a8a008d",
            "resourceVersion": "240372"
          }
        },
        {
          "ip": "10.176.2.61",
          "nodeName": "gke-k0-default-pool-12695b58-rltz",
          "targetRef": {
            "kind": "Pod",
            "namespace": "default",
            "name": "nginx-1172225296-rchna",
            "uid": "c230b4c2-956a-11e6-87dc-42010a8a008d",
            "resourceVersion": "240385"
          }
        }
      ],
      "ports": [
        {
          "port": 80,
          "protocol": "TCP"
        }
      ]
    }
  ]
}

Start a watch using the resource version (250647):

$ curl-i http://127.0.0.1:8001/api/v1/watch/namespaces/default/endpoints/nginx?resourceVersion=250647
HTTP/1.1 200 OK
Content-Type: application/json
Date: Wed, 19 Oct 2016 02:08:12 GMT
Content-Length: 176

{"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 250647 (275904)","reason":"Gone","code":410}}

Is there another way to get the current resource version? My current workaround is to check if the type is ERROR and restart the watch at 0. This workaround has the drawback that you'll need to ignore the first response with type ADDED or else you'll end up in a fast loop and ddos the API server.

@lavalamp
Copy link
Member

Is this a bug in the kubectl proxy or a bug in apiserver? I'm having a hard time believing that we return 200s from apiserver in this condition.

@lavalamp lavalamp added area/kubectl priority/backlog Higher priority than priority/awaiting-more-evidence. and removed area/kubectl priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 16, 2016
@lavalamp
Copy link
Member

Oh, wait a second. I forgot how this worked. The 200 is because the connection was established; it did that before getting the error from etcd. There (unfortunately) has to be two distinct error mechanisms for streaming connections, since you only get one chance to return a status code but an error can happen at any point.

It looks bad when an error is the first thing returned, but IIRC changing this is actually technically difficult. And since clients need to handle errors in this form anyway, it may actually be a good thing that it's easy to produce one, so people will be surprised long before they get to production.

@kelseyhightower

@sjenning
Copy link
Contributor

sjenning commented Jun 20, 2017

Just stumbled on this issue. The resourceVersion starting at a point that is too old is addressed in kubectlwith #27392.

As far as watching from the current resourceVersion, the resourceVersion returned in lists is always the current resourceVersion. Setting resourceVersion=0 in the watch also works.

@sjenning
Copy link
Contributor

I think this is fixed by #25369

hawkw added a commit to linkerd/linkerd that referenced this issue Sep 20, 2017
hawkw added a commit to linkerd/linkerd that referenced this issue Sep 20, 2017
hawkw added a commit to linkerd/linkerd that referenced this issue Sep 20, 2017
Due to a regression in some versions of Kubernetes (kubernetes/kubernetes#35068), the "resource version too old" watch event sometimes has HTTP status code 200, rather than status code 410. This event is not a fatal error and simply indicates that a watch should be restarted – k8s will fire this event for any watch that has been open for longer than thirty minutes. In Linkerd,`Watchable` currently detects this event by matching the HTTP status code of the watch event, and restarts the watch when it occurs. However, when the event is fired with the incorrect status code, the error is not handled in `Watchable` and passed downstream – in the case of issue #1636, to `EndpointsNamer`, which does not know how to handle this event. This leads to namers intermittently failing to resolve k8s endpoints.

Although this issue seems to have been fixed upstream in kubernetes/kubernetes#25369, many users of Linkerd are running versions of Kubernetes where it still occurs. Therefore,  I've added a workaround in `Watchable` to detect "resource version too old" events with status code 200 and restart the watch rather than passing these events downstream. When this occurs, Linkerd now logs a warning indicating that, although the error was handled, Kubernetes behaved erroneously. 

I've added a test to `v1.ApiTest` that replicates the Kubernetes bug.

Fixes #1636
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 28, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

6 participants