Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubectl apply should retry stale resource version. #15493

Closed
j3ffml opened this issue Oct 12, 2015 · 34 comments · Fixed by #26557
Closed

kubectl apply should retry stale resource version. #15493

j3ffml opened this issue Oct 12, 2015 · 34 comments · Fixed by #26557
Assignees
Labels
area/app-lifecycle area/kubectl kind/flake Categorizes issue or PR as related to a flaky test. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@j3ffml
Copy link
Contributor

j3ffml commented Oct 12, 2015

From an e2e test failure:

Error from server: error when applying patch:
    {"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"kind\":\"ReplicationController\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"redis-master\",\"creationTimestamp\":null,\"labels\":{\"app\":\"redis\",\"kubectl.kubernetes.io/apply-test\":\"ADDED\",\"role\":\"master\"}},\"spec\":{\"replicas\":1,\"selector\":{\"app\":\"redis\",\"kubectl.kubernetes.io/apply-test\":\"ADDED\",\"role\":\"master\"},\"template\":{\"metadata\":{\"creationTimestamp\":null,\"labels\":{\"app\":\"redis\",\"kubectl.kubernetes.io/apply-test\":\"ADDED\",\"role\":\"master\"}},\"spec\":{\"containers\":[{\"name\":\"redis-master\",\"image\":\"redis\",\"ports\":[{\"name\":\"redis-server\",\"containerPort\":6379}],\"resources\":{}}]}}},\"status\":{\"replicas\":0}}"},"creationTimestamp":null,"labels":{"kubectl.kubernetes.io/apply-test":"ADDED"}},"spec":{"selector":{"kubectl.kubernetes.io/apply-test":"ADDED"},"template":{"metadata":{"labels":{"kubectl.kubernetes.io/apply-test":"ADDED"}}}}}
    to:
    &{0xc208226190 0xc2080cbea0 e2e-tests-kubectl-ztbsd redis-master STDIN 0xc208568ff0 0xc2083aa960 1387}
    for: "STDIN": replicationControllers "redis-master" cannot be updated: the object has been modified; please apply your changes to the latest version and try again

The command should retry resource version mismatches like scale and rollingupdate.

cc @kubernetes/kubectl

@timothysc
Copy link
Member

/cc @jayunit100 fyi.

@jayunit100
Copy link
Member

From kubectl apply perspective , Isn't the failure intentional (i.e. optimistic mutate) ? or... maybe I'm (probably) missing something. I would think its the user's job to retry... ?

@jackgr
Copy link
Contributor

jackgr commented Jan 27, 2016

The desired behavior that we settled on previously is to detect version mismatches at the client and error. User can force replacement using --force. See #16543, #16569 and #17237.

@bgrant0607 bgrant0607 added priority/backlog Higher priority than priority/awaiting-more-evidence. area/app-lifecycle labels Jan 28, 2016
@bgrant0607
Copy link
Member

@jackgr What is the mismatch here? If it is just resourceVersion, kubectl should retry. It should only bail on a meaningful conflict.

We're working on eliminating flaky tests (and flaky parts of the system).

Also, that log message isn't very useful.

@bgrant0607 bgrant0607 added this to the next-candidate milestone Jan 28, 2016
@jackgr
Copy link
Contributor

jackgr commented Jan 28, 2016

@bgrant0607 per discussion on #16543, we can't determine the correct behavior when the resource version changes without more information, so best to catch version mismatch at the client and bail, and let the user replace the object with --force if they want. We can also make the error message less verbose.

@bgrant0607
Copy link
Member

@jackgr I don't see any mention of resourceVersion in #16543. It shouldn't be serialized in the annotation, because the user shouldn't specify it in their config. The resourceVersion-based precondition is intended to guard against concurrent read-modify-write operations. kubectl apply should be able retry the entire diff&patch in the case of this type of conflict.

That said, it's not obvious to me from the log message what the actual problem was in this case.

@jackgr
Copy link
Contributor

jackgr commented Jan 28, 2016

I'm sorry, my mistake. You're right that resourceVersion is not in the annotation, and that we should retry on resourceVersion failure.

I was confusing this with the api version change issue discussed in #16543. This is actually a much simpler issue, in the sense that it's an optimistic lock conflict.

If we encounter a lock conflict, then we should fetch a fresh copy of the object, and try again to apply the new configuration supplied by the user.

  • If we don't get a merge conflict when we resubmit, then all's well.
  • If we get a merge conflict because the object has changed out from under the user, then we report it in those terms, and the user can decide what to do (e.g., replace using --force).

Does this fit your expectation of how it should be handled?

@bprashanth
Copy link
Contributor

This is causing test flake:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/20534/kubernetes-pull-build-test-e2e-gce/28891
http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/28891/

    to:
    &{0xc20827c190 0xc2080cbf80 e2e-tests-kubectl-o39wt redis-master STDIN 0xc208504690 0xc208562780 4508}
    for: "STDIN": replicationControllers "redis-master" cannot be updated: the object has been modified; please apply your changes to the latest version and try again


not to have occurred

@bprashanth bprashanth added the kind/flake Categorizes issue or PR as related to a flaky test. label Feb 18, 2016
@bgrant0607 bgrant0607 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 18, 2016
@lavalamp
Copy link
Member

@jackgr yes, your understanding is correct. Do you have time to fix this? We're trying to get all our tests fixed. Additionally this sounds like a legit system bug, not just a flaky test problem.

@jackgr
Copy link
Contributor

jackgr commented Feb 19, 2016

@lavalamp Maybe. Swamped right now, but can fit it in given enough notice. When is the fix needed?

@lavalamp
Copy link
Member

@jackgr asap. We'll find someone else.

@bprashanth
Copy link
Contributor

@jackgr
Copy link
Contributor

jackgr commented Feb 19, 2016

Go for it.

Jack Greenfield | Google

On Fri, Feb 19, 2016 at 12:07 PM, Prashanth B notifications@github.com
wrote:

This should just be a generic GET/Update retry loop, eg:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replication/replication_controller_utils.go#L53


Reply to this email directly or view it on GitHub
#15493 (comment)
.

@lavalamp
Copy link
Member

Chao, do you have cycles to take a look?

@caesarxuchao
Copy link
Member

I thought Patch is retried in the server. I'll investigate.

@caesarxuchao
Copy link
Member

Patch is retried at server-side five times before an error is returned: https://github.com/kubernetes/kubernetes/blob/master/pkg/apiserver/resthandler.go#L92, so I expect this issue to occur very rarely. In fact, it has passed last ~400 kubernetes-gce-e2e tests.

If we want to make this test less flaky, I'd recommend increase the number of the retries at the server-side. What do you guys think?

@lavalamp
Copy link
Member

It'll fail if an actual conflicting change was made, though. Do we know
what the nature of the conflict was? Maybe first we need to add some log
messages to figure out what the conflict was next time it occurs.

On Mon, Feb 22, 2016 at 1:49 PM, Chao Xu notifications@github.com wrote:

Patch is retried at server-side five times before an error is returned:
https://github.com/kubernetes/kubernetes/blob/master/pkg/apiserver/resthandler.go#L92,
so I expect this issue to occur very rarely. In fact, it has passed last
~400 kubernetes-gce-e2e tests.

If we want to make this test less flaky, I'd recommend increase the number
of the retries at the server-side. What do you guys think?


Reply to this email directly or view it on GitHub
#15493 (comment)
.

@bprashanth
Copy link
Contributor

so I expect this issue to occur very rarely. In fact, it has passed last ~400 kubernetes-gce-e2e tests.

This happened last week: #15493 (comment)

@caesarxuchao
Copy link
Member

@lavalamp you are right. I thought an actual conflicting change would result in a different error message but it's not. I'll add the logs.

@bgrant0607
Copy link
Member

I only see 3 reports of failures since October, so I'm going to say, no, it doesn't need to be fixed in 1.2.

@adohe-zz
Copy link

@bgrant0607 any progress on this now? I returned back to the kubectl apply issues, and hope I can fix all apply related issues before 1.3.0 release.

@caesarxuchao
Copy link
Member

I'll try to check the flake rate again. IMO if the flake rate is non-negligible, we should increase the retry limits of Patch, or more aggressively, just remove the limit. https://github.com/kubernetes/kubernetes/blob/master/pkg/apiserver/resthandler.go#L92

@smarterclayton
Copy link
Contributor

Infinite retry still scares me, mostly because we may have failure modes
that could race for a very long time.

On Mon, Mar 28, 2016 at 3:57 AM, Chao Xu notifications@github.com wrote:

I'll try to check the flake rate again. IMO if the flake rate is
non-negligible, we should increase the retry limits of Patch, or more
aggressively, just remove the limit.
https://github.com/kubernetes/kubernetes/blob/master/pkg/apiserver/resthandler.go#L92


You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub
#15493 (comment)

@bgrant0607
Copy link
Member

@adohe I don't think anyone has worked on this.

Though the original failing test/logs weren't recorded, it appears to be a failure of "should apply a new configuration to an existing RC":
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl.go#L541

@caesarxuchao There is a retry loop in the Patch handler in apiserver, but if resourceVersion is specified, the retry loop will always fail in the case of a conflict. (We should detect that and bail out, but I don't think we do.)

In the case of such a conflict, apply should diff again, create another patch, and try again.

I agree with @smarterclayton that it shouldn't retry forever. We should use the same retry limits we use for other commands.

@caesarxuchao
Copy link
Member

@smarterclayton we have an infinite loop in GuaranteedUpdate (https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/etcd/etcd_helper.go#L610), maybe we should apply a retry limit there as well?

@caesarxuchao
Copy link
Member

There is a retry loop in the Patch handler in apiserver, but if resourceVersion is specified, the retry loop will always fail in the case of a conflict. (We should detect that and bail out, but I don't think we do.)
In the case of such a conflict, apply should diff again, create another patch, and try again.

Agree. Just note that the test flake doesn't specify resourceVersion.

@bgrant0607 bgrant0607 added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 25, 2016
@pwittrock
Copy link
Member

@adohe Since this is causing flakey test failures, it is something we should focus on. Is this something I should pick up?

@bgrant0607 bgrant0607 assigned pwittrock and unassigned bgrant0607 May 31, 2016
@bgrant0607 bgrant0607 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels May 31, 2016
@pwittrock
Copy link
Member

FYI: @adohe has a PR out for this under review

@pwittrock pwittrock added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 8, 2016
@pwittrock
Copy link
Member

Making this a P3 (no release block) since I looks like we haven't seen many instances since Feb 2016

k8s-github-robot pushed a commit that referenced this issue Jun 13, 2016
Automatic merge from submit-queue

kubectl apply retry stale resource version

```release-note
kubectl apply: retry applying a patch if a version conflict error is encountered
```

[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()

fixes #15493 
@pwittrock I just got my original implementation back, ptal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle area/kubectl kind/flake Categorizes issue or PR as related to a flaky test. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.