-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry studyjobcontroller #411
Conversation
/hold |
Hi @toshiiw. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @YujiOshima |
ret := reconcile.Result{} | ||
var updateErr error | ||
if update || err != nil { | ||
updateErr := r.Update(context.TODO(), instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the root cause of the race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I heard that some jobs can finish meanwhile and the StudyJob get updated.
I've never seen the actual failure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this #166 (comment)
2018/09/13 06:44:09 Fail to Update StudyJob c1e236530496ef99 : Operation cannot be fulfilled on studyjobs.kubeflow.org "grid-example": the object has been modified; please apply your changes to the latest version and try again
When any updates worker or metrics-collector happen during reconcile is running, the update for studyjob will be failed.
/ok-to-test |
@YujiOshima I was asking if we know the reason behind the error. Else, this error will be hidden with this fix. |
I think we can't know the exactly reason. |
The reason why we need this is avoiding endless retry. |
@YujiOshima Can you please explain more on this situation? Why will this error happen during these conditions? Error says that the object got modified before the update call.
|
@johnugeorge When any updates on a resource that is corresponding to a studyjob, the operator can't update the studyjob. A studyjob named "A" has a worker "A-w1" and a metrics collector cron-job "A-mc1". |
|
fc82e20
to
51948ea
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Related: kubernetes-sigs/controller-runtime#343 Since we are watching on job pods, there are higher chances of Reconcile functions called continuously. Immediate Get after Update can have stale entries. This will cause the next Update to fail. However, this should not be a problem since we can handle this during the next Reconcile. |
@johnugeorge Thanks. So we should not believe values in CR, believe values in DB. |
This patch treats the suggestionCount in the DB as the single source of the truth. Inconsistency with this should be resolved in the next reconcile run. |
It is not accurate to say that values in CR should not be believed. From the k8s philosophy, We have to make sure that design is level triggered based than edge triggered based. If it is a level triggered design, reconcile function will make sure that actual state moves towards the desired state. i think, we should think after v2API because there will be significant code changes. We have to ensure this. |
@johnugeorge Agreed. The current Reconcile() is written somewhat in a edge-triggered way. We might need to put some worker related info in CR or DB to ensure a level-triggered design works. |
/close Thanks for the PR. But we already dropped v1alpha1. We can't merge it since v1alpha1 is deprecated. |
@gaocegege: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This hopefully fixes Update race condition from Reconcile.
This change is