Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry studyjobcontroller #411

Closed
wants to merge 2 commits into from

Conversation

toshiiw
Copy link
Contributor

@toshiiw toshiiw commented Feb 26, 2019

This hopefully fixes Update race condition from Reconcile.


This change is Reviewable

@toshiiw
Copy link
Contributor Author

toshiiw commented Feb 26, 2019

/hold

@k8s-ci-robot
Copy link

Hi @toshiiw. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@toshiiw
Copy link
Contributor Author

toshiiw commented Feb 26, 2019

/assign @YujiOshima

ret := reconcile.Result{}
var updateErr error
if update || err != nil {
updateErr := r.Update(context.TODO(), instance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the root cause of the race condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I heard that some jobs can finish meanwhile and the StudyJob get updated.
I've never seen the actual failure though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this #166 (comment)
2018/09/13 06:44:09 Fail to Update StudyJob c1e236530496ef99 : Operation cannot be fulfilled on studyjobs.kubeflow.org "grid-example": the object has been modified; please apply your changes to the latest version and try again

When any updates worker or metrics-collector happen during reconcile is running, the update for studyjob will be failed.

@YujiOshima
Copy link
Contributor

/ok-to-test

@johnugeorge
Copy link
Member

@YujiOshima I was asking if we know the reason behind the error. Else, this error will be hidden with this fix.

@YujiOshima
Copy link
Contributor

I think we can't know the exactly reason.
But because of mutex I believe the reasons of this error are only completion of Worker, creation of job from MetricsCollector cronjob, and Fail of Worker.
Since these updates are not controled by studyjob operator.
And they can handle at the next reconcile.

@YujiOshima
Copy link
Contributor

The reason why we need this is avoiding endless retry.
When we use a heavy suggestion service which take longer than one minute for creating a suggestion, the metrics collector must create a new job during studyjob operator wait for the suggestion.
Then the studyjob will fail to update and retry.
The studyjob won't update any more.

@johnugeorge
Copy link
Member

I think we can't know the exactly reason.
But because of mutex I believe the reasons of this error are only completion of Worker, creation of job from MetricsCollector cronjob, and Fail of Worker.

@YujiOshima Can you please explain more on this situation? Why will this error happen during these conditions? Error says that the object got modified before the update call.

Since these updates are not controled by studyjob operator.
And they can handle at the next reconcile.

@YujiOshima
Copy link
Contributor

@johnugeorge When any updates on a resource that is corresponding to a studyjob, the operator can't update the studyjob.

A studyjob named "A" has a worker "A-w1" and a metrics collector cron-job "A-mc1".
First, The operator get instance by r.Get here
And the operator waiting for some process (e.g. GetSuggestion).
During the operator waiting, "A-mc1" create a job. Since "A-mc1" is a cron-job, it will create a job every minute.
At this time, the studyjob CR is modified from the operator getted.
So the operator get error when call r.Update here

@johnugeorge
Copy link
Member

At this time, the studyjob CR is modified from the operator getted.
@YujiOshima In the above example, i didn't understand how studyJobCR is getting modified after mc job is created. Can you point out to the code? AFAIK, error happens when update is called on a object that already got modified from the initial Get

@toshiiw toshiiw force-pushed the retry-studyjobcontroller branch from fc82e20 to 51948ea Compare March 5, 2019 09:04
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: yujioshima

If they are not already assigned, you can assign the PR to them by writing /assign @yujioshima in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@toshiiw
Copy link
Contributor Author

toshiiw commented Mar 5, 2019

/hold cancel
Seems to work in my local env.

@johnugeorge
Copy link
Member

Related: kubernetes-sigs/controller-runtime#343

Since we are watching on job pods, there are higher chances of Reconcile functions called continuously. Immediate Get after Update can have stale entries. This will cause the next Update to fail. However, this should not be a problem since we can handle this during the next Reconcile.

@YujiOshima
Copy link
Contributor

@johnugeorge Thanks. So we should not believe values in CR, believe values in DB.
I think it would be a problem for currently Katib. For example, when fail to update after getSuggestion, the suggestionCount will be inconsistent between DB and CR.
Should we fix that bug and other related bugs now or think about this after v2API? WDYT?

@toshiiw
Copy link
Contributor Author

toshiiw commented Mar 6, 2019

For example, when fail to update after getSuggestion, the suggestionCount will be inconsistent between DB and CR.

This patch treats the suggestionCount in the DB as the single source of the truth. Inconsistency with this should be resolved in the next reconcile run.

@YujiOshima
Copy link
Contributor

@toshiiw When the operator get stale version of CR, operator won't set false to nextSuggestionSchedule here even the GetSuggestion is alrealy called in previous Reconcile.
Then GetSuggestion is called duplicatedly and create workers but update will be fail. So the workers will be orphan.

@johnugeorge
Copy link
Member

It is not accurate to say that values in CR should not be believed. From the k8s philosophy, We have to make sure that design is level triggered based than edge triggered based. If it is a level triggered design, reconcile function will make sure that actual state moves towards the desired state.

i think, we should think after v2API because there will be significant code changes. We have to ensure this.

@toshiiw
Copy link
Contributor Author

toshiiw commented Mar 6, 2019

@johnugeorge Agreed. The current Reconcile() is written somewhat in a edge-triggered way. We might need to put some worker related info in CR or DB to ensure a level-triggered design works.

@gaocegege
Copy link
Member

/close

Thanks for the PR. But we already dropped v1alpha1. We can't merge it since v1alpha1 is deprecated.

@k8s-ci-robot
Copy link

@gaocegege: Closed this PR.

In response to this:

/close

Thanks for the PR. But we already dropped v1alpha1. We can't merge it since v1alpha1 is deprecated.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants