Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: requeue when expire time is not up yet #1614

Merged
merged 2 commits into from
Jun 16, 2022

Conversation

Garrybest
Copy link
Member

@Garrybest Garrybest commented Jun 14, 2022

Signed-off-by: Garrybest garrybest@foxmail.com

What this PR does / why we need it:
When reconciling jobs, if expire time is not up yet, the common.JobController would requeue the object into a FakeWorkQueue, which does not work in Reconcilers.

I would like to check TTL after finished and requeue the object until expire time is up if necessary.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1533

Checklist:

  • Docs included if any changes are user facing

@johnugeorge
Copy link
Member

Can you add test as well?

@coveralls
Copy link

coveralls commented Jun 14, 2022

Pull Request Test Coverage Report for Build 2506345128

  • 30 of 60 (50.0%) changed or added relevant lines in 6 files are covered.
  • 12 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-1.4%) to 40.051%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/common/util/util.go 19 20 95.0%
pkg/controller.v1/tensorflow/tfjob_controller.go 5 8 62.5%
pkg/controller.v1/mpi/mpijob_controller.go 3 8 37.5%
pkg/controller.v1/pytorch/pytorchjob_controller.go 3 8 37.5%
pkg/controller.v1/mxnet/mxjob_controller.go 0 8 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 8 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mxnet/mxjob_controller.go 1 0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 1 0%
pkg/controller.v1/mpi/mpijob_controller.go 10 76.93%
Totals Coverage Status
Change from base Build 2502090657: -1.4%
Covered Lines: 2337
Relevant Lines: 5835

💛 - Coveralls

@Garrybest
Copy link
Member Author

Thanks for reminding, I have added a unit test. BTW, I tested in my own minikube cluster with a TTL second and it works as expected.

@johnugeorge
Copy link
Member

/assign @zw0610
/assign @gaocegege

@@ -162,6 +162,15 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request)
return ctrl.Result{}, err
}

t, err := util.DurationUntilExpireTime(&pytorchjob.Spec.RunPolicy, pytorchjob.Status)
if err != nil {
logrus.Warnf("Reconcile PyTorchJob error %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using mix of logger and logrus in code. Need to cleanup in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, exactly.

@zw0610
Copy link
Member

zw0610 commented Jun 15, 2022

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jun 15, 2022
@johnugeorge
Copy link
Member

Signed-off-by: Garrybest <garrybest@foxmail.com>
Signed-off-by: Garrybest <garrybest@foxmail.com>
@johnugeorge
Copy link
Member

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Garrybest, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit fbcb6f3 into kubeflow:master Jun 16, 2022
@HeGaoYuan
Copy link
Contributor

@Garrybest @zw0610 @johnugeorge

Don't you think this implementation is so strange?

@HeGaoYuan
Copy link
Contributor

It is just like a "patch"
The RunPolicy.ActiveDeadlineSeconds has the same problem does not been fixed yet.

if pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds != nil {
logger.Infof("Job with ActiveDeadlineSeconds will sync after %d seconds", *pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds)
r.WorkQueue.AddAfter(pytorchjobKey, time.Duration(*pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds)*time.Second)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job TTLs not working
6 participants