fix Pytorjob status inaccuracy when task replica scale down #1593

PeterChg · 2022-05-17T06:35:39Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

aws-kf-ci-bot · 2022-05-17T06:35:52Z

Hi @PeterChg. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

PeterChg · 2022-05-17T06:42:39Z

/assign @Jeffwan

coveralls · 2022-05-18T04:19:39Z

Pull Request Test Coverage Report for Build 2473439800

4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
15 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.01%) to 36.924%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	15	77.05%

Totals
Change from base Build 2471254094:	0.01%
Covered Lines:	2307
Relevant Lines:	6248

💛 - Coveralls

gaocegege · 2022-05-18T06:05:38Z

/ok-to-test

gaocegege · 2022-05-18T06:06:10Z

Could you please explain why the PyTorch operator needs such a new check?

PeterChg · 2022-05-18T08:54:30Z

operator

Could you please explain why the PyTorch operator needs such a new check?

when HPA scale down task replicas of pytorchjob, pytorchjob operator terminate redundant pods. the phase of pods will become Failed Before disappears, This will result in a failed pytorchjob state. So we need to ignore the errors status caused by proactive pod terminate.

gaocegege · 2022-05-18T09:17:07Z

SGTM

PeterChg · 2022-05-19T02:28:06Z

/cc @Jeffwan

Excuse me, is there something wrong with kubeflow-training-operator-presubmit test environment configuration.
The following error messages encountered seem to have nothing to do with my modifications.

2022-05-19 02:07:33 [✖] AWS::EC2::RouteTable/PrivateRouteTableUSWEST2B: CREATE_FAILED – "Resource creation cancelle
d"
2022-05-19 02:07:33 [✖] AWS::EC2::InternetGateway/InternetGateway: CREATE_FAILED – "Resource creation cancelled"
2022-05-19 02:07:33 [✖] AWS::EC2::RouteTable/PrivateRouteTableUSWEST2C: CREATE_FAILED – "Resource creation cancelle
d"
2022-05-19 02:07:33 [✖] AWS::EC2::NatGateway/NATGateway: CREATE_FAILED – "Resource creation cancelled"
2022-05-19 02:07:33 [✖] AWS::IAM::Policy/PolicyELBPermissions: CREATE_FAILED – "Resource creation cancelled"
2022-05-19 02:07:33 [✖] AWS::IAM::Policy/PolicyCloudWatchMetrics: CREATE_FAILED – "Resource creation cancelled"
2022-05-19 02:07:33 [✖] AWS::EKS::Cluster/ControlPlane: CREATE_FAILED – "Resource handler returned message: "unsup
ported Kubernetes version (Service: Eks, Status Code: 400, Request ID: 740cefd2-d290-4dcd-9109-03558148946e)" (Requ
estToken: 1bea15d8-3e3f-12f0-c2d1-ab02f0637a04, HandlerErrorCode: InvalidRequest)"

PeterChg · 2022-05-24T01:42:37Z

/retest

aws-kf-ci-bot · 2022-05-24T01:51:10Z

@PeterChg: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
kubeflow-training-operator-presubmit	`305349b`	link	`/test kubeflow-training-operator-presubmit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

johnugeorge · 2022-06-02T18:03:17Z

Can you rebase

PeterChg · 2022-06-06T02:16:32Z

/retest

PeterChg · 2022-06-06T02:30:03Z

Can you rebase

done

johnugeorge · 2022-06-06T05:07:07Z

What about other frameworks?

/cc @gaocegege
/cc @zw0610

gaocegege

@zw0610 Should we make the change in MPIJob?

zw0610 · 2022-06-06T08:07:27Z

@gaocegege I believe so.

PeterChg · 2022-06-06T12:29:30Z

What about other frameworks?

/cc @gaocegege /cc @zw0610

This bug occurs when the number of pods is proactive expanded or shrunk, like the HPA scenario. The only situation available is pytorchjob.

gaocegege · 2022-06-07T00:47:09Z

/approve
/lgtm

google-oss-prow · 2022-06-07T00:47:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, PeterChg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gaocegege]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gaocegege · 2022-06-07T00:47:22Z

/hold

gaocegege · 2022-06-07T00:48:03Z

@zw0610 @johnugeorge

Could we merge this PR?

@PeterChg

Thanks for your contribution! 🎉 👍

zw0610 · 2022-06-07T00:58:28Z

Does this pull request assume all PyTorchJob is elastic? I'm wondering the expected behavior for traditional PyTorchJob if the replicas is scaled down.

PeterChg · 2022-06-07T02:29:36Z

Does this pull request assume all PyTorchJob is elastic? I'm wondering the expected behavior for traditional PyTorchJob if the replicas is scaled down.

traditional PyTorchJob will not Encounter a scene like this

PeterChg · 2022-06-07T02:31:24Z

MPIJob also supports replicas shrinking. But we can do the fix in another pull request.

I will submit another pull request

zw0610 · 2022-06-07T02:42:00Z

Does this pull request assume all PyTorchJob is elastic? I'm wondering the expected behavior for traditional PyTorchJob if the replicas is scaled down.

traditional PyTorchJob will not Encounter a scene like this

Great! In that case, I think we can merge this pull request. @gaocegege

PeterChg · 2022-06-10T03:12:45Z

Does this pull request assume all PyTorchJob is elastic? I'm wondering the expected behavior for traditional PyTorchJob if the replicas is scaled down.

traditional PyTorchJob will not Encounter a scene like this

Great! In that case, I think we can merge this pull request. @gaocegege

@gaocegege
Any other concerns? This fix has run in our cluster for weeks, A variety of types jobs are covered, elastic pytorchjob and traditional pytorchjob are included. So far everything seems normal !

johnugeorge · 2022-06-10T05:06:44Z

Can you rebase and resolve merge conflicts?

johnugeorge · 2022-06-10T08:23:34Z

/lgtm

gaocegege

/lgtm

Thanks for your contribution! 🎉 👍

PeterChg · 2022-06-10T09:20:57Z

/hold

Excuse me, Is this label will affert final mergeable?

gaocegege · 2022-06-10T10:02:38Z

/hold cancel

Sorry, my bad!

google-oss-prow bot added the size/XS label May 17, 2022

aws-kf-ci-bot added the needs-ok-to-test label May 17, 2022

google-oss-prow bot requested review from jinchihe, johnugeorge and zw0610 May 17, 2022 06:35

google-oss-prow bot assigned Jeffwan May 17, 2022

aws-kf-ci-bot added ok-to-test and removed needs-ok-to-test labels May 18, 2022

google-oss-prow bot added size/XXL and removed size/XS labels Jun 6, 2022

google-oss-prow bot added size/XS and removed size/XXL labels Jun 6, 2022

google-oss-prow bot requested a review from gaocegege June 6, 2022 05:07

gaocegege reviewed Jun 6, 2022

View reviewed changes

google-oss-prow bot assigned gaocegege Jun 7, 2022

google-oss-prow bot added the lgtm label Jun 7, 2022

google-oss-prow bot added the approved label Jun 7, 2022

google-oss-prow bot added the do-not-merge/hold label Jun 7, 2022

PeterChg closed this Jun 8, 2022

PeterChg reopened this Jun 10, 2022

google-oss-prow bot added size/XL size/XS and removed lgtm size/XS size/XL labels Jun 10, 2022

fix Pytorjob status inaccuracy when task replica scale down

be232cc

google-oss-prow bot assigned johnugeorge Jun 10, 2022

google-oss-prow bot added the lgtm label Jun 10, 2022

gaocegege reviewed Jun 10, 2022

View reviewed changes

google-oss-prow bot removed the do-not-merge/hold label Jun 10, 2022

google-oss-prow bot merged commit 2e87235 into kubeflow:master Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix Pytorjob status inaccuracy when task replica scale down #1593

fix Pytorjob status inaccuracy when task replica scale down #1593

PeterChg commented May 17, 2022

aws-kf-ci-bot commented May 17, 2022

PeterChg commented May 17, 2022

coveralls commented May 18, 2022 •

edited

Loading

gaocegege commented May 18, 2022

gaocegege commented May 18, 2022

PeterChg commented May 18, 2022 •

edited

Loading

gaocegege commented May 18, 2022

PeterChg commented May 19, 2022 •

edited

Loading

PeterChg commented May 24, 2022

aws-kf-ci-bot commented May 24, 2022

johnugeorge commented Jun 2, 2022

PeterChg commented Jun 6, 2022

PeterChg commented Jun 6, 2022

johnugeorge commented Jun 6, 2022

gaocegege left a comment

zw0610 commented Jun 6, 2022

PeterChg commented Jun 6, 2022 •

edited

Loading

gaocegege commented Jun 7, 2022

google-oss-prow bot commented Jun 7, 2022

gaocegege commented Jun 7, 2022

gaocegege commented Jun 7, 2022

zw0610 commented Jun 7, 2022

PeterChg commented Jun 7, 2022

PeterChg commented Jun 7, 2022

zw0610 commented Jun 7, 2022

PeterChg commented Jun 10, 2022 •

edited

Loading

johnugeorge commented Jun 10, 2022

johnugeorge commented Jun 10, 2022

gaocegege left a comment

PeterChg commented Jun 10, 2022

gaocegege commented Jun 10, 2022

fix Pytorjob status inaccuracy when task replica scale down #1593

fix Pytorjob status inaccuracy when task replica scale down #1593

Conversation

PeterChg commented May 17, 2022

aws-kf-ci-bot commented May 17, 2022

PeterChg commented May 17, 2022

coveralls commented May 18, 2022 • edited Loading

Pull Request Test Coverage Report for Build 2473439800

💛 - Coveralls

gaocegege commented May 18, 2022

gaocegege commented May 18, 2022

PeterChg commented May 18, 2022 • edited Loading

gaocegege commented May 18, 2022

PeterChg commented May 19, 2022 • edited Loading

PeterChg commented May 24, 2022

aws-kf-ci-bot commented May 24, 2022

johnugeorge commented Jun 2, 2022

PeterChg commented Jun 6, 2022

PeterChg commented Jun 6, 2022

johnugeorge commented Jun 6, 2022

gaocegege left a comment

Choose a reason for hiding this comment

zw0610 commented Jun 6, 2022

PeterChg commented Jun 6, 2022 • edited Loading

gaocegege commented Jun 7, 2022

google-oss-prow bot commented Jun 7, 2022

gaocegege commented Jun 7, 2022

gaocegege commented Jun 7, 2022

zw0610 commented Jun 7, 2022

PeterChg commented Jun 7, 2022

PeterChg commented Jun 7, 2022

zw0610 commented Jun 7, 2022

PeterChg commented Jun 10, 2022 • edited Loading

johnugeorge commented Jun 10, 2022

johnugeorge commented Jun 10, 2022

gaocegege left a comment

Choose a reason for hiding this comment

PeterChg commented Jun 10, 2022

gaocegege commented Jun 10, 2022

coveralls commented May 18, 2022 •

edited

Loading

PeterChg commented May 18, 2022 •

edited

Loading

PeterChg commented May 19, 2022 •

edited

Loading

PeterChg commented Jun 6, 2022 •

edited

Loading

PeterChg commented Jun 10, 2022 •

edited

Loading