Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Pytorjob status inaccuracy when task replica scale down #1593

Merged
merged 1 commit into from
Jun 10, 2022
Merged

fix Pytorjob status inaccuracy when task replica scale down #1593

merged 1 commit into from
Jun 10, 2022

Conversation

PeterChg
Copy link
Contributor

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@aws-kf-ci-bot
Copy link
Contributor

Hi @PeterChg. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@PeterChg
Copy link
Contributor Author

/assign @Jeffwan

@coveralls
Copy link

coveralls commented May 18, 2022

Pull Request Test Coverage Report for Build 2473439800

  • 4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
  • 15 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.01%) to 36.924%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 15 77.05%
Totals Coverage Status
Change from base Build 2471254094: 0.01%
Covered Lines: 2307
Relevant Lines: 6248

💛 - Coveralls

@gaocegege
Copy link
Member

/ok-to-test

@gaocegege
Copy link
Member

Could you please explain why the PyTorch operator needs such a new check?

@PeterChg
Copy link
Contributor Author

PeterChg commented May 18, 2022

operator

Could you please explain why the PyTorch operator needs such a new check?

when HPA scale down task replicas of pytorchjob, pytorchjob operator terminate redundant pods. the phase of pods will become Failed Before disappears, This will result in a failed pytorchjob state. So we need to ignore the errors status caused by proactive pod terminate.

@gaocegege
Copy link
Member

SGTM

@PeterChg
Copy link
Contributor Author

PeterChg commented May 19, 2022

/cc @Jeffwan

Excuse me, is there something wrong with kubeflow-training-operator-presubmit test environment configuration.
The following error messages encountered seem to have nothing to do with my modifications.

2022-05-19 02:07:33 [✖] AWS::EC2::RouteTable/PrivateRouteTableUSWEST2B: CREATE_FAILED – "Resource creation cancelle
d"
2022-05-19 02:07:33 [✖] AWS::EC2::InternetGateway/InternetGateway: CREATE_FAILED – "Resource creation cancelled"
2022-05-19 02:07:33 [✖] AWS::EC2::RouteTable/PrivateRouteTableUSWEST2C: CREATE_FAILED – "Resource creation cancelle
d"
2022-05-19 02:07:33 [✖] AWS::EC2::NatGateway/NATGateway: CREATE_FAILED – "Resource creation cancelled"
2022-05-19 02:07:33 [✖] AWS::IAM::Policy/PolicyELBPermissions: CREATE_FAILED – "Resource creation cancelled"
2022-05-19 02:07:33 [✖] AWS::IAM::Policy/PolicyCloudWatchMetrics: CREATE_FAILED – "Resource creation cancelled"
2022-05-19 02:07:33 [✖] AWS::EKS::Cluster/ControlPlane: CREATE_FAILED – "Resource handler returned message: "unsup
ported Kubernetes version (Service: Eks, Status Code: 400, Request ID: 740cefd2-d290-4dcd-9109-03558148946e)" (Requ
estToken: 1bea15d8-3e3f-12f0-c2d1-ab02f0637a04, HandlerErrorCode: InvalidRequest)"

@PeterChg
Copy link
Contributor Author

/retest

@aws-kf-ci-bot
Copy link
Contributor

@PeterChg: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-training-operator-presubmit 305349b link /test kubeflow-training-operator-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@johnugeorge
Copy link
Member

Can you rebase

@PeterChg
Copy link
Contributor Author

PeterChg commented Jun 6, 2022

/retest

@google-oss-prow google-oss-prow bot added size/XXL and removed size/XS labels Jun 6, 2022
@PeterChg
Copy link
Contributor Author

PeterChg commented Jun 6, 2022

Can you rebase

done

@google-oss-prow google-oss-prow bot added size/XS and removed size/XXL labels Jun 6, 2022
@johnugeorge
Copy link
Member

What about other frameworks?

/cc @gaocegege
/cc @zw0610

@google-oss-prow google-oss-prow bot requested a review from gaocegege June 6, 2022 05:07
Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zw0610 Should we make the change in MPIJob?

@zw0610
Copy link
Member

zw0610 commented Jun 6, 2022

@gaocegege I believe so.

@PeterChg
Copy link
Contributor Author

PeterChg commented Jun 6, 2022

What about other frameworks?

/cc @gaocegege /cc @zw0610

This bug occurs when the number of pods is proactive expanded or shrunk, like the HPA scenario. The only situation available is pytorchjob.

@gaocegege
Copy link
Member

/approve
/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, PeterChg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gaocegege
Copy link
Member

/hold

@gaocegege
Copy link
Member

@zw0610 @johnugeorge

Could we merge this PR?

@PeterChg

Thanks for your contribution! 🎉 👍

@zw0610
Copy link
Member

zw0610 commented Jun 7, 2022

Does this pull request assume all PyTorchJob is elastic? I'm wondering the expected behavior for traditional PyTorchJob if the replicas is scaled down.

@PeterChg
Copy link
Contributor Author

PeterChg commented Jun 7, 2022

Does this pull request assume all PyTorchJob is elastic? I'm wondering the expected behavior for traditional PyTorchJob if the replicas is scaled down.

traditional PyTorchJob will not Encounter a scene like this

@PeterChg
Copy link
Contributor Author

PeterChg commented Jun 7, 2022

MPIJob also supports replicas shrinking. But we can do the fix in another pull request.

I will submit another pull request

@zw0610
Copy link
Member

zw0610 commented Jun 7, 2022

Does this pull request assume all PyTorchJob is elastic? I'm wondering the expected behavior for traditional PyTorchJob if the replicas is scaled down.

traditional PyTorchJob will not Encounter a scene like this

Great! In that case, I think we can merge this pull request. @gaocegege

@PeterChg PeterChg closed this Jun 8, 2022
@PeterChg PeterChg reopened this Jun 10, 2022
@PeterChg
Copy link
Contributor Author

PeterChg commented Jun 10, 2022

Does this pull request assume all PyTorchJob is elastic? I'm wondering the expected behavior for traditional PyTorchJob if the replicas is scaled down.

traditional PyTorchJob will not Encounter a scene like this

Great! In that case, I think we can merge this pull request. @gaocegege

@gaocegege
Any other concerns? This fix has run in our cluster for weeks, A variety of types jobs are covered, elastic pytorchjob and traditional pytorchjob are included. So far everything seems normal !

@johnugeorge
Copy link
Member

Can you rebase and resolve merge conflicts?

@johnugeorge
Copy link
Member

/lgtm

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for your contribution! 🎉 👍

@PeterChg
Copy link
Contributor Author

/hold

Excuse me, Is this label will affert final mergeable?

@gaocegege
Copy link
Member

/hold cancel

Sorry, my bad!

@google-oss-prow google-oss-prow bot merged commit 2e87235 into kubeflow:master Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants