-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Podgroup is constantly created and deleted after tfjob is success or failure #1426
Comments
The PodGroup is created and deleted many times. It is weird. /cc @kubeflow/wg-training-leads |
Maybe it is related to Expectations. |
If the job status is success or failure, we should skip reconciling actually.. Does this problem happens after long time? like 24 hr? I am thinking why it restart to reconcile pod group that long. |
there is a job completed on
tf-operator log
pod/sevice/podgroup of tfjob v1-tensorflow-1006160359409
|
I upgrade tf-operator v1.2.1 to training-operator v1.3.0, it is resolved. |
tf-operator version:v1.2.1
tjob is successfully sucess in 2021/09/29, but constantly created and delete podgroup on 2021/09/30
speciallly we set ttlSecondsAfterFinished=3day
deleting podgroup by accessing apiserver directly may cost some time, leading to long schedule latency for other job.
The text was updated successfully, but these errors were encountered: