-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf_job_client blocks forever #606
Labels
Comments
/priority p1 |
jlewi
added a commit
to jlewi/k8s
that referenced
this issue
May 25, 2018
…ect. * TFJob wait should run the request asyncronously so we don't end up blocking forever. Fix kubeflow#606
jlewi
added a commit
to jlewi/k8s
that referenced
this issue
May 25, 2018
…ect. * TFJob wait should run the request asyncronously so we don't end up blocking forever. Fix kubeflow#606
This was referenced May 25, 2018
yph152
pushed a commit
to yph152/tf-operator
that referenced
this issue
Jun 18, 2018
…ect (kubeflow#607) * TFJob client should not block forever trying to get the namespace object. * TFJob wait should run the request asyncronously so we don't end up blocking forever. Fix kubeflow#606 * Fix lint.
jetmuffin
pushed a commit
to jetmuffin/tf-operator
that referenced
this issue
Jul 9, 2018
…ect (kubeflow#607) * TFJob client should not block forever trying to get the namespace object. * TFJob wait should run the request asyncronously so we don't end up blocking forever. Fix kubeflow#606 * Fix lint.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In kubeflow/kubeflow#866 I observed the test_runner.py seemingly hang waiting for TFjobs to complete.
Here are some sample logs
From gubernator and workflow
Since the code isn't timing waiting for the TFJob and we don't print out log messages indicated we are polling for job status my conjecture is that we are hanging here with the http request to the K8s API server.
This is making a synchronous HTTP call and I'm guessing its blocking; forever. We could potentially fix this by making it an async call by passing it "async=True" this would return a thread object which we could then use to enforce a timeout.
The text was updated successfully, but these errors were encountered: