Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf_job_client blocks forever #606

Closed
jlewi opened this issue May 25, 2018 · 1 comment
Closed

tf_job_client blocks forever #606

jlewi opened this issue May 25, 2018 · 1 comment

Comments

@jlewi
Copy link
Contributor

jlewi commented May 25, 2018

In kubeflow/kubeflow#866 I observed the test_runner.py seemingly hang waiting for TFjobs to complete.

Here are some sample logs

INFO|2018-05-25T03:17:37|/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-minikube-866-b0e8431-1663-0780/src/kubeflow/tf-operator/py/test_runner.py|276| Created job simple-tfjob-minikube in namespaces kubeflow-presubmit-kubeflow-e2e-minikube-866-b0e8431-1663-0780
INFO|2018-05-25T03:17:37|/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-minikube-866-b0e8431-1663-0780/src/kubeflow/tf-operator/py/tf_job_client.py|96| Job simple-tfjob-minikube in namespace kubeflow-presubmit-kubeflow-e2e-minikube-866-b0e8431-1663-0780; uid=2c9d85e5-5fca-11e8-8d50-42010a8e000e; phase=Creating, state=Running,
INFO|2018-05-25T03:18:07|/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-minikube-866-b0e8431-1663-0780/src/kubeflow/tf-operator/py/tf_job_client.py|96| Job simple-tfjob-minikube in namespace kubeflow-presubmit-kubeflow-e2e-minikube-866-b0e8431-1663-0780; uid=2c9d85e5-5fca-11e8-8d50-42010a8e000e; phase=Running, state=Running,

From gubernator and workflow

Since the code isn't timing waiting for the TFJob and we don't print out log messages indicated we are polling for job status my conjecture is that we are hanging here with the http request to the K8s API server.

    results = crd_api.get_namespaced_custom_object(
      TF_JOB_GROUP, TF_JOB_VERSION, namespace, TF_JOB_PLURAL, name)

This is making a synchronous HTTP call and I'm guessing its blocking; forever. We could potentially fix this by making it an async call by passing it "async=True" this would return a thread object which we could then use to enforce a timeout.

@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/priority p1

jlewi added a commit to jlewi/k8s that referenced this issue May 25, 2018
…ect.

* TFJob wait should run the request asyncronously so we don't end up blocking
  forever.

Fix kubeflow#606
jlewi added a commit to jlewi/k8s that referenced this issue May 25, 2018
…ect.

* TFJob wait should run the request asyncronously so we don't end up blocking
  forever.

Fix kubeflow#606
gaocegege pushed a commit that referenced this issue May 25, 2018
…ect (#607)

* TFJob client should not block forever trying to get the namespace object.

* TFJob wait should run the request asyncronously so we don't end up blocking
  forever.

Fix #606

* Fix lint.
yph152 pushed a commit to yph152/tf-operator that referenced this issue Jun 18, 2018
…ect (kubeflow#607)

* TFJob client should not block forever trying to get the namespace object.

* TFJob wait should run the request asyncronously so we don't end up blocking
  forever.

Fix kubeflow#606

* Fix lint.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this issue Jul 9, 2018
…ect (kubeflow#607)

* TFJob client should not block forever trying to get the namespace object.

* TFJob wait should run the request asyncronously so we don't end up blocking
  forever.

Fix kubeflow#606

* Fix lint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants