-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Presubmit unhealthy? Networking issues #869
Comments
#866 is one PR where I observed a number of test flakes. I'm observing a variety of failure modes Failure #1 TFJob client appears to hang trying to contact the K8s API server to get job status (kubeflow/training-operator#606). Failure #2 TFServing fails
Both of these are suggestive of some form of networking issue. |
It looks like the kube-dns pods might be having some issues
Although I guess its possible that is yet another problem caused by networking issues. |
I'm going to try deleting all the VMs in the cluster. They should get recreated and hopefully when they get rescheduled any transient issues will be addressed. |
Thanks Jeremy, I think it's good now. |
https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-presubmit
The text was updated successfully, but these errors were encountered: