-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP deployment manager test handle internal errors #833
Comments
Encountered this again.
gcloud --project=kubeflow-ci deployment-manager operations describe operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e
|
The deployment
|
Lets try reissuing the delete
The error shown in the pantheon UI is
|
I added
to all the K8s resources. Now I can delete all the K8s resources and the node pools but I get the following error
I tried setting the deletePolicy on that resource as well but it looks like it didn't work. |
* This config creates the K8s resources needed to run the bootstrapper * Enable the ResourceManager API; this is used to get IAM policies * Add IAM roles to the cloudservices account. This is needed so that the deployment manager has sufficient RBAC permissions to do what it needs to. * Delete initialNodeCount and just make the default node pool a 1 CPU node pool. * The bootstrapper isn't running successfully; it looks like its trying to create a pytorch component but its using an older version of the registry which doesn't include the pytorch operator. * Set delete policy on K8s resources to ABANDON otherwise we get internal errors. * We can use actions to enable APIs and then we won't try to delete the API when the deployment is deleted which causes errors. fix kubeflow#833
So I'm observing two failure modes on #823
|
Happened again on presubmit for #833
|
Looks like the deployment was successfully deleted. |
The resource not found errors are caused because we delete the type providers before the K8s resources so we get errors when we try to delete the K8s resources. We can fix this either by
|
…nager. * The scripts replaces our bash commands * For teardown we want to add retries to better handle INTERNAL_ERRORS with deployment manager that are causing the test to be flaky. Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager. * Fix resource_not_found errors in delete (kubeflow#833) * The not found error was due to the type providers for K8s resources being deleted before the corresponding K8s resources. So the subsequent delete of the K8s resource would fail because the type provider did not exist. * We fix this by using a $ref to refer to the type provider in the type field of K8s resources.
…er. (#866) * Create python scripts for deploying Kubeflow on GCP via deployment manager. * The scripts replaces our bash commands * For teardown we want to add retries to better handle INTERNAL_ERRORS with deployment manager that are causing the test to be flaky. Related to #836 verify Kubeflow deployed correctly with deployment manager. * Fix resource_not_found errors in delete (#833) * The not found error was due to the type providers for K8s resources being deleted before the corresponding K8s resources. So the subsequent delete of the K8s resource would fail because the type provider did not exist. * We fix this by using a $ref to refer to the type provider in the type field of K8s resources. * * deletePolicy can't be set per resource * Autoformat jsonnet.
#866 should add retries that handle internal error. |
* This config creates the K8s resources needed to run the bootstrapper * Enable the ResourceManager API; this is used to get IAM policies * Add IAM roles to the cloudservices account. This is needed so that the deployment manager has sufficient RBAC permissions to do what it needs to. * Delete initialNodeCount and just make the default node pool a 1 CPU node pool. * The bootstrapper isn't running successfully; it looks like its trying to create a pytorch component but its using an older version of the registry which doesn't include the pytorch operator. * Set delete policy on K8s resources to ABANDON otherwise we get internal errors. * We can use actions to enable APIs and then we won't try to delete the API when the deployment is deleted which causes errors. fix kubeflow#833
…er. (kubeflow#866) * Create python scripts for deploying Kubeflow on GCP via deployment manager. * The scripts replaces our bash commands * For teardown we want to add retries to better handle INTERNAL_ERRORS with deployment manager that are causing the test to be flaky. Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager. * Fix resource_not_found errors in delete (kubeflow#833) * The not found error was due to the type providers for K8s resources being deleted before the corresponding K8s resources. So the subsequent delete of the K8s resource would fail because the type provider did not exist. * We fix this by using a $ref to refer to the type provider in the type field of K8s resources. * * deletePolicy can't be set per resource * Autoformat jsonnet.
Signed-off-by: Ce Gao <gaoce@caicloud.io>
* Remove 1.15 selectors for Seldon * Add kubeflow namespace selectors to webhook configuration * Change seldon webhook conf to matchLabels
In #823 we observed test flakes during teardown due to internal errors.
We should make the tests more robust to these flakes.
/priority p1
The text was updated successfully, but these errors were encountered: