Create a python script to deploy Kubeflow on GCP via deployment manager. #866

jlewi · 2018-05-24T01:16:34Z

The scripts replaces our bash commands
For teardown we want to add retries to better handle INTERNAL_ERRORS
with deployment manager that are causing the test to be flaky.

Related to #836 verify Kubeflow deployed correctly with deployment manager.

Fix resource_not_found errors in delete (GCP deployment manager test handle internal errors #833)
The not found error was due to the type providers for K8s resources
being deleted before the corresponding K8s resources. So the subsequent
delete of the K8s resource would fail because the type provider did not
exist.
We fix this by using a $ref to refer to the type provider in the type field
of K8s resources.

This change is

jlewi · 2018-05-24T13:07:30Z

/retest

ankushagarwal · 2018-05-24T16:46:48Z

testing/deploy_kubeflow_gcp.py

+  test_suite.run()
+
+if __name__ == "__main__":
+  logging.basicConfig(


test_helper.init() takes care of logging initialization

https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/test_helper.py#L159

ankushagarwal · 2018-05-24T16:47:11Z

testing/output/artifacts/logs/deploy_kubeflow_gcp.log

@@ -0,0 +1,4 @@
+INFO|2018-05-23T18:14:50|deploy_kubeflow_gcp.py:118| Creating deployment jlewi-kubeflow-test3 in project cloud-ml-dev


Checked in by accident?

jlewi · 2018-05-24T18:32:35Z

Most recent failure was a flake in gke_e2e trying to contact github.com

jlewi · 2018-05-24T19:07:33Z

Test flake looks unrelated

Step 1/26 : FROM golang:1.8.2 as builder
Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

So I think this is ready for review.

jlewi · 2018-05-24T19:07:49Z

/assign @kunmingg

jlewi · 2018-05-24T20:43:16Z

/retest

jlewi · 2018-05-24T21:50:41Z

More random test failures

Traceback (most recent call last):
 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   "__main__", fname, loader, pkg_name)
 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
   exec code in run_globals
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-866-3fa104e-1651-5d37/src/kubeflow/kubeflow/testing/test_deploy.py", line 679, in <module>
   main()
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-866-3fa104e-1651-5d37/src/kubeflow/kubeflow/testing/test_deploy.py", line 669, in main
   wrap_test(args)
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-866-3fa104e-1651-5d37/src/kubeflow/kubeflow/testing/test_deploy.py", line 265, in wrap_test
   test_util.wrap_test(run, test_case)
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-866-3fa104e-1651-5d37/src/kubeflow/testing/py/kubeflow/testing/test_util.py", line 88, in wrap_test
   test_case.failure = "Test failed; " + e.message
TypeError: cannot concatenate 'str' and 'ConnectionError' objects

jlewi · 2018-05-24T21:50:45Z

/retest

…nager. * The scripts replaces our bash commands * For teardown we want to add retries to better handle INTERNAL_ERRORS with deployment manager that are causing the test to be flaky. Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager. * Fix resource_not_found errors in delete (kubeflow#833) * The not found error was due to the type providers for K8s resources being deleted before the corresponding K8s resources. So the subsequent delete of the K8s resource would fail because the type provider did not exist. * We fix this by using a $ref to refer to the type provider in the type field of K8s resources.

kunmingg

/lgtm

jlewi · 2018-05-25T01:16:13Z

/retest

jlewi · 2018-05-25T01:16:16Z

/approve

k8s-ci-robot · 2018-05-25T01:16:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlewi · 2018-05-25T03:08:59Z

/retest

jlewi · 2018-05-25T03:23:40Z

In this test
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/866/kubeflow-presubmit/1661/

Timed out waiting for TfJob ; everything else passed.

The workflow logs indicate the TFJob started and entered the running state pretty quickly. But then it appears to have gotten stuck.

I looked at the Kubernetes events related to this name space and it looks like most of the pods

Master starts and is stuck waiting for

CreateSession still waiting for response from worker: /job:ps/replica:0/task:1

But the pod logs for that pod indicated it started just fine.

jlewi · 2018-05-25T04:49:37Z

/retest

jlewi · 2018-05-25T05:18:57Z

/retest

jlewi · 2018-05-25T05:49:01Z

/retest

jlewi · 2018-05-25T06:14:06Z

/retest

jlewi · 2018-05-25T12:33:42Z

/test all

ankushagarwal · 2018-05-25T15:15:33Z

/lgtm

…er. (kubeflow#866) * Create python scripts for deploying Kubeflow on GCP via deployment manager. * The scripts replaces our bash commands * For teardown we want to add retries to better handle INTERNAL_ERRORS with deployment manager that are causing the test to be flaky. Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager. * Fix resource_not_found errors in delete (kubeflow#833) * The not found error was due to the type providers for K8s resources being deleted before the corresponding K8s resources. So the subsequent delete of the K8s resource would fail because the type provider did not exist. * We fix this by using a $ref to refer to the type provider in the type field of K8s resources. * * deletePolicy can't be set per resource * Autoformat jsonnet.

k8s-ci-robot added the size/L label May 24, 2018

k8s-ci-robot requested review from jimexist and willingc May 24, 2018 01:16

jlewi changed the title ~~Create a python script to deploy Kubeflow on GCP via deployment manager.~~ [WIP] Create a python script to deploy Kubeflow on GCP via deployment manager. May 24, 2018

k8s-ci-robot added the do-not-merge/work-in-progress label May 24, 2018

ankushagarwal reviewed May 24, 2018

View reviewed changes

jlewi force-pushed the gke_deploy_test branch from 41ee3e4 to 3fa104e Compare May 24, 2018 18:34

jlewi changed the title ~~[WIP] Create a python script to deploy Kubeflow on GCP via deployment manager.~~ Create a python script to deploy Kubeflow on GCP via deployment manager. May 24, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label May 24, 2018

k8s-ci-robot assigned kunmingg May 24, 2018

jlewi force-pushed the gke_deploy_test branch from 3fa104e to c5834bc Compare May 24, 2018 22:54

* deletePolicy can't be set per resource

0d3515c

kunmingg reviewed May 25, 2018

View reviewed changes

k8s-ci-robot added the lgtm label May 25, 2018

k8s-ci-robot added the approved label May 25, 2018

Autoformat jsonnet.

b0e8431

k8s-ci-robot removed the lgtm label May 25, 2018

This was referenced May 25, 2018

tf_job_client blocks forever kubeflow/training-operator#606

Closed

Presubmit unhealthy? Networking issues #869

Closed

k8s-ci-robot assigned ankushagarwal May 25, 2018

k8s-ci-robot added the lgtm label May 25, 2018

k8s-ci-robot merged commit 0f0cb1c into kubeflow:master May 25, 2018

ankushagarwal mentioned this pull request May 25, 2018

E2E test for TFJob v1alpha2 ksonnet package #852

Closed

jlewi mentioned this pull request May 29, 2018

GCP deployment manager test handle internal errors #833

Closed

ankushagarwal mentioned this pull request May 30, 2018

E2E test verifies Kubeflow installed via Deployment Manager #836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a python script to deploy Kubeflow on GCP via deployment manager. #866

Create a python script to deploy Kubeflow on GCP via deployment manager. #866

jlewi commented May 24, 2018 •

edited

Loading

jlewi commented May 24, 2018

ankushagarwal May 24, 2018

ankushagarwal May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

kunmingg left a comment

jlewi commented May 25, 2018

jlewi commented May 25, 2018

k8s-ci-robot commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

ankushagarwal commented May 25, 2018

		@@ -0,0 +1,4 @@
		INFO\|2018-05-23T18:14:50\|deploy_kubeflow_gcp.py:118\| Creating deployment jlewi-kubeflow-test3 in project cloud-ml-dev

Create a python script to deploy Kubeflow on GCP via deployment manager. #866

Create a python script to deploy Kubeflow on GCP via deployment manager. #866

Conversation

jlewi commented May 24, 2018 • edited Loading

jlewi commented May 24, 2018

ankushagarwal May 24, 2018

Choose a reason for hiding this comment

ankushagarwal May 24, 2018

Choose a reason for hiding this comment

jlewi commented May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

jlewi commented May 24, 2018

kunmingg left a comment

Choose a reason for hiding this comment

jlewi commented May 25, 2018

jlewi commented May 25, 2018

k8s-ci-robot commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

jlewi commented May 25, 2018

ankushagarwal commented May 25, 2018

jlewi commented May 24, 2018 •

edited

Loading