Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a python script to deploy Kubeflow on GCP via deployment manager. #866

Merged
merged 3 commits into from
May 25, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented May 24, 2018

  • The scripts replaces our bash commands
  • For teardown we want to add retries to better handle INTERNAL_ERRORS
    with deployment manager that are causing the test to be flaky.

Related to #836 verify Kubeflow deployed correctly with deployment manager.

  • Fix resource_not_found errors in delete (GCP deployment manager test handle internal errors #833)

  • The not found error was due to the type providers for K8s resources
    being deleted before the corresponding K8s resources. So the subsequent
    delete of the K8s resource would fail because the type provider did not
    exist.

  • We fix this by using a $ref to refer to the type provider in the type field
    of K8s resources.


This change is Reviewable

@k8s-ci-robot k8s-ci-robot requested review from jimexist and willingc May 24, 2018 01:16
@jlewi jlewi changed the title Create a python script to deploy Kubeflow on GCP via deployment manager. [WIP] Create a python script to deploy Kubeflow on GCP via deployment manager. May 24, 2018
@jlewi
Copy link
Contributor Author

jlewi commented May 24, 2018

/retest

test_suite.run()

if __name__ == "__main__":
logging.basicConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_helper.init() takes care of logging initialization

https://github.com/kubeflow/testing/blob/master/py/kubeflow/testing/test_helper.py#L159

@@ -0,0 +1,4 @@
INFO|2018-05-23T18:14:50|deploy_kubeflow_gcp.py:118| Creating deployment jlewi-kubeflow-test3 in project cloud-ml-dev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked in by accident?

@jlewi
Copy link
Contributor Author

jlewi commented May 24, 2018

Most recent failure was a flake in gke_e2e trying to contact github.com

@jlewi jlewi force-pushed the gke_deploy_test branch from 41ee3e4 to 3fa104e Compare May 24, 2018 18:34
@jlewi
Copy link
Contributor Author

jlewi commented May 24, 2018

Test flake looks unrelated

Step 1/26 : FROM golang:1.8.2 as builder
Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

So I think this is ready for review.

@jlewi jlewi changed the title [WIP] Create a python script to deploy Kubeflow on GCP via deployment manager. Create a python script to deploy Kubeflow on GCP via deployment manager. May 24, 2018
@jlewi
Copy link
Contributor Author

jlewi commented May 24, 2018

/assign @kunmingg

@jlewi
Copy link
Contributor Author

jlewi commented May 24, 2018

/retest

@jlewi
Copy link
Contributor Author

jlewi commented May 24, 2018

More random test failures

Traceback (most recent call last):
 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   "__main__", fname, loader, pkg_name)
 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
   exec code in run_globals
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-866-3fa104e-1651-5d37/src/kubeflow/kubeflow/testing/test_deploy.py", line 679, in <module>
   main()
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-866-3fa104e-1651-5d37/src/kubeflow/kubeflow/testing/test_deploy.py", line 669, in main
   wrap_test(args)
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-866-3fa104e-1651-5d37/src/kubeflow/kubeflow/testing/test_deploy.py", line 265, in wrap_test
   test_util.wrap_test(run, test_case)
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-866-3fa104e-1651-5d37/src/kubeflow/testing/py/kubeflow/testing/test_util.py", line 88, in wrap_test
   test_case.failure = "Test failed; " + e.message
TypeError: cannot concatenate 'str' and 'ConnectionError' objects

@jlewi
Copy link
Contributor Author

jlewi commented May 24, 2018

/retest

…nager.

* The scripts replaces our bash commands
* For teardown we want to add retries to better handle INTERNAL_ERRORS
  with deployment manager that are causing the test to be flaky.

Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager.

* Fix resource_not_found errors in delete (kubeflow#833)

* The not found error was due to the type providers for K8s resources
  being deleted before the corresponding K8s resources. So the subsequent
  delete of the K8s resource would fail because the type provider did not
  exist.

* We fix this by using a $ref to refer to the type provider in the type field
  of K8s resources.
@jlewi jlewi force-pushed the gke_deploy_test branch from 3fa104e to c5834bc Compare May 24, 2018 22:54
Copy link
Contributor

@kunmingg kunmingg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/retest

@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 25, 2018
@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/retest

@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

In this test
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/866/kubeflow-presubmit/1661/

Timed out waiting for TfJob ; everything else passed.

The workflow logs indicate the TFJob started and entered the running state pretty quickly. But then it appears to have gotten stuck.

I looked at the Kubernetes events related to this name space and it looks like most of the pods

Master starts and is stuck waiting for

CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 

But the pod logs for that pod indicated it started just fine.

@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/retest

@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/retest

2 similar comments
@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/retest

@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/retest

@jlewi
Copy link
Contributor Author

jlewi commented May 25, 2018

/test all

@ankushagarwal
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 0f0cb1c into kubeflow:master May 25, 2018
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
…er. (kubeflow#866)

* Create python scripts for deploying Kubeflow on GCP via deployment manager.

* The scripts replaces our bash commands
* For teardown we want to add retries to better handle INTERNAL_ERRORS
  with deployment manager that are causing the test to be flaky.

Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager.

* Fix resource_not_found errors in delete (kubeflow#833)

* The not found error was due to the type providers for K8s resources
  being deleted before the corresponding K8s resources. So the subsequent
  delete of the K8s resource would fail because the type provider did not
  exist.

* We fix this by using a $ref to refer to the type provider in the type field
  of K8s resources.

* * deletePolicy can't be set per resource

* Autoformat jsonnet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants