Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1alpha2] Add CI test #589

Closed
gaocegege opened this issue May 11, 2018 · 10 comments
Closed

[v1alpha2] Add CI test #589

gaocegege opened this issue May 11, 2018 · 10 comments

Comments

@gaocegege
Copy link
Member

Just as v1apha1 does.

@jlewi
Copy link
Contributor

jlewi commented May 21, 2018

Bumping this P1 because we need this to have v1alpha2 ready for 0.2.
/priority p1

@jlewi
Copy link
Contributor

jlewi commented May 21, 2018

It looks like at a minimum we need

  • ksonnet changes to support deploying v1alpha2 controller
  • (Possibly) Changes to v1alpha2 and v1alpha1 controllers so they only claim specific versions of TFJobs so that we can co deploy both versions
  • Upgrade the E2E test workflow to deploy both controllers
  • Upgrade the E2E test to run jobs using both versions.

@gaocegege
Copy link
Member Author

I will take a look. We need a e2e test to accelerate development of v1alpha2.

@gaocegege
Copy link
Member Author

/assign @gaocegege

@gaocegege
Copy link
Member Author

I am afraid that I can not handle the issue

/unassign

@jlewi
Copy link
Contributor

jlewi commented Jun 11, 2018

There was a bunch of prior work in kubeflow/kubeflow#852

But that issue used mnist which I think is overly complicated and doesn't make it easy to test distributed communication patterns.

For example, the test doesn't appear to have caught #634

Using a complicated test like mnist also leads to flakes like kubeflow/kubeflow#974

Lets use this issue to update the test to use a simple TFJob like we do for v1alpha1.

@jlewi
Copy link
Contributor

jlewi commented Jun 11, 2018

It looks like kubeflow/kubeflow#974 added the v1alpha2 test to Kubeflow but not to the TFOperator repository.

@jlewi
Copy link
Contributor

jlewi commented Jun 12, 2018

one complication is that the current tf-job-operator test is setting up a GPU cluster to run GPU tests.

Our DM config doesn't fully setup GPUs; we still need to create the daemonset. (see kubeflow/kubeflow#288).

Here's what I think we should do

  1. We should copy over the E2E workflow used by kubeflow/kubeflow
  2. We should strip out the non tf-operator tests
  3. We should add steps to run with/without GPUs for the current workflow
  4. We should parameterize the workflow by the version
  5. We should add separate workflows for v1alpha1 and v1alpha2 to prow_config.yaml
  6. We should disable the v1alpha2 version until the requisite issues are fixed.

jlewi added a commit to jlewi/k8s that referenced this issue Jun 12, 2018
* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589
jlewi added a commit to jlewi/k8s that referenced this issue Jun 12, 2018
* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589
jlewi added a commit to jlewi/k8s that referenced this issue Jun 12, 2018
* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589
k8s-ci-robot pushed a commit that referenced this issue Jun 13, 2018
* Changes to support v1alpha2 testing in presubmits.

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to #589

* Fix versionTag logic; we need to allow for case where versionTag is an
empty string.
@jlewi
Copy link
Contributor

jlewi commented Jun 13, 2018

#646 Created some v1alpha2 E2E tests but they aren't enabled.

We will enable them in a follow on PR. The reason to enable them in a follow on PR is I'm not sure yet whether they are passing.; there may be additional fixes that are needed.

Getting them committed and triaging in follow on PRs will easier then holding them until the PRs are committed.

jlewi added a commit to jlewi/k8s that referenced this issue Jun 14, 2018
* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix kubeflow#653 test server

Fixes: kubeflow#235 E2E test case for when chief is worker 0

Related: kubeflow#589 CI for v1alpha2
jlewi added a commit to jlewi/k8s that referenced this issue Jun 14, 2018
* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix kubeflow#653 test server

Fixes: kubeflow#235 E2E test case for when chief is worker 0

Related: kubeflow#589 CI for v1alpha2
k8s-ci-robot pushed a commit that referenced this issue Jun 14, 2018
* Add E2E tests that verify termination policy is handled correctly.

* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix #653 test server

Fixes: #235 E2E test case for when chief is worker 0

Related: #589 CI for v1alpha2

* * Fix bug in wait for pods; we were exiting prematurely
* Fix bug in getting message from event.
yph152 pushed a commit to yph152/tf-operator that referenced this issue Jun 18, 2018
* Changes to support v1alpha2 testing in presubmits.

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589

* Fix versionTag logic; we need to allow for case where versionTag is an
empty string.
yph152 pushed a commit to yph152/tf-operator that referenced this issue Jun 18, 2018
* Add E2E tests that verify termination policy is handled correctly.

* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix kubeflow#653 test server

Fixes: kubeflow#235 E2E test case for when chief is worker 0

Related: kubeflow#589 CI for v1alpha2

* * Fix bug in wait for pods; we were exiting prematurely
* Fix bug in getting message from event.
@jlewi
Copy link
Contributor

jlewi commented Jun 20, 2018

Enabled in #667

@jlewi jlewi closed this as completed Jun 20, 2018
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this issue Jul 9, 2018
* Changes to support v1alpha2 testing in presubmits.

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589

* Fix versionTag logic; we need to allow for case where versionTag is an
empty string.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this issue Jul 9, 2018
* Add E2E tests that verify termination policy is handled correctly.

* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix kubeflow#653 test server

Fixes: kubeflow#235 E2E test case for when chief is worker 0

Related: kubeflow#589 CI for v1alpha2

* * Fix bug in wait for pods; we were exiting prematurely
* Fix bug in getting message from event.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants