[v1alpha2] Add CI test #589

gaocegege · 2018-05-11T04:08:56Z

Just as v1apha1 does.

jlewi · 2018-05-21T18:27:09Z

Bumping this P1 because we need this to have v1alpha2 ready for 0.2.
/priority p1

jlewi · 2018-05-21T18:33:07Z

It looks like at a minimum we need

ksonnet changes to support deploying v1alpha2 controller
(Possibly) Changes to v1alpha2 and v1alpha1 controllers so they only claim specific versions of TFJobs so that we can co deploy both versions
Upgrade the E2E test workflow to deploy both controllers
Upgrade the E2E test to run jobs using both versions.

gaocegege · 2018-05-31T06:53:01Z

I will take a look. We need a e2e test to accelerate development of v1alpha2.

gaocegege · 2018-05-31T06:53:43Z

/assign @gaocegege

gaocegege · 2018-06-11T06:07:09Z

I am afraid that I can not handle the issue

/unassign

jlewi · 2018-06-11T22:44:06Z

There was a bunch of prior work in kubeflow/kubeflow#852

But that issue used mnist which I think is overly complicated and doesn't make it easy to test distributed communication patterns.

For example, the test doesn't appear to have caught #634

Using a complicated test like mnist also leads to flakes like kubeflow/kubeflow#974

Lets use this issue to update the test to use a simple TFJob like we do for v1alpha1.

jlewi · 2018-06-11T23:46:30Z

It looks like kubeflow/kubeflow#974 added the v1alpha2 test to Kubeflow but not to the TFOperator repository.

jlewi · 2018-06-12T00:03:36Z

one complication is that the current tf-job-operator test is setting up a GPU cluster to run GPU tests.

Our DM config doesn't fully setup GPUs; we still need to create the daemonset. (see kubeflow/kubeflow#288).

Here's what I think we should do

We should copy over the E2E workflow used by kubeflow/kubeflow
We should strip out the non tf-operator tests
We should add steps to run with/without GPUs for the current workflow
We should parameterize the workflow by the version
We should add separate workflows for v1alpha1 and v1alpha2 to prow_config.yaml
We should disable the v1alpha2 version until the requisite issues are fixed.

* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589

* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to #589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.

jlewi · 2018-06-13T05:51:07Z

#646 Created some v1alpha2 E2E tests but they aren't enabled.

We will enable them in a follow on PR. The reason to enable them in a follow on PR is I'm not sure yet whether they are passing.; there may be additional fixes that are needed.

Getting them committed and triaging in follow on PRs will easier then holding them until the PRs are committed.

* Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2

* Add E2E tests that verify termination policy is handled correctly. * Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix #653 test server Fixes: #235 E2E test case for when chief is worker 0 Related: #589 CI for v1alpha2 * * Fix bug in wait for pods; we were exiting prematurely * Fix bug in getting message from event.

* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.

* Add E2E tests that verify termination policy is handled correctly. * Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2 * * Fix bug in wait for pods; we were exiting prematurely * Fix bug in getting message from event.

jlewi · 2018-06-20T02:19:17Z

Enabled in #667

* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.

* Add E2E tests that verify termination policy is handled correctly. * Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2 * * Fix bug in wait for pods; we were exiting prematurely * Fix bug in getting message from event.

gaocegege added area/operator priority/p2 difficulty/medium api/v1alpha2 labels May 11, 2018

k8s-ci-robot added the priority/p1 label May 21, 2018

jlewi removed the priority/p2 label May 21, 2018

k8s-ci-robot assigned gaocegege May 31, 2018

k8s-ci-robot unassigned gaocegege Jun 11, 2018

jlewi added the area/0.2.0 label Jun 11, 2018

jlewi mentioned this issue Jun 12, 2018

Modify presubmits to support testing with v1alpha2 #632

Merged

jlewi mentioned this issue Jun 13, 2018

v1alpha2 E2E tests for termination policy #646

Merged

jlewi closed this as completed Jun 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1alpha2] Add CI test #589

[v1alpha2] Add CI test #589

gaocegege commented May 11, 2018

jlewi commented May 21, 2018

jlewi commented May 21, 2018

gaocegege commented May 31, 2018

gaocegege commented May 31, 2018

gaocegege commented Jun 11, 2018

jlewi commented Jun 11, 2018

jlewi commented Jun 11, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 13, 2018

jlewi commented Jun 20, 2018

[v1alpha2] Add CI test #589

[v1alpha2] Add CI test #589

Comments

gaocegege commented May 11, 2018

jlewi commented May 21, 2018

jlewi commented May 21, 2018

gaocegege commented May 31, 2018

gaocegege commented May 31, 2018

gaocegege commented Jun 11, 2018

jlewi commented Jun 11, 2018

jlewi commented Jun 11, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 13, 2018

jlewi commented Jun 20, 2018