-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1alpha2] Add CI test #589
Comments
Bumping this P1 because we need this to have v1alpha2 ready for 0.2. |
It looks like at a minimum we need
|
I will take a look. We need a e2e test to accelerate development of v1alpha2. |
/assign @gaocegege |
I am afraid that I can not handle the issue /unassign |
There was a bunch of prior work in kubeflow/kubeflow#852 But that issue used mnist which I think is overly complicated and doesn't make it easy to test distributed communication patterns. For example, the test doesn't appear to have caught #634 Using a complicated test like mnist also leads to flakes like kubeflow/kubeflow#974 Lets use this issue to update the test to use a simple TFJob like we do for v1alpha1. |
It looks like kubeflow/kubeflow#974 added the v1alpha2 test to Kubeflow but not to the TFOperator repository. |
one complication is that the current tf-job-operator test is setting up a GPU cluster to run GPU tests. Our DM config doesn't fully setup GPUs; we still need to create the daemonset. (see kubeflow/kubeflow#288). Here's what I think we should do
|
* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589
* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589
* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589
* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to #589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.
#646 Created some v1alpha2 E2E tests but they aren't enabled. We will enable them in a follow on PR. The reason to enable them in a follow on PR is I'm not sure yet whether they are passing.; there may be additional fixes that are needed. Getting them committed and triaging in follow on PRs will easier then holding them until the PRs are committed. |
* Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2
* Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2
* Add E2E tests that verify termination policy is handled correctly. * Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix #653 test server Fixes: #235 E2E test case for when chief is worker 0 Related: #589 CI for v1alpha2 * * Fix bug in wait for pods; we were exiting prematurely * Fix bug in getting message from event.
* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.
* Add E2E tests that verify termination policy is handled correctly. * Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2 * * Fix bug in wait for pods; we were exiting prematurely * Fix bug in getting message from event.
Enabled in #667 |
* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.
* Add E2E tests that verify termination policy is handled correctly. * Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2 * * Fix bug in wait for pods; we were exiting prematurely * Fix bug in getting message from event.
Just as v1apha1 does.
The text was updated successfully, but these errors were encountered: