[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653

jlewi · 2018-06-13T13:11:37Z

We'd like to write more comprehensive E2E test cases to verify the controller works as expected.

We have a number of issues related to adding more tests
#646 - Add tests for termination behavior
#651 - Add Evaluator tests

Currently our E2E tests are based on [smoke_test.py]https://github.com/kubeflow/tf-operator/blob/master/examples/tf_sample/tf_sample/tf_smoke.py). This verifies that ops can be assigned to different devices. This is a good starting point for testing we can start TF servers and that they can communicate with one another.

Its not a good test for testing controller behavior. For controller behavior the things we'd like to test are

We want to verify that TF_CONFIG is set correctly
We want to verify that restarts are handled correctly
We want to verify that TFJob status/conditions is correctly updated

I think a better approach to writing these tests would be to run in each TF replica a simply python server that exposed simple handlers like "/restart", "/continue", "/get_tf_config" that would allow the test harness to control the behavior of the process. This would make it much easier for the test_harness to simulate certain conditions and verify the controller works as expected.

Tests like tf_smoke.py could still be useful for verifying that TF_CONFIG and TF libraries work together.

I think we should start by creating a server suitable for testing pod restart behavior. So we need the following

A Python server with with the handler "/quit"
- In response to "/quit" the process will exit with the provided exit code
A test runner script
- Test should verify that if pod exits with retryable exit code the process is restarted and for permanent error its marked as failure.

* Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2

* Add E2E tests that verify termination policy is handled correctly. * Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix #653 test server Fixes: #235 E2E test case for when chief is worker 0 Related: #589 CI for v1alpha2 * * Fix bug in wait for pods; we were exiting prematurely * Fix bug in getting message from event.

* Add E2E tests that verify termination policy is handled correctly. * Only the tests for v1alpha1 are enabled. A follow on PR will see if v1alpha2 is working and enable the tests for v1alpha2. * Fix versionTag logic; we need to allow for case where versionTag is an * To facilitate these E2E tests, we create a test server to be run as inside the replicas. This server allows us to control what the process does via RPC. This allows the test runner to control when a replica exits. * Test harness needs to route requests through the APIServer proxy * Events no longer appears to be showing up for all services / pods even though all services pods are being created. So we turn the failure into a warning instead of a test failure. * Print out the TFJob spec and events to aid debugging test failures. Fix kubeflow#653 test server Fixes: kubeflow#235 E2E test case for when chief is worker 0 Related: kubeflow#589 CI for v1alpha2 * * Fix bug in wait for pods; we were exiting prematurely * Fix bug in getting message from event.

jlewi added priority/p1 area/0.3.0 labels Jun 13, 2018

jlewi mentioned this issue Jun 13, 2018

[v1alpha2] Add e2e test cases for evaluator #651

Closed

jlewi mentioned this issue Jun 14, 2018

v1alpha2 E2E tests for termination policy #646

Merged

k8s-ci-robot closed this as completed in #646 Jun 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653

[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653

jlewi commented Jun 13, 2018 •

edited

Loading

[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653

[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653

Comments

jlewi commented Jun 13, 2018 • edited Loading

jlewi commented Jun 13, 2018 •

edited

Loading