Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653

Closed
jlewi opened this issue Jun 13, 2018 · 0 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jun 13, 2018

We'd like to write more comprehensive E2E test cases to verify the controller works as expected.

We have a number of issues related to adding more tests
#646 - Add tests for termination behavior
#651 - Add Evaluator tests

Currently our E2E tests are based on [smoke_test.py]https://github.com/kubeflow/tf-operator/blob/master/examples/tf_sample/tf_sample/tf_smoke.py). This verifies that ops can be assigned to different devices. This is a good starting point for testing we can start TF servers and that they can communicate with one another.

Its not a good test for testing controller behavior. For controller behavior the things we'd like to test are

  1. We want to verify that TF_CONFIG is set correctly
  2. We want to verify that restarts are handled correctly
  3. We want to verify that TFJob status/conditions is correctly updated

I think a better approach to writing these tests would be to run in each TF replica a simply python server that exposed simple handlers like "/restart", "/continue", "/get_tf_config" that would allow the test harness to control the behavior of the process. This would make it much easier for the test_harness to simulate certain conditions and verify the controller works as expected.

Tests like tf_smoke.py could still be useful for verifying that TF_CONFIG and TF libraries work together.

I think we should start by creating a server suitable for testing pod restart behavior. So we need the following

  1. A Python server with with the handler "/quit"
    • In response to "/quit" the process will exit with the provided exit code
  2. A test runner script
    • Test should verify that if pod exits with retryable exit code the process is restarted and for permanent error its marked as failure.
jlewi added a commit to jlewi/k8s that referenced this issue Jun 14, 2018
* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix kubeflow#653 test server

Fixes: kubeflow#235 E2E test case for when chief is worker 0

Related: kubeflow#589 CI for v1alpha2
jlewi added a commit to jlewi/k8s that referenced this issue Jun 14, 2018
* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix kubeflow#653 test server

Fixes: kubeflow#235 E2E test case for when chief is worker 0

Related: kubeflow#589 CI for v1alpha2
k8s-ci-robot pushed a commit that referenced this issue Jun 14, 2018
* Add E2E tests that verify termination policy is handled correctly.

* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix #653 test server

Fixes: #235 E2E test case for when chief is worker 0

Related: #589 CI for v1alpha2

* * Fix bug in wait for pods; we were exiting prematurely
* Fix bug in getting message from event.
yph152 pushed a commit to yph152/tf-operator that referenced this issue Jun 18, 2018
* Add E2E tests that verify termination policy is handled correctly.

* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix kubeflow#653 test server

Fixes: kubeflow#235 E2E test case for when chief is worker 0

Related: kubeflow#589 CI for v1alpha2

* * Fix bug in wait for pods; we were exiting prematurely
* Fix bug in getting message from event.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this issue Jul 9, 2018
* Add E2E tests that verify termination policy is handled correctly.

* Only the tests for v1alpha1 are enabled. A follow on PR will see
if v1alpha2 is working and enable the tests for v1alpha2.

* Fix versionTag logic; we need to allow for case where versionTag is an

* To facilitate these E2E tests, we create a test server to be run as
  inside the replicas. This server allows us to control what the process
  does via RPC. This allows the test runner to control when a replica exits.

* Test harness needs to route requests through the APIServer proxy

* Events no longer appears to be showing up for all services / pods
  even though all services pods are being created. So we turn the failure
  into a warning instead of a test failure.

* Print out the TFJob spec and events to aid debugging test failures.

Fix kubeflow#653 test server

Fixes: kubeflow#235 E2E test case for when chief is worker 0

Related: kubeflow#589 CI for v1alpha2

* * Fix bug in wait for pods; we were exiting prematurely
* Fix bug in getting message from event.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant