Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: existing_arrikto: initial e2e test #4154

Merged

Conversation

yanniszark
Copy link
Contributor

@yanniszark yanniszark commented Sep 19, 2019

Initial E2E test for existing_arrikto.
This should ensure that nothing breaks with new PRs.
In the near future, we should move this to a pyfunc.

Signed-off-by: Yannis Zarkadas yanniszark@arrikto.com


This change is Reviewable

@yanniszark
Copy link
Contributor Author

/retest

@yanniszark yanniszark force-pushed the feature-kfctl-existing-arrikto-e2e-test branch from 9201469 to 526fd5a Compare September 19, 2019 16:58
@yanniszark yanniszark force-pushed the feature-kfctl-existing-arrikto-e2e-test branch from ea977a8 to d22854b Compare September 19, 2019 18:03
@nrchakradhar
Copy link
Contributor

@yanniszark
For my understanding can you please let me know what is the reason behind having cluster_creation as a script inside kfctl_go_test.py instead of a separate step in the "dag" which is added only if cluster_creation param is True? This would make it similar to the delete_cluster_step addition.

There is hard coding for cluster name, zone, in

util.run(["gcloud", "container", "cluster", "delete",
"kfctl-existing-arrikto-cluster", "--zone", "us-central1-a"],
cwd=os.getcwd)

that should match with
gcloud container clusters create "${CLUSTER_NAME}" \
--project "${PROJECT}" \
--zone "${ZONE}" \

By making creation also a step, it would be possible to remove the hard coding also.
If its cumbersome to change and better modular approach will follow with py_func then its okay.

@yanniszark
Copy link
Contributor Author

@nrchakradhar currently, this is a dirty approach to get it working.
Once I have something working, I can begin to clean it up.

For the separate DAG step, I was actually thinking of including deletion in the kfctl_delete step instead of making a new step.
kfctl would create and delete a cluster in those steps for other configs, so it's not a stretch imo.

@yanniszark yanniszark force-pushed the feature-kfctl-existing-arrikto-e2e-test branch 3 times, most recently from 3e5af2c to 4f92110 Compare September 20, 2019 18:14
@yanniszark
Copy link
Contributor Author

/retest

@yanniszark yanniszark force-pushed the feature-kfctl-existing-arrikto-e2e-test branch 6 times, most recently from 5b69308 to 2062950 Compare September 23, 2019 17:02
@yanniszark
Copy link
Contributor Author

/retest

1 similar comment
@yanniszark
Copy link
Contributor Author

/retest

@yanniszark yanniszark force-pushed the feature-kfctl-existing-arrikto-e2e-test branch 5 times, most recently from bf6a724 to 29d2a8f Compare September 24, 2019 13:45
@yanniszark yanniszark force-pushed the feature-kfctl-existing-arrikto-e2e-test branch from 0af6407 to d687f30 Compare October 11, 2019 12:01
@k8s-ci-robot k8s-ci-robot removed the lgtm label Oct 11, 2019
@yanniszark
Copy link
Contributor Author

/retest

@yanniszark
Copy link
Contributor Author

@jlewi I'm getting failed tests from the kfctl-go-build-deploy test.
It's timing out while waiting for components to be ready.
AFAIK, this is a new test, any idea why it's failing when it seems to use more or less the same config as the other GCP tests?

/retest

@@ -33,7 +33,13 @@ local runPath = srcDir + "/testing/workflows/run.sh";
local kfCtlPath = srcDir + "/bootstrap/bin/kfctl";
local kubeConfig = testDir + "/kfctl_test/.kube/kubeconfig";

// Name for the Kubeflow app.
// cluster_creation_script specifies the script to run in order to create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we move to the argo_client/tekton_client rather than jsonnet in a future PR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue is #4189

@yanniszark
Copy link
Contributor Author

/retest

@yanniszark
Copy link
Contributor Author

@jlewi could you lgtm again now that I have rebased?

"pytorch-operator",
"katib-controller",
"tensorboard",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we testing tensorboard? That shouldn't even be deployed.

@jlewi
Copy link
Contributor

jlewi commented Oct 12, 2019

@yanniszark The changes to kfctl_is_ready_test.py seem a bit risky right now. If the test is flaky adding more tests to it is only likely to exacerbate the problem. Can we pull the changes to kfctl_is_ready_test.py into a separate PR?

I think before we increase coverage of is_read_test we need to split it up; if we test more things in a single test the test is just going to become more unreliable.

@jlewi
Copy link
Contributor

jlewi commented Oct 12, 2019

@yanniszark I think an easy thing to do would be break it up into multiple tests corresponding to different related groups of servicess

e.g.

def test_kfserving
_wait_for_deployments(....)
_wait_for_statefulsets(...)

def test_gcp
_wait_for_deployments(....)
_wait_for_statefulsets(...)

That should hopefully give us finer grained signal about what's flaky. We can also easily use annotations to disable one subset or another.

But I'd suggest we do that in a separate PR so we don't keep blocking the arrikto E2E test.

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Related:
googleapis/google-api-python-client#299

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
@yanniszark yanniszark force-pushed the feature-kfctl-existing-arrikto-e2e-test branch from 552b5fb to 538e267 Compare October 14, 2019 15:33
@yanniszark
Copy link
Contributor Author

@jlewi as asked I reverted the extra coverage of the kf_is_ready test.
I filed #4294 to track this and I believe it's important to tackle it soon, since we currently don't test a lot of the installed applications.

Can you please comment on the issue with your suggestions on implementation?

Some questions I had on the approach you proposed were:

If the presubmit passes we should be good to go, based on your latest review.

@jlewi
Copy link
Contributor

jlewi commented Oct 14, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b9d90d9 into kubeflow:master Oct 14, 2019
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
* kfctl: existing_arrikto: initial e2e test

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* fix deployment name and review comment

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* fix bad syntax with extend

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* increase timeout for service ip discovery

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* update deployments to reflect latest manifests

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* skip knative for existing_arrikto

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* bump machine size and increase timeout

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* python script: fix error in delete step

Related:
googleapis/google-api-python-client#299

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* be more conservative in adding new test

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* revert checklist

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 12, 2021
* kfctl: existing_arrikto: initial e2e test

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* fix deployment name and review comment

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* fix bad syntax with extend

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* increase timeout for service ip discovery

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* update deployments to reflect latest manifests

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* skip knative for existing_arrikto

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* bump machine size and increase timeout

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* python script: fix error in delete step

Related:
googleapis/google-api-python-client#299

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* be more conservative in adding new test

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* revert checklist

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants