Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

add e2e test scripts in prow #19

Merged
merged 5 commits into from
Aug 17, 2018
Merged

Conversation

everpeace
Copy link
Contributor

@everpeace everpeace commented Aug 7, 2018

fixes #3

@jlewi @gaocegege Would you mind reviewing this PR?? I added e2e test with prow by referring pytorch repo.


This change is Reviewable

@everpeace
Copy link
Contributor Author

/cc @jlewi

@k8s-ci-robot k8s-ci-robot requested a review from jlewi August 7, 2018 03:49
Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 4 unresolved discussions (waiting on @everpeace, @disktnk, @gaocegege, and @jlewi)


scripts/create-cluster.sh, line 17 at r1 (raw file):

# limitations under the License.

# This shell script is used to build a cluster and create a namespace from our

Rather than creating a custom script can we reuse the existing scripts to deploy Kubeflow?
I think the preferred solution would be to wait for
kubeflow/kubeflow#1308
To be submitted

That will create the script kfctl.sh. You can use that to deploy Kubeflow creating a GKE cluster. You can then install chainer and deploy it.


scripts/create-cluster.sh, line 35 at r1 (raw file):

gcloud --project ${PROJECT} beta container clusters create ${CLUSTER_NAME} \
    --zone ${ZONE} \
    --accelerator type=nvidia-tesla-k80,count=2 \

Are you actually testing GPUs? And do you really need two of them?


scripts/run-test.sh, line 36 at r1 (raw file):

echo "Activating service-account"
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}

We should use kfctl.sh because this is effectively duplicating the logic in kfctl.sh


scripts/run-test.sh, line 67 at r1 (raw file):

ks generate chainer-job-simple ${MNIST_TEST} \
  --image=${REGISTRY}/${MNIST_TEST}:${VERSION} \
  --gpus=0 \

Looks like you aren't using GPUs so lets not create GPU clusters.

@jlewi
Copy link
Contributor

jlewi commented Aug 7, 2018

I opened kubeflow/kubeflow#1325 to make it easier to share code across E2E tests.

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 4 unresolved discussions (waiting on @everpeace, @disktnk, and @gaocegege)


scripts/create-cluster.sh, line 17 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Rather than creating a custom script can we reuse the existing scripts to deploy Kubeflow?
I think the preferred solution would be to wait for
kubeflow/kubeflow#1308
To be submitted

That will create the script kfctl.sh. You can use that to deploy Kubeflow creating a GKE cluster. You can then install chainer and deploy it.

Thanks. I change this so to use kfctl.sh


scripts/create-cluster.sh, line 35 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Are you actually testing GPUs? And do you really need two of them?

no. I will fix it.

@jlewi
Copy link
Contributor

jlewi commented Aug 8, 2018

Thanks. Please let me know when this is ready for another look.

@everpeace
Copy link
Contributor Author

@jlewi I changed the script to use kfctl.sh. Because I didn't run kfctl.sh actually, there might be some mistakes. Please take a look. Thanks in advance 🙇

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r2.
Reviewable status: 4 of 19 files reviewed, 5 unresolved discussions (waiting on @jlewi, @everpeace, @disktnk, and @gaocegege)


scripts/create-cluster.sh, line 35 at r2 (raw file):

kfctl

install?


scripts/kfctl-util.sh, line 24 at r2 (raw file):

    mkdir -p ${kfctl_dir}
    cd ${kfctl_dir}
    curl https://raw.githubusercontent.com/kubeflow/kubeflow/${kubeflow_version}/scripts/download.sh | sh

You shouldn't need to do this.

If you look at our other workflows the first step is checkout which checks out the code
https://github.com/kubeflow/kubeflow/blob/master/testing/workflows/components/workflows.libsonnet#L301

As part of that we can specify other repositories to check out and what version (E.g. Head). The code will get checked out to a shared NFS directory and you can then refer to it.

@jlewi
Copy link
Contributor

jlewi commented Aug 8, 2018

It doesn't look like prow is triggering any tests. If you setup prow to run the presubmits then you will be able to verify as part of this PR that the tests are working correctly before you check them in.

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 19 files reviewed, 5 unresolved discussions (waiting on @jlewi, @everpeace, @disktnk, and @gaocegege)


scripts/create-cluster.sh, line 17 at r1 (raw file):

Previously, everpeace (Shingo Omura) wrote…

Thanks. I change this so to use kfctl.sh

Done.


scripts/create-cluster.sh, line 35 at r2 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…
kfctl

install?

thanks for the catch 🙇‍♂️


scripts/kfctl-util.sh, line 24 at r2 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

You shouldn't need to do this.

If you look at our other workflows the first step is checkout which checks out the code
https://github.com/kubeflow/kubeflow/blob/master/testing/workflows/components/workflows.libsonnet#L301

As part of that we can specify other repositories to check out and what version (E.g. Head). The code will get checked out to a shared NFS directory and you can then refer to it.

nice 👍 I'll switch to use checkout.sh.


scripts/run-test.sh, line 36 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

We should use kfctl.sh because this is effectively duplicating the logic in kfctl.sh

Done.


scripts/run-test.sh, line 67 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Looks like you aren't using GPUs so lets not create GPU clusters.

Done.

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 19 files reviewed, 5 unresolved discussions (waiting on @jlewi, @disktnk, and @gaocegege)


scripts/create-cluster.sh, line 35 at r2 (raw file):

Previously, everpeace (Shingo Omura) wrote…

thanks for the catch 🙇‍♂️

Done.


scripts/kfctl-util.sh, line 24 at r2 (raw file):

Previously, everpeace (Shingo Omura) wrote…

nice 👍 I'll switch to use checkout.sh.

Done.

@everpeace
Copy link
Contributor Author

@jlewi I changed to use checkout.sh instead of downloading kfctl.sh and opened a PR in kubernetes/test-infra to activate prow in the repo. PTAL 🙇

@jlewi
Copy link
Contributor

jlewi commented Aug 9, 2018

I approved the other PR. As soon as the test is running and passing I think we can submit this PR.

@everpeace
Copy link
Contributor Author

@jlewi Hi, I'm now struggling on kfctl.sh init step in setup-cluster argo step because below error:

++ gcloud projects describe kubeflow-ci '--format=value(project_number)'
ERROR: (gcloud.projects.describe) User [593963025935-compute@developer.gserviceaccount.com] does not
 have permission to access project [kubeflow-ci] (or it may not exist): Request had insufficient aut
hentication scopes.
+ PROJECT_NUMBER=

ref: http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-chainer-operator-presubmit-e2e-19-8c9c3fc-7-577d?tab=workflow&nodeId=kubeflow-chainer-operator-presubmit-e2e-19-8c9c3fc-7-577d-3717692062&sidePanel=logs%3Akubeflow-chainer-operator-presubmit-e2e-19-8c9c3fc-7-577d-3717692062%3A

Would you mind giving me some advise??

@jlewi
Copy link
Contributor

jlewi commented Aug 13, 2018

You're getting a permission error because you are relying on the VM service account which doesn't have sufficient scopes.

You need to do two things

  1. You need to mount a secret into your steps and set the environment variable GOOGLE_APPLICATION_CREDNETIALS

Your argo workflow needs to define a volume corresponding to the secret in our kubeflow-ci cluster containing credentials for a GCP service account; see here
https://github.com/kubeflow/kubeflow/blob/master/testing/workflows/components/kfctl_test.jsonnet#L319

Your argo step template needs to add this secret as a volume mount like this
https://github.com/kubeflow/kubeflow/blob/master/testing/workflows/components/kfctl_test.jsonnet#L110

Then you need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the secret like so
https://github.com/kubeflow/kubeflow/blob/master/testing/workflows/components/kfctl_test.jsonnet#L80

  1. You need to activate the service account; by running the appropriate gcloud command see here
    https://github.com/kubeflow/kubeflow/blob/master/testing/workflows/run.sh#L7

@everpeace everpeace force-pushed the e2e-test branch 6 times, most recently from 9a03d7b to 807524b Compare August 14, 2018 08:46
@everpeace
Copy link
Contributor Author

everpeace commented Aug 14, 2018

@jlewi Thank you very much for helping me! I'm now trying to finish e2e test work. However, I found kfctl.sh delete platform does NOT delete platform correctly: kubeflow/kubeflow#1360

I tried my e2e test so many times without knowing it... And, I finally got quota error in my e2e test. sorry .... 🙇

http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-chainer-operator-presubmit-e2e-19-3580caf-26-1324?tab=workflow&nodeId=kubeflow-chainer-operator-presubmit-e2e-19-3580caf-26-1324-615064191&sidePanel=logs%3Akubeflow-chainer-operator-presubmit-e2e-19-3580caf-26-1324-615064191%3Amain

message: "{\"ResourceType\":\"container.v1.cluster\",\"ResourceErrorCode\":\"403\"\
   ,\"ResourceErrorMessage\":{\"code\":403,\"message\":\"Insufficient regional quota\
   \ to satisfy request: resource \\\"CPUS\\\": request requires '16.0' and is short\
   \ '8.0'. project has a quota of '320.0' with '8.0' available.\",\"status\":\"\
   PERMISSION_DENIED\",\"statusMessage\":\"Forbidden\",\"requestPath\":\"https://container.googleapis.com/v1/projects/kub
eflow-ci/zones/us-east1-d/clusters\"\
   ,\"httpMethod\":\"POST\"}}"

So, would you mind deleting old existing cluster??? 🙇 Thank you in advance.

// I will squash my small commits when e2e test passed.

@everpeace
Copy link
Contributor Author

@jlewi Thanks again for cleaning up garbages 🙇 Finally this pr become to be ready for review!!

But, there is one concern in clean-up job. It seems that deleting service accounts failed by permission issue. Could you give advice how to delete sevice accounts of the job correctly??

http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-chainer-operator-presubmit-e2e-19-444c6e2-32-3021?tab=workflow&nodeId=kubeflow-chainer-operator-presubmit-e2e-19-444c6e2-32-3021-566898882&sidePanel=logs%3Akubeflow-chainer-operator-presubmit-e2e-19-444c6e2-32-3021-566898882%3Amain

...
+ SA=z-19-444c6e2-32-3021-user@kubeflow-ci.iam.gserviceaccount.com
+ python /mnt/test-data-volume/kubeflow-chainer-operator-presubmit-e2e-19-444c6e2-32-3021/src/kubeflow/kubeflow/scripts/gke/delete_role_bindings.py
 --project=kubeflow-ci --service_account=z-19-444c6e2-32-3021-user@kubeflow-ci.iam.gserviceaccount.com
Service account z-19-444c6e2-32-3021-admin@kubeflow-ci.iam.gserviceaccount.com doesn't exist or you do not have permission to access service accoun
ts.
+ deleteSa z-19-444c6e2-32-3021-user@kubeflow-ci.iam.gserviceaccount.com
+ local SA=z-19-444c6e2-32-3021-user@kubeflow-ci.iam.gserviceaccount.com
++ gcloud --project=kubeflow-ci iam service-accounts describe z-19-444c6e2-32-3021-user@kubeflow-ci.iam.gserviceaccount.com
Service account z-19-444c6e2-32-3021-user@kubeflow-ci.iam.gserviceaccount.com doesn't exist or you do not have permission to access service account
s.
+ O='ERROR: (gcloud.iam.service-accounts.describe) PERMISSION_DENIED: Permission iam.serviceAccounts.get is required to perform this operation on s
ervice account projects/-/serviceAccounts/z-19-444c6e2-32-3021-user@kubeflow-ci.iam.gserviceaccount.com.'
+ local RESULT=1

@everpeace everpeace force-pushed the e2e-test branch 13 times, most recently from 261e390 to 1a7115a Compare August 16, 2018 14:47
@everpeace
Copy link
Contributor Author

everpeace commented Aug 16, 2018

@jlewi I fixed my branch regarding your comments. Please take a look again. Thanks in advance 🙇

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 20 files reviewed, 7 unresolved discussions (waiting on @jlewi, @disktnk, and @gaocegege)


prow_config.yaml, line 6 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

chainer--e2e

This will beused as the prefix of the argo workflow name so its better to pick something more descriptive.

Done. however 'chainer-e2e' exceeds kubernetes label key length limit. So, I changed it to just 'chainer'.


test/workflows/components/workflows.jsonnet, line 7 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Delete the todo.

Done.


test/workflows/components/workflows.jsonnet, line 13 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

nit: I would suggest moving the logic inside workflows.libsonnet into workflows.jsonnet and renaming it to something more descriptive e.g.
e2e_test.jsonnet.

Done.

@jlewi
Copy link
Contributor

jlewi commented Aug 17, 2018

/lgtm
/approve

@everpeace
Copy link
Contributor Author

@jlewi Thanks! I'm going to approve by myself this time. I'll add you to OWNERS file later 👍

@everpeace
Copy link
Contributor Author

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: everpeace, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 20 files reviewed, all discussions resolved (waiting on @jlewi, @disktnk, and @gaocegege)

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 18 files at r1, 6 of 7 files at r3, 3 of 4 files at r4.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @disktnk and @gaocegege)

@k8s-ci-robot k8s-ci-robot merged commit 26e3994 into kubeflow:master Aug 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

e2e test with prow
3 participants