-
Notifications
You must be signed in to change notification settings - Fork 17
Conversation
/cc @jlewi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 4 unresolved discussions (waiting on @everpeace, @disktnk, @gaocegege, and @jlewi)
scripts/create-cluster.sh, line 17 at r1 (raw file):
# limitations under the License. # This shell script is used to build a cluster and create a namespace from our
Rather than creating a custom script can we reuse the existing scripts to deploy Kubeflow?
I think the preferred solution would be to wait for
kubeflow/kubeflow#1308
To be submitted
That will create the script kfctl.sh. You can use that to deploy Kubeflow creating a GKE cluster. You can then install chainer and deploy it.
scripts/create-cluster.sh, line 35 at r1 (raw file):
gcloud --project ${PROJECT} beta container clusters create ${CLUSTER_NAME} \ --zone ${ZONE} \ --accelerator type=nvidia-tesla-k80,count=2 \
Are you actually testing GPUs? And do you really need two of them?
scripts/run-test.sh, line 36 at r1 (raw file):
echo "Activating service-account" gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
We should use kfctl.sh because this is effectively duplicating the logic in kfctl.sh
scripts/run-test.sh, line 67 at r1 (raw file):
ks generate chainer-job-simple ${MNIST_TEST} \ --image=${REGISTRY}/${MNIST_TEST}:${VERSION} \ --gpus=0 \
Looks like you aren't using GPUs so lets not create GPU clusters.
I opened kubeflow/kubeflow#1325 to make it easier to share code across E2E tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 4 unresolved discussions (waiting on @everpeace, @disktnk, and @gaocegege)
scripts/create-cluster.sh, line 17 at r1 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Rather than creating a custom script can we reuse the existing scripts to deploy Kubeflow?
I think the preferred solution would be to wait for
kubeflow/kubeflow#1308
To be submittedThat will create the script kfctl.sh. You can use that to deploy Kubeflow creating a GKE cluster. You can then install chainer and deploy it.
Thanks. I change this so to use kfctl.sh
scripts/create-cluster.sh, line 35 at r1 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Are you actually testing GPUs? And do you really need two of them?
no. I will fix it.
Thanks. Please let me know when this is ready for another look. |
@jlewi I changed the script to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 5 files at r2.
Reviewable status: 4 of 19 files reviewed, 5 unresolved discussions (waiting on @jlewi, @everpeace, @disktnk, and @gaocegege)
scripts/create-cluster.sh, line 35 at r2 (raw file):
kfctl
install?
scripts/kfctl-util.sh, line 24 at r2 (raw file):
mkdir -p ${kfctl_dir} cd ${kfctl_dir} curl https://raw.githubusercontent.com/kubeflow/kubeflow/${kubeflow_version}/scripts/download.sh | sh
You shouldn't need to do this.
If you look at our other workflows the first step is checkout which checks out the code
https://github.com/kubeflow/kubeflow/blob/master/testing/workflows/components/workflows.libsonnet#L301
As part of that we can specify other repositories to check out and what version (E.g. Head). The code will get checked out to a shared NFS directory and you can then refer to it.
It doesn't look like prow is triggering any tests. If you setup prow to run the presubmits then you will be able to verify as part of this PR that the tests are working correctly before you check them in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 19 files reviewed, 5 unresolved discussions (waiting on @jlewi, @everpeace, @disktnk, and @gaocegege)
scripts/create-cluster.sh, line 17 at r1 (raw file):
Previously, everpeace (Shingo Omura) wrote…
Thanks. I change this so to use kfctl.sh
Done.
scripts/create-cluster.sh, line 35 at r2 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
kfctl
install?
thanks for the catch 🙇♂️
scripts/kfctl-util.sh, line 24 at r2 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
You shouldn't need to do this.
If you look at our other workflows the first step is checkout which checks out the code
https://github.com/kubeflow/kubeflow/blob/master/testing/workflows/components/workflows.libsonnet#L301As part of that we can specify other repositories to check out and what version (E.g. Head). The code will get checked out to a shared NFS directory and you can then refer to it.
nice 👍 I'll switch to use checkout.sh.
scripts/run-test.sh, line 36 at r1 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
We should use kfctl.sh because this is effectively duplicating the logic in kfctl.sh
Done.
scripts/run-test.sh, line 67 at r1 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Looks like you aren't using GPUs so lets not create GPU clusters.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 19 files reviewed, 5 unresolved discussions (waiting on @jlewi, @disktnk, and @gaocegege)
scripts/create-cluster.sh, line 35 at r2 (raw file):
Previously, everpeace (Shingo Omura) wrote…
thanks for the catch 🙇♂️
Done.
scripts/kfctl-util.sh, line 24 at r2 (raw file):
Previously, everpeace (Shingo Omura) wrote…
nice 👍 I'll switch to use checkout.sh.
Done.
@jlewi I changed to use checkout.sh instead of downloading |
I approved the other PR. As soon as the test is running and passing I think we can submit this PR. |
@jlewi Hi, I'm now struggling on
Would you mind giving me some advise?? |
You're getting a permission error because you are relying on the VM service account which doesn't have sufficient scopes. You need to do two things
Your argo workflow needs to define a volume corresponding to the secret in our kubeflow-ci cluster containing credentials for a GCP service account; see here Your argo step template needs to add this secret as a volume mount like this Then you need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the secret like so
|
9a03d7b
to
807524b
Compare
@jlewi Thank you very much for helping me! I'm now trying to finish e2e test work. However, I found I tried my e2e test so many times without knowing it... And, I finally got quota error in my e2e test. sorry .... 🙇
So, would you mind deleting old existing cluster??? 🙇 Thank you in advance. // I will squash my small commits when e2e test passed. |
@jlewi Thanks again for cleaning up garbages 🙇 Finally this pr become to be ready for review!! But, there is one concern in clean-up job. It seems that deleting service accounts failed by permission issue. Could you give advice how to delete sevice accounts of the job correctly??
|
261e390
to
1a7115a
Compare
@jlewi I fixed my branch regarding your comments. Please take a look again. Thanks in advance 🙇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 20 files reviewed, 7 unresolved discussions (waiting on @jlewi, @disktnk, and @gaocegege)
prow_config.yaml, line 6 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
chainer--e2e
This will beused as the prefix of the argo workflow name so its better to pick something more descriptive.
Done. however 'chainer-e2e' exceeds kubernetes label key length limit. So, I changed it to just 'chainer'.
test/workflows/components/workflows.jsonnet, line 7 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Delete the todo.
Done.
test/workflows/components/workflows.jsonnet, line 13 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
nit: I would suggest moving the logic inside workflows.libsonnet into workflows.jsonnet and renaming it to something more descriptive e.g.
e2e_test.jsonnet.
Done.
/lgtm |
@jlewi Thanks! I'm going to approve by myself this time. I'll add you to OWNERS file later 👍 |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: everpeace, jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 20 files reviewed, all discussions resolved (waiting on @jlewi, @disktnk, and @gaocegege)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 18 files at r1, 6 of 7 files at r3, 3 of 4 files at r4.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @disktnk and @gaocegege)
fixes #3
@jlewi @gaocegege Would you mind reviewing this PR?? I added e2e test with prow by referring pytorch repo.
This change is