Update ci tests for mnist example #684

jinchihe · 2019-11-28T03:11:27Z

fixes: #677

The PR is going to:

Change tfjob_test.py to use pytest to support better reporting of test results
Change deploy_test.py change to use pytest
Replace the ksonnet workflow with the use of python to define the workflow
Removed GOOGLE_APPLICATION_CREDENTIALS setting and using work identity.

This change is

jinchihe · 2019-12-03T03:13:37Z

/retest

jinchihe · 2019-12-03T06:33:49Z

That's strange.. Cannot get the trainning data from tensorflow, caused by network issue?

Traceback (most recent call last):
  File "/opt/model.py", line 204, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/opt/model.py", line 147, in main
    mnist = tf.contrib.learn.datasets.DATASETS['mnist'](TF_DATA_DIR)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 300, in load_mnist
    return read_data_sets(train_dir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 260, in read_data_sets
    source_url + TRAIN_IMAGES)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py", line 219, in maybe_download
    temp_file_name, _ = urlretrieve_with_retry(source_url)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py", line 172, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py", line 200, in urlretrieve_with_retry
    return urllib.request.urlretrieve(url, filename)
  File "/usr/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/usr/lib/python2.7/urllib.py", line 245, in retrieve
    fp = self.open(url, data)
  File "/usr/lib/python2.7/urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.7/urllib.py", line 443, in open_https
    h.endheaders(data)
  File "/usr/lib/python2.7/httplib.py", line 1053, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 897, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 859, in send
    self.connect()
  File "/usr/lib/python2.7/httplib.py", line 1270, in connect
    HTTPConnection.connect(self)
  File "/usr/lib/python2.7/httplib.py", line 836, in connect
    self.timeout, self.source_address)
  File "/usr/lib/python2.7/socket.py", line 575, in create_connection
    raise err
IOError: [Errno socket error] [Errno 99] Cannot assign requested address

jinchihe · 2019-12-03T12:06:14Z

Another problem, seems no permission to upload data to gs ...

tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, error message ''), response '{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "insufficientPermissions",
    "message": "Insufficient Permission"
   }
  ],
  "code": 403,
  "message": "Insufficient Permission"
 }
}
'
	 when initiating an upload to gs://kubeflow-ci-deployment_ci-temp/mnist/models/1201767042638680064/

jinchihe · 2019-12-03T13:59:57Z

Hello @jlewi , the PR is ready for reviewing, thanks!
/assign @jlewi

jlewi · 2019-12-05T00:20:27Z

Thank you so much @jinchihe !

jlewi

Reviewed 18 of 24 files at r1.
Reviewable status: 18 of 24 files reviewed, 1 unresolved discussion (waiting on @jinchihe, @lluunn, and @texasmichelle)

mnist/training/GCS/kustomization.yaml, line 9 at r1 (raw file):

gcr.io/kubeflow-ci/mnist/model

How come we are using an image in gcr.io/kubeflow-ci and not gcr.io/kubeflow-examples?

Also how come you are changing newTag to latest?

Won't this end up using whatever image was most recently built by the tests? Couldn't that cause problems because we end up using an image that was built by a presubmit which made broken changes to the serving code.

Should we instead leave this image as gcr.io/kubeflow-examples/mnist/model and point to a well defined newTag.

This would correspond to that last known good image and have to be updated manually. So it could get out of sync with the code checked iuntil we automate updating it).

Could our tests run kustomsize edit set image to point to a specific image in kubeflow-ci corresponding to the test?

jlewi

apologies for the day last week was a holidy

Reviewable status: 18 of 24 files reviewed, 1 unresolved discussion (waiting on @jinchihe, @lluunn, and @texasmichelle)

jinchihe · 2019-12-06T07:18:14Z

Hello @jlewi Thanks for your comments. Updated as below:

Updated from gcr.io/kubeflow-ci to gcr.io/kubeflow-examples.
Point to a well defined newTag which is includes new model.py (the old one includes out-of-date model.py)
For the testing, we used the kustomize edit set image to update to use the image that's built by this current PR in build steps.

jlewi · 2019-12-07T00:53:41Z

/lgtm
/approve

k8s-ci-robot · 2019-12-07T00:53:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the do-not-merge/work-in-progress label Nov 28, 2019

k8s-ci-robot requested review from lluunn and texasmichelle November 28, 2019 03:11

k8s-ci-robot added the size/L label Nov 28, 2019

jinchihe force-pushed the update_mnist_ci_test branch 7 times, most recently from 9fdbc7b to aa159ab Compare November 29, 2019 02:27

k8s-ci-robot added size/XL and removed size/L labels Nov 29, 2019

jinchihe force-pushed the update_mnist_ci_test branch 17 times, most recently from 602e20e to 625f79a Compare December 2, 2019 02:36

jinchihe force-pushed the update_mnist_ci_test branch 4 times, most recently from 71bf729 to 5ba5b0e Compare December 3, 2019 02:54

jinchihe force-pushed the update_mnist_ci_test branch 2 times, most recently from d14b442 to 3fcdd4a Compare December 3, 2019 11:29

jinchihe force-pushed the update_mnist_ci_test branch 3 times, most recently from 0adaf2e to bc44e57 Compare December 3, 2019 13:47

jinchihe changed the title ~~WIP: update ci tests for mnist example~~ Update ci tests for mnist example Dec 3, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress label Dec 3, 2019

k8s-ci-robot assigned jlewi Dec 3, 2019

jlewi suggested changes Dec 5, 2019

View reviewed changes

jlewi reviewed Dec 5, 2019

View reviewed changes

jinchihe force-pushed the update_mnist_ci_test branch 2 times, most recently from 69cdaf8 to a067214 Compare December 6, 2019 06:49

update ci tests for mnist example

8d14433

jinchihe force-pushed the update_mnist_ci_test branch from a067214 to 8d14433 Compare December 6, 2019 07:14

k8s-ci-robot added the lgtm label Dec 7, 2019

k8s-ci-robot added the approved label Dec 7, 2019

k8s-ci-robot merged commit 1e38524 into kubeflow:master Dec 7, 2019

jinchihe mentioned this pull request Dec 7, 2019

WIP: Resolve two problems in ci/cd testing. #668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ci tests for mnist example #684

Update ci tests for mnist example #684

jinchihe commented Nov 28, 2019 •

edited

Loading

jinchihe commented Dec 3, 2019

jinchihe commented Dec 3, 2019

jinchihe commented Dec 3, 2019

jinchihe commented Dec 3, 2019 •

edited

Loading

jlewi commented Dec 5, 2019

jlewi left a comment

jlewi left a comment

jinchihe commented Dec 6, 2019 •

edited

Loading

jlewi commented Dec 7, 2019

k8s-ci-robot commented Dec 7, 2019

Update ci tests for mnist example #684

Update ci tests for mnist example #684

Conversation

jinchihe commented Nov 28, 2019 • edited Loading

jinchihe commented Dec 3, 2019

jinchihe commented Dec 3, 2019

jinchihe commented Dec 3, 2019

jinchihe commented Dec 3, 2019 • edited Loading

jlewi commented Dec 5, 2019

jlewi left a comment

Choose a reason for hiding this comment

jlewi left a comment

Choose a reason for hiding this comment

jinchihe commented Dec 6, 2019 • edited Loading

jlewi commented Dec 7, 2019

k8s-ci-robot commented Dec 7, 2019

jinchihe commented Nov 28, 2019 •

edited

Loading

jinchihe commented Dec 3, 2019 •

edited

Loading

jinchihe commented Dec 6, 2019 •

edited

Loading