Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ci tests for mnist example #684

Merged
merged 1 commit into from
Dec 7, 2019

Conversation

jinchihe
Copy link
Member

@jinchihe jinchihe commented Nov 28, 2019

fixes: #677

The PR is going to:

  1. Change tfjob_test.py to use pytest to support better reporting of test results

  2. Change deploy_test.py change to use pytest

  3. Replace the ksonnet workflow with the use of python to define the workflow

  4. Removed GOOGLE_APPLICATION_CREDENTIALS setting and using work identity.


This change is Reviewable

@jinchihe jinchihe force-pushed the update_mnist_ci_test branch 7 times, most recently from 9fdbc7b to aa159ab Compare November 29, 2019 02:27
@jinchihe jinchihe force-pushed the update_mnist_ci_test branch 17 times, most recently from 602e20e to 625f79a Compare December 2, 2019 02:36
@jinchihe jinchihe force-pushed the update_mnist_ci_test branch 4 times, most recently from 71bf729 to 5ba5b0e Compare December 3, 2019 02:54
@jinchihe
Copy link
Member Author

jinchihe commented Dec 3, 2019

/retest

@jinchihe
Copy link
Member Author

jinchihe commented Dec 3, 2019

That's strange.. Cannot get the trainning data from tensorflow, caused by network issue?

Traceback (most recent call last):
  File "/opt/model.py", line 204, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/opt/model.py", line 147, in main
    mnist = tf.contrib.learn.datasets.DATASETS['mnist'](TF_DATA_DIR)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 300, in load_mnist
    return read_data_sets(train_dir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 260, in read_data_sets
    source_url + TRAIN_IMAGES)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py", line 219, in maybe_download
    temp_file_name, _ = urlretrieve_with_retry(source_url)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py", line 172, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py", line 200, in urlretrieve_with_retry
    return urllib.request.urlretrieve(url, filename)
  File "/usr/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/usr/lib/python2.7/urllib.py", line 245, in retrieve
    fp = self.open(url, data)
  File "/usr/lib/python2.7/urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.7/urllib.py", line 443, in open_https
    h.endheaders(data)
  File "/usr/lib/python2.7/httplib.py", line 1053, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 897, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 859, in send
    self.connect()
  File "/usr/lib/python2.7/httplib.py", line 1270, in connect
    HTTPConnection.connect(self)
  File "/usr/lib/python2.7/httplib.py", line 836, in connect
    self.timeout, self.source_address)
  File "/usr/lib/python2.7/socket.py", line 575, in create_connection
    raise err
IOError: [Errno socket error] [Errno 99] Cannot assign requested address

@jinchihe jinchihe force-pushed the update_mnist_ci_test branch 2 times, most recently from d14b442 to 3fcdd4a Compare December 3, 2019 11:29
@jinchihe
Copy link
Member Author

jinchihe commented Dec 3, 2019

Another problem, seems no permission to upload data to gs ...

tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, error message ''), response '{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "insufficientPermissions",
    "message": "Insufficient Permission"
   }
  ],
  "code": 403,
  "message": "Insufficient Permission"
 }
}
'
	 when initiating an upload to gs://kubeflow-ci-deployment_ci-temp/mnist/models/1201767042638680064/

@jinchihe jinchihe force-pushed the update_mnist_ci_test branch 3 times, most recently from 0adaf2e to bc44e57 Compare December 3, 2019 13:47
@jinchihe jinchihe changed the title WIP: update ci tests for mnist example Update ci tests for mnist example Dec 3, 2019
@jinchihe
Copy link
Member Author

jinchihe commented Dec 3, 2019

Hello @jlewi , the PR is ready for reviewing, thanks!
/assign @jlewi

@jlewi
Copy link
Contributor

jlewi commented Dec 5, 2019

Thank you so much @jinchihe !

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 18 of 24 files at r1.
Reviewable status: 18 of 24 files reviewed, 1 unresolved discussion (waiting on @jinchihe, @lluunn, and @texasmichelle)


mnist/training/GCS/kustomization.yaml, line 9 at r1 (raw file):

gcr.io/kubeflow-ci/mnist/model

How come we are using an image in gcr.io/kubeflow-ci and not gcr.io/kubeflow-examples?

Also how come you are changing newTag to latest?

Won't this end up using whatever image was most recently built by the tests? Couldn't that cause problems because we end up using an image that was built by a presubmit which made broken changes to the serving code.

Should we instead leave this image as gcr.io/kubeflow-examples/mnist/model and point to a well defined newTag.

This would correspond to that last known good image and have to be updated manually. So it could get out of sync with the code checked iuntil we automate updating it).

Could our tests run kustomsize edit set image to point to a specific image in kubeflow-ci corresponding to the test?

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies for the day last week was a holidy

Reviewable status: 18 of 24 files reviewed, 1 unresolved discussion (waiting on @jinchihe, @lluunn, and @texasmichelle)

@jinchihe jinchihe force-pushed the update_mnist_ci_test branch 2 times, most recently from 69cdaf8 to a067214 Compare December 6, 2019 06:49
@jinchihe jinchihe force-pushed the update_mnist_ci_test branch from a067214 to 8d14433 Compare December 6, 2019 07:14
@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

Hello @jlewi Thanks for your comments. Updated as below:

  • Updated from gcr.io/kubeflow-ci to gcr.io/kubeflow-examples.
  • Point to a well defined newTag which is includes new model.py (the old one includes out-of-date model.py)
  • For the testing, we used the kustomize edit set image to update to use the image that's built by this current PR in build steps.

@jlewi
Copy link
Contributor

jlewi commented Dec 7, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1e38524 into kubeflow:master Dec 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update mnist tests so that we get good signal in periodic dashboards
3 participants