Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bunch of issues with logging. #811

Merged
merged 2 commits into from
Aug 30, 2018
Merged

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Aug 30, 2018

  • I ran into these issues while trying to understand why my job was
    marked as failed even though there was no useful informatin about
    why the pod failed.

  • Log events that indicate exit code of pods.

  • In the json payload use the syntax namespace + "." + name not
    namespace + "/" + name; use of a period is more consistent in K8s

  • Don't log an event TFJob is terminated, deleting pods and services;
    this event ends up being triggered repeatedly because of CleanPodPolicy
    the number of completed pods is non zero so the event statement kept
    getting called; the event is unnecessary because we will create
    events corresponding to actual services/events deleted.


This change is Reviewable

* I ran into these issues while trying to understand why my job was
  marked as failed even though there was no useful informatin about
  why the pod failed.

* Log events that indicate exit code of pods.

* In the json payload use the syntax namespace + "." + name not
  namespace + "/" + name; use of a period is more consistent in K8s

* Don't log an event TFJob is terminated, deleting pods and services;
  this event ends up being triggered repeatedly because of CleanPodPolicy
  the number of completed pods is non zero so the event statement kept
  getting called; the event is unnecessary because we will create
  events corresponding to actual services/events deleted.
@jlewi
Copy link
Contributor Author

jlewi commented Aug 30, 2018

/assign @gaocegege
/assign @richardsliu

@johnugeorge
Copy link
Member

/lgtm

@coveralls
Copy link

coveralls commented Aug 30, 2018

Coverage Status

Coverage increased (+0.2%) to 58.222% when pulling 5eb6e59 on jlewi:fix_logging into f78e619 on kubeflow:master.

@TravisBuddy
Copy link

Travis tests have failed

Hey @jlewi,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

2nd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/tfcontroller/tfjob.go:19:2:warning: unused variable or constant terminatedTFJobReason (varcheck)

3rd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/tfcontroller/tfjob.go:19:2:warning: unused variable or constant terminatedTFJobReason (varcheck)

travis_time:end:0de1f4d0:start=1535606778947547781,finish=1535606908357659808,duration=129410112027

@gaocegege
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 30, 2018
@johnugeorge
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants