-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[executor] Allow retrying fail pods that haven't had any containers start #4147
Conversation
…tart Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
return | ||
} | ||
|
||
if pod.Status.Phase == v1.PodFailed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in this if statement is largely the only new code in this file, the rest is just split out from job_event_reporter
podEventChecks := make([]podEventCheck, 0, len(checks)) | ||
|
||
for _, check := range checks { | ||
re, err := regexp.Compile(check.Regexp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point it would be nice to see if we can get viper to marshall regexes and push stuff like this to config validation.
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/executor/application.go # internal/executor/reporter/job_event_reporter.go
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
This PR makes it so you can configurable retry failed pods, much like the pending pod checks
The current caveat is that we only allow this for pods that have not started any containers, maybe we'll adjust this restriction at some point but we want to be conservative adding this feature for now
This feature for now is largely going to used to retry pods that have hit various Pod Allocation issues you can have (gpu drivers failing to allocate gpus, k8s failing to allocate containers etc)