Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[executor] Allow retrying fail pods that haven't had any containers start #4147

Merged
merged 10 commits into from
Jan 20, 2025

Conversation

JamesMurkin
Copy link
Contributor

This PR makes it so you can configurable retry failed pods, much like the pending pod checks

The current caveat is that we only allow this for pods that have not started any containers, maybe we'll adjust this restriction at some point but we want to be conservative adding this feature for now

This feature for now is largely going to used to retry pods that have hit various Pod Allocation issues you can have (gpu drivers failing to allocate gpus, k8s failing to allocate containers etc)

…tart

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
return
}

if pod.Status.Phase == v1.PodFailed {
Copy link
Contributor Author

@JamesMurkin JamesMurkin Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in this if statement is largely the only new code in this file, the rest is just split out from job_event_reporter

podEventChecks := make([]podEventCheck, 0, len(checks))

for _, check := range checks {
re, err := regexp.Compile(check.Regexp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point it would be nice to see if we can get viper to marshall regexes and push stuff like this to config validation.

d80tb7
d80tb7 previously approved these changes Jan 17, 2025
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

# Conflicts:
#	internal/executor/application.go
#	internal/executor/reporter/job_event_reporter.go
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
@JamesMurkin JamesMurkin enabled auto-merge (squash) January 20, 2025 13:48
@JamesMurkin JamesMurkin merged commit 48fdf31 into master Jan 20, 2025
20 checks passed
@JamesMurkin JamesMurkin deleted the retry_failed_pods branch January 20, 2025 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants