[executor] Allow retrying fail pods that haven't had any containers start #4147

JamesMurkin · 2025-01-16T11:23:17Z

This PR makes it so you can configurable retry failed pods, much like the pending pod checks

The current caveat is that we only allow this for pods that have not started any containers, maybe we'll adjust this restriction at some point but we want to be conservative adding this feature for now

This feature for now is largely going to used to retry pods that have hit various Pod Allocation issues you can have (gpu drivers failing to allocate gpus, k8s failing to allocate containers etc)

…tart Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

JamesMurkin · 2025-01-16T11:39:24Z

internal/executor/service/job_state_reporter.go

+		return
+	}
+
+	if pod.Status.Phase == v1.PodFailed {


The code in this if statement is largely the only new code in this file, the rest is just split out from job_event_reporter

d80tb7 · 2025-01-17T09:37:18Z

internal/executor/podchecks/failedpodchecks/event_checks.go

+	podEventChecks := make([]podEventCheck, 0, len(checks))
+
+	for _, check := range checks {
+		re, err := regexp.Compile(check.Regexp)


At some point it would be nice to see if we can get viper to marshall regexes and push stuff like this to config validation.

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/executor/application.go # internal/executor/reporter/job_event_reporter.go

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

JamesMurkin added 7 commits January 15, 2025 12:45

[executor] Allow retrying fail pods that haven't had any containers s…

db319c9

…tart Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Split event reporter and state reporter

9e9a8f6

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Improve messages

601db0a

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Add unit tests

f8aa3db

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Format

90a0125

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

FIx error

c8d8cba

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Merge branch 'master' into retry_failed_pods

5a4ba5f

JamesMurkin commented Jan 16, 2025

View reviewed changes

d80tb7 reviewed Jan 17, 2025

View reviewed changes

d80tb7 previously approved these changes Jan 17, 2025

View reviewed changes

JamesMurkin added 2 commits January 20, 2025 12:43

Merge branch 'master' into retry_failed_pods

72d3cb2

Merge branch 'master' into retry_failed_pods

04a7de2

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/executor/application.go # internal/executor/reporter/job_event_reporter.go

JamesMurkin dismissed d80tb7’s stale review via 04a7de2 January 20, 2025 12:48

Use common logger

cac263f

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

JamesMurkin enabled auto-merge (squash) January 20, 2025 13:48

d80tb7 approved these changes Jan 20, 2025

View reviewed changes

JamesMurkin merged commit 48fdf31 into master Jan 20, 2025
20 checks passed

JamesMurkin deleted the retry_failed_pods branch January 20, 2025 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[executor] Allow retrying fail pods that haven't had any containers start #4147

[executor] Allow retrying fail pods that haven't had any containers start #4147

JamesMurkin commented Jan 16, 2025

JamesMurkin Jan 16, 2025 •

edited

Loading

d80tb7 Jan 17, 2025

[executor] Allow retrying fail pods that haven't had any containers start #4147

[executor] Allow retrying fail pods that haven't had any containers start #4147

Conversation

JamesMurkin commented Jan 16, 2025

JamesMurkin Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

d80tb7 Jan 17, 2025

Choose a reason for hiding this comment

JamesMurkin Jan 16, 2025 •

edited

Loading