-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case study: Prow Jobs (and why Jobs are not enough for k8s testing) #43964
Comments
I don't think prow's original usage of k8s jobs was ever appropriate. We're running 200+ single-pod controllers at once when we could just be running one. We did have a use-case for jobs that k8s doesn't support afaik. We wanted to be able to run a pod and not restart it if it exits with a non-zero exit code. It should only restart the pod if it's stopped unnaturally, such as if the node goes down or it gets preempted. |
Sure. Currently we're transitioning off of using k8s jobs that essentially acted as controllers for a single pod. Here's how they work:
There are a handful of reasons why this is a bad scheme, described in kubernetes/test-infra#2054. Instead, this is a perfect use case for using a ThirdPartyResource plus a custom controller. It's not really a deficiency of k8s jobs, but just a poor choice on my part to use them as one-off controllers in the first place. Whether my custom controller actually creates k8s jobs or k8s pods depends on what I mentioned before. Right now I create a pod with |
We've build a system around bare Pods to encapsulate our bath job system. The current Job system is opinionated to the point of not being useful. Continuously running the same failing job (or group of failing jobs) and blocking resources for other ones to run is not great default choice. Worse yet the documentation is not great, the restart policy documentation is confusion ... because "Never" still restarts esp. when the pod fails to run due to bad image URI or missing repository keys. Stuff people can forget to setup / fat finger. This was reported by others in different github issues, but generally closed as WORKS AS DESIGNED. |
I'd like to stress that my use case was never a good fit for k8s jobs, opinionated or not. I do not feel like this issue demonstrates a failure of k8s jobs. The transition to ProwJobs is now complete, and it works great! |
Closing in favor of #30243 as it seems that it will address the main concern both @spxtr and @mtanski have.
@mtanski this is an open source project, you can help improve things. |
@Kargakis I did take a look at it again, in order to fix it. But it looks like it's fixed ... at least documented that restartPolicy applies to the pod and not the Job. |
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.
…8605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.
…bernetes#18605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
prow is starting to migrate away from Jobs into ProwJobs (TPR). See kubernetes/test-infra#2054 + kubernetes/test-infra#2218 + kubernetes/test-infra#2243. @kubernetes/sig-apps-misc I think it's worth investigating if this split is really necessary and if k8s native Jobs can be evolved to continue being used for testing k8s while staying generic at the same time.
cc: @kubernetes/sig-testing-misc @erictune @soltysh
The text was updated successfully, but these errors were encountered: