Case study: Prow Jobs (and why Jobs are not enough for k8s testing) #43964

0xmichalis · 2017-04-02T17:51:43Z

prow is starting to migrate away from Jobs into ProwJobs (TPR). See kubernetes/test-infra#2054 + kubernetes/test-infra#2218 + kubernetes/test-infra#2243. @kubernetes/sig-apps-misc I think it's worth investigating if this split is really necessary and if k8s native Jobs can be evolved to continue being used for testing k8s while staying generic at the same time.

cc: @kubernetes/sig-testing-misc @erictune @soltysh

spxtr · 2017-04-03T17:20:24Z

I don't think prow's original usage of k8s jobs was ever appropriate. We're running 200+ single-pod controllers at once when we could just be running one.

We did have a use-case for jobs that k8s doesn't support afaik. We wanted to be able to run a pod and not restart it if it exits with a non-zero exit code. It should only restart the pod if it's stopped unnaturally, such as if the node goes down or it gets preempted.

soltysh · 2017-04-07T10:08:32Z

@spxtr mind sharing more details about why not using jobs? The second part of your issue will hopefully be addressed in #30243. And since we're trying to gather requirements for that feature right now any input you can provide in that issue is more than welcome.

spxtr · 2017-04-07T20:58:31Z

Sure. Currently we're transitioning off of using k8s jobs that essentially acted as controllers for a single pod. Here's how they work:

Someone says @k8s-bot test this in a comment on GitHub. A webhook finds its way to the prow cluster.
We spin up a handful of k8s jobs based on the webhook information. We store information about what the job should test in its labels, annotations, and args. It's a mess.
The job controller sees a new job and starts the appropriate pod. It runs this program. For test jobs that run on k8s, it starts a new pod, watches it, and reports to GitHub. For jenkins jobs, it starts one of those, watches it, and reports to GitHub. This code is not good. It updates the job status by grabbing it's job-name label off of the DownwardAPI. Kinda neat, but feels like a bit of a hack.

There are a handful of reasons why this is a bad scheme, described in kubernetes/test-infra#2054. Instead, this is a perfect use case for using a ThirdPartyResource plus a custom controller. It's not really a deficiency of k8s jobs, but just a poor choice on my part to use them as one-off controllers in the first place.

Whether my custom controller actually creates k8s jobs or k8s pods depends on what I mentioned before. Right now I create a pod with RestartPolicy: "Never". This way, when the user's tests fail, it doesn't rerun them! However, when a node upgrade comes around or something boots the pod, I do want to rerun them. It would be nice if jobs could run a pod to completion, pass or fail.

mtanski · 2017-04-18T15:38:54Z

We've build a system around bare Pods to encapsulate our bath job system. The current Job system is opinionated to the point of not being useful. Continuously running the same failing job (or group of failing jobs) and blocking resources for other ones to run is not great default choice.

Worse yet the documentation is not great, the restart policy documentation is confusion ... because "Never" still restarts esp. when the pod fails to run due to bad image URI or missing repository keys. Stuff people can forget to setup / fat finger.

This was reported by others in different github issues, but generally closed as WORKS AS DESIGNED.

spxtr · 2017-04-18T17:10:16Z

I'd like to stress that my use case was never a good fit for k8s jobs, opinionated or not. I do not feel like this issue demonstrates a failure of k8s jobs.

The transition to ProwJobs is now complete, and it works great!

0xmichalis · 2017-04-19T09:23:29Z

Closing in favor of #30243 as it seems that it will address the main concern both @spxtr and @mtanski have.

Worse yet the documentation is not great, the restart policy documentation is confusion ... because "Never" still restarts esp. when the pod fails to run due to bad image URI or missing repository keys. Stuff people can forget to setup / fat finger.

@mtanski this is an open source project, you can help improve things.

mtanski · 2017-04-19T16:09:54Z

@Kargakis I did take a look at it again, in order to fix it. But it looks like it's fixed ... at least documented that restartPolicy applies to the pod and not the Job.

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.

…8605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.

…bernetes#18605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

0xmichalis added area/workload-api/job priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Apr 2, 2017

0xmichalis closed this as completed Apr 19, 2017

soltysh mentioned this issue Apr 26, 2017

Backoff policy and failed pod limit kubernetes/community#583

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case study: Prow Jobs (and why Jobs are not enough for k8s testing) #43964

Case study: Prow Jobs (and why Jobs are not enough for k8s testing) #43964

0xmichalis commented Apr 2, 2017

spxtr commented Apr 3, 2017

soltysh commented Apr 7, 2017

spxtr commented Apr 7, 2017

mtanski commented Apr 18, 2017 •

edited

Loading

spxtr commented Apr 18, 2017

0xmichalis commented Apr 19, 2017

mtanski commented Apr 19, 2017

Case study: Prow Jobs (and why Jobs are not enough for k8s testing) #43964

Case study: Prow Jobs (and why Jobs are not enough for k8s testing) #43964

Comments

0xmichalis commented Apr 2, 2017

spxtr commented Apr 3, 2017

soltysh commented Apr 7, 2017

spxtr commented Apr 7, 2017

mtanski commented Apr 18, 2017 • edited Loading

spxtr commented Apr 18, 2017

0xmichalis commented Apr 19, 2017

mtanski commented Apr 19, 2017

mtanski commented Apr 18, 2017 •

edited

Loading