Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case study: Prow Jobs (and why Jobs are not enough for k8s testing) #43964

Closed
0xmichalis opened this issue Apr 2, 2017 · 7 comments
Closed
Labels
area/workload-api/job priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@0xmichalis
Copy link
Contributor

prow is starting to migrate away from Jobs into ProwJobs (TPR). See kubernetes/test-infra#2054 + kubernetes/test-infra#2218 + kubernetes/test-infra#2243. @kubernetes/sig-apps-misc I think it's worth investigating if this split is really necessary and if k8s native Jobs can be evolved to continue being used for testing k8s while staying generic at the same time.

cc: @kubernetes/sig-testing-misc @erictune @soltysh

@0xmichalis 0xmichalis added area/workload-api/job priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Apr 2, 2017
@spxtr
Copy link
Contributor

spxtr commented Apr 3, 2017

I don't think prow's original usage of k8s jobs was ever appropriate. We're running 200+ single-pod controllers at once when we could just be running one.

We did have a use-case for jobs that k8s doesn't support afaik. We wanted to be able to run a pod and not restart it if it exits with a non-zero exit code. It should only restart the pod if it's stopped unnaturally, such as if the node goes down or it gets preempted.

@soltysh
Copy link
Contributor

soltysh commented Apr 7, 2017

@spxtr mind sharing more details about why not using jobs? The second part of your issue will hopefully be addressed in #30243. And since we're trying to gather requirements for that feature right now any input you can provide in that issue is more than welcome.

@spxtr
Copy link
Contributor

spxtr commented Apr 7, 2017

Sure. Currently we're transitioning off of using k8s jobs that essentially acted as controllers for a single pod. Here's how they work:

  1. Someone says @k8s-bot test this in a comment on GitHub. A webhook finds its way to the prow cluster.
  2. We spin up a handful of k8s jobs based on the webhook information. We store information about what the job should test in its labels, annotations, and args. It's a mess.
  3. The job controller sees a new job and starts the appropriate pod. It runs this program. For test jobs that run on k8s, it starts a new pod, watches it, and reports to GitHub. For jenkins jobs, it starts one of those, watches it, and reports to GitHub. This code is not good. It updates the job status by grabbing it's job-name label off of the DownwardAPI. Kinda neat, but feels like a bit of a hack.

There are a handful of reasons why this is a bad scheme, described in kubernetes/test-infra#2054. Instead, this is a perfect use case for using a ThirdPartyResource plus a custom controller. It's not really a deficiency of k8s jobs, but just a poor choice on my part to use them as one-off controllers in the first place.

Whether my custom controller actually creates k8s jobs or k8s pods depends on what I mentioned before. Right now I create a pod with RestartPolicy: "Never". This way, when the user's tests fail, it doesn't rerun them! However, when a node upgrade comes around or something boots the pod, I do want to rerun them. It would be nice if jobs could run a pod to completion, pass or fail.

@mtanski
Copy link

mtanski commented Apr 18, 2017

We've build a system around bare Pods to encapsulate our bath job system. The current Job system is opinionated to the point of not being useful. Continuously running the same failing job (or group of failing jobs) and blocking resources for other ones to run is not great default choice.

Worse yet the documentation is not great, the restart policy documentation is confusion ... because "Never" still restarts esp. when the pod fails to run due to bad image URI or missing repository keys. Stuff people can forget to setup / fat finger.

This was reported by others in different github issues, but generally closed as WORKS AS DESIGNED.

@spxtr
Copy link
Contributor

spxtr commented Apr 18, 2017

I'd like to stress that my use case was never a good fit for k8s jobs, opinionated or not. I do not feel like this issue demonstrates a failure of k8s jobs.

The transition to ProwJobs is now complete, and it works great!

@0xmichalis
Copy link
Contributor Author

Closing in favor of #30243 as it seems that it will address the main concern both @spxtr and @mtanski have.

Worse yet the documentation is not great, the restart policy documentation is confusion ... because "Never" still restarts esp. when the pod fails to run due to bad image URI or missing repository keys. Stuff people can forget to setup / fat finger.

@mtanski this is an open source project, you can help improve things.

@mtanski
Copy link

mtanski commented Apr 19, 2017

@Kargakis I did take a look at it again, in order to fix it. But it looks like it's fixed ... at least documented that restartPolicy applies to the pod and not the Job.

k8s-github-robot pushed a commit to kubernetes/community that referenced this issue Aug 28, 2017
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
justaugustus pushed a commit to justaugustus/enhancements that referenced this issue Sep 3, 2018
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
jgehrcke added a commit to jgehrcke/website that referenced this issue Jan 10, 2020
Sometimes, as it happened to me, a Pod's `restartPolicy` 
is mistakenly taken as the corresponding Job's restart policy.

That was concluded before, here:
https://github.com/kubernetes/community/pull/583/files

The confusion happened here:
kubernetes/kubernetes#30243
kubernetes/kubernetes#43964

And here:
jaegertracing/jaeger-kubernetes#32

This commit tries to clarify that there is no `restartPolicy` for
the job itself, and that using either of `backoffLimit` and
`activeDeadlineSeconds` may result in permanent failure.
k8s-ci-robot pushed a commit to kubernetes/website that referenced this issue Jan 15, 2020
…8605)

Sometimes, as it happened to me, a Pod's `restartPolicy` 
is mistakenly taken as the corresponding Job's restart policy.

That was concluded before, here:
https://github.com/kubernetes/community/pull/583/files

The confusion happened here:
kubernetes/kubernetes#30243
kubernetes/kubernetes#43964

And here:
jaegertracing/jaeger-kubernetes#32

This commit tries to clarify that there is no `restartPolicy` for
the job itself, and that using either of `backoffLimit` and
`activeDeadlineSeconds` may result in permanent failure.
wawa0210 pushed a commit to wawa0210/website that referenced this issue Mar 2, 2020
…bernetes#18605)

Sometimes, as it happened to me, a Pod's `restartPolicy` 
is mistakenly taken as the corresponding Job's restart policy.

That was concluded before, here:
https://github.com/kubernetes/community/pull/583/files

The confusion happened here:
kubernetes/kubernetes#30243
kubernetes/kubernetes#43964

And here:
jaegertracing/jaeger-kubernetes#32

This commit tries to clarify that there is no `restartPolicy` for
the job itself, and that using either of `backoffLimit` and
`activeDeadlineSeconds` may result in permanent failure.
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Nov 30, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/job priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests

4 participants