Case report: Job somehow made 84k pods #27997

lavalamp · 2016-06-24T01:15:45Z

Observed a case where a Job somehow created 84k pods. I think the user had set parallelism to 0 to pause it by the time I saw it. So I'm not sure how exactly to reproduce, but we should think about implementing some sanity limits. P1 since I was paged about this.

lavalamp · 2016-06-24T01:15:56Z

@erictune

fabioy · 2016-06-24T01:19:34Z

Meanwhile, some related issues to help mitigate this: #13774, #25146 (maybe).

mikedanese · 2016-06-24T01:24:57Z

I know how to make infinite pods:

$ cat docs/user-guide/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  parallelism: 100
  template:
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000); exit 1"]
      restartPolicy: Never
$ kubectl create -f docs/user-guide/job.yaml

mikedanese · 2016-06-24T01:43:42Z

IMO, the internal type of parallelism and completions should not be pointers

mikedanese · 2016-06-24T02:27:26Z

IMO, the internal type of parallelism and completions should not be pointers

@erictune if you say this is the right way to fix it I will go ahead and do it. Alternatively we can fix this by hardening our validation and defaulting but there's always oppurtunity that something else will split in. I don't think there's any reasons for those fields to ever be pointers by the time the reach the internal code.

lavalamp · 2016-06-24T02:56:19Z

This isn't new in 1.3 and I don't think we're going to hold the release for it, but if you get a fix merged by tomorrow's burndown meeting that's OK with me, too.

mikedanese · 2016-06-24T06:29:58Z

Easy fix in #28008, but i still think that the internal types should become non pointers. That's a bigger change.

mikedanese · 2016-06-24T07:12:51Z

Actually I think i was mistaken and the behavior I observed is by design. I'll revisit tomorrow.

soltysh · 2016-06-24T12:38:07Z

I've commented on Mike's PR.

soltysh · 2016-06-24T12:48:37Z

@mikedanese with your example (each job's run is a failure) the job will keep on creating that many pods as you allow it to. We don't have any mechanism preventing your from doing so. We could add something like that, though.

soltysh · 2016-06-24T13:33:11Z

I've carefully checked the job logic once again, we have two necessary check in place:

finish a job when at least one successful pod is there and no more active pods (see here);
not to start new pods when there's at least one successful execution (see here).

With the current logic though, it can be easily overused (see Mike's example where each pod execution fails), but on the other hand every job will create tons of pods when each execution is failing. That's how jobs were created. Unless we'll know exact use case that caused this (@lavalamp any chance for that?) it's hard to guess.

And although, it's still valid to have some security net, I'm thinking (we've struggled with a use case similar to this one already some time ago, with scaling RCs) this is what quota is for, right?

lavalamp · 2016-06-24T17:18:36Z

It's worth noting that these pods were Pending, not Failed.

soltysh · 2016-06-24T19:50:25Z

Hmm.... that's interesting... I'll try to dig into it next week.

randalloveson · 2016-07-06T18:17:19Z

I believe I'm seeing something a lot like this. No one job with anywhere near 48k pods, but many jobs did create multiple pods when their existing pods were in a Pending state (this is with parallelism 1 and completions 1 and container restart policy Never). I would think that the jobs would only make a new pod if they were sure the old one had failed or otherwise wasn't going to make any more containers, but it appears that (at least under extremely heavy load (~4,000 jobs/hour)) they will make more pods to fulfill themselves even if the old ones are Pending.

soltysh · 2016-07-07T12:34:33Z

@randalloveson that is a valid clue, I'll check our tests if they cover the case where the pods are stuck in pending state.

soltysh · 2016-07-07T12:49:03Z

It doesn't look like it had to do with pending pods, see #28597 where I've added that test case explicitly. @randalloveson can you provide a reproducer, that would be very helpful.

randalloveson · 2016-07-07T13:32:44Z

@soltysh I believe the reproduction conditions might be when requests to etcd for Pod status start falling short of some timeout used by the controller. When that happens, it seems like the Job that spawned the Pod assumes the worst and makes a new Pod, exacerbating the problem.

That or Jobs really will just ignore an existing Pod they created known to be in a Pending state and make more, which, were it the case, would be a really destructive design.

soltysh · 2016-07-07T14:28:17Z

I'm starting to wonder if this has to do something with #28486. I need to do more thorough testing in that case.

@randalloveson

Automatic merge from submit-queue Added test case covering pending pods in syncJob @randalloveson suggested in #27997 we might not take pending pods into considerations, while checking that I wrote additional test case for `syncJob`. @randalloveson @erictune ptal [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()

0xmichalis · 2017-06-23T00:07:36Z

/sig apps

0xmichalis · 2017-06-23T08:24:11Z

Not sure this was ever fixed

mikedanese · 2017-06-23T09:14:59Z

Do you want to fix the operator error mentioned here: #27997 (comment) ?

0xmichalis · 2017-06-23T09:18:57Z

We can identify failures and ratelimit the culprits. This is the same problem as when you specify a Deployment with a bad node selector.

soltysh · 2017-07-31T14:58:19Z

This will be fixed when we introduce the job failure policy.

splittingfield · 2017-08-25T16:35:58Z

Do we have an update on this job failure policy? I think we have hit a related issue.

We recently found an issue in which a Job had an init container that was exiting with a non-zero code (there was a bug in the init-container and it would actually be impossible for it to exit cleanly). This led to the Job being restarted repeatedly. Soon, the entire K8s cluster (1.6) across all namespaces become sluggish and unresponsive, requiring manual deleting, etc.

We understand that the restartPolicy:Never does not apply to Jobs (#20255) However, we are unable to set the activeDeadlineSeconds as these init containers are moving large amounts of data and the Job itself can be arbitrarily long (easily multiple days if training a large deep learning model).

I am happy to help collect more data to help debug this issue if there is anything you would want use to share? (We are currently collecting data ourselves for a post-mortem).

soltysh · 2017-08-29T12:37:42Z

There's kubernetes/community#583 addressing this issue, along with the implementation in #48075 and #51153.

Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920) [API] Feature/job failure policy **What this PR does / why we need it**: Implements the Backoff policy and failed pod limit defined in kubernetes/community#583 **Which issue this PR fixes**: fixes #27997, fixes #30243 **Special notes for your reviewer**: This is a WIP PR, I updated the api batchv1.JobSpec in order to prepare the backoff policy implementation in the JobController. **Release note**: ```release-note Add backoff policy and failed pod limit for a job ```

lavalamp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane labels Jun 24, 2016

mikedanese added this to the v1.3 milestone Jun 24, 2016

mikedanese self-assigned this Jun 24, 2016

mikedanese mentioned this issue Jun 24, 2016

require parallelism and completions to be set in JobSpec validation #28008

Closed

mikedanese removed this from the v1.3 milestone Jun 24, 2016

mikedanese closed this as completed Jun 24, 2016

mikedanese reopened this Jun 24, 2016

soltysh mentioned this issue Jun 24, 2016

Verify that Job behaves properly across master/node upgrade #27968

Closed

soltysh assigned soltysh and unassigned mikedanese Jun 24, 2016

soltysh mentioned this issue Jul 7, 2016

Added test case covering pending pods in syncJob #28597

Merged

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jun 23, 2017

0xmichalis added area/workload-api/job kind/bug Categorizes issue or PR as related to a bug. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. team/control-plane (deprecated - do not use) labels Jun 23, 2017

mikedanese closed this as completed Jun 23, 2017

0xmichalis reopened this Jun 23, 2017

clamoriniere1A mentioned this issue Aug 31, 2017

[API] Feature/job failure policy #48075

Merged

k8s-github-robot closed this as completed in #48075 Sep 3, 2017

soltysh mentioned this issue Sep 8, 2017

Failure policy for Jobs kubernetes/enhancements#298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case report: Job somehow made 84k pods #27997

Case report: Job somehow made 84k pods #27997

lavalamp commented Jun 24, 2016

lavalamp commented Jun 24, 2016

fabioy commented Jun 24, 2016

mikedanese commented Jun 24, 2016 •

edited

Loading

mikedanese commented Jun 24, 2016 •

edited

Loading

mikedanese commented Jun 24, 2016 •

edited

Loading

lavalamp commented Jun 24, 2016

mikedanese commented Jun 24, 2016

mikedanese commented Jun 24, 2016

soltysh commented Jun 24, 2016

soltysh commented Jun 24, 2016

soltysh commented Jun 24, 2016

lavalamp commented Jun 24, 2016

soltysh commented Jun 24, 2016

randalloveson commented Jul 6, 2016

soltysh commented Jul 7, 2016

soltysh commented Jul 7, 2016

randalloveson commented Jul 7, 2016

soltysh commented Jul 7, 2016

0xmichalis commented Jun 23, 2017

0xmichalis commented Jun 23, 2017

mikedanese commented Jun 23, 2017

0xmichalis commented Jun 23, 2017

soltysh commented Jul 31, 2017

splittingfield commented Aug 25, 2017 •

edited

Loading

soltysh commented Aug 29, 2017

Case report: Job somehow made 84k pods #27997

Case report: Job somehow made 84k pods #27997

Comments

lavalamp commented Jun 24, 2016

lavalamp commented Jun 24, 2016

fabioy commented Jun 24, 2016

mikedanese commented Jun 24, 2016 • edited Loading

mikedanese commented Jun 24, 2016 • edited Loading

mikedanese commented Jun 24, 2016 • edited Loading

lavalamp commented Jun 24, 2016

mikedanese commented Jun 24, 2016

mikedanese commented Jun 24, 2016

soltysh commented Jun 24, 2016

soltysh commented Jun 24, 2016

soltysh commented Jun 24, 2016

lavalamp commented Jun 24, 2016

soltysh commented Jun 24, 2016

randalloveson commented Jul 6, 2016

soltysh commented Jul 7, 2016

soltysh commented Jul 7, 2016

randalloveson commented Jul 7, 2016

soltysh commented Jul 7, 2016

0xmichalis commented Jun 23, 2017

0xmichalis commented Jun 23, 2017

mikedanese commented Jun 23, 2017

0xmichalis commented Jun 23, 2017

soltysh commented Jul 31, 2017

splittingfield commented Aug 25, 2017 • edited Loading

soltysh commented Aug 29, 2017

mikedanese commented Jun 24, 2016 •

edited

Loading

mikedanese commented Jun 24, 2016 •

edited

Loading

mikedanese commented Jun 24, 2016 •

edited

Loading

splittingfield commented Aug 25, 2017 •

edited

Loading