Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case report: Job somehow made 84k pods #27997

Closed
lavalamp opened this issue Jun 24, 2016 · 25 comments · Fixed by #48075
Closed

Case report: Job somehow made 84k pods #27997

lavalamp opened this issue Jun 24, 2016 · 25 comments · Fixed by #48075
Assignees
Labels
area/workload-api/job kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@lavalamp
Copy link
Member

Observed a case where a Job somehow created 84k pods. I think the user had set parallelism to 0 to pause it by the time I saw it. So I'm not sure how exactly to reproduce, but we should think about implementing some sanity limits. P1 since I was paged about this.

@lavalamp lavalamp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane labels Jun 24, 2016
@lavalamp
Copy link
Member Author

@erictune

@fabioy
Copy link
Contributor

fabioy commented Jun 24, 2016

Meanwhile, some related issues to help mitigate this: #13774, #25146 (maybe).

@mikedanese
Copy link
Member

mikedanese commented Jun 24, 2016

I know how to make infinite pods:

$ cat docs/user-guide/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  parallelism: 100
  template:
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000); exit 1"]
      restartPolicy: Never
$ kubectl create -f docs/user-guide/job.yaml

@mikedanese mikedanese added this to the v1.3 milestone Jun 24, 2016
@mikedanese mikedanese self-assigned this Jun 24, 2016
@mikedanese
Copy link
Member

mikedanese commented Jun 24, 2016

IMO, the internal type of parallelism and completions should not be pointers

@mikedanese
Copy link
Member

mikedanese commented Jun 24, 2016

IMO, the internal type of parallelism and completions should not be pointers

@erictune if you say this is the right way to fix it I will go ahead and do it. Alternatively we can fix this by hardening our validation and defaulting but there's always oppurtunity that something else will split in. I don't think there's any reasons for those fields to ever be pointers by the time the reach the internal code.

@lavalamp
Copy link
Member Author

This isn't new in 1.3 and I don't think we're going to hold the release for it, but if you get a fix merged by tomorrow's burndown meeting that's OK with me, too.

@mikedanese
Copy link
Member

Easy fix in #28008, but i still think that the internal types should become non pointers. That's a bigger change.

@mikedanese mikedanese removed this from the v1.3 milestone Jun 24, 2016
@mikedanese
Copy link
Member

Actually I think i was mistaken and the behavior I observed is by design. I'll revisit tomorrow.

@soltysh
Copy link
Contributor

soltysh commented Jun 24, 2016

I've commented on Mike's PR.

@soltysh
Copy link
Contributor

soltysh commented Jun 24, 2016

@mikedanese with your example (each job's run is a failure) the job will keep on creating that many pods as you allow it to. We don't have any mechanism preventing your from doing so. We could add something like that, though.

@soltysh
Copy link
Contributor

soltysh commented Jun 24, 2016

I've carefully checked the job logic once again, we have two necessary check in place:

  1. finish a job when at least one successful pod is there and no more active pods (see here);
  2. not to start new pods when there's at least one successful execution (see here).

With the current logic though, it can be easily overused (see Mike's example where each pod execution fails), but on the other hand every job will create tons of pods when each execution is failing. That's how jobs were created. Unless we'll know exact use case that caused this (@lavalamp any chance for that?) it's hard to guess.

And although, it's still valid to have some security net, I'm thinking (we've struggled with a use case similar to this one already some time ago, with scaling RCs) this is what quota is for, right?

@lavalamp
Copy link
Member Author

It's worth noting that these pods were Pending, not Failed.

@soltysh
Copy link
Contributor

soltysh commented Jun 24, 2016

Hmm.... that's interesting... I'll try to dig into it next week.

@soltysh soltysh assigned soltysh and unassigned mikedanese Jun 24, 2016
@randalloveson
Copy link

I believe I'm seeing something a lot like this. No one job with anywhere near 48k pods, but many jobs did create multiple pods when their existing pods were in a Pending state (this is with parallelism 1 and completions 1 and container restart policy Never). I would think that the jobs would only make a new pod if they were sure the old one had failed or otherwise wasn't going to make any more containers, but it appears that (at least under extremely heavy load (~4,000 jobs/hour)) they will make more pods to fulfill themselves even if the old ones are Pending.

@soltysh
Copy link
Contributor

soltysh commented Jul 7, 2016

@randalloveson that is a valid clue, I'll check our tests if they cover the case where the pods are stuck in pending state.

@soltysh
Copy link
Contributor

soltysh commented Jul 7, 2016

It doesn't look like it had to do with pending pods, see #28597 where I've added that test case explicitly. @randalloveson can you provide a reproducer, that would be very helpful.

@randalloveson
Copy link

@soltysh I believe the reproduction conditions might be when requests to etcd for Pod status start falling short of some timeout used by the controller. When that happens, it seems like the Job that spawned the Pod assumes the worst and makes a new Pod, exacerbating the problem.

That or Jobs really will just ignore an existing Pod they created known to be in a Pending state and make more, which, were it the case, would be a really destructive design.

@soltysh
Copy link
Contributor

soltysh commented Jul 7, 2016

I'm starting to wonder if this has to do something with #28486. I need to do more thorough testing in that case.

k8s-github-robot pushed a commit that referenced this issue Jul 8, 2016
Automatic merge from submit-queue

Added test case covering pending pods in syncJob

@randalloveson suggested in #27997 we might not take pending pods into considerations, while checking that I wrote additional test case for `syncJob`.

@randalloveson @erictune ptal

[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@0xmichalis
Copy link
Contributor

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jun 23, 2017
@0xmichalis 0xmichalis added area/workload-api/job kind/bug Categorizes issue or PR as related to a bug. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. team/control-plane (deprecated - do not use) labels Jun 23, 2017
@0xmichalis
Copy link
Contributor

Not sure this was ever fixed

@0xmichalis 0xmichalis reopened this Jun 23, 2017
@mikedanese
Copy link
Member

Do you want to fix the operator error mentioned here: #27997 (comment) ?

@0xmichalis
Copy link
Contributor

We can identify failures and ratelimit the culprits. This is the same problem as when you specify a Deployment with a bad node selector.

@soltysh
Copy link
Contributor

soltysh commented Jul 31, 2017

This will be fixed when we introduce the job failure policy.

@splittingfield
Copy link

splittingfield commented Aug 25, 2017

Do we have an update on this job failure policy? I think we have hit a related issue.

We recently found an issue in which a Job had an init container that was exiting with a non-zero code (there was a bug in the init-container and it would actually be impossible for it to exit cleanly). This led to the Job being restarted repeatedly. Soon, the entire K8s cluster (1.6) across all namespaces become sluggish and unresponsive, requiring manual deleting, etc.

We understand that the restartPolicy:Never does not apply to Jobs (#20255) However, we are unable to set the activeDeadlineSeconds as these init containers are moving large amounts of data and the Job itself can be arbitrarily long (easily multiple days if training a large deep learning model).

I am happy to help collect more data to help debug this issue if there is anything you would want use to share? (We are currently collecting data ourselves for a post-mortem).

@soltysh
Copy link
Contributor

soltysh commented Aug 29, 2017

There's kubernetes/community#583 addressing this issue, along with the implementation in #48075 and #51153.

k8s-github-robot pushed a commit that referenced this issue Sep 3, 2017
Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920)

[API] Feature/job failure policy

**What this PR does / why we need it**: Implements the Backoff policy and failed pod limit defined in kubernetes/community#583

**Which issue this PR fixes**: 
fixes #27997, fixes #30243

**Special notes for your reviewer**:
This is a WIP PR, I updated the api batchv1.JobSpec in order to prepare the backoff policy implementation in the JobController.

**Release note**:
```release-note
Add backoff policy and failed pod limit for a job
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/job kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants