-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case report: Job somehow made 84k pods #27997
Comments
I know how to make infinite pods: $ cat docs/user-guide/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
parallelism: 100
template:
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000); exit 1"]
restartPolicy: Never
$ kubectl create -f docs/user-guide/job.yaml |
IMO, the internal type of parallelism and completions should not be pointers |
@erictune if you say this is the right way to fix it I will go ahead and do it. Alternatively we can fix this by hardening our validation and defaulting but there's always oppurtunity that something else will split in. I don't think there's any reasons for those fields to ever be pointers by the time the reach the internal code. |
This isn't new in 1.3 and I don't think we're going to hold the release for it, but if you get a fix merged by tomorrow's burndown meeting that's OK with me, too. |
Easy fix in #28008, but i still think that the internal types should become non pointers. That's a bigger change. |
Actually I think i was mistaken and the behavior I observed is by design. I'll revisit tomorrow. |
I've commented on Mike's PR. |
@mikedanese with your example (each job's run is a failure) the job will keep on creating that many pods as you allow it to. We don't have any mechanism preventing your from doing so. We could add something like that, though. |
I've carefully checked the job logic once again, we have two necessary check in place:
With the current logic though, it can be easily overused (see Mike's example where each pod execution fails), but on the other hand every job will create tons of pods when each execution is failing. That's how jobs were created. Unless we'll know exact use case that caused this (@lavalamp any chance for that?) it's hard to guess. And although, it's still valid to have some security net, I'm thinking (we've struggled with a use case similar to this one already some time ago, with scaling RCs) this is what quota is for, right? |
It's worth noting that these pods were Pending, not Failed. |
Hmm.... that's interesting... I'll try to dig into it next week. |
I believe I'm seeing something a lot like this. No one job with anywhere near 48k pods, but many jobs did create multiple pods when their existing pods were in a Pending state (this is with parallelism 1 and completions 1 and container restart policy Never). I would think that the jobs would only make a new pod if they were sure the old one had failed or otherwise wasn't going to make any more containers, but it appears that (at least under extremely heavy load (~4,000 jobs/hour)) they will make more pods to fulfill themselves even if the old ones are Pending. |
@randalloveson that is a valid clue, I'll check our tests if they cover the case where the pods are stuck in pending state. |
It doesn't look like it had to do with pending pods, see #28597 where I've added that test case explicitly. @randalloveson can you provide a reproducer, that would be very helpful. |
@soltysh I believe the reproduction conditions might be when requests to etcd for Pod status start falling short of some timeout used by the controller. When that happens, it seems like the Job that spawned the Pod assumes the worst and makes a new Pod, exacerbating the problem. That or Jobs really will just ignore an existing Pod they created known to be in a Pending state and make more, which, were it the case, would be a really destructive design. |
I'm starting to wonder if this has to do something with #28486. I need to do more thorough testing in that case. |
Automatic merge from submit-queue Added test case covering pending pods in syncJob @randalloveson suggested in #27997 we might not take pending pods into considerations, while checking that I wrote additional test case for `syncJob`. @randalloveson @erictune ptal [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()
/sig apps |
Not sure this was ever fixed |
Do you want to fix the operator error mentioned here: #27997 (comment) ? |
We can identify failures and ratelimit the culprits. This is the same problem as when you specify a Deployment with a bad node selector. |
This will be fixed when we introduce the job failure policy. |
Do we have an update on this job failure policy? I think we have hit a related issue. We recently found an issue in which a Job had an init container that was exiting with a non-zero code (there was a bug in the init-container and it would actually be impossible for it to exit cleanly). This led to the Job being restarted repeatedly. Soon, the entire K8s cluster (1.6) across all namespaces become sluggish and unresponsive, requiring manual deleting, etc. We understand that the I am happy to help collect more data to help debug this issue if there is anything you would want use to share? (We are currently collecting data ourselves for a post-mortem). |
There's kubernetes/community#583 addressing this issue, along with the implementation in #48075 and #51153. |
Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920) [API] Feature/job failure policy **What this PR does / why we need it**: Implements the Backoff policy and failed pod limit defined in kubernetes/community#583 **Which issue this PR fixes**: fixes #27997, fixes #30243 **Special notes for your reviewer**: This is a WIP PR, I updated the api batchv1.JobSpec in order to prepare the backoff policy implementation in the JobController. **Release note**: ```release-note Add backoff policy and failed pod limit for a job ```
Observed a case where a Job somehow created 84k pods. I think the user had set parallelism to 0 to pause it by the time I saw it. So I'm not sure how exactly to reproduce, but we should think about implementing some sanity limits. P1 since I was paged about this.
The text was updated successfully, but these errors were encountered: