Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spinning up of large number of erroneous pods #64

Closed
swatisehgal opened this issue Feb 2, 2017 · 4 comments
Closed

Spinning up of large number of erroneous pods #64

swatisehgal opened this issue Feb 2, 2017 · 4 comments

Comments

@swatisehgal
Copy link
Contributor

In case of some configuration errors e.g. SSL misconfiguration, Kubernetes jobs fails to successfully complete. It keeps on creating new pods and finally the ends up having ~5000 pods with Error state.
It should stop or timeout. A simple fix could be setting the RestartPolicy equal to OnFailure or Never. Are there any other recommendation of gracefully handling this error

@balajismaniam
Copy link
Contributor

@swatisehgal Sorry for the delay in response. I don't think setting restartPolicy=Never will resolve this issue. We set job.spec.completions to a positive value. As a result, if a pod spawned by the job fails, the pod will be restarted even if restartPolicy = Never. This is my expectation but I haven't tested it yet. Also, see https://kubernetes.io/docs/user-guide/jobs/#handling-pod-and-container-failures.
Did you get a chance to test if setting restartPolicy=Never resolves this issue? I will check if there is any good way to handle this.
Also, we expect the Kubernetes cluster to be configured properly before using NFD. This is only an issue if containers or pods fail due to misconfiguration.

@okartau
Copy link
Contributor

okartau commented Feb 13, 2018

Does backoff-policy
functionality which was added in Sep-2017, cover this issue?
With backoffLimit set to low value, there should not be large numbers of failed repeated failed pods.
As backoffLimit default value is 6, the number of restarts should be limited to 6 in default config.
Can that be verified with orig.reporter case, using some recent k8s version?

@marquiz
Copy link
Contributor

marquiz commented Jul 11, 2018

@swatisehgal: any comments on this? I think the job BackoffLimit introduced in Kubernetes v1.8 should mitigate this as @okartau described.

I would be inclined into closing this issue

@marquiz
Copy link
Contributor

marquiz commented Aug 17, 2018

Closing this for now. Please re-open if you still see this issue

@marquiz marquiz closed this as completed Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants