-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling updates should fail early if a pod fails to start #18568
Comments
/subscribe |
I was trying to see if I could put together a poc for this, but it doesn't look like there is a programmatic way to check for a failed image pull (as of 1.1 codebase at least). It is always in It would work for |
The more I think about it, the more I think it's a bug in kubelet with docker. I think it should report the pod as |
/subscribe |
+1 |
I don't think the container can be marked as |
I agree with @vishh that image pulling failures could be transient and retries are needed. Perhaps we can do a better job distinguishing permanent errors such as the image does not exist, and surface them. |
See also #19343 |
I also agree that image pull failures could be due to races and transient failures. I like the idea of surfacing this information to the user through status and/or events. We intend "kubectl rollout status" to report rollout status updates as they occur. This could be one type of update reported. |
@bgrant0607: Should the error be exposed via a new |
@vishh No. It's a Reason for ContainerStateWaiting. |
Is On Thu, Jan 21, 2016 at 4:25 PM, Brian Grant notifications@github.com
|
I view this as similar to #2529. The system should degrade gracefully in the presence of failures. We're not continuing to work on kubectl rolling-update, but we are working to address this type of thing in Deployment. Currently we stop progressing the rollout when we hit maxUnavailable. That allows the user to configure the desired amount of parallelism and risk. If a container/pod simply fails to become ready, we can't currently distinguish that from normal behavior, but the idea of "permanently failed" deployments will add a progress timeout that could be used to detect this: #19343. That doesn't enable fast failure in the case of observed problems, admittedly. Since it's hard to distinguish permanent from transient failures, we'd probably need to add some kind of flag(s) to request fast failure in the case of certain problems (in addition to the timeout). Failure to pull the image isn't the only failure scenario. For instance, crash loops and deadlocks of newly updated images or configurations (env, command, args, secrets, configmaps, etc.) are also common. @vishh So far, I've resisted treating Reason as a stable API. At minimum, I reserve the right to create new, more specific Reason values at any time. If we need to, we could add image-related failure info to ContainerStatus, much as we've added ExitCode, RestartCount, etc. It may be a little tricky to represent due to all the possible means of getting images onto the node (external copy, pull, mount, peer-to-peer copy, etc.). Also, sufficiently slow progress might as well be considered failure. |
Infant mortality was also discussed a little here: #19343 (comment) |
@smarterclayton mentioned somewhere else that ImagePull errors may be transient. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/remove-lifecycle rotten |
/subscribe |
Is there an equivalent issue for kubectl rollout with deployments? They seem to have the same behavior of basically surfacing no useful information when a pod fails to start for any reason. |
In 1.0.x, rolling update didn't block, so we could check things like a failed pod while waiting for the update to complete.
In 1.1.x, rolling update does block up to the timeout amount. This is great since we had to script it externally before. However it waits the entire timeout even if the update fails immediately. For instance, if you specify a non-existent image the pod will report a
PullImageError
. At that point the rolling update could abort immediately. Instead it blocks for the 5 minute timeout. When frequently developing and deploying this 5 minute timeout becomes painful.I suggest rolling-update fails immediately if any of the new pods are not in a 'Pending' or 'Running' state. Not sure if there is a set of non-error states we can use.
The text was updated successfully, but these errors were encountered: