Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling updates should fail early if a pod fails to start #18568

Open
jsravn opened this issue Dec 11, 2015 · 21 comments
Open

Rolling updates should fail early if a pod fails to start #18568

jsravn opened this issue Dec 11, 2015 · 21 comments
Labels
area/app-lifecycle area/workload-api/deployment lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@jsravn
Copy link
Contributor

jsravn commented Dec 11, 2015

In 1.0.x, rolling update didn't block, so we could check things like a failed pod while waiting for the update to complete.

In 1.1.x, rolling update does block up to the timeout amount. This is great since we had to script it externally before. However it waits the entire timeout even if the update fails immediately. For instance, if you specify a non-existent image the pod will report a PullImageError. At that point the rolling update could abort immediately. Instead it blocks for the 5 minute timeout. When frequently developing and deploying this 5 minute timeout becomes painful.

I suggest rolling-update fails immediately if any of the new pods are not in a 'Pending' or 'Running' state. Not sure if there is a set of non-error states we can use.

@mqliang
Copy link
Contributor

mqliang commented Dec 12, 2015

/subscribe

@jsravn
Copy link
Contributor Author

jsravn commented Dec 14, 2015

I was trying to see if I could put together a poc for this, but it doesn't look like there is a programmatic way to check for a failed image pull (as of 1.1 codebase at least). It is always in Pending state. The only indicator of a problem is in the container state reason string field, but this can be almost anything depending on the container manager (and has changed from 1.1 to master for the pull image error on docker) - don't think I can rely on this.

It would work for Failed state but I believe that only happens if the application in the container fails. Failing to pull an image gets stuck in a Pending state. It feels like we need another state such as FailedDeploy. Or lump it into Failed. I don't think it should be considered Pending if the error is unrecoverable.

@jsravn
Copy link
Contributor Author

jsravn commented Dec 14, 2015

The more I think about it, the more I think it's a bug in kubelet with docker. I think it should report the pod as Failed if the image pull fails. Then we can have some sensible policies to deal with the failure (whether it's to abort the rolling update, or to try pulling the image again).

@mqliang
Copy link
Contributor

mqliang commented Dec 15, 2015

/subscribe

@nikhiljindal nikhiljindal added area/app-lifecycle sig/node Categorizes an issue or PR as relevant to SIG Node. labels Dec 17, 2015
@0xmichalis
Copy link
Contributor

The more I think about it, the more I think it's a bug in kubelet with docker. I think it should report the pod as Failed if the image pull fails. Then we can have some sensible policies to deal with the failure (whether it's to abort the rolling update, or to try pulling the image again).

+1

cc: @vishh @yujuhong

@vishh
Copy link
Contributor

vishh commented Jan 15, 2016

I don't think the container can be marked as Failed because it is possible for a retry of an image pull to succeed. I do agree that we need to provide some means to determine if a container is failing to start because of image related issues. cc @bgrant0607

@bgrant0607 bgrant0607 added team/ux and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 15, 2016
@bgrant0607
Copy link
Member

@yujuhong
Copy link
Contributor

I agree with @vishh that image pulling failures could be transient and retries are needed. Perhaps we can do a better job distinguishing permanent errors such as the image does not exist, and surface them.

@bgrant0607 bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 22, 2016
@bgrant0607 bgrant0607 added this to the next-candidate milestone Jan 22, 2016
@bgrant0607
Copy link
Member

See also #19343

@bgrant0607
Copy link
Member

I also agree that image pull failures could be due to races and transient failures.

I like the idea of surfacing this information to the user through status and/or events.

We intend "kubectl rollout status" to report rollout status updates as they occur. This could be one type of update reported.

@vishh
Copy link
Contributor

vishh commented Jan 22, 2016

@bgrant0607: Should the error be exposed via a new ContainerState?

@bgrant0607
Copy link
Member

@vishh No. It's a Reason for ContainerStateWaiting.

@vishh
Copy link
Contributor

vishh commented Jan 22, 2016

Is Reason string considered a stable API for users to build on top of?

On Thu, Jan 21, 2016 at 4:25 PM, Brian Grant notifications@github.com
wrote:

@vishh https://github.com/vishh No. It's a Reason for
ContainerStateWaiting.


Reply to this email directly or view it on GitHub
#18568 (comment)
.

@bgrant0607 bgrant0607 removed the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 26, 2016
@bgrant0607
Copy link
Member

I view this as similar to #2529. The system should degrade gracefully in the presence of failures.

We're not continuing to work on kubectl rolling-update, but we are working to address this type of thing in Deployment.

Currently we stop progressing the rollout when we hit maxUnavailable. That allows the user to configure the desired amount of parallelism and risk.

If a container/pod simply fails to become ready, we can't currently distinguish that from normal behavior, but the idea of "permanently failed" deployments will add a progress timeout that could be used to detect this: #19343.

That doesn't enable fast failure in the case of observed problems, admittedly.

Since it's hard to distinguish permanent from transient failures, we'd probably need to add some kind of flag(s) to request fast failure in the case of certain problems (in addition to the timeout).

Failure to pull the image isn't the only failure scenario. For instance, crash loops and deadlocks of newly updated images or configurations (env, command, args, secrets, configmaps, etc.) are also common.

@vishh So far, I've resisted treating Reason as a stable API. At minimum, I reserve the right to create new, more specific Reason values at any time. If we need to, we could add image-related failure info to ContainerStatus, much as we've added ExitCode, RestartCount, etc. It may be a little tricky to represent due to all the possible means of getting images onto the node (external copy, pull, mount, peer-to-peer copy, etc.). Also, sufficiently slow progress might as well be considered failure.

@bgrant0607
Copy link
Member

Infant mortality was also discussed a little here: #19343 (comment)

@0xmichalis
Copy link
Contributor

@smarterclayton mentioned somewhere else that ImagePull errors may be transient.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2018
@bgrant0607
Copy link
Member

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jan 22, 2018
@wengych
Copy link

wengych commented May 15, 2018

/subscribe

@lukeschlather
Copy link

Is there an equivalent issue for kubectl rollout with deployments? They seem to have the same behavior of basically surfacing no useful information when a pod fails to start for any reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle area/workload-api/deployment lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Needs Triage
Development

No branches or pull requests