Rolling updates should fail early if a pod fails to start #18568

jsravn · 2015-12-11T09:50:13Z

In 1.0.x, rolling update didn't block, so we could check things like a failed pod while waiting for the update to complete.

In 1.1.x, rolling update does block up to the timeout amount. This is great since we had to script it externally before. However it waits the entire timeout even if the update fails immediately. For instance, if you specify a non-existent image the pod will report a PullImageError. At that point the rolling update could abort immediately. Instead it blocks for the 5 minute timeout. When frequently developing and deploying this 5 minute timeout becomes painful.

I suggest rolling-update fails immediately if any of the new pods are not in a 'Pending' or 'Running' state. Not sure if there is a set of non-error states we can use.

The text was updated successfully, but these errors were encountered:

mqliang · 2015-12-12T03:20:02Z

/subscribe

jsravn · 2015-12-14T15:24:43Z

I was trying to see if I could put together a poc for this, but it doesn't look like there is a programmatic way to check for a failed image pull (as of 1.1 codebase at least). It is always in Pending state. The only indicator of a problem is in the container state reason string field, but this can be almost anything depending on the container manager (and has changed from 1.1 to master for the pull image error on docker) - don't think I can rely on this.

It would work for Failed state but I believe that only happens if the application in the container fails. Failing to pull an image gets stuck in a Pending state. It feels like we need another state such as FailedDeploy. Or lump it into Failed. I don't think it should be considered Pending if the error is unrecoverable.

jsravn · 2015-12-14T16:56:11Z

The more I think about it, the more I think it's a bug in kubelet with docker. I think it should report the pod as Failed if the image pull fails. Then we can have some sensible policies to deal with the failure (whether it's to abort the rolling update, or to try pulling the image again).

mqliang · 2015-12-15T01:49:27Z

/subscribe

0xmichalis · 2016-01-15T14:40:38Z

The more I think about it, the more I think it's a bug in kubelet with docker. I think it should report the pod as Failed if the image pull fails. Then we can have some sensible policies to deal with the failure (whether it's to abort the rolling update, or to try pulling the image again).

+1

cc: @vishh @yujuhong

vishh · 2016-01-15T18:19:02Z

I don't think the container can be marked as Failed because it is possible for a retry of an image pull to succeed. I do agree that we need to provide some means to determine if a container is failing to start because of image related issues. cc @bgrant0607

bgrant0607 · 2016-01-15T18:36:17Z

cc @ironcladlou @nikhiljindal @janetkuo @Kargakis @mqliang

yujuhong · 2016-01-15T18:39:03Z

I agree with @vishh that image pulling failures could be transient and retries are needed. Perhaps we can do a better job distinguishing permanent errors such as the image does not exist, and surface them.

bgrant0607 · 2016-01-22T00:09:21Z

See also #19343

bgrant0607 · 2016-01-22T00:12:27Z

I also agree that image pull failures could be due to races and transient failures.

I like the idea of surfacing this information to the user through status and/or events.

We intend "kubectl rollout status" to report rollout status updates as they occur. This could be one type of update reported.

vishh · 2016-01-22T00:16:43Z

@bgrant0607: Should the error be exposed via a new ContainerState?

bgrant0607 · 2016-01-22T00:25:07Z

@vishh No. It's a Reason for ContainerStateWaiting.

vishh · 2016-01-22T01:25:49Z

Is Reason string considered a stable API for users to build on top of?

On Thu, Jan 21, 2016 at 4:25 PM, Brian Grant notifications@github.com
wrote:

@vishh https://github.com/vishh No. It's a Reason for
ContainerStateWaiting.

—
Reply to this email directly or view it on GitHub
#18568 (comment)
.

bgrant0607 · 2016-01-26T22:29:57Z

I view this as similar to #2529. The system should degrade gracefully in the presence of failures.

We're not continuing to work on kubectl rolling-update, but we are working to address this type of thing in Deployment.

Currently we stop progressing the rollout when we hit maxUnavailable. That allows the user to configure the desired amount of parallelism and risk.

If a container/pod simply fails to become ready, we can't currently distinguish that from normal behavior, but the idea of "permanently failed" deployments will add a progress timeout that could be used to detect this: #19343.

That doesn't enable fast failure in the case of observed problems, admittedly.

Since it's hard to distinguish permanent from transient failures, we'd probably need to add some kind of flag(s) to request fast failure in the case of certain problems (in addition to the timeout).

Failure to pull the image isn't the only failure scenario. For instance, crash loops and deadlocks of newly updated images or configurations (env, command, args, secrets, configmaps, etc.) are also common.

@vishh So far, I've resisted treating Reason as a stable API. At minimum, I reserve the right to create new, more specific Reason values at any time. If we need to, we could add image-related failure info to ContainerStatus, much as we've added ExitCode, RestartCount, etc. It may be a little tricky to represent due to all the possible means of getting images onto the node (external copy, pull, mount, peer-to-peer copy, etc.). Also, sufficiently slow progress might as well be considered failure.

bgrant0607 · 2016-05-09T22:19:53Z

Infant mortality was also discussed a little here: #19343 (comment)

0xmichalis · 2016-11-21T14:21:21Z

@smarterclayton mentioned somewhere else that ImagePull errors may be transient.

fejta-bot · 2017-12-22T21:48:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-21T22:36:22Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

bgrant0607 · 2018-01-22T17:26:49Z

/remove-lifecycle rotten
/lifecycle frozen

wengych · 2018-05-15T01:46:59Z

/subscribe

lukeschlather · 2019-12-18T15:06:47Z

Is there an equivalent issue for kubectl rollout with deployments? They seem to have the same behavior of basically surfacing no useful information when a pod fails to start for any reason.

j3ffml added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Dec 13, 2015

ernestoalejo mentioned this issue Dec 13, 2015

Rolling update fails sometimes because of a timeout and leaves two replication controllers #16737

Closed

nikhiljindal added area/app-lifecycle sig/node Categorizes an issue or PR as relevant to SIG Node. labels Dec 17, 2015

bgrant0607 added team/ux and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 15, 2016

bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 22, 2016

bgrant0607 added this to the next-candidate milestone Jan 22, 2016

bgrant0607 removed the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 26, 2016

bgrant0607 removed this from the next-candidate milestone Apr 27, 2016

This was referenced May 2, 2016

Add perma-failed deployments API #19343

Merged

kubectl run --restart=Never restarts (creates a Job) #24533

Closed

bgrant0607 mentioned this issue Aug 10, 2016

Determine correct Deployment scaling behavior during rollout #29357

Closed

0xmichalis mentioned this issue Oct 10, 2016

Replica set conditions API #33905

Merged

0xmichalis self-assigned this Oct 17, 2016

This was referenced Oct 18, 2016

[WiP] DaemonSet updates #31693

Closed

If new build of application fails, deploy pod still waits 10 minutes openshift/origin#11541

Closed

0xmichalis mentioned this issue Nov 3, 2016

Controller changes for perma failed deployments #35691

Merged

0xmichalis added the area/workload-api/deployment label Dec 5, 2016

0xmichalis removed the team/ux (deprecated - do not use) label Jan 11, 2017

0xmichalis removed their assignment Jan 11, 2017

bgrant0607 mentioned this issue Mar 21, 2017

Workload API v1 requirements umbrella issue #42752

Closed

0xmichalis added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 3, 2017

0xmichalis mentioned this issue Apr 26, 2017

Automatic rollback for failed deployments #23211

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2018

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jan 22, 2018

MrTrustor mentioned this issue Aug 10, 2018

StatefulSet - can't rollback from a broken state #67250

Open

yqwang-ms mentioned this issue Jan 16, 2020

Container stuck in Waiting even for permanent errors (e.g. image not found) and restartPolicy=Never #87278

Closed

github-actions bot mentioned this issue Dec 14, 2020

Identify which errors are permanent and switch DeploymentIsFailed to take into a... pacoxu/kubernetes#142

Open

helayoty added this to SIG Apps Sep 29, 2023

github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling updates should fail early if a pod fails to start #18568

Rolling updates should fail early if a pod fails to start #18568

jsravn commented Dec 11, 2015

mqliang commented Dec 12, 2015

jsravn commented Dec 14, 2015

jsravn commented Dec 14, 2015

mqliang commented Dec 15, 2015

0xmichalis commented Jan 15, 2016

vishh commented Jan 15, 2016

bgrant0607 commented Jan 15, 2016

yujuhong commented Jan 15, 2016

bgrant0607 commented Jan 22, 2016

bgrant0607 commented Jan 22, 2016

vishh commented Jan 22, 2016

bgrant0607 commented Jan 22, 2016

vishh commented Jan 22, 2016

bgrant0607 commented Jan 26, 2016

bgrant0607 commented May 9, 2016

0xmichalis commented Nov 21, 2016

fejta-bot commented Dec 22, 2017

fejta-bot commented Jan 21, 2018

bgrant0607 commented Jan 22, 2018

wengych commented May 15, 2018

lukeschlather commented Dec 18, 2019

Rolling updates should fail early if a pod fails to start #18568

Rolling updates should fail early if a pod fails to start #18568

Comments

jsravn commented Dec 11, 2015

mqliang commented Dec 12, 2015

jsravn commented Dec 14, 2015

jsravn commented Dec 14, 2015

mqliang commented Dec 15, 2015

0xmichalis commented Jan 15, 2016

vishh commented Jan 15, 2016

bgrant0607 commented Jan 15, 2016

yujuhong commented Jan 15, 2016

bgrant0607 commented Jan 22, 2016

bgrant0607 commented Jan 22, 2016

vishh commented Jan 22, 2016

bgrant0607 commented Jan 22, 2016

vishh commented Jan 22, 2016

bgrant0607 commented Jan 26, 2016

bgrant0607 commented May 9, 2016

0xmichalis commented Nov 21, 2016

fejta-bot commented Dec 22, 2017

fejta-bot commented Jan 21, 2018

bgrant0607 commented Jan 22, 2018

wengych commented May 15, 2018

lukeschlather commented Dec 18, 2019