-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Allow perm-failed deployments #14519
Comments
Could permafail be represented as a condition on the deployment? |
There are two targets:
In 2, the user reasonably expects that changing the deployment would result in a new attempt to deploy. So inert isn't really the right knob - the user should expect to know that "state XYZ did not complete within the constraints indicated", but that's not really the same as inert, which is effectively "I don't want consider any future states". There has to be some mechanism whereby the deployment controller can a) record and b) observe that a given state is unreachable. If the deployment changes from a->b, and b is unreachable, and the user changes back to state a, the deployment should be able to proceed. It has to be easy for the "unreachable" state to be cleared - unreachable is something the deployment controller is indicating on the deployment. It's also reasonable for a user to expect that there be some way to continue to observe "unreachable" states even if the deployment controller moved on. Our choice in openshift was to mark the replication controller with a failed indication, which the deployment controller acknowledged by stopping processing if that was the current sate. If the user retries a particular state in OpenShift we are creating a new RC since we use versioned numbers - so there is no previous unreachable state to clear. In a hash based state model, we'd need to clear the unreachable / failed flag on the RC somehow to allow the deployment controller to proceed. It's likely we'd want to preserve that this was unreachable states. It's also reasonable for a user to expect that they have to take affirmative action to retry a failed state. If the deployment controller gets restarted and tries to retry a failed state (for some types of deployment) that could be seriously damaging to a production deployment. So for some patterns we need to be very careful not to have a path whereby a previously "failed" state is not suddenly reattempted. |
Yes we can add a "Failed" condition to indicate permfail. We need to way for user to be able to retry a failed deployment, without updating it (for ex: after adding a missing image). |
What if the deployment controller:
Deleting the annotation from the rc would cause a retry. Events could be raised to inform users of the transitions. |
Need a condition, like Progressing or Stuck. |
We discussed adding a timeout param in the spec, which controls the failed condition. (DeploymentController sets that condition on the deployment, if there has not been any progress for that much time). One of the name suggested was maxProgressThreshholdSeconds. We also discussed adding a subresource |
I was thinking about the timeout param in the spec but I believe a timeout-based approach here adds more complexity than say a retries-based approach as the user has to know beforehand how long of a timeout they must use. The times a deployment comes up vary too. We would also have to always make sure that How about a retries-based approach? The user would note in the spec how many times they want this deployment retried in case of an error returned while executing the strategy. I would imagine a respective status field noting how many times the deployment was retried. If spec.retries == status.retries, and the deployment still isn't successfull then we mark it as failed (another status field or an annotation?). Then Thoughts? |
Number of retries also seems fine to me. |
If during a given sync progress can't be made due to some transient error (failing to update due to some connectivity issue), I probably want to continue trying to sync until connectivity is restored. Given that, I wonder if retry counts might not apply uniformly to all classes of error. Can we think of any examples of specific error types where retry thresholds would be more appropriate than time based? Progress time thresholds seem useful even given the validation complexity. My gut says deployment users are probably interested in expressing timeouts (either overall or progress based) in terms of wall time. |
I am pretty sure there are users who would want timeout-based deployments but a timeout feels more error-prone vs a retry count. Or so I think. I don't have any strong case over a retry count. That being said, if we think a timeout is more useful then I will try that approach and see where it goes. |
I think both have value. I agree that timeout is prone to undesirable failure due to the unpredictability of timing within the cluster generally. I think that automation in general around this condition is problematic and needs to come with disclaimers. Maybe this highlights the importance of giving users clear visibility into status so they can easily know when deployments aren't progressing so they can manually intervene. |
We are more interested in automatic rollbacks. Is this on the roadmap at all ? I started deploying and the deployment is not at all progressing, should it just simply rollback automatically to the last known good version rather than just allow manual intervention or may be there is a policy needed where user selects he wants manual vs automatic. thoughts ? |
Automatic rollbacks in case of no progress is definitely something we want. |
Automatic rollback is #23211 |
Ok. Actually we want, that lifecycle hooks will be processed in context of deployment status. If hook ended with error - deployment should be considered as a failure. Main process of pod may working well and pass all tests (liveness/readiness) but if hook returned error - deployment status should be - failed, because hook is important part of pod and it readiness. |
@livelace hm, so your problem was hooks? It wasn't clear to me from your original issue. I have answered downstream. Hooks in Kube deployments are not there yet: #14512 |
Automatic merge from submit-queue Add perma-failed deployments API @kubernetes/deployment @smarterclayton API for #14519 Docs at kubernetes/website#1337
…failed Automatic merge from submit-queue Controller changes for perma failed deployments This PR adds support for reporting failed deployments based on a timeout parameter defined in the spec. If there is no progress for the amount of time defined as progressDeadlineSeconds then the deployment will be marked as failed by a Progressing condition with a ProgressDeadlineExceeded reason. Follow-up to #19343 Docs at kubernetes/website#1337 Fixes #14519 @kubernetes/deployment @smarterclayton
There are times when the deployment system can infer that the latest deployment state has no reasonable chance of being realized (e.g. a bad or unpullable image). The current deployment controller design will continue to try reconciling indefinitely regardless of the possibly of success. If based on inference or user constraints (e.g. timeout conditions specified in an enhancement to the deployment API #1743) the system is ready to give up its best-effort attempt, the deployment could be somehow marked as "permananently failed" for a given spec hash so that the system won't continue thrashing on a doomed deployment.
There is a bit of functional overlap with inert deployments (#14516) in that both concepts result in the deployment controller "ignoring" a deployment whose state still needs realized, but inert deployments as described don't seem to capture all the context users would want in this case (i.e., it's not enough to just mark a doomed deployment intert without more context about why, and without UX safety nets to distinguish the repurcussions of re-activating the suddenly inert deployment vs. a permafailed deployment.)
The text was updated successfully, but these errors were encountered: