-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add perma-failed deployments API #19343
Add perma-failed deployments API #19343
Conversation
@@ -296,6 +300,17 @@ type DeploymentStatus struct { | |||
|
|||
// Total number of non-terminated pods targeted by this deployment that have the desired template spec. | |||
UpdatedReplicas int `json:"updatedReplicas,omitempty"` | |||
|
|||
// Stuck indicates if the current deployment has failed. | |||
Stuck bool `json:"stuck,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a boolean, we need to express failed as a condition (i.e. using ConditionStatus
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, didn't know about ConditionStatus
, will switch to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After digging in other uses of ConditionStatus
in the api I see that everywhere it is wrapped in a higher level <Resource>Condition
so I guess we need to follow that pattern and have a DeploymentCondition
here.
The new timeout/condition handling looks great. |
@@ -4374,7 +4374,7 @@ | |||
"properties": { | |||
"type": { | |||
"type": "string", | |||
"description": "Type of job condition, currently only Complete." | |||
"description": "Type of job condition." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely clear why you're changing a job resource field in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -396,6 +397,23 @@ func (dc *DeploymentController) syncDeployment(key string) error { | |||
return nil | |||
} | |||
|
|||
// If the deployment times out, we will need to add a failed condition in its | |||
// status (if it's not already there), scale down all of its replication | |||
// controllers, and finally update its status to reflect zero replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment needs updated to reflect the removal of the auto-scaledown behavior
Other than one remaining nit this is looking good to me... I'd like to get feedback on the condition clearing approach (cc @nikhiljindal). |
Hm, after playing around with e2e, I realized that the current condition clearing approach isn't enough. |
Or even better, how about adding a condition instead of an annotation? |
Thinking about this more, should the controller really be enforcing timeout when no conditions are present? It could be a while between creation and first sync, and during that time I don't think it counts as the time the deployment is actually taking since it hasn't even tried working with the deployment yet. What if you only check timeouts when there's a condition present? |
Put another way, the timeout should always be relative to progress, and lack of a condition implies that we haven't even started trying to make progress yet, implying timeouts shouldn't yet be effective. |
Jenkins verification failed for commit 35cdedf48a5fa6c4489674afadaebbdc01049824. Full PR test history. The magic incantation to run this job again is |
// The config this deployment is rolling back to. Will be cleared after rollback is done. | ||
// +optional | ||
RollbackTo *RollbackConfig `json:"rollbackTo,omitempty" protobuf:"bytes,8,opt,name=rollbackTo"` | ||
|
||
// The maximum time in seconds for a deployment to make progress before it | ||
// is considered to be failed. Failed deployments are not retried by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis
What does "failed deployments are not retried" mean? Is that still accurate, if we're not implementing auto-rollback yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wording here is not accurate enough, will update.
// The maximum time in seconds for a deployment to make progress before it | ||
// is considered to be failed. Failed deployments are not retried by the | ||
// deployment controller and require user action. Failure causes are surfaced | ||
// in the deployment status as Conditions. This is not set by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify which specific condition type(s). They aren't otherwise discoverable currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
||
// These are valid conditions of a deployment. | ||
const ( | ||
// DeploymentAvailable means the deployment is available, ie. at least the minimum available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to these types by the string names not variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@Kargakis API looks ok, thanks. Added a few comments about the documentation/descriptions. |
Jenkins GKE smoke e2e failed for commit cf9fd31. Full PR test history. The magic incantation to run this job again is |
@bgrant0607 addressed all of your latest comments. Applying lgtm |
Marking as P2 to avoid any more rebases for today:) |
Automatic merge from submit-queue |
@grodrigues3 @apelisse @fejta Another PR plagued by rebases. |
Unfortunately, it's not obvious by looking at the timeline what is the hot-point for rebase. By looking at the final code though, it looks like there are generated docs that include a timestamp (recipe for disaster), and lots of generated code that is doomed to conflict. Questions are:
|
@apelisse primary source of conflicts was generated docs and protobuf. |
Protobuf must conflict (because we have to uniquely allocate proto tags). Docs are common points of conflict for almost every API change for sure. On Nov 1, 2016, at 12:22 PM, Michail Kargakis notifications@github.com @apelisse https://github.com/apelisse primary source of conflicts was — |
…failed Automatic merge from submit-queue Controller changes for perma failed deployments This PR adds support for reporting failed deployments based on a timeout parameter defined in the spec. If there is no progress for the amount of time defined as progressDeadlineSeconds then the deployment will be marked as failed by a Progressing condition with a ProgressDeadlineExceeded reason. Follow-up to #19343 Docs at kubernetes/website#1337 Fixes #14519 @kubernetes/deployment @smarterclayton
stop using factory.OpenshiftClientConfig Origin-commit: df81ecaf3ca02d8de059e5a437a546a49ef53415
@kubernetes/deployment @smarterclayton
API for #14519
Docs at kubernetes/website#1337
This change is