Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add perma-failed deployments API #19343

Merged
merged 2 commits into from
Oct 28, 2016
Merged

Add perma-failed deployments API #19343

merged 2 commits into from
Oct 28, 2016

Conversation

0xmichalis
Copy link
Contributor

@0xmichalis 0xmichalis commented Jan 6, 2016

@kubernetes/deployment @smarterclayton

API for #14519

Docs at kubernetes/website#1337


This change is Reviewable

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/new-api size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 6, 2016
@@ -296,6 +300,17 @@ type DeploymentStatus struct {

// Total number of non-terminated pods targeted by this deployment that have the desired template spec.
UpdatedReplicas int `json:"updatedReplicas,omitempty"`

// Stuck indicates if the current deployment has failed.
Stuck bool `json:"stuck,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a boolean, we need to express failed as a condition (i.e. using ConditionStatus).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, didn't know about ConditionStatus, will switch to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging in other uses of ConditionStatus in the api I see that everywhere it is wrapped in a higher level <Resource>Condition so I guess we need to follow that pattern and have a DeploymentCondition here.

@ironcladlou
Copy link
Contributor

The new timeout/condition handling looks great.

@0xmichalis 0xmichalis changed the title [WIP] Add perma-failed deployments Add perma-failed deployments Jan 7, 2016
@@ -4374,7 +4374,7 @@
"properties": {
"type": {
"type": "string",
"description": "Type of job condition, currently only Complete."
"description": "Type of job condition."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely clear why you're changing a job resource field in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 8, 2016
@@ -396,6 +397,23 @@ func (dc *DeploymentController) syncDeployment(key string) error {
return nil
}

// If the deployment times out, we will need to add a failed condition in its
// status (if it's not already there), scale down all of its replication
// controllers, and finally update its status to reflect zero replicas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs updated to reflect the removal of the auto-scaledown behavior

@ironcladlou
Copy link
Contributor

Other than one remaining nit this is looking good to me... I'd like to get feedback on the condition clearing approach (cc @nikhiljindal).

@0xmichalis
Copy link
Contributor Author

Hm, after playing around with e2e, I realized that the current condition clearing approach isn't enough. dc.timedOut will fallback in the creationTimestamp of the deployment in case of no found conditions. Imagine a deployment with maxProgressThresholdSeconds=60, failed yesterday, and today someone retried it. It would fail on the spot. Should the retry endpoint add (and update in subsequent calls) a timestamp annotation in the deployment?

@0xmichalis
Copy link
Contributor Author

Or even better, how about adding a condition instead of an annotation?

@ironcladlou
Copy link
Contributor

Hm, after playing around with e2e, I realized that the current condition clearing approach isn't enough. dc.timedOut will fallback in the creationTimestamp of the deployment in case of no found conditions. Imagine a deployment with maxProgressThresholdSeconds=60, failed yesterday, and today someone retried it. It would fail on the spot. Should the retry endpoint add (and update in subsequent calls) a timestamp annotation in the deployment?

Thinking about this more, should the controller really be enforcing timeout when no conditions are present? It could be a while between creation and first sync, and during that time I don't think it counts as the time the deployment is actually taking since it hasn't even tried working with the deployment yet.

What if you only check timeouts when there's a condition present?

@ironcladlou
Copy link
Contributor

Put another way, the timeout should always be relative to progress, and lack of a condition implies that we haven't even started trying to make progress yet, implying timeouts shouldn't yet be effective.

@lavalamp lavalamp removed their assignment Jan 8, 2016
@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 26, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 35cdedf48a5fa6c4489674afadaebbdc01049824. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

// The config this deployment is rolling back to. Will be cleared after rollback is done.
// +optional
RollbackTo *RollbackConfig `json:"rollbackTo,omitempty" protobuf:"bytes,8,opt,name=rollbackTo"`

// The maximum time in seconds for a deployment to make progress before it
// is considered to be failed. Failed deployments are not retried by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis
What does "failed deployments are not retried" mean? Is that still accurate, if we're not implementing auto-rollback yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording here is not accurate enough, will update.

// The maximum time in seconds for a deployment to make progress before it
// is considered to be failed. Failed deployments are not retried by the
// deployment controller and require user action. Failure causes are surfaced
// in the deployment status as Conditions. This is not set by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify which specific condition type(s). They aren't otherwise discoverable currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


// These are valid conditions of a deployment.
const (
// DeploymentAvailable means the deployment is available, ie. at least the minimum available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to these types by the string names not variable names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@bgrant0607
Copy link
Member

@Kargakis API looks ok, thanks. Added a few comments about the documentation/descriptions.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2016
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GKE smoke e2e failed for commit cf9fd31. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@0xmichalis
Copy link
Contributor Author

@k8s-bot test this issue: #33388

@0xmichalis
Copy link
Contributor Author

@bgrant0607 addressed all of your latest comments. Applying lgtm

@0xmichalis 0xmichalis added lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 27, 2016
@0xmichalis
Copy link
Contributor Author

Marking as P2 to avoid any more rebases for today:)

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit b98a990 into kubernetes:master Oct 28, 2016
@bgrant0607
Copy link
Member

@grodrigues3 @apelisse @fejta Another PR plagued by rebases.

@0xmichalis 0xmichalis deleted the perma-failed-deployments branch October 28, 2016 12:11
@apelisse
Copy link
Member

apelisse commented Nov 1, 2016

Unfortunately, it's not obvious by looking at the timeline what is the hot-point for rebase.

By looking at the final code though, it looks like there are generated docs that include a timestamp (recipe for disaster), and lots of generated code that is doomed to conflict.

Questions are:

  • Does it need to be rebased on a regular basis, or can it just wait for the review to be done, then rebase/quick validation/set to highest priority so that it goes quickly through the queue?
  • How much effort would it be to remove this generated code/docs?
  • Are there any other source of conflicts @Kargakis ?

@0xmichalis
Copy link
Contributor Author

@apelisse primary source of conflicts was generated docs and protobuf.

@smarterclayton
Copy link
Contributor

Protobuf must conflict (because we have to uniquely allocate proto tags).
So nothing I'm aware of we can do there.

Docs are common points of conflict for almost every API change for sure.

On Nov 1, 2016, at 12:22 PM, Michail Kargakis notifications@github.com
wrote:

@apelisse https://github.com/apelisse primary source of conflicts was
generated docs and protobuf.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#19343 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pw4ArJzbkSrMh_99Ot24wgfvkx2_ks5q52dagaJpZM4G_44l
.

k8s-github-robot pushed a commit that referenced this pull request Nov 4, 2016
…failed

Automatic merge from submit-queue

Controller changes for perma failed deployments

This PR adds support for reporting failed deployments based on a timeout
parameter defined in the spec. If there is no progress for the amount
of time defined as progressDeadlineSeconds then the deployment will be
marked as failed by a Progressing condition with a ProgressDeadlineExceeded
reason.

Follow-up to #19343

Docs at kubernetes/website#1337

Fixes #14519

@kubernetes/deployment @smarterclayton
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Apr 16, 2018
stop using factory.OpenshiftClientConfig

Origin-commit: df81ecaf3ca02d8de059e5a437a546a49ef53415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.