Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce stable pods #478

Closed
wants to merge 1 commit into from
Closed

Introduce stable pods #478

wants to merge 1 commit into from

Conversation

0xmichalis
Copy link
Contributor

@0xmichalis 0xmichalis commented Mar 23, 2017

@kubernetes/sig-apps-api-reviews @kubernetes/api-reviewers

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 23, 2017
by specifying MinReadySeconds in their PodSpec (w/o updating the ReplicaSet pod template).

## kubelet changes
For a Pod tha specifies MinReadySeconds, kubelet will need to check (after MinReadySeconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in the absence of min ready seconds, available condition would just map with ready, no? i would rather all things report a condition instead of just a few...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will update.


Both problems above can be solved by moving MinReadySeconds in the PodSpec. Once kubelet
observes that a Pod has been ready for at least MinReadySeconds without any of its containers
crashing, it will update the PodStatus with an Available condition set to Status=True.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I still ask for a better name? Is "Stable" or "StableReady" or "ReadyStable" problematic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stable sgtm

// without any of its container crashing, for it to be considered available.
// Defaults to 0 (pod will be considered available as soon as it is ready)
// +optional
MinReadySeconds *int32 `json:"minReadySeconds,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correspondingly "StableReadySeconds", or Min... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton @bgrant0607 thoughts? This makes sense if we roll with "Stable" but this entails that we then need to update the status fields in our controllers from AvailableReplicas to StableReplicas (something to keep in mind for v2 I guess at this point)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? You don't need to use the same name everywhere - names are contextual.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to convey the same thing via different names is at least confusing to me as a user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must be "Ready" for "X Seconds" in order to be "Stable". "StableAfterReadySeconds" is awkward but is somewhat closer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I still prefer MinReadySeconds ("Minimum Seconds the pod needs to be Ready") but I won't hang on it if you guys feel strong about changing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that I love any of the new names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SecondsRequiredStableBeforeReady?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp isn't an api convention to follow the *Seconds pattern? I think it is but I may be wrong.


## kubelet changes
For a Pod tha specifies MinReadySeconds, kubelet will need to check (after MinReadySeconds)
if any of the Pod containers has crashed. If not, it will switch the Available condition to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this flip to false every time a container crashes, or is it latched to true once it is achieved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the initial period, if a container has crashed, how long does it wait? We don't have enough info (I think) to know WHEN it crashed. Maybe restate this as kubelet will reset the timer every time a container crashes until this condition is achieved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really EVERY container? What if it is OK for my helper container to crash occasionally? Reality is that it happens...

Copy link
Contributor Author

@0xmichalis 0xmichalis Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to be similar to the Ready condition. If a Pod is Ready and then one of its containers crashes, the Pod stops being Ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true, is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Stable is just a longer Ready.... why not increase the amount of time before you are Ready?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, today MinReadySeconds is just that, an additional delay on top of Ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the answer is that longer wait for Ready means you are not receiving traffic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you need to take traffic before you know whether you're ok. If you have a liveness or readiness check that also checks your business metrics, you can use that to signal a problem.

  1. created
  2. check readiness (no traffic, so no errors in your metrics)
  3. take traffic, users getting 500's
  4. check readiness (got errors, fail)
  5. go out of rotation
  6. ???

What happens at step six? You just oscillate in and out, causing some percentage of users to get failures? In openshift we can fail the deployment config, and I guess Deployments can hold the deployment and backoff creation. Stability would be false, since we're oscillating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there seem to be two types of errors an application can have thus a 500 that a user will see: 1) those that will make the application exit. In this case, the pod goes out of rotation anyway, it never transitions to Stable, the Deployment blocks. We can simply verify these by looking at the container statuses. 2) Bugs that may not cause the application to exit. Pod is not going out of rotation. Deployment still blocks. Can't look at container statuses, readiness checks may not be enough unless business metrics are involved.

Both cases are covered by rerunning the readiness check I guess. I need to call this out. Does it also mean that the endpoints controller will need to remove endpoints for Pods that never reach Stable?

6 does not seem to be something that needs to be handled in this proposal. We have ProgressDeadlineSeconds and there is AutoRollback that needs to be implemented.

For a Pod tha specifies MinReadySeconds, kubelet will need to check (after MinReadySeconds)
if any of the Pod containers has crashed. If not, it will switch the Available condition to
Status=True in the status of the Pod. Pods that don't specify MinReadySeconds, won't have
the Available condition set in their status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most conditions are assumed to have their default value == the Go zero value, and are named as such, I think. Brian has better context on this, but it seems weird to me that nonexistent implies true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, similar to Derek's comment, we probably want to add the condition to all Pods and default to false. I was thinking though that such an action would be disruptive but then again in order to upgrade a Node you need to drain it first so I am probably wrong.

@0xmichalis 0xmichalis changed the title Introduce available pods Introduce stable pods Mar 24, 2017
@thockin
Copy link
Member

thockin commented Mar 24, 2017 via email

@thockin
Copy link
Member

thockin commented Mar 24, 2017 via email

@0xmichalis
Copy link
Contributor Author

0xmichalis commented Mar 24, 2017

really....I do not think that is what SHOULD happen... BUt I accept that it might be what DOES happen.

I remember that this didn't use to be the case but you don't want to continue sending traffic in broken Pods is my understanding of it. Furthermore, it may be just your sidecar that is failing, but there is no way today to separate between the main and the sidecar via the API... Or at least not a way I am aware of.

@thockin
Copy link
Member

thockin commented Mar 24, 2017 via email

@0xmichalis
Copy link
Contributor Author

0xmichalis commented Mar 24, 2017 via email

// without any of its container crashing, for it to be considered available.
// Defaults to 0 (pod will be considered available as soon as it is ready)
// +optional
MinReadySeconds *int32 `json:"minReadySeconds,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SecondsRequiredStableBeforeReady?

```
and a new PodConditionType:
```go
// PodAvailable is added in a ready pod that has MinReadySeconds specified. The pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second sentence of this comment is confusing. BTW is it worth stating somewhere in this doc how Ready vs Available are treated by services?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to rephrase this one. Also added a sentence where I explicitly mention that Services are not going to be affected by this change.


## Future work
The PDB controller will need to be extended to recognize the Available condition in Pods.
It may also need to get into the bussiness of estimating availability for Deployments/ReplicaSets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

business

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

backwards-compatible with old kubelets.

## Future work
The PDB controller will need to be extended to recognize the Available condition in Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to cite #34776 where we had some discussion of this topic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@0xmichalis
Copy link
Contributor Author

Addressed most the comments, ptal

Still tentative is the name in the PodSpec. I still prefer MinReadySeconds over all current suggestions because 1) it's less ugly and 2) it's already used by three different controllers (so people should already be familiar with it).

@0xmichalis
Copy link
Contributor Author

cc: @mfojtik since somebody from PM may need to drive the implementation :)

@0xmichalis
Copy link
Contributor Author

@kubernetes/sig-node-api-reviews

2. PodDisruptionBudget is working with Ready Pods and has no notion of MinReadySeconds
when used by workload controllers.
3. The workload controllers cannot take advantage of readiness checks when checking the
state of their underlying Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem I found while debugging is that it's very hard to reason about which pods are not available from a controller, because it only reports aggregated data. Imagine a ReplicaSet with 200 replicas, and only 3 of them are unavailable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Updated the proposal.

the PodSpec if it is specified either in their Spec or PodTemplateSpec. The value in the
PodTemplateSpec should stomp the value in the Spec. For Pods that do not specify
MinReadySeconds (those running on old kubelets), the controllers will continue to use the
current approach for estimating stability. Eventually, we should switch all controllers that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make PodTemplate.Spec.MinReadySeconds default to whatever is set in the controller spec when creating pods (we don't need to touch PodTemplate so won't cause additional rollouts)? It overwrites controller Spec.MinReadySeconds only when it's set (not nil).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I mean by "The ReplicaSet and DaemonSet controllers will create new Pods by setting MinReadySeconds in the PodSpec if it is specified either in their Spec or PodTemplateSpec. The value in the PodTemplateSpec should stomp the value in the Spec."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotya

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also mention it above in the API changes section.

@0xmichalis
Copy link
Contributor Author

@tnozicka


Question: Do we need a separate kind of probe (stabilityProbe?) or should we re-use
the readinessProbe? What if a user needs to take into account different kinds of metrics
when doing the stability check than those used in her readiness probe?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing readinessProbe makes sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@0xmichalis
Copy link
Contributor Author

@wanghaoran1988 is going to start working on the kubelet side of things

Pods can be created using it without updating the controller's PodTemplateSpec.

### kubelet changes
For a Pod that specifies MinReadySeconds, kubelet will need to rerun the readiness check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis Why we need rerun the readiness check? Can we just wait "minReadySeconds" seconds after it's ready and check if the pod is still ready?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that ideally that check would be able to identify failures that are not caught by readiness (by using metrics exposed by the app), so we may want to think about introducing a net-new probe. @kubernetes/sig-node-proposals @janetkuo @smarterclayton Or do you guys think re-running the readiness check is enough?

@0xmichalis
Copy link
Contributor Author

I'm not able to move this forward anymore so I am going to close.

@0xmichalis 0xmichalis closed this Aug 10, 2017
@kow3ns
Copy link
Member

kow3ns commented Aug 10, 2017

@Kargakis why are you closing this?

@0xmichalis
Copy link
Contributor Author

@Kargakis why are you closing this?

I'm not able to move this forward

@0xmichalis
Copy link
Contributor Author

@davidopp
Copy link
Member

FWIW I think this is a good idea in general and we should try to implement it, or something like it, eventually.

@kow3ns
Copy link
Member

kow3ns commented Aug 10, 2017

concur with @davidopp

@mattmb
Copy link

mattmb commented Sep 21, 2022

I think this is still a good idea and want to share some context. I'm also interested if there are any related developments elsewhere that anyone can link me to or if we think the right SIGs/contributors know about this suggestion?

At Yelp we've had one form or another of this problem since before we even used kubernetes! I would describe it like so: "When we launch a new container it takes some time to be discoverable in whatever service mesh technology you choose. In large clusters, proxies or iptables rules or whatever might have to be updated on many nodes or inside many pods. If you remove old versions of the container before this happens your service will be under capacity and likely to send errors".

Before kubernetes we had our own deployment code that checked a sample of our hosts to check that the new containers had converged in our mesh before the deployment proceeded to kill the old containers. In kubernetes we "abuse" the readiness check by doing a similar check of our service mesh before saying a Pod is ready. We could alternatively use min ready seconds but in cases where our mesh infrastructure goes wrong or is really slow for some reason this timer would expire and rather than a stalled deployment we have a service outage. I think in general people would prefer a stalled deployment. So I do think a second condition is needed for the deployment controller (or replica set controller?) to know when it's safe to proceed. I think it should also be possible to define a separate probe because I think it's dependent on your choice of service mesh as to what you should check before proceeding. I don't know enough about kubernetes internals to comment with authority but it feels like it could be possible in a backwards compatible way where the deployment controller falls back to depending on readiness condition when this new condition isn't defined? And then users have a choice between min ready seconds which defines this new condition or they can define a custom check that overrides that?

@thockin
Copy link
Member

thockin commented Sep 21, 2022 via email

@mattmb
Copy link

mattmb commented Sep 21, 2022

I think probably it doesn't (but certainly possible I misunderstand something). If both the deployment controller and the service/endpoint controller both wait for all the conditions to be ready then we still have the same problem. Hypothetically if we could tell a deployment to only proceed with one readiness gate and a service to only mark an endpoint ready on a different readiness gate then I think that would probably work. But I don't think that's currently possible? I actually can't find the documentation that says when an endpoint is added/removed from a service...

@thockin
Copy link
Member

thockin commented Sep 21, 2022 via email

@mattmb
Copy link

mattmb commented Sep 21, 2022

Endpoints are added when the corresponding pod is ready, as indicated by
the Ready condition. I think I see what you are asking for - you want pods
to be included in the service when they are passing readiness probes, but
you don't want rollouts to proceed until that service is confirmed
"healthy" in some statistical manner. Is that right?

Yep exactly, and indeed I think it depends on what service mesh or routing people use as to what that check would be. But the overall problem in my mind is generic to any mesh technology

So what you'd really need is something in ReplicaSet (and Deployment, which
configures RS) which says "which pod condition should I look at to
determine pod-ready" (the proverbial "another level of indirection"). Then
you could set that to a "service is happy" indicator, which you can set by
your own heuristics. Looking at the code for RS controller, that actually
seems quite simple to implement, but I am not sure what other
considerations are required - that subsystem is not my expertise. We would
need a sig-apps KEP, for sure, though it should be pretty short
(relatively).

I think that sounds pretty sensible and would work well for our case. I'll have some discussions about this, I'm not sure how urgently we need to fix our current approach but I suspect we'll have to tackle this eventually so would be nice to get it described in a KEP and contribute back if we can.

@thockin
Copy link
Member

thockin commented Sep 21, 2022 via email

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Replace Dan Ciruli with Jasmine Jaksic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants