-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully delete pods from the Kubelet #8641
Gracefully delete pods from the Kubelet #8641
Conversation
@@ -1156,6 +1156,7 @@ func (dm *DockerManager) killContainer(containerID types.UID) error { | |||
if inspect != nil && inspect.Config != nil && inspect.Config.Labels != nil { | |||
preStop, found = inspect.Config.Labels[kubernetesPodLabel] | |||
} | |||
gracePeriod := uint(30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this defaulting in kubelet rather than in the API defaulting code? I'd prefer the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental, will move it there.
On May 21, 2015, at 7:15 PM, Brian Grant notifications@github.com wrote:
In pkg/kubelet/dockertools/manager.go:
@@ -1156,6 +1156,7 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
if inspect != nil && inspect.Config != nil && inspect.Config.Labels != nil {
preStop, found = inspect.Config.Labels[kubernetesPodLabel]
}
- gracePeriod := uint(30)
Why is this defaulting in kubelet rather than in the API defaulting code? I'd prefer the latter.—
Reply to this email directly or view it on GitHub.
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain @ixdy. |
@@ -1170,6 +1171,9 @@ func (dm *DockerManager) killContainer(containerID types.UID) error { | |||
break | |||
} | |||
} | |||
if pod.Spec.TerminationGracePeriodSeconds != nil { | |||
gracePeriod = uint(*pod.Spec.TerminationGracePeriodSeconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we using DeletionTimestamp here instead? Or do we need to convert that to a duration?
I expected:
- User sets terminationGracePeriodSeconds or it gets set to 30s by default
- On deletion, deletionTimestamp is set, to Now+terminationGracePeriodSeconds by default by controllers
- Upon noticing deletionTimestamp is set, prestop hook is executed
- After prestop hook completes, stop is called with time deletionTimestamp-Now
Eventually we want to "right align" notifications, but the above seems like a reasonable starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On May 21, 2015, at 7:25 PM, Brian Grant notifications@github.com wrote:
In pkg/kubelet/dockertools/manager.go:
@@ -1170,6 +1171,9 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
break
}
}
if pod.Spec.TerminationGracePeriodSeconds != nil {
Why aren't we using DeletionTimestamp here instead? Or do we need to convert that to a duration?gracePeriod = uint(*pod.Spec.TerminationGracePeriodSeconds)
DeletionTimestamp is important for history (mostly) and clients. I think spec being changed fits with other uses, so I'd rather not have two durations on the object.
I expected:
- User sets terminationGracePeriodSeconds or it gets set to 30s by default
- On deletion, deletionTimestamp is set, to Now+terminationGracePeriodSeconds by default by controllers
- Upon noticing deletionTimestamp is set, prestop hook is executed
- After prestop hook completes, stop is called with time deletionTimestamp-Now
The timestamp has to be measured from the local clock, so it's similar to Derek's changes that the local time matters more.
Eventually we want to "right align" notifications, but the above seems like a reasonable starting point.—
Reply to this email directly or view it on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to refactor pod manager to ensure policy is carried through properly, but tracks execution of Prestop. I left a 2 second buffer on stop container always on the grounds that SIGKILL is very dangerous to some software and SIGTERM should get a tiny window to execute, at least until we can limit preStop.
On May 21, 2015, at 7:25 PM, Brian Grant notifications@github.com wrote:
In pkg/kubelet/dockertools/manager.go:
@@ -1170,6 +1171,9 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
break
}
}
if pod.Spec.TerminationGracePeriodSeconds != nil {
Why aren't we using DeletionTimestamp here instead? Or do we need to convert that to a duration?gracePeriod = uint(*pod.Spec.TerminationGracePeriodSeconds)
I expected:
- User sets terminationGracePeriodSeconds or it gets set to 30s by default
- On deletion, deletionTimestamp is set, to Now+terminationGracePeriodSeconds by default by controllers
- Upon noticing deletionTimestamp is set, prestop hook is executed
- After prestop hook completes, stop is called with time deletionTimestamp-Now
Eventually we want to "right align" notifications, but the above seems like a reasonable starting point.
—
Reply to this email directly or view it on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From v1beta3/types.go:
// DeletionTimestamp is the time after which this resource will be deleted. This
// field is set by the server when a graceful deletion is requested by the user, and is not
// directly settable by a client. The resource will be deleted (no longer visible from
// resource lists, and not reachable by name) after the time in this field. Once set, this
// value may not be unset or be set further into the future, although it may be shortened
// or the resource may be deleted prior to this time. For example, a user may request that
// a pod is deleted in 30 seconds. The Kubelet will react by sending a graceful termination
// signal to the containers in the pod. Once the resource is deleted in the API, the Kubelet
// will send a hard termination signal to the container.
DeletionTimestamp *util.Time
...
// Optional duration in seconds before the object should be deleted. Value must be non-negative integer.
// The value zero indicates delete immediately. If this value is nil, the default grace period for the
// specified type will be used.
GracePeriodSeconds *int64
...
// Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request.
// Value must be non-negative integer. The value zero indicates delete immediately.
// If this value is nil, the default grace period will be used instead.
// The grace period is the duration in seconds after the processes running in the pod are sent
// a termination signal and the time when the processes are forcibly halted with a kill signal.
// Set this value longer than the expected cleanup time for your process.
TerminationGracePeriodSeconds *int64
TerminationGracePeriodSeconds is intended to be the default grace period, such as that used by replication controller and node controller, or by users in the case that they don't override it. DeletionTimestamp is intended to reflect the actual grace period. It should be initialized from GracePeriodSeconds. If no value is provided for GracePeriodSeconds, it should be initialized from TerminationGracePeriodSeconds (for pods).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want to drive decisions off DeletionTimestamp due to potential clock skew, then we need to record the value of DeleteOptions.GracePeriodSeconds and use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On May 22, 2015, at 3:45 PM, Brian Grant notifications@github.com wrote:
In pkg/kubelet/dockertools/manager.go:
@@ -1170,6 +1171,9 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
break
}
}
if pod.Spec.TerminationGracePeriodSeconds != nil {
If we don't want to drive decisions off DeletionTimestamp due to potential clock skew, then we need to record the value of DeleteOptions.GracePeriodSeconds and use that.gracePeriod = uint(*pod.Spec.TerminationGracePeriodSeconds)
I'm doing that into pod.Spec.TerminationGracePeriodSeconds - is that not sufficient? It seems like that's the proper place - we record the users intent to set a custom termination grace period (with their delete) into the spec.
—
Reply to this email directly or view it on GitHub.
944f2bb
to
df43161
Compare
Regarding the rc part of this change: what's the intended behavior? are you trying to get an updated status.replicas, or do you actually want the rc to try and create more replicas right away (1535 isn't too clear)? Calling deletePod won't have an effect if the pod is still in the store (which it will be, until the Informer observes a deletion via the watch, which IIUC will only happen when the kubelet finally sends a Delete(pod, grace) to the apiserver). |
|
df43161
to
58ba208
Compare
Ah, if you added a new state yes. Please check for that explicitly instead of the deletion timestamp. |
See my latest changes - DeletionTimestamp is not a State - it's an irrevocable transition, so filtering it. ----- Original Message -----
|
Why not |
So Brian had already raised concerns about adding new states - DeletionTimestamp is a condition driven by the core API object logic (part of the fundamentals of the DELETE action), so reflecting it somewhere else seemed unnecessary. It's effectively equivalent to CreationTimestamp (and that's part of the core POST action). I would prefer not to introduce new Phases - I don't really want to reflect it as a condition because it's part of the object lifecycle. But I'm willing to discuss more about that. ----- Original Message -----
|
I agree about having too many phases, but currently the rcs only filter based on phase, they never open up a pod to check other user defined fields (breaking this won't be too bad I suppose, but I could then make the rcs check restart policy, and so on. I suppose deletionTimestamp isn't directly specified by the user so maybe it's close to phase). I'm curious how we're going to surface a pending delete (via |
We're not adding any new phases. |
I added a filter based on metadata.deletionTimestamp - are you saying that the cache doesn't contain that info and my logic is flawed? In kubectl I think we'd hijack the status field for now. ----- Original Message -----
|
The logic is right. |
To answer questions: The default: Many applications just want to drain in-flight requests. For that scenario, 10-15s is probably enough. I think Heroku gives 10s. 10-15s is also probably long enough for compiles and unit tests to finish rather than immediately committing suicide. For databases, even 30s isn't long enough. I'm not sure where the sweet spot is for most open-source apps. Certainly we shouldn't make the default longer than 30s, and I'm inclined to make it a bit shorter. In theory, we shouldn't have to remove pods being gracefully terminated from endpoints. The app will be notified and then could change its reported readiness. However, for that to reflect back into readiness quickly we might need to initiate an off-cycle readiness probe, which could race against notification handling in the app. That all seems too complicated, so probably removing the pods from endpoints upon setting deletionTimestamp is best. Applications want to continue to run to drain requests in flight but want to stop new requests from being sent ASAP. I would have the replication controller replace the pod. |
@@ -204,6 +204,10 @@ func (rm *ReplicationManager) getPodControllers(pod *api.Pod) *api.ReplicationCo | |||
// When a pod is created, enqueue the controller that manages it and update it's expectations. | |||
func (rm *ReplicationManager) addPod(obj interface{}) { | |||
pod := obj.(*api.Pod) | |||
if pod.DeletionTimestamp != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bprashanth I worry I don't understand the replication controller (as it is now) to debug what I've added. A few questions:
- Are there known scenarios today where fast rescale triggers the RC loop to forget a pod?
- Do I even need these notifications, or would it be more accurate to simply let the controller handle it now that I have the check in filterActivePods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rescale it up to 50, the rc will create 50 replicas. If you rescale it down again soon after, the rc will ignore the downsize till the watch has notified it of the 50 replicas having been created, after which it will go about deleting all 50. The key here is the Expectations
thingy. When the rc fires off creates/deletes it sets the number it expects to see via watch, and goes to sleep till the manager does in fact see these many. Watch latencies sometimes tend to be higher than I like on some cloud providers :(
The rc manager will never sync the same rc more than once simultaneously, the work queue enforces this, which is why the size up and immediate size down isn't racy.
You need to invoke deletePod because you need the DeletionsObserved
call to wake up the rc, if you want it to create more replicas. Without this, you will wake up the rc, it will say 'hey i expect another delete so im going back to sleep' (but it will update status.Replica).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On May 22, 2015, at 3:04 PM, Prashanth B notifications@github.com wrote:
In pkg/controller/replication_controller.go:
@@ -204,6 +204,10 @@ func (rm _ReplicationManager) getPodControllers(pod *api.Pod) *api.ReplicationCo
// When a pod is created, enqueue the controller that manages it and update it's expectations.
func (rm *ReplicationManager) addPod(obj interface{}) {
pod := obj.(_api.Pod)
- if pod.DeletionTimestamp != nil {
If you rescale it up to 50, the rc will create 50 replicas. If you rescale it down again soon after, the rc will ignore the downsize till the watch has notified it of the 50 replicas having been created, after which it will go about deleting all 50. The key here is the Expectations thingy. When the rc fires off creates/deletes it sets the number it expects to see via watch, and goes to sleep till the manager does in fact see these many. Watch latencies sometimes tend to be higher than I like on some cloud providers :(Ok, so we're touching base at 50 then? I thought we made a statement a while back that touching base was bad?
The rc manager will never sync the same rc more than once simultaneously, the work queue enforces this, which is why the size up and immediate size down isn't racy.You need to push the login into deletePod because you need the DeletionsObserved call to wake up the rc, if you want it to create more replicas. Without this, you will wake up the rc, it will say 'hey i expect another delete so im going back to sleep' (but it will update status.Replica).
—
Reply to this email directly or view it on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50 was a random example, sorry, can you elaborate?
The rc will only create BurstReplicas at a time (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/controller/replication_controller.go#L69) but that's a rate limit. Our scalability tests periodically spin up rcs with 3000 replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if I say 50, then 30, then 10, the manager has to touch base at 50 before going back down to 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is better placed above the deletePod in updatePod, pologies for not mentioning this.
If one can create an rc with a pod template that has the deletion timestamp set, it will just keep creating pods. Are we disallowing this in validation or is it intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cleared during creation - that said, I'm making an additional guard to prevent deletion grace period seconds from being reset and will move your comment.
----- Original Message -----
@@ -204,6 +204,10 @@ func (rm _ReplicationManager) getPodControllers(pod
*api.Pod) *api.ReplicationCo
// When a pod is created, enqueue the controller that manages it and
update it's expectations.
func (rm *ReplicationManager) addPod(obj interface{}) {
pod := obj.(_api.Pod)
- if pod.DeletionTimestamp != nil {
The comment is better placed above updatePod, pologies for not mentioning
this.If one can create an rc with a pod template that has the deletion timestamp
set, it will just keep creating pods. Are we disallowing this in validation
or is it intended?
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/8641/files#r31478081
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, then we'll never get add pods with the deletion timestamp set unless you restart the controller manager or something right? (we get add pod events when the pod isn't in the store but is present in whatever the apiserver sends us -- this could either be a watch event or the output of a relist --). Please add that as a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments updated. I also added stronger checking and reset behavior to ValidateObjectMetaUpdate (deletionGracePeriodSeconds is immutable) and rest.BeforeCreate() to guard against objects being created with those values.
----- Original Message -----
@@ -204,6 +204,10 @@ func (rm _ReplicationManager) getPodControllers(pod
*api.Pod) *api.ReplicationCo
// When a pod is created, enqueue the controller that manages it and
update it's expectations.
func (rm *ReplicationManager) addPod(obj interface{}) {
pod := obj.(_api.Pod)
- if pod.DeletionTimestamp != nil {
ah, then we'll never get add pods with the deletion timestamp set unless you
restart the controller manager or something right? (we get add pod events
when the pod isn't in the store but is present in whatever the apiserver
sends us -- this could either be a watch event or the output of a relist
--). Please add that as a comment here.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/8641/files#r31480606
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
This commit wires together the graceful delete option for pods on the Kubelet. When a pod is deleted on the API server, a grace period is calculated that is based on the Pod.Spec.TerminationGracePeriodInSeconds, the user's provided grace period, or a default. The grace period can only shrink once set. The value provided by the user (or the default) is set onto metadata as DeletionGracePeriod. When the Kubelet sees a pod with DeletionTimestamp set, it uses the value of ObjectMeta.GracePeriodSeconds as the grace period sent to Docker. When updating status, if the pod has DeletionTimestamp set and all containers are terminated, the Kubelet will update the status one last time and then invoke Delete(pod, grace: 0) to clean up the pod immediately.
Pods that are slated for deletion should be excluded from replication and endpoints immediately.
c3e2095
to
f12a68c
Compare
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
Gracefully delete pods from the Kubelet
I think this caused e2e failures:
That line number is at the end of the graceful deletion e2e test. |
Reverted by PR #9143 |
A graceful deletion does a compare and swap, which means it's subject to races with other updates (in this case, pod status updates). That's "expected" within the design of the pull, but I can use GuaranteedUpdate instead.
|
@xiang90 are there any performance implications to large numbers of ~ short TTLs being present in the cluster? At least one scalability test indicated list pods (large recursive query) saw a latency spike - is there any locking around the TTL expiration code that could starve GETs? |
@smarterclayton What is large number? |
4000-5000 documents retrieved by the get taking 16s (99th percentile), maybe 100-200 pending TTL expiration in that group? I expect a large number to expire at the same time.
|
100-200 expiration at a time should be fine and should not cause a problem. If this is not the case, then it is a bug that etcd needs to fix. |
I'm going to try to create a reproduction scenario to isolate the behavior and determine if it's etcd or just unexpected higher level implications. Thanks
|
I'm tempted to make a change to not use TTL, given the other work done here and the potential risk of a fix/debug in etcd. Once scheduled, the kubelet has ownership of the pod and manages deletion. If the kubelet is unable to end the pod, the node controller will handle it. Preschedule, a delete doesn't need grace, so it can be immediate. I think that combination removes the need for TTL. Going to try that. Also going to make graceful a runtime enabled feature so we can test it without breaking the critical path.
|
SGTM, thanks. |
+1 to elevator discussions :-) Sent from my iPhone
|
Moving to #9165 - will start this behind a feature flag so we can selectively enable it and test it. |
This commit wires together the graceful delete option for pods on the Kubelet. When a pod is deleted on the API server, a grace period is calculated that is based on the
Pod.Spec.TerminationGracePeriodInSeconds
, the user's provided grace period, or a default (30 seconds). The grace period can only shrink once set. The value provided by the user (or the default) is set onto pod.Spec directly.When the Kubelet sees a pod with DeletionTimestamp set, it uses the value of Spec.TerminationGracePeriodInSeconds as the grace period sent to Docker. When updating status, if the pod has DeletionTimestamp set and all containers are terminated, the Kubelet will update the
status one last time and then invoke Delete(pod, grace: 0) to clean up the pod immediately.
This still needs tests and some e2e testing but is functionally complete.
Why this matters:
Known gaps:
Open questions
Implements #1535 for pods