Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully delete pods from the Kubelet #8641

Merged
merged 10 commits into from
Jun 2, 2015

Conversation

smarterclayton
Copy link
Contributor

This commit wires together the graceful delete option for pods on the Kubelet. When a pod is deleted on the API server, a grace period is calculated that is based on the Pod.Spec.TerminationGracePeriodInSeconds, the user's provided grace period, or a default (30 seconds). The grace period can only shrink once set. The value provided by the user (or the default) is set onto pod.Spec directly.

When the Kubelet sees a pod with DeletionTimestamp set, it uses the value of Spec.TerminationGracePeriodInSeconds as the grace period sent to Docker. When updating status, if the pod has DeletionTimestamp set and all containers are terminated, the Kubelet will update the
status one last time and then invoke Delete(pod, grace: 0) to clean up the pod immediately.

This still needs tests and some e2e testing but is functionally complete.

Why this matters:

  • No one wants their database SIGKILL'ed

Known gaps:

  • test when kubelet is restarted
  • test graceful max is in place
  • show indication in pod describe the resource is deleting
  • verify replication controller handles this gracefully

Open questions

  • Should we remove deleting pods (DeletionTimestamp) from replication controller and endpoints controller counts immediately?
  • Is 30s a good default?

Implements #1535 for pods

@bgrant0607
Copy link
Member

cc @piosz @davidopp @vmarmol

@@ -1156,6 +1156,7 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
if inspect != nil && inspect.Config != nil && inspect.Config.Labels != nil {
preStop, found = inspect.Config.Labels[kubernetesPodLabel]
}
gracePeriod := uint(30)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this defaulting in kubelet rather than in the API defaulting code? I'd prefer the latter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental, will move it there.

On May 21, 2015, at 7:15 PM, Brian Grant notifications@github.com wrote:

In pkg/kubelet/dockertools/manager.go:

@@ -1156,6 +1156,7 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
if inspect != nil && inspect.Config != nil && inspect.Config.Labels != nil {
preStop, found = inspect.Config.Labels[kubernetesPodLabel]
}

  • gracePeriod := uint(30)
    Why is this defaulting in kubelet rather than in the API defaulting code? I'd prefer the latter.


Reply to this email directly or view it on GitHub.

@k8s-bot
Copy link

k8s-bot commented May 21, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain @ixdy.

@@ -1170,6 +1171,9 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
break
}
}
if pod.Spec.TerminationGracePeriodSeconds != nil {
gracePeriod = uint(*pod.Spec.TerminationGracePeriodSeconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we using DeletionTimestamp here instead? Or do we need to convert that to a duration?

I expected:

  1. User sets terminationGracePeriodSeconds or it gets set to 30s by default
  2. On deletion, deletionTimestamp is set, to Now+terminationGracePeriodSeconds by default by controllers
  3. Upon noticing deletionTimestamp is set, prestop hook is executed
  4. After prestop hook completes, stop is called with time deletionTimestamp-Now

Eventually we want to "right align" notifications, but the above seems like a reasonable starting point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On May 21, 2015, at 7:25 PM, Brian Grant notifications@github.com wrote:

In pkg/kubelet/dockertools/manager.go:

@@ -1170,6 +1171,9 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
break
}
}

  •       if pod.Spec.TerminationGracePeriodSeconds != nil {
    
  •           gracePeriod = uint(*pod.Spec.TerminationGracePeriodSeconds)
    
    Why aren't we using DeletionTimestamp here instead? Or do we need to convert that to a duration?

DeletionTimestamp is important for history (mostly) and clients. I think spec being changed fits with other uses, so I'd rather not have two durations on the object.
I expected:

  1. User sets terminationGracePeriodSeconds or it gets set to 30s by default
  2. On deletion, deletionTimestamp is set, to Now+terminationGracePeriodSeconds by default by controllers
  3. Upon noticing deletionTimestamp is set, prestop hook is executed
  4. After prestop hook completes, stop is called with time deletionTimestamp-Now

The timestamp has to be measured from the local clock, so it's similar to Derek's changes that the local time matters more.
Eventually we want to "right align" notifications, but the above seems like a reasonable starting point.


Reply to this email directly or view it on GitHub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to refactor pod manager to ensure policy is carried through properly, but tracks execution of Prestop. I left a 2 second buffer on stop container always on the grounds that SIGKILL is very dangerous to some software and SIGTERM should get a tiny window to execute, at least until we can limit preStop.

On May 21, 2015, at 7:25 PM, Brian Grant notifications@github.com wrote:

In pkg/kubelet/dockertools/manager.go:

@@ -1170,6 +1171,9 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
break
}
}

  •       if pod.Spec.TerminationGracePeriodSeconds != nil {
    
  •           gracePeriod = uint(*pod.Spec.TerminationGracePeriodSeconds)
    
    Why aren't we using DeletionTimestamp here instead? Or do we need to convert that to a duration?

I expected:

  1. User sets terminationGracePeriodSeconds or it gets set to 30s by default
  2. On deletion, deletionTimestamp is set, to Now+terminationGracePeriodSeconds by default by controllers
  3. Upon noticing deletionTimestamp is set, prestop hook is executed
  4. After prestop hook completes, stop is called with time deletionTimestamp-Now

Eventually we want to "right align" notifications, but the above seems like a reasonable starting point.


Reply to this email directly or view it on GitHub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From v1beta3/types.go:

    // DeletionTimestamp is the time after which this resource will be deleted. This
    // field is set by the server when a graceful deletion is requested by the user, and is not
    // directly settable by a client. The resource will be deleted (no longer visible from
    // resource lists, and not reachable by name) after the time in this field. Once set, this
    // value may not be unset or be set further into the future, although it may be shortened
    // or the resource may be deleted prior to this time. For example, a user may request that
    // a pod is deleted in 30 seconds. The Kubelet will react by sending a graceful termination
    // signal to the containers in the pod. Once the resource is deleted in the API, the Kubelet
    // will send a hard termination signal to the container.
    DeletionTimestamp *util.Time
...
    // Optional duration in seconds before the object should be deleted. Value must be non-negative integer.
    // The value zero indicates delete immediately. If this value is nil, the default grace period for the
    // specified type will be used.
    GracePeriodSeconds *int64
...
    // Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request.
    // Value must be non-negative integer. The value zero indicates delete immediately.
    // If this value is nil, the default grace period will be used instead.
    // The grace period is the duration in seconds after the processes running in the pod are sent
    // a termination signal and the time when the processes are forcibly halted with a kill signal.
    // Set this value longer than the expected cleanup time for your process.
    TerminationGracePeriodSeconds *int64

TerminationGracePeriodSeconds is intended to be the default grace period, such as that used by replication controller and node controller, or by users in the case that they don't override it. DeletionTimestamp is intended to reflect the actual grace period. It should be initialized from GracePeriodSeconds. If no value is provided for GracePeriodSeconds, it should be initialized from TerminationGracePeriodSeconds (for pods).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want to drive decisions off DeletionTimestamp due to potential clock skew, then we need to record the value of DeleteOptions.GracePeriodSeconds and use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On May 22, 2015, at 3:45 PM, Brian Grant notifications@github.com wrote:

In pkg/kubelet/dockertools/manager.go:

@@ -1170,6 +1171,9 @@ func (dm *DockerManager) killContainer(containerID types.UID) error {
break
}
}

  •       if pod.Spec.TerminationGracePeriodSeconds != nil {
    
  •           gracePeriod = uint(*pod.Spec.TerminationGracePeriodSeconds)
    
    If we don't want to drive decisions off DeletionTimestamp due to potential clock skew, then we need to record the value of DeleteOptions.GracePeriodSeconds and use that.

I'm doing that into pod.Spec.TerminationGracePeriodSeconds - is that not sufficient? It seems like that's the proper place - we record the users intent to set a custom termination grace period (with their delete) into the spec.

Reply to this email directly or view it on GitHub.

@smarterclayton smarterclayton force-pushed the gracefully_delete_pods branch from 944f2bb to df43161 Compare May 22, 2015 03:04
@bprashanth
Copy link
Contributor

Regarding the rc part of this change: what's the intended behavior? are you trying to get an updated status.replicas, or do you actually want the rc to try and create more replicas right away (1535 isn't too clear)?

Calling deletePod won't have an effect if the pod is still in the store (which it will be, until the Informer observes a deletion via the watch, which IIUC will only happen when the kubelet finally sends a Delete(pod, grace) to the apiserver).

@smarterclayton
Copy link
Contributor Author

On May 21, 2015, at 11:50 PM, Prashanth B notifications@github.com wrote:

Regarding the rc part of this change: what's the intended behavior? are you trying to get an updated status.replicas, or do you actually want the rc to try and create more replicas right away (1535 isn't too clear)?

Right away I think - a pod marked for deletion will be gone within the TerminationGracePeriodSeconds - I think for anything managed by a controller it's better to be moving towards creation immediately, since it is impossible to transition out of deletion pending.
Calling deletePod won't have an effect if the pod is still in the store (which it will be, until the Informer observes a deletion via the watch, which IIUC will only happen when the kubelet finally sends a Delete(pod, grace) to the apiserver).

Hrm... I'll dig into the logic more - the pod should be filtered from the visible list (which I'll add), which looks like filterActivePods.

Reply to this email directly or view it on GitHub.

@smarterclayton smarterclayton force-pushed the gracefully_delete_pods branch from df43161 to 58ba208 Compare May 22, 2015 17:03
@bprashanth
Copy link
Contributor

Hrm... I'll dig into the logic more - the pod should be filtered from the visible list (which I'll add), which looks like filterActivePods.

Ah, if you added a new state yes. Please check for that explicitly instead of the deletion timestamp.

@smarterclayton
Copy link
Contributor Author

See my latest changes - DeletionTimestamp is not a State - it's an irrevocable transition, so filtering it.

----- Original Message -----

Hrm... I'll dig into the logic more - the pod should be filtered from the
visible list (which I'll add), which looks like filterActivePods.

Ah, if you added a new state yes. Please check for that explicitly instead of
the deletion timestamp.


Reply to this email directly or view it on GitHub:
#8641 (comment)

@bprashanth
Copy link
Contributor

Why not DeletionPending? (Don't see this being discussed on the bug, 'pologies if it was already ruled out). Are we going to surface a deletion timestamp in kubectl?

@smarterclayton
Copy link
Contributor Author

So Brian had already raised concerns about adding new states - DeletionTimestamp is a condition driven by the core API object logic (part of the fundamentals of the DELETE action), so reflecting it somewhere else seemed unnecessary. It's effectively equivalent to CreationTimestamp (and that's part of the core POST action). I would prefer not to introduce new Phases - I don't really want to reflect it as a condition because it's part of the object lifecycle. But I'm willing to discuss more about that.

----- Original Message -----

Why not DeletionPending? (Don't see this being discussed on the bug,
'pologies if it was already ruled out). Are we going to surface a deletion
timestamp in kubectl?


Reply to this email directly or view it on GitHub:
#8641 (comment)

@bprashanth
Copy link
Contributor

I agree about having too many phases, but currently the rcs only filter based on phase, they never open up a pod to check other user defined fields (breaking this won't be too bad I suppose, but I could then make the rcs check restart policy, and so on. I suppose deletionTimestamp isn't directly specified by the user so maybe it's close to phase). I'm curious how we're going to surface a pending delete (via Message in kubectl?).

@bgrant0607
Copy link
Member

cc @dchen1107 @brendandburns

@bgrant0607
Copy link
Member

We're not adding any new phases.

@smarterclayton
Copy link
Contributor Author

I added a filter based on metadata.deletionTimestamp - are you saying that the cache doesn't contain that info and my logic is flawed?

In kubectl I think we'd hijack the status field for now.

----- Original Message -----

I agree about having too many phases, but currently the rcs only filter based
on phase, they never open up a pod to check other user defined fields
(breaking this won't be too bad I suppose, but I could then make the rcs
check restart policy, and so on). I'm curious how we're going to surface a
pending delete (via Message in kubectl?).


Reply to this email directly or view it on GitHub:
#8641 (comment)

@bprashanth
Copy link
Contributor

The logic is right.
I was looking for a way to surface the same thing without having to check metadata and hoping it would tie in with whatever way we used for kubectl, so there would be a direct correlation when we debug.

@bgrant0607
Copy link
Member

To answer questions:

The default: Many applications just want to drain in-flight requests. For that scenario, 10-15s is probably enough. I think Heroku gives 10s. 10-15s is also probably long enough for compiles and unit tests to finish rather than immediately committing suicide. For databases, even 30s isn't long enough. I'm not sure where the sweet spot is for most open-source apps. Certainly we shouldn't make the default longer than 30s, and I'm inclined to make it a bit shorter.

In theory, we shouldn't have to remove pods being gracefully terminated from endpoints. The app will be notified and then could change its reported readiness. However, for that to reflect back into readiness quickly we might need to initiate an off-cycle readiness probe, which could race against notification handling in the app. That all seems too complicated, so probably removing the pods from endpoints upon setting deletionTimestamp is best. Applications want to continue to run to drain requests in flight but want to stop new requests from being sent ASAP.

I would have the replication controller replace the pod.

@@ -204,6 +204,10 @@ func (rm *ReplicationManager) getPodControllers(pod *api.Pod) *api.ReplicationCo
// When a pod is created, enqueue the controller that manages it and update it's expectations.
func (rm *ReplicationManager) addPod(obj interface{}) {
pod := obj.(*api.Pod)
if pod.DeletionTimestamp != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bprashanth I worry I don't understand the replication controller (as it is now) to debug what I've added. A few questions:

  1. Are there known scenarios today where fast rescale triggers the RC loop to forget a pod?
  2. Do I even need these notifications, or would it be more accurate to simply let the controller handle it now that I have the check in filterActivePods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rescale it up to 50, the rc will create 50 replicas. If you rescale it down again soon after, the rc will ignore the downsize till the watch has notified it of the 50 replicas having been created, after which it will go about deleting all 50. The key here is the Expectations thingy. When the rc fires off creates/deletes it sets the number it expects to see via watch, and goes to sleep till the manager does in fact see these many. Watch latencies sometimes tend to be higher than I like on some cloud providers :(

The rc manager will never sync the same rc more than once simultaneously, the work queue enforces this, which is why the size up and immediate size down isn't racy.

You need to invoke deletePod because you need the DeletionsObserved call to wake up the rc, if you want it to create more replicas. Without this, you will wake up the rc, it will say 'hey i expect another delete so im going back to sleep' (but it will update status.Replica).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On May 22, 2015, at 3:04 PM, Prashanth B notifications@github.com wrote:

In pkg/controller/replication_controller.go:

@@ -204,6 +204,10 @@ func (rm _ReplicationManager) getPodControllers(pod *api.Pod) *api.ReplicationCo
// When a pod is created, enqueue the controller that manages it and update it's expectations.
func (rm *ReplicationManager) addPod(obj interface{}) {
pod := obj.(_api.Pod)

  • if pod.DeletionTimestamp != nil {
    If you rescale it up to 50, the rc will create 50 replicas. If you rescale it down again soon after, the rc will ignore the downsize till the watch has notified it of the 50 replicas having been created, after which it will go about deleting all 50. The key here is the Expectations thingy. When the rc fires off creates/deletes it sets the number it expects to see via watch, and goes to sleep till the manager does in fact see these many. Watch latencies sometimes tend to be higher than I like on some cloud providers :(

Ok, so we're touching base at 50 then? I thought we made a statement a while back that touching base was bad?
The rc manager will never sync the same rc more than once simultaneously, the work queue enforces this, which is why the size up and immediate size down isn't racy.

You need to push the login into deletePod because you need the DeletionsObserved call to wake up the rc, if you want it to create more replicas. Without this, you will wake up the rc, it will say 'hey i expect another delete so im going back to sleep' (but it will update status.Replica).


Reply to this email directly or view it on GitHub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50 was a random example, sorry, can you elaborate?
The rc will only create BurstReplicas at a time (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/controller/replication_controller.go#L69) but that's a rate limit. Our scalability tests periodically spin up rcs with 3000 replicas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if I say 50, then 30, then 10, the manager has to touch base at 50 before going back down to 10?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is better placed above the deletePod in updatePod, pologies for not mentioning this.

If one can create an rc with a pod template that has the deletion timestamp set, it will just keep creating pods. Are we disallowing this in validation or is it intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cleared during creation - that said, I'm making an additional guard to prevent deletion grace period seconds from being reset and will move your comment.

----- Original Message -----

@@ -204,6 +204,10 @@ func (rm _ReplicationManager) getPodControllers(pod
*api.Pod) *api.ReplicationCo
// When a pod is created, enqueue the controller that manages it and
update it's expectations.
func (rm *ReplicationManager) addPod(obj interface{}) {
pod := obj.(_api.Pod)

  • if pod.DeletionTimestamp != nil {

The comment is better placed above updatePod, pologies for not mentioning
this.

If one can create an rc with a pod template that has the deletion timestamp
set, it will just keep creating pods. Are we disallowing this in validation
or is it intended?


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/8641/files#r31478081

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, then we'll never get add pods with the deletion timestamp set unless you restart the controller manager or something right? (we get add pod events when the pod isn't in the store but is present in whatever the apiserver sends us -- this could either be a watch event or the output of a relist --). Please add that as a comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments updated. I also added stronger checking and reset behavior to ValidateObjectMetaUpdate (deletionGracePeriodSeconds is immutable) and rest.BeforeCreate() to guard against objects being created with those values.

----- Original Message -----

@@ -204,6 +204,10 @@ func (rm _ReplicationManager) getPodControllers(pod
*api.Pod) *api.ReplicationCo
// When a pod is created, enqueue the controller that manages it and
update it's expectations.
func (rm *ReplicationManager) addPod(obj interface{}) {
pod := obj.(_api.Pod)

  • if pod.DeletionTimestamp != nil {

ah, then we'll never get add pods with the deletion timestamp set unless you
restart the controller manager or something right? (we get add pod events
when the pod isn't in the store but is present in whatever the apiserver
sends us -- this could either be a watch event or the output of a relist
--). Please add that as a comment here.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/8641/files#r31480606

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@k8s-bot
Copy link

k8s-bot commented Jun 1, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

This commit wires together the graceful delete option for pods
on the Kubelet.  When a pod is deleted on the API server, a
grace period is calculated that is based on the
Pod.Spec.TerminationGracePeriodInSeconds, the user's provided grace
period, or a default.  The grace period can only shrink once set.
The value provided by the user (or the default) is set onto metadata
as DeletionGracePeriod.

When the Kubelet sees a pod with DeletionTimestamp set, it uses the
value of ObjectMeta.GracePeriodSeconds as the grace period
sent to Docker.  When updating status, if the pod has DeletionTimestamp
set and all containers are terminated, the Kubelet will update the
status one last time and then invoke Delete(pod, grace: 0) to
clean up the pod immediately.
Pods that are slated for deletion should be excluded from
replication and endpoints immediately.
@smarterclayton smarterclayton force-pushed the gracefully_delete_pods branch from c3e2095 to f12a68c Compare June 1, 2015 23:24
@k8s-bot
Copy link

k8s-bot commented Jun 1, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

bgrant0607 added a commit that referenced this pull request Jun 2, 2015
@bgrant0607 bgrant0607 merged commit b7ae48e into kubernetes:master Jun 2, 2015
@bgrant0607
Copy link
Member

I think this caused e2e failures:

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:263
Failed to observe pod deletion: pods "pod-update-7ee1e9d8-096f-11e5-9297-42010af01555" cannot be updated: 101: Compare failed ([3094 != 3095]) [3095]
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:263
Failed to observe pod deletion: pods "pod-update-fd42a5f2-0968-11e5-b74b-42010af01555" cannot be updated: 101: Compare failed ([2947 != 2949]) [2949]

That line number is at the end of the graceful deletion e2e test.

@bgrant0607
Copy link
Member

Reverted by PR #9143

@smarterclayton
Copy link
Contributor Author

A graceful deletion does a compare and swap, which means it's subject to races with other updates (in this case, pod status updates). That's "expected" within the design of the pull, but I can use GuaranteedUpdate instead.

On Jun 2, 2015, at 5:49 PM, Brian Grant notifications@github.com wrote:

I think this caused e2e failures:

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:263
Failed to observe pod deletion: pods "pod-update-7ee1e9d8-096f-11e5-9297-42010af01555" cannot be updated: 101: Compare failed ([3094 != 3095]) [3095]
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:263
Failed to observe pod deletion: pods "pod-update-fd42a5f2-0968-11e5-b74b-42010af01555" cannot be updated: 101: Compare failed ([2947 != 2949]) [2949]
That line number is at the end of the graceful deletion e2e test.


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor Author

@xiang90 are there any performance implications to large numbers of ~ short TTLs being present in the cluster? At least one scalability test indicated list pods (large recursive query) saw a latency spike - is there any locking around the TTL expiration code that could starve GETs?

@xiang90
Copy link
Contributor

xiang90 commented Jun 2, 2015

@smarterclayton What is large number?

@smarterclayton
Copy link
Contributor Author

4000-5000 documents retrieved by the get taking 16s (99th percentile), maybe 100-200 pending TTL expiration in that group? I expect a large number to expire at the same time.

On Jun 2, 2015, at 6:25 PM, Xiang Li notifications@github.com wrote:

@smarterclayton What is large number?


Reply to this email directly or view it on GitHub.

@xiang90
Copy link
Contributor

xiang90 commented Jun 2, 2015

100-200 expiration at a time should be fine and should not cause a problem. If this is not the case, then it is a bug that etcd needs to fix.

@smarterclayton
Copy link
Contributor Author

I'm going to try to create a reproduction scenario to isolate the behavior and determine if it's etcd or just unexpected higher level implications. Thanks

On Jun 2, 2015, at 6:30 PM, Xiang Li notifications@github.com wrote:

100-200 expiration at a time should be fine and should not cause a problem. If this is not the case, then it is a bug that etcd needs to fix.


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor Author

I'm tempted to make a change to not use TTL, given the other work done here and the potential risk of a fix/debug in etcd. Once scheduled, the kubelet has ownership of the pod and manages deletion. If the kubelet is unable to end the pod, the node controller will handle it. Preschedule, a delete doesn't need grace, so it can be immediate. I think that combination removes the need for TTL. Going to try that. Also going to make graceful a runtime enabled feature so we can test it without breaking the critical path.

On Jun 2, 2015, at 5:52 PM, Brian Grant notifications@github.com wrote:

Reverted by PR #9143


Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

SGTM, thanks.

@derekwaynecarr
Copy link
Member

+1 to elevator discussions :-)

Sent from my iPhone

On Jun 2, 2015, at 8:30 PM, Clayton Coleman notifications@github.com wrote:

I'm tempted to make a change to not use TTL, given the other work done here and the potential risk of a fix/debug in etcd. Once scheduled, the kubelet has ownership of the pod and manages deletion. If the kubelet is unable to end the pod, the node controller will handle it. Preschedule, a delete doesn't need grace, so it can be immediate. I think that combination removes the need for TTL. Going to try that. Also going to make graceful a runtime enabled feature so we can test it without breaking the critical path.

On Jun 2, 2015, at 5:52 PM, Brian Grant notifications@github.com wrote:

Reverted by PR #9143


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor Author

Moving to #9165 - will start this behind a feature flag so we can selectively enable it and test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants