Enable graceful deletion using reconciliation loops in the Kubelet without TTL #9165

smarterclayton · 2015-06-03T02:47:53Z

Restores graceful delete (merged prior to 1.0 and reverted in #8641). Removes the use of TTL which causes latency issues in the API server - instead, the Kubelet and node controller are responsible for managing the deletion of terminated pods.

A new metadata field is added to ObjectMeta that describes the user's requested grace period provided on delete as DeletionGracePeriodSeconds that complements DeletionTimestamp (the time the object is expected to be deleted).
The kubelet, during SyncPods, ensures pods that are terminated and not running on the Kubelet are cleaned up.
The node controller ensures that pods deleted when a node is evacuated are cleaned up (best effort)
A future controller (described in Node controller should delete pods assigned to deleted nodes #11631) would terminate pods that on nodes that don't exist, as well as guarantee that pods that appear to never have been terminated by a kubelet are also deleted
The namespace controller waits until all pods are totally gone before removing the namespace. The default grace period is used.

k8s-bot · 2015-06-03T02:51:28Z

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

bprashanth · 2015-06-03T02:53:11Z

Fwiw i found the 30s default confusing when i synced and created pods, it felt like kubernetes was just unresponsive

smarterclayton · 2015-06-03T03:17:13Z

Very few things should ever go to 30s - I think that lag was indicative of a race or other problem in the code, not the actual deletion taking that long. I've never seen a very long deletion.

bprashanth · 2015-06-03T03:20:24Z

You're probably right, I think i was trying to delete a failed container with restart never. Anyway I'll keep an eye out for that case should it pop up again.

k8s-bot · 2015-06-03T03:22:22Z

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

bprashanth · 2015-06-03T03:55:30Z

The explanantion to the observed sluggishness is that i never realized how slow docker stop can be :
Previously:

create pod foo
delete pod foo
create pod foo

would just work. Now the third create is forced to wait on the docker stop.

smarterclayton · 2015-06-03T04:06:11Z

I think with my branch there's a gap condition in the kubelet status manager where it is losing status updates. It might be qps, on either docker info or the API, or it might be a race between the status updates and the kills. Again, because it's now part of the blocking path it's noticeable.

On Jun 2, 2015, at 11:55 PM, Prashanth B notifications@github.com wrote:

The explanantion to the observed sluggishness is that i never realized how slow docker stop can be :
Previously:

create pod foo
delete pod foo
create pod foo
would just work. Now the third create is forced to wait on the docker stop.

—
Reply to this email directly or view it on GitHub.

k8s-bot · 2015-06-03T04:17:49Z

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

smarterclayton · 2015-06-03T04:19:07Z

It looks like the delays are due to status being missed. It's possible that something needs to be cleared in the Kubelet that I'm missing.

Ignoring same pod status for slow-rc-0ht7n_default - old: {Pending [{Ready False}]  127.0.0.1  2015-06-03 03:24:35.100566274 +0000 UTC [{webserver {%!s(*api.ContainerStateWaiting=&{Image: gcr.io/google_containers/nettest:1.5 is ready, container is creating}) %!s(*api.ContainerStateRunning=<nil>) %!s(*api.ContainerStateTerminated=<nil>)} {%!s(*api.ContainerStateWaiting=<nil>) %!s(*api.ContainerStateRunning=<nil>) %!s(*api.ContainerStateTerminated=<nil>)} %!s(bool=false) %!s(int=0) gcr.io/google_containers/nettest:1.5  }]} new: {Pending [{Ready False}]  127.0.0.1  2015-06-03 03:24:35.100566274 +0000 UTC [{webserver {%!s(*api.ContainerStateWaiting=&{Image: gcr.io/google_containers/nettest:1.5 is ready, container is creating}) %!s(*api.ContainerStateRunning=<nil>) %!s(*api.ContainerStateTerminated=<nil>)} {%!s(*api.ContainerStateWaiting=<nil>) %!s(*api.ContainerStateRunning=<nil>) %!s(*api.ContainerStateTerminated=<nil>)} %!s(bool=false) %!s(int=0) gcr.io/google_containers/nettest:1.5  }]}

k8s-bot · 2015-06-03T13:41:34Z

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

smarterclayton · 2015-06-03T16:13:47Z

This is now behind a feature gate (--runtime-config=feature/gracefulDelete) on the apiserver. I need to tweak the Kubelet behavior with this change to use pod.Spec.TerminationGracePeriodSeconds in the event that pod.DeletionGracePeriodSeconds is not set, and then this should be safe to deliver and be a strict improvement over our current code (because it would allow users to specify deletion grace period seconds) and then we can test graceful deletion independently.

----- Original Message -----

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

Reply to this email directly or view it on GitHub:
#9165 (comment)

smarterclayton · 2015-06-03T16:22:07Z

Those changes are added. I think this is ready for review (assuming we're ok with the feature gate).

----- Original Message -----

This is now behind a feature gate (--runtime-config=feature/gracefulDelete)
on the apiserver. I need to tweak the Kubelet behavior with this change to
use pod.Spec.TerminationGracePeriodSeconds in the event that
pod.DeletionGracePeriodSeconds is not set, and then this should be safe to
deliver and be a strict improvement over our current code (because it would
allow users to specify deletion grace period seconds) and then we can test
graceful deletion independently.

----- Original Message -----

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

Reply to this email directly or view it on GitHub:
#9165 (comment)

k8s-bot · 2015-06-03T16:26:27Z

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

yujuhong · 2015-06-03T17:07:20Z

Is docker StopContainer(id string, timeout uint) error a blocking call? From my experience, docker stop -t 60 <container_id> does block until it times out. If StopContainer is also a blocking call, we can't just add this feature without parallelizing the container kills since all killing is done in the main loop (SyncPods). The best way for kubelet to handle this is to let the per-pod go-routines kill the containers individually. This would be a slightly bigger change to kubelet though.

smarterclayton · 2015-06-03T17:08:30Z

Pod death is KillPod which is parallel.

----- Original Message -----

Is docker StopContainer(id string, timeout uint) error a blocking call?
From my experience, docker stop -t 60 <container_id> does block until it
times out. If StopContainer is also a blocking call, we can't just add
this feature without parallelizing the container kills since all killing is
done in the main loop (SyncPods). The best way for kubelet to handle this is
to let the per-pod go-routines kill the containers individually. This would
be a slightly bigger change to kubelet though.

Reply to this email directly or view it on GitHub:
#9165 (comment)

yujuhong · 2015-06-03T17:12:21Z

Pod death is KillPod which is parallel.

It is, because I added that. However, the latency is still dominated by the slowest timeout. I think the default timeout is 10 seconds so it's somewhat tolerable. If you allow it to go beyond that, it'd slow down kubelet response time. We wanted to move pod killing to per-pod goroutine, but didn't have time to do so before v1.0.

smarterclayton · 2015-06-03T17:17:26Z

I thought syncpod was managed by the workers and that the workers were parallel? Don't disagree the current path will slow down the kubelet, but SIGKILLing apps with stateful disk is way worse than latency on the kubelet. 10s is probably too short for most databases or other stateful systems, and so the solution probably has to be on the kubelet side.

----- Original Message -----

Pod death is KillPod which is parallel.

It is, because I added that. However, the latency is still dominated by the
slowest timeout. I think the default timeout is 10 seconds so it's somewhat
tolerable. If you allow it to go beyond that, it'd slow down kubelet
response time. We wanted to move pod killing to per-pod goroutine, but
didn't have time to do so before v1.0.

Reply to this email directly or view it on GitHub:
#9165 (comment)

yujuhong · 2015-06-03T17:34:55Z

syncPod is managed by the workers, but SyncPods is not. Pod killing happens in SyncPods.

The right way to fix it is to move pod killing into per-pod workers. The (perhaps) slightly easier way is to spawn another go-routing for pod killing and related cleanup (basically the second-half of SyncPods), so at least it doesn't slow down other active pods.

Agreed that some pods need more time to terminate. At the least we should have some safeguard (upper limit) to avoid users specifying a really high timeout (which would render kubelet unresponsive)?

smarterclayton · 2015-06-03T17:42:32Z

----- Original Message -----

syncPod
is managed by the workers, but
SyncPods
is not. Pod
killing
happens in SyncPods.

The right way to fix it is to move pod killing into per-pod workers. The
(perhaps) slightly easier way is to spawn another go-routing for pod killing
and related cleanup (basically the second-half of SyncPods), so at least
it doesn't slow down other active pods.

Is pod kill in per-pod significantly harder?

Agreed that some pods need more time to terminate. At the least we should
have some safeguard (upper limit) to avoid users specifying a really high
timeout (which would render kubelet unresponsive)?

Reply to this email directly or view it on GitHub:
#9165 (comment)

vmarmol · 2015-06-03T17:47:20Z

pkg/kubelet/dockertools/manager.go

@@ -54,8 +54,17 @@ const (

 	maxReasonCacheEntries = 200

-	kubernetesPodLabel       = "io.kubernetes.pod.data"
-	kubernetesContainerLabel = "io.kubernetes.container.name"
+	// In order to avoid unnecessary SIGKILLs, give every container a minimum grace


This is really just to guard against forcing SIGKILL. We default pod.Spec.TerminationGracePeriodSceonds to 30 - this just guarantees you never get less than 2s to respond to SIGTERM.

This reverts commit 4f856b5, reversing changes made to d78525a. Conflicts: pkg/kubelet/status_manager.go

Revert "Merge pull request #9165 from smarterclayton/graceful"

smarterclayton · 2015-08-19T01:22:27Z

Will chase down what's causing e2e flakes. Seemed to be something recent

smarterclayton · 2015-08-19T01:50:14Z

@roberthbailey do you have the seed # from that run?

bgrant0607 · 2015-08-19T02:38:13Z

Seed: 1439942057

So far, I see "pod already exists" errors. Maybe setting the default grace period to 0 will work around the problem?

bgrant0607 · 2015-08-19T02:47:22Z

For example:

Pod Disks should schedule a pod w/ a RW PD shared between multiple containers, write to PD, delete pod, verify contents, and repeat in rapid succession

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:231
Failed to create host0Pod: pods "pd-test-711bdc52-460c-11e5-a99d-42010af01555" already exists
Expected error:
    <*errors.StatusError | 0xc2087d9e80>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: ""},
            Status: "Failure",
            Message: "pods \"pd-test-711bdc52-460c-11e5-a99d-42010af01555\" already exists",
            Reason: "AlreadyExists",
            Details: {
                Name: "pd-test-711bdc52-460c-11e5-a99d-42010af01555",
                Kind: "pods",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 409,
        },
    }
    pods "pd-test-711bdc52-460c-11e5-a99d-42010af01555" already exists
not to have occurred

roberthbailey · 2015-08-19T02:48:58Z

The run with the 10 failures had a seed of 1439937740.

bgrant0607 · 2015-08-19T02:51:55Z

GCE e2e seed: 1439944232

scheduler_predicates, resize_nodes, pd, and at least some of the other tests appear to fail pretty consistently.

smarterclayton · 2015-08-19T04:13:30Z

Yeah, I've found places where we assume there is no overlap on RC stop, but
because the RC disregards terminating pods, stop no longer guarantees
things are actually fully stopped. I'm a little concerned about changing
the RC manager to treat scale 0 and deletion timestamp differently from
scale > 0 and having a deletion timestamp, although that's probably closest
to the desired behavior. The other options are to add additional client
logic for terminating pods (which we already rejected for other reasons),
or to accept that scale=0 on an RC means "once we've sent a graceful
deletion signal to all pods".

The other errors were mismatches between the e2e test and naive pod list
behavior for an RC or service (listing the pods is not enough, terminating
has to be filtered out) for verification. The node problem was a further
race condition in the node controller which I'll fix by not deleting the
node until there are zero pods left scheduled on the node (should have been
that way before, but got mixed up with Brendan's changes).

On Tue, Aug 18, 2015 at 10:52 PM, Brian Grant notifications@github.com
wrote:

GCE e2e seed: 1439944232

scheduler_predicates, resize_nodes, pd, and at least some of the other
tests appear to fail pretty consistently.

—
Reply to this email directly or view it on GitHub
#9165 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

bgrant0607 · 2015-08-19T20:47:47Z

We need graceful termination of RCs instead of the client-side stop implementation. I don't think we should change the semantics of scaling to 0 generally.

smarterclayton · 2015-08-19T21:04:55Z

Ok - are we ok to leave that wrinkle for RC then?

On Wed, Aug 19, 2015 at 4:48 PM, Brian Grant notifications@github.com
wrote:

We need graceful termination of RCs instead of the client-side stop
implementation. I don't think we should change the semantics of scaling to
0 generally.

—
Reply to this email directly or view it on GitHub
#9165 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

…n/graceful"" This reverts commit 08e6a43.

smarterclayton · 2015-08-20T03:54:35Z

I've done what i should have done from the beginning and carved this up - 1-4 are "safe", 5 may have a bug that is fixed in the other cleanup e2e pulls, and I'm looking at making 6 safer. 7 is the only commit that actually turns on graceful deletion, so assuming we get through 1-6 only 7 would have to be rolled back.

Sorry for not doing this earlier...

smarterclayton · 2015-08-31T18:54:35Z

#12963 needs a final once over but should be ready for another merge attempt. Only getting services flakes at this point.

googlebot added the cla: yes label Jun 3, 2015

smarterclayton mentioned this pull request Jun 3, 2015

Gracefully delete pods from the Kubelet #8641

Merged

4 tasks

smarterclayton force-pushed the graceful branch from c621900 to e7b5256 Compare June 3, 2015 03:14

smarterclayton force-pushed the graceful branch from e7b5256 to b1bc7e7 Compare June 3, 2015 04:10

bgrant0607 self-assigned this Jun 3, 2015

smarterclayton force-pushed the graceful branch from b1bc7e7 to 7841d93 Compare June 3, 2015 13:34

bprashanth mentioned this pull request Jun 3, 2015

Kubelet sometimes slow to delete pods #9188

Closed

smarterclayton force-pushed the graceful branch from 7841d93 to a20a0a1 Compare June 3, 2015 16:21

vmarmol reviewed Jun 3, 2015
View reviewed changes

roberthbailey added a commit to roberthbailey/kubernetes that referenced this pull request Aug 19, 2015

Revert "Merge pull request kubernetes#9165 from smarterclayton/graceful"

08e6a43

This reverts commit 4f856b5, reversing changes made to d78525a. Conflicts: pkg/kubelet/status_manager.go

roberthbailey mentioned this pull request Aug 19, 2015

Revert "Merge pull request #9165 from smarterclayton/graceful" #12891

Merged

roberthbailey added a commit that referenced this pull request Aug 19, 2015

Merge pull request #12891 from roberthbailey/revert-9165

b5a4a54

Revert "Merge pull request #9165 from smarterclayton/graceful"

smarterclayton mentioned this pull request Aug 19, 2015

Cleanup more extensively in e2e (0/7) #12926

Merged

This was referenced Aug 19, 2015

Node controller did not handle an error #12929

Merged

WIP: Reintroduce graceful deletion with e2e passing #12931

Closed

smarterclayton added a commit to smarterclayton/kubernetes that referenced this pull request Aug 19, 2015

Revert "Revert "Merge pull request kubernetes#9165 from smarterclayto…

7991ecd

…n/graceful"" This reverts commit 08e6a43.

carmark mentioned this pull request Nov 13, 2015

./hack/test-cmd.sh failed hyperhq/hypernetes#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable graceful deletion using reconciliation loops in the Kubelet without TTL #9165

Enable graceful deletion using reconciliation loops in the Kubelet without TTL #9165

smarterclayton commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

bprashanth commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

bprashanth commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

bprashanth commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

yujuhong commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

yujuhong commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

yujuhong commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

vmarmol Jun 3, 2015

smarterclayton Jun 3, 2015

smarterclayton commented Aug 19, 2015

smarterclayton commented Aug 19, 2015

bgrant0607 commented Aug 19, 2015

bgrant0607 commented Aug 19, 2015

roberthbailey commented Aug 19, 2015

bgrant0607 commented Aug 19, 2015

smarterclayton commented Aug 19, 2015

bgrant0607 commented Aug 19, 2015

smarterclayton commented Aug 19, 2015

smarterclayton commented Aug 20, 2015

smarterclayton commented Aug 31, 2015

Enable graceful deletion using reconciliation loops in the Kubelet without TTL #9165

Enable graceful deletion using reconciliation loops in the Kubelet without TTL #9165

Conversation

smarterclayton commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

bprashanth commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

bprashanth commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

bprashanth commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

k8s-bot commented Jun 3, 2015

yujuhong commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

yujuhong commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

yujuhong commented Jun 3, 2015

smarterclayton commented Jun 3, 2015

vmarmol Jun 3, 2015

Choose a reason for hiding this comment

smarterclayton Jun 3, 2015

Choose a reason for hiding this comment

smarterclayton commented Aug 19, 2015

smarterclayton commented Aug 19, 2015

bgrant0607 commented Aug 19, 2015

bgrant0607 commented Aug 19, 2015

roberthbailey commented Aug 19, 2015

bgrant0607 commented Aug 19, 2015

smarterclayton commented Aug 19, 2015

bgrant0607 commented Aug 19, 2015

smarterclayton commented Aug 19, 2015

smarterclayton commented Aug 20, 2015

smarterclayton commented Aug 31, 2015