Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable graceful deletion using reconciliation loops in the Kubelet without TTL #9165

Merged
merged 8 commits into from
Aug 18, 2015

Conversation

smarterclayton
Copy link
Contributor

Restores graceful delete (merged prior to 1.0 and reverted in #8641). Removes the use of TTL which causes latency issues in the API server - instead, the Kubelet and node controller are responsible for managing the deletion of terminated pods.

  • A new metadata field is added to ObjectMeta that describes the user's requested grace period provided on delete as DeletionGracePeriodSeconds that complements DeletionTimestamp (the time the object is expected to be deleted).
  • The kubelet, during SyncPods, ensures pods that are terminated and not running on the Kubelet are cleaned up.
  • The node controller ensures that pods deleted when a node is evacuated are cleaned up (best effort)
  • A future controller (described in Node controller should delete pods assigned to deleted nodes #11631) would terminate pods that on nodes that don't exist, as well as guarantee that pods that appear to never have been terminated by a kubelet are also deleted
  • The namespace controller waits until all pods are totally gone before removing the namespace. The default grace period is used.

@k8s-bot
Copy link

k8s-bot commented Jun 3, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

@bprashanth
Copy link
Contributor

Fwiw i found the 30s default confusing when i synced and created pods, it felt like kubernetes was just unresponsive

@smarterclayton
Copy link
Contributor Author

Very few things should ever go to 30s - I think that lag was indicative of a race or other problem in the code, not the actual deletion taking that long. I've never seen a very long deletion.

@bprashanth
Copy link
Contributor

You're probably right, I think i was trying to delete a failed container with restart never. Anyway I'll keep an eye out for that case should it pop up again.

@k8s-bot
Copy link

k8s-bot commented Jun 3, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

@bprashanth
Copy link
Contributor

The explanantion to the observed sluggishness is that i never realized how slow docker stop can be :
Previously:

create pod foo
delete pod foo
create pod foo

would just work. Now the third create is forced to wait on the docker stop.

@smarterclayton
Copy link
Contributor Author

I think with my branch there's a gap condition in the kubelet status manager where it is losing status updates. It might be qps, on either docker info or the API, or it might be a race between the status updates and the kills. Again, because it's now part of the blocking path it's noticeable.

On Jun 2, 2015, at 11:55 PM, Prashanth B notifications@github.com wrote:

The explanantion to the observed sluggishness is that i never realized how slow docker stop can be :
Previously:

create pod foo
delete pod foo
create pod foo
would just work. Now the third create is forced to wait on the docker stop.


Reply to this email directly or view it on GitHub.

@k8s-bot
Copy link

k8s-bot commented Jun 3, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

@smarterclayton
Copy link
Contributor Author

It looks like the delays are due to status being missed. It's possible that something needs to be cleared in the Kubelet that I'm missing.

Ignoring same pod status for slow-rc-0ht7n_default - old: {Pending [{Ready False}]  127.0.0.1  2015-06-03 03:24:35.100566274 +0000 UTC [{webserver {%!s(*api.ContainerStateWaiting=&{Image: gcr.io/google_containers/nettest:1.5 is ready, container is creating}) %!s(*api.ContainerStateRunning=<nil>) %!s(*api.ContainerStateTerminated=<nil>)} {%!s(*api.ContainerStateWaiting=<nil>) %!s(*api.ContainerStateRunning=<nil>) %!s(*api.ContainerStateTerminated=<nil>)} %!s(bool=false) %!s(int=0) gcr.io/google_containers/nettest:1.5  }]} new: {Pending [{Ready False}]  127.0.0.1  2015-06-03 03:24:35.100566274 +0000 UTC [{webserver {%!s(*api.ContainerStateWaiting=&{Image: gcr.io/google_containers/nettest:1.5 is ready, container is creating}) %!s(*api.ContainerStateRunning=<nil>) %!s(*api.ContainerStateTerminated=<nil>)} {%!s(*api.ContainerStateWaiting=<nil>) %!s(*api.ContainerStateRunning=<nil>) %!s(*api.ContainerStateTerminated=<nil>)} %!s(bool=false) %!s(int=0) gcr.io/google_containers/nettest:1.5  }]}

@k8s-bot
Copy link

k8s-bot commented Jun 3, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

@smarterclayton
Copy link
Contributor Author

This is now behind a feature gate (--runtime-config=feature/gracefulDelete) on the apiserver. I need to tweak the Kubelet behavior with this change to use pod.Spec.TerminationGracePeriodSeconds in the event that pod.DeletionGracePeriodSeconds is not set, and then this should be safe to deliver and be a strict improvement over our current code (because it would allow users to specify deletion grace period seconds) and then we can test graceful deletion independently.

----- Original Message -----

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.


Reply to this email directly or view it on GitHub:
#9165 (comment)

@smarterclayton
Copy link
Contributor Author

Those changes are added. I think this is ready for review (assuming we're ok with the feature gate).

----- Original Message -----

This is now behind a feature gate (--runtime-config=feature/gracefulDelete)
on the apiserver. I need to tweak the Kubelet behavior with this change to
use pod.Spec.TerminationGracePeriodSeconds in the event that
pod.DeletionGracePeriodSeconds is not set, and then this should be safe to
deliver and be a strict improvement over our current code (because it would
allow users to specify deletion grace period seconds) and then we can test
graceful deletion independently.

----- Original Message -----

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.


Reply to this email directly or view it on GitHub:
#9165 (comment)

@k8s-bot
Copy link

k8s-bot commented Jun 3, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

@yujuhong
Copy link
Contributor

yujuhong commented Jun 3, 2015

Is docker StopContainer(id string, timeout uint) error a blocking call? From my experience, docker stop -t 60 <container_id> does block until it times out. If StopContainer is also a blocking call, we can't just add this feature without parallelizing the container kills since all killing is done in the main loop (SyncPods). The best way for kubelet to handle this is to let the per-pod go-routines kill the containers individually. This would be a slightly bigger change to kubelet though.

@smarterclayton
Copy link
Contributor Author

Pod death is KillPod which is parallel.

----- Original Message -----

Is docker StopContainer(id string, timeout uint) error a blocking call?
From my experience, docker stop -t 60 <container_id> does block until it
times out. If StopContainer is also a blocking call, we can't just add
this feature without parallelizing the container kills since all killing is
done in the main loop (SyncPods). The best way for kubelet to handle this is
to let the per-pod go-routines kill the containers individually. This would
be a slightly bigger change to kubelet though.


Reply to this email directly or view it on GitHub:
#9165 (comment)

@yujuhong
Copy link
Contributor

yujuhong commented Jun 3, 2015

Pod death is KillPod which is parallel.

It is, because I added that. However, the latency is still dominated by the slowest timeout. I think the default timeout is 10 seconds so it's somewhat tolerable. If you allow it to go beyond that, it'd slow down kubelet response time. We wanted to move pod killing to per-pod goroutine, but didn't have time to do so before v1.0.

@smarterclayton
Copy link
Contributor Author

I thought syncpod was managed by the workers and that the workers were parallel? Don't disagree the current path will slow down the kubelet, but SIGKILLing apps with stateful disk is way worse than latency on the kubelet. 10s is probably too short for most databases or other stateful systems, and so the solution probably has to be on the kubelet side.

----- Original Message -----

Pod death is KillPod which is parallel.

It is, because I added that. However, the latency is still dominated by the
slowest timeout. I think the default timeout is 10 seconds so it's somewhat
tolerable. If you allow it to go beyond that, it'd slow down kubelet
response time. We wanted to move pod killing to per-pod goroutine, but
didn't have time to do so before v1.0.


Reply to this email directly or view it on GitHub:
#9165 (comment)

@yujuhong
Copy link
Contributor

yujuhong commented Jun 3, 2015

syncPod is managed by the workers, but SyncPods is not. Pod killing happens in SyncPods.

The right way to fix it is to move pod killing into per-pod workers. The (perhaps) slightly easier way is to spawn another go-routing for pod killing and related cleanup (basically the second-half of SyncPods), so at least it doesn't slow down other active pods.

Agreed that some pods need more time to terminate. At the least we should have some safeguard (upper limit) to avoid users specifying a really high timeout (which would render kubelet unresponsive)?

@smarterclayton
Copy link
Contributor Author

----- Original Message -----

syncPod
is managed by the workers, but
SyncPods
is not. Pod
killing

happens in SyncPods.

The right way to fix it is to move pod killing into per-pod workers. The
(perhaps) slightly easier way is to spawn another go-routing for pod killing
and related cleanup (basically the second-half of SyncPods), so at least
it doesn't slow down other active pods.

Is pod kill in per-pod significantly harder?

Agreed that some pods need more time to terminate. At the least we should
have some safeguard (upper limit) to avoid users specifying a really high
timeout (which would render kubelet unresponsive)?


Reply to this email directly or view it on GitHub:
#9165 (comment)

@@ -54,8 +54,17 @@ const (

maxReasonCacheEntries = 200

kubernetesPodLabel = "io.kubernetes.pod.data"
kubernetesContainerLabel = "io.kubernetes.container.name"
// In order to avoid unnecessary SIGKILLs, give every container a minimum grace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2s not 1m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really just to guard against forcing SIGKILL. We default pod.Spec.TerminationGracePeriodSceonds to 30 - this just guarantees you never get less than 2s to respond to SIGTERM.

roberthbailey added a commit to roberthbailey/kubernetes that referenced this pull request Aug 19, 2015
This reverts commit 4f856b5, reversing
changes made to d78525a.

Conflicts:
	pkg/kubelet/status_manager.go
roberthbailey added a commit that referenced this pull request Aug 19, 2015
Revert "Merge pull request #9165 from smarterclayton/graceful"
@smarterclayton
Copy link
Contributor Author

Will chase down what's causing e2e flakes. Seemed to be something recent

@smarterclayton
Copy link
Contributor Author

@roberthbailey do you have the seed # from that run?

@bgrant0607
Copy link
Member

Seed: 1439942057

So far, I see "pod already exists" errors. Maybe setting the default grace period to 0 will work around the problem?

@bgrant0607
Copy link
Member

For example:

Pod Disks should schedule a pod w/ a RW PD shared between multiple containers, write to PD, delete pod, verify contents, and repeat in rapid succession

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:231
Failed to create host0Pod: pods "pd-test-711bdc52-460c-11e5-a99d-42010af01555" already exists
Expected error:
    <*errors.StatusError | 0xc2087d9e80>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: ""},
            Status: "Failure",
            Message: "pods \"pd-test-711bdc52-460c-11e5-a99d-42010af01555\" already exists",
            Reason: "AlreadyExists",
            Details: {
                Name: "pd-test-711bdc52-460c-11e5-a99d-42010af01555",
                Kind: "pods",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 409,
        },
    }
    pods "pd-test-711bdc52-460c-11e5-a99d-42010af01555" already exists
not to have occurred

@roberthbailey
Copy link
Contributor

The run with the 10 failures had a seed of 1439937740.

@bgrant0607
Copy link
Member

GCE e2e seed: 1439944232

scheduler_predicates, resize_nodes, pd, and at least some of the other tests appear to fail pretty consistently.

@smarterclayton
Copy link
Contributor Author

Yeah, I've found places where we assume there is no overlap on RC stop, but
because the RC disregards terminating pods, stop no longer guarantees
things are actually fully stopped. I'm a little concerned about changing
the RC manager to treat scale 0 and deletion timestamp differently from
scale > 0 and having a deletion timestamp, although that's probably closest
to the desired behavior. The other options are to add additional client
logic for terminating pods (which we already rejected for other reasons),
or to accept that scale=0 on an RC means "once we've sent a graceful
deletion signal to all pods".

The other errors were mismatches between the e2e test and naive pod list
behavior for an RC or service (listing the pods is not enough, terminating
has to be filtered out) for verification. The node problem was a further
race condition in the node controller which I'll fix by not deleting the
node until there are zero pods left scheduled on the node (should have been
that way before, but got mixed up with Brendan's changes).

On Tue, Aug 18, 2015 at 10:52 PM, Brian Grant notifications@github.com
wrote:

GCE e2e seed: 1439944232

scheduler_predicates, resize_nodes, pd, and at least some of the other
tests appear to fail pretty consistently.


Reply to this email directly or view it on GitHub
#9165 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

@bgrant0607
Copy link
Member

We need graceful termination of RCs instead of the client-side stop implementation. I don't think we should change the semantics of scaling to 0 generally.

@smarterclayton
Copy link
Contributor Author

Ok - are we ok to leave that wrinkle for RC then?

On Wed, Aug 19, 2015 at 4:48 PM, Brian Grant notifications@github.com
wrote:

We need graceful termination of RCs instead of the client-side stop
implementation. I don't think we should change the semantics of scaling to
0 generally.


Reply to this email directly or view it on GitHub
#9165 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

@smarterclayton
Copy link
Contributor Author

I've done what i should have done from the beginning and carved this up - 1-4 are "safe", 5 may have a bug that is fixed in the other cleanup e2e pulls, and I'm looking at making 6 safer. 7 is the only commit that actually turns on graceful deletion, so assuming we get through 1-6 only 7 would have to be rolled back.

Sorry for not doing this earlier...

@smarterclayton
Copy link
Contributor Author

#12963 needs a final once over but should be ready for another merge attempt. Only getting services flakes at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.