-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable graceful deletion using reconciliation loops in the Kubelet without TTL #9165
Conversation
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
Fwiw i found the 30s default confusing when i synced and created pods, it felt like kubernetes was just unresponsive |
Very few things should ever go to 30s - I think that lag was indicative of a race or other problem in the code, not the actual deletion taking that long. I've never seen a very long deletion. |
You're probably right, I think i was trying to delete a failed container with restart never. Anyway I'll keep an eye out for that case should it pop up again. |
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
The explanantion to the observed sluggishness is that i never realized how slow
would just work. Now the third create is forced to wait on the docker stop. |
I think with my branch there's a gap condition in the kubelet status manager where it is losing status updates. It might be qps, on either docker info or the API, or it might be a race between the status updates and the kills. Again, because it's now part of the blocking path it's noticeable.
|
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
It looks like the delays are due to status being missed. It's possible that something needs to be cleared in the Kubelet that I'm missing.
|
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
This is now behind a feature gate (--runtime-config=feature/gracefulDelete) on the apiserver. I need to tweak the Kubelet behavior with this change to use pod.Spec.TerminationGracePeriodSeconds in the event that pod.DeletionGracePeriodSeconds is not set, and then this should be safe to deliver and be a strict improvement over our current code (because it would allow users to specify deletion grace period seconds) and then we can test graceful deletion independently. ----- Original Message -----
|
Those changes are added. I think this is ready for review (assuming we're ok with the feature gate). ----- Original Message -----
|
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
Is docker |
Pod death is KillPod which is parallel. ----- Original Message -----
|
It is, because I added that. However, the latency is still dominated by the slowest timeout. I think the default timeout is 10 seconds so it's somewhat tolerable. If you allow it to go beyond that, it'd slow down kubelet response time. We wanted to move pod killing to per-pod goroutine, but didn't have time to do so before v1.0. |
I thought syncpod was managed by the workers and that the workers were parallel? Don't disagree the current path will slow down the kubelet, but SIGKILLing apps with stateful disk is way worse than latency on the kubelet. 10s is probably too short for most databases or other stateful systems, and so the solution probably has to be on the kubelet side. ----- Original Message -----
|
The right way to fix it is to move pod killing into per-pod workers. The (perhaps) slightly easier way is to spawn another go-routing for pod killing and related cleanup (basically the second-half of Agreed that some pods need more time to terminate. At the least we should have some safeguard (upper limit) to avoid users specifying a really high timeout (which would render kubelet unresponsive)? |
----- Original Message -----
Is pod kill in per-pod significantly harder?
|
@@ -54,8 +54,17 @@ const ( | |||
|
|||
maxReasonCacheEntries = 200 | |||
|
|||
kubernetesPodLabel = "io.kubernetes.pod.data" | |||
kubernetesContainerLabel = "io.kubernetes.container.name" | |||
// In order to avoid unnecessary SIGKILLs, give every container a minimum grace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2s not 1m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really just to guard against forcing SIGKILL. We default pod.Spec.TerminationGracePeriodSceonds to 30 - this just guarantees you never get less than 2s to respond to SIGTERM.
Revert "Merge pull request #9165 from smarterclayton/graceful"
Will chase down what's causing e2e flakes. Seemed to be something recent |
@roberthbailey do you have the seed # from that run? |
Seed: 1439942057 So far, I see "pod already exists" errors. Maybe setting the default grace period to 0 will work around the problem? |
For example:
|
The run with the 10 failures had a seed of 1439937740. |
GCE e2e seed: 1439944232 scheduler_predicates, resize_nodes, pd, and at least some of the other tests appear to fail pretty consistently. |
Yeah, I've found places where we assume there is no overlap on RC stop, but The other errors were mismatches between the e2e test and naive pod list On Tue, Aug 18, 2015 at 10:52 PM, Brian Grant notifications@github.com
Clayton Coleman | Lead Engineer, OpenShift |
We need graceful termination of RCs instead of the client-side stop implementation. I don't think we should change the semantics of scaling to 0 generally. |
Ok - are we ok to leave that wrinkle for RC then? On Wed, Aug 19, 2015 at 4:48 PM, Brian Grant notifications@github.com
Clayton Coleman | Lead Engineer, OpenShift |
…n/graceful"" This reverts commit 08e6a43.
I've done what i should have done from the beginning and carved this up - 1-4 are "safe", 5 may have a bug that is fixed in the other cleanup e2e pulls, and I'm looking at making 6 safer. 7 is the only commit that actually turns on graceful deletion, so assuming we get through 1-6 only 7 would have to be rolled back. Sorry for not doing this earlier... |
#12963 needs a final once over but should be ready for another merge attempt. Only getting services flakes at this point. |
Restores graceful delete (merged prior to 1.0 and reverted in #8641). Removes the use of TTL which causes latency issues in the API server - instead, the Kubelet and node controller are responsible for managing the deletion of terminated pods.
DeletionGracePeriodSeconds
that complementsDeletionTimestamp
(the time the object is expected to be deleted).