Have an option to keep Pod around for debugging #14602

davidopp · 2015-09-26T01:19:39Z

We received an interesting suggestion today which has multiple layers to it but basically the use case was that they would like us to keep Pod state around for debugging in the case where a Pod is "moved" to another machine. This would presumably be an option in the Pod, rather than default behavior, since in the normal case people probably only care about the logs and not the full state.

[Please read "move" here in the proper cattle-centric way, meaning "kill old Pod and create a fungible replica on a different machine"]

Today IIUC the only scenario where we "move" a Pod to a new machine is if the machine fails. So the feature request is that when the machine comes back up, we want it to still have logs, containers, host directories, Pod-scoped directories, etc. still intact. I guess this is somewhat related to the Borg critical data concept, although for a very different purpose.

In the future I imagine we will want some other scenarios to trigger the Pod to "move" -- for example there is a restart loop (kubelet keeps restarting the container locally due to repeated OOM or something else that boils down to "the Pod is just not going to work here, but might work in another machine"). In this case you'd want to also keep the same state from the old Pod around despite creating a replacement replica elsewhere.

The last use case that was brought up was debugging deadlock. There would be an option to start a new replica on another node while keeping the deadlocked Pod/container running so that you can attach a debugger to the process that is deadlocked. Of course you'd also like to keep the same state we've been talking about for the other cases (logs, containers, host directories, Pod-scoped directories).

@mikedanese does this sound like a fairly accurate description of the request? Anything I left out?

derekwaynecarr · 2015-09-26T02:46:33Z

@bparees - this struck me as something you would find of interest and want
to weigh in on.

I also think it's related to this as one potential means to support
debugging of the container state:

#14561

On Friday, September 25, 2015, David Oppenheimer notifications@github.com
wrote:

We received an interesting suggestion today which has multiple layers to
it but basically the use case was that they would like us to keep Pod state
around for debugging in the case where a Pod is "moved" to another machine.
This would presumably be an option in the Pod, rather than default
behavior, since in the normal case people probably only care about the logs
and not the full state.

[Please read "move" here in the proper cattle-centric way, meaning "kill
old Pod and create a fungible replica on a different machine"]

Today IIUC the only scenario where we "move" a Pod to a new machine is if
the machine fails. So the feature request is that when the machine comes
back up, we want it to still have logs, containers, host directories,
Pod-scoped directories, etc. still intact. I guess this is somewhat related
to the Borg critical data concept, although for a very different purpose.

In the future I imagine we will want some other scenarios to trigger the
Pod to "move" -- for example there is a restart loop (kubelet keeps
restarting the container locally due to repeated OOM or something else that
boils down to "the Pod is just not going to work here, but might work in
another machine"). In this case you'd want to also keep the same state from
the old Pod around despite creating a replacement replica elsewhere.

The last use case that was brought up was debugging deadlock. There would
be an option to start a new replica on another node while keeping the
deadlocked Pod/container running so that you can attach a debugger to the
process that is deadlocked. Of course you'd also like to keep the same
state we've been talking about for the other cases (logs, containers, host
directories, Pod-scoped directories).

@mikedanese https://github.com/mikedanese does this sound like a fairly
accurate description of the request? Anything I left out?

—
Reply to this email directly or view it on GitHub
#14602.

bparees · 2015-09-27T19:22:55Z

thanks @derekwaynecarr, definitely the ability to get the pod logs seems extremely useful.

in relation to #14561, the ability to do more than what is described here and actually commit the containers of the failed/moved pod and push it to a registry so i can pull the image (basically a snapshot at that point) and do some investigation of the ephemeral filesystem/attempt to start the process, etc would also be extremely useful imho.

bgrant0607 · 2015-10-07T17:18:19Z

Sort of related: #3949

soltysh · 2015-11-30T14:09:10Z

I have a slightly different use-case, still sort of related to what @davidopp mentioned. When implementing #17940 and still to come #17244 I was struck we don't have any option to gracefully terminate a pod. Aforementioned issues/PRs are dealing with a job which should be terminated upon certain conditions (timeout, remote termination). Such pod should then be in a failed state denoting that it was terminated prematurely. @davidopp does this also fits your use-case, or should I rather create a separate issue regarding the topic?

bgrant0607 · 2015-11-30T21:23:06Z

Related: #2789

bgrant0607 · 2015-11-30T21:26:03Z

Note that if you don't care about the resources consumed, it's easy to just keep a pod around by changing its labels to orphan it from its controller.

soltysh · 2015-12-01T14:14:10Z

Note that if you don't care about the resources consumed, it's easy to just keep a pod around by changing its labels to orphan it from its controller.

I guess not. If I'm setting a deadline, I'm rather interested in limiting the resources consumption. At least that's how I see this.

fejta-bot · 2017-12-14T12:30:37Z

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Freeze the issue for 90d with /lifecycle frozen.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-13T13:18:24Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-12T13:26:07Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

davidopp added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. team/control-plane labels Sep 26, 2015

soltysh mentioned this issue Nov 30, 2015

Added ActiveDeadlineSeconds to jobs #17940

Merged

bgrant0607 mentioned this issue Jan 29, 2016

Allow pod.Spec.ActiveDeadlineSeconds to be updateable #20170

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 13, 2018

k8s-ci-robot closed this as completed Feb 12, 2018

github-actions bot mentioned this issue Dec 14, 2020

below code should be replaced with pod termination resulting in pacoxu/kubernetes#175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have an option to keep Pod around for debugging #14602

Have an option to keep Pod around for debugging #14602

davidopp commented Sep 26, 2015

derekwaynecarr commented Sep 26, 2015

bparees commented Sep 27, 2015

bgrant0607 commented Oct 7, 2015

soltysh commented Nov 30, 2015

bgrant0607 commented Nov 30, 2015

bgrant0607 commented Nov 30, 2015

soltysh commented Dec 1, 2015

fejta-bot commented Dec 14, 2017

fejta-bot commented Jan 13, 2018

fejta-bot commented Feb 12, 2018

Have an option to keep Pod around for debugging #14602

Have an option to keep Pod around for debugging #14602

Comments

davidopp commented Sep 26, 2015

derekwaynecarr commented Sep 26, 2015

bparees commented Sep 27, 2015

bgrant0607 commented Oct 7, 2015

soltysh commented Nov 30, 2015

bgrant0607 commented Nov 30, 2015

bgrant0607 commented Nov 30, 2015

soltysh commented Dec 1, 2015

fejta-bot commented Dec 14, 2017

fejta-bot commented Jan 13, 2018

fejta-bot commented Feb 12, 2018