Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have an option to keep Pod around for debugging #14602

Closed
davidopp opened this issue Sep 26, 2015 · 10 comments
Closed

Have an option to keep Pod around for debugging #14602

davidopp opened this issue Sep 26, 2015 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@davidopp
Copy link
Member

We received an interesting suggestion today which has multiple layers to it but basically the use case was that they would like us to keep Pod state around for debugging in the case where a Pod is "moved" to another machine. This would presumably be an option in the Pod, rather than default behavior, since in the normal case people probably only care about the logs and not the full state.

[Please read "move" here in the proper cattle-centric way, meaning "kill old Pod and create a fungible replica on a different machine"]

Today IIUC the only scenario where we "move" a Pod to a new machine is if the machine fails. So the feature request is that when the machine comes back up, we want it to still have logs, containers, host directories, Pod-scoped directories, etc. still intact. I guess this is somewhat related to the Borg critical data concept, although for a very different purpose.

In the future I imagine we will want some other scenarios to trigger the Pod to "move" -- for example there is a restart loop (kubelet keeps restarting the container locally due to repeated OOM or something else that boils down to "the Pod is just not going to work here, but might work in another machine"). In this case you'd want to also keep the same state from the old Pod around despite creating a replacement replica elsewhere.

The last use case that was brought up was debugging deadlock. There would be an option to start a new replica on another node while keeping the deadlocked Pod/container running so that you can attach a debugger to the process that is deadlocked. Of course you'd also like to keep the same state we've been talking about for the other cases (logs, containers, host directories, Pod-scoped directories).

@mikedanese does this sound like a fairly accurate description of the request? Anything I left out?

@davidopp davidopp added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. team/control-plane labels Sep 26, 2015
@derekwaynecarr
Copy link
Member

@bparees - this struck me as something you would find of interest and want
to weigh in on.

I also think it's related to this as one potential means to support
debugging of the container state:

#14561

On Friday, September 25, 2015, David Oppenheimer notifications@github.com
wrote:

We received an interesting suggestion today which has multiple layers to
it but basically the use case was that they would like us to keep Pod state
around for debugging in the case where a Pod is "moved" to another machine.
This would presumably be an option in the Pod, rather than default
behavior, since in the normal case people probably only care about the logs
and not the full state.

[Please read "move" here in the proper cattle-centric way, meaning "kill
old Pod and create a fungible replica on a different machine"]

Today IIUC the only scenario where we "move" a Pod to a new machine is if
the machine fails. So the feature request is that when the machine comes
back up, we want it to still have logs, containers, host directories,
Pod-scoped directories, etc. still intact. I guess this is somewhat related
to the Borg critical data concept, although for a very different purpose.

In the future I imagine we will want some other scenarios to trigger the
Pod to "move" -- for example there is a restart loop (kubelet keeps
restarting the container locally due to repeated OOM or something else that
boils down to "the Pod is just not going to work here, but might work in
another machine"). In this case you'd want to also keep the same state from
the old Pod around despite creating a replacement replica elsewhere.

The last use case that was brought up was debugging deadlock. There would
be an option to start a new replica on another node while keeping the
deadlocked Pod/container running so that you can attach a debugger to the
process that is deadlocked. Of course you'd also like to keep the same
state we've been talking about for the other cases (logs, containers, host
directories, Pod-scoped directories).

@mikedanese https://github.com/mikedanese does this sound like a fairly
accurate description of the request? Anything I left out?


Reply to this email directly or view it on GitHub
#14602.

@bparees
Copy link
Contributor

bparees commented Sep 27, 2015

thanks @derekwaynecarr, definitely the ability to get the pod logs seems extremely useful.

in relation to #14561, the ability to do more than what is described here and actually commit the containers of the failed/moved pod and push it to a registry so i can pull the image (basically a snapshot at that point) and do some investigation of the ephemeral filesystem/attempt to start the process, etc would also be extremely useful imho.

@bgrant0607
Copy link
Member

Sort of related: #3949

@soltysh
Copy link
Contributor

soltysh commented Nov 30, 2015

I have a slightly different use-case, still sort of related to what @davidopp mentioned. When implementing #17940 and still to come #17244 I was struck we don't have any option to gracefully terminate a pod. Aforementioned issues/PRs are dealing with a job which should be terminated upon certain conditions (timeout, remote termination). Such pod should then be in a failed state denoting that it was terminated prematurely. @davidopp does this also fits your use-case, or should I rather create a separate issue regarding the topic?

@bgrant0607
Copy link
Member

Related: #2789

@bgrant0607
Copy link
Member

Note that if you don't care about the resources consumed, it's easy to just keep a pod around by changing its labels to orphan it from its controller.

@soltysh
Copy link
Contributor

soltysh commented Dec 1, 2015

Note that if you don't care about the resources consumed, it's easy to just keep a pod around by changing its labels to orphan it from its controller.

I guess not. If I'm setting a deadline, I'm rather interested in limiting the resources consumption. At least that's how I see this.

@fejta-bot
Copy link

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Freeze the issue for 90d with /lifecycle frozen.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 13, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants