Ensure all the running containers are killed on daemon shutdown #12400

coolljt0725 · 2015-04-15T13:31:13Z

Signed-off-by: Lei Jitang leijitang@huawei.com

On daemon shutdown, the daemon will be killed in 10s whether the running
containers are killed or not( the running containers are supposed to be killed in 10s).
This will cause some problems for the containers which are not killed in 10s. One
problem is that it will cause an error when remove these containers on daemon next
start.The reason is that the container rootfs is still mounted. The containers' rootfs supposed
to be unmounted when the container is killed.
Steps to reproduce(storage driver: overlay):
Step 1. Run a container docker run -ti --name test busybox

[l00284783@localhost docker]$ docker run -ti --name test busybox
/ #

Step 2. Shutdown the docker daemon

^CINFO[0127] Received signal 'interrupt', starting shutdown of docker...
DEBU[0127] starting clean shutdown of all containers...
DEBU[0127] stopping aeb7a17120cd72e045d30e727b36a56b71663f700a69212c08b9060c8964b992
DEBU[0127] Sending 15 to aeb7a17120cd72e045d30e727b36a56b71663f700a69212c08b9060c8964b992
INFO[0127] -job serveapi(unix:///var/run/docker.sock) OK

Step 3. restart the docker daemon
Step 4. Remove the container test which is the container run on step 1

[l00284783@localhost docker]$ docker rm test
Error response from daemon: Cannot destroy container test: Driver overlay failed to remove root filesystem aeb7a17120cd72e045d30e727b36a56b71663f700a69212c08b9060c8964b992: readdirent: no such file or directory
FATA[0000] Error: failed to remove one or more containers

I have test with storage overlay and devicemapper, have the same problem. devicemapper
will show device busy

What cause this is that container test isn't killed in 10s, but the docker daemon timeout and
killed, so the cleanup of the container is not done and leave the container with its rootfs still mounted.

My fixed is use container.stop(10) instead of c.KillSig(15),
container.stop(10) will send SIGTERM to container and wait 10s,
if the container failed to exit within 10s of SIGTERM, then send SIGKILL.
And change the daemon shutdown timeout from 10s to 15s

WDYT? ping @cpuguy83 @estesp @LK4D4 @jfrazelle

jessfraz · 2015-04-15T17:56:21Z

can we have a daemon restart integration-cli test for this

jessfraz · 2015-04-15T17:56:49Z

also thanks for hunting this down, I'm sure it was not the easiest thing to debug

coolljt0725 · 2015-04-16T02:24:38Z

ping @jfrazelle @cpuguy83 add a test, can we run the test with --storage-driver=overlay or devicemapper on Jenkins? The default graphdriver vfs will not trigger the problem

jessfraz · 2015-04-16T02:27:28Z

Ah I see well we do this on all master builds for all the graph drivers so
it's ok if it succeeds on vfs without the patch I'll test personally to
make sure it works w overlay before merging :)

But if we have a test then the overlay master build will fail on Jenkins if
we ever have a regression

On Wednesday, April 15, 2015, Lei Jitang notifications@github.com wrote:

ping @jfrazelle https://github.com/jfrazelle @cpuguy83
https://github.com/cpuguy83 add a test, can we run the test with --storage-driver=overlay
or devicemapper on Jenkins? The default graphdriver vfs will not trigger
the problem

—
Reply to this email directly or view it on GitHub
#12400 (comment).

jessfraz · 2015-04-16T02:28:38Z

Ping @crosbymichael @LK4D4 this is what I was thinking was related to your
state on shutdown pr

On Wednesday, April 15, 2015, Jessica Frazelle jess@docker.com wrote:

Ah I see well we do this on all master builds for all the graph drivers so
it's ok if it succeeds on vfs without the patch I'll test personally to
make sure it works w overlay before merging :)

But if we have a test then the overlay master build will fail on Jenkins
if we ever have a regression

On Wednesday, April 15, 2015, Lei Jitang <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

ping @jfrazelle https://github.com/jfrazelle @cpuguy83
https://github.com/cpuguy83 add a test, can we run the test with --storage-driver=overlay
or devicemapper on Jenkins? The default graphdriver vfs will not trigger
the problem

—
Reply to this email directly or view it on GitHub
#12400 (comment).

jessfraz · 2015-04-16T03:09:39Z

Can you PTAL at #12423 and see if it solves this for you

On Wednesday, April 15, 2015, Jessica Frazelle jess@docker.com wrote:

Ping @crosbymichael @LK4D4 this is what I was thinking was related to
your state on shutdown pr

On Wednesday, April 15, 2015, Jessica Frazelle <jess@docker.com
javascript:_e(%7B%7D,'cvml','jess@docker.com');> wrote:

Ah I see well we do this on all master builds for all the graph drivers
so it's ok if it succeeds on vfs without the patch I'll test personally to
make sure it works w overlay before merging :)

But if we have a test then the overlay master build will fail on Jenkins
if we ever have a regression

On Wednesday, April 15, 2015, Lei Jitang notifications@github.com
wrote:

ping @jfrazelle https://github.com/jfrazelle @cpuguy83
https://github.com/cpuguy83 add a test, can we run the test with --storage-driver=overlay
or devicemapper on Jenkins? The default graphdriver vfs will not
trigger the problem

—
Reply to this email directly or view it on GitHub
#12400 (comment).

coolljt0725 · 2015-04-16T03:23:20Z

@jfrazelle I have tested, #12423 can't solve this, it solve a different problem

coolljt0725 · 2015-04-16T03:30:13Z

This one is to ensure the container is cleanuped (unmount the rootfs of the container) on docker daemon shutdown.

coolljt0725 · 2015-04-16T16:27:09Z

ping @jfrazelle @cpuguy83

cpuguy83 · 2015-04-16T17:22:24Z

In some other refactoring here @estesp and I felt like we should bump this engine wait to something significantly higher.
Reason being is that it may take longer than 10s (or even 15s) to clean up the containers.
If it is taking too long than the user can send 2 more TERMs (or INTs) to the daemon to forcibly stop it, or they can just wait for the timeout.

So if we set the timeout really high it shouldn't really hurt anything.
I believe the reason we are only ever sending TERM to the containers on shutdown is to make sure we aren't just killing them... but I suppose these need to be handled at some point anyway.

coolljt0725 · 2015-04-17T10:29:03Z

@cpuguy83

Reason being is that it may take longer than 10s (or even 15s) to clean up the containers.
If it is taking too long than the user can send 2 more TERMs (or INTs) to the daemon to forcibly stop it,

My fix is to use c.Stop(10); to stop the container , it will send the SIGTERM to container and wait 10 seconds, if the container failed to exit in 10 seconds then send SIGINT to container.

So if we set the timeout really high it shouldn't really hurt anything.
I believe the reason we are only ever sending TERM to the containers on shutdown is to make sure we aren't just killing them... but I suppose these need to be handled at some point anyway.

But just set the timeout really high may not work for some container in interactive(with -it) mode . I have test local, I set the timeout to 60 seconds and start a container with -ti, docker run -ti busybox , then I shutdown the daemon, but the container can't stop in 60 seconds. So I think send SIGTERM to container and wait for seconds and if not exited, then send SIGKILL to container is a much more reasonable way.

coolljt0725 · 2015-04-22T09:30:44Z

ping @jfrazelle @cpuguy83 @LK4D4 WDYT?

cpuguy83 · 2015-04-22T13:15:57Z

I think this is a dup of #11988, which was closed yesterday because we don't want to SIGKILL containers on daemon exit.

LK4D4 · 2015-04-22T18:45:14Z

Hmm, but it will wait for 10 seconds. I dunno, you will kill them anyway with killing their parent - docker.

LK4D4 · 2015-05-05T16:07:28Z

Design seems legit for me, it is basically the same, but more understandable and secure.
ping @docker/core-maintainers

estesp · 2015-05-06T01:39:39Z

@LK4D4 what about comments of @unclejack and others from the other PR (for example: #11988 (comment)) that in some cases it is potentially very bad to SIGKILL, even after waiting and finding the container unresponsive? However, it seems like the problem of "un-cleaned up" container rootFS's needs to be fixed--this PR seems to fix it cleanly, but still leaves open the window of opportunity for a container with a long-shutdown cycle (closing a ton of connections, performing a significantly large store-to-disk operation, etc.) to be unceremoniously killed in the middle of what would be a reasonable shutdown if the timeout didn't exist. As @cpuguy83 we thought about making the timeout much larger to accommodate, but even that is hacky in the case that it is unpredictable how much time is required.

estesp · 2015-05-06T01:40:05Z

by the way..with engine gone, this PR needs a rebase to not use engine.

LK4D4 · 2015-05-06T01:57:52Z

@estesp I'm not sure what will be with that processes without parent. I think they probably will just die. I'll try tomorrow.

LK4D4 · 2015-05-19T17:06:12Z

@rohansingh I think separate will be easier.

cpuguy83 · 2015-05-19T17:18:46Z

LGTM

coolljt0725 · 2015-05-20T01:34:18Z

update as @LK4D4 's suggestion

cpuguy83 · 2015-05-20T15:18:16Z

Also I believe this closes #7086

cpuguy83 · 2015-05-20T15:18:28Z

Also needs another rebase. :(

davidxia · 2015-05-20T18:22:33Z

@rohansingh @LK4D4 I opened a PR here for increasing upstart kill timeout: #13359

coolljt0725 · 2015-05-21T01:02:38Z

@cpuguy83 rebased

coolljt0725 · 2015-05-21T02:09:50Z

TestLogsFollowGoroutinesWithStdout failed on windows jenkins, I think it's unrelated

thaJeztah · 2015-05-26T22:20:15Z

@coolljt0725 I added this to the 1.7 milestone, could you rebase the PR (again .. )?

Signed-off-by: Lei Jitang <leijitang@huawei.com>

coolljt0725 · 2015-05-27T01:44:12Z

ping @LK4D4 @jfrazelle @thaJeztah rebased

coolljt0725 · 2015-05-28T13:16:03Z

ping @cpuguy83 @LK4D4 @jfrazelle @thaJeztah any new about this?

cpuguy83 · 2015-05-28T14:41:58Z

Still LGTM

calavera · 2015-05-29T23:45:54Z

LGTM. Merging 🤘

…mon_shutdown Ensure all the running containers are killed on daemon shutdown

jessfraz · 2015-05-29T23:51:05Z

cherry-picked

GordonTheTurtle added status/0-triage dco/yes and removed dco/no labels Apr 15, 2015

jessfraz added status/2-code-review and removed status/0-triage labels Apr 15, 2015

coolljt0725 force-pushed the kill_all_containers_on_daemon_shutdown branch from 6ee7159 to 6d3c837 Compare April 16, 2015 02:18

GordonTheTurtle added dco/yes and removed dco/no labels Apr 16, 2015

coolljt0725 mentioned this pull request Apr 16, 2015

After stopping docker, previously running containers cannot be started or removed #5684

Closed

icecrime removed the dco/yes label Apr 23, 2015

LK4D4 added status/1-design-review and removed status/2-code-review labels May 5, 2015

coolljt0725 force-pushed the kill_all_containers_on_daemon_shutdown branch from 28f6007 to 958f980 Compare May 20, 2015 00:58

coolljt0725 force-pushed the kill_all_containers_on_daemon_shutdown branch from 958f980 to 555f91b Compare May 21, 2015 01:00

GordonTheTurtle added the platform/windows label May 21, 2015

coolljt0725 force-pushed the kill_all_containers_on_daemon_shutdown branch 2 times, most recently from 2ecbe0b to 7a616d2 Compare May 25, 2015 12:56

thaJeztah added this to the 1.7.0 milestone May 26, 2015

coolljt0725 force-pushed the kill_all_containers_on_daemon_shutdown branch from 7a616d2 to bdb7707 Compare May 27, 2015 01:07

Ensure all the running containers are killed on daemon shutdown

bdb7707

Signed-off-by: Lei Jitang <leijitang@huawei.com>

tiborvass removed the platform/windows label May 28, 2015

calavera added a commit that referenced this pull request May 29, 2015

Merge pull request #12400 from coolljt0725/kill_all_containers_on_dae…

04c6f09

…mon_shutdown Ensure all the running containers are killed on daemon shutdown

calavera merged commit 04c6f09 into moby:master May 29, 2015

calavera assigned jessfraz May 29, 2015

thaJeztah mentioned this pull request May 29, 2015

Docker fails to stop if container ignores signals #7086

Closed

coolljt0725 mentioned this pull request May 30, 2015

Fix restart docker daemon with paused containers #13304

Merged

thaJeztah mentioned this pull request May 31, 2015

More graceful daemon restarts #9433

Closed

MrMMorris mentioned this pull request Jun 2, 2015

Old services left over! And extra services?? gliderlabs/registrator#190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure all the running containers are killed on daemon shutdown #12400

Ensure all the running containers are killed on daemon shutdown #12400

coolljt0725 commented Apr 15, 2015

jessfraz commented Apr 15, 2015

jessfraz commented Apr 15, 2015

coolljt0725 commented Apr 16, 2015

jessfraz commented Apr 16, 2015

jessfraz commented Apr 16, 2015

jessfraz commented Apr 16, 2015

coolljt0725 commented Apr 16, 2015

coolljt0725 commented Apr 16, 2015

coolljt0725 commented Apr 16, 2015

cpuguy83 commented Apr 16, 2015

coolljt0725 commented Apr 17, 2015

coolljt0725 commented Apr 22, 2015

cpuguy83 commented Apr 22, 2015

LK4D4 commented Apr 22, 2015

LK4D4 commented May 5, 2015

estesp commented May 6, 2015

estesp commented May 6, 2015

LK4D4 commented May 6, 2015

LK4D4 commented May 19, 2015

cpuguy83 commented May 19, 2015

coolljt0725 commented May 20, 2015

cpuguy83 commented May 20, 2015

cpuguy83 commented May 20, 2015

davidxia commented May 20, 2015

coolljt0725 commented May 21, 2015

coolljt0725 commented May 21, 2015

thaJeztah commented May 26, 2015

coolljt0725 commented May 27, 2015

coolljt0725 commented May 28, 2015

cpuguy83 commented May 28, 2015

calavera commented May 29, 2015

jessfraz commented May 29, 2015

Ensure all the running containers are killed on daemon shutdown #12400

Ensure all the running containers are killed on daemon shutdown #12400

Conversation

coolljt0725 commented Apr 15, 2015

jessfraz commented Apr 15, 2015

jessfraz commented Apr 15, 2015

coolljt0725 commented Apr 16, 2015

jessfraz commented Apr 16, 2015

jessfraz commented Apr 16, 2015

jessfraz commented Apr 16, 2015

coolljt0725 commented Apr 16, 2015

coolljt0725 commented Apr 16, 2015

coolljt0725 commented Apr 16, 2015

cpuguy83 commented Apr 16, 2015

coolljt0725 commented Apr 17, 2015

coolljt0725 commented Apr 22, 2015

cpuguy83 commented Apr 22, 2015

LK4D4 commented Apr 22, 2015

LK4D4 commented May 5, 2015

estesp commented May 6, 2015

estesp commented May 6, 2015

LK4D4 commented May 6, 2015

LK4D4 commented May 19, 2015

cpuguy83 commented May 19, 2015

coolljt0725 commented May 20, 2015

cpuguy83 commented May 20, 2015

cpuguy83 commented May 20, 2015

davidxia commented May 20, 2015

coolljt0725 commented May 21, 2015

coolljt0725 commented May 21, 2015

thaJeztah commented May 26, 2015

coolljt0725 commented May 27, 2015

coolljt0725 commented May 28, 2015

cpuguy83 commented May 28, 2015

calavera commented May 29, 2015

jessfraz commented May 29, 2015