-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for discovery on container start error #22561
Conversation
a0b7398
to
4657391
Compare
@@ -380,8 +380,22 @@ func (daemon *Daemon) restore() error { | |||
} | |||
} | |||
} | |||
|
|||
if err := daemon.containerStart(c); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now libnetwork only returns a fmt.Errorf
error, it would be nice to have a typed error that we could check instead of assuming a discovery issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of even trying to start a container and fail, we should check for NetworkByName
for all the networks that the container is configured with. If the call returns ErrNoSuchNetwork
, then we can wait for discovery event.
This will make the check more explicit and also not even try to start the container until the store is up for name lookup.
This seems to work for me. It really speeds up start up times. Edit: I spoke too soon. It looks like start times are not that much faster :/ It does lead to containers starting up successfully though :D |
cf8dc80
to
faeb23f
Compare
Updated with @mavenugo's suggestion. |
8430be0
to
972ebf7
Compare
if err := daemon.containerStart(c); err != nil { | ||
logrus.Errorf("Failed to start container %s: %s", c.ID, err) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remnant of a previous iteration.
Will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
972ebf7
to
b410494
Compare
close(d.readyCh) | ||
readyChanClosed = true | ||
t.Stop() | ||
t = d.ticker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the store is unreachable during the daemon restart (for any reason), will the ticker will be aggressively set for 500 msec for ever ? I think we should a timeout after which, it should go to normal default/user-configured heartbeat timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
236f06b
to
c1a2bec
Compare
This gives discovery a chance to initialize, particularly if the K/V store being used is in a container. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
c1a2bec
to
2dce79e
Compare
Thanks @cpuguy83. LGTM. @vikstrous can you please try this patch and confirm if it handles your scenario ? |
I tried to apply this patch on docker 1.11 and after doing a restart some of my containers failed to start with It seems like it's still not succeeding at launching the containers after failing the first time. |
My testing was wrong. It works as expected now. Update: On the second restart it fails again. I don't know if it's a new error or the same one. We are investigating... |
@cpuguy83 I tried the patch multiple times and it works as advertised. I also think the 60 seconds timeout for the restart-policy is reasonable in order for the KV container to come up. LGTM |
I'm not fond of the long timeout, but I understand we need a fix. |
- What I did
Allow the
cluster-store
to run in a container and start reliably on daemon restart.- How I did it
On the first container start, if an error is encountered wait for discovery to finish initialization.
This allows a containerized K/V store to come up while the failed container waits for the discovery service to reach the K/V store and be ready, then attempting to start again.
- How to verify it
docker run -d --name etcd --net=host --restart=always quay.io/coreos/etcd
etcd://127.0.0.1
docker network create -d overlay test
docker run -d --name test --restart=always --net=test busybox top
Note: There seems to be an issue in docker-in-docker where a net namespace is leaked causing a
file exists
error when it tries to start thetest
container after daemon restart, but that is orthogonal to this PR.- Description for the changelog
Fixes issues with restarting containers after a daemon restart for containers using networks which require the cluster store when the cluster store is containerized
Fixes #22486