Wait for discovery on container start error #22561

cpuguy83 · 2016-05-06T21:13:16Z

- What I did
Allow the cluster-store to run in a container and start reliably on daemon restart.

- How I did it
On the first container start, if an error is encountered wait for discovery to finish initialization.
This allows a containerized K/V store to come up while the failed container waits for the discovery service to reach the K/V store and be ready, then attempting to start again.

- How to verify it

Start a fresh daemon.
docker run -d --name etcd --net=host --restart=always quay.io/coreos/etcd
Restart the daemon with cluster opts, set the k/v store to etcd://127.0.0.1
docker network create -d overlay test
docker run -d --name test --restart=always --net=test busybox top
restart the daemon

Note: There seems to be an issue in docker-in-docker where a net namespace is leaked causing a file exists error when it tries to start the test container after daemon restart, but that is orthogonal to this PR.

- Description for the changelog

Fixes issues with restarting containers after a daemon restart for containers using networks which require the cluster store when the cluster store is containerized

Fixes #22486

cpuguy83 · 2016-05-06T21:19:08Z

daemon/daemon.go

@@ -380,8 +380,22 @@ func (daemon *Daemon) restore() error {
 					}
 				}
 			}
+
 			if err := daemon.containerStart(c); err != nil {


Right now libnetwork only returns a fmt.Errorf error, it would be nice to have a typed error that we could check instead of assuming a discovery issue

I think instead of even trying to start a container and fail, we should check for NetworkByName for all the networks that the container is configured with. If the call returns ErrNoSuchNetwork, then we can wait for discovery event.

This will make the check more explicit and also not even try to start the container until the store is up for name lookup.

vikstrous · 2016-05-07T08:07:28Z

This seems to work for me. It really speeds up start up times.

Edit: I spoke too soon. It looks like start times are not that much faster :/ It does lead to containers starting up successfully though :D

cpuguy83 · 2016-05-09T14:39:57Z

Updated with @mavenugo's suggestion.

LK4D4 · 2016-05-09T17:20:57Z

daemon/daemon.go

 			if err := daemon.containerStart(c); err != nil {
-				logrus.Errorf("Failed to start container %s: %s", c.ID, err)
+				if err != nil {


I don't understand this trick

Remnant of a previous iteration.
Will remove.

mavenugo · 2016-05-10T11:06:11Z

daemon/discovery.go

+					close(d.readyCh)
+					readyChanClosed = true
+					t.Stop()
+					t = d.ticker


If the store is unreachable during the daemon restart (for any reason), will the ticker will be aggressively set for 500 msec for ever ? I think we should a timeout after which, it should go to normal default/user-configured heartbeat timer.

This gives discovery a chance to initialize, particularly if the K/V store being used is in a container. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

mavenugo · 2016-05-12T06:39:53Z

Thanks @cpuguy83. LGTM.

@vikstrous can you please try this patch and confirm if it handles your scenario ?

vikstrous · 2016-05-12T18:56:47Z

I tried to apply this patch on docker 1.11 and after doing a restart some of my containers failed to start with "Error": "network dtr-ol not found"

It seems like it's still not succeeding at launching the containers after failing the first time.

vikstrous · 2016-05-13T19:54:03Z

My testing was wrong. It works as expected now.

Update: On the second restart it fails again. I don't know if it's a new error or the same one. We are investigating...

mavenugo · 2016-05-15T23:17:30Z

@cpuguy83 I tried the patch multiple times and it works as advertised. I also think the 60 seconds timeout for the restart-policy is reasonable in order for the KV container to come up.

LGTM

tiborvass · 2016-05-16T13:51:15Z

I'm not fond of the long timeout, but I understand we need a fix.
LGTM

GordonTheTurtle added the status/0-triage label May 6, 2016

cpuguy83 force-pushed the delay_start_for_discovery branch from a0b7398 to 4657391 Compare May 6, 2016 21:14

cpuguy83 added status/2-code-review and removed status/0-triage labels May 6, 2016

cpuguy83 added this to the 1.12.0 milestone May 6, 2016

cpuguy83 mentioned this pull request May 6, 2016

race conditions with restart policies and overlay networks in docker 1.11.0-1.11.1 #22486

Closed

cpuguy83 reviewed May 6, 2016
View reviewed changes

cpuguy83 force-pushed the delay_start_for_discovery branch 2 times, most recently from cf8dc80 to faeb23f Compare May 9, 2016 14:36

cpuguy83 force-pushed the delay_start_for_discovery branch 2 times, most recently from 8430be0 to 972ebf7 Compare May 9, 2016 15:21

LK4D4 reviewed May 9, 2016
View reviewed changes

cpuguy83 force-pushed the delay_start_for_discovery branch from 972ebf7 to b410494 Compare May 9, 2016 17:40

mavenugo reviewed May 10, 2016
View reviewed changes

cpuguy83 force-pushed the delay_start_for_discovery branch 2 times, most recently from 236f06b to c1a2bec Compare May 11, 2016 13:45

Wait for discovery on container start error

2dce79e

This gives discovery a chance to initialize, particularly if the K/V store being used is in a container. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

cpuguy83 force-pushed the delay_start_for_discovery branch from c1a2bec to 2dce79e Compare May 11, 2016 13:49

thaJeztah modified the milestones: 1.11.2, 1.12.0 May 12, 2016

tiborvass merged commit 0088b8f into moby:master May 16, 2016

thaJeztah added the impact/changelog label May 17, 2016

This was referenced May 19, 2016

container with unless-stopped restart policy is not restarted when the overlay network is not yet available #18091

Closed

"Error response from daemon: service endpoint with name es already exists" when starting container #20398

Closed

thaJeztah modified the milestones: 1.11.2, 1.12.0 May 23, 2016

mlaventure mentioned this pull request May 27, 2016

1.11.2 cherry pick #23058

Merged

etoews mentioned this pull request Jun 17, 2016

Update release notes for 6/17/2016 getcarina/getcarina.com#870

Merged

mavenugo mentioned this pull request Jul 28, 2016

Cannot remove network due to active endpoint, but cannot stop/remove containers #23302

Closed

thaJeztah mentioned this pull request Mar 28, 2017

The returned err should be printed #32167

Merged

cpuguy83 deleted the delay_start_for_discovery branch July 18, 2017 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for discovery on container start error #22561

Wait for discovery on container start error #22561

cpuguy83 commented May 6, 2016

cpuguy83 May 6, 2016

mavenugo May 6, 2016

vikstrous commented May 7, 2016 •

edited

Loading

cpuguy83 commented May 9, 2016

LK4D4 May 9, 2016

cpuguy83 May 9, 2016

cpuguy83 May 9, 2016

mavenugo May 10, 2016

cpuguy83 May 11, 2016

mavenugo commented May 12, 2016

vikstrous commented May 12, 2016 •

edited

Loading

vikstrous commented May 13, 2016 •

edited

Loading

mavenugo commented May 15, 2016

tiborvass commented May 16, 2016

Wait for discovery on container start error #22561

Wait for discovery on container start error #22561

Conversation

cpuguy83 commented May 6, 2016

cpuguy83 May 6, 2016

Choose a reason for hiding this comment

mavenugo May 6, 2016

Choose a reason for hiding this comment

vikstrous commented May 7, 2016 • edited Loading

cpuguy83 commented May 9, 2016

LK4D4 May 9, 2016

Choose a reason for hiding this comment

cpuguy83 May 9, 2016

Choose a reason for hiding this comment

cpuguy83 May 9, 2016

Choose a reason for hiding this comment

mavenugo May 10, 2016

Choose a reason for hiding this comment

cpuguy83 May 11, 2016

Choose a reason for hiding this comment

mavenugo commented May 12, 2016

vikstrous commented May 12, 2016 • edited Loading

vikstrous commented May 13, 2016 • edited Loading

mavenugo commented May 15, 2016

tiborvass commented May 16, 2016

vikstrous commented May 7, 2016 •

edited

Loading

vikstrous commented May 12, 2016 •

edited

Loading

vikstrous commented May 13, 2016 •

edited

Loading