Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for discovery on container start error #22561

Merged
merged 1 commit into from
May 16, 2016

Conversation

cpuguy83
Copy link
Member

@cpuguy83 cpuguy83 commented May 6, 2016

- What I did
Allow the cluster-store to run in a container and start reliably on daemon restart.

- How I did it
On the first container start, if an error is encountered wait for discovery to finish initialization.
This allows a containerized K/V store to come up while the failed container waits for the discovery service to reach the K/V store and be ready, then attempting to start again.

- How to verify it

  1. Start a fresh daemon.
  2. docker run -d --name etcd --net=host --restart=always quay.io/coreos/etcd
  3. Restart the daemon with cluster opts, set the k/v store to etcd://127.0.0.1
  4. docker network create -d overlay test
  5. docker run -d --name test --restart=always --net=test busybox top
  6. restart the daemon

Note: There seems to be an issue in docker-in-docker where a net namespace is leaked causing a file exists error when it tries to start the test container after daemon restart, but that is orthogonal to this PR.

- Description for the changelog

Fixes issues with restarting containers after a daemon restart for containers using networks which require the cluster store when the cluster store is containerized

Fixes #22486

@@ -380,8 +380,22 @@ func (daemon *Daemon) restore() error {
}
}
}

if err := daemon.containerStart(c); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now libnetwork only returns a fmt.Errorf error, it would be nice to have a typed error that we could check instead of assuming a discovery issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of even trying to start a container and fail, we should check for NetworkByName for all the networks that the container is configured with. If the call returns ErrNoSuchNetwork, then we can wait for discovery event.

This will make the check more explicit and also not even try to start the container until the store is up for name lookup.

@vikstrous
Copy link
Contributor

vikstrous commented May 7, 2016

This seems to work for me. It really speeds up start up times.

Edit: I spoke too soon. It looks like start times are not that much faster :/ It does lead to containers starting up successfully though :D

@cpuguy83 cpuguy83 force-pushed the delay_start_for_discovery branch 2 times, most recently from cf8dc80 to faeb23f Compare May 9, 2016 14:36
@cpuguy83
Copy link
Member Author

cpuguy83 commented May 9, 2016

Updated with @mavenugo's suggestion.

@cpuguy83 cpuguy83 force-pushed the delay_start_for_discovery branch 2 times, most recently from 8430be0 to 972ebf7 Compare May 9, 2016 15:21
if err := daemon.containerStart(c); err != nil {
logrus.Errorf("Failed to start container %s: %s", c.ID, err)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this trick

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remnant of a previous iteration.
Will remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@cpuguy83 cpuguy83 force-pushed the delay_start_for_discovery branch from 972ebf7 to b410494 Compare May 9, 2016 17:40
close(d.readyCh)
readyChanClosed = true
t.Stop()
t = d.ticker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the store is unreachable during the daemon restart (for any reason), will the ticker will be aggressively set for 500 msec for ever ? I think we should a timeout after which, it should go to normal default/user-configured heartbeat timer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@cpuguy83 cpuguy83 force-pushed the delay_start_for_discovery branch 2 times, most recently from 236f06b to c1a2bec Compare May 11, 2016 13:45
This gives discovery a chance to initialize, particularly if the K/V
store being used is in a container.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@cpuguy83 cpuguy83 force-pushed the delay_start_for_discovery branch from c1a2bec to 2dce79e Compare May 11, 2016 13:49
@mavenugo
Copy link
Contributor

Thanks @cpuguy83. LGTM.

@vikstrous can you please try this patch and confirm if it handles your scenario ?

@vikstrous
Copy link
Contributor

vikstrous commented May 12, 2016

I tried to apply this patch on docker 1.11 and after doing a restart some of my containers failed to start with "Error": "network dtr-ol not found"

It seems like it's still not succeeding at launching the containers after failing the first time.

@thaJeztah thaJeztah modified the milestones: 1.11.2, 1.12.0 May 12, 2016
@vikstrous
Copy link
Contributor

vikstrous commented May 13, 2016

My testing was wrong. It works as expected now.

Update: On the second restart it fails again. I don't know if it's a new error or the same one. We are investigating...

@mavenugo
Copy link
Contributor

@cpuguy83 I tried the patch multiple times and it works as advertised. I also think the 60 seconds timeout for the restart-policy is reasonable in order for the KV container to come up.

LGTM

@tiborvass
Copy link
Contributor

I'm not fond of the long timeout, but I understand we need a fix.
LGTM

@tiborvass tiborvass merged commit 0088b8f into moby:master May 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race conditions with restart policies and overlay networks in docker 1.11.0-1.11.1
7 participants