Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon Restart: attempt to wait for container deps #18208

Merged
merged 1 commit into from
Jan 4, 2016

Conversation

cpuguy83
Copy link
Member

This provides a best effort on daemon restarts to restart containers
which have linked containers that are not up yet instead of failing.

// Best effort to wait for container deps to be up
switch e := err.(type) {
case errcode.Error:
if e.ErrorCode() == derr.ErrorCodeLinkNotRunning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also do e.ErrorCode() != derr.ErrorCodeLinkNotRunning { break } and remove the conditional a few lines below 😸

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we have a function to check those error codes without adding switch statements everywhere, but I cannot find it now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duglin ?

Also, I have the switch because I was also trying to fix another case where overlay networks aren't ready -- but there appears to be a deeper issue in libnetwork that needs to get fixed first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at one point I did have a util func to avoid making everyone write switch statements but it never made the cut.

@cpuguy83 cpuguy83 force-pushed the restart_links branch 2 times, most recently from c3f52e1 to fa70c7f Compare November 24, 2015 21:02
}
err = nil
}
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary anymore 😸

@calavera
Copy link
Contributor

nevermind about the function, I might have dreamed about it

@cpuguy83
Copy link
Member Author

@calavera Updated

// This file contains all of the errors that can be generated from the
// docker engine but are not tied to any specific top-level component.

const errGroup = "engine"

// IsErr returns a bool indicating if the passed in error matches the expected error
func IsErr(err error, expected errcode.ErrorCode) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really bad func name if you think about it ;)

maybe something more descriptive but seeing that this is not used in this PR i would remove this file from the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in this PR

@cpuguy83
Copy link
Member Author

@crosbymichael updated the fn name.

@calavera
Copy link
Contributor

LGTM

Edited:

I was thinking about an alternative approach:

We put the containers that fail to start in a queue instead of retrying. When all the other containers have finished, we try to restart the failed ones again sequentially. I don't know if it's better, what do you think?

@crosbymichael
Copy link
Contributor

you can easily built a little dependency graph to fix the start order instead of thrashing the containers trying to start them.

@cpuguy83
Copy link
Member Author

PTAL

}

for _, c := range children {
if _, err := c.waitRunning(1 * time.Second); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just picked a duration here.

@cpuguy83
Copy link
Member Author

Going for something simple, but I think this might be too simple.
I'll have to work on something else. Closing for now.

@cpuguy83 cpuguy83 closed this Nov 25, 2015
@cpuguy83 cpuguy83 reopened this Nov 25, 2015
@cpuguy83 cpuguy83 force-pushed the restart_links branch 3 times, most recently from b562af0 to f572850 Compare November 25, 2015 18:51
@cpuguy83
Copy link
Member Author

Ok, PTAL.

@crosbymichael
Copy link
Contributor

This is closer but there is too much concurrency where it is not needed

@cpuguy83
Copy link
Member Author

Ah, good point, updated to make waiting on children happen in one goroutine per container instead of for each child.

@tiborvass
Copy link
Contributor

@cpuguy83 needs rebase

@cpuguy83 cpuguy83 force-pushed the restart_links branch 2 times, most recently from 0f9315f to bd0c54b Compare December 8, 2015 01:55
@cpuguy83
Copy link
Member Author

cpuguy83 commented Dec 8, 2015

rebased

@@ -353,17 +353,43 @@ func (daemon *Daemon) restore() error {
// The container register failed should not be started.
return
}
}(c.container, c.registered)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this "upper" go routine need to exist still now that actual container start operations are in the "lower" go routine (added below) with the range restartContainers { loop?

I tried to dig through daemon.Register and daemon.generateNewName--which seem to be the only external calls in the go routine now--to see if there is any reason for this to be necessarily handled per-container in a concurrent manner and nothing jumps out at me?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, you are probably right...
Maybe restarting the daemon after unclean shutdown would be a bit slower (making a few syscalls to cleanup)...

@cpuguy83
Copy link
Member Author

@estesp Updated with your suggestion.

@estesp
Copy link
Contributor

estesp commented Dec 17, 2015

This looks a lot better to me without two loops of go routines.. not sure if @crosbymichael wanted to take another look given the comment re: "too much concurrency"?

@thaJeztah
Copy link
Member

ping @calavera @crosbymichael if you can have another look?

This provides a best effort on daemon restarts to restart containers
which have linked containers that are not up yet instead of failing.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@cpuguy83
Copy link
Member Author

Rebased

@calavera
Copy link
Contributor

LGTM

// running before we try to start the container
children, err := daemon.children(container.Name)
if err != nil {
logrus.Warnf("error getting children for %s: %v", container.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have a return here or you will probably get a panic from a nil container

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll hit a panic here as we aren't doing anything with the values except checking if there if we have a start notifier for it later, which can be nil and be ok, and if children is nil, it's ok because we are just ranging over it anyway.

I'd also prefer to not return here since that would make the container never start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i see that now. thanks

@crosbymichael
Copy link
Contributor

LGTM

crosbymichael added a commit that referenced this pull request Jan 4, 2016
Daemon Restart: attempt to wait for container deps
@crosbymichael crosbymichael merged commit 04234bd into moby:master Jan 4, 2016
@cpuguy83 cpuguy83 deleted the restart_links branch January 5, 2016 15:56
@thaJeztah thaJeztah added this to the 1.10.0 milestone Feb 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants